Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Home NTDEV

Before Posting...

Please check out the Community Guidelines in the Announcements and Administration Category.

More Info on Driver Writing and Debugging


The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.


Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/


x86 Asm FUN!

Paul_BunnPaul_Bunn Member Posts: 251
Five years ago, I took a small routine in C and converted it to asm. Since
this loop is executed in our app 30 billion times or so, I figured it was
worth the effort, and indeed this was about 10,000% faster than the code
generated by the x86 compiler of the time (interestingly, the best we could
do in Alpha assembler was about a 5-10% speed increase, the GEM back-end
compiler for Alpha is legendary). I was just looking at the code the other
day, and was wondering whether there is any way to improve the speed still
further (using MMX instructions maybe ?). I was wondering whether anyone
would care to share some pearls of wisdom regarding improving performance of
this simple routine...

loop1:
mov bl, byte ptr [eax]
xor bl, cl
mov edx, [esi + ebx*4]
shr ecx, 8
xor ecx, edx
inc eax
dec edi
jnz loop1

Regards,

Paul Bunn, UltraBac.com, 425-644-6000
Microsoft MVP - WindowsNT/2000
http://www.ultrabac.com

Comments

  • OSR_Community_User-48OSR_Community_User-48 Member Posts: 74
    What is is trying to do? (high level) What is the equivalant C code?

    I want to try an experiment with this, but feel I don't have enough info
    cause all you provided was the loop.

    -----Original Message-----
    From: Paul Bunn [mailto:[email protected]]
    Sent: Tuesday, June 13, 2000 4:50 PM
    To: NT Developers Interest List
    Subject: [ntdev] x86 Asm FUN!


    Five years ago, I took a small routine in C and converted it to asm.
    Since
    this loop is executed in our app 30 billion times or so, I figured it
    was
    worth the effort, and indeed this was about 10,000% faster than the code
    generated by the x86 compiler of the time (interestingly, the best we
    could
    do in Alpha assembler was about a 5-10% speed increase, the GEM back-end
    compiler for Alpha is legendary). I was just looking at the code the
    other
    day, and was wondering whether there is any way to improve the speed
    still
    further (using MMX instructions maybe ?). I was wondering whether
    anyone
    would care to share some pearls of wisdom regarding improving
    performance of
    this simple routine...

    loop1:
    mov bl, byte ptr [eax]
    xor bl, cl
    mov edx, [esi + ebx*4]
    shr ecx, 8
    xor ecx, edx
    inc eax
    dec edi
    jnz loop1

    Regards,

    Paul Bunn, UltraBac.com, 425-644-6000
    Microsoft MVP - WindowsNT/2000
    http://www.ultrabac.com


    ---
    You are currently subscribed to ntdev as: [email protected]
    To unsubscribe send a blank email to $subst('Email.Unsub')
  • David_J._CraigDavid_J._Craig Member Posts: 1,885
    There is a lot of context missing from this loop. Why is EDI involved in
    the loop? Where is it used? How does ESI & EBX get set, incremented, and
    what is the structure indexed by them?

    ----- Original Message -----
    From: "Paul Bunn" <[email protected]>
    To: "NT Developers Interest List" <[email protected]>
    Sent: Tuesday, June 13, 2000 7:49 PM
    Subject: [ntdev] x86 Asm FUN!


    > Five years ago, I took a small routine in C and converted it to asm.
    Since
    > this loop is executed in our app 30 billion times or so, I figured it was
    > worth the effort, and indeed this was about 10,000% faster than the code
    > generated by the x86 compiler of the time (interestingly, the best we
    could
    > do in Alpha assembler was about a 5-10% speed increase, the GEM back-end
    > compiler for Alpha is legendary). I was just looking at the code the
    other
    > day, and was wondering whether there is any way to improve the speed still
    > further (using MMX instructions maybe ?). I was wondering whether anyone
    > would care to share some pearls of wisdom regarding improving performance
    of
    > this simple routine...
    >
    > loop1:
    > mov bl, byte ptr [eax]
    > xor bl, cl
    > mov edx, [esi + ebx*4]
    > shr ecx, 8
    > xor ecx, edx
    > inc eax
    > dec edi
    > jnz loop1
    >
    > Regards,
    >
    > Paul Bunn, UltraBac.com, 425-644-6000
    > Microsoft MVP - WindowsNT/2000
    > http://www.ultrabac.com
    >
    >
    > ---
    > You are currently subscribed to ntdev as: [email protected]
    > To unsubscribe send a blank email to $subst('Email.Unsub')
    >
  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    >I was wondering whether anyone
    >would care to share some pearls of wisdom regarding improving performance of
    >this simple routine...
    >
    >loop1:
    > mov bl, byte ptr [eax]
    > xor bl, cl
    > mov edx, [esi + ebx*4]
    > shr ecx, 8
    > xor ecx, edx
    > inc eax
    > dec edi
    > jnz loop1

    Hmmm, is this perhaps a table driven CRC or maybe encryption code?

    The line:

    mov edx, [esi + ebx*4]

    makes conversion to MMX vector instructions ugly, as you can't lookup
    multiple values in parallel (unless you use a bigger table, which tends to
    blow the cache). You might be able to calculate the table values faster
    than looking them up, depending on exactly what they are.

    You might try something like:

    mov edi, the negative of the size counter
    mov eax, the source base + size
    loop1:
    mov bl,byte ptr [edi][eax] ; counts up, subtracting from the end of the block
    xor bl,cl ; xor low 8 bits
    shr ecx,8 ; discard low 8 bits
    xor ecx,[esi + ebx *4] ; xor indexed table value
    inc edi ; increment index/counter
    jnz loop1

    (warning, code never tested and written by developer at 4:30 am, so may
    contain stupid errors and run 5X slower than expected)

    This at least removes one of the counter decrements, and frees up edx. If
    the buffers are big, you might also use the non-temporal moves in the P3,
    to reduce cache pollution.

    - Jan
  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    Here are some items for thought:

    1) Add "ALIGN 16" to the source just before the top of loop.
    This will add NOPs to get you to the start of a CACHE line.
    Anyone know the cache line size on Pentium II/III CPU? Is 16 right?
    Assuming yes, there is 1 byte available to do more work with
    at no effective cost. Right, that isn't much.

    The whole loop fits in a single CACHE line. This is goodness.
    Gotta get it aligned though or it is TWO cache hits per loop.
    The ALIGN will knock this down to ZERO.
    On Pentium II/III, this may not matter -- bigger cache lines.

    2) You are fetching BYTEs from memory on the load of BL.
    Behind the scenes, this is the same fetch 4 times.
    You can save 3 per loop by borrowing EBP to store the data in
    don't forget about the byte swap.

    3) Reorder the instructions so that the Pentium chip can make
    use of the second execution pipe.
    I'm not precisely sure if code below meets parallel rules, but
    it should be close.
    The INC EAX and SHR ECX, 8 are reordered in the code that follows.
    These save 1 and 3 clocks respectively.

    It is a start.

    Joe Nord

    .386P
    00000000 CODE SEGMENT PARA 'CODE' USE32

    ALIGN 16
    00000000 loop1:
    00000000 8A 18 mov bl, byte ptr [eax]
    00000002 32 D9 xor bl, cl
    00000004 40 inc eax
    00000005 C1 E9 08 shr ecx, 8
    00000008 8B 14 9E mov edx, [esi + ebx*4]
    0000000B 33 CA xor ecx, edx
    0000000D 4F dec edi
    0000000E 75 F0 jnz loop1

    0010 CODE ENDS
    END
  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    I didn't measure it, and I don't know if you need to keep the upper half of
    ebx, but you can try something like this:

    mov ebx,ecx
    xor ebx,[eax]
    and ebx,0ffh
    loop1:
    shr ecx,8
    xor ecx,[esi + ebx*4]
    mov ebx,ecx
    add eax,1
    xor ebx,[eax]
    and ebx,0ffh
    dec edi
    jnz loop1

    Some of the ideas are: Avoid byte arithmetic; keep setting of ebx far enough
    from using it; take maximum advantage of memory-to-register arithmetic. At
    this point, unrolling the loop might even help! Also, maybe aligning loop1
    on a cache line boundary might be a good idea. And inserting one NOP at the
    right place might make a difference, I don't know, but you're going to have
    to experiment with it. I suggest you write a little C program with a rdtsc
    before the loop and another rdtsc after the loop, and loop a zillion times,
    measuring the elapsed time - don't forget you hike the priorty of the thread
    to try to avoid getting preempted while the test is running. The other idea
    is, if you can do this four at a time, maybe you could consider using the
    Pentium III SIMD instructions. Hope this helps!

    Alberto.
    ===================================================

    -----Original Message-----
    From: Paul Bunn [mailto:[email protected]]
    Sent: Tuesday, June 13, 2000 7:50 PM
    To: NT Developers Interest List
    Subject: [ntdev] x86 Asm FUN!


    Five years ago, I took a small routine in C and converted it to asm. Since
    this loop is executed in our app 30 billion times or so, I figured it was
    worth the effort, and indeed this was about 10,000% faster than the code
    generated by the x86 compiler of the time (interestingly, the best we could
    do in Alpha assembler was about a 5-10% speed increase, the GEM back-end
    compiler for Alpha is legendary). I was just looking at the code the other
    day, and was wondering whether there is any way to improve the speed still
    further (using MMX instructions maybe ?). I was wondering whether anyone
    would care to share some pearls of wisdom regarding improving performance of
    this simple routine...

    loop1:
    mov bl, byte ptr [eax]
    xor bl, cl
    mov edx, [esi + ebx*4]
    shr ecx, 8
    xor ecx, edx
    inc eax
    dec edi
    jnz loop1

    Regards,

    Paul Bunn, UltraBac.com, 425-644-6000
    Microsoft MVP - WindowsNT/2000
    http://www.ultrabac.com


    ---
    You are currently subscribed to ntdev as: [email protected]
    To unsubscribe send a blank email to $subst('Email.Unsub')
  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    One thing,

    xor bl, cl

    This instruction will be lately the
    cause of a partial register stall
    on a pentium II or pentium III processor
    when next instruction:

    mov edx, [esi + ebx*4]

    will be executed. This cause a big
    penalty on Pentium II or III. As Alberto
    posted you should try to use as much as
    possible 32 bits mathematic! How to overcome
    the stall depend if you need the upper part of ebx.

    Guy Bonneau
Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. Sign in or register to get started.

Upcoming OSR Seminars
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!
Kernel Debugging 13-17 May 2024 Live, Online
Developing Minifilters 1-5 Apr 2024 Live, Online
Internals & Software Drivers 11-15 Mar 2024 Live, Online
Writing WDF Drivers 26 Feb - 1 Mar 2024 Live, Online