x86 Asm FUN!

Five years ago, I took a small routine in C and converted it to asm. Since
this loop is executed in our app 30 billion times or so, I figured it was
worth the effort, and indeed this was about 10,000% faster than the code
generated by the x86 compiler of the time (interestingly, the best we could
do in Alpha assembler was about a 5-10% speed increase, the GEM back-end
compiler for Alpha is legendary). I was just looking at the code the other
day, and was wondering whether there is any way to improve the speed still
further (using MMX instructions maybe ?). I was wondering whether anyone
would care to share some pearls of wisdom regarding improving performance of
this simple routine…

loop1:
mov bl, byte ptr [eax]
xor bl, cl
mov edx, [esi + ebx*4]
shr ecx, 8
xor ecx, edx
inc eax
dec edi
jnz loop1

Regards,

Paul Bunn, UltraBac.com, 425-644-6000
Microsoft MVP - WindowsNT/2000
http://www.ultrabac.com

What is is trying to do? (high level) What is the equivalant C code?

I want to try an experiment with this, but feel I don’t have enough info
cause all you provided was the loop.

-----Original Message-----
From: Paul Bunn [mailto:xxxxx@ultrabac.com]
Sent: Tuesday, June 13, 2000 4:50 PM
To: NT Developers Interest List
Subject: [ntdev] x86 Asm FUN!

Five years ago, I took a small routine in C and converted it to asm.
Since
this loop is executed in our app 30 billion times or so, I figured it
was
worth the effort, and indeed this was about 10,000% faster than the code
generated by the x86 compiler of the time (interestingly, the best we
could
do in Alpha assembler was about a 5-10% speed increase, the GEM back-end
compiler for Alpha is legendary). I was just looking at the code the
other
day, and was wondering whether there is any way to improve the speed
still
further (using MMX instructions maybe ?). I was wondering whether
anyone
would care to share some pearls of wisdom regarding improving
performance of
this simple routine…

loop1:
mov bl, byte ptr [eax]
xor bl, cl
mov edx, [esi + ebx*4]
shr ecx, 8
xor ecx, edx
inc eax
dec edi
jnz loop1

Regards,

Paul Bunn, UltraBac.com, 425-644-6000
Microsoft MVP - WindowsNT/2000
http://www.ultrabac.com


You are currently subscribed to ntdev as: xxxxx@microsoft.com
To unsubscribe send a blank email to $subst(‘Email.Unsub’)

There is a lot of context missing from this loop. Why is EDI involved in
the loop? Where is it used? How does ESI & EBX get set, incremented, and
what is the structure indexed by them?

----- Original Message -----
From: “Paul Bunn”
To: “NT Developers Interest List”
Sent: Tuesday, June 13, 2000 7:49 PM
Subject: [ntdev] x86 Asm FUN!

> Five years ago, I took a small routine in C and converted it to asm.
Since
> this loop is executed in our app 30 billion times or so, I figured it was
> worth the effort, and indeed this was about 10,000% faster than the code
> generated by the x86 compiler of the time (interestingly, the best we
could
> do in Alpha assembler was about a 5-10% speed increase, the GEM back-end
> compiler for Alpha is legendary). I was just looking at the code the
other
> day, and was wondering whether there is any way to improve the speed still
> further (using MMX instructions maybe ?). I was wondering whether anyone
> would care to share some pearls of wisdom regarding improving performance
of
> this simple routine…
>
> loop1:
> mov bl, byte ptr [eax]
> xor bl, cl
> mov edx, [esi + ebx*4]
> shr ecx, 8
> xor ecx, edx
> inc eax
> dec edi
> jnz loop1
>
> Regards,
>
> Paul Bunn, UltraBac.com, 425-644-6000
> Microsoft MVP - WindowsNT/2000
> http://www.ultrabac.com
>
>
> —
> You are currently subscribed to ntdev as: xxxxx@mindspring.com
> To unsubscribe send a blank email to $subst(‘Email.Unsub’)
>

>I was wondering whether anyone

would care to share some pearls of wisdom regarding improving performance of
this simple routine…

loop1:
mov bl, byte ptr [eax]
xor bl, cl
mov edx, [esi + ebx*4]
shr ecx, 8
xor ecx, edx
inc eax
dec edi
jnz loop1

Hmmm, is this perhaps a table driven CRC or maybe encryption code?

The line:

mov edx, [esi + ebx*4]

makes conversion to MMX vector instructions ugly, as you can’t lookup
multiple values in parallel (unless you use a bigger table, which tends to
blow the cache). You might be able to calculate the table values faster
than looking them up, depending on exactly what they are.

You might try something like:

mov edi, the negative of the size counter
mov eax, the source base + size
loop1:
mov bl,byte ptr [edi][eax] ; counts up, subtracting from the end of the block
xor bl,cl ; xor low 8 bits
shr ecx,8 ; discard low 8 bits
xor ecx,[esi + ebx *4] ; xor indexed table value
inc edi ; increment index/counter
jnz loop1

(warning, code never tested and written by developer at 4:30 am, so may
contain stupid errors and run 5X slower than expected)

This at least removes one of the counter decrements, and frees up edx. If
the buffers are big, you might also use the non-temporal moves in the P3,
to reduce cache pollution.

  • Jan

Here are some items for thought:

  1. Add “ALIGN 16” to the source just before the top of loop.
    This will add NOPs to get you to the start of a CACHE line.
    Anyone know the cache line size on Pentium II/III CPU? Is 16 right?
    Assuming yes, there is 1 byte available to do more work with
    at no effective cost. Right, that isn’t much.

The whole loop fits in a single CACHE line. This is goodness.
Gotta get it aligned though or it is TWO cache hits per loop.
The ALIGN will knock this down to ZERO.
On Pentium II/III, this may not matter – bigger cache lines.

  1. You are fetching BYTEs from memory on the load of BL.
    Behind the scenes, this is the same fetch 4 times.
    You can save 3 per loop by borrowing EBP to store the data in
    don’t forget about the byte swap.

  2. Reorder the instructions so that the Pentium chip can make
    use of the second execution pipe.
    I’m not precisely sure if code below meets parallel rules, but
    it should be close.
    The INC EAX and SHR ECX, 8 are reordered in the code that follows.
    These save 1 and 3 clocks respectively.

It is a start.

Joe Nord

.386P
00000000 CODE SEGMENT PARA ‘CODE’ USE32

ALIGN 16
00000000 loop1:
00000000 8A 18 mov bl, byte ptr [eax]
00000002 32 D9 xor bl, cl
00000004 40 inc eax
00000005 C1 E9 08 shr ecx, 8
00000008 8B 14 9E mov edx, [esi + ebx*4]
0000000B 33 CA xor ecx, edx
0000000D 4F dec edi
0000000E 75 F0 jnz loop1

0010 CODE ENDS
END

I didn’t measure it, and I don’t know if you need to keep the upper half of
ebx, but you can try something like this:

mov ebx,ecx
xor ebx,[eax]
and ebx,0ffh
loop1:
shr ecx,8
xor ecx,[esi + ebx*4]
mov ebx,ecx
add eax,1
xor ebx,[eax]
and ebx,0ffh
dec edi
jnz loop1

Some of the ideas are: Avoid byte arithmetic; keep setting of ebx far enough
from using it; take maximum advantage of memory-to-register arithmetic. At
this point, unrolling the loop might even help! Also, maybe aligning loop1
on a cache line boundary might be a good idea. And inserting one NOP at the
right place might make a difference, I don’t know, but you’re going to have
to experiment with it. I suggest you write a little C program with a rdtsc
before the loop and another rdtsc after the loop, and loop a zillion times,
measuring the elapsed time - don’t forget you hike the priorty of the thread
to try to avoid getting preempted while the test is running. The other idea
is, if you can do this four at a time, maybe you could consider using the
Pentium III SIMD instructions. Hope this helps!

Alberto.

-----Original Message-----
From: Paul Bunn [mailto:xxxxx@ultrabac.com]
Sent: Tuesday, June 13, 2000 7:50 PM
To: NT Developers Interest List
Subject: [ntdev] x86 Asm FUN!

Five years ago, I took a small routine in C and converted it to asm. Since
this loop is executed in our app 30 billion times or so, I figured it was
worth the effort, and indeed this was about 10,000% faster than the code
generated by the x86 compiler of the time (interestingly, the best we could
do in Alpha assembler was about a 5-10% speed increase, the GEM back-end
compiler for Alpha is legendary). I was just looking at the code the other
day, and was wondering whether there is any way to improve the speed still
further (using MMX instructions maybe ?). I was wondering whether anyone
would care to share some pearls of wisdom regarding improving performance of
this simple routine…

loop1:
mov bl, byte ptr [eax]
xor bl, cl
mov edx, [esi + ebx*4]
shr ecx, 8
xor ecx, edx
inc eax
dec edi
jnz loop1

Regards,

Paul Bunn, UltraBac.com, 425-644-6000
Microsoft MVP - WindowsNT/2000
http://www.ultrabac.com


You are currently subscribed to ntdev as: xxxxx@numega.com
To unsubscribe send a blank email to $subst(‘Email.Unsub’)

One thing,

xor bl, cl

This instruction will be lately the
cause of a partial register stall
on a pentium II or pentium III processor
when next instruction:

mov edx, [esi + ebx*4]

will be executed. This cause a big
penalty on Pentium II or III. As Alberto
posted you should try to use as much as
possible 32 bits mathematic! How to overcome
the stall depend if you need the upper part of ebx.

Guy Bonneau