At 02:21 PM 5/6/00 -0700, you wrote:
> Is there some reason you just don’t write a little loop, as in:
> for(i = 0;i < length;i++) {
> if (*a++ != *b++)
> break;
> }
I can’t speak for the original poster, but you should always use the library
functions (memcmp, memcpy, RtlCompareMemory, etc.) for that sort of thing
since it uses the string instructions that are built into the Intel CPU. I
think I measured them long ago and found them to be about 25% faster than
the optimized versions of the code I wrote. Of course, I suppose the
compiler these days could be much smarter, and there may not actually be a
difference.
My guess is the run-time functions (kernel or C) do NOT take into account
processor unique instructions, such as the cache prefetch and non-caching
move instructions on recent processors. It’s been a couple of years since I
measured all the tradoffs, but very much believe the Intel string
instructions will cause horrible cache pollution (anybody correct me if
this is not the case). Ideally, calling a kernel run-time function for
things like this should do some processor specific code (different on
Pentium II and Pentium III), as the OS certainly has to know about
different processors. I don’t know if W2K kernel functions have this
enhancement.
The Microsoft C compiler does know how to inline many of the low level
run-time functions. I know for sure it can generate inline string move
instructions to replace memcpy, not sure about memcmp). If the code
optimizer was really together, it should detect a simple loop that moves
memory, and use the appropriate instructions.
If one of the operands is device memory, over say a PCI bus, there may be
some advantage to using larger data sizes (like 64 or 128 bit) move
operations, as the hardware should know to do a 2 or 4 DWORD burst, instead
of single DWORD bursts. Burst length is extreemly important to getting good
PCI bus efficency. Stripmining into registers may also generate even better
PCI bus bursts, like (for moving memory from a PCI target to main memory):
next:
movq mm0,[deviceMemory+0x00]
movq mm1,[deviceMemory+0x08]
movq mm2,[deviceMemory+0x10]
movq mm3,[deviceMemory+0x18]
movq mm4,[deviceMemory+0x20]
movq mm5,[deviceMemory+0x28]
movq mm6,[deviceMemory+0x30]
movq mm7,[deviceMemory+0x38]
movntq [dest+0x00],mm0
movntq [dest+0x08],mm1
movntq [dest+0x10],mm2
movntq [dest+0x18],mm3
movntq [dest+0x20],mm4
movntq [dest+0x28],mm5
movntq [dest+0x30],mm6
movntq [dest+0x38],mm7
sub ecx, 0x40
jnz next
This MAY generate 64-byte PCI bursts (a bus analyzer would be needed to see
what really happens) and writes the data to memory without cache pollution.
I know there was a discussion here a while back about getting burst PCI
target read transfers if the source device memory is declared uncached
(don’t remember the conclusion).
Also note that MMX instructions are like floating-point instructions, care
must be taken to make them work in kernel mode (especially at greater than
IRQL PASSIVE).
My understanding of the actual application by xxxxx@usa.net is to sweep
128K of PCI target memory, and look for changes against a memory buffer.
Getting fast PCI bursts, and not flushing the processor cache on every
“poll” seems like a desirable quality. Twenty carefully chosen instructions
will potentially perform MUCH better than a generic memory comparison
function. Of course if the polling happens once every 60 seconds, it really
doen’t matter. If the polling happens 100 times/sec, it really does.
I guess my point is there may be HUGE efficency losses by just using some
generic run-time function, and “it depends” on lots of factors if this is a
problem.