RtlCopyMemory() Vs Memcpy()

Calvin_Guan-3 · May 7, 2013, 12:48am

You won’t be surprised to see each NT components have their own function
for the same thing. for spinlock, you will see video port spinlock,
ndisspinlock etc.
I can think of the following reasons:

Their driver model needs to support NT and non-NT based kernel.
They believe they could do it smarter in the future and reserve the
opportunity/convenience to do so.
They don’t like the name defined by other teams but want to roll their
own “OS” JK

On Mon, May 6, 2013 at 7:59 AM, wrote:

> Is is safer to use Memcpy() rather then RtlCopyMemory() in driver?
>
> What is the Pros n Cons by using both.
>
> Thanks
>
> —
> NTDEV is sponsored by OSR
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

OSR_Community_User · May 7, 2013, 1:00am

> Speer, Kenny wrote:

> This is the sort of advice that I would love to use, but this statement
> in the documentation is too ambiguous:
>
> The RtlCopyMemory routine runs faster than RtlMoveMemory,"
>
> What does “run faster” mean? Is it 1000x faster or just .0005x faster?
> Some products will be very dependent on this statement.

It’s a ridiculous statement that has been handed down since the mists of
time, based on the documentation for the original C run-time library
some 40 or 50 years ago.
*****
Actually, it can still be true. The important part of the text below is
the to the “normal” direction, and that is key. memcpy uses the “normal”
direction. “Normal” is whatever is most efficient on the hardware
platform. It can be either low-to-high or high-to-low, and the C library
carefully leaves this unspecified. Note that if the source and
destination overlap, the results are undefined. Thus, the “move” forms
were introduced. If the buffers do not overlap, it can do whatever memcpy
does. But if they overlap, then the test below determines if the data is
copied in the forward direction (not the “normal” direction) or the
backward direction. This allows you to “slide” a group of contiguous
values either drection with the same call, and ensures that the results
are well-defined.

Now, as far as “efficiency” is cocerned, part of the issue is that many
compilers, including Microsoft, can implement “memcpy” as an “intrinsic”,
meaning the compiler replaces the call with actual inline code. The
inline code moves data in the “normal” direction. The “move” version
generally has no intrinsic implementation, so requires an out-of-line
call.

On modern pipelined superscalar multilevel-cache architectures, the
differences in performance will not show up as long as the move is in the
“normal” direction. But if it requires moving in the “non-normal”
direction, the performance degradation is highly platform-specific. In
modern Intel x86/x64 machines, however, these differences may be minor.
Hint: if uncertain, do the experiment and measure and see.

The bottom line: for non-overlapping copies, use the “copy” form. If
there is any difference in performance, this will get the best
performance. If the source and destination can overlap, you ***MUST***
use the “move” forms; this is not optional. In this case, correctness
trumps performance.
joe

Here is a pseudo-code implementation of memmove:
Is the source address greater than the destination address?
copy in the normal direction //WRONG! “forward”
else
copy in the backwards direction

Here is a pseudo-code implementation of memcpy:
copy in the normal direction

You can see how significant the performance impact is likely to be. It
is so close to 0 that it cannot be measured, except for very small
copies. The “copy” algorithms themselves are identical.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Calvin_Guan-3 · May 7, 2013, 2:44am

It’s OT but in my programming career, I have never bumped into awkward
situation that src and dest overlap. Am I just lucky?

Calvin

The bottom line: for non-overlapping copies, use the “copy” form. If

there is any difference in performance, this will get the best
performance. If the source and destination can overlap, you ***MUST***
use the “move” forms; this is not optional. In this case, correctness
trumps performance.

OSR_Community_User · May 7, 2013, 3:08am

> It’s OT but in my programming career, I have never bumped into awkward

situation that src and dest overlap. Am I just lucky?

Part of it depends on what kind of algorithms you are using. For example,
inserting something into an array involves sliding things up to create
space, and deleting an element from the array involves sliding things down
to fill in the space. One common scenario for this is inserting or
deleting contents of a memory-mapped file.
joe

Calvin

The bottom line: for non-overlapping copies, use the “copy” form. If

> there is any difference in performance, this will get the best
> performance. If the source and destination can overlap, you ***MUST***
> use the “move” forms; this is not optional. In this case, correctness
> trumps performance.
>

NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Daniel_Terhell · May 7, 2013, 4:13am

>“Calvin Guan (news)” wrote in message
>news:xxxxx@ntdev…
>It’s OT but in my programming career, I have never bumped into awkward
>situation that src and dest overlap. Am I just lucky?

It was very common in the days that CPUs did not have the luxury of a STOS
instruction. Then a simple memset had to be achieved by by setting source
and destination operands 1 byte apart and then copy to fill the block (LDIR
on a Z80).

//Daniel

Peter_Viscarola_OSR · May 7, 2013, 1:10pm

Zzzzzzzz…

It seems Mr. Garner has disappeared, rather than attempting to substantiate his nonsense claim about RtlCopyMemory being different from memcpy in Windows kernel-mode code:

How unfortunate. I was looking forward to him provoking me further to the point where he could become the first person in NTDEV history ever put on moderation.

Peter
OSR

anton_bassov · May 7, 2013, 1:34pm

> It was very common in the days that CPUs did not have the luxury of a STOS instruction.

Then a simple memset had to be achieved by by setting source and destination operands 1 byte apart
and then copy to fill the block (LDIR on a Z80).

Actually, this “luxury” is still unavailable in RISC processors, for understadable reasons, once operations
other than LOAD and STORE on memory operands are not allowed (hence the term “load-and-store architecture” ). It means you cannot have complex instructions like STOS, MOVS and friends. Therefore,
memcpy has to be done in a software. For example, let’s look at ARM - memcpy involves
moving from the source to the registers (ARM allows you to load and store multiple registers in one go) and then
from registers to the destination, and doing the above in a loop…

Anton Bassov

OSR_Community_User · May 7, 2013, 9:07pm

I can’t remember whether it was the 486 or the Pentium, but prior to that generation CPU the REP MOVSD instruction was the quickest way to move memory. Then with the new CPU’s a 6 instruction loop using basic instructions (mov, add, sub, jmp) significantly outperformed REP MOVSD and in those days that really mattered to a lot of software.

I agree with the recommendations here–always use the DDK functions when you can. Never make assumptions based on reverse engineering today’s environment. Whether it is a wrapper for memcpy today is totally irrelevant. Instead rely on what the documentation says.

Tim_Roberts · May 8, 2013, 1:37pm

xxxxx@gmail.com wrote:

I can’t remember whether it was the 486 or the Pentium, but prior to that generation CPU the REP MOVSD instruction was the quickest way to move memory. Then with the new CPU’s a 6 instruction loop using basic instructions (mov, add, sub, jmp) significantly outperformed REP MOVSD and in those days that really mattered to a lot of software.

That has never been true. Starting with the 486, REP MOVSD runs 1
transfer per cycle. Even with Pentium pairing, you can’t beat that with
traditional instructions. It IS possible to beat that with the new
instruction sets (like XMM), because you can transfer 8 or 16 bytes per
cycle instead of 4.

Now, there certainly are some very counterintuitive optimizations in the
Pentium. For example, the LOOP instruction is always slower than a
DEC/JNZ pair. JCXZ and JECXZ are also slower than the equivalent
sequences. A single LODS, STOS or MOVS is slower than the equivalent
sequences, but once you put REP on there, that changes.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Chris_Aseltine · May 8, 2013, 4:52pm

I don’t know if this is relevant to the discussion but thought I would throw it out there. Say you have a struct like this:

struct Foo
{
char buf_[4095];
};

I recently noticed that the VS2010 compiler will use something like the following in the compiler-generated copy constructor for Foo:

00BF10E3 lea esi,[esp+24h]
00BF10E7 mov ecx,3FFh
00BF10EC rep movs dword ptr es:[edi],dword ptr [esi]
00BF10EE movs word ptr es:[edi],word ptr [esi]
00BF10F0 movs byte ptr es:[edi],byte ptr [esi]

However if you make buf_ be 4096 size, you get this instead:

009810D9 push 1000h
009810DE lea ecx,[esp+28h]
009810E2 push ecx
009810E3 push eax
009810E4 call memcpy (981C7Ah)
009810E9 add esp,0Ch

OSR_Community_User · May 8, 2013, 9:08pm

> That has never been true. Starting with the 486, REP MOVSD runs

1 transfer per cycle. Even with Pentium pairing, you can’t beat that
with traditional instructions.

You might have missed out then. It was indeed pairing that allowed 3 pairs of instructions to beat REP MOVSD. We are going back to the 90’s here and I at first couldn’t accept a 6 instruction loop could possibly beat the tidy single instruction meant for the job and benchmarked to confirm. I think this info can be found on the net pretty easily. The loop is below for completeness.

l: mov eax,[esi]
mov [edi],eax
add esi,4
add edi,4
sub ecx,4
jnz l

OSR_Community_User · May 8, 2013, 9:59pm

>> That has never been true. Starting with the 486, REP MOVSD runs

> 1 transfer per cycle. Even with Pentium pairing, you can’t beat that
> with traditional instructions.

You might have missed out then. It was indeed pairing that allowed 3 pairs
of instructions to beat REP MOVSD. We are going back to the 90’s here and
I at first couldn’t accept a 6 instruction loop could possibly beat the
tidy single instruction meant for the job and benchmarked to confirm. I
think this info can be found on the net pretty easily. The loop is below
for completeness.

l: mov eax,[esi]
mov [edi],eax
add esi,4
add edi,4
sub ecx,4
jnz l

One of the problems of modern architectures is that your “intuitions”
start falling apart. Caches, instruction pipelining, dynamic register
renaming, and opportunistic asynchronous superscalar architectures (just
to hit the highlights) have resulted in very odd rules about what
constitutes “good” code. Remember also that, in its heart-of-hearts, the
x86 is a RISC machine that emulates the Intel instruction set (and don’t
tell I’m talking about the Itanic). In some cases, the wizards who obsess
about such things are responsible for he code you see. If you are not
similarly obsessive, and/or do not have the NDA materials from Intel, it
is hard to say, just looking at an instruction sequence, exactly how fast
it really is. Benchmarking is what you have to do.

Some years ago, I published something on my Web site which used a computed
shift value. A reader chastised me because “everyone knows” that computed
shift is not only more expensive than constant shift, but also that shift
times “of course” are proportional to the shift distance. he pointed me
at a Web site that “proved” he was right. The Web site talked about a
Pentium 3, and was about as relevant to modern computers as timings on an
8088. The result was that I did some benchmarking, and wrote the essay

http://www.flounder.com/shiftcost.htm

Shift cost is constant independent of shift distance. If I cared, I coud
benchmarks various implementations of “memcpy”, but I’ll be happy to send
my benchmarking program to anyone who asks.
joe

NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Tim_Roberts · May 9, 2013, 12:27pm

xxxxx@gmail.com wrote:

> That has never been true. Starting with the 486, REP MOVSD runs
> 1 transfer per cycle. Even with Pentium pairing, you can’t beat that
> with traditional instructions.
You might have missed out then. It was indeed pairing that allowed 3 pairs of instructions to beat REP MOVSD. We are going back to the 90’s here and I at first couldn’t accept a 6 instruction loop could possibly beat the tidy single instruction meant for the job and benchmarked to confirm. I think this info can be found on the net pretty easily. The loop is below for completeness.

l: mov eax,[esi]
mov [edi],eax
add esi,4
add edi,4
sub ecx,4
jnz l

Nope, that doesn’t do it. Even with pairing, that loop requires at
least 3 cycles per transfer. “rep movsd”, once you get rolling, runs 1
cycle per transfer.

However, on a side note, that loop can’t be paired, because the second
instruction has a blocking dependency on the first. You would have to
prefetch the first dword and swap those two instructions.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Alex_Grig · May 9, 2013, 12:48pm

>“rep movsd”, once you get rolling, runs 1 cycle per transfer.

Where ‘transfer’ in favorable conditions might mean 64 bytes.

Tim_Roberts · May 9, 2013, 12:51pm

xxxxx@broadcom.com wrote:

> “rep movsd”, once you get rolling, runs 1 cycle per transfer.
Where ‘transfer’ in favorable conditions might mean 64 bytes.

Not with “rep movsd”. It only does 4 bytes at a time. It’s true that
you’re going to bring in a whole cache line at a time, but the copying
will only be 4 bytes per cycle. In 64-bit mode, “rep movsq” will do 8
bytes per cycle.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Don_Burn · May 9, 2013, 12:56pm

Since the start of processor family’s the only thing you can really rely
on about speed is:

“Hardware developers will change the fastest sequence for some set of
operations for every new model, then complain when all the software out
there does not immediately adopt the changes”.

Don Burn
Windows Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr

Alex_Grig · May 9, 2013, 2:33pm

>Not with “rep movsd”. It only does 4 bytes at a time.

See “Fast string operations” in IA32 docs. I was partially correct: It takes 1 clock per 16 bytes (for allflavors of data size). In Nehalem architecture, alignment requirements are relaxed - the src/dst alignment doesn’t have to match.