memcpy inlined, memmove not

I had to disassemble memcpy to find out how is implemented. The header files don’t specify anything about it being inlined. rep movs byte ptr -> on a memory mapped device results in a performance bottleneck.

I did some experiments with memmove which passed. For how long will it not be inlined? Also, the x WinDBG command shows the same location for memcpy and memmove.

To conclude on WinDDK\7600.16385.1

  • memcpy is inlined on chk builds
  • the crt does export memcpy and memmove at the same location
  • memmove is not inlined.

Calin Iaru wrote:

I had to disassemble memcpy to find out how is implemented. The header
files don’t specify anything about it being inlined. rep movs byte ptr
-> on a memory mapped device results in a performance bottleneck.

I don’t understand your comment. The external version of memcpy and
memmove both do a “rep movsd”, exactly like the inlined version. Why do
you say it is a performance bottleneck?

I did some experiments with memmove which passed. For how long will it
not be inlined? Also, the x WinDBG command shows the same location for
memcpy and memmove.

This is internal to the compiler. If you #include <intrin.h> or add the
/Oi compiler flag, then memcpy will be inlined. memmove will not,
because the additional checking required makes it easier just to call
the external. memmove is the stricter of the two, so there’s no need
for two implementations.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.</intrin.h>

The external version of memcpy does 8 bytes load/store with mov rax, dword ptr[src]. It transfers 32 bytes per loop.

You talk about priorities between memcpy and memmove. Is there a WDK header that orders this? intrin.h is part of the sdk or Visual Studio.


From: Tim Roberts
To: Windows System Software Devs Interest List
Sent: Tuesday, August 30, 2011 10:24 PM
Subject: Re: [ntdev] memcpy inlined, memmove not

Calin Iaru wrote:
> I had to disassemble memcpy to find out how is implemented. The header
> files don’t specify anything about it being inlined. rep movs byte ptr
> -> on a memory mapped device results in a performance bottleneck.

I don’t understand your comment.? The external version of memcpy and
memmove both do a “rep movsd”, exactly like the inlined version.? Why do
you say it is a performance bottleneck?

> I did some experiments with memmove which passed. For how long will it
> not be inlined? Also, the x WinDBG command shows the same location for
> memcpy and memmove.

This is internal to the compiler.? If you #include <intrin.h> or add the
/Oi compiler flag, then memcpy will be inlined.? memmove will not,
because the additional checking required makes it easier just to call
the external.? memmove is the stricter of the two, so there’s no need
for two implementations.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer</intrin.h>

For a memory mapped device, use WRITE_REGISTER_BUFFER_ULONG().

Calin Iaru wrote:

The external version of memcpy does 8 bytes load/store with mov rax,
dword ptr[src]. It transfers 32 bytes per loop.

Ah, I looked at the 32-bit version. Interesting that the 64-bit
compiler inlines memcpy to a “rep movsb” instead of a “rep movsd”; I
would argue that was a compiler bug. I will have to do some testing to
see if the hardware combines that.

You talk about priorities between memcpy and memmove. Is there a WDK
header that orders this?

No, that behavior is an ancient part of the standard C run-time
library. “memcpy” copies source to destination, but is undefined if the
two buffers overlap. “memmove” copies source to destination, but does
the right thing if the two overlap. Thus, “memmove” can always handle a
“memcpy” call, but not vice versa.

intrin.h is part of the sdk or Visual Studio.

It is included with Visual Studio, but /Oi serves the same purpose. Or
you can say “#pragma intrinsic(memcpy)”.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Tim Roberts wrote:

Calin Iaru wrote:
> The external version of memcpy does 8 bytes load/store with mov rax,
> dword ptr[src]. It transfers 32 bytes per loop.
Ah, I looked at the 32-bit version. Interesting that the 64-bit
compiler inlines memcpy to a “rep movsb” instead of a “rep movsd”; I
would argue that was a compiler bug. I will have to do some testing to
see if the hardware combines that.

To my surprise, on an x64 processor, “rep movsb”, “rep movsd” and “rep
movsq” all take exactly the same amount of time to copy the same number
of bytes. That’s a change from the x86 world, where the instruction was
one cycle per iteration.

On my Intel Core2 Quad, it takes 20,000 cycles to copy 65,536 bytes with
“rep movs”. The external “memmove” beats that by 10%.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

> -----Original Message-----

From: xxxxx@lists.osr.com [mailto:bounce-472675-
xxxxx@lists.osr.com] On Behalf Of Tim Roberts
Sent: Tuesday, August 30, 2011 3:19 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] memcpy inlined, memmove not

Tim Roberts wrote:
> Calin Iaru wrote:
>> The external version of memcpy does 8 bytes load/store with mov rax,
>> dword ptr[src]. It transfers 32 bytes per loop.
> Ah, I looked at the 32-bit version. Interesting that the 64-bit
> compiler inlines memcpy to a “rep movsb” instead of a “rep movsd”; I
> would argue that was a compiler bug. I will have to do some testing to
> see if the hardware combines that.

To my surprise, on an x64 processor, “rep movsb”, “rep movsd” and “rep
movsq” all take exactly the same amount of time to copy the same number
of bytes. That’s a change from the x86 world, where the instruction was
one cycle per iteration.

On my Intel Core2 Quad, it takes 20,000 cycles to copy 65,536 bytes with
“rep movs”. The external “memmove” beats that by 10%.

Are you saying the lib implementation only takes 18,000 cycles? If so, have
you disassembled it to see if it’s using a different construct?

Phil

Philip D. Barila

Philip D Barila wrote:

Are you saying the lib implementation only takes 18,000 cycles?

Yes.

If so, have
you disassembled it to see if it’s using a different construct?

Yes, as Calin said, it unrolls the loop using registers. It also uses
the non-temporal cache hint instructions (prefetchnta and movnti). It’s
possible the cache hinting alone is enough to squeeze that additional 10%.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

> I had to disassemble memcpy to find out how is implemented. The header

files don’t specify anything about it being inlined. rep movs byte ptr ->
on a memory mapped device results in a performance bottleneck.
****
That’s a little surprising, given the source is included with the compiler
install, at least for the Visual Studio version.

The C version, for my install, is found in

c:\Program Files (x86)\Microsoft Visual Studio 10.0\vc\crt\src\memcpy.c

and the assembly-code version is found in

c:\Program Files (x86)\Microsoft Visual Studio
10.0\vc\crt\src\intel\memccpy.asm

Note that the meanings of memcpy and memmove are explicitly documented.

I did some experiments with memmove which passed. For how long will it not
be inlined? Also, the x WinDBG command shows the same location for memcpy
and memmove.
****
memmove semantics are a superset of memcpy semantics. The assembly code
source for memmove.asm, in its entirety, is

;***
;memmove.asm -
;
; Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
; memmove() copies a source memory buffer to a destination buffer.
; Overlapping buffers are treated specially, to avoid propogation.
;
; NOTE: This stub module scheme is compatible with NT build
; procedure.
;
;*******************************************************************************

MEM_MOVE EQU 1
INCLUDE Intel\MEMCPY.ASM

******
Note that a casual reading of the source shows why they appear to be the
same location.
******

To conclude on WinDDK\7600.16385.1

  • memcpy is inlined on chk builds
  • the crt does export memcpy and memmove at the same location
  • memmove is not inlined.

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

>inlined. rep movs byte ptr -> on a memory mapped device results in a performance bottleneck.

Try turning on write combining by MmMapIoSpace parameter.

  • the crt does export memcpy and memmove at the same location

Oh yes! they have really listened to Linus who suggested the same in glibc! :slight_smile:


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

> the external. memmove is the stricter of the two, so there’s no need

for two implementations.

The well-known story (occured half a year ago IIRC) about Linux and this exact topic:

  • memcpy was always documented to NOT tolerate overlaps, while memmove was guaranteed to tolerate them.
  • nevertheless, the common memcpy implementations were actually tolerating overlaps in one direction and not another.
  • this was the GNU libc implementation
  • some software like Adobe Flash player on Linux had bugs with calling memcpy for overlapping ranges, which were not exposed due to “lucky direction” of the overlap.
  • then the glibc guys remade memcpy to introduce some new optimizations by Intel
  • the result was - kaboom! for many binary Linux apps, including Adobe Flash
  • during to the long discussion followed, Linus suggested to make memcpy a synonim of memmove (which is the way MS does this)
  • but the glibc guys resisted a lot about “broken software must be fixed”


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Hi Tim,

? I can see the /Oi switch being used. The list of intrinsics for Studio 2008 is documented at

http://msdn.microsoft.com/en-us/library/26td21ds(v=VS.90).aspx

Regards,

? Calin


From: Tim Roberts
To: Windows System Software Devs Interest List
Sent: Tuesday, August 30, 2011 10:24 PM
Subject: Re: [ntdev] memcpy inlined, memmove not

This is internal to the compiler.? If you #include <intrin.h> or add the
/Oi compiler flag, then memcpy will be inlined.? memmove will not,
because the additional checking required makes it easier just to call
the external.? memmove is the stricter of the two, so there’s no need
for two implementations.</intrin.h>

The SDK memcpy implementation is not the same as the WDK.


From: “xxxxx@flounder.com
To: Windows System Software Devs Interest List
Sent: Wednesday, August 31, 2011 1:46 AM
Subject: Re: [ntdev] memcpy inlined, memmove not

> I had to disassemble memcpy to find out how is implemented. The header
> files don’t specify anything about it being inlined. rep movs byte ptr ->
> on a memory mapped device results in a performance bottleneck.

That’s a little surprising, given the source is included with the compiler
install, at least for the Visual Studio version.

The C version, for my install, is found in

c:\Program Files (x86)\Microsoft Visual Studio 10.0\vc\crt\src\memcpy.c

and the assembly-code version is found in

c:\Program Files (x86)\Microsoft Visual Studio
10.0\vc\crt\src\intel\memccpy.asm

Note that the meanings of memcpy and memmove are explicitly documented.

>
> I did some experiments with memmove which passed. For how long will it not
> be inlined? Also, the x WinDBG command shows the same location for memcpy
> and memmove.

memmove semantics are a superset of memcpy semantics.? The assembly code
source for memmove.asm, in its entirety, is

;
;memmove.asm -
;
;? ? ? Copyright (c) Microsoft Corporation.? All rights reserved.
;
;Purpose:
;? ? ? memmove() copies a source memory buffer to a destination buffer.
;? ? ? Overlapping buffers are treated specially, to avoid propogation.
;
;? ? ? NOTE:? This stub module scheme is compatible with NT build
;? ? ? procedure.
;
;
****************************************************************************

MEM_MOVE EQU 1
INCLUDE Intel\MEMCPY.ASM


Note that a casual reading of the source shows why they appear to be the
same location.

>
> To conclude on WinDDK\7600.16385.1
> - memcpy is inlined on chk builds
> - the crt does export memcpy and memmove at the same location
> - memmove is not inlined.
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Yes, the overlapping can be categorized into 4 different types, and for some overlaps can avoid bugs for quite some time - accidentally correct. And some overlaps can restrict blockmove, iirc.

-pro

On Aug 31, 2011, at 12:38 AM, Maxim S. Shatskih wrote:

> the external. memmove is the stricter of the two, so there’s no need
> for two implementations.

The well-known story (occured half a year ago IIRC) about Linux and this exact topic:

  • memcpy was always documented to NOT tolerate overlaps, while memmove was guaranteed to tolerate them.
  • nevertheless, the common memcpy implementations were actually tolerating overlaps in one direction and not another.
  • this was the GNU libc implementation
  • some software like Adobe Flash player on Linux had bugs with calling memcpy for overlapping ranges, which were not exposed due to “lucky direction” of the overlap.
  • then the glibc guys remade memcpy to introduce some new optimizations by Intel
  • the result was - kaboom! for many binary Linux apps, including Adobe Flash
  • during to the long discussion followed, Linus suggested to make memcpy a synonim of memmove (which is the way MS does this)
  • but the glibc guys resisted a lot about “broken software must be fixed”


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

I’ve always preferred to force memcpy to be inlined, but that would prevent taking advantage of improvements such as this in the runtime. I’m convinced. :slight_smile:

Phil

Philip D. Barila

Regarding the movsb/movsd/etc controversy: I’ve found what appears to be
suboptimal code generated by the 64-bit compiler, and have been told “Yes,
it looks bad, but if you understand how the pipelines, prefetch and caches
work, then construct is as good as or better than construct .”

For example, I have seen code of the form
…compute result in R?X
mov local, R?X
mov R?X, local

but because of speculative execution and the write pipe, the second mov is
said to take 0 CPU clock cycles to execute, so there is apparently no
attempt made to optimize this. I was surprised.
joe

>> -----Original Message-----
>> From: xxxxx@lists.osr.com [mailto:bounce-472675-
>> xxxxx@lists.osr.com] On Behalf Of Tim Roberts
>> Sent: Tuesday, August 30, 2011 3:19 PM
>> To: Windows System Software Devs Interest List
>> Subject: Re: [ntdev] memcpy inlined, memmove not
>>
>> Tim Roberts wrote:
>> > Calin Iaru wrote:
>> >> The external version of memcpy does 8 bytes load/store with mov rax,
>> >> dword ptr[src]. It transfers 32 bytes per loop.
>> > Ah, I looked at the 32-bit version. Interesting that the 64-bit
>> > compiler inlines memcpy to a “rep movsb” instead of a “rep movsd”; I
>> > would argue that was a compiler bug. I will have to do some testing
>> to
>> > see if the hardware combines that.
>>
>> To my surprise, on an x64 processor, “rep movsb”, “rep movsd” and “rep
>> movsq” all take exactly the same amount of time to copy the same number
>> of bytes. That’s a change from the x86 world, where the instruction was
>> one cycle per iteration.
>>
>> On my Intel Core2 Quad, it takes 20,000 cycles to copy 65,536 bytes with
>> “rep movs”. The external “memmove” beats that by 10%.
>
> Are you saying the lib implementation only takes 18,000 cycles? If so,
> have
> you disassembled it to see if it’s using a different construct?
>
> Phil
>
> Philip D. Barila
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

xxxxx@flounder.com wrote:

For example, I have seen code of the form
…compute result in R?X
mov local, R?X
mov R?X, local

but because of speculative execution and the write pipe, the second mov is
said to take 0 CPU clock cycles to execute, so there is apparently no
attempt made to optimize this. I was surprised.

I’m afraid I’ve lost track of micro-optimizing for the x64 architecture,
but I was surprised to see the following in the disassembly of the
“memmove” function:

00000000`774ce761 666666666666660f1f840000000000 nop word ptr [rax+rax]

That’s a 15-byte no-op, in order to force the next instruction to be
16-byte aligned. In the x86, you couldn’t have more than one of each
prefix. Apparently, that limitation has been lifted…


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

“Calin Iaru” wrote in message news:xxxxx@ntdev…
> I had to disassemble memcpy to find out how is implemented. The header
> files don’t specify anything about it being inlined. rep movs byte ptr ->
> on a memory mapped device results in a performance bottleneck.
>
> I did some experiments with memmove which passed. For how long will it not
> be inlined? Also, the x WinDBG command shows the same location for memcpy
> and memmove.
>
> To conclude on WinDDK\7600.16385.1
> - memcpy is inlined on chk builds
> - the crt does export memcpy and memmove at the same location
> - memmove is not inlined.

Correct. Memory-mapped I/O registers are not a plain vanilla RAM. If you
issue a wrong cycle type, it can
behave in various interesting ways. Only your hardware/FPGA guys can tell
how.

So better don’t use memcpy or memmove when at least one side is I/O memory.
Use READ/WRITE_REGISTER_xxx or intrinsics defined in wdm.h: movsb,
movsw, __movsd

Regards,
– pa

What about the READ/WRITE_REGISTER_BUFFER_xxxx operations?
joe

“Calin Iaru” wrote in message news:xxxxx@ntdev…
>> I had to disassemble memcpy to find out how is implemented. The header
>> files don’t specify anything about it being inlined. rep movs byte ptr
>> ->
>> on a memory mapped device results in a performance bottleneck.
>>
>> I did some experiments with memmove which passed. For how long will it
>> not
>> be inlined? Also, the x WinDBG command shows the same location for
>> memcpy
>> and memmove.
>>
>> To conclude on WinDDK\7600.16385.1
>> - memcpy is inlined on chk builds
>> - the crt does export memcpy and memmove at the same location
>> - memmove is not inlined.
>
> Correct. Memory-mapped I/O registers are not a plain vanilla RAM. If you
> issue a wrong cycle type, it can
> behave in various interesting ways. Only your hardware/FPGA guys can tell
> how.
>
> So better don’t use memcpy or memmove when at least one side is I/O
> memory.
> Use READ/WRITE_REGISTER_xxx or intrinsics defined in wdm.h: movsb,
>
movsw, __movsd
>
> Regards,
> – pa
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

See the macro in wdm.h for those.

I seem to recall that older versions simply looped calling
READ/WRITE_REGISTER_xxxx. The latest wdm.h has a loop writing to memory
with the volatile attribute. However, before is does so it starts with
a fence. So this should be quicker than calling individual
READ_WRITE_REGISTER_xxxx since it eliminates (n - 1) fence calls.

Mark.

On 31/08/2011 19:36, xxxxx@flounder.com wrote:

What about the READ/WRITE_REGISTER_BUFFER_xxxx operations?
joe

> “Calin Iaru” wrote in message news:xxxxx@ntdev…
>>
>>> I had to disassemble memcpy to find out how is implemented. The header
>>> files don’t specify anything about it being inlined. rep movs byte ptr
>>> ->
>>> on a memory mapped device results in a performance bottleneck.
>>>
>>> I did some experiments with memmove which passed. For how long will it
>>> not
>>> be inlined? Also, the x WinDBG command shows the same location for
>>> memcpy
>>> and memmove.
>>>
>>> To conclude on WinDDK\7600.16385.1
>>> - memcpy is inlined on chk builds
>>> - the crt does export memcpy and memmove at the same location
>>> - memmove is not inlined.
>>>
>> Correct. Memory-mapped I/O registers are not a plain vanilla RAM. If you
>> issue a wrong cycle type, it can
>> behave in various interesting ways. Only your hardware/FPGA guys can tell
>> how.
>>
>> So better don’t use memcpy or memmove when at least one side is I/O
>> memory.
>> Use READ/WRITE_REGISTER_xxx or intrinsics defined in wdm.h: movsb,
>>
movsw, __movsd
>>
>> Regards,
>> – pa
>>
>>