What are load and store cpu instructions?

I am debugging a data corruption issue in an RMA library and the problem
seems to be fixed by using different memory barriers. I would never expect
this to happen, but here’s the thing:
we use mfence to flush the transfers and data corruption occurs
if we use sfence, this does not happen.

Here are some excerpts from Intel VTune Analyzer documentation:
"
The SFENCE instruction is ordered with respect store instructions, other
SFENCE instructions, any MFENCE instructions, and any serializing
instructions (such as the CPUID instruction). It is not ordered with
respect to load instructions or the LFENCE instruction.
"
I cannot realy understand what does this mean and what is a load
instruction. My guess is that load instructions are those instructions
that copy from memory to cpu registers. Please confirm.

And the documentation mentions mfence too:
"
Performs a serializing operation on all load and store instructions that
were issued prior the MFENCE instruction. This serializing operation
guarantees that every load and store instruction that precedes in program
order the MFENCE instruction is globally visible before any load or store
instruction that follows the MFENCE instruction is globally visible.
"
So I would think that mfence is stronger than sfence because every load
and store operations are guaranteed to happen before this barrier.

The library uses a customized copy function that assumes the buffers are
properly aligned (also a performance requirement) and then transfers from
one location to the next using the largest SSE registers (XMM and if those
are not available, then it uses MMX and so on).

Let’s go into details and find out what could cause this. By the way, the
library works well on an AMD 4 CPU cluster, but gives this data failure on
an Intel 4 core cluster.


Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs

Yes load is read from memory to processor (register) and store is write from
processor (register) to memory.

On Nov 23, 2007 2:05 PM, Calin Iaru wrote:

> I am debugging a data corruption issue in an RMA library and the problem
> seems to be fixed by using different memory barriers. I would never expect
> this to happen, but here’s the thing:
> we use mfence to flush the transfers and data corruption occurs
> if we use sfence, this does not happen.
>
> Here are some excerpts from Intel VTune Analyzer documentation:
> "
> The SFENCE instruction is ordered with respect store instructions, other
> SFENCE instructions, any MFENCE instructions, and any serializing
> instructions (such as the CPUID instruction). It is not ordered with
> respect to load instructions or the LFENCE instruction.
> "
> I cannot realy understand what does this mean and what is a load
> instruction. My guess is that load instructions are those instructions
> that copy from memory to cpu registers. Please confirm.
>
> And the documentation mentions mfence too:
> "
> Performs a serializing operation on all load and store instructions that
> were issued prior the MFENCE instruction. This serializing operation
> guarantees that every load and store instruction that precedes in program
> order the MFENCE instruction is globally visible before any load or store
> instruction that follows the MFENCE instruction is globally visible.
> "
> So I would think that mfence is stronger than sfence because every load
> and store operations are guaranteed to happen before this barrier.
>
> The library uses a customized copy function that assumes the buffers are
> properly aligned (also a performance requirement) and then transfers from
> one location to the next using the largest SSE registers (XMM and if those
> are not available, then it uses MMX and so on).
>
> Let’s go into details and find out what could cause this. By the way, the
> library works well on an AMD 4 CPU cluster, but gives this data failure on
> an Intel 4 core cluster.
>
>
>
>
> ____________________________________________________________________________________
> Never miss a thing. Make Yahoo your home page.
> http://www.yahoo.com/r/hs
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>


Mark Roddy

Hi Mark,

Thanks for your quick response.

Best regards,
Calin

— Mark Roddy wrote:

> Yes load is read from memory to processor (register) and store is write
> from
> processor (register) to memory.
>
> On Nov 23, 2007 2:05 PM, Calin Iaru wrote:
>
> > I am debugging a data corruption issue in an RMA library and the
> problem
> > seems to be fixed by using different memory barriers. I would never
> expect
> > this to happen, but here’s the thing:
> > we use mfence to flush the transfers and data corruption occurs
> > if we use sfence, this does not happen.
> >
> > Here are some excerpts from Intel VTune Analyzer documentation:
> > "
> > The SFENCE instruction is ordered with respect store instructions,
> other
> > SFENCE instructions, any MFENCE instructions, and any serializing
> > instructions (such as the CPUID instruction). It is not ordered with
> > respect to load instructions or the LFENCE instruction.
> > "
> > I cannot realy understand what does this mean and what is a load
> > instruction. My guess is that load instructions are those instructions
> > that copy from memory to cpu registers. Please confirm.
> >
> > And the documentation mentions mfence too:
> > "
> > Performs a serializing operation on all load and store instructions
> that
> > were issued prior the MFENCE instruction. This serializing operation
> > guarantees that every load and store instruction that precedes in
> program
> > order the MFENCE instruction is globally visible before any load or
> store
> > instruction that follows the MFENCE instruction is globally visible.
> > "
> > So I would think that mfence is stronger than sfence because every
> load
> > and store operations are guaranteed to happen before this barrier.
> >
> > The library uses a customized copy function that assumes the buffers
> are
> > properly aligned (also a performance requirement) and then transfers
> from
> > one location to the next using the largest SSE registers (XMM and if
> those
> > are not available, then it uses MMX and so on).
> >
> > Let’s go into details and find out what could cause this. By the way,
> the
> > library works well on an AMD 4 CPU cluster, but gives this data
> failure on
> > an Intel 4 core cluster.
> >
> >
> >
> >
> >
>

> > Never miss a thing. Make Yahoo your home page.
> > http://www.yahoo.com/r/hs
> >
> > —
> > NTDEV is sponsored by OSR
> >
> > For our schedule of WDF, WDM, debugging and other seminars visit:
> > http://www.osr.com/seminars
> >
> > To unsubscribe, visit the List Server section of OSR Online at
> > http://www.osronline.com/page.cfm?name=ListServer
> >
>
>
>
> –
> Mark Roddy
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


Be a better sports nut! Let your teams follow you
with Yahoo Mobile. Try it now. http://mobile.yahoo.com/sports;_ylt=At9_qDKvtAbMuh1G1SQtBI7ntAcJ

And here are the implementations of XMM with MFENCE function and XMM with
SFENCE. I don’t really believe that sfence fixes the data issue, but
nonetheless, please look at both of them; perhaps one of you sees
something wrong. (I know, I should be more positive, but bear with me.)

__inline static void
MOVE128_SSE2_INSTRINSIC_MFENCE(volatile unsigned *src,
volatile unsigned *dst)
{
__m128i xmm0,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7;

__m128i* src_ptr = (__m128i*)src;
__m128i* dst_ptr = (__m128i*)dst;

if(((u_vuaddr_t)src % SSE2_MOVE_ALIGNMENT) || ((u_vuaddr_t)dst %
SSE2_MOVE_ALIGNMENT)) {
MOVE128_SSE2_INSTRINSIC_MFENCE_U(src, dst);
return;
}

/*
* Prefetching to L1 cache - Loads one cache line of data from address

* to a location “closer” to the processor (L1 cache)
*/
_mm_prefetch((char*) src_ptr,_MM_HINT_T0);

xmm0 = _mm_load_si128(src_ptr); /* Move byte 0 - 15 into XMM
reg */
xmm1 = _mm_load_si128(src_ptr + 1); /* Move byte 16 - 31 into XMM
reg */
xmm2 = _mm_load_si128(src_ptr + 2); /* Move byte 32 - 47 into XMM
reg */
xmm3 = _mm_load_si128(src_ptr + 3); /* Move byte 48 - 63 into XMM
reg */

_mm_prefetch((char*) (src_ptr + 4), _MM_HINT_T0);

xmm4 = _mm_load_si128(src_ptr + 4); /* Move byte 64 - 79 into XMM
reg */
xmm5 = _mm_load_si128(src_ptr + 5); /* Move byte 80 - 95 into XMM
reg */
xmm6 = _mm_load_si128(src_ptr + 6); /* Move byte 96 - 111 into XMM
reg */
xmm7 = _mm_load_si128(src_ptr + 7); /* Move byte 112 - 127 into XMM
reg */

_mm_lfence();

_mm_store_si128(dst_ptr, xmm0); /* Move byte 0 - 15 to
destination addr */
_mm_store_si128(dst_ptr + 1, xmm1); /* Move byte 16 - 31 to
destination addr */
_mm_store_si128(dst_ptr + 2, xmm2); /* Move byte 32 - 47 to
destination addr */
_mm_store_si128(dst_ptr + 3, xmm3); /* Move byte 48 - 63 to
destination addr */

_mm_mfence();

_mm_store_si128(dst_ptr + 4, xmm4); /* Move byte 64 - 79 to
destination addr */
_mm_store_si128(dst_ptr + 5, xmm5); /* Move byte 80 - 95 to
destination addr */
_mm_store_si128(dst_ptr + 6, xmm6); /* Move byte 96 - 111 to
destination addr */
_mm_store_si128(dst_ptr + 7, xmm7); /* Move byte 112 - 127 to
destination addr */

/* Flushing */
_mm_mfence();
}

__inline static void
MOVE128_SSE2_INSTRINSIC_SFENCE(volatile unsigned *src,
volatile unsigned *dst)
{
__m128i xmm0,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7;

__m128i* src_ptr = (__m128i*)src;
__m128i* dst_ptr = (__m128i*)dst;

if(((u_vuaddr_t)src % SSE2_MOVE_ALIGNMENT) || ((u_vuaddr_t)dst %
SSE2_MOVE_ALIGNMENT)) {
MOVE128_SSE2_INSTRINSIC_SFENCE_U(src, dst);
return;
}

/*
* Prefetching to L1 cache - Loads one cache line of data from address

* to a location “closer” to the processor (L1 cache)
*/
_mm_prefetch((char*) src_ptr,_MM_HINT_T0);

xmm0 = _mm_load_si128(src_ptr); /* Move byte 0 - 15 into XMM
reg */
xmm1 = _mm_load_si128(src_ptr + 1); /* Move byte 16 - 31 into XMM
reg */
xmm2 = _mm_load_si128(src_ptr + 2); /* Move byte 32 - 47 into XMM
reg */
xmm3 = _mm_load_si128(src_ptr + 3); /* Move byte 48 - 63 into XMM
reg */

_mm_prefetch((char*) (src_ptr + 4), _MM_HINT_T0);

xmm4 = _mm_load_si128(src_ptr + 4); /* Move byte 64 - 79 into XMM
reg */
xmm5 = _mm_load_si128(src_ptr + 5); /* Move byte 80 - 95 into XMM
reg */
xmm6 = _mm_load_si128(src_ptr + 6); /* Move byte 96 - 111 into XMM
reg */
xmm7 = _mm_load_si128(src_ptr + 7); /* Move byte 112 - 127 into XMM
reg */

_mm_lfence();

_mm_store_si128(dst_ptr, xmm0); /* Move byte 0 - 15 to
destination addr */
_mm_store_si128(dst_ptr + 1, xmm1); /* Move byte 16 - 31 to
destination addr */
_mm_store_si128(dst_ptr + 2, xmm2); /* Move byte 32 - 47 to
destination addr */
_mm_store_si128(dst_ptr + 3, xmm3); /* Move byte 48 - 63 to
destination addr */

_mm_sfence();

_mm_store_si128(dst_ptr + 4, xmm4); /* Move byte 64 - 79 to
destination addr */
_mm_store_si128(dst_ptr + 5, xmm5); /* Move byte 80 - 95 to
destination addr */
_mm_store_si128(dst_ptr + 6, xmm6); /* Move byte 96 - 111 to
destination addr */
_mm_store_si128(dst_ptr + 7, xmm7); /* Move byte 112 - 127 to
destination addr */

/* Flushing */
_mm_sfence();
}

— Mark Roddy wrote:

> Yes load is read from memory to processor (register) and store is write
> from
> processor (register) to memory.
>
> On Nov 23, 2007 2:05 PM, Calin Iaru wrote:
>
> > I am debugging a data corruption issue in an RMA library and the
> problem
> > seems to be fixed by using different memory barriers. I would never
> expect
> > this to happen, but here’s the thing:
> > we use mfence to flush the transfers and data corruption occurs
> > if we use sfence, this does not happen.
> >
> > Here are some excerpts from Intel VTune Analyzer documentation:
> > "
> > The SFENCE instruction is ordered with respect store instructions,
> other
> > SFENCE instructions, any MFENCE instructions, and any serializing
> > instructions (such as the CPUID instruction). It is not ordered with
> > respect to load instructions or the LFENCE instruction.
> > "
> > I cannot realy understand what does this mean and what is a load
> > instruction. My guess is that load instructions are those instructions
> > that copy from memory to cpu registers. Please confirm.
> >
> > And the documentation mentions mfence too:
> > "
> > Performs a serializing operation on all load and store instructions
> that
> > were issued prior the MFENCE instruction. This serializing operation
> > guarantees that every load and store instruction that precedes in
> program
> > order the MFENCE instruction is globally visible before any load or
> store
> > instruction that follows the MFENCE instruction is globally visible.
> > "
> > So I would think that mfence is stronger than sfence because every
> load
> > and store operations are guaranteed to happen before this barrier.
> >
> > The library uses a customized copy function that assumes the buffers
> are
> > properly aligned (also a performance requirement) and then transfers
> from
> > one location to the next using the largest SSE registers (XMM and if
> those
> > are not available, then it uses MMX and so on).
> >
> > Let’s go into details and find out what could cause this. By the way,
> the
> > library works well on an AMD 4 CPU cluster, but gives this data
> failure on
> > an Intel 4 core cluster.
> >
> >
> >
> >
> >
>

> > Never miss a thing. Make Yahoo your home page.
> > http://www.yahoo.com/r/hs
> >
> > —
> > NTDEV is sponsored by OSR
> >
> > For our schedule of WDF, WDM, debugging and other seminars visit:
> > http://www.osr.com/seminars
> >
> > To unsubscribe, visit the List Server section of OSR Online at
> > http://www.osronline.com/page.cfm?name=ListServer
> >
>
>
>
> –
> Mark Roddy
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


Get easy, one-click access to your favorites.
Make Yahoo! your homepage.
http://www.yahoo.com/r/hs

Maybe its time to back up and check some basics if you haven’t done so
already.

First, you are presumably using 8 SSE registers. Have you looked at the
generated code to see if you really are using the registers you expect? My
experience is those intrinsics tend to do a lot of stuff that is unnecessary
and that I wouldn’t expect. Throw the optimizer in it, and you can get
drastic code reordering. So you might not have the loads and stores
happening the way you expect at all.

Why do you bother with the load fence? If you just start storing, I think
the fact you are using a register with an outstanding load pending will
cause the needed stall. You might be slowing down the algorithm
unnecessarily by placing the load fence there.

I’m also not sure why you are doign two store fences. Perhaps you are
trying to insure that the data appears in memory in ascending address order?
If so, I’m not sure what it is in your program logic that would or should
require something like that. I can see the requirement for a store fence at
the end of the routine, but I can’t see the need for a store fence in the
middle of the routine.

BTW, if this routine is called from an outer routine that does a loop for
transferring larger items, you would probably be better off putting the
mfence (or whichever) instructions there, and letting the internal loop run
as fast as it can.

Loren

----- Original Message -----
From: “Calin Iaru”
To: “Windows System Software Devs Interest List”
Sent: Friday, November 23, 2007 12:07 PM
Subject: Re: [ntdev] What are load and store cpu instructions?

> And here are the implementations of XMM with MFENCE function and XMM with
> SFENCE. I don’t really believe that sfence fixes the data issue, but
> nonetheless, please look at both of them; perhaps one of you sees
> something wrong. (I know, I should be more positive, but bear with me.)
>
> __inline static void
> MOVE128_SSE2_INSTRINSIC_MFENCE(volatile unsigned src,
> volatile unsigned dst)
> {
>__m128i xmm0,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7;
>
> __m128i
src_ptr = (__m128i
)src;
> __m128i* dst_ptr = (__m128i*)dst;
>
> if(((u_vuaddr_t)src % SSE2_MOVE_ALIGNMENT) || ((u_vuaddr_t)dst %
> SSE2_MOVE_ALIGNMENT)) {
> MOVE128_SSE2_INSTRINSIC_MFENCE_U(src, dst);
> return;
> }
>
> /
> * Prefetching to L1 cache - Loads one cache line of data from address
>
> * to a location “closer” to the processor (L1 cache)
> /
> _mm_prefetch((char
) src_ptr,_MM_HINT_T0);
>
> xmm0 = _mm_load_si128(src_ptr); /
Move byte 0 - 15 into XMM
> reg /
> xmm1 = _mm_load_si128(src_ptr + 1); /
Move byte 16 - 31 into XMM
> reg /
> xmm2 = _mm_load_si128(src_ptr + 2); /
Move byte 32 - 47 into XMM
> reg /
> xmm3 = _mm_load_si128(src_ptr + 3); /
Move byte 48 - 63 into XMM
> reg /
>
> _mm_prefetch((char
) (src_ptr + 4), _MM_HINT_T0);
>
> xmm4 = _mm_load_si128(src_ptr + 4); /* Move byte 64 - 79 into XMM
> reg /
> xmm5 = _mm_load_si128(src_ptr + 5); /
Move byte 80 - 95 into XMM
> reg /
> xmm6 = _mm_load_si128(src_ptr + 6); /
Move byte 96 - 111 into XMM
> reg /
> xmm7 = _mm_load_si128(src_ptr + 7); /
Move byte 112 - 127 into XMM
> reg /
>
> _mm_lfence();
>
> _mm_store_si128(dst_ptr, xmm0); /
Move byte 0 - 15 to
> destination addr /
> _mm_store_si128(dst_ptr + 1, xmm1); /
Move byte 16 - 31 to
> destination addr /
> _mm_store_si128(dst_ptr + 2, xmm2); /
Move byte 32 - 47 to
> destination addr /
> _mm_store_si128(dst_ptr + 3, xmm3); /
Move byte 48 - 63 to
> destination addr /
>
> _mm_mfence();
>
> _mm_store_si128(dst_ptr + 4, xmm4); /
Move byte 64 - 79 to
> destination addr /
> _mm_store_si128(dst_ptr + 5, xmm5); /
Move byte 80 - 95 to
> destination addr /
> _mm_store_si128(dst_ptr + 6, xmm6); /
Move byte 96 - 111 to
> destination addr /
> _mm_store_si128(dst_ptr + 7, xmm7); /
Move byte 112 - 127 to
> destination addr /
>
> /
Flushing /
> _mm_mfence();
> }
>
> __inline static void
> MOVE128_SSE2_INSTRINSIC_SFENCE(volatile unsigned src,
> volatile unsigned dst)
> {
>__m128i xmm0,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7;
>
> __m128i
src_ptr = (__m128i
)src;
> __m128i
dst_ptr = (__m128i*)dst;
>
> if(((u_vuaddr_t)src % SSE2_MOVE_ALIGNMENT) || ((u_vuaddr_t)dst %
> SSE2_MOVE_ALIGNMENT)) {
> MOVE128_SSE2_INSTRINSIC_SFENCE_U(src, dst);
> return;
> }
>
> /
> * Prefetching to L1 cache - Loads one cache line of data from address
>
> * to a location “closer” to the processor (L1 cache)
> /
> _mm_prefetch((char
) src_ptr,_MM_HINT_T0);
>
> xmm0 = _mm_load_si128(src_ptr); /
Move byte 0 - 15 into XMM
> reg /
> xmm1 = _mm_load_si128(src_ptr + 1); /
Move byte 16 - 31 into XMM
> reg /
> xmm2 = _mm_load_si128(src_ptr + 2); /
Move byte 32 - 47 into XMM
> reg /
> xmm3 = _mm_load_si128(src_ptr + 3); /
Move byte 48 - 63 into XMM
> reg /
>
> _mm_prefetch((char
) (src_ptr + 4), _MM_HINT_T0);
>
> xmm4 = _mm_load_si128(src_ptr + 4); /* Move byte 64 - 79 into XMM
> reg /
> xmm5 = _mm_load_si128(src_ptr + 5); /
Move byte 80 - 95 into XMM
> reg /
> xmm6 = _mm_load_si128(src_ptr + 6); /
Move byte 96 - 111 into XMM
> reg /
> xmm7 = _mm_load_si128(src_ptr + 7); /
Move byte 112 - 127 into XMM
> reg /
>
> _mm_lfence();
>
> _mm_store_si128(dst_ptr, xmm0); /
Move byte 0 - 15 to
> destination addr /
> _mm_store_si128(dst_ptr + 1, xmm1); /
Move byte 16 - 31 to
> destination addr /
> _mm_store_si128(dst_ptr + 2, xmm2); /
Move byte 32 - 47 to
> destination addr /
> _mm_store_si128(dst_ptr + 3, xmm3); /
Move byte 48 - 63 to
> destination addr /
>
> _mm_sfence();
>
> _mm_store_si128(dst_ptr + 4, xmm4); /
Move byte 64 - 79 to
> destination addr /
> _mm_store_si128(dst_ptr + 5, xmm5); /
Move byte 80 - 95 to
> destination addr /
> _mm_store_si128(dst_ptr + 6, xmm6); /
Move byte 96 - 111 to
> destination addr /
> _mm_store_si128(dst_ptr + 7, xmm7); /
Move byte 112 - 127 to
> destination addr /
>
> /
Flushing */
> _mm_sfence();
> }
>
>
> — Mark Roddy wrote:
>
>> Yes load is read from memory to processor (register) and store is write
>> from
>> processor (register) to memory.
>>
>> On Nov 23, 2007 2:05 PM, Calin Iaru wrote:
>>
>> > I am debugging a data corruption issue in an RMA library and the
>> problem
>> > seems to be fixed by using different memory barriers. I would never
>> expect
>> > this to happen, but here’s the thing:
>> > we use mfence to flush the transfers and data corruption occurs
>> > if we use sfence, this does not happen.
>> >
>> > Here are some excerpts from Intel VTune Analyzer documentation:
>> > "
>> > The SFENCE instruction is ordered with respect store instructions,
>> other
>> > SFENCE instructions, any MFENCE instructions, and any serializing
>> > instructions (such as the CPUID instruction). It is not ordered with
>> > respect to load instructions or the LFENCE instruction.
>> > "
>> > I cannot realy understand what does this mean and what is a load
>> > instruction. My guess is that load instructions are those instructions
>> > that copy from memory to cpu registers. Please confirm.
>> >
>> > And the documentation mentions mfence too:
>> > "
>> > Performs a serializing operation on all load and store instructions
>> that
>> > were issued prior the MFENCE instruction. This serializing operation
>> > guarantees that every load and store instruction that precedes in
>> program
>> > order the MFENCE instruction is globally visible before any load or
>> store
>> > instruction that follows the MFENCE instruction is globally visible.
>> > "
>> > So I would think that mfence is stronger than sfence because every
>> load
>> > and store operations are guaranteed to happen before this barrier.
>> >
>> > The library uses a customized copy function that assumes the buffers
>> are
>> > properly aligned (also a performance requirement) and then transfers
>> from
>> > one location to the next using the largest SSE registers (XMM and if
>> those
>> > are not available, then it uses MMX and so on).
>> >
>> > Let’s go into details and find out what could cause this. By the way,
>> the
>> > library works well on an AMD 4 CPU cluster, but gives this data
>> failure on
>> > an Intel 4 core cluster.
>> >
>> >
>> >
>> >
>> >
>>
>
>> > Never miss a thing. Make Yahoo your home page.
>> > http://www.yahoo.com/r/hs
>> >
>> > —
>> > NTDEV is sponsored by OSR
>> >
>> > For our schedule of WDF, WDM, debugging and other seminars visit:
>> > http://www.osr.com/seminars
>> >
>> > To unsubscribe, visit the List Server section of OSR Online at
>> > http://www.osronline.com/page.cfm?name=ListServer
>> >
>>
>>
>>
>> –
>> Mark Roddy
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>
>
>
>
>

> Get easy, one-click access to your favorites.
> Make Yahoo! your homepage.
> http://www.yahoo.com/r/hs
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

This mail is about basics, really. I don’t know what pipeline is for sure,
or how barriers are suppose to synchronize between cpu’s. I will read more
on that until I get a clear picture. So, here’s what I know: we were
looking at a performance issue and experimented with different CPUs (AMD
and Intel) on X86 and X64. So, some of the functions are strategically
placed to fill a cache line (not sure - aren’t cache lines 512 bytes in
size?) or to offer a unified performance. We could achieve this with some
function pointers, but we already had so many and we got the performance
needed just by doing these tricks. It could be that some are not
necessary, but I know that Intel does not give the same performance as AMD
unless those specific tricks are made. This is really low level stuff that
I must admit I am not very familiar. But besides these details, nobody
seems to point out a data issue. I kind of expected that.

— Loren Wilton wrote:

> Maybe its time to back up and check some basics if you haven’t done so
> already.
>
> First, you are presumably using 8 SSE registers. Have you looked at the
>
> generated code to see if you really are using the registers you expect?
> My
> experience is those intrinsics tend to do a lot of stuff that is
> unnecessary
> and that I wouldn’t expect. Throw the optimizer in it, and you can get
> drastic code reordering. So you might not have the loads and stores
> happening the way you expect at all.
>
> Why do you bother with the load fence? If you just start storing, I
> think
> the fact you are using a register with an outstanding load pending will
> cause the needed stall. You might be slowing down the algorithm
> unnecessarily by placing the load fence there.
>
> I’m also not sure why you are doign two store fences. Perhaps you are
> trying to insure that the data appears in memory in ascending address
> order?
> If so, I’m not sure what it is in your program logic that would or
> should
> require something like that. I can see the requirement for a store
> fence at
> the end of the routine, but I can’t see the need for a store fence in
> the
> middle of the routine.
>
> BTW, if this routine is called from an outer routine that does a loop
> for
> transferring larger items, you would probably be better off putting the
> mfence (or whichever) instructions there, and letting the internal loop
> run
> as fast as it can.
>
> Loren
>
> ----- Original Message -----
> From: “Calin Iaru”
> To: “Windows System Software Devs Interest List”
> Sent: Friday, November 23, 2007 12:07 PM
> Subject: Re: [ntdev] What are load and store cpu instructions?
>
>
> > And here are the implementations of XMM with MFENCE function and XMM
> with
> > SFENCE. I don’t really believe that sfence fixes the data issue, but
> > nonetheless, please look at both of them; perhaps one of you sees
> > something wrong. (I know, I should be more positive, but bear with
> me.)
> >
> > __inline static void
> > MOVE128_SSE2_INSTRINSIC_MFENCE(volatile unsigned src,
> > volatile unsigned dst)
> > {
> >__m128i xmm0,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7;
> >
> > __m128i
src_ptr = (__m128i
)src;
> > __m128i* dst_ptr = (__m128i*)dst;
> >
> > if(((u_vuaddr_t)src % SSE2_MOVE_ALIGNMENT) || ((u_vuaddr_t)dst %
> > SSE2_MOVE_ALIGNMENT)) {
> > MOVE128_SSE2_INSTRINSIC_MFENCE_U(src, dst);
> > return;
> > }
> >
> > /
> > * Prefetching to L1 cache - Loads one cache line of data from
> address
> >
> > * to a location “closer” to the processor (L1 cache)
> > /
> > _mm_prefetch((char
) src_ptr,_MM_HINT_T0);
> >
> > xmm0 = _mm_load_si128(src_ptr); /
Move byte 0 - 15 into
> XMM
> > reg /
> > xmm1 = _mm_load_si128(src_ptr + 1); /
Move byte 16 - 31 into
> XMM
> > reg /
> > xmm2 = _mm_load_si128(src_ptr + 2); /
Move byte 32 - 47 into
> XMM
> > reg /
> > xmm3 = _mm_load_si128(src_ptr + 3); /
Move byte 48 - 63 into
> XMM
> > reg /
> >
> > _mm_prefetch((char
) (src_ptr + 4), _MM_HINT_T0);
> >
> > xmm4 = _mm_load_si128(src_ptr + 4); /* Move byte 64 - 79 into
> XMM
> > reg /
> > xmm5 = _mm_load_si128(src_ptr + 5); /
Move byte 80 - 95 into
> XMM
> > reg /
> > xmm6 = _mm_load_si128(src_ptr + 6); /
Move byte 96 - 111 into
> XMM
> > reg /
> > xmm7 = _mm_load_si128(src_ptr + 7); /
Move byte 112 - 127 into
> XMM
> > reg /
> >
> > _mm_lfence();
> >
> > _mm_store_si128(dst_ptr, xmm0); /
Move byte 0 - 15 to
> > destination addr /
> > _mm_store_si128(dst_ptr + 1, xmm1); /
Move byte 16 - 31 to
> > destination addr /
> > _mm_store_si128(dst_ptr + 2, xmm2); /
Move byte 32 - 47 to
> > destination addr /
> > _mm_store_si128(dst_ptr + 3, xmm3); /
Move byte 48 - 63 to
> > destination addr /
> >
> > _mm_mfence();
> >
> > _mm_store_si128(dst_ptr + 4, xmm4); /
Move byte 64 - 79 to
> > destination addr /
> > _mm_store_si128(dst_ptr + 5, xmm5); /
Move byte 80 - 95 to
> > destination addr /
> > _mm_store_si128(dst_ptr + 6, xmm6); /
Move byte 96 - 111 to
> > destination addr /
> > _mm_store_si128(dst_ptr + 7, xmm7); /
Move byte 112 - 127 to
> > destination addr /
> >
> > /
Flushing /
> > _mm_mfence();
> > }
> >
> > __inline static void
> > MOVE128_SSE2_INSTRINSIC_SFENCE(volatile unsigned src,
> > volatile unsigned dst)
> > {
> >__m128i xmm0,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7;
> >
> > __m128i
src_ptr = (__m128i
)src;
> > __m128i
dst_ptr = (__m128i*)dst;
> >
> > if(((u_vuaddr_t)src % SSE2_MOVE_ALIGNMENT) || ((u_vuaddr_t)dst %
> > SSE2_MOVE_ALIGNMENT)) {
> > MOVE128_SSE2_INSTRINSIC_SFENCE_U(src, dst);
> > return;
> > }
> >
> > /
> > * Prefetching to L1 cache - Loads one cache line of data from
> address
> >
> > * to a location “closer” to the processor (L1 cache)
> > /
> > _mm_prefetch((char
) src_ptr,_MM_HINT_T0);
> >
> > xmm0 = _mm_load_si128(src_ptr); /
Move byte 0 - 15 into
> XMM
> > reg /
> > xmm1 = _mm_load_si128(src_ptr + 1); /
Move byte 16 - 31 into
> XMM
> > reg /
> > xmm2 = _mm_load_si128(src_ptr + 2); /
Move byte 32 - 47 into
> XMM
> > reg /
> > xmm3 = _mm_load_si128(src_ptr + 3); /
Move byte 48 - 63 into
> XMM
> > reg /
> >
> > _mm_prefetch((char
) (src_ptr + 4), _MM_HINT_T0);
> >
> > xmm4 = _mm_load_si128(src_ptr + 4); /* Move byte 64 - 79 into
> XMM
> > reg /
> > xmm5 = _mm_load_si128(src_ptr + 5); /
Move byte 80 - 95 into
> XMM
> > reg /
> > xmm6 = _mm_load_si128(src_ptr + 6); /
Move byte 96 - 111 into
> XMM
> > reg /
> > xmm7 = _mm_load_si128(src_ptr + 7); /
Move byte 112 - 127 into
> XMM
> > reg /
> >
> > _mm_lfence();
> >
> > _mm_store_si128(dst_ptr, xmm0); /
Move byte 0 - 15 to
> > destination addr /
> > _mm_store_si128(dst_ptr + 1, xmm1); /
Move byte 16 - 31 to
> > destination addr /
> > _mm_store_si128(dst_ptr + 2, xmm2); /
Move byte 32 - 47 to
> > destination addr /
> > _mm_store_si128(dst_ptr + 3, xmm3); /
Move byte 48 - 63 to
> > destination addr /
> >
> > _mm_sfence();
> >
> > _mm_store_si128(dst_ptr + 4, xmm4); /
Move byte 64 - 79 to
> > destination addr /
> > _mm_store_si128(dst_ptr + 5, xmm5); /
Move byte 80 - 95 to
> > destination addr /
> > _mm_store_si128(dst_ptr + 6, xmm6); /
Move byte 96 - 111 to
> > destination addr /
> > _mm_store_si128(dst_ptr + 7, xmm7); /
Move byte 112 - 127 to
> > destination addr /
> >
> > /
Flushing */
> > _mm_sfence();
> > }
> >
> >
> > — Mark Roddy wrote:
> >
> >> Yes load is read from memory to processor (register) and store is
> write
> >> from
> >> processor (register) to memory.
> >>
> >> On Nov 23, 2007 2:05 PM, Calin Iaru wrote:
> >>
> >> > I am debugging a data corruption issue in an RMA library and the
> >> problem
> >> > seems to be fixed by using different memory barriers. I would never
> >> expect
> >> > this to happen, but here’s the thing:
> >> > we use mfence to flush the transfers and data corruption occurs
> >> > if we use sfence, this does not happen.
> >> >
> >> > Here are some excerpts from Intel VTune Analyzer documentation:
> >> > "
> >> > The SFENCE instruction is ordered with respect store instructions,
> >> other
> >> > SFENCE instructions, any MFENCE instructions, and any serializing
> >> > instructions (such as the CPUID instruction). It is not ordered
> with
> >> > respect to load instructions or the LFENCE instruction.
> >> > "
> >> > I cannot realy understand what does this mean and what is a load
> >> > instruction. My guess is that load instructions are those
> instructions
> >> > that copy from memory to cpu registers. Please confirm.
> >> >
> >> > And the documentation mentions mfence too:
> >> > "
> >> > Performs a serializing operation on all load and store instructions
> >> that
> >> > were issued prior the MFENCE instruction. This serializing
> operation
> >> > guarantees that every load and store instruction that precedes in
> >> program
> >> > order the MFENCE instruction is globally visible before any load or
> >> store
> >> > instruction that follows the MFENCE instruction is globally
> visible.
> >> > "
> >> > So I would think that mfence is stronger than sfence because every
> >> load
> >> > and store operations are guaranteed to happen before this barrier.
> >> >
> >> > The library uses a customized copy function that assumes the
> buffers
> >> are
> >> > properly aligned (also a performance requirement) and then
> transfers
> >> from
> >> > one location to the next using the largest SSE registers (XMM and
> if
> >> those
> >> > are not available, then it uses MMX and so on).
> >> >
> >> > Let’s go into details and find out what could cause this. By the
> way,
> >> the
> >> > library works well on an AMD 4 CPU cluster, but gives this data
> >> failure on
> >> > an Intel 4 core cluster.
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
> >
>

> >> > Never miss a thing. Make Yahoo your home page.
> >> > http://www.yahoo.com/r/hs
> >> >
> >> > —
> >> > NTDEV is sponsored by OSR
> >> >
> >> > For our schedule of WDF, WDM, debugging and other seminars visit:
> >> > http://www.osr.com/seminars
> >> >
> >> > To unsubscribe, visit the List Server section of OSR Online at
> >> > http://www.osronline.com/page.cfm?name=ListServer
> >> >
> >>
> >>
> >>
> >> –
> >> Mark Roddy
> >>
> >> —
> >> NTDEV is sponsored by OSR
> >>
> >> For our schedule of WDF, WDM, debugging and other seminars visit:
> >> http://www.osr.com/seminars
> >>
> >> To unsubscribe, visit the List Server section of OSR Online at
> > http://www.osronline.com/page.cfm?name=ListServer
> >
> >
> >
> >
> >
>

> > Get easy, one-click access to your favorites.
> > Make Yahoo! your homepage.
> > http://www.yahoo.com/r/hs
> >
> > —
> > NTDEV is sponsored by OSR
> >
> > For our schedule of WDF, WDM, debugging and other seminars visit:
> > http://www.osr.com/seminars
> >
> > To unsubscribe, visit the List Server section of OSR Online at
> > http://www.osronline.com/page.cfm?name=ListServer
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

____________________________________________________________________________________
Get easy, one-click access to your favorites.
Make Yahoo! your homepage.
http://www.yahoo.com/r/hs

What exactly is the data corruption you are seeing? Missing data? Data in
the wrong place?

The various processor manuals from both Intel and AMD talk about the
pipelines in the basics and in more detail in (I think it is) the system
programming books. Usually Volume 2 or so for most processors.

Cache line sizes vary. Used to be 64 bytes, I doubt it is more than 128
bytes on most processors now, but it could be. You have to check the manual
for the specific chips you are using.

In general you shouldn’t need fences in transfers except for special cases.
Since I don’t understand exactly what you are doing, this may be one of the
cases. But maybe they aren’t required, and are just masking some other
problem.

Note that if you copy A to B on a multiproc system on CPU X and don’t flush
to memory, the memory system will remember that CPU X has that cache line in
a modified state. Then when CPU Y tries to read that cache line, it will be
determined that it needs to get the data from CPU X rather then from memory.
(Or alternately CPU X will have to flush the data to memory and then CPU Y
will read it from memory – depends on the memory implementation.)

If you know the purpose of the transfer from A to B is to give it to another
CPU, then you should probably do a flush on the results as it may be faster
for the other CPU to get to. But it really shouldn’t be necessary in most
cases. If the purpose of the move is to give the data to someone else and
you don’t plan on reading it yourself, you should also look at non-temporal
memory fetches on the data, as this may be faster.

Loren

----- Original Message -----
From: “Calin Iaru”
To: “Windows System Software Devs Interest List”
Sent: Friday, November 23, 2007 3:27 PM
Subject: Re: [ntdev] What are load and store cpu instructions?

> This mail is about basics, really. I don’t know what pipeline is for sure,
> or how barriers are suppose to synchronize between cpu’s. I will read more
> on that until I get a clear picture. So, here’s what I know: we were
> looking at a performance issue and experimented with different CPUs (AMD
> and Intel) on X86 and X64. So, some of the functions are strategically
> placed to fill a cache line (not sure - aren’t cache lines 512 bytes in
> size?) or to offer a unified performance. We could achieve this with some
> function pointers, but we already had so many and we got the performance
> needed just by doing these tricks. It could be that some are not
> necessary, but I know that Intel does not give the same performance as AMD
> unless those specific tricks are made. This is really low level stuff that
> I must admit I am not very familiar. But besides these details, nobody
> seems to point out a data issue. I kind of expected that.
>
> — Loren Wilton wrote:
>
>> Maybe its time to back up and check some basics if you haven’t done so
>> already.
>>
>> First, you are presumably using 8 SSE registers. Have you looked at the
>>
>> generated code to see if you really are using the registers you expect?
>> My
>> experience is those intrinsics tend to do a lot of stuff that is
>> unnecessary
>> and that I wouldn’t expect. Throw the optimizer in it, and you can get
>> drastic code reordering. So you might not have the loads and stores
>> happening the way you expect at all.
>>
>> Why do you bother with the load fence? If you just start storing, I
>> think
>> the fact you are using a register with an outstanding load pending will
>> cause the needed stall. You might be slowing down the algorithm
>> unnecessarily by placing the load fence there.
>>
>> I’m also not sure why you are doign two store fences. Perhaps you are
>> trying to insure that the data appears in memory in ascending address
>> order?
>> If so, I’m not sure what it is in your program logic that would or
>> should
>> require something like that. I can see the requirement for a store
>> fence at
>> the end of the routine, but I can’t see the need for a store fence in
>> the
>> middle of the routine.
>>
>> BTW, if this routine is called from an outer routine that does a loop
>> for
>> transferring larger items, you would probably be better off putting the
>> mfence (or whichever) instructions there, and letting the internal loop
>> run
>> as fast as it can.
>>
>> Loren
>>
>> ----- Original Message -----
>> From: “Calin Iaru”
>> To: “Windows System Software Devs Interest List”
>> Sent: Friday, November 23, 2007 12:07 PM
>> Subject: Re: [ntdev] What are load and store cpu instructions?
>>
>>
>> > And here are the implementations of XMM with MFENCE function and XMM
>> with
>> > SFENCE. I don’t really believe that sfence fixes the data issue, but
>> > nonetheless, please look at both of them; perhaps one of you sees
>> > something wrong. (I know, I should be more positive, but bear with
>> me.)
>> >
>> > __inline static void
>> > MOVE128_SSE2_INSTRINSIC_MFENCE(volatile unsigned src,
>> > volatile unsigned dst)
>> > {
>> >__m128i xmm0,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7;
>> >
>> > __m128i
src_ptr = (__m128i
)src;
>> > __m128i* dst_ptr = (__m128i*)dst;
>> >
>> > if(((u_vuaddr_t)src % SSE2_MOVE_ALIGNMENT) || ((u_vuaddr_t)dst %
>> > SSE2_MOVE_ALIGNMENT)) {
>> > MOVE128_SSE2_INSTRINSIC_MFENCE_U(src, dst);
>> > return;
>> > }
>> >
>> > /
>> > * Prefetching to L1 cache - Loads one cache line of data from
>> address
>> >
>> > * to a location “closer” to the processor (L1 cache)
>> > /
>> > _mm_prefetch((char
) src_ptr,_MM_HINT_T0);
>> >
>> > xmm0 = _mm_load_si128(src_ptr); /
Move byte 0 - 15 into
>> XMM
>> > reg /
>> > xmm1 = _mm_load_si128(src_ptr + 1); /
Move byte 16 - 31 into
>> XMM
>> > reg /
>> > xmm2 = _mm_load_si128(src_ptr + 2); /
Move byte 32 - 47 into
>> XMM
>> > reg /
>> > xmm3 = _mm_load_si128(src_ptr + 3); /
Move byte 48 - 63 into
>> XMM
>> > reg /
>> >
>> > _mm_prefetch((char
) (src_ptr + 4), _MM_HINT_T0);
>> >
>> > xmm4 = _mm_load_si128(src_ptr + 4); /* Move byte 64 - 79 into
>> XMM
>> > reg /
>> > xmm5 = _mm_load_si128(src_ptr + 5); /
Move byte 80 - 95 into
>> XMM
>> > reg /
>> > xmm6 = _mm_load_si128(src_ptr + 6); /
Move byte 96 - 111 into
>> XMM
>> > reg /
>> > xmm7 = _mm_load_si128(src_ptr + 7); /
Move byte 112 - 127 into
>> XMM
>> > reg /
>> >
>> > _mm_lfence();
>> >
>> > _mm_store_si128(dst_ptr, xmm0); /
Move byte 0 - 15 to
>> > destination addr /
>> > _mm_store_si128(dst_ptr + 1, xmm1); /
Move byte 16 - 31 to
>> > destination addr /
>> > _mm_store_si128(dst_ptr + 2, xmm2); /
Move byte 32 - 47 to
>> > destination addr /
>> > _mm_store_si128(dst_ptr + 3, xmm3); /
Move byte 48 - 63 to
>> > destination addr /
>> >
>> > _mm_mfence();
>> >
>> > _mm_store_si128(dst_ptr + 4, xmm4); /
Move byte 64 - 79 to
>> > destination addr /
>> > _mm_store_si128(dst_ptr + 5, xmm5); /
Move byte 80 - 95 to
>> > destination addr /
>> > _mm_store_si128(dst_ptr + 6, xmm6); /
Move byte 96 - 111 to
>> > destination addr /
>> > _mm_store_si128(dst_ptr + 7, xmm7); /
Move byte 112 - 127 to
>> > destination addr /
>> >
>> > /
Flushing /
>> > _mm_mfence();
>> > }
>> >
>> > __inline static void
>> > MOVE128_SSE2_INSTRINSIC_SFENCE(volatile unsigned src,
>> > volatile unsigned dst)
>> > {
>> >__m128i xmm0,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7;
>> >
>> > __m128i
src_ptr = (__m128i
)src;
>> > __m128i
dst_ptr = (__m128i*)dst;
>> >
>> > if(((u_vuaddr_t)src % SSE2_MOVE_ALIGNMENT) || ((u_vuaddr_t)dst %
>> > SSE2_MOVE_ALIGNMENT)) {
>> > MOVE128_SSE2_INSTRINSIC_SFENCE_U(src, dst);
>> > return;
>> > }
>> >
>> > /
>> > * Prefetching to L1 cache - Loads one cache line of data from
>> address
>> >
>> > * to a location “closer” to the processor (L1 cache)
>> > /
>> > _mm_prefetch((char
) src_ptr,_MM_HINT_T0);
>> >
>> > xmm0 = _mm_load_si128(src_ptr); /
Move byte 0 - 15 into
>> XMM
>> > reg /
>> > xmm1 = _mm_load_si128(src_ptr + 1); /
Move byte 16 - 31 into
>> XMM
>> > reg /
>> > xmm2 = _mm_load_si128(src_ptr + 2); /
Move byte 32 - 47 into
>> XMM
>> > reg /
>> > xmm3 = _mm_load_si128(src_ptr + 3); /
Move byte 48 - 63 into
>> XMM
>> > reg /
>> >
>> > _mm_prefetch((char
) (src_ptr + 4), _MM_HINT_T0);
>> >
>> > xmm4 = _mm_load_si128(src_ptr + 4); /* Move byte 64 - 79 into
>> XMM
>> > reg /
>> > xmm5 = _mm_load_si128(src_ptr + 5); /
Move byte 80 - 95 into
>> XMM
>> > reg /
>> > xmm6 = _mm_load_si128(src_ptr + 6); /
Move byte 96 - 111 into
>> XMM
>> > reg /
>> > xmm7 = _mm_load_si128(src_ptr + 7); /
Move byte 112 - 127 into
>> XMM
>> > reg /
>> >
>> > _mm_lfence();
>> >
>> > _mm_store_si128(dst_ptr, xmm0); /
Move byte 0 - 15 to
>> > destination addr /
>> > _mm_store_si128(dst_ptr + 1, xmm1); /
Move byte 16 - 31 to
>> > destination addr /
>> > _mm_store_si128(dst_ptr + 2, xmm2); /
Move byte 32 - 47 to
>> > destination addr /
>> > _mm_store_si128(dst_ptr + 3, xmm3); /
Move byte 48 - 63 to
>> > destination addr /
>> >
>> > _mm_sfence();
>> >
>> > _mm_store_si128(dst_ptr + 4, xmm4); /
Move byte 64 - 79 to
>> > destination addr /
>> > _mm_store_si128(dst_ptr + 5, xmm5); /
Move byte 80 - 95 to
>> > destination addr /
>> > _mm_store_si128(dst_ptr + 6, xmm6); /
Move byte 96 - 111 to
>> > destination addr /
>> > _mm_store_si128(dst_ptr + 7, xmm7); /
Move byte 112 - 127 to
>> > destination addr /
>> >
>> > /
Flushing */
>> > _mm_sfence();
>> > }
>> >
>> >
>> > — Mark Roddy wrote:
>> >
>> >> Yes load is read from memory to processor (register) and store is
>> write
>> >> from
>> >> processor (register) to memory.
>> >>
>> >> On Nov 23, 2007 2:05 PM, Calin Iaru wrote:
>> >>
>> >> > I am debugging a data corruption issue in an RMA library and the
>> >> problem
>> >> > seems to be fixed by using different memory barriers. I would never
>> >> expect
>> >> > this to happen, but here’s the thing:
>> >> > we use mfence to flush the transfers and data corruption occurs
>> >> > if we use sfence, this does not happen.
>> >> >
>> >> > Here are some excerpts from Intel VTune Analyzer documentation:
>> >> > "
>> >> > The SFENCE instruction is ordered with respect store instructions,
>> >> other
>> >> > SFENCE instructions, any MFENCE instructions, and any serializing
>> >> > instructions (such as the CPUID instruction). It is not ordered
>> with
>> >> > respect to load instructions or the LFENCE instruction.
>> >> > "
>> >> > I cannot realy understand what does this mean and what is a load
>> >> > instruction. My guess is that load instructions are those
>> instructions
>> >> > that copy from memory to cpu registers. Please confirm.
>> >> >
>> >> > And the documentation mentions mfence too:
>> >> > "
>> >> > Performs a serializing operation on all load and store instructions
>> >> that
>> >> > were issued prior the MFENCE instruction. This serializing
>> operation
>> >> > guarantees that every load and store instruction that precedes in
>> >> program
>> >> > order the MFENCE instruction is globally visible before any load or
>> >> store
>> >> > instruction that follows the MFENCE instruction is globally
>> visible.
>> >> > "
>> >> > So I would think that mfence is stronger than sfence because every
>> >> load
>> >> > and store operations are guaranteed to happen before this barrier.
>> >> >
>> >> > The library uses a customized copy function that assumes the
>> buffers
>> >> are
>> >> > properly aligned (also a performance requirement) and then
>> transfers
>> >> from
>> >> > one location to the next using the largest SSE registers (XMM and
>> if
>> >> those
>> >> > are not available, then it uses MMX and so on).
>> >> >
>> >> > Let’s go into details and find out what could cause this. By the
>> way,
>> >> the
>> >> > library works well on an AMD 4 CPU cluster, but gives this data
>> >> failure on
>> >> > an Intel 4 core cluster.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >>
>> >
>>
>
>> >> > Never miss a thing. Make Yahoo your home page.
>> >> > http://www.yahoo.com/r/hs
>> >> >
>> >> > —
>> >> > NTDEV is sponsored by OSR
>> >> >
>> >> > For our schedule of WDF, WDM, debugging and other seminars visit:
>> >> > http://www.osr.com/seminars
>> >> >
>> >> > To unsubscribe, visit the List Server section of OSR Online at
>> >> > http://www.osronline.com/page.cfm?name=ListServer
>> >> >
>> >>
>> >>
>> >>
>> >> –
>> >> Mark Roddy
>> >>
>> >> —
>> >> NTDEV is sponsored by OSR
>> >>
>> >> For our schedule of WDF, WDM, debugging and other seminars visit:
>> >> http://www.osr.com/seminars
>> >>
>> >> To unsubscribe, visit the List Server section of OSR Online at
>> > http://www.osronline.com/page.cfm?name=ListServer
>> >
>> >
>> >
>> >
>> >
>>
>

>> > Get easy, one-click access to your favorites.
>> > Make Yahoo! your homepage.
>> > http://www.yahoo.com/r/hs
>> >
>> > —
>> > NTDEV is sponsored by OSR
>> >
>> > For our schedule of WDF, WDM, debugging and other seminars visit:
>> > http://www.osr.com/seminars
>> >
>> > To unsubscribe, visit the List Server section of OSR Online at
>> > http://www.osronline.com/page.cfm?name=ListServer
>>
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>>
>
>
>
>
> ____________________________________________________________________________________
> Get easy, one-click access to your favorites.
> Make Yahoo! your homepage.
> http://www.yahoo.com/r/hs
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer