Synchronization for the memory shared between user app and driver

OK, this is really good article about sharing memory between user app and driver: http://www.osronline.com/article.cfm?article=39. Thanks OSR to put everything together! I am using the MmMapLockedPagesSpecifyCache method for sharing the memory.

Now the challenge part is the synchronization. My data is a ring buffer with load pointer and unload pointer. The user mode app loads the data the buffer and then updates the load pointer and driver unloads the buffer and updates unload pointer. The driver code that accesses the shared memory is running at the DISPATCH_LEVEL and our code needs to be safe on the multi core/CPU system. I understand I can just use the name event to do the job if we can run driver in the PASSIVE_LEVEL. However it is not option for me as the one of key reason to processing the data in kernel mode is our code can run in the DISPATCH_LEVEL. So is there any way to protect the shared data if the driver is running at DISPATCH_LEVEL?

Well, a second thought I had is: do I really need synchronization for this case: driver and and user app does not write to the same data buffer and they write to their own load/unload pointer. The only possible issue I can think of is when user mode app needs to read the unload pointer(to make sure the buffer is not full) in one CPU while the driver code is update the same unload pointer at the same time in another CPU. Can the X86 CPU/Memory controller guarantee this will not happen? If so, can I say the code has no synchronization issue even without any synchronization mechanism?

Thanks,
William

A couple of observations here.

First of all, in most cases sharing a buffer is not really a wise idea - it is much better to use an “inverted call” technique (search OSR archives for more info). It looks like you have arrived to this conclusion yourself. More on it below

Second, it does not matter if event is named or not - drivers assess it as PKEVENT, rather than by name. Once KEVENT object itself is allocated from non-paged pool, you can access it at DPC level. The only thing you cannot do is to wait for non-zero interval at elevated IRQL, which holds true for both named synchronization objects and unnamed ones.

Third, I just wonder what the following statement means:

[begin quote]

…the one of key reason to processing the data in kernel mode is our code can run in the DISPATCH_LEVEL.

[end quote]

Do I really need synchronization for this case: driver and and user app does
not write to the same data buffer and they write to their own load/unload pointer

It looks like you have almost arrived to the conclusion that you don’t really need to share a buffer, because, in your particular situation, the whole thing can be sorted out simply by IOCTLs and “inverted calls”…

Anton Bassov

Very bad design.

Send IOCTLs (maybe overlapped) with data to the driver instead of using
this unsafe thing. If the user app will misbehave, possibly intentionally -
then the whole OS will crash.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

wrote in message news:xxxxx@ntdev…
> OK, this is really good article about sharing memory between user app and
driver: http://www.osronline.com/article.cfm?article=39. Thanks OSR to put
everything together! I am using the MmMapLockedPagesSpecifyCache method for
sharing the memory.
>
> Now the challenge part is the synchronization. My data is a ring buffer with
load pointer and unload pointer. The user mode app loads the data the buffer
and then updates the load pointer and driver unloads the buffer and updates
unload pointer. The driver code that accesses the shared memory is running at
the DISPATCH_LEVEL and our code needs to be safe on the multi core/CPU system.
I understand I can just use the name event to do the job if we can run driver
in the PASSIVE_LEVEL. However it is not option for me as the one of key reason
to processing the data in kernel mode is our code can run in the
DISPATCH_LEVEL. So is there any way to protect the shared data if the driver
is running at DISPATCH_LEVEL?
>
> Well, a second thought I had is: do I really need synchronization for this
case: driver and and user app does not write to the same data buffer and they
write to their own load/unload pointer. The only possible issue I can think of
is when user mode app needs to read the unload pointer(to make sure the buffer
is not full) in one CPU while the driver code is update the same unload pointer
at the same time in another CPU. Can the X86 CPU/Memory controller guarantee
this will not happen? If so, can I say the code has no synchronization issue
even without any synchronization mechanism?
>
> Thanks,
> William
>

xxxxx@sbcglobal.net wrote:

Well, a second thought I had is: do I really need synchronization for this case: driver and and user app does not write to the same data buffer and they write to their own load/unload pointer. The only possible issue I can think of is when user mode app needs to read the unload pointer(to make sure the buffer is not full) in one CPU while the driver code is update the same unload pointer at the same time in another CPU. Can the X86 CPU/Memory controller guarantee this will not happen? If so, can I say the code has no synchronization issue even without any synchronization mechanism?

Circular buffers can often be designed so there is no need for
synchronization, since each pointer is one-writer/many-reader. The key
point is to ask yourself, “is there ANY point in this code sequence
where the data structure is in an inconsistent state?” If the answer is
“yes”, then you need synchronization. If “no”, then you don’t.

Consider the case you talked about, where one side reads the IN pointer
at the same time the other side writes it. What’s the risk? It will
either read the old value or the new value. If it reads the old value,
it won’t see the new data, but will check again later.

As long as it can’t read a half-updated value, everything is fine.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Maxim S. Shatskih wrote:

Very bad design.

… unless you’re smart…

Of course, you’re not going to do anything silly like actually have a
ring-buffer of pointers (either user or kernel mode)… instead, you’re
going to have a ring buffer of descriptors that your driver supplied and
can validate appropriately, where each descriptor references some memory
that can be copied in the kernel, and the mapping information between
descriptors and pointers is supplied and validated by the driver…

… unless you’re *very* smart, and have designed your hardware in an
appropriately “virtual” fashion and with a lot of care with respect to
security holes …

… in which case, no kernel access to the memory will be required
except mapping it, and you can then do zero copy DMA operations to user
mode.

MH.

Out of curiosity, what is the data rate that you are dealing with? In my
experience a shared memory approach makes sense only at very high data rates
(hundreds of MB/s). Otherwise, direct i/o combined with overlapped i/o is
extremely efficient.

Have a nice day
GV


Gianluca Varenni, Windows DDK MVP

CACE Technologies
http://www.cacetech.com

----- Original Message -----
From:
To: “Windows System Software Devs Interest List”
Sent: Friday, February 15, 2008 12:21 AM
Subject: [ntdev] Synchronization for the memory shared between user app and
driver

> OK, this is really good article about sharing memory between user app and
> driver: http://www.osronline.com/article.cfm?article=39. Thanks OSR to put
> everything together! I am using the MmMapLockedPagesSpecifyCache method
> for sharing the memory.
>
> Now the challenge part is the synchronization. My data is a ring buffer
> with load pointer and unload pointer. The user mode app loads the data the
> buffer and then updates the load pointer and driver unloads the buffer and
> updates unload pointer. The driver code that accesses the shared memory is
> running at the DISPATCH_LEVEL and our code needs to be safe on the multi
> core/CPU system. I understand I can just use the name event to do the job
> if we can run driver in the PASSIVE_LEVEL. However it is not option for
> me as the one of key reason to processing the data in kernel mode is our
> code can run in the DISPATCH_LEVEL. So is there any way to protect the
> shared data if the driver is running at DISPATCH_LEVEL?
>
> Well, a second thought I had is: do I really need synchronization for this
> case: driver and and user app does not write to the same data buffer and
> they write to their own load/unload pointer. The only possible issue I
> can think of is when user mode app needs to read the unload pointer(to
> make sure the buffer is not full) in one CPU while the driver code is
> update the same unload pointer at the same time in another CPU. Can the
> X86 CPU/Memory controller guarantee this will not happen? If so, can I
> say the code has no synchronization issue even without any synchronization
> mechanism?
>
> Thanks,
> William
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

> except mapping it, and you can then do zero copy DMA operations to user

mode.

The same can be done with IOCTLs and METHOD_xxx_DIRECT


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Actually, I would suspect it is even higher these days. Ten years ago I did
some experiments and found the break even point was in the 50MB/sec range,
the speed of processoes and memory is so much faster that I suspect we are
approaching the 1GB/sec threshold. And that assumes you are doing things
really well, most implelentations trying to share memory mess up, and in
many cases are slower than using standard techniques.


Don Burn (MVP, Windows DDK)
Windows 2k/XP/2k3 Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr
Remove StopSpam to reply

“Gianluca Varenni” wrote in message
news:xxxxx@ntdev…
> Out of curiosity, what is the data rate that you are dealing with? In my
> experience a shared memory approach makes sense only at very high data
> rates (hundreds of MB/s). Otherwise, direct i/o combined with overlapped
> i/o is extremely efficient.
>
> Have a nice day
> GV
>
> –
> Gianluca Varenni, Windows DDK MVP
>
> CACE Technologies
> http://www.cacetech.com
>
>
> ----- Original Message -----
> From:
> To: “Windows System Software Devs Interest List”
> Sent: Friday, February 15, 2008 12:21 AM
> Subject: [ntdev] Synchronization for the memory shared between user app
> and driver
>
>
>> OK, this is really good article about sharing memory between user app and
>> driver: http://www.osronline.com/article.cfm?article=39. Thanks OSR to
>> put everything together! I am using the MmMapLockedPagesSpecifyCache
>> method for sharing the memory.
>>
>> Now the challenge part is the synchronization. My data is a ring buffer
>> with load pointer and unload pointer. The user mode app loads the data
>> the buffer and then updates the load pointer and driver unloads the
>> buffer and updates unload pointer. The driver code that accesses the
>> shared memory is running at the DISPATCH_LEVEL and our code needs to be
>> safe on the multi core/CPU system. I understand I can just use the name
>> event to do the job if we can run driver in the PASSIVE_LEVEL. However
>> it is not option for me as the one of key reason to processing the data
>> in kernel mode is our code can run in the DISPATCH_LEVEL. So is there
>> any way to protect the shared data if the driver is running at
>> DISPATCH_LEVEL?
>>
>> Well, a second thought I had is: do I really need synchronization for
>> this case: driver and and user app does not write to the same data buffer
>> and they write to their own load/unload pointer. The only possible issue
>> I can think of is when user mode app needs to read the unload pointer(to
>> make sure the buffer is not full) in one CPU while the driver code is
>> update the same unload pointer at the same time in another CPU. Can the
>> X86 CPU/Memory controller guarantee this will not happen? If so, can I
>> say the code has no synchronization issue even without any
>> synchronization mechanism?
>>
>> Thanks,
>> William
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>
>

The ringbuffer scenario makes sense for other reasons, but nothing the OP
has written evidences any of the ones I can think of.

On Fri, Feb 15, 2008 at 1:33 PM, Don Burn wrote:

> Actually, I would suspect it is even higher these days. Ten years ago I
> did
> some experiments and found the break even point was in the 50MB/sec range,
> the speed of processoes and memory is so much faster that I suspect we are
> approaching the 1GB/sec threshold. And that assumes you are doing things
> really well, most implelentations trying to share memory mess up, and in
> many cases are slower than using standard techniques.
>
>
> –
> Don Burn (MVP, Windows DDK)
> Windows 2k/XP/2k3 Filesystem and Driver Consulting
> Website: http://www.windrvr.com
> Blog: http://msmvps.com/blogs/WinDrvr
> Remove StopSpam to reply
>
>
>
>
> “Gianluca Varenni” wrote in message
> news:xxxxx@ntdev…
> > Out of curiosity, what is the data rate that you are dealing with? In my
> > experience a shared memory approach makes sense only at very high data
> > rates (hundreds of MB/s). Otherwise, direct i/o combined with overlapped
> > i/o is extremely efficient.
> >
> > Have a nice day
> > GV
> >
> > –
> > Gianluca Varenni, Windows DDK MVP
> >
> > CACE Technologies
> > http://www.cacetech.com
> >
> >
> > ----- Original Message -----
> > From:
> > To: “Windows System Software Devs Interest List”
> > Sent: Friday, February 15, 2008 12:21 AM
> > Subject: [ntdev] Synchronization for the memory shared between user app
> > and driver
> >
> >
> >> OK, this is really good article about sharing memory between user app
> and
> >> driver: http://www.osronline.com/article.cfm?article=39. Thanks OSR to
> >> put everything together! I am using the MmMapLockedPagesSpecifyCache
> >> method for sharing the memory.
> >>
> >> Now the challenge part is the synchronization. My data is a ring
> buffer
> >> with load pointer and unload pointer. The user mode app loads the data
> >> the buffer and then updates the load pointer and driver unloads the
> >> buffer and updates unload pointer. The driver code that accesses the
> >> shared memory is running at the DISPATCH_LEVEL and our code needs to be
> >> safe on the multi core/CPU system. I understand I can just use the
> name
> >> event to do the job if we can run driver in the PASSIVE_LEVEL. However
> >> it is not option for me as the one of key reason to processing the data
> >> in kernel mode is our code can run in the DISPATCH_LEVEL. So is there
> >> any way to protect the shared data if the driver is running at
> >> DISPATCH_LEVEL?
> >>
> >> Well, a second thought I had is: do I really need synchronization for
> >> this case: driver and and user app does not write to the same data
> buffer
> >> and they write to their own load/unload pointer. The only possible
> issue
> >> I can think of is when user mode app needs to read the unload
> pointer(to
> >> make sure the buffer is not full) in one CPU while the driver code is
> >> update the same unload pointer at the same time in another CPU. Can
> the
> >> X86 CPU/Memory controller guarantee this will not happen? If so, can I
> >> say the code has no synchronization issue even without any
> >> synchronization mechanism?
> >>
> >> Thanks,
> >> William
> >>
> >> —
> >> NTDEV is sponsored by OSR
> >>
> >> For our schedule of WDF, WDM, debugging and other seminars visit:
> >> http://www.osr.com/seminars
> >>
> >> To unsubscribe, visit the List Server section of OSR Online at
> >> http://www.osronline.com/page.cfm?name=ListServer
> >
> >
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>


Mark Roddy

Thanks for the all the replies!

Yes I understand the standard IOCTLs method can do the same job and less error-prone. The main concerns is the perofrmance as the data access rate can be as high as 10K per second when run the maximum instances of our data channels. I am a little bit worry about the OS overhead if using IOCTOL. Well I probably can improve/decrease the data access rate here but the MmMapLockedPagesSpecifyCache way seems to be the most effective way from performance stand point of view. Yes the user app has to be written very carefully. Any mistake in the writing the mapped kernel memory will cause system crash. And Tim’s input make me worry much less about the synchronization. But back to one of my questions with Tim’s statement:

[Quote]
As long as it can’t read a half-updated value, everything is fine.

[Quote]

Can this really happen in X86 multi-CPU system assuming the data is 32 bit and 32 bit CPU/OS?

I could easily do 4000 times that amound on a 200MHz Pentium with what is
today slow memory ten years ago. Doing anything other than simple calls is
really a bad idea here.


Don Burn (MVP, Windows DDK)
Windows 2k/XP/2k3 Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr
Remove StopSpam to reply

wrote in message news:xxxxx@ntdev…
> Thanks for the all the replies!
>
> Yes I understand the standard IOCTLs method can do the same job and less
> error-prone. The main concerns is the perofrmance as the data access rate
> can be as high as 10K per second when run the maximum instances of our
> data channels. I am a little bit worry about the OS overhead if using
> IOCTOL. Well I probably can improve/decrease the data access rate here
> but the MmMapLockedPagesSpecifyCache way seems to be the most effective
> way from performance stand point of view. Yes the user app has to be
> written very carefully. Any mistake in the writing the mapped kernel
> memory will cause system crash. And Tim’s input make me worry much less
> about the synchronization. But back to one of my questions with Tim’s
> statement:
>
> [Quote]
> As long as it can’t read a half-updated value, everything is fine.
> [Quote]
>
> Can this really happen in X86 multi-CPU system assuming the data is 32 bit
> and 32 bit CPU/OS?
>

Thanks Gianluca and Don. The maximum data rate is about 20MB/s. Based on your previous experience, maybe it is not that bad to use METHOD_xxx_DIRECT/METHOD_xxx_NEITHER wih overlapped I/O.

----- Original Message -----
From:
To: “Windows System Software Devs Interest List”
Sent: Friday, February 15, 2008 10:46 AM
Subject: RE:[ntdev] Synchronization for the memory shared between user app
and driver

> Thanks for the all the replies!
>
> Yes I understand the standard IOCTLs method can do the same job and less
> error-prone. The main concerns is the perofrmance as the data access rate
> can be as high as 10K per second when run the maximum instances of our
> data channels. I am a little bit worry about the OS overhead if using
> IOCTOL. Well I probably can improve/decrease the data access rate here
> but the MmMapLockedPagesSpecifyCache way seems to be the most effective
> way from performance stand point of view. Yes the user app has to be
> written very carefully. Any mistake in the writing the mapped kernel
> memory will cause system crash. And Tim’s input make me worry much less
> about the synchronization. But back to one of my questions with Tim’s
> statement:
>
> [Quote]
> As long as it can’t read a half-updated value, everything is fine.
> [Quote]
>
> Can this really happen in X86 multi-CPU system assuming the data is 32 bit
> and 32 bit CPU/OS?
>

Yes if the 32bit shared memory variable is not aligned on a 32bit boundary.
If you don’t force the alignment of your structures (e.g. by using #pragma
pack(1)) and let the compiler align the fields of structures with their
“natural” alignment, it shouldn’t happen.

Have a nice day
GV

> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

>as high as 10K per second

10 kilobytes per second?

This is tiny traffic. Even 10MB is easily sustainable on even
Pentium-3-generation machine, if not older.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

> Thanks Gianluca and Don. The maximum data rate is about 20MB/s.

Lesser then 10GB hard drive dated 1999 could produce.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

xxxxx@sbcglobal.net wrote:

Thanks for the all the replies!

Yes I understand the standard IOCTLs method can do the same job and less error-prone. The main concerns is the perofrmance as the data access rate can be as high as 10K per second when run the maximum instances of our data channels.

Do you mean 10k bytes per second, or 10k ioctls per second? 10k bytes
per second is completely trivial, and you’ve already spent more time
worrying about optimization than it is worth.

And Tim’s input make me worry much less about the synchronization. But back to one of my questions with Tim’s statement:

[Quote]
As long as it can’t read a half-updated value, everything is fine.

[Quote]

Can this really happen in X86 multi-CPU system assuming the data is 32 bit and 32 bit CPU/OS?

It depends on your design more than the CPU architecture. Take, for
example, the following theoretical design for a circular buffer:

m_AvailableBytes += newData;
m_In += newData;
if( m_In >= m_Limit ) {
// We wrapped.
m_In -= m_BufferSize;
}

This contains several places where the structure is left inconsistent.
After the first statement, we’re advertising more available bytes than
the buffer pointers actually have. After the second statement, if there
was a wraparound, we’ve left the “in” pointer pointing into outer space.

Just remember that you can be interrupted at absolutely any point. You
have to ask yourself at each and every point, “what happens if another
consumer tries to access the buffer here?”


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

xxxxx@sbcglobal.net wrote:

Thanks Gianluca and Don. The maximum data rate is about 20MB/s. Based on your previous experience, maybe it is not that bad to use METHOD_xxx_DIRECT/METHOD_xxx_NEITHER wih overlapped I/O.

Yes. Do the math. Even if you used METHOD_BUFFERED (which I’m not sure
I would), the overhead of the extra copy would only be 0.2% of a 3 GHz
processor. It’s easy to forget how blinking fast CPUs are these days.

Forget about METHOD_NEITHER. The circumstances where it makes sense are
so specific and narrow that the typical driver writer never needs to use it.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

> Forget about METHOD_NEITHER. The circumstances where it makes sense

are
so specific and narrow that the typical driver writer never needs to use it.

Actually, neither IO read/write is more intended for FSDs to allow easy passing
Irp->UserBuffer to CcCopyRead/Write, which will do the proper validation
(actually it will not, it will page fault on bad Irp->UserBuffer and raise an
exception to be caught by the FSD).

As about METHOD_NEITHER in IOCTLs - the only one I know are
FSCTL_GET_VOLUME_BITMAP and FSCTL_GET_RETRIEVAL_POINTERS.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

> Circular buffers can often be designed so there is no need for synchronization,

since each pointer is one-writer/many-reader.

Yes, but readers still have to signal to the writer that it can write to the buffer - otherwise, a writer may write data to a buffer while someone tries to read it. Therefore, you still need some synchronization construct here (in this particular case ERESOURCE is the best option). If you mean updating a pointer to the circular buffer… well, then readers have to synchronize it between themselves, because, as far as position pointer is concerned, they are writers.
The only situation when you may avoid synchronization is when you have one reader and one writer - reader reads data and then updates a pointer, and writer writes at the position indicated by the pointer…

Anton Bassov