The good old Sharing Memory between Driver and User, with a twist

I have read the article and associated code (OsrMemDrv) and I believe that I understand much of it. I have to do something slightly different in that
I have to share PCI memory with the user space. After calling
devContext->m_MemoryAddress0 = (PUCHAR) MmMapIoSpace( resourceTrans->u.Memory.Start, resourceTrance->u.Memory/Length, MmNonChached);
what do I need to do to map that to user space? OsrMemDrv uses MmMapLockedPagesSpecifyCache, but I believe that I don’t want to use MDL’s in a kmdf
driver, correct? Is there an equivalent call for Kmdf? There is WdfMemoryCreatePreallocated, but I’m not sure that does what I want. Any advice?

Thanks again,
Michael

> I have to do something slightly different in that I have to share PCI memory with the user space.

Why do you think you need something like that, in the first place???

I can accept sharing a buffer, but allowing UM access to device’s memory-mapped registers…ugh, dodgy, don’t you think…

Anton Bassov

Michael Wade wrote:

I have read the article and associated code (OsrMemDrv) and I believe that I understand much of it. I have to do something slightly different in that
I have to share PCI memory with the user space. After calling
devContext->m_MemoryAddress0 = (PUCHAR) MmMapIoSpace( resourceTrans->u.Memory.Start, resourceTrance->u.Memory/Length, MmNonChached);
what do I need to do to map that to user space?

The FIRST thing you need to do is sit back and consider why you are
doing this. Besides the security risk, this makes it impossible for you
to control shared access to the device resources. In virtually every
device I’ve ever worked on, we threw out the “map to user mode” approach
in favor of a “simple ioctl access” approach, and I have never been sorry.

OsrMemDrv uses MmMapLockedPagesSpecifyCache, but I believe that I don’t want to use MDL’s in a kmdf
driver, correct?

No, MDLs still work just fine. KMDF is built on top of WDM, which is
still quite present, but it doesn’t cover every eventuality. KMDF makes
things easier, but when you run out of KMDF pavement, you can still
drive on WDM gravel.

Gosh, that’s a pretty cute metaphor.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

We build very custom hardware that only talks to our software in systems that we build. This driver will never ever be used by anyone but us. There
are many extremely time critical things that need to be done that makes it very desirable for the application program be able to talk to our PCI card
directly, rather than going through many ring transitions. There are many calculations that need to be done at 100Hz based on data from the PCI card
and then sent back to the PCI card. When we tried doing either Read/Write or Ioctl the overhead just killed us. So, yes, while this is not the
normal way that one would handle this, it has been decided that this is the best way to approach the problem.

So given that we’ve mapped the appropriate BAR, using MmMapIoSpace, how does one create an MDL with the returned pointer and length? And do we need
to call ZwMapViewOfSection?

Thanks again,
Michael

On Tue, 14 Dec 2010 15:07:35 -0800, Tim Roberts wrote:

>Michael Wade wrote:
>> I have read the article and associated code (OsrMemDrv) and I believe that I understand much of it. I have to do something slightly different in that
>> I have to share PCI memory with the user space. After calling
>> devContext->m_MemoryAddress0 = (PUCHAR) MmMapIoSpace( resourceTrans->u.Memory.Start, resourceTrance->u.Memory/Length, MmNonChached);
>> what do I need to do to map that to user space?
>
>The FIRST thing you need to do is sit back and consider why you are
>doing this. Besides the security risk, this makes it impossible for you
>to control shared access to the device resources. In virtually every
>device I’ve ever worked on, we threw out the “map to user mode” approach
>in favor of a “simple ioctl access” approach, and I have never been sorry.
>
>> OsrMemDrv uses MmMapLockedPagesSpecifyCache, but I believe that I don’t want to use MDL’s in a kmdf
>> driver, correct?
>
>No, MDLs still work just fine. KMDF is built on top of WDM, which is
>still quite present, but it doesn’t cover every eventuality. KMDF makes
>things easier, but when you run out of KMDF pavement, you can still
>drive on WDM gravel.
>
>Gosh, that’s a pretty cute metaphor.

Michael Wade wrote:

… When we tried doing either Read/Write or Ioctl the overhead just killed us. So, yes, while this is not the normal way that one would handle this, it has been decided that this is the best way to approach the problem.

Well, it’s neither here nor there, but that suggests you were “doing it
wrong”. Overhead is not usually a concern for a thoughtfully-designed
driver API. That’s a debate for another day.

So given that we’ve mapped the appropriate BAR, using MmMapIoSpace, how does one create an MDL with the returned pointer and length? And do we need to call ZwMapViewOfSection?

There’s actually an old knowledge base article about this:
http://support.microsoft.com/kb/189327

MmMapIoSpace, then IoAllocateMdl, then MmBuildMdlForNonPagedPool, then
MmMapLockedPages.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

> When we tried doing either Read/Write or Ioctl the overhead just killed us.

What about CPU starvation??? It never occurred to you to think that the target thread, probably, just did not have a chance to process data or send it to the device because it did not have a chance to get to the CPU???

Anton Bassov

No. The main target thread, as you call it, does an ioctl that doesn’t complete until an interrupt from the card is received. The process is running
at REALTIME_PRIORITY_CLASS and the main thread at tpTimeCritical. I believe that this is the highest priority that a user thread can run at. We have
done much analysis of interrupt response time, ring transition overhead, PCI bus overhead/throughput, etc. There is a very large amount of data that
needs to be exchanged between the hardware and the software in a very short period of time. Transferring the data back and forth to the kernel is not
necessary, does not really give us any benefit and complicates the design. Windows is most definitely not a real time operating system. It is,
however, the most popular OS right now, and the most familar to our customers. So when the decision was made to go to Windows (back in Win NT 3.5) we
knew that we had to work within the constraints of a non-real time OS as best we could. As I said, this device is never communicated with by any
other user, so I don’t really see the benefit of implementing ioctl calls to get/set memory on the PCI card. Whether drivers should or should not
share memory with user space is a very old argument, and I believe it depends on the specifics of the situation. Just my two cents.

Thanks,
Michael

On Tue, 14 Dec 2010 19:01:12 -0500 (EST), xxxxx@hotmail.com wrote:

> When we tried doing either Read/Write or Ioctl the overhead just killed us.

What about CPU starvation??? It never occurred to you to think that the target thread, probably, just did not have a chance to process data or send it to the device because it did not have a chance to get to the CPU???

Anton Bassov

On Tue, 14 Dec 2010 15:56:14 -0800, Tim Roberts wrote:

>Michael Wade wrote:
>> … When we tried doing either Read/Write or Ioctl the overhead just killed us. So, yes, while this is not the normal way that one would handle this, it has been decided that this is the best way to approach the problem.
>
>Well, it’s neither here nor there, but that suggests you were “doing it
>wrong”. Overhead is not usually a concern for a thoughtfully-designed
>driver API. That’s a debate for another day.

Absolutely. I’m really not that interested in arguing the point. I know our system, we did extensive throughput analysis, optimization, response
time, etc. To us, this particular situation is far from usual.

>
>> So given that we’ve mapped the appropriate BAR, using MmMapIoSpace, how does one create an MDL with the returned pointer and length? And do we need to call ZwMapViewOfSection?
>
>There’s actually an old knowledge base article about this:
>http://support.microsoft.com/kb/189327

Thanks!
Michael

>
>MmMapIoSpace, then IoAllocateMdl, then MmBuildMdlForNonPagedPool, then
>MmMapLockedPages.

> There is a very large amount of data that needs to be exchanged between the hardware and

the software in a very short period of time. Transferring the data back and forth to the kernel
is not necessary,

Whether drivers should or should not share memory with user space is a very old argument,
and I believe it depends on the specifics of the situation.

Hold on - are you sharing memory buffer or memory-mapped BARs??? The former approach may be, indeed, quite useful in some situations (for example, if you want to move large amounts of data within a short timeframe). However, the way you presented your question suggests the latter, and this is what we are speaking about…

In any case, I am going to help you - even if want to make your pig fly I will tell you how to attach wings to it.

BarAddress=MmMapIoSpace(Phys,Length, MmNonCached);

PMdl=IoAllocateMdl(BarAddress, Length, 0,0,0);

MmBuildMdlForNonPagedPool(PMdl);

The above lines may run in context of any process, but the final one below is different

MmMapLockedPagesSpecifyCache (PMdl, UserMode, MmNonCached, 0,0,NormalPagePriority);

This one should be invoked only in context of a process that will actually access mapped memory, because
you specify UserMode, so that the address will be valid only in context of a calling process.

That’s all. May your pig enjoy a flight…

Anton Bassov

We have some legacy (circa 1996) proprietary products that maps memory on our PCI card into a Service and these pigs fly quite nicely thank you very much. I have added some lipstick to our pigs by updating the driver to KMDF but the service code is not allowed to change.

I don?t know if this is correct but it works for us. In a nutshell we do the following:

Service requests mapping
Get a handle to physical memory using ZwOpenSection()
Get a pointer to physical memory object using ObReferenceObjectByHandle()
Get the start of your PCI memory from the BAR
Calculate the length
Map a virtual view using ZwMapViewOfSection()
Pass the virtual view pointer back to the Service

Larry C

> We have some legacy (circa 1996) proprietary products that maps memory on our PCI card into a

Service and these pigs fly quite nicely thank you very much.

How many pig flights have been observed in so far? Just a couple of observations:

  1. ZwMapViewOfSection() will map memory as writeback cacheable - it does not seem to take CacheType argument, does it. If you map BARs to the kernel address space as noncacheable and as WB into the userland
    it may well happen that the machine hangs for no obvious reason as a result of cache conflict

  2. If you map device BARs as cacheable you may be getting “funny” results from time to time

Anton Bassov

Even though we are not part of the aviation/aerospace division, we probably have anywhere from 3000+ (my guess but could be much higher) pigs in flight at any given moment since 1996 and have not seen either of the problems you mention.

LarryC

>decision was made to go to Windows (back in Win NT 3.5)

Well, since that old times the performance stuff changed a lot.

Probably the modern computer+OS will run IOCTLs faster and they are no more the overhead they were in NT3 times.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

> Even though we are not part of the aviation/aerospace division, we probably have anywhere

from 3000+ (my guess but could be much higher) pigs in flight at any given moment since 1996
and have not seen either of the problems you mention.

Probably no one just told you about these problems? After all, if Windows machine hangs once in a sudden
the user will simply reboot it. Certainly if the problem persists they may get upset and start finding out what’s going on, but this is the kind of a problem that does not necessarily arise on every other run…

Anton Bassov

I wonder: have you ever heard of something called a “scheduler”? Expecting
a user-level program to be able to access a device without delays that can
be tens or hundreds of milliseconds is a fictitious hope. Your thread can
be preempted by any number of kernel threads, or by other user-level threads
(see: Balance Set Manager) for arbitrary periods. Devices which need fast
response usually have to have interrupt-driven response times, and you have
to manage buffering to allow for arbitrary delays from user space.

One way to manage this is large onboard buffers on the device. Another
method is to put large numbers of driver buffers in a list or other
construct, and keep data in the kernel. A third method is to always open
the device in async mode, and do lots of I/O calls in sequence (I once had a
high-speed device where I needed to put 50 async ReadFiles down into the
kernel; when we had only 40, we lost data due to the performance issues;
with 50 we never had a data loss). But I had a question (in of all places,
the MFC newsgroup I hang out in) which asked “I need to poke the parallel
port every 100us, and I can’t find a timer with resolution less than 1ms,
such as SetTimer” (not realizing SetTimer has integer multiples of 15ms
resolution, no matter what value is given). The presumption that it was
possible to even realize 100us resolution in user space was clearly a
failure in understanding how application programs work.

And you can’t really say “Well, our code is the only code running, so this
should not be a problem” doesn’t work because there are kernel threads which
can preempt you, and if you manage to get a process running in high-priority
mode and run a thread at level 31, you can completely shut down features
like the entire file system, generally not considered the best of all
possible ideas.

Mapping the device to user space does NOT guarantee responsiveness, and in
fact generally guarantees the opposite. People I know have measured
interrupt-to-run time of applications as being on the average 250ms and
peaking at 450ms, because even if you do something like IoCompleteRequest
with a huge priority boost, all you are doing is asking the scheduler to run
the thread at some indeterminate time in the future, at its convenience.

So you need to seriously reconsider this idea. It is almost always going to
be fatal, and it sounds like your assumptions are not supportable.
joe

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Michael Wade
Sent: Tuesday, December 14, 2010 6:24 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] The good old Sharing Memory between Driver and User,
with a twist

We build very custom hardware that only talks to our software in systems
that we build. This driver will never ever be used by anyone but us. There
are many extremely time critical things that need to be done that makes it
very desirable for the application program be able to talk to our PCI card
directly, rather than going through many ring transitions. There are many
calculations that need to be done at 100Hz based on data from the PCI card
and then sent back to the PCI card. When we tried doing either Read/Write
or Ioctl the overhead just killed us. So, yes, while this is not the normal
way that one would handle this, it has been decided that this is the best
way to approach the problem.

So given that we’ve mapped the appropriate BAR, using MmMapIoSpace, how does
one create an MDL with the returned pointer and length? And do we need to
call ZwMapViewOfSection?

Thanks again,
Michael

On Tue, 14 Dec 2010 15:07:35 -0800, Tim Roberts wrote:

>Michael Wade wrote:
>> I have read the article and associated code (OsrMemDrv) and I believe
>> that I understand much of it. I have to do something slightly different
in that I have to share PCI memory with the user space. After calling
>> devContext->m_MemoryAddress0 = (PUCHAR) MmMapIoSpace(
resourceTrans->u.Memory.Start, resourceTrance->u.Memory/Length,
MmNonChached);
>> what do I need to do to map that to user space?
>
>The FIRST thing you need to do is sit back and consider why you are
>doing this. Besides the security risk, this makes it impossible for
>you to control shared access to the device resources. In virtually
>every device I’ve ever worked on, we threw out the “map to user mode”
>approach in favor of a “simple ioctl access” approach, and I have never
been sorry.
>
>> OsrMemDrv uses MmMapLockedPagesSpecifyCache, but I believe that I
>> don’t want to use MDL’s in a kmdf driver, correct?
>
>No, MDLs still work just fine. KMDF is built on top of WDM, which is
>still quite present, but it doesn’t cover every eventuality. KMDF
>makes things easier, but when you run out of KMDF pavement, you can
>still drive on WDM gravel.
>
>Gosh, that’s a pretty cute metaphor.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


This message has been scanned for viruses and dangerous content by
MailScanner, and is believed to be clean.

Joseph M. Newcomer wrote:

I wonder: have you ever heard of something called a “scheduler”? Expecting
a user-level program to be able to access a device without delays that can
be tens or hundreds of milliseconds is a fictitious hope. Your thread can
be preempted by any number of kernel threads, or by other user-level threads
(see: Balance Set Manager) for arbitrary periods. Devices which need fast
response usually have to have interrupt-driven response times, and you have
to manage buffering to allow for arbitrary delays from user space.

He did say this was custom hardware on systems they build. Given that,
the hope is not fictitious. As long as one uses a disk that doesn’t do
“thermal recalibration”, and eliminates the audio card, you can get
latency that is quite reproducible.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

But it sounds like you are stuck in a synchronous I/O model. You have to
consider using async I/O, and pumping LOTS of I/O requests down, because
IoStartNextPacket (or its KMDF equivalent) is actually very fast.

Even TIME_CRITICAL threads have response time issues, and can be preempted
by file system threads. In addition, the Balance Set Manager can promote a
lower-priority thread to TIME_CRITICAL and give it a double timeslice if it
feels like it, so you are competing with an unknown and unknowable set of
threads. File system threads and other kernel threads are likely to preempt
you, and starting in Vista, there is something called the Multimedia
Scheduler Service which allows you to have user threads running up to
priority 26. You can take advantage of this service to get higher
priorities for your app threads, or you can be a victim of it.

In cases like this, I use async I/O and tend to use one or more threads
connected to an I/O Completion Port to handle the responses, but that’s just
my preferred application architecture. For high-performance devices,
synchronous I/O is a dead-end design; I’ve seen it fail consistently, and
therefore I don’t touch it or even acknowlege its existence in such cases.

In using asynch I/O for high performance devices, I usually parameterize the
number of I/O operations I’m going to queue up, so I can tune the
performance once the app is running. In one app, 40 failed, and 50 worked.
In another app, 8 failed and 10 worked. Which is why it should be a tunable
parameter.

Key here is that if you have not tried the async I/O approach, or considered
any of the other optimization strategies I suggested, you should not be
redesigning the interface to let apps touch registers on the device.
joe

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Michael Wade
Sent: Tuesday, December 14, 2010 7:20 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] The good old Sharing Memory between Driver and User,
with a twist

No. The main target thread, as you call it, does an ioctl that doesn’t
complete until an interrupt from the card is received. The process is
running at REALTIME_PRIORITY_CLASS and the main thread at tpTimeCritical. I
believe that this is the highest priority that a user thread can run at. We
have done much analysis of interrupt response time, ring transition
overhead, PCI bus overhead/throughput, etc. There is a very large amount of
data that needs to be exchanged between the hardware and the software in a
very short period of time. Transferring the data back and forth to the
kernel is not necessary, does not really give us any benefit and complicates
the design. Windows is most definitely not a real time operating system.
It is, however, the most popular OS right now, and the most familar to our
customers. So when the decision was made to go to Windows (back in Win NT
3.5) we knew that we had to work within the constraints of a non-real time
OS as best we could. As I said, this device is never communicated with by
any other user, so I don’t really see the benefit of implementing ioctl
calls to get/set memory on the PCI card. Whether drivers should or should
not share memory with user space is a very old argument, and I believe it
depends on the specifics of the situation. Just my two cents.

Thanks,
Michael

On Tue, 14 Dec 2010 19:01:12 -0500 (EST), xxxxx@hotmail.com wrote:

> When we tried doing either Read/Write or Ioctl the overhead just killed
us.

What about CPU starvation??? It never occurred to you to think that the
target thread, probably, just did not have a chance to process data or send
it to the device because it did not have a chance to get to the CPU???

Anton Bassov


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


This message has been scanned for viruses and dangerous content by
MailScanner, and is believed to be clean.

>Probably no one just told you about these problems? After all, if Windows

machine hangs once in a sudden
the user will simply reboot it. Certainly if the problem persists they may get
upset and start finding out what’s going on, but this is the kind of a problem
that does not necessarily arise on every other run…

The expectation of our customers is that these machines should never reboot.
If one of our boxes even burps/farts/slow our technical assistance people get a call. The only reboot allowed is for updates and since these boxes are not on the internet, updates are very few with months, for newer machines, years for or older machines times between reboots.

Larry C

OP:

We still have and support products on NT3.5x and Win2000 that run 24/7 without reboots for years. We have newer OS versions of the same product but application software will never change eventhough the driver/card must support the newer versions of the PCI bus.

Larry C

ZwMapViewOfSection takes page protection as a parameter, so you can specify PAGE_NOCACHE there. (This works only for Device\PhysicalMemory; for regular sections cache attributes must be specified when the section is created).

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com
Sent: Wednesday, December 15, 2010 12:59 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] The good old Sharing Memory between Driver and User, with a twist

We have some legacy (circa 1996) proprietary products that maps memory
on our PCI card into a Service and these pigs fly quite nicely thank you very much.

How many pig flights have been observed in so far? Just a couple of observations:

  1. ZwMapViewOfSection() will map memory as writeback cacheable - it does not seem to take CacheType argument, does it. If you map BARs to the kernel address space as noncacheable and as WB into the userland it may well happen that the machine hangs for no obvious reason as a result of cache conflict

  2. If you map device BARs as cacheable you may be getting “funny” results from time to time

Anton Bassov


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer