Since you are using such tiny buffers (2MB), the ideal solution would have
been to use direct mode I/O and have your device transfer everything
directly into user space. But that does require having scatter/gather
capability (hint: when designing real devices to run in real operating
systems, having “unlimited scatter gather” as part of your basic design is a
Really Good Idea). But given you don’t have that, the common buffer
approach with the horrendous copy is probably the best you are going to be
able to do. So what you seem to have is a set of N buffers, and you cycle
through them, if I’ve understood this correctly. Then, it sounds like a
single image can be 2GB, which means you are going to have to transfer 2GB
in 2MB chunks, that is, 1024 transfers per frame time. So one of my
concerns is a standard back-of-the-envelope computation to show if the data
rate can be sustained, no matter how “efficient” PCIx might be. If we
assume the buffers are DWORD-aligned, transferring a 2MB buffer requires
512K DWORD fetches and 512K DWORD stores (memcpy does a MOVSD for the main
body), and that’s starting to push the performance envelope. But you have
to do 1024 of these per frame time, which I think exceeds what most systems
could hope to accomplish.
With 4 2MB buffers, you can read 8MB before you have to reuse one of the
common buffers, so you need to make sure that the copy can complete before
you have to recycle the first buffer of the set. If you are doing this copy
in a passive thread, you are at the mercy of the thread scheduler. This
means there is *substantial* latency between the time you determine the
thread should run and when it actually *does* run; if this latency is more
than three buffer-fills, you’re dead. You can’t assume zero latency to
activate a thread (I’ve seen people in user-level apps make this assumption,
and wonder why their projects fail to work as defined). You can’t assume
that thread runs 100% of the time, because interrupts and DPCs will preempt
it, so you can’t assume that even after it starts, it completes its action
in a predictable amount of time.
This is why most devices with high bandwidth requirements have onboard
memory and scatter/gather DMA, and even then, you’re pushing the envelope
with a 2GB frame. But with intermediate staging and explicit memcpy, you
double or triple the number of memory accesses required over straight
smart-DMA. Or maybe worse. Before you start trying to make this work, you
should work out the math to see if can ever be made to work. I once
participated in evaluating a project whose required bandwidth was something
like six times the available I/O bandwidth (what bothered ME was the group
had defined complex communication packets down to the bit level without once
saying what problem they were trying to solve; a colleague from EE worked
out the required bandwidth and said “Hey guys, .” and the project
disappeared without a trace three days later).
I’d like to see some performance numbers, or at least what you can say
without violating proprietary information, such as bytes/frame, frames/sec,
etc.
joe
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of ???
Sent: Tuesday, July 12, 2011 11:16 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] 0xc0000005 (Access Violation) in MmProbeAndLockPages
Thank you for your reply Mr. Joseph
I could get some hints that I can execute it because of your reply.
(Trying to explain user buffer with Scatter Gather DMA using PartialMdl?)
First of all, Let me explain about my situation.
I designed PCI Express I/F Card Interface card 3 years ago with FPGA .
And this board doesn’t have any memory (so that I’m facing this problem.)
I also wrote WDM driver and I transfered data with common buffer DMA mode.
(because target system had huge memory)
In that system I initialized DMA Buffers in StartDevice of PNP routine
(using AllocateCommonBuffer)
(In my poor memory, that was 2 buffers of 1MBytes(Transfer Data Buffer) and
1 buffer for 4kBytes(command I/F))
I also can achieve Physical Memory Address & Virtual Address so that I can
copy between FPGA & driver.
At that time I copied in my dispatch routine from Common buffer to user
memory
(IoAllocateMdl-> MmProbeAndLockPages-> MmGetSystemAddressForMdlSafe->
memcpy)
I thought That I could use this method can be used in System Thread also
because I used it in Dispatch Routine before.
Now, Let’s talk about my current system.
I initialized 4 transfer data buffers of 2MBytes and 1 buffer for 4kBytes
with CommonBufferDMA.
Because of New system doesn’t have any memory, So Target System just push
all Line Scan Camera Images to my Interface Card like streaming data.
I have to push all datas also because My board doesn’t have any memory.
That’s why I deciede to use PCI Express interrupt to notify driver that
buffer has been filled,
to transfer each CommonBuffers to user buffers
(I think this is fastest way to implement for me not to modify my current
FPGA design that much)
And Because of this design is based on Line scan camera Grabber, Transfer
length will be decided by the size of LCD Glass.
(that’s why transfer size can be expanded more than 2GB)
It’s midnight in here.
So, I’ll try tomorrow morning to get scatter gather list.
I think after I get system addresses of this user buffer I think I can copy
camera datas(in CommonBuffer) to System address of user buffer in System
Thread.
Is there anything wrong?
Please give me any comment
Any comments would be helpful.
thank you for your reading.
2011/7/12 Joseph M. Newcomer
See below…
-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@gmail.com
Sent: Tuesday, July 12, 2011 6:34 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] 0xc0000005 (Access Violation) in MmProbeAndLockPages
Hi, Nice to meet you everyone.
I’m having a problem
I Created user memory Buffer 256MB using VirtualAlloc function.
and I stored this virtual address in device extension structure.
That is wrong. If you use Direct I/O, and the MDLAddress field will hold
the MDL that defines this user buffer. Or, you use “mode neither”, and use
the user-mode address to create a MDL which you must allocate in your
top-level dispatch routine. After that, you build partial MDLs for
transferring segments.
The user buffer address is totally useless to you in your DPC or a system
thread. It is valid only in the top-level dispatch routine.
And I let my PCI Express IF Card to transfer my DMA Buffer.
after that I generated interrupt to notify device driver that transfer has
done.
This cannot work if the address you give it is the user address. You have
not said if you are trying to use the user-mode address for your DMA or you
are using the internal address of a buffer you are allocating, and you do
not say how you are allocating it, so we don’t know if it is a physically
contiguous buffer or just a “buffer”, which can be scattered over many
discontiguous pages, which means you have to deal with scatter/gather DMA.
*
in ISR routine I put to dpc queue to let Dpc generate event which will wakes
system thread up.
after system thread routine got this event, this routine allocates MDL with
IoAllocateMdl
What good is this going to do? You are “years” (in computer time) too late
to do anything with a user-mode address at this point. It has no meaning
whatsoever in a system thread.
and after that I tried to Lock Memory with MmProbeAndLockPages
But after that I got exception with 0xc0000005.
First, if you do MmProbeAndLockPages you need to do it in the context of an
exception frame to capture any exceptions so you don’t get a BSOD. Note
that if you get an exception, it means your locking failed, and you cannot
complete the I/O. Note also that MmProbeAndLockPages is probably going to
fail because you are not using the MDL in the IRP (if direct I/O) or the MDL
you allocated in your top-level dispatch routine (if using “mode neither”,
but one you have “allocated” (and you do not tell use how you got the
parameters for allocating it!) and tried to initialize by using some
nonsense random number (aka “the user-mode address”) in a context in which
this address is pretty much guaranteed to be complete nonsense.
I want to know what is causing this problem.
You are not using Direct I/O would be a good first guess. Allocating your
own MDL in a context in which the user address is a meaningless random
number is a likely contributor. Storing a user-mode address is the most
fatal aspect I can identify. Doing an MmProbeAndLockPages without an
exception frame is the direct cause.
You must allocate the MDL in your top-level entry routine (e.g., the handler
for IRP_MJ_READ or IRP_MJ_DEVICE_CONTROL, depending on how you are reading
the data) so that it has the correct information in it. Then you will use
“mode neither” so the I/O manager doesn’t try to lock all the pages down.
You can then do DMA into pieces of the MDL using a “partial MDL” which you
can create from the original MDL
And I want to know is there any methods that I can transfer streaming data
from PCI Express card to User Buffer?
Handle your MDL in the top-level dispatch routine
I think I have to allocate Mdl each interrupts in this system thread because
I have to transfer over than 2GBytes of memory. Because I can’t allocate Mdl
& Physical DMA Buffer like that much.
It is far more complex than you think. You just said that the user
allocates 256MB of memory, so how did you suddenly get this 2GB number? You
don’t need to transfer, by your specification, more than 256MB of memory.
You seem to feel a need to transfer it in pieces, which suggests the card
has a seriously bad design (no scatter/gather capability), but there are
ways of dealing with that.
How did you jump from 256MB to 2GB?
—
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer
—
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer
— NTDEV is sponsored by OSR For our schedule of WDF, WDM, debugging and
other seminars visit: http://www.osr.com/seminars To unsubscribe, visit the
List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer