Architecture of Windows PCIe driver?

Brian_Rasmussen · September 2, 2011, 9:44am

Hi

I have an architecture question for a Windows XP PCIe driver. I have a custom made PCIe FPGA board from Xilinx and I have implemented the Xilinx XAPP1052 example. It is a PCIe Bus Master DMA endpoint (x1 lane).
The XAPP1052 supplied code consist of the kernel driver, a C++ DriverMgr.dll and a small VB application to verify DMA performance.

The XAPP1052 example code is working fine, but I need a data rate of apprx. 20Mbps (apprx. 600bytes at 4kHz rate) from the board, so the PC GUI should not poll for the data. Instead I have this idea of using MSI interrupts, but I need someone to verify if it is the right way to go:

The FPGA should initiate every transfer by signaling an MSI to the kernel driver.
The kernel driver then initiates a transfer from FPGA memory to the PC.
The BusMaster DMA controller in the FPGA will then transfer the data.
The kernel driver will then either know when data transfer is done, or the FPGA will send another MSI?
The kernel driver will then inform the DriverMgr.Dll that new data is available (using an event).
The DriverMgr.Dll will inform the user application (it will be a C# app.) that new data is available, with a pointer to the “TLPWriteBuffer” (where the received data is placed by DriverMGr.Dll).
The application will save the new received data and process them (in another thread?).

Is this in any way possible to do? Is it the right way to handle a data stream of 20Mbps?

Any hints or experiences would be appreciated.

Thanks.

Best Regards
Brian

OSR_Community_User · September 2, 2011, 10:40am

(I have a new email interface, and I mis-typed something and the message
disappeared; it may have been sent, or not, so this may be a repeat)

See comments inline below.

Hi

I have an architecture question for a Windows XP PCIe driver. I have a
custom made PCIe FPGA board from Xilinx and I have implemented the Xilinx
XAPP1052 example. It is a PCIe Bus Master DMA endpoint (x1 lane).
The XAPP1052 supplied code consist of the kernel driver, a C++
DriverMgr.dll and a small VB application to verify DMA performance.

The XAPP1052 example code is working fine, but I need a data rate of
apprx. 20Mbps (apprx. 600bytes at 4kHz rate) from the board, so the PC
GUI should not poll for the data. Instead I have this idea of using MSI
interrupts, but I need someone to verify if it is the right way to go:

The FPGA should initiate every transfer by signaling an MSI to the
kernel driver.

((((((((((((((((((((((((((
I do not understand why this should be so. In fact, it is somewhat
suspect. What MIGHT happen is that the chip can signal a “data ready”
condition via an interrupt
))))))))))))))))))))))))))

The kernel driver then initiates a transfer from FPGA memory to the PC.

((((((((((((((((((((((((((
This would normally mean that the DMA address and count register(s) were
loaded with the address(es) and count(s) of the transfer, and the chip was
then signaled (via a control register) to initiate DMA transfer
)))))))))))))))))))))))))))

The BusMaster DMA controller in the FPGA will then transfer the data.

(((((((((((((((((((((((((((
Yes
))))))))))))))))))))))))))))

The kernel driver will then either know when data transfer is done, or
the FPGA will send another MSI?

(((((((((((((((((((((((((((((
There is no “either”. The way the driver knows the transfer is done is
when it receives an interrupt indicating the transfer is done
))))))))))))))))))))))))))))))

The kernel driver will then inform the DriverMgr.Dll that new data is
available (using an event).

((((((((((((((((((((((((((((((
There are so many things wrong here. I’m not sure where to begin. First,
the concept that an “event” is involved is something that the client
programmer, not the driver, determines; this is done by opening the device
in asynchronous mode and putting and event handle in the OVERLAPPED
structure; this whole mechanism is completely invisible to the driver
writer.

Normally, the way you “signal” that I/O is complete is you complete the
IRP that was sent down. The very concept that that there is a
driver-manager DLL is something that is immediately suspect here, more so
after I read the next point.

The design is completely wrong and should be scrapped.
)))))))))))))))))))))))))))))))

The DriverMgr.Dll will inform the user application (it will be a C#
app.) that new data is available, with a pointer to the “TLPWriteBuffer”
(where the received data is placed by DriverMGr.Dll).

(((((((((((((((((((((((((((((((
The way the application knows the I/O has completed is that the
ReadFile/DeviceIoControl that initiated the transfer completes. For
synchronous I/O, it means the call returns. For asynchronous I/O, one of
the numerous asynchronous notification mechanims is invoked. I do not
know how C# handles asynchronous I/O, but the whole IDEA that there is an
asynchronous notification requires careful design of the application
level, and should not be left up to the driver writer.
))))))))))))))))))))))))))))))))

The application will save the new received data and process them (in
another thread?).

((((((((((((((((((((((((((((((((
The entire user-level architecture seems wrong. This sounds like an
attempt to shoehorn the I/O design of some historically boring operating
system into Windows, and is doomed. The result of this mess will be
frustration, heartbreak, unreliable code, and a long-term disaster.
Completely rethink this problem, preferrablly by having an experienced
Windows application programmer tell you what SHOULD happen. Nothing in
the design I’ve read makes any sense whatsoever, and if someone came to me
with an application-level design that looked like what you proposed I
would send them back to redesign it. It is a mess. I don’t even know if
it COULD be made to work in C#; I’d have enough trouble trying to get it
to work in C or C++.

You do not start the design from the card; you start the design from the
application interface and this tells you what the card and driver must do.
The proposed design is horrible.
joe
))))))))))))))))))))))))))))))))))

Is this in any way possible to do? Is it the right way to handle a data
stream of 20Mbps?

Any hints or experiences would be appreciated.

Thanks.

Best Regards
Brian

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · September 2, 2011, 11:10am

Brian
The link below is to Xilinx site with notes about that example code.
Hopefully they will be some help.
Patrick

http://forums.xilinx.com/t5/PCI-Express/Problem-with-PCI-Express-Example-XAPP1052-on-an-ML605/td-p/159632

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@amfitech.dk
Sent: Friday, September 02, 2011 6:44 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Architecture of Windows PCIe driver?

Hi

I have an architecture question for a Windows XP PCIe driver. I have a custom made PCIe FPGA board from Xilinx and I have implemented the Xilinx XAPP1052 example. It is a PCIe Bus Master DMA endpoint (x1 lane).
The XAPP1052 supplied code consist of the kernel driver, a C++ DriverMgr.dll and a small VB application to verify DMA performance.

The XAPP1052 example code is working fine, but I need a data rate of apprx. 20Mbps (apprx. 600bytes at 4kHz rate) from the board, so the PC GUI should not poll for the data. Instead I have this idea of using MSI interrupts, but I need someone to verify if it is the right way to go:

The FPGA should initiate every transfer by signaling an MSI to the kernel driver.
The kernel driver then initiates a transfer from FPGA memory to the PC.
The BusMaster DMA controller in the FPGA will then transfer the data.
The kernel driver will then either know when data transfer is done, or the FPGA will send another MSI?
The kernel driver will then inform the DriverMgr.Dll that new data is available (using an event).
The DriverMgr.Dll will inform the user application (it will be a C# app.) that new data is available, with a pointer to the “TLPWriteBuffer” (where the received data is placed by DriverMGr.Dll).
The application will save the new received data and process them (in another thread?).

Is this in any way possible to do? Is it the right way to handle a data stream of 20Mbps?

Any hints or experiences would be appreciated.

Thanks.

Best Regards
Brian

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

CONFIDENTIAL COMMUNICATION: This communication including any attached file(s) is CONFIDENTIAL, intended only for the named recipient (s) above and contains trade secrets or other information that is exempt from disclosure under applicable law. Any use, dissemination, distribution or copying of this communication and any file attachments by anyone other than the named recipient (s) is strictly prohibited. If you have received this communication in error, please immediately notify us by reply electronic mail and permanently delete the original transmission and any attachments from your computer system and any copies of this message and file attachments. Thank you.

Tim_Roberts · September 2, 2011, 1:22pm

xxxxx@amfitech.dk wrote:

I have an architecture question for a Windows XP PCIe driver. I have a custom made PCIe FPGA board from Xilinx and I have implemented the Xilinx XAPP1052 example. It is a PCIe Bus Master DMA endpoint (x1 lane).
The XAPP1052 supplied code consist of the kernel driver, a C++ DriverMgr.dll and a small VB application to verify DMA performance.

The XAPP1052 example code is working fine, but I need a data rate of apprx. 20Mbps (apprx. 600bytes at 4kHz rate) from the board, so the PC GUI should not poll for the data.

Semantic issue: Mbps is the traditional symbol for “megabits per
second”. You mean 20 megabytes per second. I usually see that rendered
as “20 MB/s”. I try to be careful to use “b” for bits and “B” for bytes
to avoid confusion.

The FPGA should initiate every transfer by signaling an MSI to the kernel driver.

The kernel driver then initiates a transfer from FPGA memory to the PC.

The BusMaster DMA controller in the FPGA will then transfer the data.

The kernel driver will then either know when data transfer is done, or the FPGA will send another MSI?

The kernel driver will then inform the DriverMgr.Dll that new data is available (using an event).

The DriverMgr.Dll will inform the user application (it will be a C# app.) that new data is available, with a pointer to the “TLPWriteBuffer” (where the received data is placed by DriverMGr.Dll).

The application will save the new received data and process them (in another thread?).

The driver cannot tell when the DMA operation is complete, so you need
an interrupt at that point. It seems unfortunate that you also need an
interrupt to say “data is available”. What happens if you issue a
bus-master request before data is ready? Does the FPGA hold off, or
does it transfer old data?

Don’t use an event to signal between the driver and the DLL. Instead,
just have the DLL submit read requests (possibly through an ioctl), and
have the read requests wait in the driver. When the data is available,
you fill in the buffer and complete the request. That way, you can even
avoid an extra copy by having the board DMA directly into the user’s buffer.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Brian_Rasmussen · September 2, 2011, 1:47pm

Hi guys

Thanks for your quick replies.

@joe: I will try to clear a few things out (I have not been precise in my description above):
I need to create a datastream of 20Mbps from the FPGA to my C# application. The FPGA is a measuring system with 20 channels, and it outputs 8 bytes / channel at 4kHz rate. The PCIe bus should then transfer the data to the PC application.

A simple layer model of the actual design (Xilinx XAPP1052) is like this:
C# GUI Application -> DriverMgr.Dll (responsible for device IO calls) -> PCIe driver -> FPGA PCIe Endpoint (BusMaster DMA).

My idea of the communication will then be:

The FPGA issues an MSI interupt (Data is ready)
The driver writes address into DMA registers in FPGA and initiates the transfer.
The driver is informed by another MSI interrupt (Transfer ended).
Driver informs DriverMgr.dll (The Xilinx XAPP1052 polls the driver by using an DeviceIOControl call. Isn’t there a better way to do this?).
DriverMgr.Dll will issue an event to the application that new data has arrived.

Basicly my question is:
Is this doable? If the DriverMgr.Dll is changed to a normal application (and the C# app is thrown away) is it then still not the way to do data aquisition at 20Mbps?
If you should do a 20Mbps data aquisition what would you do? Use polling from the application or isn’t there a way to be informed by the driver when data arrives?

@ Patrick: The physical data rate is not a problem. It is measured with the Xilinx app to 900Mbps. The problem is I can only transfer 32 or 64 kB with 900Mbps speed once pr. second, and this is far from the required continous data stream of 20Mbps.

I hope you can point me in the right direction for the PCIe driver design?

Thanks.

Best regards
Brian

Brian_Rasmussen · September 2, 2011, 2:57pm

Hi Tim

Thanks for your reply. I actually mean 20 megabits pr. second.

I have tried to issue a BusMaster request and then let the FPGA start the transmission. It only worked a few times in a row, but this might be due to a required re-design of the driver…

Best regards
Brian

Igor_Sharovar · September 2, 2011, 6:12pm

>Basicly my question is: Is this doable? If the DriverMgr.Dll is changed to a normal application (and >the C# app is thrown away) is it then still not the way to do data aquisition at 20Mbps? If you should >do a 20Mbps data aquisition what would you do? Use polling from the application or isn’t there a way >to be informed by the driver when data arrives?

The speed of transferring data would mostly depends on PCIe transfer. You have x1 PCIe which has the speed 250 MB/s. But it is speed for transferring all data, including control information. The real speed would 60%-70% from general speed. Thus, you have 150 MB/s.
Usually driver could not bring significant delay unless you do very nasty stuff.
For your design I would recommend a thread in application which sends a IO_DIRECT request to the driver. If data is not available, the driver marks the request as PENDING. As soon you get an interrupt from the hardware the driver takes the addresses from MDL into SGL and provides DMA. After getting an interrupt on DMA completion driver completes the request and the application thread wakes up.
Look the sample in WDK \WinDDK\7600.16385.1\src\general\PLX9x5x\sys. It may help you.

Igor Sharovar

OSR_Community_User · September 2, 2011, 11:37pm

If there are 8 channels, it suggests that you have eight queues, one for
each channel, and each element completes asynchronously when its channel
has provided the data. So this suggests a multiple-queue driver.

My previous comments about the bizarre structure hold. When “an event” is
signaled, which of the eight channels completed? It looks like
DeviceIoControl might be the API-of-choice for reading data. The correct
behavior, what I would expect of a device, is that I would program its DMA
register(s) with the address/length pairs required for the transfer (you
DO have scatter/gather I/O, right?), and initiate a transfer. When that
channel complets, it generates an interrupt, and the device makes it known
what channel has caused the interrupt (note that it is possible for
multiple channels to complete, and it must handle that case cleanly).
There are no “data present” interrupts at all; there is only “data
complete”. If there is no data, the device holds the request pending
until data appears.

To support IRP cancellation, it would be nice if the device had a way to
have the driver abort a pending request. This would generate an interrupt
and the status would indicate that I/O on that channel had been terminated
programmatically. Otherwise, you have to make an assumption that *ALL*
I/O will complete “real soon now” and just wait for the device to complete
the transfer, which it MUST do within a bounded time period (otherwise,
there’s no way to stop the application, and other disasters soon follow).
Devices which can have unbounded completion times need a way to cancel the
hardware transaction. For example, most disk drivers don’t try to cancel
the current IRP that is running, because most disks will complete the
transaction within a few tens of milliseconds. On the other hand, it gets
messier with devices like serial ports where the time to the next
character could be measured in weeks.

When designing a device, you start from the application requirements
FIRST, design the API SECOND, and then design the card and driver together
to cleanly support the API. It is generally a serious design error to
design the card, then design the driver, and then finally think about what
the API should look like.

Devices that don’t have scatter/gather I/O are usually thought of as
driver-hostile. They are difficult to program, and hard to use in
contexts of high bandwidth because you get an interrupt on every page
instead of an interrupt for an entire transfer. The alternative, to have
“bounce buffers”, requires allocating buffers in the kernel, not a good
idea if they are large buffers, and then you get tons of overhead on the
gratuitous copies from the kernel buffers to user space.

Key here is design the API *FIRST*. To do this, you need an experienced
Windows programmer (in your case, an experienced C# programmer) to decide
what the best high-level architecture must be to get minimum overhead and
maximum flexibility. The whole design of using events to signal user
space is a design that works against simplicity, robustness, and “cultural
compatibility” with Windows applications.

Frankly, I have found that callback I/O is just about the worst way to do
asynchronous I/O; I prefer I/O Completion Ports. I’m not a C# programmer,
so I don’t know what is best in that environment.

If you have eight channels, the high-level architecture might suggest one
thread per channel for simplicity. But this depends on what your
application writers need to do.

The limitation of 32KB or 64KB may be due to the rather convoluted design
you suggested. You have not said what your maximum DMA transfer block
size is, but it should be unlimited-scatter-gather to get maximum
throughput. Direct transfers to application space.

As to “doable”, yes, the design is probably “doable”, but there are
questions to consider about why it should ever be done. Even bad designs
can be forced to work. But if I were looking at the project, the first
thing I’d do is scrap the current design as being clumsy, convoluted, and
incompatible with how Windows works.

Referring to an application that is not written in C# as “normal” seems
odd; there is an entire subculture of programmers who think that C# is a
normal way to write applications. My client base prefers C applications,
although I can sneak C++/MFC in when they aren’t looking. But if your
design cannot support C, C++/MFC, VB and C#, then this is a serious
indication there is something wrong with it. All you have to do in the
driver is transfer data. Issues about notification, completion handling,
result processing, etc. belong in the high-level program and your driver
must be compatible with any standard I/O the user chooses to use. I find
that for high-bandwidth communications, using async I/O and “priming” the
queue with a number of input requests gives the best response. For
example, the following is a bad structure

while(true)
{
ReadFile/DeviceIoControl…
…process data…
}

but I find that, for high-performance devices, the correct approach is to
open it in asynchronous mode and do

for(int i = 0; i < SOME_LIMIT; i++)
{
ReadFile/DeviceIoControl…
}

Now, when the packets complete, you have to worry about the sequence (the
sequence they return in is not guaranteed to be the original order) but I
find that using an I/O Completion Port is one of the more elegant ways to
handle this. As soon as the completion notification is seen, I pump a new
ReadFile/DeviceIoControl down, then queue the existing one for later
processing in its own thread. Maximizes throughput in multicore systems,
and usually results in simpler code for the threads.

Bottom line: for high bandwidth devices, if you get the scheduler in the
way of your throughput, you’re doomed. The while-readfile-process loop is
exactly this model. It is not survivable. Note that having an event
being signaled is exactly the same problem, which is why using
asynchronous I/O with a event in the OVERLAPPED structure and doing
WaitFor…() is also a Really Bad Idea. Callback I/O forces the callback
to execute in the context of the initiating thread, and it forces you to
enter Altertable Wait State sufficiently frequently that you don’t become
the bottleneck. Just about the worst possible architecture you could
imagine for asynchronous I/O. Puts the entire processing burden on a
single thread, which means that most of your cores are sitting idle while
one is overheating and starting melt (figuratively speaking). Completion
Port I/O allows maximum concurrency, and generally keeps the scheduler out
of the picture. The priming with lots of packets means the
IoStartNextPacket or its WDF analog, an operation whose total overhead is
probably measurable in single-digit microseconds, and this gets the
scheduler pretty much out of the picture.

Note that if the processing time exceeds the expected interpacket
completion time, you will always have data overruns, so ultimately you
want to keep this small. This suggests transferring the largest block of
data possible in the user-level call. So if you need to read and process
20MB/sec, then you will need to process the data at a rate > 20MB/sec or
you will always run out of buffers.

Note that people have measured interrupt-to-user-space delays in the low
hundreds of MILLIseconds, which is why you want to keep the scheduler out
of the picture as much as possible.

So if you have a “device manager” DLL, it might be handling all this
queueing and doing asynchronous notifications outward by queueing up work
items for processing threads. In one app I did, to avoid any dropped
data, I had to preload 50 ReadFile IRPs (I still lost data at 40, and I
think they ended up using about 60 so there was some headroom). Note that
you might have different device manager DLLs for C/C++, C#, and perhaps
even VB, whose interfaces are culturally compatible with those languages’
paradigms.

But you seem to be putting the wrong amounts of work into the wrong
levels. Notifications to application space that an I/O operation has
completed are handled by the I/O Manager, not by the driver. The driver
should neither know nor care how this is done. Your responsibility for
what to do with the IRP ends when you complete the IRP.
joe

Hi guys

Thanks for your quick replies.

@joe: I will try to clear a few things out (I have not been precise in my
description above):
I need to create a datastream of 20Mbps from the FPGA to my C#
application. The FPGA is a measuring system with 20 channels, and it
outputs 8 bytes / channel at 4kHz rate. The PCIe bus should then transfer
the data to the PC application.

A simple layer model of the actual design (Xilinx XAPP1052) is like this:
C# GUI Application -> DriverMgr.Dll (responsible for device IO calls) ->
PCIe driver -> FPGA PCIe Endpoint (BusMaster DMA).

My idea of the communication will then be:

The FPGA issues an MSI interupt (Data is ready)

The driver writes address into DMA registers in FPGA and initiates the
transfer.

The driver is informed by another MSI interrupt (Transfer ended).

Driver informs DriverMgr.dll (The Xilinx XAPP1052 polls the driver by
using an DeviceIOControl call. Isn’t there a better way to do this?).

DriverMgr.Dll will issue an event to the application that new data has
arrived.

Basicly my question is:
Is this doable? If the DriverMgr.Dll is changed to a normal application
(and the C# app is thrown away) is it then still not the way to do data
aquisition at 20Mbps?
If you should do a 20Mbps data aquisition what would you do? Use polling
from the application or isn’t there a way to be informed by the driver
when data arrives?

@ Patrick: The physical data rate is not a problem. It is measured with
the Xilinx app to 900Mbps. The problem is I can only transfer 32 or 64 kB
with 900Mbps speed once pr. second, and this is far from the required
continous data stream of 20Mbps.

I hope you can point me in the right direction for the PCIe driver design?

Thanks.

Best regards
Brian

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Brian_Rasmussen · September 5, 2011, 7:26am

Hi Joe

Thanks for your answer and for pointing me in the right direction.

Brian

Maxim_S_Shatskih · September 6, 2011, 5:10pm

> Is this in any way possible to do? Is it the right way to handle a data stream of 20Mbps?

Yes. One of the right ways.

Another way is to:

a) DLL sends many overlapped reads to the driver
b) the driver queues these reads to the queue (like KMDF one)
c) the driver pings the FPGA that it is ready to consume the data
d) when the data arrives, FPGA issues an interrupt (for a small data rate like 20MB/s the usual interrupt is OK, no need in the MSI)
e) the driver’s ISR/DPC routines consume the first read request from the queue
f) they pass this request via the Windows/KMDF DMA stuff to get its SGL
g) they push the first (or first several) SGL entries to the FPGA
h) they ping the FPGA that it should go on run the DMA
i) the FPGA runs the DMA
j) the FPGA issues “DMA done” interrupt
k) the driver’s ISR/DPC do the necessary cleanup and complete the read
l) the DLL assembles the data buffer from this completed read with another such data buffers in proper order
m) the DLL’s caller consumes this list of buffers as a data stream

This provides you with the stream of data in user mode.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Maxim_S_Shatskih · September 6, 2011, 5:13pm

> If you should do a 20Mbps data aquisition what would you do?

I would carefully examine the latency requirements for the user app, and, if they are generous - then use rather large buffer queue.

If they are NOT generous - then sorry, run all of this on another OS like QNX.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Tim_Roberts · September 6, 2011, 6:06pm

Maxim S. Shatskih wrote:

> If you should do a 20Mbps data aquisition what would you do?
I would carefully examine the latency requirements for the user app, and, if they are generous - then use rather large buffer queue.

If they are NOT generous - then sorry, run all of this on another OS like QNX.

20 Mbps is a trivial data rate. Extraordinary measures are not required.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

MBond · September 8, 2011, 8:15am

As a slight refinement, try tracking an IOP sequence per handle associated
with the IOCP. This approach, combined with some neat queuing logic to
handle the out-of-order completion, can eliminate an extra context switch
and allow additional parallelism to dramatically improve the results when
many handles (thousands) are open.

wrote in message news:xxxxx@ntdev…