As a slight refinement, try tracking an IOP sequence per handle associated
with the IOCP. This approach, combined with some neat queuing logic to
handle the out-of-order completion, can eliminate an extra context switch
and allow additional parallelism to dramatically improve the results when
many handles (thousands) are open.
wrote in message news:xxxxx@ntdev…
If there are 8 channels, it suggests that you have eight queues, one for
each channel, and each element completes asynchronously when its channel
has provided the data. So this suggests a multiple-queue driver.
My previous comments about the bizarre structure hold. When “an event” is
signaled, which of the eight channels completed? It looks like
DeviceIoControl might be the API-of-choice for reading data. The correct
behavior, what I would expect of a device, is that I would program its DMA
register(s) with the address/length pairs required for the transfer (you
DO have scatter/gather I/O, right?), and initiate a transfer. When that
channel complets, it generates an interrupt, and the device makes it known
what channel has caused the interrupt (note that it is possible for
multiple channels to complete, and it must handle that case cleanly).
There are no “data present” interrupts at all; there is only “data
complete”. If there is no data, the device holds the request pending
until data appears.
To support IRP cancellation, it would be nice if the device had a way to
have the driver abort a pending request. This would generate an interrupt
and the status would indicate that I/O on that channel had been terminated
programmatically. Otherwise, you have to make an assumption that *ALL*
I/O will complete “real soon now” and just wait for the device to complete
the transfer, which it MUST do within a bounded time period (otherwise,
there’s no way to stop the application, and other disasters soon follow).
Devices which can have unbounded completion times need a way to cancel the
hardware transaction. For example, most disk drivers don’t try to cancel
the current IRP that is running, because most disks will complete the
transaction within a few tens of milliseconds. On the other hand, it gets
messier with devices like serial ports where the time to the next
character could be measured in weeks.
When designing a device, you start from the application requirements
FIRST, design the API SECOND, and then design the card and driver together
to cleanly support the API. It is generally a serious design error to
design the card, then design the driver, and then finally think about what
the API should look like.
Devices that don’t have scatter/gather I/O are usually thought of as
driver-hostile. They are difficult to program, and hard to use in
contexts of high bandwidth because you get an interrupt on every page
instead of an interrupt for an entire transfer. The alternative, to have
“bounce buffers”, requires allocating buffers in the kernel, not a good
idea if they are large buffers, and then you get tons of overhead on the
gratuitous copies from the kernel buffers to user space.
Key here is design the API *FIRST*. To do this, you need an experienced
Windows programmer (in your case, an experienced C# programmer) to decide
what the best high-level architecture must be to get minimum overhead and
maximum flexibility. The whole design of using events to signal user
space is a design that works against simplicity, robustness, and “cultural
compatibility” with Windows applications.
Frankly, I have found that callback I/O is just about the worst way to do
asynchronous I/O; I prefer I/O Completion Ports. I’m not a C# programmer,
so I don’t know what is best in that environment.
If you have eight channels, the high-level architecture might suggest one
thread per channel for simplicity. But this depends on what your
application writers need to do.
The limitation of 32KB or 64KB may be due to the rather convoluted design
you suggested. You have not said what your maximum DMA transfer block
size is, but it should be unlimited-scatter-gather to get maximum
throughput. Direct transfers to application space.
As to “doable”, yes, the design is probably “doable”, but there are
questions to consider about why it should ever be done. Even bad designs
can be forced to work. But if I were looking at the project, the first
thing I’d do is scrap the current design as being clumsy, convoluted, and
incompatible with how Windows works.
Referring to an application that is not written in C# as “normal” seems
odd; there is an entire subculture of programmers who think that C# is a
normal way to write applications. My client base prefers C applications,
although I can sneak C++/MFC in when they aren’t looking. But if your
design cannot support C, C++/MFC, VB and C#, then this is a serious
indication there is something wrong with it. All you have to do in the
driver is transfer data. Issues about notification, completion handling,
result processing, etc. belong in the high-level program and your driver
must be compatible with any standard I/O the user chooses to use. I find
that for high-bandwidth communications, using async I/O and “priming” the
queue with a number of input requests gives the best response. For
example, the following is a bad structure
while(true)
{
ReadFile/DeviceIoControl…
…process data…
}
but I find that, for high-performance devices, the correct approach is to
open it in asynchronous mode and do
for(int i = 0; i < SOME_LIMIT; i++)
{
ReadFile/DeviceIoControl…
}
Now, when the packets complete, you have to worry about the sequence (the
sequence they return in is not guaranteed to be the original order) but I
find that using an I/O Completion Port is one of the more elegant ways to
handle this. As soon as the completion notification is seen, I pump a new
ReadFile/DeviceIoControl down, then queue the existing one for later
processing in its own thread. Maximizes throughput in multicore systems,
and usually results in simpler code for the threads.
Bottom line: for high bandwidth devices, if you get the scheduler in the
way of your throughput, you’re doomed. The while-readfile-process loop is
exactly this model. It is not survivable. Note that having an event
being signaled is exactly the same problem, which is why using
asynchronous I/O with a event in the OVERLAPPED structure and doing
WaitFor…() is also a Really Bad Idea. Callback I/O forces the callback
to execute in the context of the initiating thread, and it forces you to
enter Altertable Wait State sufficiently frequently that you don’t become
the bottleneck. Just about the worst possible architecture you could
imagine for asynchronous I/O. Puts the entire processing burden on a
single thread, which means that most of your cores are sitting idle while
one is overheating and starting melt (figuratively speaking). Completion
Port I/O allows maximum concurrency, and generally keeps the scheduler out
of the picture. The priming with lots of packets means the
IoStartNextPacket or its WDF analog, an operation whose total overhead is
probably measurable in single-digit microseconds, and this gets the
scheduler pretty much out of the picture.
Note that if the processing time exceeds the expected interpacket
completion time, you will always have data overruns, so ultimately you
want to keep this small. This suggests transferring the largest block of
data possible in the user-level call. So if you need to read and process
20MB/sec, then you will need to process the data at a rate > 20MB/sec or
you will always run out of buffers.
Note that people have measured interrupt-to-user-space delays in the low
hundreds of MILLIseconds, which is why you want to keep the scheduler out
of the picture as much as possible.
So if you have a “device manager” DLL, it might be handling all this
queueing and doing asynchronous notifications outward by queueing up work
items for processing threads. In one app I did, to avoid any dropped
data, I had to preload 50 ReadFile IRPs (I still lost data at 40, and I
think they ended up using about 60 so there was some headroom). Note that
you might have different device manager DLLs for C/C++, C#, and perhaps
even VB, whose interfaces are culturally compatible with those languages’
paradigms.
But you seem to be putting the wrong amounts of work into the wrong
levels. Notifications to application space that an I/O operation has
completed are handled by the I/O Manager, not by the driver. The driver
should neither know nor care how this is done. Your responsibility for
what to do with the IRP ends when you complete the IRP.
joe
Hi guys
Thanks for your quick replies.
@joe: I will try to clear a few things out (I have not been precise in my
description above):
I need to create a datastream of 20Mbps from the FPGA to my C#
application. The FPGA is a measuring system with 20 channels, and it
outputs 8 bytes / channel at 4kHz rate. The PCIe bus should then transfer
the data to the PC application.
A simple layer model of the actual design (Xilinx XAPP1052) is like this:
C# GUI Application -> DriverMgr.Dll (responsible for device IO calls) ->
PCIe driver -> FPGA PCIe Endpoint (BusMaster DMA).
My idea of the communication will then be:
- The FPGA issues an MSI interupt (Data is ready)
- The driver writes address into DMA registers in FPGA and initiates the
transfer.
- The driver is informed by another MSI interrupt (Transfer ended).
- Driver informs DriverMgr.dll (The Xilinx XAPP1052 polls the driver by
using an DeviceIOControl call. Isn’t there a better way to do this?).
- DriverMgr.Dll will issue an event to the application that new data has
arrived.
Basicly my question is:
Is this doable? If the DriverMgr.Dll is changed to a normal application
(and the C# app is thrown away) is it then still not the way to do data
aquisition at 20Mbps?
If you should do a 20Mbps data aquisition what would you do? Use polling
from the application or isn’t there a way to be informed by the driver
when data arrives?
@ Patrick: The physical data rate is not a problem. It is measured with
the Xilinx app to 900Mbps. The problem is I can only transfer 32 or 64 kB
with 900Mbps speed once pr. second, and this is far from the required
continous data stream of 20Mbps.
I hope you can point me in the right direction for the PCIe driver design?
Thanks.
Best regards
Brian
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer