I’m in the process of designing a software stack that will have to consume data produced by a PCIe device (nothing very original I guess).
The specifications changed recently (!), and now the acquisition is asynchronous (i.e. the time between data packets is variable) and the data packets don’t have a fixed size anymore.
The PCIe device is scatter-gather-DMA capable.
The PCIe device is MSI capable and sends three types of MSIs (to simplify):
Error
Data ready
DMA done
The driver is responsible for transferring data produced by the PCIe device to a user-allocated buffer each time a new packet is ready inside the PCIe device.
It’s the responsability of the PCIe device to avoid interrupting the software too often (using a set of timers inside the firmware).
Here is how I imagine an acquisition:
The application allocates a buffer.
The application performs a DeviceIoControl on the driver to put it in Acquisition mode (i.e. it must wait for data to be ready to complete ReadFile requests).
Upon reception of the DeviceIoControl, the driver sets it data ready counter to 0, and changes its internal mode to Acquisition.
The application performs a ReadFile on the driver, specifying the write pointer on the buffer (currently the base pointer), the read offset on the device (always 0 since the offset is internally managed by the driver, the driver doesn’t have to know the size of internal device buffers) and the bytes requested (initially the whole size of the buffer).
Upon reception of the ReadFile, the request is not completed and is only marked as pending since no data is available for the moment (data ready counter is 0).
A ‘Data ready’ MSI is received by the driver.
Inside the ISR, the driver queries the device to know how much data is ready and increments its data ready counter accordingly.
Inside the DPC, the driver sees that a ReadFile is pending and starts a DMA.
A ‘DMA done’ MSI is received by the driver.
Inside the ISR, the driver does nothing since it already knows the source of the interrupt (through the message ID parameter).
Inside the DPC, the driver sees that a DMA transaction is pending and complete the associated ReadFile request (but with a NumberOfBytesRead lower than the specified NumberOfBytesToRead).
The application receives the event associated with the Overlapped I/O and sees that the NumberOfBytesRead is lower than the NumberOfBytesToRead.
The application immediately performs a new ReadFile on the driver, specifying the write pointer on the buffer (base pointer + NumberOfBytesRead), the read offset on the device (always 0 since the offset is internally managed by the driver, the driver doesn’t have to know the size of internal device buffers) and the bytes requested(size of the buffer - NumberOfBytesRead).
… until the end of the acquisition.
You see the drawback of this method: as the number of bytes produced by the PCIe device is not known in advance, I’m forced to trigger only one ReadFile at a time.
And as the application buffer is a circular buffer, the ReadFile will specify a NumberOfBytesToRead that will be lower and lower until the end of the buffer is reached.
It means that the last ReadFile can lead to a small and non-optimal DMA transfer (in terms of bandwidth usage).
The problems I have is that I want to stick to the “KMDF way of doing things”, use only proven “design patterns” for this software layer, so that I can avoid the maximum of bugs.
Does this way of doing things sound for you like a “KMDF design pattern”, or is it weird?
Is there a way to use ReadFileScatter API with a non-storage device like this to avoid the small DMA issue above?
Upon ReadFile call, the framework will lock the user buffers pages in physical memory, right?
Is there a way to keep them locked during the acquisition, so that I can avoid this extra-overhead on each ReadFile call?
>Upon ReadFile call, the framework will lock the user buffers pages in physical memory, right? Is >there a way to keep them locked during the acquisition, so that I can avoid this extra-overhead >on each ReadFile call?
You could complete a requested IRP only when NumberOfBytesRead reach zero. An application buffer would be locked all this time. Is it a problem for your design?
You could complete a requested IRP only when NumberOfBytesRead reach zero. An application buffer would be locked all this time. Is it a problem for your design?
It would lead to latency issues; I must be able to transfer and then process the data as soon as they are made available by the device.
In fact, I think the best idea for my problem is to give control of the acquisition buffer to the driver:
The application creates the buffer and sends the pointer to the driver.
The driver creates an MDL from the user pointer, locks it in physical memory and manages the DMA transfers internally without requiring intervention from the application.
When the application is ready to process new data, it sends an IOCTL to the driver containing its read pointer on the buffer.
Upon receiving the IOCTL, the driver updates its internal read pointer (for the flow control): if data is already available in the buffer, it replies immediately with its write pointer, else it marks the IRP as pending until new data is available.
This way, the driver can use IRPs to signal data to the user side, without requiring named events or named semaphores.
What do you think of that?
But I have some questions:
Is it possible for a driver to construct an MDL based on a user space pointer?
Is it possible to do that without mapping the corresponding buffer in kernel space (because my driver doesn’t need to access the buffer through the CPU)?
To be clearer, what I would like to do is sending a structure like this for the IOCTL:
struct
{
unsigned int parameter1;
unsigned int parameter2;
void* user_buffer;
};
And make the driver retrieve the MDL associated with user_buffer without mapping user_buffer in kernel space.
Thank you.
Sorry, I found this entry too late: http://www.osronline.com/showThread.cfm?link=131633
IoAllocateMdl() + MmProbeAndLockPages() seems to do the job.
Do you confirm that I can input a user space pointer to IoAllocateMdl()?
This is not clearly stated inside the documentation…
Did you take a look at the buffering methods? METHOD_IN_DIRECT or METHOD_OUT_DIRECT will do what you want for you.
Gary G. Little
----- Original Message -----
From: “vincent saint-martin” To: “Windows System Software Devs Interest List” Sent: Tuesday, January 25, 2011 3:04:57 AM Subject: RE:[ntdev] Asynchronous acquisition of variable-sized packets
Sorry, I found this entry too late: http://www.osronline.com/showThread.cfm?link=131633 IoAllocateMdl() + MmProbeAndLockPages() seems to do the job. Do you confirm that I can input a user space pointer to IoAllocateMdl()? This is not clearly stated inside the documentation…
Hi Gary,
You mean performing a DeviceIoControl() from the user application with the address of my buffer as lpInBuffer parameter?
And then calling WdfRequestRetrieveInputWdmMdl() from inside the EvtIoDeviceControl() callback of my driver?
If I read the documentation for WdfRequestRetrieveInputWdmMdl(), it states that “The driver must not access a request’s MDL after completing the I/O request”, so I guess the MDL pointer is not valid anymore after completing the DeviceIoControl().
Or I must copy the MDL structure? I’m not sure it’s something that can be done…
Vincent
For DeviceIoControl, you optionally define two buffers , sometimes called Input and Output. When using METHOD_IN/OUT_BUFFERED you must define both and I think of them as Input=Command and Output=Data. Use the command buffer to control transport and the data buffer then contains data either written to or read from the device. The point is that the IO Manager has probed and locked, created the MDL, etc for the data buffer, so you don’t have to. The MDL, if you need it, is available in the IRP. I think the easiest way to solve your issue is by sending multiple DeviceIoControl requests to the driver. When the driver fills a buffer it simply completes the request associated with that buffer and begins filling the buffer from the request. The application gets the completed IO and sends back another IO request with another buffer to be filled.
And yes, once you complete, or pass off, the WDF request or the IRP, you cannot touch any part of the request or IRP.
Gary G. Little
----- Original Message -----
From: “vincent saint-martin” To: “Windows System Software Devs Interest List” Sent: Tuesday, January 25, 2011 7:53:34 AM Subject: RE:[ntdev] Asynchronous acquisition of variable-sized packets
Hi Gary, You mean performing a DeviceIoControl() from the user application with the address of my buffer as lpInBuffer parameter? And then calling WdfRequestRetrieveInputWdmMdl() from inside the EvtIoDeviceControl() callback of my driver? If I read the documentation for WdfRequestRetrieveInputWdmMdl(), it states that “The driver must not access a request’s MDL after completing the I/O request”, so I guess the MDL pointer is not valid anymore after completing the DeviceIoControl(). Or I must copy the MDL structure? I’m not sure it’s something that can be done… Vincent
In fact, this was the way I intended to implement the communication between my application and my driver: using ReadFile() in place of DeviceIoControl(), but the idea is the same.
Posting several ReadFile() requests from the application and let the driver complete them one after the other depending on data ready inside the device.
But the problem is that the device specification changed a little and now the data packets don’t have a fixed size (an interrupt can be for 1 MB or 1.1 MB for example); and once signaled, the data available on the device must be provided to the application layer as soon as possible.
I cannot wait for a fixed-size chunk of data to be available.
Then, issuing several ReadFile() fixed-size requests is not possible.
That’s why I asked if it was possible to get the MDL representing the user buffer; that way, I could manage the DMAs directly inside the driver without a ReadFile() request from the application layer, and only send updates of the write pointer to the application (using an inverted call or I/O completion).
Is there something going on in the hardware you have not mentioned? Once the HW signals done, you should be able to complete a request and queue up the next buffer for the next receive cycle. Whether a buffer is 'full" or contain only 1 byte is irrelevant. Size your buffers such that they can the largest transfer; e.g. 1.1MB.
You can acquire the MDL in your driver, and using WDM you may even be manipulate that MDL, but NOT after you complete the request containing that MDL.
Gary G. Little
----- Original Message -----
From: “vincent saint-martin” To: “Windows System Software Devs Interest List” Sent: Tuesday, January 25, 2011 8:27:19 AM Subject: RE:[ntdev] Asynchronous acquisition of variable-sized packets
In fact, this was the way I intended to implement the communication between my application and my driver: using ReadFile() in place of DeviceIoControl(), but the idea is the same. Posting several ReadFile() requests from the application and let the driver complete them one after the other depending on data ready inside the device. But the problem is that the device specification changed a little and now the data packets don’t have a fixed size (an interrupt can be for 1 MB or 1.1 MB for example); and once signaled, the data available on the device must be provided to the application layer as soon as possible. I cannot wait for a fixed-size chunk of data to be available. Then, issuing several ReadFile() fixed-size requests is not possible. That’s why I asked if it was possible to get the MDL representing the user buffer; that way, I could manage the DMAs directly inside the driver without a ReadFile() request from the application layer, and only send updates of the write pointer to the application (using an inverted call or I/O completion).
Thank you Gary and Maxim.
I searched for IoAllocateMdl inside the WDK samples.
There is such an example inside \WinDDK\7600.16385.1\src\general\ioctl\wdm\sys\sioctl.c
IoAllocateMdl() on line 441 and then MmProbeAndLockPages() on line 458
Gary,
The application receive buffer is a ring buffer where the received data are stored like in a bytestream.
The application consumers are free to access the data with the granularity they want.
And I don’t want to allocate memory that can be wasted because of a small DMA transfer, as I have strong memory usage requirements.
Vincent
Vincent,
You may do following:
-Create a thread. In the tread, create an IOCTL request which contains an user ring buffer. This request would stay in driver until NumberOfBytesRead reach zero.
-Create the second tread in application. In this thread the application creates another IOCTL where output data is a following structure:
struct
{
long current_offset;
long size_of_reading_data; //Here is the size of the last DMA transaction
};
As soon as the driver gets new data it completes the second IOCTL returning the current offset of reading data and the size of the last transaction. The application finishing all necessary work sends the second IOCTL again. In the same time the first IOCTL, which contains the ring buffer, would stay pended. When the driver reach zero in ring buffer it completes the first IOCTL. Of course, you should provide proper synchronization between two threads in application.
Vincent,
You didn’t provide much details about your design. But I could say that your design would be the same as my if you use more than one Irp. One Irp would be used for keeping ring buffer and others for updating the current offset in the buffer.