BuildScatterGatherList with DMA remapping, for bus-master DMA device

bruno · September 4, 2020, 9:57am

I am enabling DMA remapping for PCI device Windows driver (mostly WDF based).
I am getting BSODs (including not surprisingly DRIVER_VERIFIER_DMA_VIOLATION).

Could you help to confirm (or point problems you see) if the approach which was undertaken (and proven to work without DMA remapping) is still valid when DMA remapping is enabled.
I believe the redesign is needed, so I would appreciate your suggestions in this matter as well.

The current outline is like that:
On the device initialization WdfDeviceSetAlignmentRequirement is set to (PAGE_SIZE – 1)
On the device initialization DMA enabler is created with WdfDmaEnablerCreate and WdfDmaProfileScatterGather64.
On the device initialization driver allocates few common buffers which are used to program the device.

Userspace app sends DIRECT IO IOCTL request (METHOD_OUT_DIRECT, FILE_ANY_ACCESS) with previously allocated (malloc) memory buffer.
Driver obtains MDL (WdfRequestWdmGetIrp(request)->MdlAddress) and calls:
a. WdfDmaEnablerWdmGetDmaAdapter
b. CalculateScatterGatherList
c. BuildScatterGatherList to obtain ScatterGatherBuffer
Driver iterates over ScatterGatherBuffer to program device DMA.
Put address of each page from ScatterGatherBuffer to device configuration space in memory (it is previously allocated common buffer)
Driver calls PutScatterGatherList
Driver starts the device processing
During the processing device booth reads from and writes to this memory buffer via DMA.
Device sends interrupt of competition
Driver completes IO requests.

I have doubts on the direction of the transfer which seems to me to be inconsistent:

IOCTL request is ‘Out’ only (METHOD_OUT_DIRECT)
For WdfDmaEnablerWdmGetDmaAdapter driver passes WdfDmaDirectionReadFromDevice
BuildScatterGatherList specifies WriteToDevice as FALSE and PutScatterGatherList as TRUE
The device both reads and writes to the memory buffer

On the BSOD scenario (buffer length 128B) I can see BuildScatterGatherList returns ScatterGatherBuffer NumberOfElements to be zero. For sure the driver does not properly handle this case which may be BSOD reason. But I don’t know why NumberOfElements==0 can happen.
Beside NumberOfElements==0 BuildScatterGatherList returns successfully. The previous call to CalculateScatterGatherList returned ScatterGatherListSize==232 and NumberOfMapRegisters==1.
Could you help me understand what is the reason NumberOfElements==0 and if the driver can handle that i.e., get the proper address to program device to use DMA in such a case?

What more is interesting this driver seems to work fine (with DMA remapping enabled) for memory buffers > PAGE_SIZE.
For memory buffers <= PAGES_SIZE sometimes the driver also works fine, but only for the first IO request sent. The device is still plugged in and is not being unplugged during the test, but it may go to D3 between the first and second request being sent.

Peter_Viscarola_OSR · September 4, 2020, 12:39pm

Let me first be clear: If your driver doesn’t work properly with a DMA remapping, it’s not written properly for any case in Windows. So, the driver is broken… not because of any special DMA-R requirements, but because it’s just basically broken. Having thus chastised you, let me hasten to add that there is always a chance that there’s a bug in the DMA-R code. It’s relatively new, after all.

OK… now that I got the lecture out of the way: Your flow of control sounds fundamentally correct. But, the devil is always in the details.

There is no valid case where you should get an SGL with ZERO elements… so something is wrong somewhere.

Soooo…what to do? You’ve got WDF verifier enabled. Good! Make sure you’ve got VERBOSE logging enabled. Get the debugger hooked up. Look at the WDF log when you get the verifier error. It can be surprisingly helpful. And carefully read the error message and stack dump from the crash. They try very hard to make these errors clearly point to a problem (these days).

But, in any case, be reassured that I don’t see anything fundamentally wrong with your approach. You’ve just got a bug or an issue somewhere.

Peter

bruno · September 5, 2020, 6:36pm

Thanks for your answer! It is encouraging than not everything is broken :).
I made some experiments with commenting out “5. Driver calls PutScatterGatherList” and the results are better (it leaks but the BSOD doesn’t appear so fast), so I believe something is wrong with this step.
Moreover after reading Driver Basics - DMA Concepts
I suspect that the presented control flow have to be changed in a way to postpone the call to PutScatterGatherList to a point after the device finishes the DMA transfers (i.e., “8. Device sends interrupt of competition”).
I know you said “flow of control sounds fundamentally correct”, but could you take a look at it once again please?

Bruno

Peter_Viscarola_OSR · September 6, 2020, 4:22pm

So, let’s back up a bit and ask a bit more of a fundamental question: you seems to be mixing the WDF and WDM models here. Why?

And, yes… you call PutScatterGatherList after the transfer is done.

Peter

Tim_Roberts · September 6, 2020, 5:52pm

And, yes… you call PutScatterGatherList after the transfer is done.

If I may expand, you call it ONLY after the transfer is done. PSGL can cause some of the related data structures to be unlocked and unmapped, so if you’re using those structures, you’ll be in empty space.

bruno · September 7, 2020, 9:02am

Thanks for the confirmation on PSGL.
As for the WDM/WDF mixture I don’t think there’s a valid excuse, it’s just badly programmed and definitely full WDF transition is on the todo list. Probably now is a good moment for that. Assuming a change from BSGL/PSGL to WdfDmaTransactionInitialize, etc., still I have doubts on the direction.

As the devices’ reads and writes are not separable from the driver perspective. Is it safe to only use METHOD_OUT_DIRECT or the second parallel direct IO request with METHOD_IN_DIRECT (with the same memory buffer) should be created, to be 100% compliant with the Windows DMA abstraction?
What about the direction of the transfer when initializing the transaction (e.g., DmaDirection of WdfDmaTransactionInitialize). Should the driver create two (read & write) transactions (before starting the device) just to be sure it is 100% compliant or I can assume that single transaction with DmaDirection==WdfDmaDirectionReadFromDevice is sufficient despite the fact that the device reads and writes during this transaction.

Bruno

Peter_Viscarola_OSR · September 7, 2020, 1:29pm

METHOD_OUT_DIRECT Is fine for both read and write. In Windows, we don’t have write access without read access… and memory access permission is what this is about.

The second question… about dual-direction DMA… is much more difficult. So, your device does read and write for each transaction? Normally, what I see is a device that has a set of shared data structures that are setup initially and that are used as descriptions to control DMA operations. This area is indeed periodically read/written (every few ms from the device side) with continuous Mode DMA. Once that area is setup, the driver then services newly arriving read requests and write requests (“packet-based” operations) from users by making entries in the previously established data structures.

Your devices does NOT work this way? Can you tell us a BIT more about what your device does and how it works so we can be sure we’re steering you on the right road?

Peter

bruno · September 7, 2020, 2:21pm

Thanks for the confirmation.
As for the details…

This device could be seen as specialized computational device and is programed via a common buffer, so no problem at this point.
However ‘the program’ can be rather complicated - e.g., it can have thousands of steps. The inputs and outputs for every step can be chosen arbitrary via ‘the program’. The ‘program’ is created by the user. He can choose: types, sizes, number of, and pointers for the steps.
The ‘program’ is executed in a single step (from the driver perspective, it is started by some registry write then driver waits for a single interrupt of competition).
AFAIK behind the scene there are many DMA transfers done by the device to realize the ‘program’.

One step (e.g., addition) can produce some results while some next step (e.g., multiplication) may use this output as an input.
In theory the user buffer could be logically separated (into R,W&RW regions) although it would massively complicate the driver.

Bruno

Tim_Roberts · September 7, 2020, 6:07pm

There are some pretty serious security implications here. User mode code is easily subverted. I hope you have included methods to allow the driver to scan the “program” to validate that all potential DMA actions lie within the user buffer. Consider what fun an attacker could have by altering your program.

What you’re describing does not fit neatly into the KMDF DMA abstraction. It would not work, for example, in a system using map registers, nor in a system with an IOMMU, nor in a system where you have to worry about cache coherency.

If I were doing a design like this, I’d model it after GPUs. So, you’d use DMA to transfer all the incoming data into device memory, then let the program operate on device memory to its heart’s content producing output in some device location, then DMA the final results back to the app. That’s certainly going to be much higher performance than transferring buffers over the bus all the time.

Peter_Viscarola_OSR · September 7, 2020, 6:20pm

Sounds like a type of FPGA accelerator. We’ve done one or two of those. And assuming I remember correctly, we did it exactly the way Mr. Roberts described: The data gets moved to the DRAM associated with the FPGA card, and the work is done that way.

Definitely one of the more complex project I’ve worked on in a while.

Peter

bruno · September 8, 2020, 11:01am

Thanks

There are some pretty serious security implications here.
Hmm… There are security measures for that undertaken in the HW/SW design.
What you’re describing does not fit neatly into the KMDF DMA abstraction.
You mean the device behavior, the driver or booth?
It would not work, for example, in a system using map registers, nor in a system with an IOMMU, nor in a system where you have to worry about cache coherency.
I have to consult this point. Probably I’ve not revealed all the device capabilities.
Could you elaborate a bit on the IOMMU problems you see? I was preaty sure that SGL we get in driver contains valid addresses (wrt IOMMU translation) and device can use it.
Sounds like a type of FPGA accelerator
It may be seen like that.
The data gets moved to the DRAM associated with the FPGA card
The device may not have DRAM of the required size (e.g., up to hundreds of MB) associated.

Tim_Roberts · September 8, 2020, 4:57pm

I was preaty sure that SGL we get in driver contains valid addresses (wrt IOMMU translation) and device can use it.

As one example, let’s say your device didn’t understand 64-bit physical addressing. In that case, you’d be using the system’s “bounce buffers”. That means the system has to know which direction each request is going, because it has to pre-process host-to-card transfers, and it has to post-process card-to-host transfers. If a single buffer can be used both ways, that’s a problem.

I don’t have direct experience with any modern IOMMUs, but I can imagine the same kind of thing might be needed in some special cache coherency situations.

The device may not have DRAM of the required size (e.g., up to hundreds of MB) associated.

A gigabyte DDR chip is about $20. I think you’re going to be disappointed by the performance if every instruction has to block to wait for bus transfers before and after.

bruno · September 11, 2020, 10:09pm

Thanks for your valuable comments.
AFAIK the device understands 64 bit addressing, so at least this one can assumed as no problem.
Its probably true that DDR attachment would be a good idea, however AFAIK the problem there is that currently the main device use-case assumes it to be physically integrated into the platform/SoC.
Nevertheless this discussion convinced us that its highly probable that it

does not fit neatly into the KMDF DMA abstraction
So we’ve just requested support from MSFT and will see what they propose.