Implement DMA in packet-mode in a PCI/VME Bridge

Hello Tim,
Thank you very much, you have pointed out the issue: in the EvtDeviceIOControl() method , I have the possibility to do a WdfRequestComplete() before performing WdfDmaTransactionDmaCompleted();
In my implementation, I complete the request only in ISR/DPC if WdfDmaTransactionDmaCompleted() returns true or if any errors occur before end of DMA, I call WdfObjectDelete() and then WdfRequestCompleteWithInformation().
Is the right way to proceed ?

Do you mean WdfObjectDelete on the transaction object? That’s not right; you need to call WdfDmaTransactionDmaCompletedFinal to tell the system how much did get done. That lets the framework clean up the resources it allocated. If that returns true, then you complete the request.

There’s a very nice article on completing DMA transactions here: https://docs.microsoft.com/en-us/windows-hardware/drivers/wdf/completing-a-dma-transfer

Hello Tim,
I do not have succeed to complete the DMA transaction in the DPC; I have a bugcheck with following error:
DRIVER_VERIFIER_IOMANAGER_VIOLATION (c9)
The IO manager has caught a misbehaving driver.
Arguments:
Arg1: 000000000000000e, Irql > DPC at IoCompleteRequest
Arg2: 0000000000000009, the current Irql
Arg3: ffff9d077d39cd80, the IRP
Arg4: 0000000000000000

Reading the article, if the length of the DMA is not reached, do I have to call WdfDmaTransactionDmaCompletedWithLength or WdfDmaTransactionCompletedFinal ? Because my hardware do not report any error but I can see that DMA length is not reached.

If you call WdfDmaTransactionDmaCompletedWithLength and the length is less than the full transfer, the system will submit another transfer to complete the rest. If you need to abort the transfer and can’t do any more, then you call WdfDmaTransactionDmaCompletedFinal.

The BSOD you got is because you tried to call IoCompleteRequest in your ISR. You need to defer almost all of your processing to your DPC. The ISR should do very little more than acknowledge the interrupt and trigger your DPC.

Thank you very much Tim.
I have fixed my issue and my DMA is working now !!
Thank you again for your great advices !!

Debugging my DMA, I print the Address in Sglist.Element[0] and is not incremented at each call of EvtProgramDma (after ending in DPC). I do not understand why the buffer address is not incremented for each DMA transfer. Could you explain why ?

Is your device limited to 32-bit physical addresses? If so, then the operating system has allocated a “bounce buffer” below the 4GB mark, and is copying the user’s buffer into that space. So, each call to EvtProgramDma is getting the same physical buffer with different data.

Yup. What Mr. Roberts said. Exactly.

Peter

Yes my device is limited to 32-bit physical address. So I understand that the “bounce buffer” is at the same address but data are copied into this buffer.
Thank you for your reply.

There’s no excuse for any hardware design from the last 20 years to use 32-bit physical addressing. The hardware engineers should have their fingernails removed.

(Tangent: Mr. Roberts is well known for (quite rightly) deriding DMA designs that do not support scatter/gather. I had to laugh when I was watching a video of Darryl Havens doing a presentation on the I/O Subsystem from 1993 (the first PDC where the details of Windows NT were discussed) where he said “Windows NT basically assumes everything supports scatter/gather. More and more hardware has that capability these days.” I immediately thought of Mr. Roberts. Later in the same presentation, Mr. Havens also was remarking on devices that only supported 24-bit addressing…)

Supporting scatter/gather and requitement of 64 bit address are not the same thing.
It is not wise (not from security POV, neither from electronic engineering POV) to give a device full access to host address space.
The host should have I/O map registers (IOMMU or whatever) that map limited pieces of host memory for the device.
For many reasonable devices 24-bit sized fragments (even 16 bit) are enough.
When host address space increases beyond 64 bits, this approach will accommodate old 64-bit devices and 32-bit as well.

consider the cost of the extra transistors to do this over the production run of the device versus the cost of the development. Add in the unnecessary performance cost and the fact that nearly every device has more than 4 GB of RAM. transfer sizes are different from being able to access the whole address space

@MBond2 What unnecessary performance cost?
Besides of implementing 64-bit address registers in the device (peripheral), consider that modern buses are serial, such as PCIe.
Overhead of moving extra address bytes forth and back is performance cost, both in time and energy.
– pa

… consider that modern buses are serial, such as PCIe.

You’re surely not arguing that this is a cause of lower performance, are you? The whole reason we moved from parallel PCI to serial PCIe is that you get SIGNIFICANTLY higher bits-per-second rates with a serial scheme than you can with a parallel scheme at the same cost. The current generation of PCIe can move 8 GB per second across one lane. That’s nearly as fast as memory-to-memory transfers.

I’m not sure I understand what you are saying here. Yes, if a device has to receive or send only 32 bits versus 64 bits it can take less time or power to do so, but it seems impossible to think that the overall system efficiency will be better because that single transfer can omit those bits, but every IO to that device needs a bounce buffer (and the corresponding memory copy)

My, how we can drift to… you know… pretty silly topics here.

Supporting scatter/gather and requirement of 64 bit address are not the same thing.

No shit. Really??

Are you, Mr. @Pavel_A, actually, seriously, arguing that device manufacturers should build devices with less than 64-bit addressing in the 21st century? If you are, that’s an astonishingly ridiculous argument… considering most DMA transfers do NOT involve use of the IOMMU, and most operating systems don’t enable IOMMU use by default (for internal PCIe devices in some OSes, for any PCIe devices in other OSes).

Overhead of moving extra address bytes forth and back is performance cost, both in time and energy.

Are you trying to argue that doing a DMA transfer using a short address and via an IOMMU takes less time and overhead than a DMA transfer with a “direct” 64-bit address on the bus without an IOMMU? Because, if that’s what you’re trying to argue, I’d suggest you do some checking around. There’s a reason most operating systems don’t enable use of the IOMMU by default… and why when they DO enable it they enable it in limited – sometime VERY limited – situations.

Peter

@MBond2 said:
I’m not sure I understand what you are saying here. Yes, if a device has to receive or send only 32 bits versus 64 bits it can take less time or power to do so, but it seems impossible to think that the overall system efficiency will be better because that single transfer can omit those bits, but every IO to that device needs a bounce buffer (and the corresponding memory copy)

Pardon me, I obviously failed to explain. If a host does address translation from device address space to host’s large address space (64 bits or more) by means of IOMMU or hardware “mapping registers”, no bounce buffer will be needed and there would be no security compromises.
And the device will support SG.
Of course, implementing such mapping registers will increase cost of the CPU, but, these days, if something has security/privacy implications, CPU makers should be willing to do this.

– pa

Thank you Mr. @“Peter_Viscarola_(OSR)” for not passing me with your kind attention.

Yes I remember what you said about IOMMUs. Really.
In recent years I worked with customers that make their own silicon: wireless, AR glasses and others.
Most of their new designed devices do not have 64-bit DMA.
Or they had 64-bit support on FPGA stage, but reduced to 32-bit in ASIC.
Otherwise they are quite standard PCIe. I’ve made simple Windows drivers for these things, to pass PCIe compliance test which runs on Windows.

The engineers are not crazy or incompetent. The reason is that none of these devices were specially designed for Windows PC or servers. Their intended host systems are either Linux on x64 platforms or new ARMs.
Linux lets us allocate shared buffers and all other needed memory in low 32-bit address space, so it never was a issue.
For devices such as glasses the designers wanted to squeeze every bit of energy and chip area. It’s the weight of battery that users haul on them. the savings that I mentioned are sensible.

So how this is relevant to Windows at all?
Most of these devices are cool and quite usable on a Windows machine.
These companies wanted to offer evaluation kits and SDKs on Windows but decided not to go for it
because of complications (driver signing among others).
For wi-fi drivers SG often is not required, a shared buffer is enough (TL;DR) so it’s ok.
OTOH, SG is needed for the glasses (video streaming) - but the host does not run Windows, so no one cares.


The bottom line is, folks do all sorts of interesting hardware but Windows users have no chance to get that without adapting the platform.
Apple these days makes their own silicon… how about Microsoft?

P.S Yes I’ve checked the PCIe spec. I/O packets with 32-bit address indeed are 4 bytes shorter than packets with 64-bit address, thus saving bus cycles and room in FIFOs.

Thank you, Mr. @Pavel_A for the kind favor of a reply to a topic that you yourself started.

And excellent work, I might add, changing your argument mid-stream when the original was found wanting. You went from arguing:

It is not wise (not from security POV, neither from electronic engineering POV) to give a device full access to host address space. The host should have I/O map registers (IOMMU or whatever) that map limited pieces of host memory for the device.

and

Overhead of moving extra address bytes forth and back is performance cost, both in time and energy.

… to arguing:

their intended host systems are either Linux on x64 platforms… Linux lets us allocate shared buffers and all other needed memory in low 32-bit address space

which is nicely dramatic change, given that most Linux platforms (such as RHEL) do not enable use of the IOMMU by default.

Let’s also be sure we note that devices with 64-bit addressing capabilities do not use TLP headers with 64-bit addresses in them unless the destination address actually requires it. So… a device that has 64-bit addressing capability but that only does transfers to buffers below the 0xFFFFFFFF will always use 32-bit headers.

And, of course, Windows drivers that support DMA but not 64-bit addressing on x64 machines and that need to transfer data above the 32-bit max have MULTIPLE choices: They can let Windows just handle the problem for them (by doing intermediate buffering) or they can use the same solution we would use on Linux by allocating memory below the max memory address that’s supported by the device, and sharing that memory with the user. It’s not rocket science.

I/O packets with 32-bit address indeed are 4 bytes shorter than packets with 64-bit address

Ah, yes… you save 32 bits… PER BURST. Pick a burst size. But only on every packet that’s targeted to an address above the 32-bit max. Wow… there’s savings!! But, you put every DWORD in the burst through the IOMMU – regardless of the address size – (according to your previous, unmodified, argument) thereby dwarfing any savings you might have incurred. Truly, arguing that shorter addresses save time or energy because you can use the expensive IOMMU to relocate requests is a silly argument, as I think you’ve concluded.

Let’s just face it: Hardware designers do all sorts of things. Many are guilty of the same types of misguided or premature optimizations that we driver devs are. I’ve found this particularly true these days when practically anyone can design their own device, using an FPGA, a vast array of pre-written IP blocks, and a big pile of reference implementations. Everyone is a hardware designer now. Everybody thinks they need a pile of different MSIs to signal different conditions on their devices, that in fact do nothing to change the flow of control in the driver… or, better still, still require the driver to read a set of registers to process most of the conditions. Many such hardware folks rightly look to us, the experienced OS people who’ve seen a vast array of devices over the years, for guidance and sanity. I think advising such folks that they should limit their implementations to only having 32-bit addressing as an option is doing them a dis-service.

Peter