Implement DMA in packet-mode in a PCI/VME Bridge

Tim_Roberts · April 15, 2021, 5:27pm

If this is a DMA write, then it’s really METHOD_OUT_DIRECT, but that doesn’t really matter very much. Where are you completing the request? Is it possible you complete the request before you call WdefDmaTransactionDmaCompleted?

croy_kfr · April 15, 2021, 5:43pm

Hello Tim,
Thank you very much, you have pointed out the issue: in the EvtDeviceIOControl() method , I have the possibility to do a WdfRequestComplete() before performing WdfDmaTransactionDmaCompleted();
In my implementation, I complete the request only in ISR/DPC if WdfDmaTransactionDmaCompleted() returns true or if any errors occur before end of DMA, I call WdfObjectDelete() and then WdfRequestCompleteWithInformation().
Is the right way to proceed ?

Tim_Roberts · April 15, 2021, 8:01pm

Do you mean WdfObjectDelete on the transaction object? That’s not right; you need to call WdfDmaTransactionDmaCompletedFinal to tell the system how much did get done. That lets the framework clean up the resources it allocated. If that returns true, then you complete the request.

There’s a very nice article on completing DMA transactions here: https://docs.microsoft.com/en-us/windows-hardware/drivers/wdf/completing-a-dma-transfer

croy_kfr · April 19, 2021, 4:23pm

Hello Tim,
I do not have succeed to complete the DMA transaction in the DPC; I have a bugcheck with following error:
DRIVER_VERIFIER_IOMANAGER_VIOLATION (c9)
The IO manager has caught a misbehaving driver.
Arguments:
Arg1: 000000000000000e, Irql > DPC at IoCompleteRequest
Arg2: 0000000000000009, the current Irql
Arg3: ffff9d077d39cd80, the IRP
Arg4: 0000000000000000

Reading the article, if the length of the DMA is not reached, do I have to call WdfDmaTransactionDmaCompletedWithLength or WdfDmaTransactionCompletedFinal ? Because my hardware do not report any error but I can see that DMA length is not reached.

Tim_Roberts · April 19, 2021, 8:43pm

If you call WdfDmaTransactionDmaCompletedWithLength and the length is less than the full transfer, the system will submit another transfer to complete the rest. If you need to abort the transfer and can’t do any more, then you call WdfDmaTransactionDmaCompletedFinal.

The BSOD you got is because you tried to call IoCompleteRequest in your ISR. You need to defer almost all of your processing to your DPC. The ISR should do very little more than acknowledge the interrupt and trigger your DPC.

croy_kfr · April 20, 2021, 8:55am

Thank you very much Tim.
I have fixed my issue and my DMA is working now !!
Thank you again for your great advices !!

croy_kfr · April 26, 2021, 2:33pm

Debugging my DMA, I print the Address in Sglist.Element[0] and is not incremented at each call of EvtProgramDma (after ending in DPC). I do not understand why the buffer address is not incremented for each DMA transfer. Could you explain why ?

Tim_Roberts · April 26, 2021, 4:33pm

Is your device limited to 32-bit physical addresses? If so, then the operating system has allocated a “bounce buffer” below the 4GB mark, and is copying the user’s buffer into that space. So, each call to EvtProgramDma is getting the same physical buffer with different data.

Peter_Viscarola_OSR · April 26, 2021, 8:13pm

Yup. What Mr. Roberts said. Exactly.

Peter

croy_kfr · April 27, 2021, 2:49pm

Yes my device is limited to 32-bit physical address. So I understand that the “bounce buffer” is at the same address but data are copied into this buffer.
Thank you for your reply.

Tim_Roberts · April 27, 2021, 4:24pm

There’s no excuse for any hardware design from the last 20 years to use 32-bit physical addressing. The hardware engineers should have their fingernails removed.

Peter_Viscarola_OSR · April 27, 2021, 6:58pm

(Tangent: Mr. Roberts is well known for (quite rightly) deriding DMA designs that do not support scatter/gather. I had to laugh when I was watching a video of Darryl Havens doing a presentation on the I/O Subsystem from 1993 (the first PDC where the details of Windows NT were discussed) where he said “Windows NT basically assumes everything supports scatter/gather. More and more hardware has that capability these days.” I immediately thought of Mr. Roberts. Later in the same presentation, Mr. Havens also was remarking on devices that only supported 24-bit addressing…)

Pavel_A1 · April 27, 2021, 10:12pm

Supporting scatter/gather and requitement of 64 bit address are not the same thing.
It is not wise (not from security POV, neither from electronic engineering POV) to give a device full access to host address space.
The host should have I/O map registers (IOMMU or whatever) that map limited pieces of host memory for the device.
For many reasonable devices 24-bit sized fragments (even 16 bit) are enough.
When host address space increases beyond 64 bits, this approach will accommodate old 64-bit devices and 32-bit as well.

MBond2 · April 28, 2021, 9:36pm

consider the cost of the extra transistors to do this over the production run of the device versus the cost of the development. Add in the unnecessary performance cost and the fact that nearly every device has more than 4 GB of RAM. transfer sizes are different from being able to access the whole address space

Pavel_A1 · April 30, 2021, 5:08pm

@MBond2 What unnecessary performance cost?
Besides of implementing 64-bit address registers in the device (peripheral), consider that modern buses are serial, such as PCIe.
Overhead of moving extra address bytes forth and back is performance cost, both in time and energy.
– pa

Tim_Roberts · April 30, 2021, 5:41pm

… consider that modern buses are serial, such as PCIe.

You’re surely not arguing that this is a cause of lower performance, are you? The whole reason we moved from parallel PCI to serial PCIe is that you get SIGNIFICANTLY higher bits-per-second rates with a serial scheme than you can with a parallel scheme at the same cost. The current generation of PCIe can move 8 GB per second across one lane. That’s nearly as fast as memory-to-memory transfers.

MBond2 · May 1, 2021, 12:42am

I’m not sure I understand what you are saying here. Yes, if a device has to receive or send only 32 bits versus 64 bits it can take less time or power to do so, but it seems impossible to think that the overall system efficiency will be better because that single transfer can omit those bits, but every IO to that device needs a bounce buffer (and the corresponding memory copy)

Peter_Viscarola_OSR · May 1, 2021, 4:14pm

My, how we can drift to… you know… pretty silly topics here.

Supporting scatter/gather and requirement of 64 bit address are not the same thing.

No shit. Really??

Are you, Mr. @Pavel_A, actually, seriously, arguing that device manufacturers should build devices with less than 64-bit addressing in the 21st century? If you are, that’s an astonishingly ridiculous argument… considering most DMA transfers do NOT involve use of the IOMMU, and most operating systems don’t enable IOMMU use by default (for internal PCIe devices in some OSes, for any PCIe devices in other OSes).

Overhead of moving extra address bytes forth and back is performance cost, both in time and energy.

Are you trying to argue that doing a DMA transfer using a short address and via an IOMMU takes less time and overhead than a DMA transfer with a “direct” 64-bit address on the bus without an IOMMU? Because, if that’s what you’re trying to argue, I’d suggest you do some checking around. There’s a reason most operating systems don’t enable use of the IOMMU by default… and why when they DO enable it they enable it in limited – sometime VERY limited – situations.

Peter

Pavel_A1 · May 2, 2021, 10:12am

@MBond2 said:
I’m not sure I understand what you are saying here. Yes, if a device has to receive or send only 32 bits versus 64 bits it can take less time or power to do so, but it seems impossible to think that the overall system efficiency will be better because that single transfer can omit those bits, but every IO to that device needs a bounce buffer (and the corresponding memory copy)

Pardon me, I obviously failed to explain. If a host does address translation from device address space to host’s large address space (64 bits or more) by means of IOMMU or hardware “mapping registers”, no bounce buffer will be needed and there would be no security compromises.
And the device will support SG.
Of course, implementing such mapping registers will increase cost of the CPU, but, these days, if something has security/privacy implications, CPU makers should be willing to do this.

– pa

Pavel_A1 · May 4, 2021, 8:12pm

Thank you Mr. @“Peter_Viscarola_(OSR)” for not passing me with your kind attention.

Yes I remember what you said about IOMMUs. Really.
In recent years I worked with customers that make their own silicon: wireless, AR glasses and others.
Most of their new designed devices do not have 64-bit DMA.
Or they had 64-bit support on FPGA stage, but reduced to 32-bit in ASIC.
Otherwise they are quite standard PCIe. I’ve made simple Windows drivers for these things, to pass PCIe compliance test which runs on Windows.

The engineers are not crazy or incompetent. The reason is that none of these devices were specially designed for Windows PC or servers. Their intended host systems are either Linux on x64 platforms or new ARMs.
Linux lets us allocate shared buffers and all other needed memory in low 32-bit address space, so it never was a issue.
For devices such as glasses the designers wanted to squeeze every bit of energy and chip area. It’s the weight of battery that users haul on them. the savings that I mentioned are sensible.

So how this is relevant to Windows at all?
Most of these devices are cool and quite usable on a Windows machine.
These companies wanted to offer evaluation kits and SDKs on Windows but decided not to go for it
because of complications (driver signing among others).
For wi-fi drivers SG often is not required, a shared buffer is enough (TL;DR) so it’s ok.
OTOH, SG is needed for the glasses (video streaming) - but the host does not run Windows, so no one cares.

The bottom line is, folks do all sorts of interesting hardware but Windows users have no chance to get that without adapting the platform.
Apple these days makes their own silicon… how about Microsoft?

P.S Yes I’ve checked the PCIe spec. I/O packets with 32-bit address indeed are 4 bytes shorter than packets with 64-bit address, thus saving bus cycles and room in FIFOs.