Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Home NTDEV

More Info on Driver Writing and Debugging


The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.


Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/


Before Posting...

Please check out the Community Guidelines in the Announcements and Administration Category.

Implement DMA in packet-mode in a PCI/VME Bridge

2»

Comments

  • croy_kfrcroy_kfr Member Posts: 22

    Hello Tim,
    Thank you very much, you have pointed out the issue: in the EvtDeviceIOControl() method , I have the possibility to do a WdfRequestComplete() before performing WdfDmaTransactionDmaCompleted();
    In my implementation, I complete the request only in ISR/DPC if WdfDmaTransactionDmaCompleted() returns true or if any errors occur before end of DMA, I call WdfObjectDelete() and then WdfRequestCompleteWithInformation().
    Is the right way to proceed ?

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 13,958

    Do you mean WdfObjectDelete on the transaction object? That's not right; you need to call WdfDmaTransactionDmaCompletedFinal to tell the system how much did get done. That lets the framework clean up the resources it allocated. If that returns true, then you complete the request.

    There's a very nice article on completing DMA transactions here: https://docs.microsoft.com/en-us/windows-hardware/drivers/wdf/completing-a-dma-transfer

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • croy_kfrcroy_kfr Member Posts: 22

    Hello Tim,
    I do not have succeed to complete the DMA transaction in the DPC; I have a bugcheck with following error:
    DRIVER_VERIFIER_IOMANAGER_VIOLATION (c9)
    The IO manager has caught a misbehaving driver.
    Arguments:
    Arg1: 000000000000000e, Irql > DPC at IoCompleteRequest
    Arg2: 0000000000000009, the current Irql
    Arg3: ffff9d077d39cd80, the IRP
    Arg4: 0000000000000000

    Reading the article, if the length of the DMA is not reached, do I have to call WdfDmaTransactionDmaCompletedWithLength or WdfDmaTransactionCompletedFinal ? Because my hardware do not report any error but I can see that DMA length is not reached.

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 13,958

    If you call WdfDmaTransactionDmaCompletedWithLength and the length is less than the full transfer, the system will submit another transfer to complete the rest. If you need to abort the transfer and can't do any more, then you call WdfDmaTransactionDmaCompletedFinal.

    The BSOD you got is because you tried to call IoCompleteRequest in your ISR. You need to defer almost all of your processing to your DPC. The ISR should do very little more than acknowledge the interrupt and trigger your DPC.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • croy_kfrcroy_kfr Member Posts: 22

    Thank you very much Tim.
    I have fixed my issue and my DMA is working now !!
    Thank you again for your great advices !!

  • croy_kfrcroy_kfr Member Posts: 22

    Debugging my DMA, I print the Address in Sglist.Element[0] and is not incremented at each call of EvtProgramDma (after ending in DPC). I do not understand why the buffer address is not incremented for each DMA transfer. Could you explain why ?

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 13,958

    Is your device limited to 32-bit physical addresses? If so, then the operating system has allocated a "bounce buffer" below the 4GB mark, and is copying the user's buffer into that space. So, each call to EvtProgramDma is getting the same physical buffer with different data.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,485

    Yup. What Mr. Roberts said. Exactly.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • croy_kfrcroy_kfr Member Posts: 22

    Yes my device is limited to 32-bit physical address. So I understand that the "bounce buffer" is at the same address but data are copied into this buffer.
    Thank you for your reply.

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 13,958

    There's no excuse for any hardware design from the last 20 years to use 32-bit physical addressing. The hardware engineers should have their fingernails removed.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,485

    (Tangent: Mr. Roberts is well known for (quite rightly) deriding DMA designs that do not support scatter/gather. I had to laugh when I was watching a video of Darryl Havens doing a presentation on the I/O Subsystem from 1993 (the first PDC where the details of Windows NT were discussed) where he said "Windows NT basically assumes everything supports scatter/gather. More and more hardware has that capability these days." I immediately thought of Mr. Roberts. Later in the same presentation, Mr. Havens also was remarking on devices that only supported 24-bit addressing...)

    Peter Viscarola
    OSR
    @OSRDrivers

  • Pavel_APavel_A Member Posts: 2,781
    edited April 27

    Supporting scatter/gather and requitement of 64 bit address are not the same thing.
    It is not wise (not from security POV, neither from electronic engineering POV) to give a device full access to host address space.
    The host should have I/O map registers (IOMMU or whatever) that map limited pieces of host memory for the device.
    For many reasonable devices 24-bit sized fragments (even 16 bit) are enough.
    When host address space increases beyond 64 bits, this approach will accommodate old 64-bit devices and 32-bit as well.

    Post edited by Pavel_A on
  • MBond2MBond2 Member Posts: 328

    consider the cost of the extra transistors to do this over the production run of the device versus the cost of the development. Add in the unnecessary performance cost and the fact that nearly every device has more than 4 GB of RAM. transfer sizes are different from being able to access the whole address space

  • Pavel_APavel_A Member Posts: 2,781

    @MBond2 What unnecessary performance cost?
    Besides of implementing 64-bit address registers in the device (peripheral), consider that modern buses are serial, such as PCIe.
    Overhead of moving extra address bytes forth and back is performance cost, both in time and energy.
    -- pa

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 13,958

    ... consider that modern buses are serial, such as PCIe.

    You're surely not arguing that this is a cause of lower performance, are you? The whole reason we moved from parallel PCI to serial PCIe is that you get SIGNIFICANTLY higher bits-per-second rates with a serial scheme than you can with a parallel scheme at the same cost. The current generation of PCIe can move 8 GB per second across one lane. That's nearly as fast as memory-to-memory transfers.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • MBond2MBond2 Member Posts: 328

    I'm not sure I understand what you are saying here. Yes, if a device has to receive or send only 32 bits versus 64 bits it can take less time or power to do so, but it seems impossible to think that the overall system efficiency will be better because that single transfer can omit those bits, but every IO to that device needs a bounce buffer (and the corresponding memory copy)

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,485

    My, how we can drift to... you know... pretty silly topics here.

    Supporting scatter/gather and requirement of 64 bit address are not the same thing.

    No shit. Really??

    Are you, Mr. @Pavel_A, actually, seriously, arguing that device manufacturers should build devices with less than 64-bit addressing in the 21st century? If you are, that's an astonishingly ridiculous argument... considering most DMA transfers do NOT involve use of the IOMMU, and most operating systems don't enable IOMMU use by default (for internal PCIe devices in some OSes, for any PCIe devices in other OSes).

    Overhead of moving extra address bytes forth and back is performance cost, both in time and energy.

    Are you trying to argue that doing a DMA transfer using a short address and via an IOMMU takes less time and overhead than a DMA transfer with a "direct" 64-bit address on the bus without an IOMMU? Because, if that's what you're trying to argue, I'd suggest you do some checking around. There's a reason most operating systems don't enable use of the IOMMU by default... and why when they DO enable it they enable it in limited -- sometime VERY limited -- situations.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • Pavel_APavel_A Member Posts: 2,781
    edited May 2

    @MBond2 said:
    I'm not sure I understand what you are saying here. Yes, if a device has to receive or send only 32 bits versus 64 bits it can take less time or power to do so, but it seems impossible to think that the overall system efficiency will be better because that single transfer can omit those bits, but every IO to that device needs a bounce buffer (and the corresponding memory copy)

    Pardon me, I obviously failed to explain. If a host does address translation from device address space to host's large address space (64 bits or more) by means of IOMMU or hardware "mapping registers", no bounce buffer will be needed and there would be no security compromises.
    And the device will support SG.
    Of course, implementing such mapping registers will increase cost of the CPU, but, these days, if something has security/privacy implications, CPU makers should be willing to do this.

    -- pa

  • Pavel_APavel_A Member Posts: 2,781
    edited May 4

    Thank you Mr. @Peter_Viscarola_(OSR) for not passing me with your kind attention.

    Yes I remember what you said about IOMMUs. Really.
    In recent years I worked with customers that make their own silicon: wireless, AR glasses and others.
    Most of their new designed devices do not have 64-bit DMA.
    Or they had 64-bit support on FPGA stage, but reduced to 32-bit in ASIC.
    Otherwise they are quite standard PCIe. I've made simple Windows drivers for these things, to pass PCIe compliance test which runs on Windows.

    The engineers are not crazy or incompetent. The reason is that none of these devices were specially designed for Windows PC or servers. Their intended host systems are either Linux on x64 platforms or new ARMs.
    Linux lets us allocate shared buffers and all other needed memory in low 32-bit address space, so it never was a issue.
    For devices such as glasses the designers wanted to squeeze every bit of energy and chip area. It's the weight of battery that users haul on them. the savings that I mentioned are sensible.

    So how this is relevant to Windows at all?
    Most of these devices are cool and quite usable on a Windows machine.
    These companies wanted to offer evaluation kits and SDKs on Windows but decided not to go for it
    because of complications (driver signing among others).
    For wi-fi drivers SG often is not required, a shared buffer is enough (TL;DR) so it’s ok.
    OTOH, SG is needed for the glasses (video streaming) - but the host does not run Windows, so no one cares.


    The bottom line is, folks do all sorts of interesting hardware but Windows users have no chance to get that without adapting the platform.
    Apple these days makes their own silicon... how about Microsoft?

    P.S Yes I've checked the PCIe spec. I/O packets with 32-bit address indeed are 4 bytes shorter than packets with 64-bit address, thus saving bus cycles and room in FIFOs.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,485

    Thank you, Mr. @Pavel_A for the kind favor of a reply to a topic that you yourself started.

    And excellent work, I might add, changing your argument mid-stream when the original was found wanting. You went from arguing:

    It is not wise (not from security POV, neither from electronic engineering POV) to give a device full access to host address space. The host should have I/O map registers (IOMMU or whatever) that map limited pieces of host memory for the device.

    and

    Overhead of moving extra address bytes forth and back is performance cost, both in time and energy.

    ... to arguing:

    their intended host systems are either Linux on x64 platforms... Linux lets us allocate shared buffers and all other needed memory in low 32-bit address space

    which is nicely dramatic change, given that most Linux platforms (such as RHEL) do not enable use of the IOMMU by default.

    Let's also be sure we note that devices with 64-bit addressing capabilities do not use TLP headers with 64-bit addresses in them unless the destination address actually requires it. So... a device that has 64-bit addressing capability but that only does transfers to buffers below the 0xFFFFFFFF will always use 32-bit headers.

    And, of course, Windows drivers that support DMA but not 64-bit addressing on x64 machines and that need to transfer data above the 32-bit max have MULTIPLE choices: They can let Windows just handle the problem for them (by doing intermediate buffering) or they can use the same solution we would use on Linux by allocating memory below the max memory address that's supported by the device, and sharing that memory with the user. It's not rocket science.

    I/O packets with 32-bit address indeed are 4 bytes shorter than packets with 64-bit address

    Ah, yes... you save 32 bits... PER BURST. Pick a burst size. But only on every packet that's targeted to an address above the 32-bit max. Wow... there's savings!! But, you put every DWORD in the burst through the IOMMU -- regardless of the address size -- (according to your previous, unmodified, argument) thereby dwarfing any savings you might have incurred. Truly, arguing that shorter addresses save time or energy because you can use the expensive IOMMU to relocate requests is a silly argument, as I think you've concluded.

    Let's just face it: Hardware designers do all sorts of things. Many are guilty of the same types of misguided or premature optimizations that we driver devs are. I've found this particularly true these days when practically anyone can design their own device, using an FPGA, a vast array of pre-written IP blocks, and a big pile of reference implementations. Everyone is a hardware designer now. Everybody thinks they need a pile of different MSIs to signal different conditions on their devices, that in fact do nothing to change the flow of control in the driver.... or, better still, still require the driver to read a set of registers to process most of the conditions. Many such hardware folks rightly look to us, the experienced OS people who've seen a vast array of devices over the years, for guidance and sanity. I think advising such folks that they should limit their implementations to only having 32-bit addressing as an option is doing them a dis-service.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Upcoming OSR Seminars
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!
Developing Minifilters 24 May 2021 Live, Online
Writing WDF Drivers 14 June 2021 Live, Online
Internals & Software Drivers 27 September 2021 Live, Online
Kernel Debugging 15 November 2021 Live, Online