Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Home NTDEV

More Info on Driver Writing and Debugging


The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.


Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/


Before Posting...

Please check out the Community Guidelines in the Announcements and Administration Category.

Best approach for sharing a buffer between kernel driver and UM service for read and write?

2»

Comments

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,671

    The unfortunate case is that the prototype design needs exactly that to accomplish on Windows 10, otherwise what's the point of having PCIe connected $5000 worth FPGA silicon with 2GB SDRAM full of data and not easily read-write accessible from Windows 10 user mode application...

    One more note: you way want to have a look at the OpenCL standard…. for how this is typically done. I know at least one FPGA vendor supports this for some of their generic accelerator boards.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • craig_howardcraig_howard Member Posts: 183

    @Peter_Viscarola_(OSR) said:
    Well, wait. You can certainly write a driver that takes a bung chunk of device memory and maps it into the user mode address space of a given process.

    Mr @craig_howard is being a bit overly… dramatic? restrictive? … about this.

    Perhaps ... but I've played in this rodeo before, here's how it usually goes ...

    Me: We are limited to how much continguous memory we can get in kernel space for mapping the device memory, 2MB looks to be a safe number
    Linux-addled Mgr: We need more!
    Me: Well, we can have a better chance of getting a larger contiguous chunk if we load earlier, if we're an ELAM driver then we can probably get 8 or 16MB
    Linux-addled Mgr: We need even more!
    Me: Well, we can have a thread doing aggressive maps every few seconds, so that if by some chance there is a large chunk of contiguous memory that gets freed up we can grab that, maybe 20MB but we'll have to wait for awhile for that
    Linux-addled Mgr: OK, make it so, let's write an ELAM driver that has a thread aggressively trying to get big chunks of contiguous memory so we can map them into the device space
    ...
    Reality-facing Support Mgr: Why does the shipping machine run slower than turtles stampeding through peanut butter on a cold Yukon morning, and why can't customers load anti-virus software anymore?
    Smarter replacement Mgr: OK, let's use a 2MB window for smaller stuff stuff and use scatter/gather DMA for the big stuff

    Same with Mr @Tim_Roberts statement about 32bit processes being too vulnerable to shellcode to be worth letting them connect to my driver ... I've played in that rodeo too and the first thing that a Red team from Fireeye will do is use a 32bit application with a known CVE to connect to your driver ... if they can do that (and they are very good at fuzzing things and sniffing out IOCTL's) then your company gets put on the naughty list and the original dev is likely encouraged to seek out a different occupational path

    :smiley:

    And increasingly, the idea is to return results directly to host memory from the FPGA using a “slave bridge” IP (I think we’re supposed to say “address translator” now, but nobody would know what I was talking about) to do the DMA transfers without having to setup and do traditional “DMA transfers” — I’ve done that in two projects recently. This, instead of putting the result in device address space and having the host (or device) DMA it back.

    @Peter_Viscarola_(OSR) I assume that you're talking about AXI memory [ https://www.xilinx.com/support/documentation/ip_documentation/axi_pcie/v2_8/pg055-axi-bridge-pcie.pdf ] and the access through this concept [ https://www.xilinx.com/Attachment/Xilinx_Answer_65444_Windows.pdf ] ... very interesting, I'm going to have to dig more into this! Thanks for the new info!

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,671

    We are limited to how much continguous memory we can get in kernel space for mapping the device memory

    I don’t follow this… at all. One maps device memory with virtual address space. And in a 64 bit machine we have plenty of it.

    It sounds to me like you’re talking about allocating large chunks of physically contiguous memory. And I’m entirely over this complaint. Because the answer to this is simple: Buy more memory. Memory is cheap and every system is running 64-bit Windows. I have a driver that allocates multiple 2GB Common Buffers…. And was allocating as much as 16GB contiguous physical memory before I decided I needed to “play nice” and support the IOMMU. You just need to have enough memory on the machine. Put 96GB on a machine…. I guarantee you your contiguous memory allocation problems are gone.

    Needless to say: I am not talking about writing drivers for commodity devices on commodity PCs here. That’s a whole different set of engineering constraints.

    I assume that you're talking about AXI memory

    That is one incarnation of the approach, in the Xilinx world, yes. Same feature is available in the Altera universe. I’ve worked with both.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • craig_howardcraig_howard Member Posts: 183
    edited August 29

    @Peter_Viscarola_(OSR) said:

    We are limited to how much continguous memory we can get in kernel space for mapping the device memory

    I don’t follow this… at all. One maps device memory with virtual address space. And in a 64 bit machine we have plenty of it.
    ... good stuff ...
    Needless to say: I am not talking about writing drivers for commodity devices on commodity PCs here. That’s a whole different set of engineering constraints.

    That's true, and that's why I usually query what the target platform is. The PC's that I work with are very much in the commodity class, usually NUC's or similar small form factor. BladeServer's don't work well in doctor offices, exam rooms or on factory floors and I'm usually lucky if I can get machines with 32GB of memory installed (normally it's 16GB, sometimes it's as low as 8GB) ... once you have all the other things competing for memory, that gets to be pretty tight ... :smiley:

  • rusakov2rusakov2 Member Posts: 54
    edited September 3

    @craig_howard said:
    @rusakov2 Simply explain to your firmware designer and the board EE that they have made a common and tragic mistake in not doing a little bit of GoogleFu to see that Windows is not Linux, and that because of that misunderstanding the Windows 10 marketplace is no longer available to them ...

    Craig, thanks for reply but all of that is known and communicated to EE as well, a while ago. No, they won't consider Windows requirements. It works here, at least while I am seeing it, the opposite way. Invest millions in hardware made for something like RTOS or VxWorks, sell it and then decide to try to connect it to Windows 10.
    Offers of "design for windows" are not considered for the hardware already in production for years.
    Anyhow, they would rather drop Windows 10 option than change anything in hardware.

    I think what you describe is most applicable to Microsoft policy of pushing customers to update their hardware every N years to take advantage of the emerging new software.
    Sadly for some specific system markets this approach (and similar of Android) of periodic updates in hardware and bundled software doesn't work.

    • Your only way of getting to that 2GB of data is through DMA to a much smaller buffer, so I would tell your firmware designer and EE to start looking for DMA IP ...

    I believe that I already mentioned that even if we acquire DMA IP block it won't work. Per clarification from hardware vendor this IP block needs at least 4 lanes of Gen2 PCIe to work and the device is connected via only 1 lane. Consider the cost of tooling for new revision of 14-layer PCB... no hopes.

    Well, wait. You can certainly write a driver that takes a bung chunk of device memory and maps it into the user mode address space of a given process.

    exactly Peter!
    Just expose 2 GB of device memory via PCIe BAR and then map that memory to user mode program process memory and let application developers show what they can do with data
    No DMA necessary. Not pretty but works with existing hardware.

    And increasingly, the idea is to return results directly to host memory from the FPGA using a “slave bridge” IP (I think we’re supposed to say “address translator” now, but nobody would know what I was talking about)

    There was a project at Intel which solved this differently. Xeon CPU and FPGA were on the same memory bus, and in same silicon (>150W who cares it is not for average computer). No PCIe no DMA all looked well, but that combo silicon was dropped by Intel.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,671

    Not pretty but works with existing hardware.

    Exactly.

    “A person has got to do what they have to do.” (an updated wording of an old aphorism)

    Xeon CPU and FPGA were on the same memory bus, and in same silicon

    Yeah! I like it! There are soooo many, really excellent, FPGA-based computing accelerators these days, some of which are do-located with a CPU on-board. Xilinx has a great line of them. Haven’t seen anything like a board with an FPGA and a Xeon-class processor out there, though…. But I want one ;-)

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 14,095

    Xilinx has a great line of them.

    As long as you're not in a hurry. Because of supply chain issues, the lead time on many Xilinx parts is up to 12 months. We have some clients who are considering black market purchases.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,671

    the lead time on many Xilinx parts is up to 12 months

    Wow... interesting. Thanks for that. I wasn't aware that was the case.

    I was however, in general, thinking of complete add-in accelerator boards that you plug in to more general purpose type systems.

    For clarity: Neither I nor OSR have any stock in Xilinx or stand to directly profit from the sale of their products.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • MBond2MBond2 Member Posts: 364

    There is a lot of stuff here

    first, 2 or 20 or 200 GB of contiguous virtual address space is easily obtained in UM or KM on most x64 systems. The same is not true of physical address space - but that should never be needed

    second 96 GB of RAM is a moderate amount. It is in the realm of a high end desktop or a crappy sever. As I write this, my desktop has 256 GB and the servers that I am writing software for have between 2 and 10 TB of RAM

    The most important point is fitness for a purpose. It is entirely inappropriate to map device memory into UM for a commodity mass produced device. It is entirely appropriate for a closed or controlled system - as are often found in medical, industrial control and line of business systems

    but if you control the UM application design, you can probably get effectively equivalent performance using an IOCTL interface without any of the caveats of the direct memory approach. That's a big if, because a good design here depends greatly on what kind of interactions the app needs with the device and good threading design in windows is very different than what might work well in *nix

  • rusakov2rusakov2 Member Posts: 54

    @Peter_Viscarola_(OSR) said:

    Xeon CPU and FPGA were on the same memory bus, and in same silicon

    Yeah! I like it! There are soooo many, really excellent, FPGA-based computing accelerators these days, some of which are do-located with a CPU on-board. Xilinx has a great line of them. Haven’t seen anything like a board with an FPGA and a Xeon-class processor out there, though…. But I want one ;-)

    I placed a back order a while ago, then Intel rep called and said no it is dropped don't hope.
    https://fpgaer.files.wordpress.com/2018/05/xeon-fpga-front-back-1.jpg?w=1024
    However, he advised to look into a similar Xeon+FPGA silicon but with UPI option for possible future products
    https://fpgaer.wordpress.com/2018/05/24/intel-xeon-processor-with-fpga-now-shipping/#:~:text=The Intel Xeon 6138P includes ane Arria10 GX,via Intel’s ultra fast UPI (Ultra Path Interconnect).

  • rusakov2rusakov2 Member Posts: 54
    edited September 5

    @MBond2 said:
    There is a lot of stuff here

    yes, and the answer is often "it depends" :)

    The most important point is fitness for a purpose.

    agreed. Please recall original poster has question but does not mention which system and for which market is it, so I threw in my two cents

    It is entirely inappropriate to map device memory into UM for a commodity mass produced device.

    conceptually yes, agreed

    It is entirely appropriate for a closed or controlled system - as are often found in medical, industrial control and line of business systems

    that's the case, yes. Is it a bird? Or an airplane? Or air delivery drone? No.... AI says with 87% probability it is an intruder unmanned aerial vehicle :( Add security and surveillance systems. Now give me bounding box coordinates, then "medium range, three sweeps" (c) :)

    but if you control the UM application design, you can probably get effectively equivalent performance using an IOCTL interface without any of the caveats of the direct memory approach. That's a big if, because a good design here depends greatly on what kind of interactions the app needs with the device

    Correct. Although app developers are not as rigid as hardware team, there are challenges to convince them to design their app specific way for the particular hardware.

  • MBond2MBond2 Member Posts: 364

    I'm glad that you generally agree with me. But let's also agree that UM application design is also far easier to change that anything else that we have discussed. It is even possible to provide a UM DLL that acts as a 'shim' between and IOCTL KM interface and a UM program that expects a shared memory interface - the write watch family of functions enables this and it performs better than you might think

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. Sign in or register to get started.

Upcoming OSR Seminars
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!
Internals & Software Drivers 15 November 2021 Live, Online
Writing WDF Drivers TBD Live, Online
Developing Minifilters 7 February 2022 Live, Online
Kernel Debugging 21 March 2022 Live, Online