Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Home NTDEV

More Info on Driver Writing and Debugging


The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.


Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/


Before Posting...

Please check out the Community Guidelines in the Announcements and Administration Category.

Best approach for sharing a buffer between kernel driver and UM service for read and write?

henrik_meidahenrik_meida Member Posts: 76
edited August 25 in NTDEV

Hi,

I want to share a buffer between a UM-service and a kernel driver.

I know that we can use Direct IO, but the problem with that is that i want them to be able to write to this buffer at any time without the need of IOCTLs, both of them can read and write to this buffer.

My questions are:

  1. What is the best way to implement this?

  2. What is the best way to synchronize them, so they wouldn't be writing/reading the content at the same time?

  3. Which of them should create the buffer? UM service using something malloc, or kernel mode driver by using something ExAllocatePool? Any difference?

«1

Comments

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,665

    Have the app create the buffer, and send a Direct I/O IOCTL to the driver describing that buffer. The driver then keeps the Request in progress. The app and the driver can then exchange data in this buffer anytime.

    When the user app closes the handle (EvtFileCleanup), the driver completes the IOCTL (thereby invalidating the mapping).

    Easy to implement and to understand. Clean.

    Synchronization is a problem that's inherent with this approach. There really isn't a great way. It's all dependent on exactly what you need.

    Hope that helps,

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • henrik_meidahenrik_meida Member Posts: 76

    @Peter_Viscarola_(OSR) said:
    Have the app create the buffer, and send a Direct I/O IOCTL to the driver describing that buffer. The driver then keeps the Request in progress. The app and the driver can then exchange data in this buffer anytime.

    So the user sends the buffer address that it allocated using malloc to the driver, and driver creates an MDL that describes the address, and locks the user buffer address in physical memory using MmProbeAndLockPages, right?

    When the user app closes the handle (EvtFileCleanup), the driver completes the IOCTL (thereby invalidating the mapping).

    Easy to implement and to understand. Clean.

    What happens in case the UM service for whatever reason crashes? how can i get notified of its crash, so i wouldn't be accessing the buffer and getting a BSOD?

    how long does it take after the UM service crashes until the UM buffer address becomes invalid? considering that i locked it in physical memory before, will it ever become invalid?

    Synchronization is a problem that's inherent with this approach. There really isn't a great way. It's all dependent on exactly what you need.

    So what do you suggest? The buffer is simply a big array of a struct, and both the UM service and the kernel driver can be reading and writing to it. And i need to lock it whenever each of them tries to read or write to it.

  • henrik_meidahenrik_meida Member Posts: 76
    edited August 26

    @Peter_Viscarola_(OSR) said:
    When the user app closes the handle (EvtFileCleanup), the driver completes the IOCTL (thereby invalidating the mapping).

    Also i should note that I am not using kmdf.
    I assume EvtFileCleanup is equivalent to IRP_MJ_CLEANUP in WDM? So how will i get notified when a user closes the handle to a buffer that i locked? Will my driver's IRP_MJ_CLEANUP get called? if so, then i assume i have to do stuff other than just locking the user buffer in memory, right?

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 14,093

    So the user sends the buffer address that it allocated using malloc to the driver, and driver creates an MDL that describes the address, and locks the user buffer address in physical memory using MmProbeAndLockPages, right?

    No. The point of using a METHOD_IN_DIRECT ioctl is that the operating system handles the locking and the creation of the MDL. You don't have to do it.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • henrik_meidahenrik_meida Member Posts: 76

    No. The point of using a METHOD_IN_DIRECT ioctl is that the operating system handles the locking and the creation of the MDL. You don't have to do it.

    Doesn't the IO manager unlock the locked pages after the return from the IOCTL? because i never seen someone manually use MmUnlockPages before returning from a direct IOCTL, and i assume that's because since the IO manager locked the pages itself, it unlocks them as well, therefore i can't use the buffer after the IOCTL is finished.

    This is important because i need to have the buffer locked in physical memory until my driver unloads, because i want to use the same buffer to communicate as long as my driver is running.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,665
    edited August 26

    Really, I have to say: You’d almost certainly be better off using normal Direct I/O for sending data back and forth. This type of shared memory scheme is almost never the best way to implement a driver in Windows, and for the vast majority of cases, it’s use illustrates a fundamental lack of understanding on the part of the developer. You know: “This is how we do it in Linux” or “This sounds really fast and will make a zero copy scheme”…

    What happens in case the UM service for whatever reason crashes?

    You need to handle cancellation of that pending Request.

    am not using kmdf.

    That is probably a serious mistake.

    Doesn't the IO manager unlock the locked pages after the return from the IOCTL?

    Good heavens, man… please take a minute to think! The I/O manager can’t unlock the pages until the IOCTL has been completed, right? That’s the purpose of keeping the IOCTL in progress.

    So what do you suggest? The buffer is simply a big array of a struct

    To repeat myself, it’s all dependent on your needs… and on what assumptions you can make about the environment. I’m working on a driver that uses a scheme similar to the one you described right now, that runs on a dedicated system that’s part of a medical instrument. Among other things, the assumption is that the application will always “play nice” with the driver… so we simply trust the app to update his tail pointer (kept in the shared memory area) and to not write to the driver’s head pointer (also kept in the shard memory area). We don’t do any locking, and have safeguards in place to ensure that there’s no overrun in a working system. In this system, this scheme is sufficient. In a commercial PC in a s desktop or server environment, this would be grossly insufficient.

    Again… this shared memory scheme is almost never the best approach. I’m using it in the driver I’m writing largely because of backwards compatibility. I’m not saying other approaches don’t have challenges, but the challenges they have are far smaller/simpler than those for a shared memory approach.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • henrik_meidahenrik_meida Member Posts: 76
    edited August 26

    I did a little experiment:

    Received user buffer address from UM service, then IoAllocateMdl->MmProbeAndLockPages->MmGetSystemAddressForMdlSafe

    It works fine and i can access user buffer at anytime, but when the UM service exists, i get BSOD : process_has_locked_pages.

    But why? i thought i locked the physical pages in memory? how to prevent this BSOD?

    And how to properly synchronize the UM service and the kernel driver?

  • henrik_meidahenrik_meida Member Posts: 76

    @Peter_Viscarola_(OSR) said:
    Really, I have to say: You’d almost certainly be better off using normal Direct I/O for sending data back and forth. This type of shared memory scheme is almost never the best way to implement a driver in Windows, and for the vast majority of cases, it’s use illustrates a fundamental lack of understanding on the part of the developer. You know: “This is how we do it in Linux” or “This sounds really fast and will make a zero copy scheme”…

    What happens in case the UM service for whatever reason crashes?

    You need to handle cancellation of that pending Request.

    am not using kmdf.

    That is probably a serious mistake.

    Doesn't the IO manager unlock the locked pages after the return from the IOCTL?

    Good heavens, man… please take a minute to think! The I/O manager can’t unlock the pages until the IOCTL has been completed, right? That’s the purpose of keeping the IOCTL in progress.

    So what do you suggest? The buffer is simply a big array of a struct

    To repeat myself, it’s all dependent on your needs… and on what assumptions you can make about the environment. I’m working on a driver that uses a scheme similar to the one you described right now, that runs on a dedicated system that’s part of a medical instrument. Among other things, the assumption is that the application will always “play nice” with the driver… so we simply trust the app to update his tail pointer (kept in the shared memory area) and to not write to the driver’s head pointer (also kept in the shard memory area). We don’t do any locking, and have safeguards in place to ensure that there’s no overrun in a working system. In this system, this scheme is sufficient. In a commercial PC in a s desktop or server environment, this would be grossly insufficient.

    Again… this shared memory scheme is almost never the best approach. I’m using it in the driver I’m writing largely because of backwards compatibility. I’m not saying other approaches don’t have challenges, but the challenges they have are far smaller/simpler than those for a shared memory approach.

    Peter

    Right now i am just experimenting to see how much speed improvement i get, but its a very very large array and the communication is two way, so just with the simple case of DIRECT IO is not enough for me, because the driver also needs to modify this large array in arbitrary times, and both of them might change or read this a large number of times every second.

  • Don_BurnDon_Burn Member - All Emails Posts: 1,746

    "Right now i am just experimenting to see how much speed improvement i get, but its a very very large array and the communication is two way, so just with the simple case of DIRECT IO is not enough for me, because the driver also needs to modify this large array in arbitrary times, and both of them might change or read this a large number of times every second."

    First what do you define as a very large array? I did a number of drivers that needed high speed and was amazed at the number of times clients would insist on shared memory for what actually wasn't that large of an array. Tell us the amount of memory you are talking about, and what the throughput is that you environment neets!

    You comment about the problems with process_has_locked_pages, of course it will you passed process memory to the driver which locked it. You don't own the memory so when the process exits, so you get a crash.

    What Peter is proposing is one of the safest ways to do this. The kernel has the memory as long as it does not COMPLETE THE IOCTL, you can return with STATUS_PENDING and everything is fine. But you must realize that when the process exits you will receive a cancel for the IOCTL and must handle it.

    Anything else will involve allocating the memory in the kernel, and sharing it, that has its own set of problems with coordination, and with security. Take some time and learn the approach Peter is proposing, but first tell use the memory space and throughput needed for the application.

  • craig_howardcraig_howard Member Posts: 182
    edited August 26

    Good questions, it's something that comes up a lot!

    1) @Peter_Viscarola_(OSR) method is probably the only way to accomplish a shared memory segment between usermode and kernelmode, and you need to follow that recipe precisely; allocate the buffer in usermode, push it in an IRP, keep the IRP pending until you're all done and shutting down, complete the IRP, done. The key is that the IRP stays pending for the lifetime of the shared memory, which is the lifetime of the usermode connection to the driver

    2) You mentioned that you have a large array with many elements which are all being modified by either usermode or kernelmode, which means you need to consider the data structure you're going to be using to manage that array. What you are really doing is sharing a data structure between two threads of execution, it's just that one thread happens to be in kernel and the other thread happens to be in usermode; it's that data structure, and the thread locking method, that is going to give you the very best performance gain (orders of magnitude above living in usermode and kernelmode). There are a number of ways to do this (map, hash tables, vectors, etc.) I would simply do some GoogleFu and then experiment to find that works best

    3) Usermode malloc, always always always! Two good rules of thumb are that a) memory is allocated where it's used [usermode in usermode, kernelmode in kernelmode] and b) always go from least privileged to most privileged. In this case, rule b) applies for usermode to kernelmode (least to most) so the allocation is done in usermode and passed to kernelmode. There are a ton of issues with allocating in kernelmode and passing to usermode, I have never seen a use case where that is needed, and will be astonished if I ever do

    What I would do first is to determine that data structure and get thread locking dialed in first; allocate a block of memory of size X, populate it with your very very large array of objects and have two threads access it at the same time.

    Get that working properly, safely and quickly then it's a simple pivot to use @Peter_Viscarola_(OSR) specific recipe for sharing memory, put one thread in a kernel system thread and you're done ...

    But start with getting the data structure and thread locking dialed in (and you might already), that is going to be the big performance bottleneck ..

    You also mentioned another issue in a later post which needs some further thought on your part ... the lifetime and scope of this shared array; is this array truly to be for the lifetime of the driver and how will you manage multiple usermode clients? If the lifetime is truly the lifetime of the driver and not the client connection then you will need to persist the array after the client has disconnected, which means the driver keeps a copy of the array lying around ... the pattern would be

    • Usermode connects with it's big blob IRP
    • Driver checks to see if there is an existing big blob from a prior session; if so, copy that into the big blob
    • Do whatever between kernel and usermode
    • When usermode signals it's done, driver copies the big blob data into a kernel allocated big blob
    • Driver completes the IRP

    Note that we are keeping a copy of the big blob between sessions and doing a copy of that into a usermode big blob, we are not sharing that kernel allocated big blob

    The other item is scope, remember there may be multiple clients attaching to the driver, how are those handled? Multiple big blob's or one "blob to rule them all" which gets complex pretty quickly ...

  • Don_BurnDon_Burn Member - All Emails Posts: 1,746

    Craig stated "But start with getting the data structure and thread locking dialed in (and you might already), that is going to be the big performance bottleneck .." That is why I asked the actual size of the data and the throughput. I've ripped out over the years at least a dozen shared memory approaches and replaced them with IOCTL's (admittedly in some cases with WDM bypassing some or all of the standard model) and gotten faster speeds. Ten years ago on not the fastest dual processor system for a test I got over 1,000,000 IOCTL's per second on with a driver.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,665
    edited August 26

    It works fine and i can access user buffer at anytime, but when the UM service exists, i get BSOD : process_has_locked_pages.

    But why? i thought i locked the physical pages in memory? how to prevent this BSOD?

    Because you are, for some reason that only you know, not following the instructions I provided.

    With all due respect, Mr. @henrik_meida ... You really need to learn more about how to write Windows drivers. Especially before you "break the rules" and write a driver like that with a shared memory approach.

    And how to properly synchronize the UM service and the kernel driver?

    Again with all due respect, Mr. @henrik_meida, you need to stop asking the same question. The one that I answered now, multiple times.

    You're welcome.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 14,093

    Doesn't the IO manager unlock the locked pages after the return from the IOCTL?

    No. I/O manager unlocks the pages when you complete the IRP. That's why the driver has to queue the IRP long-term. Returning from the handler is a non-event.

    Received user buffer address from UM service, then IoAllocateMdl->MmProbeAndLockPages->MmGetSystemAddressForMdlSafe

    >

    It works fine and i can access user buffer at anytime, but when the UM service exists, i get BSOD : process_has_locked_pages.

    Just because it worked once for you in ideal circumstances does not mean it's right. Windows systems are extremely dynamic. Processes die suddenly for many reasons, so that the address might become invalid before you are able to probe and lock it. Yes, that can be handled, but it's tricky, and what's the point of taking on that burden when I/O manager is perfectly happy to do it for you?

    Look, at least three of the people in this thread have been writing Windows drivers for 30 years. We KNOW what we're talking about. You aren't going to stumble upon a better approach that we haven't seen before.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • henrik_meidahenrik_meida Member Posts: 76
    edited August 26

    First off, thank you everyone for taking the time and answering to my question.

    @Don_Burn said:
    First what do you define as a very large array? I did a number of drivers that needed high speed and was amazed at the number of times clients would insist on shared memory for what actually wasn't that large of an array. Tell us the amount of memory you are talking about, and what the throughput is that you environment neets!

    Every member of that array is around 5000 bytes, and there are usually 3000-4000 array members, but they come and go, think of them as a linked list. Every array member is a structure that reports something that happened at a specific time, UM service scans them and modifies them based on the result, but its details doesn't really matter right now.

    @craig_howard said:
    2) You mentioned that you have a large array with many elements which are all being modified by either usermode or kernelmode, which means you need to consider the data structure you're going to be using to manage that array. What you are really doing is sharing a data structure between two threads of execution, it's just that one thread happens to be in kernel and the other thread happens to be in usermode; it's that data structure, and the thread locking method, that is going to give you the very best performance gain (orders of magnitude above living in usermode and kernelmode). There are a number of ways to do this (map, hash tables, vectors, etc.) I would simply do some GoogleFu and then experiment to find that works best

    I have used Event to synchronize between UM and kernel before, I'll give it a try to see how well it works. But if there are any faster/better approaches that you know of please let me know. I'll probably implement multiple approaches and benchmark them to see which of them works the best.

    @Peter_Viscarola_(OSR) said:

    It works fine and i can access user buffer at anytime, but when the UM service exists, i get BSOD : process_has_locked_pages.

    But why? i thought i locked the physical pages in memory? how to prevent this BSOD?

    Because you are, for some reason that only you know, not following the instructions I provided.

    Sorry, i didn't understand it the first time, now i get it. Use direct IO -> Get the locked user buffer -> STATUS_PENDING the IRP -> now when UM process exists i get notified and no more BSOD.

    @Tim_Roberts said:
    Look, at least three of the people in this thread have been writing Windows drivers for 30 years. We KNOW what we're talking about. You aren't going to stumble upon a better approach that we haven't seen before.

    Oh i know, trust me i have been following this forum a long time and the fact that so many veteran driver developers are answering my question is an honor. Thank you everyone for taking the time and giving me suggestions.

  • craig_howardcraig_howard Member Posts: 182

    Events are fast, safe, efficient mechanisms to communicate state changes between objects I've found; I would use them. Remember that allocations happen in usermode, so create the event handles in usermode and pass them in the IRP in a structure of some sort

    Get your algorithm working first in usermode between two threads using events to synchronize, then move it to the driver. That's also the place to figure out how to handle multiple clients (multiple threads), clients detaching/ attaching (threads coming and going) and if you want to have a persistent array or not ...

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,665

    And shall we discuss why this isn’t a KMDF driver that you’re writing?

    I mean, maybe there’s a good reason…

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • MBond2MBond2 Member Posts: 360

    There seems to be a lot of naïve advice on this thread. If you care about performance, don't use malloc. And if you can't do regular multi-threading in UM within a single process, and via shared memory IPC (for example a DLL section or a memory mapped file) don't try to do it between UM and KM

    There are other issues that have not been mentioned. Size of pointer issues between 32 bit and 64 bit calling processes. And what happens when there is data overflow or underflow? There are many complexities that have to be handled in one way shape or form

    As Peter says, if you are working on a closed system, or developing a learning or hobby project the required standards are different than if you propose a mass market deployment

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 14,093

    If you care about performance, don't use malloc.

    You know that's mostly nonsense, don't you? The VC++ malloc routine is little more than a wrapper around HeapAlloc. Unless you're doing tens of thousands of allocations a second, malloc is not a concern. There are many more important things to worry about.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • craig_howardcraig_howard Member Posts: 182
    edited August 27

    @MBond2 said:
    There seems to be a lot of naïve advice on this thread. If you care about performance, don't use malloc.

    Ah, the malloc wars ... I remember when those sweep through the forums, those are good times! Lot's of folks put onto moderation ... they eventually lead to discussions of nontemporal allocations [ https://vgatherps.github.io/2018-09-02-nontemporal/ ] which are very interesting but as a practical matter pointless.

    I personally always allocate in usermode with VirtualAlloc(), in kernelmode with ExAllocatePoolWithTag() (now ExAllocatePool2) and have never been disappointed ... and if I'm trying to allocate a few hundred thousand times a second then the performance problem is sitting at the keyboard, not in the compiler.

    I recall that the OP wanted to access a buffer allocated once, exactly once, by UM and KM. I'm pretty confident that the speed of that single malloc() won't be crippling the performance of the operation ...

    There are other issues that have not been mentioned. Size of pointer issues between 32 bit and 64 bit calling processes.

    Don't care, it's a single blob of memory allocated on a 64bit OS by a (presumably) 64bit application ... there is exactly one pointer that will be passed, and unless the OP does some special machinations that pointer is going to be 64bit ... but yes, the OP should check to see if the caller is a 32bit client in the IRP handler firewall (I personally fail any 32bit caller I see) ...

    And what happens when there is data overflow or underflow?

    It's a blob of memory; what is going to overflow or underflow?

    There are many complexities that have to be handled in one way shape or form

    As Peter says, if you are working on a closed system, or developing a learning or hobby project the required standards are different than if you propose a mass market deployment

    @Peter_Viscarola_(OSR) The OP might (for some reason) be writing a Kernel Service or something that doesn't need to be concerned about PnP events but yes, non KMDF needs a really good reason to do that ...

  • rusakov2rusakov2 Member Posts: 54

    @Peter_Viscarola_(OSR) said:
    Have the app create the buffer, and send a Direct I/O IOCTL to the driver describing that buffer. The driver then keeps the Request in progress. The app and the driver can then exchange data in this buffer anytime.

    When the user app closes the handle (EvtFileCleanup), the driver completes the IOCTL (thereby invalidating the mapping).

    Easy to implement and to understand. Clean.

    Do I have correct understanding Peter that this suggestion is for a case when all the memory resides in DDR memory of Windows platform, not in a device memory (say PCIe connected VLIW device with 6 GB of its own memory, of which 3 GB needs to be made R-W available to user mode process) ?

  • henrik_meidahenrik_meida Member Posts: 76
    edited August 27

    @MBond2 said:
    There are other issues that have not been mentioned. Size of pointer issues between 32 bit and 64 bit calling processes. And what happens when there is data overflow or underflow? There are many complexities that have to be handled in one way shape or form

    But will that be an issue if i am using Direct IO's output buffer to pass the user mode buffer address to my driver from a 32 bit application to a 64 bit driver? Surely the IO manager will handle any complications, right?

    @Peter_Viscarola_(OSR) said:
    And shall we discuss why this isn’t a KMDF driver that you’re writing?

    I mean, maybe there’s a good reason…

    Peter

    Well in my particular case, i didn't see any benefit, at least speed wise, to use KMDF, plus I'm more comfortable with writing WDM. In my driver I'm basically monitoring every event that is happening on the system, such as process/thread creation, handles, events, and almost anything that is happening (think of procmon but with even more events), and report them back to UM to decide if something bad is happening. The UM service is using complex Machine Learning based stuff to decide, so i can't move it to kernel (or at least it would be a LOT of work to move it, it uses a large number of UM libraries).

  • craig_howardcraig_howard Member Posts: 182

    @rusakov2 said:

    @Peter_Viscarola_(OSR) said:
    Have the app create the buffer, and send a Direct I/O IOCTL to the driver describing that buffer. The driver then keeps the Request in progress. The app and the driver can then exchange data in this buffer anytime.

    When the user app closes the handle (EvtFileCleanup), the driver completes the IOCTL (thereby invalidating the mapping).

    Easy to implement and to understand. Clean.

    Do I have correct understanding Peter that this suggestion is for a case when all the memory resides in DDR memory of Windows platform, not in a device memory (say PCIe connected VLIW device with 6 GB of its own memory, of which 3 GB needs to be made R-W available to user mode process) ?

    You're conflating three completely separate topics ...

    DDR memory, or the "stuff on the sticks", is the only memory that Windows can utilize. Memory that is located on a separate PCIe card is completely independent to that device; that's why my graphics card has 64GB of video memory and I only see the 16MB of memory on my motherboard ...

    For the OS to be able to access any of that 64GB of video memory the PCIe device firmware needs to expose a part of that as a BAR to the OS; the rub is that every byte that is being exposed is going to take up a byte of DDR memory; so if you want to make 3GB available to the OS it's going to consume 3GB worth of memory (not exactly, there are PTE tables and other such resources) but essentially nothing is for free; you can't just drop in a card with 64GB of memory and poof your 16GB machine is now 80GB

    Finally, memory windowed from PCIe space is kernel mode only; there's no way that a usermode app will be able to get to that memory (again not exactly, but doing that will cause all kinds of new issues and is best ignored) but essentially the only memory available to usermode is what the SDK gives you

    Typically a PCIe card will expose a small region of memory as a BAR, say 4MB or so, and use that as a DMA target. The driver will map that into it's kernel space and use that for it's DMA operations with the card. The cards onboard firmware will use the rest of the memory for it's own purposes, usually for display pages and such

  • craig_howardcraig_howard Member Posts: 182

    @henrik_meida said:

    @MBond2 said:
    There are other issues that have not been mentioned. Size of pointer issues between 32 bit and 64 bit calling processes. And what happens when there is data overflow or underflow? There are many complexities that have to be handled in one way shape or form

    But will that be an issue if i am using Direct IO's output buffer to pass the user mode buffer address to my driver from a 32 bit application to a 64 bit driver? Surely the IO manager will handle any complications, right?

    No, there are two big problems with 32bit applications into a 64bit driver ... first, you will need to make sure that the data written to and read from the usermode (32bit) buffer is always 32bit, which since all of the API's in the 64bit driver use 64bit data you'll be doing lots of casts and truncations and such. That's a hassle, and easy to miss something. The second issue is that of security; a 32bit application has many more vulnerabilities to malware than a 64bit one. With shellcode [ https://www.sentinelone.com/blog/malicious-input-how-hackers-use-shellcode/ ], it's all about two pieces of information: location and a vulnerability ... you need both to be able to execute a pivot. With a 64bit process and ASLR it's very difficult for shellcode to determine location, with a 32bit application even with ASLR that's a total of 4GB the shellcode needs to scan for a signature or a heapspray and then it's game over. There are other issues with 32bit applications and the WOW DLL layers and call gates that are used and such, but the bottom line is that I simply will refuse to allow a 32bit application to send IOCTL's to my drivers (ReadFile/ WriteFile is different, don't care about that) and I've never been called on that. 32bits, just say "no" ...

    @Peter_Viscarola_(OSR) said:
    And shall we discuss why this isn’t a KMDF driver that you’re writing?

    I mean, maybe there’s a good reason…

    Peter

    Well in my particular case, i didn't see any benefit, at least speed wise, to use KMDF, plus I'm more comfortable with writing WDM. In my driver I'm basically monitoring every event that is happening on the system, such as process/thread creation, handles, events, and almost anything that is happening (think of procmon but with even more events), and report them back to UM to decide if something bad is happening. The UM service is using complex Machine Learning based stuff to decide, so i can't move it to kernel (or at least it would be a LOT of work to move it, it uses a large number of UM libraries).

    From your description I kind of figured you were doing some kind of monitor and report application, using API's in the kernel not available in usermode to gather info and push it back to usermode ... and that's fine, quite a few utilities do the same thing. You need to be clear about what KMDF gives you versus WDM though ... KMDF is not simply a wrapper for PnP but a framework for the WDM stuff that you would need to write yourself. I'm not going to repeat what you can find right here about KMDF vs WDM, but an analogy I like to use is a manual transmission versus an automatic on a car. Both will get you down the road, but a manual has a lot more things you the driver have to think about and master while with an automatic you just have to master pressing the gas pedal. Just like an automatic transmission lets you focus on driving, KMDF lets you focus on the business logic of your driver ...

    You seem to be fixated on speed, which is understandable, but IMHO by far the biggest performance wins will be in the event insertion code on the kernel side, the event reading code on the usermode side, the event cleanup handler in usermode and the signaling mechanism between them. The "speed" you feel you might lose from KMDF is trivial compared to getting that dialed in, and the time you spend debugging and fixing problems from doing the WDM way that is done already in KMDF could easily increase the time spent on the project by an order of magnitude.

    As for KMDF being unfamiliar, well the adage is that progress is always forward and usually once we move to something new we can look back and wonder why we ever did it the "old" way ...

  • rusakov2rusakov2 Member Posts: 54
    edited August 27

    @craig_howard said:

    @rusakov2 said:

    @Peter_Viscarola_(OSR) said:
    Have the app create the buffer, and send a Direct I/O IOCTL to the driver describing that buffer. The driver then keeps the Request in progress. The app and the driver can then exchange data in this buffer anytime.

    When the user app closes the handle (EvtFileCleanup), the driver completes the IOCTL (thereby invalidating the mapping).

    Easy to implement and to understand. Clean.

    Do I have correct understanding Peter that this suggestion is for a case when all the memory resides in DDR memory of Windows platform, not in a device memory (say PCIe connected VLIW device with 6 GB of its own memory, of which 3 GB needs to be made R-W available to user mode process) ?

    You're conflating three completely separate topics ...

    Sorry the topic sounded similar to me. Perhaps I extended it too far. I thought the question was how to share some kernel mode accessible memory between KMDF driver (whatever that memory is) and user mode app. Apparently "not whatever that memory is"
    My question was about applicability of approach suggested when malloc or equivalent is recommended when memory (size and addresses) in question is defined by device

    Typically a PCIe card will expose a small region of memory as a BAR, say 4MB or so, and use that as a DMA target. The driver will map that into it's kernel space and use that for it's DMA operations with the card.

    Legacy VLIW based PCIe device used in example, no DMA Windows 10 would understand in this particular case, was never designed to run with Windows 10

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 14,093

    But will that be an issue if i am using Direct IO's output buffer to pass the user mode buffer address to my driver from a 32 bit application to a 64 bit driver? Surely the IO manager will handle any complications, right?

    As long as you're passing your buffer AS the direct I/O output buffer, then yes, your driver won't even need to know. Problems arise when you try to pass pointers or addresses in your data. If you're passing a structure that contains void * or ULONG_PTR, then your driver has to translate the structures depending on the calling process' bit width. The problem is not as severe as @craig_howard makes it seem; most people don't pass pointers in their data, so the problems that arise are generally from careless structure packing.

    Well in my particular case, i didn't see any benefit, at least speed wise, to use KMDF, plus I'm more comfortable with writing WDM.

    In terms of driver performance, no, there is no speed gain. That was not its purpose. In terms of your development time and your future maintenance time, the gain is enormous. It has been said, with only a slight exaggeration, that there has never been a WDM driver that handled PnP and Power in a correct way. KMDF handles all of that for you.

    Remember that KMDF is, at this point, 16 years old. KMDF has been a part of Windows now for more than half its life It is the established standard, proven and tested. It is, frankly, unprofessional for you to ignore it.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,665
    edited August 28

    Well in my particular case, i didn't see any benefit, at least speed wise, to use KMDF, plus I'm more comfortable with writing WDM.

    OK. So, as I suspected, no good reason for not using KMDF.

    Writing a new WDM driver at this point, except for certain very narrow situations, falls well short of the standard of professional practice as a kernel-mode software engineer. I mean, using WDF isn’t only “best practice “ … it’s actually the minimum standard of “reasonable” practice.

    There are exceptions where WDM makes sense: Kernel services, for one. File systems, and file system filters…. they’re in a category by themselves and can’t really be termed WDM. And a very limited number of types of drivers and filters.

    I urge you to rethink your approach.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • rusakov2rusakov2 Member Posts: 54

    @craig_howard said:

    Finally, memory windowed from PCIe space is kernel mode only; there's no way that a usermode app will be able to get to that memory

    The unfortunate case is that the prototype design needs exactly that to accomplish on Windows 10, otherwise what's the point of having PCIe connected $5000 worth FPGA silicon with 2GB SDRAM full of data and not easily read-write accessible from Windows 10 user mode application...

    (again not exactly, but doing that will cause all kinds

    it works, not easy though

  • craig_howardcraig_howard Member Posts: 182
    edited August 29

    @rusakov2 Simply explain to your firmware designer and the board EE that they have made a common and tragic mistake in not doing a little bit of GoogleFu to see that Windows is not Linux, and that because of that misunderstanding the Windows 10 marketplace is no longer available to them ...

    I would first suggest that you, the EE and the firmware designer understand what you are working with here [ https://answers.microsoft.com/en-us/windows/forum/all/physical-and-virtual-memory-in-windows-10/e36fb5bc-9ac8-49af-951c-e7d39b979938 ] ... then you and your firmware designer and board EE's all need to realize a few truths about Windows ...

    • You live in the Windows world as a tightly controlled, constrained guest entirely at the whim of the OS. You may ask for resources, but it's up to Windows if it wants to grant them to you or not; you have no control over that, and the OS may give you less and less over time and definitely not as much as you might want
    • You live in a world with other drivers and devices, each of which is treated equally and fairly by Windows. You don't get any more special privileges just because you're a PCIe card than some cheap USB thumb drive gets, and that thumb drive may very well grab those same resources you were looking for
    • Your only way of getting to that 2GB of data is through DMA to a much smaller buffer, so I would tell your firmware designer and EE to start looking for DMA IP ...
  • Tim_RobertsTim_Roberts Member - All Emails Posts: 14,093

    And do remember that today's processors copy data very, very quickly. The cost of an extra copy only becomes in issue in absurd situations line 10Gb networking.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,665

    Well, wait. You can certainly write a driver that takes a bung chunk of device memory and maps it into the user mode address space of a given process.

    Mr @craig_howard is being a bit overly… dramatic? restrictive? … about this.

    OTOH, there’s no doubt that Mr @craig_howard is right … it is almost universally true that you should DMA to this memory. Or copy to it, as Mr @Tim_Roberts suggested. That’s how we typically get data into a big, fast, FPGA with multi GB of onboard DDR. The idea certainly isn’t to bring those xGB out via the BARs and map it into the user app.

    And increasingly, the idea is to return results directly to host memory from the FPGA using a “slave bridge” IP (I think we’re supposed to say “address translator” now, but nobody would know what I was talking about) to do the DMA transfers without having to setup and do traditional “DMA transfers” — I’ve done that in two projects recently. This, instead of putting the result in device address space and having the host (or device) DMA it back.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. Sign in or register to get started.

Upcoming OSR Seminars
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!
Internals & Software Drivers 15 November 2021 Live, Online
Writing WDF Drivers TBD Live, Online
Developing Minifilters 7 February 2022 Live, Online
Kernel Debugging 21 March 2022 Live, Online