Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Home NTDEV
Before Posting...
Please check out the Community Guidelines in the Announcements and Administration Category.

More Info on Driver Writing and Debugging


The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.


Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/


IOCTL as an alternative to WINSOCK for the most direct data path to user space ?

OSR_Community_UserOSR_Community_User Member Posts: 110,217
Hi everyone !

I am working on a virtual audio driver to send audio data from kernel
to a user space application in charge of streaming received audio data
over a network.

I am actually using a socket to send audio data from kernel to user space.

Because some of you told me I should use IOCTL instead of WINSOCK, I
would like to have your opinion about IOCTL and be sure it will be a
better solution than using WINSOCK.

So, here is what I do :
* I am developping a virtual audio driver, playback only, and WaveCyclic based
* I want to send audio data as soon as possible from kernel space to user space
* I am actually copying audio data and scheduling a DPC each time the
IDMAChannel::CopyTo is called
* I am sending audio data using a kernel socket each time my custom DPC
is executed
* the communication is essentially from kernel to user space

What is the best strategy, in my case, to send audio data as soon as
possible, using IOCTL ? Do I still need to use a DPC ? Is sending data
from kernel to user space synchronous ?

Is there any other alternative ?

Thanks in advance.

Matt
«1

Comments

  • Slava_ImameevSlava_Imameev Member Posts: 480
    If you want to continue with IOCTLs then a simplest solution is to issue an IOCTL from a user application and either block IRP in the driver or return STATUS_PENDING(do not forget IoMarkIrpPending ). In DPC you fill the IRP's buffer and call IoCompleteRequest for the pending/blocked IRP.

    An evolution of this approach to reduce the latency and increase throughput is issuing multiple asynchronous IOCTLS and waiting for their completion ( overlapped IO ). The driver puts IRPs in a list and returns STATUS_PENDING. DPC removes an IRP from the list, fills data and completes the IRP.

    There is another solution which probably has the lowest latency and highest throughput - circular buffer. A user application allocates a buffer and an event. Sends them to a driver via IOCTL. The driver locks the buffer(MmProbeAndLockPages) so it can be used in DPC and gets a pointer to an event object ( ObReferenceObjectByHandle ). Both the driver and the user application implement a circular buffer. The driver writes in it and the application reads from the buffer. A single reader/single write circular buffer can be implemented w/o any lock as long as the pointer arithmetic is atomic, which is a case on IA-32 and AMD-64 architectures for aligned pointers, i.e. when (address mod sizeof(void*)) == 0 or address%sizeof(void*) == 0 . The DPC routine writes into the circular buffer and sets the event in a signal state. The user application waits on the event. When WaitForSingleObject returns it reads data from the buffer until it becomes empty and returns to waiting on the event.
  • Slava_ImameevSlava_Imameev Member Posts: 480
    If you decides to continue with a circular buffer you need to unlock the buffer in IRP_MJ_CLEANUP in case an application crashes without gracefully closing connection via IOCTL( or you decide not to implement an explicit connection termination ). The application should hold a handle to a driver communication object until it finished with data stream processing. If you fail to do this the system will BSOD with Bug Check 0x76: PROCESS_HAS_LOCKED_PAGES.
  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    On 2016-08-30 14:31:46 +0000, Matthieu Collette said:

    > Hi everyone !
    >
    > I am working on a virtual audio driver to send audio data from kernel
    > to a user space application in charge of streaming received audio data
    > over a network.
    >
    > I am actually using a socket to send audio data from kernel to user space.
    >
    > Because some of you told me I should use IOCTL instead of WINSOCK, I
    > would like to have your opinion about IOCTL and be sure it will be a
    > better solution than using WINSOCK.
    >
    > So, here is what I do :
    > * I am developping a virtual audio driver, playback only, and WaveCyclic based
    > * I want to send audio data as soon as possible from kernel space to user space
    > * I am actually copying audio data and scheduling a DPC each time the
    > IDMAChannel::CopyTo is called
    > * I am sending audio data using a kernel socket each time my custom DPC
    > is executed
    > * the communication is essentially from kernel to user space
    >
    > What is the best strategy, in my case, to send audio data as soon as
    > possible, using IOCTL ? Do I still need to use a DPC ? Is sending data
    > from kernel to user space synchronous ?
    >
    > Is there any other alternative ?
    >
    > Thanks in advance.
    >
    > Matt

    I found the following interesting article about IOCTL and the Inverted
    Call Model
  • Don_BurnDon_Burn Member - All Emails Posts: 1,715
    Premature optimization, such as the circular buffer suggested below, is the
    biggest cause of unreliable drivers. Start with the IOCTL approach, and
    only after you have measured performance and if it is unacceptable profiled
    the driver to determine that the IOCTL mechanism is the problem should you
    consider other mechanisms.


    Don Burn
    Windows Driver Consulting
    Website: http://www.windrvr.com




    -----Original Message-----
    From: [email protected]
    [mailto:[email protected]] On Behalf Of [email protected]
    Sent: Tuesday, August 30, 2016 10:59 AM
    To: Windows System Software Devs Interest List <[email protected]>
    Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
    data path to user space ?

    If you want to continue with IOCTLs then a simplest solution is to issue an
    IOCTL from a user application and either block IRP in the driver or return
    STATUS_PENDING(do not forget IoMarkIrpPending ). In DPC you fill the IRP's
    buffer and call IoCompleteRequest for the pending/blocked IRP.

    An evolution of this approach to reduce the latency and increase throughput
    is issuing multiple asynchronous IOCTLS and waiting for their completion (
    overlapped IO ). The driver puts IRPs in a list and returns STATUS_PENDING.
    DPC removes an IRP from the list, fills data and completes the IRP.

    There is another solution which probably has the lowest latency and highest
    throughput - circular buffer. A user application allocates a buffer and an
    event. Sends them to a driver via IOCTL. The driver locks the
    buffer(MmProbeAndLockPages) so it can be used in DPC and gets a pointer to
    an event object ( ObReferenceObjectByHandle ). Both the driver and the user
    application implement a circular buffer. The driver writes in it and the
    application reads from the buffer. A single reader/single write circular
    buffer can be implemented w/o any lock as long as the pointer arithmetic is
    atomic, which is a case on IA-32 and AMD-64 architectures for aligned
    pointers, i.e. when (address mod sizeof(void*)) == 0 or
    address%sizeof(void*) == 0 . The DPC routine writes into the circular buffer
    and sets the event in a signal state. The user application waits on the
    event. When WaitForSingleObject returns it reads data from the buffer until
    it becomes empty and returns to waiting on the event.


    ---
    NTDEV is sponsored by OSR

    Visit the list online at:
    <http://www.osronline.com/showlists.cfm?list=ntdev&gt;

    MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
    drivers!
    Details at <http://www.osr.com/seminars&gt;

    To unsubscribe, visit the List Server section of OSR Online at
    <http://www.osronline.com/page.cfm?name=ListServer&gt;
  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    On 2016-08-30 14:58:51 +0000, [email protected] said:

    > If you want to continue with IOCTLs then a simplest solution is to
    > issue an IOCTL from a user application and either block IRP in the
    > driver or return STATUS_PENDING(do not forget IoMarkIrpPending ). In
    > DPC you fill the IRP's buffer and call IoCompleteRequest for the
    > pending/blocked IRP.
    >
    > An evolution of this approach to reduce the latency and increase
    > throughput is issuing multiple asynchronous IOCTLS and waiting for
    > their completion ( overlapped IO ). The driver puts IRPs in a list and
    > returns STATUS_PENDING. DPC removes an IRP from the list, fills data
    > and completes the IRP.
    >
    > There is another solution which probably has the lowest latency and
    > highest throughput - circular buffer. A user application allocates a
    > buffer and an event. Sends them to a driver via IOCTL. The driver locks
    > the buffer(MmProbeAndLockPages) so it can be used in DPC and gets a
    > pointer to an event object ( ObReferenceObjectByHandle ). Both the
    > driver and the user application implement a circular buffer. The driver
    > writes in it and the application reads from the buffer. A single
    > reader/single write circular buffer can be implemented w/o any lock as
    > long as the pointer arithmetic is atomic, which is a case on IA-32 and
    > AMD-64 architectures for aligned pointers, i.e. when (address mod
    > sizeof(void*)) == 0 or address%sizeof(void*) == 0 . The DPC routine
    > writes into the circular buffer and sets the event in a signal state.
    > The user application waits on the event. When WaitForSingleObject
    > returns it reads data from the buffer until it becomes empty and
    > returns to waiting on the event.
    >

    Hi !

    Multiple solution to try. I'll probably start with the two first ones
    which seems simpler.

    Thanks for your answser !
  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    On 2016-08-30 15:12:55 +0000, Don Burn said:

    > Premature optimization, such as the circular buffer suggested below, is the
    > biggest cause of unreliable drivers. Start with the IOCTL approach, and
    > only after you have measured performance and if it is unacceptable profiled
    > the driver to determine that the IOCTL mechanism is the problem should you
    > consider other mechanisms.
    >
    >
    > Don Burn
    > Windows Driver Consulting
    > Website: http://www.windrvr.com

    Hi !

    By IOCTL approach you refer to the two first solutions suggested by slavaim ?

    I have not yet read the article about the Inverted Call Model, what's
    your opinion about that ?

    Thanks for you help !

    >
    >
    >
    >
    > -----Original Message-----
    > From: [email protected]
    > [mailto:[email protected]] On Behalf Of [email protected]
    > Sent: Tuesday, August 30, 2016 10:59 AM
    > To: Windows System Software Devs Interest List <[email protected]>
    > Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
    > data path to user space ?
    >
    > If you want to continue with IOCTLs then a simplest solution is to issue an
    > IOCTL from a user application and either block IRP in the driver or return
    > STATUS_PENDING(do not forget IoMarkIrpPending ). In DPC you fill the IRP's
    > buffer and call IoCompleteRequest for the pending/blocked IRP.
    >
    > An evolution of this approach to reduce the latency and increase throughput
    > is issuing multiple asynchronous IOCTLS and waiting for their completion (
    > overlapped IO ). The driver puts IRPs in a list and returns STATUS_PENDING.
    > DPC removes an IRP from the list, fills data and completes the IRP.
    >
    > There is another solution which probably has the lowest latency and highest
    > throughput - circular buffer. A user application allocates a buffer and an
    > event. Sends them to a driver via IOCTL. The driver locks the
    > buffer(MmProbeAndLockPages) so it can be used in DPC and gets a pointer to
    > an event object ( ObReferenceObjectByHandle ). Both the driver and the user
    > application implement a circular buffer. The driver writes in it and the
    > application reads from the buffer. A single reader/single write circular
    > buffer can be implemented w/o any lock as long as the pointer arithmetic is
    > atomic, which is a case on IA-32 and AMD-64 architectures for aligned
    > pointers, i.e. when (address mod sizeof(void*)) == 0 or
    > address%sizeof(void*) == 0 . The DPC routine writes into the circular buffer
    > and sets the event in a signal state. The user application waits on the
    > event. When WaitForSingleObject returns it reads data from the buffer until
    > it becomes empty and returns to waiting on the event.
    >
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list online at:
    > <http://www.osronline.com/showlists.cfm?list=ntdev&gt;
    >
    > MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
    > drivers!
    > Details at <http://www.osr.com/seminars&gt;
    >
    > To unsubscribe, visit the List Server section of OSR Online at
    > <http://www.osronline.com/page.cfm?name=ListServer&gt;
  • Slava_ImameevSlava_Imameev Member Posts: 480
    Some words on circular buffer implementation. The solution with IRP_MJ_CLEANUP is not perfect because of handle duplication ( if somebody decides to attack the system ). The safer solution is holding an IRP( i.e. IOCTL) with a cancel routine and detecting abnormal process termination in it. But it looks more elaborate.
  • Don_BurnDon_Burn Member - All Emails Posts: 1,715
    Yes, the IOCTL approach is what should be used, and is basically the
    inverted call if you use multiple IOCTL and STATUS_PENDING.


    Don Burn
    Windows Driver Consulting
    Website: http://www.windrvr.com




    -----Original Message-----
    From: [email protected]
    [mailto:[email protected]] On Behalf Of Matthieu Collette
    Sent: Tuesday, August 30, 2016 11:20 AM
    To: Windows System Software Devs Interest List <[email protected]>
    Subject: Re:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
    data path to user space ?

    On 2016-08-30 15:12:55 +0000, Don Burn said:

    > Premature optimization, such as the circular buffer suggested below, is
    the
    > biggest cause of unreliable drivers. Start with the IOCTL approach, and
    > only after you have measured performance and if it is unacceptable
    > profiled the driver to determine that the IOCTL mechanism is the
    > problem should you consider other mechanisms.
    >
    >
    > Don Burn
    > Windows Driver Consulting
    > Website: http://www.windrvr.com

    Hi !

    By IOCTL approach you refer to the two first solutions suggested by slavaim
    ?

    I have not yet read the article about the Inverted Call Model, what's your
    opinion about that ?

    Thanks for you help !

    >
    >
    >
    >
    > -----Original Message-----
    > From: [email protected]
    > [mailto:[email protected]] On Behalf Of
    > [email protected]
    > Sent: Tuesday, August 30, 2016 10:59 AM
    > To: Windows System Software Devs Interest List <[email protected]>
    > Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most
    > direct data path to user space ?
    >
    > If you want to continue with IOCTLs then a simplest solution is to
    > issue an IOCTL from a user application and either block IRP in the
    > driver or return STATUS_PENDING(do not forget IoMarkIrpPending ). In
    > DPC you fill the IRP's buffer and call IoCompleteRequest for the
    pending/blocked IRP.
    >
    > An evolution of this approach to reduce the latency and increase
    > throughput is issuing multiple asynchronous IOCTLS and waiting for
    > their completion ( overlapped IO ). The driver puts IRPs in a list and
    returns STATUS_PENDING.
    > DPC removes an IRP from the list, fills data and completes the IRP.
    >
    > There is another solution which probably has the lowest latency and
    > highest throughput - circular buffer. A user application allocates a
    > buffer and an event. Sends them to a driver via IOCTL. The driver
    > locks the
    > buffer(MmProbeAndLockPages) so it can be used in DPC and gets a
    > pointer to an event object ( ObReferenceObjectByHandle ). Both the
    > driver and the user application implement a circular buffer. The
    > driver writes in it and the application reads from the buffer. A
    > single reader/single write circular buffer can be implemented w/o any
    > lock as long as the pointer arithmetic is atomic, which is a case on
    > IA-32 and AMD-64 architectures for aligned pointers, i.e. when
    > (address mod sizeof(void*)) == 0 or
    > address%sizeof(void*) == 0 . The DPC routine writes into the circular
    > buffer and sets the event in a signal state. The user application
    > waits on the event. When WaitForSingleObject returns it reads data
    > from the buffer until it becomes empty and returns to waiting on the
    event.
    >
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list online at:
    > <http://www.osronline.com/showlists.cfm?list=ntdev&gt;
    >
    > MONTHLY seminars on crash dump analysis, WDF, Windows internals and
    > software drivers!
    > Details at <http://www.osr.com/seminars&gt;
    >
    > To unsubscribe, visit the List Server section of OSR Online at
    > <http://www.osronline.com/page.cfm?name=ListServer&gt;



    ---
    NTDEV is sponsored by OSR

    Visit the list online at:
    <http://www.osronline.com/showlists.cfm?list=ntdev&gt;

    MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
    drivers!
    Details at <http://www.osr.com/seminars&gt;

    To unsubscribe, visit the List Server section of OSR Online at
    <http://www.osronline.com/page.cfm?name=ListServer&gt;
  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    On 2016-08-30 15:24:31 +0000, Don Burn said:

    > Yes, the IOCTL approach is what should be used, and is basically the
    > inverted call if you use multiple IOCTL and STATUS_PENDING.

    Ok

    >
    >
    > Don Burn
    > Windows Driver Consulting
    > Website: http://www.windrvr.com
    >
    >
    >
    >
    > -----Original Message-----
    > From: [email protected]
    > [mailto:[email protected]] On Behalf Of Matthieu Collette
    > Sent: Tuesday, August 30, 2016 11:20 AM
    > To: Windows System Software Devs Interest List <[email protected]>
    > Subject: Re:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
    > data path to user space ?
    >
    > On 2016-08-30 15:12:55 +0000, Don Burn said:
    >
    >> Premature optimization, such as the circular buffer suggested below, is
    > the
    >> biggest cause of unreliable drivers. Start with the IOCTL approach, and
    >> only after you have measured performance and if it is unacceptable
    >> profiled the driver to determine that the IOCTL mechanism is the
    >> problem should you consider other mechanisms.
    >>
    >>
    >> Don Burn
    >> Windows Driver Consulting
    >> Website: http://www.windrvr.com
    >
    > Hi !
    >
    > By IOCTL approach you refer to the two first solutions suggested by slavaim
    > ?
    >
    > I have not yet read the article about the Inverted Call Model, what's your
    > opinion about that ?
    >
    > Thanks for you help !
    >
    >>
    >>
    >>
    >>
    >> -----Original Message-----
    >> From: [email protected]
    >> [mailto:[email protected]] On Behalf Of
    >> [email protected]
    >> Sent: Tuesday, August 30, 2016 10:59 AM
    >> To: Windows System Software Devs Interest List <[email protected]>
    >> Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most
    >> direct data path to user space ?
    >>
    >> If you want to continue with IOCTLs then a simplest solution is to
    >> issue an IOCTL from a user application and either block IRP in the
    >> driver or return STATUS_PENDING(do not forget IoMarkIrpPending ). In
    >> DPC you fill the IRP's buffer and call IoCompleteRequest for the
    > pending/blocked IRP.
    >>
    >> An evolution of this approach to reduce the latency and increase
    >> throughput is issuing multiple asynchronous IOCTLS and waiting for
    >> their completion ( overlapped IO ). The driver puts IRPs in a list and
    > returns STATUS_PENDING.
    >> DPC removes an IRP from the list, fills data and completes the IRP.
    >>
    >> There is another solution which probably has the lowest latency and
    >> highest throughput - circular buffer. A user application allocates a
    >> buffer and an event. Sends them to a driver via IOCTL. The driver
    >> locks the
    >> buffer(MmProbeAndLockPages) so it can be used in DPC and gets a
    >> pointer to an event object ( ObReferenceObjectByHandle ). Both the
    >> driver and the user application implement a circular buffer. The
    >> driver writes in it and the application reads from the buffer. A
    >> single reader/single write circular buffer can be implemented w/o any
    >> lock as long as the pointer arithmetic is atomic, which is a case on
    >> IA-32 and AMD-64 architectures for aligned pointers, i.e. when
    >> (address mod sizeof(void*)) == 0 or
    >> address%sizeof(void*) == 0 . The DPC routine writes into the circular
    >> buffer and sets the event in a signal state. The user application
    >> waits on the event. When WaitForSingleObject returns it reads data
    >> from the buffer until it becomes empty and returns to waiting on the
    > event.
    >>
    >>
    >> ---
    >> NTDEV is sponsored by OSR
    >>
    >> Visit the list online at:
    >> <http://www.osronline.com/showlists.cfm?list=ntdev&gt;
    >>
    >> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
    >> software drivers!
    >> Details at <http://www.osr.com/seminars&gt;
    >>
    >> To unsubscribe, visit the List Server section of OSR Online at
    >> <http://www.osronline.com/page.cfm?name=ListServer&gt;
    >
    >
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list online at:
    > <http://www.osronline.com/showlists.cfm?list=ntdev&gt;
    >
    > MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
    > drivers!
    > Details at <http://www.osr.com/seminars&gt;
    >
    > To unsubscribe, visit the List Server section of OSR Online at
    > <http://www.osronline.com/page.cfm?name=ListServer&gt;
  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    On 2016-08-30 15:19:51 +0000, [email protected]l.com said:

    > Some words on circular buffer implementation. The solution with
    > IRP_MJ_CLEANUP is not perfect because of handle duplication ( if
    > somebody decides to attack the system ). The safer solution is holding
    > an IRP( i.e. IOCTL) with a cancel routine and detecting abnormal
    > process termination in it. But it looks more elaborate.

    Ok
  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    Isn't a device's driver supposed to transfer data from it's ... EvtIoRead callback routine ?
  • Tim_RobertsTim_Roberts Member - All Emails Posts: 13,694
    [email protected] wrote:
    > Isn't a device's driver supposed to transfer data from it's ... EvtIoRead callback routine ?

    No. A device driver transfers its data at whatever point it actually
    has data. If you happen to have data already queued up, then you can
    certainly transfer it in EvtIoRead or EvtIoDeviceControl, but most
    drivers aren't that lucky. They have to tuck those requests into a
    queue somewhere. Later on, when the driver actually receives data,
    whether from a device, or a bus, or another driver, it can pop the next
    waiting request and complete it.

    --
    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • MBondMBond Member - All Emails Posts: 846
    More elaborate? From what point of view?



    The system does all of the work for you. You don?t need to worry about tracking when it is safe to use the buffer or when it should be mapped / unmapped and handling all of those corner cases and race conditions. The engineers at Microsoft have done this for you and what use would it be to re-implement their work ? even assuming you could without spending years working on it.



    In general, IMHO, shared buffer schemes that are implemented correctly are generally not any more efficient than simply using IRPs as the overhead of doing it right is exactly what the Microsoft engineers have coded ? do you think they purposely make these calls slow? Having said that the long lived IRP can be an exception to that rule and is used safely in specific cases, but as others have said, this should not be your first design regardless of what performance you think you need



    Sent from Mail for Windows 10



    From: [email protected]
    Sent: August 30, 2016 11:21 AM
    To: Windows System Software Devs Interest List
    Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct data path to user space ?



    Some words on circular buffer implementation. The solution with IRP_MJ_CLEANUP is not perfect because of handle duplication ( if somebody decides to attack the system ). The safer solution is holding an IRP( i.e. IOCTL) with a cancel routine and detecting abnormal process termination in it. But it looks more elaborate.

    ---
    NTDEV is sponsored by OSR

    Visit the list online at:

    MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
    Details at

    To unsubscribe, visit the List Server section of OSR Online at
  • anton_bassovanton_bassov Member MODERATED Posts: 5,190
    > The engineers at Microsoft have done this for you and what use would it be to re-implement
    > their work ? even assuming you could without spending years working on it.


    As it usually happens, anything that gets blown out of proportion starts looking/sounding ridiculous, and the above statement is not an exception. Surely sharing a buffer gives you few more things to worry about, and, as it had been already pointed out by other posters, in most cases this extra pain is simply unnecessary. However, "spending years working on it" is just a gross exaggeration that sounds more of propaganda from our "Professor Joe Flounder" .



    > In general, IMHO, shared buffer schemes that are implemented correctly are generally
    >not any more efficient than simply using IRPs as the overhead of doing it right is exactly
    >what the Microsoft engineers have coded ?

    Yet another piece of nonsense. It is well known fact that sharing memory between an app and driver may offer dramatic performance enhancement, compared to file IO operations interface (including ioctl()). This is what mmap() was designed for.


    > do you think they purposely make these calls slow?

    Well, they just had no option other than following the instructions of someone who is known to think of UNIX (as well as of GUI) as of a life-long foe. Do you really think they would be allowed to implement a single and uniform interface for both disk file and driver IO operations under these circumstances???? As a result, now you have to invent various mechanism(like sharing events) for working around various shortcomings and limitations arising from the lack of mmap() system call. As someone said, "Those who don't understand UNIX are doomed to re-invent it. Poorly"......



    Anton Bassov
  • Slava_ImameevSlava_Imameev Member Posts: 480
    <QUOTE>
    In general, IMHO, shared buffer schemes that are implemented correctly are
    generally not any more efficient than simply using IRPs as the overhead of doing
    it right is exactly what the Microsoft engineers have coded ?
    </QUOTE>

    Shared buffer has nearly zero overhead compared to IOCTL implementation, the only calls are a DPC call to KeSetEvent that results in a scheduler being called and WaitForSingleObject by an application. Shared buffer is mapped in system and process space, its physical pages are locked. So no page faults generated. As the buffer pages are locked the data is hold in the CPU cache( at least L2 ) when a user process is woken up on the same CPU. There is no cache thrashing as in case of buffers copying when CPU cache has to evict two times more data to accommodate the copied data. IOCTL implementation in addition to the same KeSetEvent in IRP completion requires entering the kernel for file IO, which is not cheaper than entering the kernel to wait for event(WaitForSingleObject) as it encompasses it. IOCTL requires multiple memory allocations(Irp+buffer), memory releasing and copying buffers between kernel and user buffer OR the Memory Management involvement to lock and map a user buffer and then unlock and unmap it. IOCTL implementation involves a couple of order of magnitude more code being executed to transfer data from DPC to a user application in addition to the overhead of a shared buffer implementation. IOCTL also has a burden on CPU cache( both code and data caches), PTE and TLB management.
  • Don_BurnDon_Burn Member - All Emails Posts: 1,715
    >Yet another piece of nonsense. It is well known fact that sharing
    >memory between an app and driver may offer dramatic performance
    >enhancement, compared to file IO operations interface (including ioctl()).

    Note the "MAY OFFER" in this statement, unfortunately plenty of
    implementations
    don't offer performance improvements. In fact Anton, please give some
    references
    to your "well known fact", the few I know of have a lot of qualifiers, that
    make that
    statement closer to urban legend than fact.

    >Well, they just had no option other than following the instructions of
    someone who
    >is known to think of UNIX (as well as of GUI) as of a life-long foe

    A number of the senior folks who worked on the I/O subsystem over the years
    came
    from a UNUX background. For example Nar Ganapathy who was one of the
    architects
    of KMDF was previously an architect as Sun. Anton your conspiracy theories
    are
    becoming ridiculous.


    Don Burn
    Windows Driver Consulting
    Website: http://www.windrvr.com
  • Don_BurnDon_Burn Member - All Emails Posts: 1,715
    <QUOTE>
    Shared buffer has nearly zero overhead compared to IOCTL implementation,
    the only calls are a DPC call to KeSetEvent that results in a scheduler
    being called and WaitForSingleObject by an application.
    </QUOTE>

    Have you measured the cost of KeSetEvent/WaitForSingleObject pair versus a
    DeviceIoControl
    for all situations and approaches. You may be surprised to find that the
    overhead is similar
    and with some of the optimizations provided for user space applications for
    I/O calls that
    DeviceIoControl can actually be made faster.

    Of course this whole discussion is based on the assumption that the OP needs
    an extremely
    fast data path, in almost all cases when you really pin people down what
    they think is fast for
    a modern OS is actually pretty much average.


    Don Burn
    Windows Driver Consulting
    Website: http://www.windrvr.com
  • Slava_ImameevSlava_Imameev Member Posts: 480
    <QUOTE>
    Have you measured the cost of KeSetEvent/WaitForSingleObject pair versus a DeviceIoControl for all situations and approaches.
    </QUOTE>

    It is amazing that I should remember you that DeviceIoControl solution includes exactly the same calls to KeSetEvent and WaitForSingleObject/KeWaitForSingleObject.

    <QUOTE>
    Of course this whole discussion is based on the assumption that the OP needs an extremely fast data path
    </QUOTE>

    Actually the OP needs a real time data path for audio processing. Contrary to common believe RT is not about fast but about predictable and nothing beats shared buffer in this.
  • Don_BurnDon_Burn Member - All Emails Posts: 1,715
    <QUOTE>
    It is amazing that I should remember you that DeviceIoControl solution
    includes exactly the same calls to KeSetEvent and
    WaitForSingleObject/KeWaitForSingleObject.
    </QUOTE>

    So you have read the Windows source and seen this? And of course calls like
    SetFileCompletionNotificationModes or the
    use of completion ports do not impact this mechanism at all? You obviously
    have limited knowledge of the capabilities of
    Windows in the I/O path.

    <QUOTE>
    Actually the OP needs a real time data path for audio processing. Contrary
    to common believe RT is not about fast but about predictable and nothing
    beats shared buffer in this.
    </QUOTE>

    Actually, share buffers have little or nothing to do with predictability.
    This is a scheduling issue which is dependent on a number of things, but way
    that data is copied from kernel to user space is not one of them. Don't
    apply the biases of another OS to Windows, each OS is different and assuming
    that they will react the same without testing the hypothesis just shows
    ignorance.


    Don Burn
    Windows Driver Consulting
    Website: http://www.windrvr.com
  • Slava_ImameevSlava_Imameev Member Posts: 480
    <QUOTE>
    So you have read the Windows source and seen this?
    </QUOTE>

    Yes, I did.

    <QUOTE>
    You obviously have limited knowledge of the capabilities of Windows in the I/O path.
    </QUOTE>

    You obviously trying to start personal attacks( not the first time ) to cover your limited knowledge of OS design in general. When you are going to wait for data you have to wait. Waiting means either polling or releasing CPU the latter means a thread is removed from a running queue and inserted into some event's queue.

    <QUOTE>
    Actually, share buffers have little or nothing to do with predictability.
    </QUOTE>

    Actually it does. I am not going educate you on RT design here.
  • Don_BurnDon_Burn Member - All Emails Posts: 1,715
    Sorry you did not read the source well, or you would have seen that there
    are mechanisms for getting the event out of the I/O path, and other
    optimizations. In fact I have implemented a number of high speed I/O
    models on Windows for clients including shared memory models, and none of
    these use a user space event directly since it was not efficient.

    As far as my OS knowledge, I've worked on OS'es and system software for over
    40 years. In that time I have been involved with 5 commercal OS'es four of
    which I was part of the original team.


    Don Burn
    Windows Driver Consulting
    Website: http://www.windrvr.com




    -----Original Message-----
    From: [email protected]
    [mailto:[email protected]] On Behalf Of [email protected]
    Sent: Wednesday, August 31, 2016 8:22 AM
    To: Windows System Software Devs Interest List <[email protected]>
    Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
    data path to user space ?

    <QUOTE>
    So you have read the Windows source and seen this?
    </QUOTE>

    Yes, I did.

    <QUOTE>
    You obviously have limited knowledge of the capabilities of Windows in the
    I/O path.
    </QUOTE>

    You obviously trying to start personal attacks( not the first time ) to
    cover your limited knowledge of OS design in general. When you are going to
    wait for data you have to wait. Waiting means either polling or releasing
    CPU the latter means a thread is removed from a running queue and inserted
    into some event's queue.

    <QUOTE>
    Actually, share buffers have little or nothing to do with predictability.
    </QUOTE>

    Actually it does. I am not going educate you on RT design here.

    ---
    NTDEV is sponsored by OSR

    Visit the list online at:
    <http://www.osronline.com/showlists.cfm?list=ntdev&gt;

    MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
    drivers!
    Details at <http://www.osr.com/seminars&gt;

    To unsubscribe, visit the List Server section of OSR Online at
    <http://www.osronline.com/page.cfm?name=ListServer&gt;
  • Slava_ImameevSlava_Imameev Member Posts: 480
    Don, I am sorry for you if you believe that waiting for data that has not arrived is possible without some sort of blocking event mechanism at the core level, which always involves scheduler. I am not consider polling with a blocked scheduler as an extreme case.

    In Windows blocking event mechanism is implemented by objects containing DISPATCHER_HEADER, so it it doesn't matter whether you are talking about event or something else like process, thread, semapthore, mutex, timer. You are actually talking about DISPATCHER_HEADER and all blocking synchronization ends up in manipulating with DISPATCHER_HEADER. I believe it is not a secret for you that KEVENT is just a wrapper for DISPATCHER_HEADER.

    <QUOTE>
    Sorry you did not read the source well,
    </QUOTE>

    You will always correct me if I am wrong.

    <QUOTE>
    In fact I have implemented a number of high speed I/O
    models on Windows for clients including shared memory models, and none of
    these use a user space event directly
    </QUOTE>

    A good man knows his limitations.
  • Daniel_TerhellDaniel_Terhell Member Posts: 1,355
    <QUOTE>
    Actually, share buffers have little or nothing to do with predictability.
    </QUOTE>

    They do. Because the memory is locked down they prevent hard pagefaults.
    DeviceIoControls incur user/kernel switches which in turn might lead to the
    scheduler running another thread on the CPU. They may be good for general
    throughput but not if latency is a concern.

    //Daniel
  • Don_BurnDon_Burn Member - All Emails Posts: 1,715
    Shared buffers do not equal locked down memory, they are independent items.
    I have seen implementations of share buffers that used pageable memory, and
    I have seen IOCTL mechanisms where the user space program through one of
    several mechanisms handed the kernel non-paged buffers.

    There are ways to limit the scheduler impact in the IOCTL path (FastIO,
    using SetFileCompletionNotificationModes and completion ports, etc).

    Overall there are a lot of tools, assuming shared buffers is going to be the
    fastest is a poor design decision, and that what lead to this long thread.


    Don Burn
    Windows Driver Consulting
    Website: http://www.windrvr.com




    -----Original Message-----
    From: [email protected]
    [mailto:[email protected]] On Behalf Of
    [email protected]
    Sent: Wednesday, August 31, 2016 9:27 AM
    To: Windows System Software Devs Interest List <[email protected]>
    Subject: Re:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
    data path to user space ?

    <QUOTE>
    Actually, share buffers have little or nothing to do with predictability.
    </QUOTE>

    They do. Because the memory is locked down they prevent hard pagefaults.
    DeviceIoControls incur user/kernel switches which in turn might lead to the
    scheduler running another thread on the CPU. They may be good for general
    throughput but not if latency is a concern.

    //Daniel


    ---
    NTDEV is sponsored by OSR

    Visit the list online at:
    <http://www.osronline.com/showlists.cfm?list=ntdev&gt;

    MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
    drivers!
    Details at <http://www.osr.com/seminars&gt;

    To unsubscribe, visit the List Server section of OSR Online at
    <http://www.osronline.com/page.cfm?name=ListServer&gt;
  • Slava_ImameevSlava_Imameev Member Posts: 480
    <QUOTE>
    > Shared buffers do not equal locked down memory, they are independent items.
    </QUOTE>

    We are talking here about locked buffers in case you didn't notice the word DPC.

    <QUOTE>
    > I have seen implementations of share buffers that used pageable memory
    </QUOTE>

    We all have seen this. This is called file system cache or do you want to tell us that you have seen a driver that mapped a user space address range backed by a pagefile/file to a system space and the pages were not locked. Think twice before answering.

    <QUOTE>
    > I have seen IOCTL mechanisms where the user space program through one of
    several mechanisms handed the kernel non-paged buffers.
    </QUOTE>

    We all have seen this. This is called buffered IO or do you want to tell that a driver used a locked user space address range in arbitrary context (we are talking about DPC here).

    <QUOTE>
    > and that what lead to this long thread.
    </QUOTE>

    Your personal attacks lead to this long thread.
  • Tim_RobertsTim_Roberts Member - All Emails Posts: 13,694
    [email protected] wrote:
    > Shared buffer has nearly zero overhead compared to IOCTL implementation, the only calls are a DPC call to KeSetEvent that results in a scheduler being called and WaitForSingleObject by an application.

    Yes, but how is that any different than a long-term IRP? You have the
    exact same transactions: set an event and wake the user-mode process.

    The shared buffer FEELS like it ought to be more efficient, but the
    overhead is essentially identical

    --
    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • anton_bassovanton_bassov Member MODERATED Posts: 5,190
    <quote>

    As far as my OS knowledge, I've worked on OS'es and system software for over 40 years. In that time I have been involved with 5 commercal OS'es four of which I was part of the original team.


    </quote>


    Whenever you see a statement like that from Mr.Burn you can be at least 500% sure his previous statements on a given topic are, softly speaking, "not-so-sound" from the technical standpoint, which had been pointed out to him. Whenever he puts a foot into his mouth he starts referring to his 50+ years experience( as well as throwing personal attacks and ad hominem arguments). The best option in such situation is simply to ignore this part, and to request him to back up his statements from the technical standpoint. At this point it becomes plainly obvious that he is just full of shit We've been here on quite a few occasions, Don, don't you think (if a backup of my statements is needed, links to at least dozen of threads on NTDEV are available upon the request). As we are going to see shortly, this particular thread is not an exception to above mentioned general rule.



    <quote>

    Shared buffers do not equal locked down memory, they are independent items. I have seen implementations of share buffers that used pageable memory, I have seen IOCTL mechanisms where the user space program through one of several mechanisms handed the kernel non-paged buffers.

    </quote>


    You must have implemented them yourself then....

    Look - we are speaking about writing to memory buffer in context of DPC. If a driver does not lock an MDL that describes shared memory buffer in advance its very first page fault is going to result in BSOD. What are you arguing about here???


    In general, " I have seen XYZ" is a pretty weak argument in itself. For example, if you speak to any ambulance or emergency departments doctor you will hear multiple stories about things
    that were _intentionally_ used in "very non-conventional" ways by the patients - from peas in the nose to electrical bulbs up the arses. However, it does not necessarily imply that peas and bulbs are inherently dangerous items per se, don't you think. The situation is exactly the same - if XYZ has been misused by someone it doesnot necessarily mean that it is bad.

    <quote>

    ... the "MAY OFFER" in this statement, unfortunately plenty of implementations don't offer performance improvements. In fact Anton, please give some references to your "well known fact", the few I know of have a lot of qualifiers, that make that statement closer to urban legend than fact.

    </quote>


    Well, your statement may make sense only if we assume that it was you who had provided these implementations, effectively washing away all the potential performance benefits of shared memory. If you need more precise explanations/proof of above mentioned benefits I would suggest reading post N 16 on this thread carefully - I simply don't want to replicate the work that has already been done on this thread



    <quote>

    A number of the senior folks who worked on the I/O subsystem over the years came from a UNUX background. For example Nar Ganapathy who was one of the architects of KMDF was previously an architect as Sun.


    </quote>


    The "only" problem here is that the whole NT IO subsystem was _architecturally_ defined
    in the very first version of NT, i.e. more than a decade before these folks had turned up, and had not changed since. The only thing that these folks could do was improving certain implementation details, which, however, cannot change the original design choices. For example, AFAIK, FastIO was introduced exactly for the purpose of fixing the deficiencies and shortcoming of IRP-based IO model that became obvious at the very first stages of NT's life (IIRC, it was mentioned in Rajeev Nagar's book).


    > Anton your conspiracy theories are becoming ridiculous.

    Sorry, Don, but the only ridiculous thing on this thread in so far is your participation in it - unfortunately, practically all the statements that you made here are nonsensical. You seem to be arguing for the very sake of arguing about something....



    Anton Bassov
  • Slava_ImameevSlava_Imameev Member Posts: 480
    <QUOTE>
    Yes, but how is that any different than a long-term IRP? You have the
    exact same transactions: set an event and wake the user-mode process.

    The shared buffer FEELS like it ought to be more efficient, but the
    overhead is essentially identical
    </QUOTE>

    The overhead is much less if you consider CPU caches, PTE and TLB management by CPU and the system. I already outlined this above.

    Shared buffer is much better when outburst happens. Consecutive data packets can be processed without entering kernel mode at all.

    If you consider a latency for a relatively slow communication it can be the same just because most of the time is spent in waiting. So it is just being amortized by long waiting for data. But the overhead is never the same - the number of CPU instructions being executed is orders of magnitude more for IRP case. This has its toll on CPU caches, TLB and branch prediction table in CPU.
  • Tim_RobertsTim_Roberts Member - All Emails Posts: 13,694
    [email protected] wrote:
    > If you consider a latency for a relatively slow communication it can be the same just because most of the time is spent in waiting. So it is just being amortized by long waiting for data. But the overhead is never the same - the number of CPU instructions being executed is orders of magnitude more for IRP case.

    All of which are MANY orders of magnitude less than any I/O operation
    that was involved. The difference is insignificant.


    > This has its toll on CPU caches, TLB and branch prediction table in CPU.

    Insignificant. CPU cycles are not the only metric of efficiency in the
    world today. If it costs me additional design, programming, debugging,
    and maintenance time to get a result that micro-optimizes the branch
    prediction tables, then my work was a fool's errand.

    --
    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • Slava_ImameevSlava_Imameev Member Posts: 480
    <QUOTE>
    All of which are MANY orders of magnitude less than any I/O operation that was involved.
    </QUOTE>

    Not so. If this was considered as a generally acceptable idea Windows NT would have been implemented as a mini-kernel as it was originally thought of.


    <QUOTE>
    Insignificant. CPU cycles are not the only metric of efficiency in the world today.
    </QUOTE>

    When we are talking about code execution overhead this is one of the most important metrics. If it was not significant we don't have superscallar CPUs with more than 100 instructions in flight, huge register files, huge reorder buffers, out of order execution, branch prediction tables, huge caches.

    You can develop over metrics depending on the goal. For example man hours spent on design and development or number of pizzas eaten by the team.

    <QUOTE>
    If it costs me additional design, programming, debugging,
    and maintenance time to get a result that micro-optimizes the branch
    prediction tables, then my work was a fool's errand.
    </QUOTE>

    Actually it doesn't cost much than IOCTL implementation. All you need is to implement a single reader/single write lock free circular buffer over an allocated and locked user buffer. Such buffer implementations are available for no charge from many sources.
Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Upcoming OSR Seminars
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!
Writing WDF Drivers 7 Dec 2020 LIVE ONLINE
Internals & Software Drivers 25 Jan 2021 LIVE ONLINE
Developing Minifilters 8 March 2021 LIVE ONLINE