Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Home NTDEV

More Info on Driver Writing and Debugging


The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.


Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/


Before Posting...

Please check out the Community Guidelines in the Announcements and Administration Category.

NetBufferList chain using same MDL

Menachem_ShapiraMenachem_Shapira Member - All Emails Posts: 34
Hi,

We have a Virtual NDIS Miniport driver that receives a chain of L2 Frames
from a User Mode application.
It passes the Head of the chain (a constant 65K buffer) to the driver to
it's IRP_MJ_WRITE routine (we're using NDIS_WDM mode..)

The driver in turn builds a list of NBLs, each NBL describing a certain
frame from the large buffer. Practically, we don't use any more that 10
frames for each indication.
In order to conserve in memory mappings, we decided to use the same MDL
created by th I/O manager for the IRP, and have each NB point to a
different chuck of the 65K buffer.


1) Is the sharing of MDL permissible in such a constellation?
2) This solution seems to work, in practice. We achieve nice throughputs.
However, when we attach an upper fitler driver (just run wireshark, really,
which attaches winpcap), we get a bugcheck in tcpip.sys (see below). Looks
like some bad pointer. Additionally, we see that our MDL has been meddled
with. And here was I thinking that the MDL should describe a constant block
of memory, and other driver should "look, but not touch".
3) If my approach is not acceptable, what better approach can be suggested?

Summary of the bugcheck:

1: kd> !analyze -v

*******************************************************************************

*
*

* Bugcheck
Analysis *

*
*

*******************************************************************************



DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)

...

Arguments:

Arg1: 00000006, memory referenced

Arg2: 00000002, IRQL

Arg3: 00000000, value 0 = read operation, 1 = write operation

Arg4: 89ac75aa, address which referenced memory

...

STACK_TEXT:

...

92438360 89ac75aa 0000000a 00000006 00000002 nt!KiTrap0E+0x1ca

92438498 89ac7339 8e5c60e4 924384e4 00000000
tcpip!Ipv4pValidateNetBuffer+0x2a

92438514 89acaa84 00000000 00000000 00000002
tcpip!IppFlcReceivePacketsCore+0x259

92438584 89acade8 89ae5500 91553000 b9142395
tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x214

92438640 81ea3e73 924386b4 70147e0b 00000001
tcpip!FlReceiveNetBufferListChainCalloutRoutine+0xd8

92438678 81ea3dbe bd6422c0 00000002 01000000
nt!KeExpandKernelStackAndCalloutInternal+0xa3

9243868c 89ae5988 89acad10 924386b4 00002400
nt!KeExpandKernelStackAndCalloutEx+0x1e

924386d4 8963769c 00628650 beff2e01 00000000
tcpip!FlReceiveNetBufferListChain+0x98

9243874c 896373b9 00000000 00000002 00000000
ndis!ndisMIndicateNetBufferListsToOpen+0x26c

92438798 8964dfdc 8e6370e8 beff2ef0 00000000
ndis!ndisMTopReceiveNetBufferLists+0x209

924387fc 8965ba91 00000002 00000000 8e6370e8
ndis!ndisInvokeNextReceiveHandler+0x25

92438844 89ce1159 cbd94de8 beff2ef0 00000000
ndis!NdisFIndicateReceiveNetBufferLists+0x1eb81

924388f4 89638c97 97066a88 beff2ef0 00000000
wfplwfs!LwfLowerRecvNetBufferLists+0x149

92438948 8964c2a9 cbd94de8 89ce1010 97066a88
ndis!ndisCallReceiveHandler+0x287

9243896c 81ea3e73 92438a74 701471d7 8e6370e8
ndis!ndisDataPathExpandStackCallback+0x27

924389a4 81ea3dbe bd6422c0 00000000 00000000
nt!KeExpandKernelStackAndCalloutInternal+0xa3

924389b8 89642605 8964c282 92438a74 00002666
nt!KeExpandKernelStackAndCalloutEx+0x1e

924389d0 896595dd d1768f20 8fb110e8 8fb11030 ndis!ndisExpandStack+0x11

92438ab4 90397e22 8e6370e8 beff2ef0 00000000
ndis!NdisMIndicateReceiveNetBufferLists+0x2171d

92438b10 9039307d d1768f20 00000000 92438b30
mydriver!Receive_DeviceWrite+0x562 [myfile.c @ 305]

92438b38 8969d1e4 8fb11030 d1768f20 d1768f20 mydriver!VpnDeviceWrite+0x1d
[myfile.c @ 558]

Comments

  • Jan_BottorffJan_Bottorff Member - All Emails Posts: 472
    Sounds like a wireshark bug. Seems like there was a conversation on this list in the last year about wireshark causing some perhaps similar crash. I’d suggest you search this list for wireshark, and see if anything sounds similar.

    Did you Google the issue of sharing MDLs, I instantly found (Google ndis partial mdl) http://www.osronline.com/showThread.cfm?link=147694 from 2009 on the Windows Xen driver? It pointed out that NdisCopyBuffer is IoAllocateMdl+IoBuildPartialMdl

    Typically, there will be one MDL per receive NB, as if it were a real receive buffer. NDIS did have data/body split for a while, which put the headers and body in different MDLs, and different buffers. On the receive path, there is a requirement that the NDIS specified lookahead size bytes be in the first MDL of the receive NB. This is so upper layers can parse the packet headers without having to deal with disjoint chunks in multiple MDLs. NDIS is free to set the lookahead size for a NIC miniport to anything up to the MTU, although don’t believe it typically does. I’ve recently seen it being set to a value around 550 bytes (don’t remember exactly) but do remember I rounded it up the next cache line size 576 from what NDIS was setting.

    I know there is a bug in the WHQL test that fails if the NB offset is not zero on receive packets, and if you are sharing an MDL across many packets, the NB offset will have to not be zero on some packets. I also know on the send side into a NIC miniport, WHQL tests share memory buffers (and maybe MDLs) for different packets. On the send side, buffers and MDLs need to be treated as read-only (also note the bit in the NB that specifies a packet as read-only may not be set). The only workaround I could find for this WHQL bug was to adjust the MDL to always use an NB offset of zero. If you share MDLs, that won’t be possible.

    It offhand seems like there may be an issue of concurrent updating of a shared MDL. The MDL you get from the write may not have a kernel address mapping, and upper level protocols will need to access a virtual address of the buffer. If for example you indicate some NBs, and some get processed on a different core in a worker thread, you may then have multiple cores needing to create a kernel address mapping, and if those happen at the same moment, you have two cores writing to the MDL, with no locking. You might mitigate this by assuring the MDL is mapped before indicating it. A second issue is things like filter drivers might want to insert/remove bytes in the packet, and they sometimes do so by manipulation of the MDL chain. Filters are supposed to restore a NB back to its original state before returning a packet, but it offhand seems like while they own the packet they are free to manipulate it however they want, which might include manipulating the MDL. For the typical case of received packets with non-shared MDLs, this will not cause issues, but in your case, it will. I’d say the read-only flag in the NB could be set in the NB if it’s read-only, but my experience has also been the read-only bit was not set on definitely read-only packets.

    I can’t absolutely say it’s illegal to share MDLs across receive packets, but it does tread on a path that is uncommon, and anytime you go off the path you might find a new bug, in OS or other ISV code.

    If it were me, I think I’d preallocate MDL memory (largest size needed, allocated perhaps in the NBL context), and fill it in with IoBuildPartialMdl, which should not require any allocations or page table manipulation, just some copying/adjustment of the source MDL into the destination MDL. Each indicated NB could then have a unique MDL, which gets reused when the NBL/NB comes back. So your conversion from the write to the NBL indicate goes something like:

    MmGetSystemAddressForMdlSafe on the write mdl to assure it has a mapping, so the partial mappings will use the same mapping
    While (there are more packets to indicate in this write) {
    Pull an NBL/NB with embedded MDL from the free pool (if pool is empty, do something appropriate)
    Increment a reference count to the write
    Use IoBuildPartialMdl to create a slice of the write buffer
    Set appropriate NBL and NB fields for this packet
    Put this NBL at the end of a chain to indicate
    }

    Indicate chain of NBLs

    When you get the NBLs back, you just need to put it back on the free list, and decrement the reference count to the original write. When the reference count goes to zero, you can complete the write.

    Jan


    From: on behalf of Menachem Shapira
    Reply-To: Windows List
    Date: Tuesday, May 23, 2017 at 11:01 AM
    To: Windows List
    Subject: [ntdev] NetBufferList chain using same MDL

    Hi,

    We have a Virtual NDIS Miniport driver that receives a chain of L2 Frames from a User Mode application.
    It passes the Head of the chain (a constant 65K buffer) to the driver to it's IRP_MJ_WRITE routine (we're using NDIS_WDM mode..)

    The driver in turn builds a list of NBLs, each NBL describing a certain frame from the large buffer. Practically, we don't use any more that 10 frames for each indication.
    In order to conserve in memory mappings, we decided to use the same MDL created by th I/O manager for the IRP, and have each NB point to a different chuck of the 65K buffer.


    1) Is the sharing of MDL permissible in such a constellation?
    2) This solution seems to work, in practice. We achieve nice throughputs. However, when we attach an upper fitler driver (just run wireshark, really, which attaches winpcap), we get a bugcheck in tcpip.sys (see below). Looks like some bad pointer. Additionally, we see that our MDL has been meddled with. And here was I thinking that the MDL should describe a constant block of memory, and other driver should "look, but not touch".
    3) If my approach is not acceptable, what better approach can be suggested?

    Summary of the bugcheck:

    1: kd> !analyze -v
    *******************************************************************************
    * *
    * Bugcheck Analysis *
    * *
    *******************************************************************************

    DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
    ...
    Arguments:
    Arg1: 00000006, memory referenced
    Arg2: 00000002, IRQL
    Arg3: 00000000, value 0 = read operation, 1 = write operation
    Arg4: 89ac75aa, address which referenced memory
    ...
    STACK_TEXT:
    ...
    92438360 89ac75aa 0000000a 00000006 00000002 nt!KiTrap0E+0x1ca
    92438498 89ac7339 8e5c60e4 924384e4 00000000 tcpip!Ipv4pValidateNetBuffer+0x2a
    92438514 89acaa84 00000000 00000000 00000002 tcpip!IppFlcReceivePacketsCore+0x259
    92438584 89acade8 89ae5500 91553000 b9142395 tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x214
    92438640 81ea3e73 924386b4 70147e0b 00000001 tcpip!FlReceiveNetBufferListChainCalloutRoutine+0xd8
    92438678 81ea3dbe bd6422c0 00000002 01000000 nt!KeExpandKernelStackAndCalloutInternal+0xa3
    9243868c 89ae5988 89acad10 924386b4 00002400 nt!KeExpandKernelStackAndCalloutEx+0x1e
    924386d4 8963769c 00628650 beff2e01 00000000 tcpip!FlReceiveNetBufferListChain+0x98
    9243874c 896373b9 00000000 00000002 00000000 ndis!ndisMIndicateNetBufferListsToOpen+0x26c
    92438798 8964dfdc 8e6370e8 beff2ef0 00000000 ndis!ndisMTopReceiveNetBufferLists+0x209
    924387fc 8965ba91 00000002 00000000 8e6370e8 ndis!ndisInvokeNextReceiveHandler+0x25
    92438844 89ce1159 cbd94de8 beff2ef0 00000000 ndis!NdisFIndicateReceiveNetBufferLists+0x1eb81
    924388f4 89638c97 97066a88 beff2ef0 00000000 wfplwfs!LwfLowerRecvNetBufferLists+0x149
    92438948 8964c2a9 cbd94de8 89ce1010 97066a88 ndis!ndisCallReceiveHandler+0x287
    9243896c 81ea3e73 92438a74 701471d7 8e6370e8 ndis!ndisDataPathExpandStackCallback+0x27
    924389a4 81ea3dbe bd6422c0 00000000 00000000 nt!KeExpandKernelStackAndCalloutInternal+0xa3
    924389b8 89642605 8964c282 92438a74 00002666 nt!KeExpandKernelStackAndCalloutEx+0x1e
    924389d0 896595dd d1768f20 8fb110e8 8fb11030 ndis!ndisExpandStack+0x11
    92438ab4 90397e22 8e6370e8 beff2ef0 00000000 ndis!NdisMIndicateReceiveNetBufferLists+0x2171d
    92438b10 9039307d d1768f20 00000000 92438b30 mydriver!Receive_DeviceWrite+0x562 [myfile.c @ 305]
    92438b38 8969d1e4 8fb11030 d1768f20 d1768f20 mydriver!VpnDeviceWrite+0x1d [myfile.c @ 558]

    --- NTDEV is sponsored by OSR Visit the list online at: MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at To unsubscribe, visit the List Server section of OSR Online at
  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
  • Menachem_ShapiraMenachem_Shapira Member - All Emails Posts: 34
    Thanks for the write up.
    Sounds simple enough now...

    Sent from mobile

    On May 24, 2017 12:41 PM, "Jan Bottorff" wrote:

    > Sounds like a wireshark bug. Seems like there was a conversation on this
    > list in the last year about wireshark causing some perhaps similar crash.
    > I’d suggest you search this list for wireshark, and see if anything sounds
    > similar.
    >
    >
    >
    > Did you Google the issue of sharing MDLs, I instantly found (Google ndis
    > partial mdl) http://www.osronline.com/showThread.cfm?link=147694 from
    > 2009 on the Windows Xen driver? It pointed out that NdisCopyBuffer is
    > IoAllocateMdl+IoBuildPartialMdl
    >
    >
    >
    > Typically, there will be one MDL per receive NB, as if it were a real
    > receive buffer. NDIS did have data/body split for a while, which put the
    > headers and body in different MDLs, and different buffers. On the receive
    > path, there is a requirement that the NDIS specified lookahead size bytes
    > be in the first MDL of the receive NB. This is so upper layers can parse
    > the packet headers without having to deal with disjoint chunks in multiple
    > MDLs. NDIS is free to set the lookahead size for a NIC miniport to anything
    > up to the MTU, although don’t believe it typically does. I’ve recently seen
    > it being set to a value around 550 bytes (don’t remember exactly) but do
    > remember I rounded it up the next cache line size 576 from what NDIS was
    > setting.
    >
    >
    >
    > I know there is a bug in the WHQL test that fails if the NB offset is not
    > zero on receive packets, and if you are sharing an MDL across many packets,
    > the NB offset will have to not be zero on some packets. I also know on the
    > send side into a NIC miniport, WHQL tests share memory buffers (and maybe
    > MDLs) for different packets. On the send side, buffers and MDLs need to be
    > treated as read-only (also note the bit in the NB that specifies a packet
    > as read-only may not be set). The only workaround I could find for this
    > WHQL bug was to adjust the MDL to always use an NB offset of zero. If you
    > share MDLs, that won’t be possible.
    >
    >
    >
    > It offhand seems like there may be an issue of concurrent updating of a
    > shared MDL. The MDL you get from the write may not have a kernel address
    > mapping, and upper level protocols will need to access a virtual address of
    > the buffer. If for example you indicate some NBs, and some get processed on
    > a different core in a worker thread, you may then have multiple cores
    > needing to create a kernel address mapping, and if those happen at the same
    > moment, you have two cores writing to the MDL, with no locking. You might
    > mitigate this by assuring the MDL is mapped before indicating it. A second
    > issue is things like filter drivers might want to insert/remove bytes in
    > the packet, and they sometimes do so by manipulation of the MDL chain.
    > Filters are supposed to restore a NB back to its original state before
    > returning a packet, but it offhand seems like while they own the packet
    > they are free to manipulate it however they want, which might include
    > manipulating the MDL. For the typical case of received packets with
    > non-shared MDLs, this will not cause issues, but in your case, it will. I’d
    > say the read-only flag in the NB could be set in the NB if it’s read-only,
    > but my experience has also been the read-only bit was not set on definitely
    > read-only packets.
    >
    >
    >
    > I can’t absolutely say it’s illegal to share MDLs across receive packets,
    > but it does tread on a path that is uncommon, and anytime you go off the
    > path you might find a new bug, in OS or other ISV code.
    >
    >
    >
    > If it were me, I think I’d preallocate MDL memory (largest size needed,
    > allocated perhaps in the NBL context), and fill it in with
    > IoBuildPartialMdl, which should not require any allocations or page table
    > manipulation, just some copying/adjustment of the source MDL into the
    > destination MDL. Each indicated NB could then have a unique MDL, which gets
    > reused when the NBL/NB comes back. So your conversion from the write to the
    > NBL indicate goes something like:
    >
    >
    >
    > MmGetSystemAddressForMdlSafe on the write mdl to assure it has a mapping,
    > so the partial mappings will use the same mapping
    >
    > While (there are more packets to indicate in this write) {
    >
    > Pull an NBL/NB with embedded MDL from the free pool (if pool is empty,
    > do something appropriate)
    >
    > Increment a reference count to the write
    >
    > Use IoBuildPartialMdl to create a slice of the write buffer
    >
    > Set appropriate NBL and NB fields for this packet
    >
    > Put this NBL at the end of a chain to indicate
    >
    > }
    >
    >
    >
    > Indicate chain of NBLs
    >
    >
    >
    > When you get the NBLs back, you just need to put it back on the free list,
    > and decrement the reference count to the original write. When the reference
    > count goes to zero, you can complete the write.
    >
    >
    >
    > Jan
    >
    >
    >
    >
    >
    > *From: * on behalf of Menachem
    > Shapira
    > *Reply-To: *Windows List
    > *Date: *Tuesday, May 23, 2017 at 11:01 AM
    > *To: *Windows List
    > *Subject: *[ntdev] NetBufferList chain using same MDL
    >
    >
    >
    > Hi,
    >
    >
    >
    > We have a Virtual NDIS Miniport driver that receives a chain of L2 Frames
    > from a User Mode application.
    >
    > It passes the Head of the chain (a constant 65K buffer) to the driver to
    > it's IRP_MJ_WRITE routine (we're using NDIS_WDM mode..)
    >
    >
    >
    > The driver in turn builds a list of NBLs, each NBL describing a certain
    > frame from the large buffer. Practically, we don't use any more that 10
    > frames for each indication.
    >
    > In order to conserve in memory mappings, we decided to use the same MDL
    > created by th I/O manager for the IRP, and have each NB point to a
    > different chuck of the 65K buffer.
    >
    >
    >
    >
    >
    > 1) Is the sharing of MDL permissible in such a constellation?
    >
    > 2) This solution seems to work, in practice. We achieve nice throughputs.
    > However, when we attach an upper fitler driver (just run wireshark, really,
    > which attaches winpcap), we get a bugcheck in tcpip.sys (see below). Looks
    > like some bad pointer. Additionally, we see that our MDL has been meddled
    > with. And here was I thinking that the MDL should describe a constant block
    > of memory, and other driver should "look, but not touch".
    >
    > 3) If my approach is not acceptable, what better approach can be
    > suggested?
    >
    >
    >
    > Summary of the bugcheck:
    >
    >
    >
    > 1: kd> !analyze -v
    >
    > ************************************************************
    > *******************
    >
    > *
    > *
    >
    > * Bugcheck Analysis
    > *
    >
    > *
    > *
    >
    > ************************************************************
    > *******************
    >
    >
    >
    > DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
    >
    > ...
    >
    > Arguments:
    >
    > Arg1: 00000006, memory referenced
    >
    > Arg2: 00000002, IRQL
    >
    > Arg3: 00000000, value 0 = read operation, 1 = write operation
    >
    > Arg4: 89ac75aa, address which referenced memory
    >
    > ...
    >
    > STACK_TEXT:
    >
    > ...
    >
    > 92438360 89ac75aa 0000000a 00000006 00000002 nt!KiTrap0E+0x1ca
    >
    > 92438498 89ac7339 8e5c60e4 924384e4 00000000 tcpip!Ipv4pValidateNetBuffer+
    > 0x2a
    >
    > 92438514 89acaa84 00000000 00000000 00000002 tcpip!
    > IppFlcReceivePacketsCore+0x259
    >
    > 92438584 89acade8 89ae5500 91553000 b9142395 tcpip!
    > FlpReceiveNonPreValidatedNetBufferListChain+0x214
    >
    > 92438640 81ea3e73 924386b4 70147e0b 00000001 tcpip!
    > FlReceiveNetBufferListChainCalloutRoutine+0xd8
    >
    > 92438678 81ea3dbe bd6422c0 00000002 01000000 nt!
    > KeExpandKernelStackAndCalloutInternal+0xa3
    >
    > 9243868c 89ae5988 89acad10 924386b4 00002400 nt!
    > KeExpandKernelStackAndCalloutEx+0x1e
    >
    > 924386d4 8963769c 00628650 beff2e01 00000000 tcpip!
    > FlReceiveNetBufferListChain+0x98
    >
    > 9243874c 896373b9 00000000 00000002 00000000 ndis!
    > ndisMIndicateNetBufferListsToOpen+0x26c
    >
    > 92438798 8964dfdc 8e6370e8 beff2ef0 00000000 ndis!
    > ndisMTopReceiveNetBufferLists+0x209
    >
    > 924387fc 8965ba91 00000002 00000000 8e6370e8 ndis!
    > ndisInvokeNextReceiveHandler+0x25
    >
    > 92438844 89ce1159 cbd94de8 beff2ef0 00000000 ndis!
    > NdisFIndicateReceiveNetBufferLists+0x1eb81
    >
    > 924388f4 89638c97 97066a88 beff2ef0 00000000 wfplwfs!
    > LwfLowerRecvNetBufferLists+0x149
    >
    > 92438948 8964c2a9 cbd94de8 89ce1010 97066a88 ndis!ndisCallReceiveHandler+
    > 0x287
    >
    > 9243896c 81ea3e73 92438a74 701471d7 8e6370e8 ndis!
    > ndisDataPathExpandStackCallback+0x27
    >
    > 924389a4 81ea3dbe bd6422c0 00000000 00000000 nt!
    > KeExpandKernelStackAndCalloutInternal+0xa3
    >
    > 924389b8 89642605 8964c282 92438a74 00002666 nt!
    > KeExpandKernelStackAndCalloutEx+0x1e
    >
    > 924389d0 896595dd d1768f20 8fb110e8 8fb11030 ndis!ndisExpandStack+0x11
    >
    > 92438ab4 90397e22 8e6370e8 beff2ef0 00000000 ndis!
    > NdisMIndicateReceiveNetBufferLists+0x2171d
    >
    > 92438b10 9039307d d1768f20 00000000 92438b30 mydriver!Receive_DeviceWrite+0x562
    > [myfile.c @ 305]
    >
    > 92438b38 8969d1e4 8fb11030 d1768f20 d1768f20 mydriver!VpnDeviceWrite+0x1d
    > [myfile.c @ 558]
    >
    >
    >
    > --- NTDEV is sponsored by OSR Visit the list online at: MONTHLY seminars
    > on crash dump analysis, WDF, Windows internals and software drivers!
    > Details at To unsubscribe, visit the List Server section of OSR Online at
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list online at: showlists.cfm?list=ntdev>
    >
    > MONTHLY seminars on crash dump analysis, WDF, Windows internals and
    > software drivers!
    > Details at
    >
    > To unsubscribe, visit the List Server section of OSR Online at <
    > http://www.osronline.com/page.cfm?name=ListServer&gt;
    >
  • zx2c4zx2c4 Member Posts: 6

    @Jan_Bottorff said:
    For the typical case of received packets with non-shared MDLs, this will not cause issues, but in your case, it will. I’d say the read-only flag in the NB could be set in the NB if it’s read-only, but my experience has also been the read-only bit was not set on definitely read-only packets.

    Are you referring to NDIS_NBL_FLAGS_RECV_READ_ONLY? If so, I've found that calling NdisMIndicateReceiveNetBufferLists on an NBL with that flag set will still sometimes result in writes to the NB's MDL somewhere in the TCP stack. Any insights as to why this still might be happening?

  • Jeffrey_Tippet_[MSFT]Jeffrey_Tippet_[MSFT] Member - All Emails Posts: 577

    The concepts of NDIS_NBL_FLAGS_RECV_READ_ONLY and NDIS_NBL_FLAGS_SEND_READ_ONLY are essentially dead (just like this thread was). In fact, I didn't even know the flags existed, and I'm Microsoft's architect for NDIS. There are a variety of places in the OS that will ignore these flags when writing to MDLs. The reality is that the OS considers the MDL and packet payload to always be writable.

  • zx2c4zx2c4 Member Posts: 6
    edited June 2019

    @Jeffrey_Tippet_[MSFT]

    Ah, that's a rough situation. It's surprising to me that you say NDIS considers MDLs always writable. On Linux, it's usually necessary to call a particular function before modifying a skb, which will optionally do a copy-on-write situation, because its page fragments might be shared between several skbs or belonging to some physical device with various conditions. I assume NDIS would have something similar, to at least deal with GRO and whatnot.

    Right now in Wintun I've got a IRP_MJ_WRITE function on a DO_DIRECT_IO device object that essentially does:

    for (offset,size in GetHeaders(Irp)) {
        NdisMIndicateReceiveNetBufferLists(NdisAllocateNetBufferAndNetBufferList(ctx->NBLPool, 0, 0, Irp->MdlAddress, offset, size));
    }
    

    I can map that Mdl as writable -- without MdlMappingNoWrite -- which makes things "work". But of course that allows userspace to scribble over read-only files when the TCP stack fixes up checksums and such.

    The bigger picture is that NtWriteFile uses MmProbeAndLockPages(IoReadAccess) and ProbeForRead(), but it of course doesn't use IoWriteAccess/ProbeForWrite, since that's not what NtWriteFile is intended for, taking only const buffers. This poses a problem.

    It seems like my options are:

    • Use neither direct nor buffered I/O, and do the ProbeAndLock dance myself with IoWriteAccess and ProbeForWrite, to just return an error to userspace if they are actually passing a readonly page.
    • Introduce a dreaded allocation and memcpy.
    • Restrict the file object to SYSTEM, live with the security implications, and call it a day.

    None of those options sound remotely pleasant to me. Is there a good alternative?

    Post edited by zx2c4 on
  • Jeffrey_Tippet_[MSFT]Jeffrey_Tippet_[MSFT] Member - All Emails Posts: 577

    This is, I think, the same problem that winsock solves: how to efficiently get a buffer from usermode. It's a bit more nuanced than "memcpy == bad". It turns out that it's cheaper to copy small buffers than it is to update the page tables. So winsock will just memcpy for buffers that are smaller than a particular threshold. Its threshold is on the order of 1500 bytes, but you should benchmark your code to see where the break-even threshold is.

    Allocations from the general pool are bad, but you can get a fast allocation from a lookaside list, if you don't mind wasting a bit of memory. So, for example, you could create lookaside lists for Big, Medium, and Small buffers, and allocate a buffer that best fits the current I/O.

    If you have the luxury of changing your userspace API, I'd heartily suggest making an API where userspace pre-registers the buffers, then just reuses them over and over again. This is what RIO does, and I understand Linux is going in that direction too. If you can do that, then you only pay the expensive cost of probing & locking the buffers once, up-front.

    Beyond the above discussion, you should probably use METHOD_NEITHER and manually probe for read+write and lock it in kernel VA. It's not completely clear to me why this is unappealing; it's not that slow (winsock does it for larger buffers).

  • zx2c4zx2c4 Member Posts: 6

    @Jeffrey_Tippet_[MSFT] said:
    It turns out that it's cheaper to copy small buffers than it is to update the page tables.

    Indeed we see this in the Linux stack too: page table manipulations aren't cheap.

    If you have the luxury of changing your userspace API, I'd heartily suggest making an API where userspace pre-registers the buffers, then just reuses them over and over again.

    That's not a bad idea. The poorman's way of doing this is to just map it on the first Read/WriteFile, and then enforce that the virtual address is the same for subsequent calls (even though userspace could obviously remap that address - mostly a sanity check I guess). A better way might be with ioctls or some other syscall to send commands to the device.

    This is what RIO does, and I understand Linux is going in that direction too. If you can do that, then you only pay the expensive cost of probing & locking the buffers once, up-front.

    Yea actually the Linux direction is even sexier: you can avoid syscalls entirely during heavy load by registering a ring buffer. On write, userspace atomically increments some header variable, and when completed the first read of it, the kernel doesn't return to userspace if the counter has been incremented. On read, it's the reverse, with the kernel increment the counter and userspace not making synchronous syscalls unless there's nothing to be read and so it blocks.

    Beyond the above discussion, you should probably use METHOD_NEITHER and manually probe for read+write and lock it in kernel VA. It's not completely clear to me why this is unappealing; it's not that slow (winsock does it for larger buffers).

    Yea, fair enough. Doing the mapping once, up front, makes this more viable too.

    Thanks for the suggestions.


    On the actual original topic of this thread, in case folks revisit in the archives, I've seen basically no problems (yet?) with MDL sharing when doing:

    ULONG offset = 0;
    for (size in stuffToReceive(mdl)) {
        NdisMIndicateReceiveNetBufferLists(NdisAllocateNetBufferAndNetBufferList(pool, 0, 0, mdl, offset, size));
        offset += size;
    }
    
  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,493

    It's a bit more nuanced than "memcpy == bad".

    Yes, indeed. People are too quick to say "Buffered I/O BAD... Direct I/O GOOD!"

    As Mr. Roberts has noted here before, processors these days are very, very, good at doing block move operations. It's something that the CPU architects actually focus on.

    It turns out that it's cheaper to copy small buffers than it is to update the page tables.

    And, while I'm very certain Mr. Tippet knows this, I'll just add for other playing along at home that accounting for the overhead of Direct I/O requires an extremely difficult calculus. It's not so much the cost of allocating and filling PDEs/PTEs when the Direct I/O mapping is created... it's the cost of the "TLB shootdown" (the INVD that needs done) as a result of unmapping the region when the I/O is completed.

    Back in the day, we used to recommend that people start considering Direct I/O for read/write operations over 4K. In the last few years, we've changed to recommending people at least consider Direct I/O all the way up to 64K read/write operations.

    In many cases, as well, it comes down to what you're trying to optimize: CPU utilization, elapsed time, or something else (like network bandwidth utilization). As a former business partner of mine used to repeat regularly "anything regarding performance analysis you've determined by intuition is likely to be wrong, because there's almost nothing intuitive about performance."

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • anton_bassovanton_bassov Member MODERATED Posts: 5,258

    Indeed we see this in the Linux stack too: page table manipulations aren't cheap.

    Well, Linux does not really have to do that many page table manipulations as far as KM address space is concerned, does it - after all, vmalloc()-area (i.e. the one where 'virtual_address== physical_address+ fixed_offset' equation does not hold true) utilisation by the KM code is generally frowned upon.....

    Anton Bassov

  • mengxpmengxp Member Posts: 1

    @Jeffrey_Tippet_[MSFT] said:
    The concepts of NDIS_NBL_FLAGS_RECV_READ_ONLY and NDIS_NBL_FLAGS_SEND_READ_ONLY are essentially dead (just like this thread was). In fact, I didn't even know the flags existed, and I'm Microsoft's architect for NDIS. There are a variety of places in the OS that will ignore these flags when writing to MDLs. The reality is that the OS considers the MDL and packet payload to always be writable.

    Thanks to IIS, HTTP.sys will post read-only memory to NDIS in the cases of HTTP Error

    refer:
    https://social.msdn.microsoft.com/Forums/windowsdesktop/en-US/a63fef1c-c7e6-451d-adc5-16de1e8c0b62/ndis-51-intermediate-driver-for-packet-encryptiondecryption-not-working-with-httpsys?forum=wdk

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,493

    And thanks to you, Mr. @mengxp for updating a two year old, dead, thread. Awesome example of necroposting.

    Peter Viscarola
    OSR
    @OSRDrivers

  • MBond2MBond2 Member Posts: 331

    as you have not yet locked this old thread, allow me to add an observation about the security implications of these techniques. They all allow UM processes to lock up physical memory in a way that could be exploited. That's why the lock pages in memory privilege was created - and it should be checked by anyone rolling their own scheme like this

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Upcoming OSR Seminars
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!
Developing Minifilters 24 May 2021 Live, Online
Writing WDF Drivers 14 June 2021 Live, Online
Internals & Software Drivers 27 September 2021 Live, Online
Kernel Debugging 15 November 2021 Live, Online