NetBufferList chain using same MDL

Hi,

We have a Virtual NDIS Miniport driver that receives a chain of L2 Frames
from a User Mode application.
It passes the Head of the chain (a constant 65K buffer) to the driver to
it’s IRP_MJ_WRITE routine (we’re using NDIS_WDM mode…)

The driver in turn builds a list of NBLs, each NBL describing a certain
frame from the large buffer. Practically, we don’t use any more that 10
frames for each indication.
In order to conserve in memory mappings, we decided to use the same MDL
created by th I/O manager for the IRP, and have each NB point to a
different chuck of the 65K buffer.

  1. Is the sharing of MDL permissible in such a constellation?
  2. This solution seems to work, in practice. We achieve nice throughputs.
    However, when we attach an upper fitler driver (just run wireshark, really,
    which attaches winpcap), we get a bugcheck in tcpip.sys (see below). Looks
    like some bad pointer. Additionally, we see that our MDL has been meddled
    with. And here was I thinking that the MDL should describe a constant block
    of memory, and other driver should “look, but not touch”.
  3. If my approach is not acceptable, what better approach can be suggested?

Summary of the bugcheck:

1: kd> !analyze -v

*******************************************************************************

*
*

* Bugcheck
Analysis *

*
*

*******************************************************************************

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)

Arguments:

Arg1: 00000006, memory referenced

Arg2: 00000002, IRQL

Arg3: 00000000, value 0 = read operation, 1 = write operation

Arg4: 89ac75aa, address which referenced memory

STACK_TEXT:

92438360 89ac75aa 0000000a 00000006 00000002 nt!KiTrap0E+0x1ca

92438498 89ac7339 8e5c60e4 924384e4 00000000
tcpip!Ipv4pValidateNetBuffer+0x2a

92438514 89acaa84 00000000 00000000 00000002
tcpip!IppFlcReceivePacketsCore+0x259

92438584 89acade8 89ae5500 91553000 b9142395
tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x214

92438640 81ea3e73 924386b4 70147e0b 00000001
tcpip!FlReceiveNetBufferListChainCalloutRoutine+0xd8

92438678 81ea3dbe bd6422c0 00000002 01000000
nt!KeExpandKernelStackAndCalloutInternal+0xa3

9243868c 89ae5988 89acad10 924386b4 00002400
nt!KeExpandKernelStackAndCalloutEx+0x1e

924386d4 8963769c 00628650 beff2e01 00000000
tcpip!FlReceiveNetBufferListChain+0x98

9243874c 896373b9 00000000 00000002 00000000
ndis!ndisMIndicateNetBufferListsToOpen+0x26c

92438798 8964dfdc 8e6370e8 beff2ef0 00000000
ndis!ndisMTopReceiveNetBufferLists+0x209

924387fc 8965ba91 00000002 00000000 8e6370e8
ndis!ndisInvokeNextReceiveHandler+0x25

92438844 89ce1159 cbd94de8 beff2ef0 00000000
ndis!NdisFIndicateReceiveNetBufferLists+0x1eb81

924388f4 89638c97 97066a88 beff2ef0 00000000
wfplwfs!LwfLowerRecvNetBufferLists+0x149

92438948 8964c2a9 cbd94de8 89ce1010 97066a88
ndis!ndisCallReceiveHandler+0x287

9243896c 81ea3e73 92438a74 701471d7 8e6370e8
ndis!ndisDataPathExpandStackCallback+0x27

924389a4 81ea3dbe bd6422c0 00000000 00000000
nt!KeExpandKernelStackAndCalloutInternal+0xa3

924389b8 89642605 8964c282 92438a74 00002666
nt!KeExpandKernelStackAndCalloutEx+0x1e

924389d0 896595dd d1768f20 8fb110e8 8fb11030 ndis!ndisExpandStack+0x11

92438ab4 90397e22 8e6370e8 beff2ef0 00000000
ndis!NdisMIndicateReceiveNetBufferLists+0x2171d

92438b10 9039307d d1768f20 00000000 92438b30
mydriver!Receive_DeviceWrite+0x562 [myfile.c @ 305]

92438b38 8969d1e4 8fb11030 d1768f20 d1768f20 mydriver!VpnDeviceWrite+0x1d
[myfile.c @ 558]

Sounds like a wireshark bug. Seems like there was a conversation on this list in the last year about wireshark causing some perhaps similar crash. I’d suggest you search this list for wireshark, and see if anything sounds similar.

Did you Google the issue of sharing MDLs, I instantly found (Google ndis partial mdl) http://www.osronline.com/showThread.cfm?link=147694 from 2009 on the Windows Xen driver? It pointed out that NdisCopyBuffer is IoAllocateMdl+IoBuildPartialMdl

Typically, there will be one MDL per receive NB, as if it were a real receive buffer. NDIS did have data/body split for a while, which put the headers and body in different MDLs, and different buffers. On the receive path, there is a requirement that the NDIS specified lookahead size bytes be in the first MDL of the receive NB. This is so upper layers can parse the packet headers without having to deal with disjoint chunks in multiple MDLs. NDIS is free to set the lookahead size for a NIC miniport to anything up to the MTU, although don’t believe it typically does. I’ve recently seen it being set to a value around 550 bytes (don’t remember exactly) but do remember I rounded it up the next cache line size 576 from what NDIS was setting.

I know there is a bug in the WHQL test that fails if the NB offset is not zero on receive packets, and if you are sharing an MDL across many packets, the NB offset will have to not be zero on some packets. I also know on the send side into a NIC miniport, WHQL tests share memory buffers (and maybe MDLs) for different packets. On the send side, buffers and MDLs need to be treated as read-only (also note the bit in the NB that specifies a packet as read-only may not be set). The only workaround I could find for this WHQL bug was to adjust the MDL to always use an NB offset of zero. If you share MDLs, that won’t be possible.

It offhand seems like there may be an issue of concurrent updating of a shared MDL. The MDL you get from the write may not have a kernel address mapping, and upper level protocols will need to access a virtual address of the buffer. If for example you indicate some NBs, and some get processed on a different core in a worker thread, you may then have multiple cores needing to create a kernel address mapping, and if those happen at the same moment, you have two cores writing to the MDL, with no locking. You might mitigate this by assuring the MDL is mapped before indicating it. A second issue is things like filter drivers might want to insert/remove bytes in the packet, and they sometimes do so by manipulation of the MDL chain. Filters are supposed to restore a NB back to its original state before returning a packet, but it offhand seems like while they own the packet they are free to manipulate it however they want, which might include manipulating the MDL. For the typical case of received packets with non-shared MDLs, this will not cause issues, but in your case, it will. I’d say the read-only flag in the NB could be set in the NB if it’s read-only, but my experience has also been the read-only bit was not set on definitely read-only packets.

I can’t absolutely say it’s illegal to share MDLs across receive packets, but it does tread on a path that is uncommon, and anytime you go off the path you might find a new bug, in OS or other ISV code.

If it were me, I think I’d preallocate MDL memory (largest size needed, allocated perhaps in the NBL context), and fill it in with IoBuildPartialMdl, which should not require any allocations or page table manipulation, just some copying/adjustment of the source MDL into the destination MDL. Each indicated NB could then have a unique MDL, which gets reused when the NBL/NB comes back. So your conversion from the write to the NBL indicate goes something like:

MmGetSystemAddressForMdlSafe on the write mdl to assure it has a mapping, so the partial mappings will use the same mapping
While (there are more packets to indicate in this write) {
Pull an NBL/NB with embedded MDL from the free pool (if pool is empty, do something appropriate)
Increment a reference count to the write
Use IoBuildPartialMdl to create a slice of the write buffer
Set appropriate NBL and NB fields for this packet
Put this NBL at the end of a chain to indicate
}

Indicate chain of NBLs

When you get the NBLs back, you just need to put it back on the free list, and decrement the reference count to the original write. When the reference count goes to zero, you can complete the write.

Jan

From: on behalf of Menachem Shapira
Reply-To: Windows List
Date: Tuesday, May 23, 2017 at 11:01 AM
To: Windows List
Subject: [ntdev] NetBufferList chain using same MDL

Hi,

We have a Virtual NDIS Miniport driver that receives a chain of L2 Frames from a User Mode application.
It passes the Head of the chain (a constant 65K buffer) to the driver to it’s IRP_MJ_WRITE routine (we’re using NDIS_WDM mode…)

The driver in turn builds a list of NBLs, each NBL describing a certain frame from the large buffer. Practically, we don’t use any more that 10 frames for each indication.
In order to conserve in memory mappings, we decided to use the same MDL created by th I/O manager for the IRP, and have each NB point to a different chuck of the 65K buffer.

1) Is the sharing of MDL permissible in such a constellation?
2) This solution seems to work, in practice. We achieve nice throughputs. However, when we attach an upper fitler driver (just run wireshark, really, which attaches winpcap), we get a bugcheck in tcpip.sys (see below). Looks like some bad pointer. Additionally, we see that our MDL has been meddled with. And here was I thinking that the MDL should describe a constant block of memory, and other driver should “look, but not touch”.
3) If my approach is not acceptable, what better approach can be suggested?

Summary of the bugcheck:

1: kd> !analyze -v


Bugcheck Analysis



DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)

Arguments:
Arg1: 00000006, memory referenced
Arg2: 00000002, IRQL
Arg3: 00000000, value 0 = read operation, 1 = write operation
Arg4: 89ac75aa, address which referenced memory

STACK_TEXT:

92438360 89ac75aa 0000000a 00000006 00000002 nt!KiTrap0E+0x1ca
92438498 89ac7339 8e5c60e4 924384e4 00000000 tcpip!Ipv4pValidateNetBuffer+0x2a
92438514 89acaa84 00000000 00000000 00000002 tcpip!IppFlcReceivePacketsCore+0x259
92438584 89acade8 89ae5500 91553000 b9142395 tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x214
92438640 81ea3e73 924386b4 70147e0b 00000001 tcpip!FlReceiveNetBufferListChainCalloutRoutine+0xd8
92438678 81ea3dbe bd6422c0 00000002 01000000 nt!KeExpandKernelStackAndCalloutInternal+0xa3
9243868c 89ae5988 89acad10 924386b4 00002400 nt!KeExpandKernelStackAndCalloutEx+0x1e
924386d4 8963769c 00628650 beff2e01 00000000 tcpip!FlReceiveNetBufferListChain+0x98
9243874c 896373b9 00000000 00000002 00000000 ndis!ndisMIndicateNetBufferListsToOpen+0x26c
92438798 8964dfdc 8e6370e8 beff2ef0 00000000 ndis!ndisMTopReceiveNetBufferLists+0x209
924387fc 8965ba91 00000002 00000000 8e6370e8 ndis!ndisInvokeNextReceiveHandler+0x25
92438844 89ce1159 cbd94de8 beff2ef0 00000000 ndis!NdisFIndicateReceiveNetBufferLists+0x1eb81
924388f4 89638c97 97066a88 beff2ef0 00000000 wfplwfs!LwfLowerRecvNetBufferLists+0x149
92438948 8964c2a9 cbd94de8 89ce1010 97066a88 ndis!ndisCallReceiveHandler+0x287
9243896c 81ea3e73 92438a74 701471d7 8e6370e8 ndis!ndisDataPathExpandStackCallback+0x27
924389a4 81ea3dbe bd6422c0 00000000 00000000 nt!KeExpandKernelStackAndCalloutInternal+0xa3
924389b8 89642605 8964c282 92438a74 00002666 nt!KeExpandKernelStackAndCalloutEx+0x1e
924389d0 896595dd d1768f20 8fb110e8 8fb11030 ndis!ndisExpandStack+0x11
92438ab4 90397e22 8e6370e8 beff2ef0 00000000 ndis!NdisMIndicateReceiveNetBufferLists+0x2171d
92438b10 9039307d d1768f20 00000000 92438b30 mydriver!Receive_DeviceWrite+0x562 [myfile.c @ 305]
92438b38 8969d1e4 8fb11030 d1768f20 d1768f20 mydriver!VpnDeviceWrite+0x1d [myfile.c @ 558]

— NTDEV is sponsored by OSR Visit the list online at: MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at To unsubscribe, visit the List Server section of OSR Online at

Winpcap’s source code is available.

https://github.com/wireshark/winpcap/tree/master/packetNtx/driver

W. N.

Thanks for the write up.
Sounds simple enough now…

Sent from mobile

On May 24, 2017 12:41 PM, “Jan Bottorff” wrote:

> Sounds like a wireshark bug. Seems like there was a conversation on this
> list in the last year about wireshark causing some perhaps similar crash.
> I’d suggest you search this list for wireshark, and see if anything sounds
> similar.
>
>
>
> Did you Google the issue of sharing MDLs, I instantly found (Google ndis
> partial mdl) http://www.osronline.com/showThread.cfm?link=147694 from
> 2009 on the Windows Xen driver? It pointed out that NdisCopyBuffer is
> IoAllocateMdl+IoBuildPartialMdl
>
>
>
> Typically, there will be one MDL per receive NB, as if it were a real
> receive buffer. NDIS did have data/body split for a while, which put the
> headers and body in different MDLs, and different buffers. On the receive
> path, there is a requirement that the NDIS specified lookahead size bytes
> be in the first MDL of the receive NB. This is so upper layers can parse
> the packet headers without having to deal with disjoint chunks in multiple
> MDLs. NDIS is free to set the lookahead size for a NIC miniport to anything
> up to the MTU, although don’t believe it typically does. I’ve recently seen
> it being set to a value around 550 bytes (don’t remember exactly) but do
> remember I rounded it up the next cache line size 576 from what NDIS was
> setting.
>
>
>
> I know there is a bug in the WHQL test that fails if the NB offset is not
> zero on receive packets, and if you are sharing an MDL across many packets,
> the NB offset will have to not be zero on some packets. I also know on the
> send side into a NIC miniport, WHQL tests share memory buffers (and maybe
> MDLs) for different packets. On the send side, buffers and MDLs need to be
> treated as read-only (also note the bit in the NB that specifies a packet
> as read-only may not be set). The only workaround I could find for this
> WHQL bug was to adjust the MDL to always use an NB offset of zero. If you
> share MDLs, that won’t be possible.
>
>
>
> It offhand seems like there may be an issue of concurrent updating of a
> shared MDL. The MDL you get from the write may not have a kernel address
> mapping, and upper level protocols will need to access a virtual address of
> the buffer. If for example you indicate some NBs, and some get processed on
> a different core in a worker thread, you may then have multiple cores
> needing to create a kernel address mapping, and if those happen at the same
> moment, you have two cores writing to the MDL, with no locking. You might
> mitigate this by assuring the MDL is mapped before indicating it. A second
> issue is things like filter drivers might want to insert/remove bytes in
> the packet, and they sometimes do so by manipulation of the MDL chain.
> Filters are supposed to restore a NB back to its original state before
> returning a packet, but it offhand seems like while they own the packet
> they are free to manipulate it however they want, which might include
> manipulating the MDL. For the typical case of received packets with
> non-shared MDLs, this will not cause issues, but in your case, it will. I’d
> say the read-only flag in the NB could be set in the NB if it’s read-only,
> but my experience has also been the read-only bit was not set on definitely
> read-only packets.
>
>
>
> I can’t absolutely say it’s illegal to share MDLs across receive packets,
> but it does tread on a path that is uncommon, and anytime you go off the
> path you might find a new bug, in OS or other ISV code.
>
>
>
> If it were me, I think I’d preallocate MDL memory (largest size needed,
> allocated perhaps in the NBL context), and fill it in with
> IoBuildPartialMdl, which should not require any allocations or page table
> manipulation, just some copying/adjustment of the source MDL into the
> destination MDL. Each indicated NB could then have a unique MDL, which gets
> reused when the NBL/NB comes back. So your conversion from the write to the
> NBL indicate goes something like:
>
>
>
> MmGetSystemAddressForMdlSafe on the write mdl to assure it has a mapping,
> so the partial mappings will use the same mapping
>
> While (there are more packets to indicate in this write) {
>
> Pull an NBL/NB with embedded MDL from the free pool (if pool is empty,
> do something appropriate)
>
> Increment a reference count to the write
>
> Use IoBuildPartialMdl to create a slice of the write buffer
>
> Set appropriate NBL and NB fields for this packet
>
> Put this NBL at the end of a chain to indicate
>
> }
>
>
>
> Indicate chain of NBLs
>
>
>
> When you get the NBLs back, you just need to put it back on the free list,
> and decrement the reference count to the original write. When the reference
> count goes to zero, you can complete the write.
>
>
>
> Jan
>
>
>
>
>
> *From: * on behalf of Menachem
> Shapira
> *Reply-To: *Windows List
> *Date: *Tuesday, May 23, 2017 at 11:01 AM
> *To: *Windows List
> *Subject: *[ntdev] NetBufferList chain using same MDL
>
>
>
> Hi,
>
>
>
> We have a Virtual NDIS Miniport driver that receives a chain of L2 Frames
> from a User Mode application.
>
> It passes the Head of the chain (a constant 65K buffer) to the driver to
> it’s IRP_MJ_WRITE routine (we’re using NDIS_WDM mode…)
>
>
>
> The driver in turn builds a list of NBLs, each NBL describing a certain
> frame from the large buffer. Practically, we don’t use any more that 10
> frames for each indication.
>
> In order to conserve in memory mappings, we decided to use the same MDL
> created by th I/O manager for the IRP, and have each NB point to a
> different chuck of the 65K buffer.
>
>
>
>
>
> 1) Is the sharing of MDL permissible in such a constellation?
>
> 2) This solution seems to work, in practice. We achieve nice throughputs.
> However, when we attach an upper fitler driver (just run wireshark, really,
> which attaches winpcap), we get a bugcheck in tcpip.sys (see below). Looks
> like some bad pointer. Additionally, we see that our MDL has been meddled
> with. And here was I thinking that the MDL should describe a constant block
> of memory, and other driver should “look, but not touch”.
>
> 3) If my approach is not acceptable, what better approach can be
> suggested?
>
>
>
> Summary of the bugcheck:
>
>
>
> 1: kd> !analyze -v
>
> *****************************************
>

>
> *
> *
>
> * Bugcheck Analysis
> *
>
> *
> *
>
> *****************************************
>

>
>
>
> DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
>
> …
>
> Arguments:
>
> Arg1: 00000006, memory referenced
>
> Arg2: 00000002, IRQL
>
> Arg3: 00000000, value 0 = read operation, 1 = write operation
>
> Arg4: 89ac75aa, address which referenced memory
>
> …
>
> STACK_TEXT:
>
> …
>
> 92438360 89ac75aa 0000000a 00000006 00000002 nt!KiTrap0E+0x1ca
>
> 92438498 89ac7339 8e5c60e4 924384e4 00000000 tcpip!Ipv4pValidateNetBuffer+
> 0x2a
>
> 92438514 89acaa84 00000000 00000000 00000002 tcpip!
> IppFlcReceivePacketsCore+0x259
>
> 92438584 89acade8 89ae5500 91553000 b9142395 tcpip!
> FlpReceiveNonPreValidatedNetBufferListChain+0x214
>
> 92438640 81ea3e73 924386b4 70147e0b 00000001 tcpip!
> FlReceiveNetBufferListChainCalloutRoutine+0xd8
>
> 92438678 81ea3dbe bd6422c0 00000002 01000000 nt!
> KeExpandKernelStackAndCalloutInternal+0xa3
>
> 9243868c 89ae5988 89acad10 924386b4 00002400 nt!
> KeExpandKernelStackAndCalloutEx+0x1e
>
> 924386d4 8963769c 00628650 beff2e01 00000000 tcpip!
> FlReceiveNetBufferListChain+0x98
>
> 9243874c 896373b9 00000000 00000002 00000000 ndis!
> ndisMIndicateNetBufferListsToOpen+0x26c
>
> 92438798 8964dfdc 8e6370e8 beff2ef0 00000000 ndis!
> ndisMTopReceiveNetBufferLists+0x209
>
> 924387fc 8965ba91 00000002 00000000 8e6370e8 ndis!
> ndisInvokeNextReceiveHandler+0x25
>
> 92438844 89ce1159 cbd94de8 beff2ef0 00000000 ndis!
> NdisFIndicateReceiveNetBufferLists+0x1eb81
>
> 924388f4 89638c97 97066a88 beff2ef0 00000000 wfplwfs!
> LwfLowerRecvNetBufferLists+0x149
>
> 92438948 8964c2a9 cbd94de8 89ce1010 97066a88 ndis!ndisCallReceiveHandler+
> 0x287
>
> 9243896c 81ea3e73 92438a74 701471d7 8e6370e8 ndis!
> ndisDataPathExpandStackCallback+0x27
>
> 924389a4 81ea3dbe bd6422c0 00000000 00000000 nt!
> KeExpandKernelStackAndCalloutInternal+0xa3
>
> 924389b8 89642605 8964c282 92438a74 00002666 nt!
> KeExpandKernelStackAndCalloutEx+0x1e
>
> 924389d0 896595dd d1768f20 8fb110e8 8fb11030 ndis!ndisExpandStack+0x11
>
> 92438ab4 90397e22 8e6370e8 beff2ef0 00000000 ndis!
> NdisMIndicateReceiveNetBufferLists+0x2171d
>
> 92438b10 9039307d d1768f20 00000000 92438b30 mydriver!Receive_DeviceWrite+0x562
> [myfile.c @ 305]
>
> 92438b38 8969d1e4 8fb11030 d1768f20 d1768f20 mydriver!VpnDeviceWrite+0x1d
> [myfile.c @ 558]
>
>
>
> — NTDEV is sponsored by OSR Visit the list online at: MONTHLY seminars
> on crash dump analysis, WDF, Windows internals and software drivers!
> Details at To unsubscribe, visit the List Server section of OSR Online at
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>

@Jan_Bottorff said:
For the typical case of received packets with non-shared MDLs, this will not cause issues, but in your case, it will. I’d say the read-only flag in the NB could be set in the NB if it’s read-only, but my experience has also been the read-only bit was not set on definitely read-only packets.

Are you referring to NDIS_NBL_FLAGS_RECV_READ_ONLY? If so, I’ve found that calling NdisMIndicateReceiveNetBufferLists on an NBL with that flag set will still sometimes result in writes to the NB’s MDL somewhere in the TCP stack. Any insights as to why this still might be happening?

The concepts of NDIS_NBL_FLAGS_RECV_READ_ONLY and NDIS_NBL_FLAGS_SEND_READ_ONLY are essentially dead (just like this thread was). In fact, I didn’t even know the flags existed, and I’m Microsoft’s architect for NDIS. There are a variety of places in the OS that will ignore these flags when writing to MDLs. The reality is that the OS considers the MDL and packet payload to always be writable.

@“Jeffrey_Tippet_[MSFT]”

Ah, that’s a rough situation. It’s surprising to me that you say NDIS considers MDLs always writable. On Linux, it’s usually necessary to call a particular function before modifying a skb, which will optionally do a copy-on-write situation, because its page fragments might be shared between several skbs or belonging to some physical device with various conditions. I assume NDIS would have something similar, to at least deal with GRO and whatnot.

Right now in Wintun I’ve got a IRP_MJ_WRITE function on a DO_DIRECT_IO device object that essentially does:

for (offset,size in GetHeaders(Irp)) {
    NdisMIndicateReceiveNetBufferLists(NdisAllocateNetBufferAndNetBufferList(ctx->NBLPool, 0, 0, Irp->MdlAddress, offset, size));
}

I can map that Mdl as writable – without MdlMappingNoWrite – which makes things “work”. But of course that allows userspace to scribble over read-only files when the TCP stack fixes up checksums and such.

The bigger picture is that NtWriteFile uses MmProbeAndLockPages(IoReadAccess) and ProbeForRead(), but it of course doesn’t use IoWriteAccess/ProbeForWrite, since that’s not what NtWriteFile is intended for, taking only const buffers. This poses a problem.

It seems like my options are:

  • Use neither direct nor buffered I/O, and do the ProbeAndLock dance myself with IoWriteAccess and ProbeForWrite, to just return an error to userspace if they are actually passing a readonly page.
  • Introduce a dreaded allocation and memcpy.
  • Restrict the file object to SYSTEM, live with the security implications, and call it a day.

None of those options sound remotely pleasant to me. Is there a good alternative?

This is, I think, the same problem that winsock solves: how to efficiently get a buffer from usermode. It’s a bit more nuanced than “memcpy == bad”. It turns out that it’s cheaper to copy small buffers than it is to update the page tables. So winsock will just memcpy for buffers that are smaller than a particular threshold. Its threshold is on the order of 1500 bytes, but you should benchmark your code to see where the break-even threshold is.

Allocations from the general pool are bad, but you can get a fast allocation from a lookaside list, if you don’t mind wasting a bit of memory. So, for example, you could create lookaside lists for Big, Medium, and Small buffers, and allocate a buffer that best fits the current I/O.

If you have the luxury of changing your userspace API, I’d heartily suggest making an API where userspace pre-registers the buffers, then just reuses them over and over again. This is what RIO does, and I understand Linux is going in that direction too. If you can do that, then you only pay the expensive cost of probing & locking the buffers once, up-front.

Beyond the above discussion, you should probably use METHOD_NEITHER and manually probe for read+write and lock it in kernel VA. It’s not completely clear to me why this is unappealing; it’s not that slow (winsock does it for larger buffers).

@“Jeffrey_Tippet_[MSFT]” said:
It turns out that it’s cheaper to copy small buffers than it is to update the page tables.

Indeed we see this in the Linux stack too: page table manipulations aren’t cheap.

If you have the luxury of changing your userspace API, I’d heartily suggest making an API where userspace pre-registers the buffers, then just reuses them over and over again.

That’s not a bad idea. The poorman’s way of doing this is to just map it on the first Read/WriteFile, and then enforce that the virtual address is the same for subsequent calls (even though userspace could obviously remap that address - mostly a sanity check I guess). A better way might be with ioctls or some other syscall to send commands to the device.

This is what RIO does, and I understand Linux is going in that direction too. If you can do that, then you only pay the expensive cost of probing & locking the buffers once, up-front.

Yea actually the Linux direction is even sexier: you can avoid syscalls entirely during heavy load by registering a ring buffer. On write, userspace atomically increments some header variable, and when completed the first read of it, the kernel doesn’t return to userspace if the counter has been incremented. On read, it’s the reverse, with the kernel increment the counter and userspace not making synchronous syscalls unless there’s nothing to be read and so it blocks.

Beyond the above discussion, you should probably use METHOD_NEITHER and manually probe for read+write and lock it in kernel VA. It’s not completely clear to me why this is unappealing; it’s not that slow (winsock does it for larger buffers).

Yea, fair enough. Doing the mapping once, up front, makes this more viable too.

Thanks for the suggestions.


On the actual original topic of this thread, in case folks revisit in the archives, I’ve seen basically no problems (yet?) with MDL sharing when doing:

ULONG offset = 0;
for (size in stuffToReceive(mdl)) {
    NdisMIndicateReceiveNetBufferLists(NdisAllocateNetBufferAndNetBufferList(pool, 0, 0, mdl, offset, size));
    offset += size;
}

It’s a bit more nuanced than “memcpy == bad”.

Yes, indeed. People are too quick to say “Buffered I/O BAD… Direct I/O GOOD!”

As Mr. Roberts has noted here before, processors these days are very, very, good at doing block move operations. It’s something that the CPU architects actually focus on.

It turns out that it’s cheaper to copy small buffers than it is to update the page tables.

And, while I’m very certain Mr. Tippet knows this, I’ll just add for other playing along at home that accounting for the overhead of Direct I/O requires an extremely difficult calculus. It’s not so much the cost of allocating and filling PDEs/PTEs when the Direct I/O mapping is created… it’s the cost of the “TLB shootdown” (the INVD that needs done) as a result of unmapping the region when the I/O is completed.

Back in the day, we used to recommend that people start considering Direct I/O for read/write operations over 4K. In the last few years, we’ve changed to recommending people at least consider Direct I/O all the way up to 64K read/write operations.

In many cases, as well, it comes down to what you’re trying to optimize: CPU utilization, elapsed time, or something else (like network bandwidth utilization). As a former business partner of mine used to repeat regularly “anything regarding performance analysis you’ve determined by intuition is likely to be wrong, because there’s almost nothing intuitive about performance.”

Peter

Indeed we see this in the Linux stack too: page table manipulations aren’t cheap.

Well, Linux does not really have to do that many page table manipulations as far as KM address space is concerned, does it - after all, vmalloc()-area (i.e. the one where ‘virtual_address== physical_address+ fixed_offset’ equation does not hold true) utilisation by the KM code is generally frowned upon…

Anton Bassov

@“Jeffrey_Tippet_[MSFT]” said:
The concepts of NDIS_NBL_FLAGS_RECV_READ_ONLY and NDIS_NBL_FLAGS_SEND_READ_ONLY are essentially dead (just like this thread was). In fact, I didn’t even know the flags existed, and I’m Microsoft’s architect for NDIS. There are a variety of places in the OS that will ignore these flags when writing to MDLs. The reality is that the OS considers the MDL and packet payload to always be writable.

Thanks to IIS, HTTP.sys will post read-only memory to NDIS in the cases of HTTP Error

refer:
https://social.msdn.microsoft.com/Forums/windowsdesktop/en-US/a63fef1c-c7e6-451d-adc5-16de1e8c0b62/ndis-51-intermediate-driver-for-packet-encryptiondecryption-not-working-with-httpsys?forum=wdk

And thanks to you, Mr. @mengxp for updating a two year old, dead, thread. Awesome example of necroposting.

as you have not yet locked this old thread, allow me to add an observation about the security implications of these techniques. They all allow UM processes to lock up physical memory in a way that could be exploited. That’s why the lock pages in memory privilege was created - and it should be checked by anyone rolling their own scheme like this