NDIS 6 40 miniport Driver PCle: NdisMAllocateNetBufferSGList API crashed with IRQL_NOT_LESS_OR_EQUAL

I am developing NDIS Miniport Driver(NDIS 6.4) for 10G PCIe card. When The driver called NdisMAllocateNetBufferSGList() to populate Scatter Gather List for a give NET_BUFFER, the driver crashed with IRQL_LESS_OR_EQUAL . This crash observed after 12 hours after I started a stress test, using NTTCP.

HOW TO REPRODUCE THE ISSUE :

  1. Take two windows 10 Rs5 system (T1 and T2).

  2. T1 contains my 10G card and T2 contains another reputed company’s 10G Card.

  3. Both cards contain Two ports.

  4. Assign startic IPs to each port on T1 and T2(Ip1 and Ip2).
    T1

    Port 1: 192.168.0.1
    Port 2. 172.168.0.1

    T2

    Port 1: 192.168.0.1
    Port 2. 172.168.0.1

  5. Connect Port1 of T1 and T2 using CAT 7 cable (Or optical).

  6. Connect Port2 of T1 and T2 using CAT 7 cable (Or optical).

  7. Run following commands.


On T1:(System under test)
NTttcps.exe -s -m 8,,192.168.0.2 -a 2 -t 50400 -wu 10 -cd 10
NTttcps.exe -s -m 8,
,172.168.0.2 -a 2 -t 50400 -wu 10 -cd 10
NTttcpr.exe -r -m 8,,192.168.0.1 -a 16 -t 50400 -wu 10 -cd 10
NTttcpr.exe -r -m 8,
,172.168.0.1 -a 16 -t 50400 -wu 10 -cd 10

On T2: (Support machine)
NTttcps.exe -s -m 8,,192.168.0.1 -a 2 -t 50400 -wu 10 -cd 10
NTttcps.exe -s -m 8,
,172.168.0.1 -a 2 -t 50400 -wu 10 -cd 10
NTttcpr.exe -r -m 8,,192.168.0.2 -a 16 -t 50400 -wu 10 -cd 10
NTttcpr.exe -r -m 8,
,172.168.0.2 -a 16 -t 50400 -wu 10 -cd 10

Below is the stack trace and other details

IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: 0000000000000008, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000000, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: fffff80312425f40, address which referenced memory


nt!KeBugCheckEx
nt!KiBugCheckDispatch+0x69
nt!KiPageFault+0x454
hal!HalpDmaAllocateScatterPagesFromScatterPoolV2+0x90
hal!HalpDmaAllocateScatterPagesFromScatterPool+0x4a
hal!HalpDmaAllocateMapRegisters+0xc7
hal!HalAllocateAdapterChannelV2+0xd2
hal!HalAllocateAdapterChannel+0x45
hal!HalBuildScatterGatherListV2+0x1850b
NDIS!NdisMAllocateNetBufferSGList+0x1f9 <--------------------------------
MyNetworkCard!MPSendNetBufferLists+0x2fb <-------------------------------
NDIS!ndisMSendNBLToMiniportInternal+0x11a
NDIS!ndisMSendNBLToMiniport+0xe
NDIS!ndisCallSendHandler+0xb8
NDIS!NdisSendNetBufferLists+0x2de
tcpip!FlFastSendPackets+0x93
tcpip!IpNlpFastContinueSendDatagrams+0x4f3
tcpip!IpNlpFastSendDatagram+0x3f1
tcpip!TcpTcbSend+0x572
tcpip!TcpEnqueueTcbSend+0x462
tcpip!TcpTlConnectionSendCalloutRoutine+0x24
nt!KeExpandKernelStackAndCalloutInternal+0x78
nt!KeExpandKernelStackAndCalloutEx+0x1d
tcpip!TcpTlConnectionSend+0x77
afd!AfdSend+0x5cf
afd!AfdDispatch+0x154
nt!IofCallDriver+0x59
nt!IopSynchronousServiceTail+0x1b1
nt!NtWriteFile+0x8bd
nt!KiSystemServiceCopyEnd+0x25
ntdll!NtWriteFile+0x14
KERNELBASE!WriteFile+0xfd
NTttcps!PostAsynchBuffer+0xc1
NTttcps!DoAsynchSendsReceives+0x278
NTttcps!StartSenderReceiver+0x45f
KERNEL32!BaseThreadInitThunk+0x14
ntdll!RtlUserThreadStart+0x21

Initial Diagnosis:

  1. This crash happened at early stage of data transfer.Till this point , driver does not do anything apart from extracting the NET_BUFFERS from NET_BUFFER_LISTs and Queiueing them in the order. From the Transmit Handler, the driver dequeue each NET_BUFFER and then, calls NdisMAllocateNetBufferSGList(). It does not do anything else.So can it be a OS/NDIS issue?

  2. I analyzed the NET_BUFFER under processing. The NET_BUFFER data offset was 0xCA and Datalength was 0x546. It had two MDls. First MDLs had a ByteCountof 0X100 and Second one had 0X510. (0x546 = 0x100 + 0x510 - 0xCA).So It passes MDL DATA length validation.

  3. The FIrst MDL flag is 4 and second MDL Flag is 18.

Please help me with your thoughts.
I can provide additional info on request.
Thanks
Sam

I can offer nothing but very general advise

It is very unlikely that this is a problem with Windows or another driver in this system. It is much more likely that your code causes some kind of subtle corruption that goes un detected for a long time. Exactly the sort of bug that a stress test like this is designed to expose.

start with !analyze -v

I would follow up with additional tests using a different machine to host your card. Choose a different manufacturer. Make sure the motherboard is different. And then do it again with a different CPU brand (AMD vs Intel). If you can reproduce the crash on all platforms, you can be nearly sure that it is your problem. If not, then what commonalities exist will be telling

I would also try testing with UDP traffic and look at what kind of TCP offload settings you have on your card.

These are all random guesses of where to look based on not a shred of tangible evidence