help tracking down memory corruption...

James_Harper · June 28, 2011, 8:21pm

I have the following code that is misbehaving with a 0xD1 (9, 2, 1, x)
BSoD, and the debugger says this is happening when referencing the Blink
of a LIST_ENTRY, where the Flink of the previous entry has been set to 1
instead of a pointer. I peppered a bunch of ASSERTS around to catch when
this happens and it’s happening after a KdPrint as per code below:

#define FUNCTION_MSG(…) KdPrint((__DRIVER_NAME " " VA_ARGS))
#define NBL_LIST_ENTRY_FIELD MiniportReserved[0]
#define NBL_LIST_ENTRY(_nbl)
(*(PLIST_ENTRY)&(_nbl)->NBL_LIST_ENTRY_FIELD)

while (!IsListEmpty(&nbl_head))
{
PNET_BUFFER_LIST nbl;
nbl_entry = RemoveHeadList(&nbl_head);
ASSERT((ULONG_PTR)nbl_head.Flink->Flink != 1);
ASSERT((ULONG_PTR)nbl_head.Blink->Flink != 1);
FUNCTION_MSG(" %p retrieved flink = %p, blink = %p\n", nbl_entry,
nbl_entry->Flink, nbl_entry->Blink);
ASSERT((ULONG_PTR)nbl_head.Flink->Flink != 1); <----- this assert
fails
ASSERT((ULONG_PTR)nbl_head.Blink->Flink != 1);
nbl = CONTAINING_RECORD(nbl_entry, NET_BUFFER_LIST,
NBL_LIST_ENTRY_FIELD);
ASSERT((ULONG_PTR)nbl_head.Flink->Flink != 1);
ASSERT((ULONG_PTR)nbl_head.Blink->Flink != 1);
nbl->Status = NDIS_STATUS_SUCCESS;
ASSERT((ULONG_PTR)nbl_head.Flink->Flink != 1);
ASSERT((ULONG_PTR)nbl_head.Blink->Flink != 1);
FUNCTION_MSG(“A %p\n”, nbl);
ASSERT((ULONG_PTR)nbl_head.Flink->Flink != 1);
ASSERT((ULONG_PTR)nbl_head.Blink->Flink != 1);
NdisMSendNetBufferListsComplete(xi->adapter_handle, nbl,
NDIS_SEND_COMPLETE_FLAGS_DISPATCH_LEVEL);
ASSERT((ULONG_PTR)nbl_head.Flink->Flink != 1);
ASSERT((ULONG_PTR)nbl_head.Blink->Flink != 1);
}

nbl_head is a local variable containing a list of packets that have been
retrieved from the io ring of the hardware (xen virtual network device
actually) and I gather them all with a lock held and then indicate them
after releasing the lock (the code above).

The fact that the only thing that happens before the breakage is a
KdPrint presumably means that I’ve previously corrupted memory, but
tracking it down is proving to be an exercise in frustration. Aside from
the verifier (which isn’t helping), are there any other tricks I can use
to find out where my bug is?

Thanks

James

mjd · June 28, 2011, 9:45pm

James -

I’m assuming the list head (_nbl)->NBL_LIST_ENTRY_FIELD is protected by a lock as you said. What type of lock? For instance, your

while ( nbl_entry = ExInterlockedRemoveHeadList(&nbl_head,&spinlock))

mjd · June 28, 2011, 9:56pm

James – Sorry I hit the post button by accident while typing.

Elsewhere in your DriverEntry or where ever it makes sense, Initialize your ListSpinlock…

while ( nbl_entry = ExInterlockedRemoveHeadList(&nbl_head,&ListSpinlock))
{
PNET_BUFFER_LIST nbl;
ASSERT((ULONG_PTR)nbl_head.Flink->Flink != 1);
ASSERT((ULONG_PTR)nbl_head.Blink->Flink != 1);
FUNCTION_MSG(" %p retrieved flink = %p, blink = %p\n", nbl_entry,
nbl_entry->Flink, nbl_entry->Blink);
Rest of your loop
}

There are other, equally valid ways to code this protection. And there are instances where one would obviously want to protect the whole loop and not just the linked list access. I’m assuming you’ve protected every access to your list with the same lock of some sort.

The only other suggestion, aside from verifier and pool tagging is to try running your driver through Prefast (I’m naively assuming it works with network miniport drivers, anyone?)

–Mjd

James_Harper · June 28, 2011, 9:58pm

> James -

I’m assuming the list head (_nbl)->NBL_LIST_ENTRY_FIELD is protected
by a lock
as you said. What type of lock? For instance, your

while ( nbl_entry =
ExInterlockedRemoveHeadList(&nbl_head,&spinlock))

Yes it’s protected by a lock. I’m a little closer to the problem…
Somehow I’m putting the same NBL on the complete-list multiple times,
and because this is an SMP system and I’ve already given the NBL back to
NDIS, it gets changed while I think I still own it. The reason it
happens after the KdPrint is that the KdPrint is a time consuming
operation (I hook debugprint and send the output to xen) so there is
plenty of time for something to happen to the NBL.

The funny thing is that when I’m creating the list I get the NBL and
then I get the LIST_ENTRY (MiniportReserved[0]) from it, and the pointer
to the LIST_ENTRY isn’t anything like the pointer to the nbl pointer
when they should be only 0x30 bytes apart. This only occurs when many
packets are completed at once though.

Getting closer at least

Thanks

James

James_Harper · June 28, 2011, 10:00pm

> James – Sorry I hit the post button by accident while typing.

Elsewhere in your DriverEntry or where ever it makes sense,
Initialize your
ListSpinlock…

while ( nbl_entry =
ExInterlockedRemoveHeadList(&nbl_head,&ListSpinlock))
{
PNET_BUFFER_LIST nbl;
ASSERT((ULONG_PTR)nbl_head.Flink->Flink != 1);
ASSERT((ULONG_PTR)nbl_head.Blink->Flink != 1);
FUNCTION_MSG(" %p retrieved flink = %p, blink = %p\n", nbl_entry,
nbl_entry->Flink, nbl_entry->Blink);
Rest of your loop
}

There are other, equally valid ways to code this protection. And there
are
instances where one would obviously want to protect the whole loop and
not
just the linked list access. I’m assuming you’ve protected every
access to
your list with the same lock of some sort.

The list_head is a local variable so doesn’t need any protection.

James

mjd · June 28, 2011, 10:08pm

>This only occurs when many packets are completed at once though.

Sorry to sound like a broken record, but I’ve been through this recently with a different style of driver with a high I/O rate. Is your complete list processing protected by the same common smp capable lock?

Glad you’re getting closer to a solution.

Mike

Jonathan_Edwards · June 28, 2011, 11:39pm

If it happens every time, or often enough, a memory access breakpoint will help

ba w4 nbl_head.Blink->Flink

From: xxxxx@lists.osr.com [xxxxx@lists.osr.com] On Behalf Of James Harper [xxxxx@bendigoit.com.au]
Sent: Tuesday, June 28, 2011 17:20
To: Windows System Software Devs Interest List
Subject: [ntdev] help tracking down memory corruption…

I have the following code that is misbehaving with a 0xD1 (9, 2, 1, x)
BSoD, and the debugger says this is happening when referencing the Blink
of a LIST_ENTRY, where the Flink of the previous entry has been set to 1
instead of a pointer. I peppered a bunch of ASSERTS around to catch when
this happens and it’s happening after a KdPrint as per code below:

#define FUNCTION_MSG(…) KdPrint((__DRIVER_NAME " " VA_ARGS))
#define NBL_LIST_ENTRY_FIELD MiniportReserved[0]
#define NBL_LIST_ENTRY(_nbl)
(*(PLIST_ENTRY)&(_nbl)->NBL_LIST_ENTRY_FIELD)

while (!IsListEmpty(&nbl_head))
{
PNET_BUFFER_LIST nbl;
nbl_entry = RemoveHeadList(&nbl_head);
ASSERT((ULONG_PTR)nbl_head.Flink->Flink != 1);
ASSERT((ULONG_PTR)nbl_head.Blink->Flink != 1);
FUNCTION_MSG(" %p retrieved flink = %p, blink = %p\n", nbl_entry,
nbl_entry->Flink, nbl_entry->Blink);
ASSERT((ULONG_PTR)nbl_head.Flink->Flink != 1); <----- this assert
fails
ASSERT((ULONG_PTR)nbl_head.Blink->Flink != 1);
nbl = CONTAINING_RECORD(nbl_entry, NET_BUFFER_LIST,
NBL_LIST_ENTRY_FIELD);
ASSERT((ULONG_PTR)nbl_head.Flink->Flink != 1);
ASSERT((ULONG_PTR)nbl_head.Blink->Flink != 1);
nbl->Status = NDIS_STATUS_SUCCESS;
ASSERT((ULONG_PTR)nbl_head.Flink->Flink != 1);
ASSERT((ULONG_PTR)nbl_head.Blink->Flink != 1);
FUNCTION_MSG(“A %p\n”, nbl);
ASSERT((ULONG_PTR)nbl_head.Flink->Flink != 1);
ASSERT((ULONG_PTR)nbl_head.Blink->Flink != 1);
NdisMSendNetBufferListsComplete(xi->adapter_handle, nbl,
NDIS_SEND_COMPLETE_FLAGS_DISPATCH_LEVEL);
ASSERT((ULONG_PTR)nbl_head.Flink->Flink != 1);
ASSERT((ULONG_PTR)nbl_head.Blink->Flink != 1);
}

nbl_head is a local variable containing a list of packets that have been
retrieved from the io ring of the hardware (xen virtual network device
actually) and I gather them all with a lock held and then indicate them
after releasing the lock (the code above).

The fact that the only thing that happens before the breakage is a
KdPrint presumably means that I’ve previously corrupted memory, but
tracking it down is proving to be an exercise in frustration. Aside from
the verifier (which isn’t helping), are there any other tricks I can use
to find out where my bug is?

Thanks

James

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

James_Harper · June 29, 2011, 2:00am

>

>This only occurs when many packets are completed at once though.

Sorry to sound like a broken record, but I’ve been through this
recently with
a different style of driver with a high I/O rate. Is your complete
list
processing protected by the same common smp capable lock?

No, but as per previous email it’s a local list so doesn’t need any
protection.

I’m inclined to agree that it’s an synchronisation issue somewhere
though… I’m running under a single core/thread now to see if that
makes a difference.

James

James_Harper · June 29, 2011, 2:39am

Found the problem… I skipped the bit in the docs where
NdisMSendNetBufferListsComplete completes the linked list of NBL’s - I
was returning them one at a time without clearing the ‘next’ pointer.

Thanks

James