> Hi all,
This is going to be a somewhat ambiguous question unfortunately, and I
apologize in advance, but I am really at a loss for how to proceed with
the problem, so I was hoping I could pick up some pointers here.
We are currently maintaining a WDF xHCI driver and have been having some
nagging problems with memory corruption occurring. We’ve narrowed the
issue down to a change made in the past to address another issue; undoing
that fixes the memory corruption issue, but also reintroduces the original
problem.
Here are the details of the two problems and the “fix”:
Original problem:
Our devices transfer a large amount of data at once at regular intervals
on USB bulk pipes (2MB-16MB), and we have designed our controller driver
to allocate large ring buffers for the bulk endpoints to accommodate these
transfers. We calculate our ring sizes so that they are just large enough
to hold all the data that the device will transfer at once. We have found
that this works for devices that send less data (2MB-8MB), but the larger
data transferring devices would fail. Looking at USB traces, it seemed
that part-way through data transfer (very strange times, like after
transferring 2-3MB), the host would stop providing ACKs and the transfer
would simply never complete. It would indefinitely stall at this point.
Fix:
The fix was to arbitrarily pad the ring buffer size by some amount. We
ultimately chose to pad it to 10% and the problem above stopped happening.
The cause and reason that this fixed the problem was not determined.
New problem:
This “fix” seems to cause the memory corruption problems mentioned above.
Essentially, in the course of using our driver and device, we occasionally
get BSODs which, when examined in WinDbg, are always all over the place,
usually in other drivers. We have not isolated an exact reproduction case,
but we have some certain system+hardware setups that seem to trigger these
BSODs with higher frequencies.
With only the information provided above, can anyone even guess what could
be happening?
The ring buffer allocation code is quite straightforward:
- Calculate the required ring size (and add padding)
- Calculate and store the number of required segments
- Use a standard ExAllocatePoolWithTag to allocate the entire ring
- For each segment (Windows page size), WdfCommonBufferCreate is called
for the DMA routines to use
- Store pointers to the segments and set up links between the segments
(link TRBs, according to the xHCI spec)
The clean-up code is equally simple:
- For each segment, call WdfObjectDelete on the CommonBuffers created
during allocation.
- Call ExFreePool to clean up the ring.
Does anyone have any hints of where to start looking at this?
Are you checking all the calls to make sure they have returned valid
values? There is a tendency in app code to call something and just assume
it has worked. For a lot of calls (such as graphics calls) this is
sort-of-OK, unless you make a mistake and give invalid parameters (my
tendency these days is to enclose such calls in a VERIFY macro, which is
just like ASSERT except that when compiled in release mode, the expressin
is still executed (for ASSERT, the expression is never computed in release
mode).
But I might suggest geting a PCI bus analyzer and tracing the bus traffic.
Put enough debug output statements of your choice so that you have a set
of addresses you can compare to what you see on the bus. At one point, on
a hardware platform decades ago, and at least a decade before personal
computers, there was a DMA problem where a bug in the CPU logic (which was
hardwired rather than microcoded), if a DMA came in when a “rotate
register” or “shift register” instruction was being executed, the contents
of the target register were scrambled. I’ve heard others tell of bugs in
the bus negotiation, for example, if a bus request is retried, it might
use the wrong address, although I’ve never encountered any of these
myself.
The “increase the buffer size by 10%” fix, where you don’t know why it
works, is a bit scary. It may be the same root cause as the problem you
are seeing now. Or maybe not. Make sure you don’t have race conditions
between your deallocation and your usage, such that a buffer can be freed
while a transfer is in progress. When a buffer is freed, make sure its
reference is set to NULL, which can be successful for detecting errors as
long as you don’t have copies of that pointer floating around.
Memory damage bugs, especially those caused by hardware, rank among the
most obnoxious, instrangient, and difficult buts to find. One time, years
ago, I spent a lot of effort “bulletproofing” an app, and no matter what I
did, the app (which had to be up 24/7 and required a reboot of the whole
machine if the app stopped) continued to crash, usually on an access
fault. I finally gave up and put in handlers to catch this condition
(which is hard to do, or sometimes, impossible to do, in Windows kernel
mode). It continued to crash at least once a day, and the suspicion was
always the memory controller (which we had designed, to give a 16x16
crosspoint switch), which we suspected would give bad data of the kind we
saw in the equivalent of !analyze -v, otherwise known as “Roy Levin”, who
would pore over huge octal dumps and try to see what happened (there are
some fascinating hardware failures he discovered, all OT). The data we
saw in the register, which supposedly came from a specific memory
location, did not match the data seen in the dump.
You have a nontrivial problem, and if you can’t identify a software cause,
it’s time to go for the bus analyzer.
joe
Thanks all!
Rich
NTDEV is sponsored by OSR
Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
OSR is HIRING!! See http://www.osr.com/careers
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer