GetBusData() limitation in which cases

rusakov2 · October 22, 2019, 4:58pm

Hi,
first step to describe the platform where the situation occus.
There is a custom PCIe end point device on near typical Core-i5 based mainboard. Since this end point device is custom it has some limitations and needs own self diagnostics. To accomplish so there is a timer based self check in its device driver, where several device registers are read and their content validated. Several of them to check are located in PCIe configuration address space.
Periodically called function attempts to read such registers by issuing a read into parent driver, that is PCIe bus driver, using GetBusData(bus_ctx, PCI_WHICHSPACE_CONFIG, …) function calls.
If registers are read and contents shows device is healthy then no further action needed. If registers are read and content shows internal problem in PCIe device, or registers were not read, then device will be declared having problem and driver will be soon unloaded.
This built in health check method of custom device have been working well for several years, since Windows 10 RS1 at least. It came to my attention, that recently a crash in driver has been reported in some occasional, rare (approximately 1 in 1650) cases. Investigation showed that crash is caused by GetBusData() call.
All variables are inspected and seem to be valid at the moment crash occurs.

My feeling is that in normal cases the call to GetBusData() goes through without issues. In the crash case something goes wrong in GetBusData() function in parent device driver operation - analyze command output claims device object has been blocking an Irp for too long a time.

My question to experts is: under which circumstances PCIe bus driver may behave that way, i.e. a call to GetBusData() would take too long?

Here is call stack from crash case

STACK_TEXT:
ffffd18065baeab0 fffff80261d94f47 : ffff830fc2e49130 ffff830fbff0e080 0000000000000000 0000000000000000 : nt!KiSwapContext+0x76
ffffd18065baebf0 fffff80261d94ab9 : 0000000000000000 ffff830fc24dd2a0 0000000000000000 ffff830fc0ee6b00 : nt!KiSwapThread+0x297
ffffd18065baecb0 fffff80261d93840 : ffff830fc0ee6b00 ffff830f00000000 0000000000000000 ffffd18065baedc1 : nt!KiCommitThreadWait+0x549
ffffd18065baed50 fffff80275a0cae6 : ffffd18065baee60 ffffd18000000000 0000000000000000 0000000000000000 : nt!KeWaitForSingleObject+0x520
ffffd18065baee20 fffff80275a64c90 : 0000000000000000 ffffd18000000000 0000000000000000 0000000000000000 : Wdf01000!FxIoTarget::SubmitSync+0x192 [minkernel\wdf\framework\shared\targets\general\fxiotarget.cpp @ 1839]
ffffd18065baeee0 fffff80275a66158 : ffff830fc24dd2a0 0000000000000000 0000000000000001 fffff802780ea078 : Wdf01000!FxIoTargetSendIo+0x290 [minkernel\wdf\framework\shared\targets\general\fxiotargetapi.cpp @ 812]
ffffd18065baf150 fffff802780c9d3f : ffff830f00000002 303578303d746573 ffffd18065baf340 fffff802780e54fd : Wdf01000!imp_WdfIoTargetSendWriteSynchronously+0x38 [minkernel\wdf\framework\shared\targets\general\fxiotargetapi.cpp @ 1035]
ffffd18065baf1a0 0000000000000000 : 000000020000ff12 000002130000fe12 000001120000ff13 000000000200002c :nnnn!UpdateDeviceStatusInfo+0x42f [L:\drivers\nnnn\driver\device.cpp @ 2494]

Thanks,
Sergey

Pavel_A1 · October 25, 2019, 1:13am

What is the crash error code and parameters?

Let’s think a bit differently. You read the config space because the device is known to fail sometimes.
What if the “crash” occurs not because of call to GetBusData, but because the device is in bad state?

– pa

Peter_Viscarola_OSR · October 25, 2019, 4:17pm

Hmmm… I’m kinda lost.

The stack show in the OP isn’t doing GetBusData. It’s seems to me to be a synchronous write operation to an I/O Target.

Soooo… what are we looking at/diagnosing?

Peter

rusakov2 · October 25, 2019, 6:21pm

Hi
Peter, according to MSDN GetBusData is a pointer to some internal function for which I cannot see any details. All I have is a call stack as shown. Yes GetBusData function is not shown in call stack.

Pavel, if your idea is right then I cannot simply use GetBusData() when something inside device went so bad that parent decice driver, that is PCIe bus driver is unable to read configuration space. Should then end point link go down and trigger that device driver to have surprise removal to occur. Which did not seem to happen. One thought here may be that timer driven call to UpdateDeviceStatusInfo() is in progress while surprise removal is taking place, since OnSurpriseRemoval() call back is one of the cases to stop timer checking for device internal health. If this is happening in rare coincidence then the behavior of GetBusData() for the device being surprise removed due to internal error maybe the issue.

Thanks,
Sergey

Peter_Viscarola_OSR · October 25, 2019, 10:00pm

GetBusData is a pointer to some internal function for which I cannot see any details.
All I have is a call stack as shown. Yes GetBusData function is not shown in call stack.

Then you’re looking at the wrong call stack, the call stack for the wrong thread, correct? Might be a bit more useful/interesting to see the call stack where the problem is.

One thought here may be that timer driven call to UpdateDeviceStatusInfo() is in progress while surprise removal is taking place

Well… again… surely you can know this from the crash dump, or at least from debugging a live system that you’ve caught in this situation.

As a diagnostic, you could map PCIe Extended Config Space into kernel virtual address space and see if … when you get this situation… you can successfully examine config space via that mapping.

As a total aside: why oh why do people persist in putting registers in extended config space?

ETA: Is this an FPGA? Can you use something like chipscope or signaltap or whatever to see what’s going on?

Peter

rusakov2 · October 28, 2019, 5:13pm

Hi,
thank you for your replies again. Those are very helpful. That is why I ask. The investigation is complete and problem solved.

Thank you Pavel for your hint. The issue happens on only one device in a batch of hundreds identical ones. And happens randomly. That should have rung the bell for me. It didn’t, sorry.
Peter, you are very right. I went too far in the wrong direction. The call stack is invalid. I was fooled by the fact that calling function UpdateDeviceStatusInfo() has only one function it calls and it is GetBusData(). But call stack is clearly messed up. That should have stopped me but it didn’t, sorry again.

Now, after weekend of additional test runs the resolution came easy.
Combined conclusions of 1 and 2 above the issue is in one specific device. Which becomes hosed during test. Can be flaky DDR, overheated chipset, or strong EMI interference. Either way, that specific device (one in a batch) is removed from production and sent into parts recycling. As no other identical devices reproduce the issue, no changes are needed in driver.

And for asides.

why oh why do people persist in putting registers in extended config space?
there are reasons some valid some not. There is manufacturing cost, there is cost for an extra IP if you want that (config space IP comes as included).
Also, say if you want to have PCIe device health register somewhere not in config space, then your PCIe device must be enumerated first, i.e. have config space valid. Only after that you can access other device registers, if I read PCIe specs correctly. So, PCIe config space is the first where you could have something device specific stored, and check it even when other memory mapped area is dead.

For instance PCIe peripheral device firmware crashed. Then its watchdog will reset it, and PCIe device restarts. Which takes short time, during which PCIe device other memory and registers are dead but its PCIe config space may be still alive if it is driven by a separate small robust IP block - in theory.

Yes it is expensive aircraft grade FPGA. Sadly the architecture is limited to PCIe since Windows 10 doesn’t support other transports which FPGA supports.

Thanks,
Sergey

Peter_Viscarola_OSR · October 29, 2019, 11:09am

say if you want to have PCIe device health register somewhere not in config space, then your PCIe device must be enumerated first

That’s a good and helpful answer. Thanks.

And thanks for getting back to us with your ultimate solution. It’s much more satisfying knowing something about how a problem was resolved than just having the thread… die.