Reset/restore PCIe config area

Hello
This question might sound strange but if you have any idea I’d welcome it.
We have a PCIe card with a CPU with integrated PCIe controller (LS1046A, as endpoint). If the CPU locks up then we need to reboot the PC to get the card running again which is obviously not really usable. We can issue a hard reset to the CPU but that will also reset the whole config area where the BARs and MSIs are setup. That in return kills the communication with the PC (understandable). I tried to save and restore the whole PCIe area to circumvent this problem. With the BAR values restored I get data exchange via memory back up running again. However sending an MSI still doesn’t work even though the MSI address and data (as given from RC) are restored as well, including the address translation unit to send the write out onto the bus.
What could be the reason that the MSI interrupts still don’t work even after restoring the whole PCIe area? I also had to confirm and clear some errors in these registers but that didn’t help. Any other ideas?

(The LS1046A also offers core resets that would only reset the cores without the rest of the chip. That would leave the PCIe area untouched but only works if the cores are still responding. That kind of invalidates the reset as emergency measure in the first place.)

Thanks

Well… you can’t just save/restore the PCI Config Space. It doesn’t BELONG to you. It’s the data area that’s setup and managed by the PCI Bus Driver.

What I think you want to do is to restart the device. Do what you need to do on the device itself, and then tell the system to restart the device by calling

WdfDeviceSetFailed(wdfDevice, WdfDeviceFailedAttemptRestart);

That should unload your driver, restart your device, and reload your driver. Is that what you want??

I sincerely hope you are only talking about debugging purposes only here. On general principles, no piece of hardware should ever require a reset during normal operation, and no production driver should trigger a reset like this. If your motherboard includes PCIe hotplug support (only a few do), you can have the PCIe block drop off the bus and reappear, where it will be re-enumerated.

And if it IS for debugging purposes only, then it isn’t really that burdensome to restart the PC in this case, right? That time penalty is a good incentive to fix the bugs that are causing the CPU to lock up in the first place.

This is a CPU card where users can write and run their own (low-level) code so a lockup is something to deal with (access to unmapped memory, blocked interrupts etc) and not just for debugging. I’m aware that I shouldn’t touch the PCIe area, that was kind of a last resort measure.
I’ll gladly look into WdfDeviceSetFailed if this gives me an accepted way of restarting the card. What if I have two such cards in the same PC? Would unload the driver also stop communication (temporarily) to the second card?
Thanks

On general principles, no piece of hardware should ever require a reset during normal operation

I would have agree with that a year or two ago. And then I spent a lot of time, almost too much time, with high-end FPGAs that can get reconfigured.

OK… niche market… but, still. There is sometimes a legit need to do a complete reset of a device, with a resulting (sometimes dramatic) config space change.

1 Like

What if I have two such cards in the same PC? Would unload the driver also stop communication (temporarily) to the second card?

Device objects are all entirely separate. The fact that they happen to use the same driver is an irrelevant detail.

I don't know if I'm missing something. I tried to call WdfDeviceSetFailed but then I lost connection to my card, even though it was still working fine. As it also has an Ethernet port I could look inside and check that everything is fine. Also the PCIe area looked the same (on first glance, no thorough check). Do I need to do something else after calling this function?

Hmmm... well... yes, you WILL lose connection to your device. That's kind of the point.

Your device will be re-enumerated and your driver will be reloaded for that device instance.

Is that not what you wanted to happen?

I thought that it would restablish connection, kind of same as after a reboot but without the reboot of the PC. I don't mind losing connection temporarily, that's the point of a reset. But now I don't get the connection back as I thought I would. Do I need to handle some events in the driver (other than already present)? Do I need to do something else to get the connection back after enumeration (on PC side)?

Well, no… you shouldn’t have to do anything special. You’ve read the docs for WdfDeviceSetFailed, right? Specifying “attempt restart” in that second parameter should be all you need to get your driver called again.

So, hmmm…

Yes, I did read about it and used it as you suggested. I don't know yet why exactly communication doesn't work anymore after calling it. But if it doesn't work while everything is in normal running state I have doubts that it works better after issuing a reset on the card.
Without any other (hardware) changes in the PC, should the card get the same resources (BAR addresses, MSI numbers) as the first time or can they change in re-enumerating?

Sorry that it's not helping.

Yes, the resources can change.

In fact, that's the reason I used it: I programmed the card to use different resources.