How to get details of PCI device enumeration when Windows starts up

Hi everyone,
We made a motherboard. A PCIe switch (1 upstream port, 4 downstream ports) is used. Each downstream port directly leads to a PCIe slot.

There is a problem that has been plagued for a long time. The No. 2 slot does not recognize the device and has this problem on multiple motherboards.
We checked the power supply and wiring of slot 2 and everything worked fine. And carefully checked the PCB design and layout, slot 2 is no different from other slots.
At present, the hardware has not found any doubts, so I wonder if I can get further fault information through the operating system.
Can I get the details of the device enumeration by opening some configuration of the operating system or by kernel debugging?
Thanks.

For this type of issue I would probably start my troubleshooting at the EFI shell. There’s a pci command that can help you identify enumerated devices and read/write config space to your hearts desire.

@Shane_Corbin said:
For this type of issue I would probably start my troubleshooting at the EFI shell. There’s a pci command that can help you identify enumerated devices and read/write config space to your hearts desire.
Thank you for your advice.
I’ve used the pci command to see that the No.2 slot device is online and can be read out to configure the space. However, after entering Windows, the No.2 slot device is still not visible in the device manager. I tried to refresh the device by right-clicking on the device manager and the No.2 slot device appeared! By running the test application, it’s found that the board is operating normally.
I did a comparative experiment and plugged a Samsung PCIe SSD into slot 2. The device would display properly whether in the EFI shell or in Windows.
I don’t know what else is wrong in this situation.

It’s interesting that doing a refresh in device manager caused the device to appear. That makes me suspect something around the hotplug configuration either of the switch or the device. I have a vague memory of debugging some similar issue for a PCIe device behind thunderbolt. Look for the hot plug capabilites or surprise removal fields in one of the PCI/PCIe capabilities for the switch downstream ports. I think if it isn’t marked as hot plug capable, windows may no check it for changes after the initial PCIe bus enumeration.
Take the above with a big grain of salt but hope it helps.

You can use the tool RWEverything to examine the link state and all the other PCIe register.
In windbg there is !pcitree and !pci that will let you do similar things.
I’m guessing you don’t have a PCIe protocol analyzer or you would already have used it. If you are in the portland area, let me know, i have access to one.

Eric

@Eric_Wittmayer said:
It’s interesting that doing a refresh in device manager caused the device to appear. That makes me suspect something around the hotplug configuration either of the switch or the device. I have a vague memory of debugging some similar issue for a PCIe device behind thunderbolt. Look for the hot plug capabilites or surprise removal fields in one of the PCI/PCIe capabilities for the switch downstream ports. I think if it isn’t marked as hot plug capable, windows may no check it for changes after the initial PCIe bus enumeration.
Take the above with a big grain of salt but hope it helps.

You can use the tool RWEverything to examine the link state and all the other PCIe register.
In windbg there is !pcitree and !pci that will let you do similar things.
I’m guessing you don’t have a PCIe protocol analyzer or you would already have used it. If you are in the portland area, let me know, i have access to one.

Eric

I’m sorry I’d not respond promptly. I followed your instructions to check the relevant registers of the switch port and found strange things.
The correct one(my N0.1 slot)


The uncorrect one(my No.2 slot)

They have different Slot Status. :disappointed:
In fact, my board has not been designed for the hot plug-in, including hardware and software.

@coder_lizzoe said:
They have different Slot Status. :disappointed:
In fact, my board has not been designed for the hot plug-in, including hardware and software.

These are the downstream ports of the PLX switch, correct? I would expect that the PLX switch should allow you to configure these settings for each port.

As for designing your HW for hot plug, there is hot-plug and there is **Hot-Plug**. In the context of your issue, having the slot advertise hot-plug capabilities should be sufficient. If not then your other slots would also have issues. The hot-plug support tells the OS it should expect and watch for devices coming and going on this slot instead of just enumerating it on boot and expecting it to never change.

On the SW side, I would urge anyone writting drivers for PCIe devices to implement at least basic hot-plug support. The reason for this is ThunderBolt. Thunderbolt is at least partly a tree of PCIe switches with devices coming and going on the Thunderbolt(PCIe) slots. The chances that a PCIe product could end up either built into a Thunderbolt dock or being used by an end user in a Thunderbolt to PCIe expansion chassis is non-zero and growing fast. I knew about the hot-plug capability on the switch port from debugging a failure reported for PCIe product that was put in a Thunderbolt dock.

Eric

I would urge anyone writting drivers for PCIe devices to implement at least basic hot-plug support

I’m guessing that what you mean by this is that you urge folks to implement basic PnP support, right? That is best practice, I agree.

You have to expect dynamic removes of your device (a call to Exit D0 and then a call to Release Hardware, speaking in WDF terms) for any general purpose PCI family device. This allows your device to properly accommodate resource rebalancing operations. Plus, you want to be able to support dynamically disabling/unloading your device via Device Manager.

Are you advocating folks implement more than this?

Peter

Peter said:

Are you advocating folks implement more than this?

Yes I am. Your response made me realize I could be much more specific about what I suggest and why.

From the experience I had updating a PCIe driver to work through Thunderbolt and other transports where a hot removal is common, there are two situations that were the cause of most issues. Let me also acknowlege that both of these should be checked for and handled correctly in all drivers.

  1. Accessing registers in the device, both config space and memory mapped:
    Make sure you validate return data carefully. In general, a register read of an MMIO register should return all F’s 0xFFFFFFFF if the PCIe transaction fails so check for that at a minimum. A good example of this is searching a capability structure, either in config space or MMIO. The first part of the capability contains an offset to the next capabibility. If the code takes 0xFF as a valid offset and tries to access the next capability, where it finds an offset of 0xFF again, things can go bad quickly.

  2. Waiting for a device without a timeout.
    Even if the HW will always return an event or interrupt after a specific time, code for the case when the hardware never responds.

Some things I’ve found helpful:

  • Use a function to check if the HW is actually still present. The function reads a register in the device that always has the same value and checks for that value. If the value is wrong, set a variable to record that the device is gone and don’t check again if that variable is set.
  • Skip device cleanup routines, when appropriate, once the device is known to be gone.
  • Review timeout values and shorten appropriate ones when device is known to be gone.

In general, a register read of an MMIO register should return all F’s 0xFFFFFFFF if the PCIe transaction fails so check for that at a minimum.

What a déjà vu moment. All this I had to do when adapted PCI drivers to PCMCIA and Cardbus, let’s recall… 200 years ago?

– pa