Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results


Before Posting...

Please check out the Community Guidelines in the Announcements and Administration Category.

More Info on Driver Writing and Debugging

The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.

Check out The OSR Learning Library at:

PCIE device's BAR0 cleared after WHEA event 17 corrected error on root port

vx600vx600 Member Posts: 1

I'm bringing up a new board integrating an ancient proprietary ASIC that supports only PCIE Gen1. The ASIC has been problem-free for years in other designs and its PCIE IP is by Synopsis. In the latest design, I'm seeing a frequent problem in which the device's BAR0 address gets zeroed out at random intervals (but typically under 30 minutes), of course disabling access to the device's registers. Every time this occurs, a WHEA event ID 17 corrected error (classified as a warning) is logged in Event Viewer for the PCIE root port to which the device is connected:

A corrected hardware error has occurred.

Component: PCI Express Root Port
Error Source: Generic

Primary Bus:Device:Function: 0xE2:0x2:0x0
Secondary Bus:Device:Function: 0xE3:0x3:0x0
Primary Device Name:PCI\VEN_8086&DEV_347A&SUBSYS_00008086&REV_04
Secondary Device Name:

The failure occurs even with the board just sitting idle aside from periodic BAR0 reads and writes to check whether the registers are still accessible. After the failure, Device Manager continues to report that the device is functioning normally, but BAR0 remains zeroed. To recover, simply disabling and enabling the device gets it back again. The OS version is Windows 10 IoT Enterprise LTSC 21H2 build 19044.1645.

The frequency of the problem varies across CPUs and PCIE slots but seems particularly bad with Xeon 4310s. I think it's likely there's a problem with the latest board design, and the hardware engineers are investigating, but in the meantime I'd like to understand what's going on with BAR0 being cleared. Is this action taken by the OS, the CPU's PCIE subsystem, or the device itself? Why is this happening in response to a "corrected" error, and can this action be disabled? Is this considered a surprise removal? Googling around, it seems these WHEA event 17s are pretty common among gamers stressing their GPUs, but the devices don't seem to be incapacitated when they occur.

Has anyone seen a similar problem on past projects?

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. Sign in or register to get started.

Upcoming OSR Seminars
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!
Kernel Debugging 16-20 October 2023 Live, Online
Developing Minifilters 13-17 November 2023 Live, Online
Internals & Software Drivers 4-8 Dec 2023 Live, Online
Writing WDF Drivers 10-14 July 2023 Live, Online