Fixing PCI BIOS enumeration problems in a device driver. (Long)

All,

Our company has recently designed a couple of products that use PCI-to-PCI
bridges. If we put one card in a system, everything usually works fine.
Two cards and one Via-based board screws up. Four in a system, all Intel
boards screw up.

It appears that there is a bug in motherboard BIOS during PCI enumeration
and depending on which board maker (probably original BIOS version), it
either fails to initialize some of our devices behind the bridge, the
bridge, or one of the devices on the primary bus. Interestingly, all these
failures occur on an AMI (American Megatrends, Inc.) BIOS.

I should also point out that we’re dealing with two separate bridges. One
board uses a DEC/Intel bridge; the other uses a DEC/Intel bridge with a HINT
bridge behind it.

I’ve implemented a workaround for our NT4 driver that seems to work – it
re-enumerates the devices on its own, based on the assignments it found
during an initial scan of the buses – at least when only the devices behind
the bridge are improperly initialized. Unfortunately the fix doesn’t work
under W2K. I use HalSetBusData() to reprogram the device and that works
fine; but HalAssignSlotResources() appears to use a cached copy of the
config space and simply writes the wrong data back to the device.

At this point, I wonder if the cached copy used by HalAssignSlotResources()
isn’t simply part of the data written to the RESOURCEMAP key. But since we
have the additional problem of some versions of the BIOS not properly
assigning the bridge resources – this only occurs on the card that has a
bridge behind another bridge – I’m wondering if I shouldn’t just create a
separate driver that re-enumerates the PCI buses ahead of the HAL/PCI.SYS,
or if I should create some kind of filter driver for PCI.SYS.

Do any of you know if anything like this has already been done? I find it
hard to believe that we’re the first one to encounter this problem. My
questions boil down to:

  1. On NT4, how do I get a driver to load before the HAL?
  2. On W2K, how do I get a driver to load ahead of PCI.SYS?
  3. On W2K, can I create a filter driver for PCI.SYS and if so, is that
    even a reasonable approach to the problem?
  4. On W2K, is do you know of some way around the problem with
    HalAssignSlotResources() caching config data?
  5. On W98, you can fail the resource assignments and force PCI.VXD to
    stop all the other devices on your bus and force a re-enumeration. Is there
    any way to do the same in W2K?

Any source you could point me to that would get me started would be helpful.
And yes, I have the DDKs, MSDN subscriptions and NuMega’s DriverWorks if
refering to any of that would be helpful.

BTW: I’ve found that W2K/HalAssignSlotResources() does attempt to allocate
BARs when it finds they’re NULL. However, I’ve also found that if all the
devices on the bus are not initialize, it will erroneously assign the first
device an illegal BAR that starts just beyond the parent bridge’s limit for
that space type. This is a most annoying bug. But since
HalAssignSlotResources doesn’t bother to assign an IRQ pin, a Latency Timer
value, nor turn on any address spaces in the Command register, this problem
is kind of trivial in comparison. It also leads me to believe that even if
this feature is available, PCI.SYS can’t be trusted to handle device
reassignment/re-enumeration and I’m going to have to do it myself.

And before you tell me to tell our customers to get a BIOS fix, let me point
out that these BIOSes are very common (Intel uses them) and there isn’t
currently a fix for the problem. Also, one of these boards is a redesign of
an older product, so telling our customers to throw out their inventory of
motherboards won’t be very well received. Since the problem is so
wide-spread, I don’t see how we can avoid addressing the problem ourselves
at the driver level.

Thanks in advance,

  • Joel Corley

> And before you tell me to tell our customers to get a BIOS fix, let
me point

out that these BIOSes are very common (Intel uses them) and there
isn’t

Have you switched “PnP OS” to “No” in BIOS setup?

Max

Max,

I’ve tried both. It makes no difference.

The root failure occurs when the BIOS initially goes through its discovery,
enumeration and assignment process of the PCI bus. I’ve observed the
behavior after-the-fact using a DOS boot disk and a PCI configuration space
dump utility. I’ve also probed the bus with a logic analyzer during POST and
found that the BIOS explicitly writes zeros to the aforementioned registers
in the configuration space – explicitly disabling our devices. Ironically,
it allocates plenty of space for these disabled devices in the parent
bridge.

Something I may not have mentioned is that some of these systems become
somewhat unstable when we have “too many” of our cards in the box.
Specifically, the Intel 845 motherboards have this nasty tendency to get
into a state where it reports simply “Boot Failure”, even though there’s one
or more boot device attached.

Once the Intel board enters this mode, you have to strip all the cards from
the system for it to recover. Once it recovers, it seems to work normally
again. The problem only shows up when we put a forth or fifth card in the
system and only occasionally. Once we get one of these motherboards to boot
with four (4) or five (5) cards, it operates normally – except that it
fails to properly initialize the last few devices during POST.

FWIW: Our boards have as many as 10 PCI devices per slot – adding as many
as 60 PCI devices in a single Intel D845EBT motherboard. (In practice you
can’t use more than 5 cards because you’ll need a high-performance storage
system to support so many devices.) I suspect that the AMI BIOS is using a
fixed size table for its enumeration and is simply overflowing that table.
Past products used a separate coprocessor and a non-transparent bridge; this
is our first generation of product to use so many devices behind a
transparent bridge.

  • Joel Corley

http://www.enseo.com

“Maxim S. Shatskih” wrote in message
news:xxxxx@ntdev…
>
> > And before you tell me to tell our customers to get a BIOS fix, let
> me point
> > out that these BIOSes are very common (Intel uses them) and there
> isn’t
>
> Have you switched “PnP OS” to “No” in BIOS setup?
>
> Max
>
>
>
>
>