I need a little help on interpreting a BUGCHECK WHEA_UNCORRECTABLE_ERROR
(124). If anyone could provide tips on how to interpret the
WHEA_ERROR_RECORD, I would appreciate it. It appears to be some sort of
PCIe protocol error.
((&->
Here is a copy of the bugcheck and dump of the WHEA_ERROR_RECORD:
2: kd> !analyze -v
ERROR: FindPlugIns 8007007b
*******************************************************************************
*
*
* Bugcheck Analysis
*
*
*
*******************************************************************************
WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of
error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000005, Generic Error
Arg2: fffffa81391a5028, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000
Arg4: 0000000000000000
Debugging Details:
BUGCHECK_STR: 0x124_5
DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT
PROCESS_NAME: System
CURRENT_IRQL: f
STACK_TEXT:
fffff8800247c2b8 fffff80001a21a3b : 0000000000000124 0000000000000005 fffffa81391a5028 0000000000000000 : nt!KeBugCheckEx
fffff8800247c2c0 fffff80001be4b43 : 0000000000000001 fffffa8231e207b0 0000000000000000 0000000000000000 :
hal!HalBugCheckSystem+0x1e3
fffff8800247c300 fffff80001a1b6be : fffffa8100002ba0 fffffa813856cbf0 fffff8800247c3f0 fffff80001a39470 :
nt!WheaReportHwError+0x263
fffff8800247c360 fffff80001b86c61 : fffff8800247c530 0000000000000001 0000000000000001 0000000000000001 :
hal!HalHandleNMI+0x66
fffff8800247c390 fffff80001ad4502 : 0000000000000001 0000000000000000 0000000000000000 0000000000000002 :
nt!KiProcessNMI+0x131
fffff8800247c3f0 fffff80001ad4363 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 :
nt!KxNmiInterrupt+0x82
fffff8800247c530 fffff88005932c61 : fffff80001adfcf9 00000000001dba6f fffffa8231e13450 0000000000000000 :
nt!KiNmiInterrupt+0x163
fffff88002499c98 fffff80001adfcf9 : 00000000001dba6f fffffa8231e13450 0000000000000000 0000000000000000 :
intelppm!MWaitIdle+0x19
fffff88002499ca0 fffff80001acee9c : fffff88002471180 fffff88000000001 0000000000000000 fffff88000000000 : nt!PoIdle+0x52a
fffff88002499d80 0000000000000000 : fffff8800249a000 fffff88002494000 fffff88002499d40 0000000000000000 : nt!KiIdleLoop+0x2c
STACK_COMMAND: kb
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: hardware
IMAGE_NAME: hardware
DEBUG_FLR_IMAGE_TIMESTAMP: 0
FAILURE_BUCKET_ID: X64_0x124_5_PCIEXPRESS
BUCKET_ID: X64_0x124_5_PCIEXPRESS
Followup: MachineOwner
2: kd> dt _WHEA_ERROR_RECORD fffffa81391a5028 -b
nt!_WHEA_ERROR_RECORD
+0x000 Header : _WHEA_ERROR_RECORD_HEADER
+0x000 Signature : 0x52455043
+0x004 Revision : _WHEA_REVISION
+0x000 MinorRevision : 0x10 ''
+0x001 MajorRevision : 0x2 ''
+0x000 AsUSHORT : 0x210
+0x006 SignatureEnd : 0xffffffff
+0x00a SectionCount : 1
+0x00c Severity : 1 ( WheaErrSevFatal )
+0x010 ValidBits : _WHEA_ERROR_RECORD_HEADER_VALIDBITS
+0x000 PlatformId : 0y0
+0x000 Timestamp : 0y1
+0x000 PartitionId : 0y0
+0x000 Reserved : 0y00000000000000000000000000000 (0)
+0x000 AsULONG : 2
+0x014 Length : 0x198
+0x018 Timestamp : _WHEA_TIMESTAMP
+0x000 Seconds : 0y00000100 (0x4)
+0x000 Minutes : 0y00100001 (0x21)
+0x000 Hours : 0y00001101 (0xd)
+0x000 Precise : 0y0
+0x000 Reserved : 0y0000000 (0)
+0x000 Day : 0y00011110 (0x1e)
+0x000 Month : 0y00000110 (0x6)
+0x000 Year : 0y00001100 (0xc)
+0x000 Century : 0y00010100 (0x14)
+0x000 AsLARGE_INTEGER : _LARGE_INTEGER 0x140c061e000d2104 +0x000 LowPart : 0xd2104 +0x004 HighPart : 0n336332318 +0x000 u : <unnamed-tag><br> +0x000 LowPart : 0xd2104<br> +0x004 HighPart : 0n336332318<br> +0x000 QuadPart : 0n1444536306398732548<br> +0x020 PlatformId : _GUID <br>{00000000-0000-0000-0000-000000000000}<br> +0x000 Data1 : 0<br> +0x004 Data2 : 0<br> +0x006 Data3 : 0<br> +0x008 Data4 : ""<br> [00] 0 ''<br> [01] 0 ''<br> [02] 0 ''<br> [03] 0 ''<br> [04] 0 ''<br> [05] 0 ''<br> [06] 0 ''<br> [07] 0 ''<br> +0x030 PartitionId : _GUID <br>{00000000-0000-0000-0000-000000000000}<br> +0x000 Data1 : 0<br> +0x004 Data2 : 0<br> +0x006 Data3 : 0<br> +0x008 Data4 : ""<br> [00] 0 ''<br> [01] 0 ''<br> [02] 0 ''<br> [03] 0 ''<br> [04] 0 ''<br> [05] 0 ''<br> [06] 0 ''<br> [07] 0 ''<br> +0x040 CreatorId : _GUID <br>{cf07c4bd-b789-4e18-b3c4-1f732cb57131}<br> +0x000 Data1 : 0xcf07c4bd<br> +0x004 Data2 : 0xb789<br> +0x006 Data3 : 0x4e18<br> +0x008 Data4 : "???"<br> [00] 0xb3 ''<br> [01] 0xc4 ''<br> [02] 0x1f ''<br> [03] 0x73 's'<br> [04] 0x2c ','<br> [05] 0xb5 ''<br> [06] 0x71 'q'<br> [07] 0x31 '1'<br> +0x050 NotifyType : _GUID <br>{3e62a467-ab40-409a-a698-f362d464b38f}<br> +0x000 Data1 : 0x3e62a467<br> +0x004 Data2 : 0xab40<br> +0x006 Data3 : 0x409a<br> +0x008 Data4 : "???"<br> [00] 0xa6 ''<br> [01] 0x98 ''<br> [02] 0xf3 ''<br> [03] 0x62 'b'<br> [04] 0xd4 ''<br> [05] 0x64 'd'<br> [06] 0xb3 ''<br> [07] 0x8f ''<br> +0x060 RecordId : 0x1cd56b99e48c4c6
+0x068 Flags : _WHEA_ERROR_RECORD_HEADER_FLAGS
+0x000 Recovered : 0y0
+0x000 PreviousError : 0y0
+0x000 Simulated : 0y0
+0x000 Reserved : 0y00000000000000000000000000000 (0)
+0x000 AsULONG : 0
+0x06c PersistenceInfo : _WHEA_PERSISTENCE_INFO
+0x000 Signature : 0y0101001001000101 (0x5245)
+0x000 Length : 0y000000000000000000000000 (0)
+0x000 Identifier : 0y0000000000000000 (0)
+0x000 Attributes : 0y00
+0x000 DoNotLog : 0y0
+0x000 Reserved : 0y00000 (0)
+0x000 AsULONGLONG : 0x5245
+0x074 Reserved : ""
[00] 0 ''
[01] 0 ''
[02] 0 ''
[03] 0 ''
[04] 0 ''
[05] 0 ''
[06] 0 ''
[07] 0 ''
[08] 0 ''
[09] 0 ''
[10] 0 ''
[11] 0 ''
+0x080 SectionDescriptor :
[00] _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR
+0x000 SectionOffset : 0xc8
+0x004 SectionLength : 0xd0
+0x008 Revision : _WHEA_REVISION
+0x000 MinorRevision : 0x1 ''
+0x001 MajorRevision : 0x2 ''
+0x000 AsUSHORT : 0x201
+0x00a ValidBits :
_WHEA_ERROR_RECORD_SECTION_DESCRIPTOR_VALIDBITS
+0x000 FRUId : 0y0
+0x000 FRUText : 0y0
+0x000 Reserved : 0y000000 (0)
+0x000 AsUCHAR : 0 ''
+0x00b Reserved : 0 ''
+0x00c Flags : _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR_FLAGS
+0x000 Primary : 0y1
+0x000 ContainmentWarning : 0y0
+0x000 Reset : 0y0
+0x000 ThresholdExceeded : 0y0
+0x000 ResourceNotAvailable : 0y0
+0x000 LatentError : 0y0
+0x000 Reserved : 0y00000000000000000000000000 (0)
+0x000 AsULONG : 1
+0x010 SectionType : _GUID
{d995e954-bbc1-430f-ad91-b44dcb3c6f35}
+0x000 Data1 : 0xd995e954
+0x004 Data2 : 0xbbc1
+0x006 Data3 : 0x430f
+0x008 Data4 : "???"
[00] 0xad ''
[01] 0x91 ''
[02] 0xb4 ''
[03] 0x4d 'M'
[04] 0xcb ''
[05] 0x3c '<'
[06] 0x6f 'o'
[07] 0x35 '5'
+0x020 FRUId : _GUID
{00000000-0000-0000-0000-000000000000}
+0x000 Data1 : 0
+0x004 Data2 : 0
+0x006 Data3 : 0
+0x008 Data4 : ""
[00] 0 ''
[01] 0 ''
[02] 0 ''
[03] 0 ''
[04] 0 ''
[05] 0 ''
[06] 0 ''
[07] 0 ''
+0x030 SectionSeverity : 1 ( WheaErrSevFatal )
+0x034 FRUText : ""
[00] 0 ''
[01] 0 ''
[02] 0 ''
[03] 0 ''
[04] 0 ''
[05] 0 ''
[06] 0 ''
[07] 0 ''
[08] 0 ''
[09] 0 ''
[10] 0 ''
[11] 0 ''
[12] 0 ''
[13] 0 ''
[14] 0 ''
[15] 0 ''
[16] 0 ''
[17] 0 ''
[18] 0 ''
[19] 0 ''
Try to use a PCIe analyzer.
WHEA (weird) is one of the most useless bugcheck. It says hey there is a
little rat with wings somewhere in your 13 acre farm house, go find it…
You start with a PCI analyzer to prove or disprove if the PCI dev being
monitored had generated a NR error while the system is generate WHEA.
Calvin
On Mon, Jul 2, 2012 at 1:35 PM, QuasiCodo wrote:
> I need a little help on interpreting a BUGCHECK WHEA_UNCORRECTABLE_ERROR
> (124). If anyone could provide tips on how to interpret the
> WHEA_ERROR_RECORD, I would appreciate it. It appears to be some sort of
> PCIe protocol error.
>
> ((&->
>
Thanks, guys.
I found that “!errrec ” does a fair job at interpreting the WHEA
error record. I was able to see which VEN_ID and DEV_ID caused the
problem with the Command register, Status register and the Uncorrectable
Error Status. Basically, the PLX bridge fell off the bus for some
reason. The hardware guys are now investigating.
I love the rat-with-wings-on-a-13-acre-farm analogy. That is too good 
The problem with this error is that it only happens once a year, so
putting a leCroy PCIe analyzer on it is not really an option.
Thx
((&->
On 7/2/2012 3:14 PM, Calvin Guan (news) wrote:
> WHEA (weird) is one of the most useless bugcheck. It says hey there is a
> little rat with wings somewhere in your 13 acre farm house, go find it…
>
> You start with a PCI analyzer to prove or disprove if the PCI dev being
> monitored had generated a NR error while the system is generate WHEA.
>
> Calvin
>
>
> On Mon, Jul 2, 2012 at 1:35 PM, QuasiCodo > mailto:xxxxx> wrote:
>
> I need a little help on interpreting a BUGCHECK
> WHEA_UNCORRECTABLE_ERROR (124). If anyone could provide tips on how
> to interpret the WHEA_ERROR_RECORD, I would appreciate it. It
> appears to be some sort of PCIe protocol error.
>
> ((&->
>
></mailto:xxxxx>
Nice trick! I usually bounced through all PCI devices in the system
starting from the suspected path if I was lucky enough that the system were
still cooperating.
Well, for problem happened once a year, it’s hard to declare victory. In
general, at least one negative and one positive are required to claim a
valid fix.
On Mon, Jul 2, 2012 at 4:17 PM, QuasiCodo wrote:
> Thanks, guys.
>
> I found that “!errrec ” does a fair job at interpreting the WHEA
> error record. I was able to see which VEN_ID and DEV_ID caused the problem
> with the Command register, Status register and the Uncorrectable Error
> Status. Basically, the PLX bridge fell off the bus for some reason. The
> hardware guys are now investigating.
>
> I love the rat-with-wings-on-a-13-acre-**farm analogy. That is too good
> 
>
> The problem with this error is that it only happens once a year, so
> putting a leCroy PCIe analyzer on it is not really an option.
>
> Thx
>
> ((&->
>
>
>
> On 7/2/2012 3:14 PM, Calvin Guan (news) wrote:
>
>> WHEA (weird) is one of the most useless bugcheck. It says hey there is a
>> little rat with wings somewhere in your 13 acre farm house, go find it…
>>
>> You start with a PCI analyzer to prove or disprove if the PCI dev being
>> monitored had generated a NR error while the system is generate WHEA.
>>
>> Calvin
>>
>>
>> On Mon, Jul 2, 2012 at 1:35 PM, QuasiCodo >> mailto:xxxxx> wrote:
>>
>> I need a little help on interpreting a BUGCHECK
>> WHEA_UNCORRECTABLE_ERROR (124). If anyone could provide tips on how
>> to interpret the WHEA_ERROR_RECORD, I would appreciate it. It
>> appears to be some sort of PCIe protocol error.
>>
>> ((&->
>>
>>
>>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.**cfm?name=ListServerhttp:
></http:></mailto:xxxxx>
The victory is that it is not my problem any more – let someone else
sweat bullets for a change 
However, we have already figured it out. It turns out that the system’s
BIOS lied to us. The slot that is having problems is actually limited
to 15W of power, while the BIOS reports that it will supply 25W of
power. As a result, we consistently draw 25W of power on that slot.
Most of the time, the cards in the system are not busy. However, every
so often, all of the cards draw 25W on all of the PCIe slots and over
tax the power supply. This causes a brown out on the slot which causes
our device to issue a Surprise Down PCIe Uncorrectable Error. This of
course causes Windows to bug check.
((&->
On 7/2/2012 8:22 PM, Calvin Guan (news) wrote:
Nice trick! I usually bounced through all PCI devices in the system
starting from the suspected path if I was lucky enough that the system
were still cooperating.
Well, for problem happened once a year, it’s hard to declare victory. In
general, at least one negative and one positive are required to claim a
valid fix.
On Mon, Jul 2, 2012 at 4:17 PM, QuasiCodo > mailto:xxxxx> wrote:
>
> Thanks, guys.
>
> I found that “!errrec ” does a fair job at interpreting the
> WHEA error record. I was able to see which VEN_ID and DEV_ID caused
> the problem with the Command register, Status register and the
> Uncorrectable Error Status. Basically, the PLX bridge fell off the
> bus for some reason. The hardware guys are now investigating.
>
> I love the rat-with-wings-on-a-13-acre-__farm analogy. That is too
> good 
>
> The problem with this error is that it only happens once a year, so
> putting a leCroy PCIe analyzer on it is not really an option.
>
> Thx
>
> ((&->
>
>
>
> On 7/2/2012 3:14 PM, Calvin Guan (news) wrote:
>
> WHEA (weird) is one of the most useless bugcheck. It says hey
> there is a
> little rat with wings somewhere in your 13 acre farm house, go
> find it…
>
> You start with a PCI analyzer to prove or disprove if the PCI
> dev being
> monitored had generated a NR error while the system is generate
> WHEA.
>
> Calvin
>
>
> On Mon, Jul 2, 2012 at 1:35 PM, QuasiCodo > mailto:xxxxx
> mailto:xxxxx>> wrote:
>
> I need a little help on interpreting a BUGCHECK
> WHEA_UNCORRECTABLE_ERROR (124). If anyone could provide
> tips on how
> to interpret the WHEA_ERROR_RECORD, I would appreciate it. It
> appears to be some sort of PCIe protocol error.
>
> ((&->
>
>
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.__cfm?name=ListServer
> http:
>
></http:></mailto:xxxxx></mailto:xxxxx></mailto:xxxxx>
It’s always nice when we can blame the hardware guys. I’ve had to do this
several times (we once had a computer that would occasionally push the
return address onto the stack, but then not call the subroutine. This
same architecture would also scramble the contents of a register if a DMA
grant happened during a rotate-bit operation).
Stories like this help me set students’ expectations. Although I am no
longer actively teaching, I’m working on mentoring opportunities in local
universities, but I have other distractions in my life right now that are
keeping me very busy.
Thank you for sharing this gem.
joe
The victory is that it is not my problem any more – let someone else
sweat bullets for a change 
However, we have already figured it out. It turns out that the system’s
BIOS lied to us. The slot that is having problems is actually limited
to 15W of power, while the BIOS reports that it will supply 25W of
power. As a result, we consistently draw 25W of power on that slot.
Most of the time, the cards in the system are not busy. However, every
so often, all of the cards draw 25W on all of the PCIe slots and over
tax the power supply. This causes a brown out on the slot which causes
our device to issue a Surprise Down PCIe Uncorrectable Error. This of
course causes Windows to bug check.
((&->
On 7/2/2012 8:22 PM, Calvin Guan (news) wrote:
> Nice trick! I usually bounced through all PCI devices in the system
> starting from the suspected path if I was lucky enough that the system
> were still cooperating.
> Well, for problem happened once a year, it’s hard to declare victory. In
> general, at least one negative and one positive are required to claim a
> valid fix.
>
> On Mon, Jul 2, 2012 at 4:17 PM, QuasiCodo >> mailto:xxxxx> wrote:
>>
>> Thanks, guys.
>>
>> I found that “!errrec ” does a fair job at interpreting the
>> WHEA error record. I was able to see which VEN_ID and DEV_ID caused
>> the problem with the Command register, Status register and the
>> Uncorrectable Error Status. Basically, the PLX bridge fell off the
>> bus for some reason. The hardware guys are now investigating.
>>
>> I love the rat-with-wings-on-a-13-acre-__farm analogy. That is too
>> good 
>>
>> The problem with this error is that it only happens once a year, so
>> putting a leCroy PCIe analyzer on it is not really an option.
>>
>> Thx
>>
>> ((&->
>>
>>
>>
>> On 7/2/2012 3:14 PM, Calvin Guan (news) wrote:
>>
>> WHEA (weird) is one of the most useless bugcheck. It says hey
>> there is a
>> little rat with wings somewhere in your 13 acre farm house, go
>> find it…
>>
>> You start with a PCI analyzer to prove or disprove if the PCI
>> dev being
>> monitored had generated a NR error while the system is generate
>> WHEA.
>>
>> Calvin
>>
>>
>> On Mon, Jul 2, 2012 at 1:35 PM, QuasiCodo >> mailto:xxxxx
>> mailto:xxxxx>>
>> wrote:
>>
>> I need a little help on interpreting a BUGCHECK
>> WHEA_UNCORRECTABLE_ERROR (124). If anyone could provide
>> tips on how
>> to interpret the WHEA_ERROR_RECORD, I would appreciate it.
>> It
>> appears to be some sort of PCIe protocol error.
>>
>> ((&->
>>
>>
>>
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.__cfm?name=ListServer
>> http:
>>
>>
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
></http:></mailto:xxxxx></mailto:xxxxx></mailto:xxxxx>