Bugcheck: WHEA_UNCORRECTABLE_ERROR

Hi folks,

Got a nice little bugcheck that I’m having trouble debugging. I can
decode the basic record header and section descriptor structures, but I
don’t know how to decode the actual data, and hence, don’t know how to
determine what actually happened. (e.g. something like PCI device X
asserted #PERR or #SERR).

The info I have is:

0: kd> !analyze -v
*******************************************************************************
*
*
* Bugcheck
Analysis *
*
*
*******************************************************************************

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of
error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 00000000, MCA_ASSERT
Arg2: 85045028, Address of WHEA_ERROR_RECORD structure
Arg3: b2000000, High 32 bits of MCi_STATUS MSR for the MCA bank that had
the error
Arg4: 00070f0f, Low 32 bits of MCi_STATUS MSR for the MCA bank that had
the error

Debugging Details:

WHEA_ERROR_RECORD: !errrec ffffffff85045028



0: kd> ??(_WHEA_ERROR_RECORD *)(0x85045028);.echo done
struct _WHEA_ERROR_RECORD * 0x85045028
+0x000 Header : _WHEA_ERROR_RECORD_HEADER
+0x088 SectionDescriptor : [1] _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR
done
0: kd> ??((_WHEA_ERROR_RECORD *)(0x85045028))->Header;.echo done
struct _WHEA_ERROR_RECORD_HEADER
+0x000 Signature : 0x52455043
+0x004 Revision : 0x100
+0x006 Reserved1 : 0xffff
+0x008 Reserved2 : 0xffff
+0x00a SectionCount : 1
+0x00c Severity : 1 ( WheaErrSevFatal )
+0x010 ValidationBits : 2
+0x014 Length : 0x2e8
+0x018 Timestamp : _LARGE_INTEGER 0x1c73fd6`1d68ff38
+0x020 PlatformId : _GUID {00000000-0000-0000-0000-000000000000}
+0x030 PartitionId : _GUID {00000000-0000-0000-0000-000000000000}
+0x040 CreatorId : _GUID {f9de0c24-0e4d-4c87-b410-f5701cab65c3}
+0x050 NotifyType : _GUID {e8f56ffe-919c-4cc5-ba88-65abe14913bb}
+0x060 RecordId : 1
+0x068 Flags : 0
+0x070 PersistenceInfo : _WHEA_PERSISTENCE_INFO
+0x078 Reserved3 : [12] “”
done
0: kd> ??((_WHEA_ERROR_RECORD *)(0x85045028))->SectionDescriptor;.echo done
struct _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR [1] 0x850450b0
+0x000 SectionOffset : 0xd0
+0x004 SectionLength : 0x218
+0x008 Revision : 0x100
+0x00a ValidationBits : 0 ‘’
+0x00b Reserved : 0 ‘’
+0x00c Flags : 1
+0x010 SectionType : _GUID {e71254e9-c1b9-4940-ab76-909703a4320f}
+0x020 FRUId : _GUID {00000000-0000-0000-0000-000000000000}
+0x030 SectionSeverity : 1 ( WheaErrSevFatal )
+0x034 FRUText : [20] “”
done
0: kd> dd (0x85045028 + 0xd0) L 86
850450f8 74507245 00000000 00000218 00000000
85045108 00000100 00000000 00000000 00000000
85045118 00000000 00000001 00000000 00000000
85045128 00000000 00000002 00000002 00000000
85045138 0000017f 00000000 00030000 00000300
85045148 00020f12 00000000 00000000 00000000
85045158 00000000 00000000 00000000 00000000
85045168 00000000 00000000 00000000 00000000
85045178 00000000 00000000 00000000 00000000
85045188 00000000 00000000 00000000 00000000
85045198 00000000 00000000 00000000 00000000
850451a8 00000000 00000000 00000000 00000000
850451b8 00000000 00000000 00000000 00000000
850451c8 00000000 00000000 00000002 00000000
850451d8 00000000 00000000 00000000 00000000
850451e8 00000000 00000000 00000000 00000000
850451f8 00000000 00000000 00000000 00000000
85045208 00000001 00000000 00000001 00000001
85045218 1d68ff38 01c73fd6 00000002 00000000
85045228 00000004 00000000 00070f0f b2000000
85045238 00000000 00000000 00000000 00000000
85045248 00000000 00000000 00000000 00000000
85045258 00000000 00000000 00000000 00000000
85045268 00000000 00000000 00000000 00000000
85045278 00000000 00000000 00000000 00000000
85045288 00000000 00000000 00000000 00000000
85045298 00000000 00000000 00000000 00000000
850452a8 00000000 00000000 00000000 00000000
850452b8 00000000 00000000 00000000 00000000
850452c8 00000000 00000000 00000000 00000000
850452d8 00000000 00000000 00000000 00000000
850452e8 00000000 00000000 00000000 00000000
850452f8 00000000 00000000 00000000 00000000
85045308 00000000 00000000

Hi Martin, I asked one of the developers who works on this if he could
help. Here is what he said.

I get the daily summary, but that’s all.

I can give you what sounds like a good description of the error.
However, this particular error is somewhat generic and very difficult to
root cause.

Concatenating bugcheck arguments 2 and 3, you can get the full machine
check status code for the error - 0x b200000000070f0f. This is an
AMD-specific error code reported on the processor’s Northbridge machine
check status bank. The error code means a HyperTransport Watchdog
Timeout (WDTO) occurred. Basically, this means a PCI transaction failed
to complete for some reason, HT timed out, and the processor raised the
fatal machine check. There is some additional snooping that is possible
if this error occurs under a debugger, but, due to the likelihood of a
hard lockup, the OS does not attempt to probe the buses to find which
device(s) have error bits set. Under a debugger, you can try to dump
PCI config space to determine which device(s) report errors. This
doesn’t necessarily offer a great deal of help, but it may indicate
which device(s) are involved.

The root cause could be any one or more of the following: a device
driver is misprogramming hardware, thus hanging the bus; a device is not
in a state such that it can properly respond to accesses (frequently the
device is found to be in a low power state); or possibly this is case
where the HT timeout threshold is too sensitive. This could potentially
happen on DMA or PIO requests to storage device and maybe network
devices. This could again be hardware, firmware, or a device driver
issue.

The Windows Hardware Error Architecture (WHEA) does allow the platform
to cooperate with the OS to provide additional details about errors such
as this. This particular error, for instance could potentially be much
better described if the BIOS were to identify the device(s) involved in
the error, but until WHEA-aware platforms are available from vendors
details like this are not included in the error record.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Martin Harvey
Sent: Wednesday, January 24, 2007 9:37 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Bugcheck: WHEA_UNCORRECTABLE_ERROR

Hi folks,

Got a nice little bugcheck that I’m having trouble debugging. I can
decode the basic record header and section descriptor structures, but I
don’t know how to decode the actual data, and hence, don’t know how to
determine what actually happened. (e.g. something like PCI device X
asserted #PERR or #SERR).

The info I have is:

0: kd> !analyze -v
************************************************************************
*******
*

*
* Bugcheck
Analysis *
*

*
************************************************************************
*******

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of
error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 00000000, MCA_ASSERT
Arg2: 85045028, Address of WHEA_ERROR_RECORD structure
Arg3: b2000000, High 32 bits of MCi_STATUS MSR for the MCA bank that had
the error
Arg4: 00070f0f, Low 32 bits of MCi_STATUS MSR for the MCA bank that had
the error

Debugging Details:

WHEA_ERROR_RECORD: !errrec ffffffff85045028



0: kd> ??(_WHEA_ERROR_RECORD *)(0x85045028);.echo done
struct _WHEA_ERROR_RECORD * 0x85045028
+0x000 Header : _WHEA_ERROR_RECORD_HEADER
+0x088 SectionDescriptor : [1] _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR
done
0: kd> ??((_WHEA_ERROR_RECORD *)(0x85045028))->Header;.echo done
struct _WHEA_ERROR_RECORD_HEADER
+0x000 Signature : 0x52455043
+0x004 Revision : 0x100
+0x006 Reserved1 : 0xffff
+0x008 Reserved2 : 0xffff
+0x00a SectionCount : 1
+0x00c Severity : 1 ( WheaErrSevFatal )
+0x010 ValidationBits : 2
+0x014 Length : 0x2e8
+0x018 Timestamp : _LARGE_INTEGER 0x1c73fd6`1d68ff38
+0x020 PlatformId : _GUID
{00000000-0000-0000-0000-000000000000}
+0x030 PartitionId : _GUID
{00000000-0000-0000-0000-000000000000}
+0x040 CreatorId : _GUID
{f9de0c24-0e4d-4c87-b410-f5701cab65c3}
+0x050 NotifyType : _GUID
{e8f56ffe-919c-4cc5-ba88-65abe14913bb}
+0x060 RecordId : 1
+0x068 Flags : 0
+0x070 PersistenceInfo : _WHEA_PERSISTENCE_INFO
+0x078 Reserved3 : [12] “”
done
0: kd> ??((_WHEA_ERROR_RECORD *)(0x85045028))->SectionDescriptor;.echo
done
struct _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR [1] 0x850450b0
+0x000 SectionOffset : 0xd0
+0x004 SectionLength : 0x218
+0x008 Revision : 0x100
+0x00a ValidationBits : 0 ‘’
+0x00b Reserved : 0 ‘’
+0x00c Flags : 1
+0x010 SectionType : _GUID
{e71254e9-c1b9-4940-ab76-909703a4320f}
+0x020 FRUId : _GUID
{00000000-0000-0000-0000-000000000000}
+0x030 SectionSeverity : 1 ( WheaErrSevFatal )
+0x034 FRUText : [20] “”
done
0: kd> dd (0x85045028 + 0xd0) L 86
850450f8 74507245 00000000 00000218 00000000
85045108 00000100 00000000 00000000 00000000
85045118 00000000 00000001 00000000 00000000
85045128 00000000 00000002 00000002 00000000
85045138 0000017f 00000000 00030000 00000300
85045148 00020f12 00000000 00000000 00000000
85045158 00000000 00000000 00000000 00000000
85045168 00000000 00000000 00000000 00000000
85045178 00000000 00000000 00000000 00000000
85045188 00000000 00000000 00000000 00000000
85045198 00000000 00000000 00000000 00000000
850451a8 00000000 00000000 00000000 00000000
850451b8 00000000 00000000 00000000 00000000
850451c8 00000000 00000000 00000002 00000000
850451d8 00000000 00000000 00000000 00000000
850451e8 00000000 00000000 00000000 00000000
850451f8 00000000 00000000 00000000 00000000
85045208 00000001 00000000 00000001 00000001
85045218 1d68ff38 01c73fd6 00000002 00000000
85045228 00000004 00000000 00070f0f b2000000
85045238 00000000 00000000 00000000 00000000
85045248 00000000 00000000 00000000 00000000
85045258 00000000 00000000 00000000 00000000
85045268 00000000 00000000 00000000 00000000
85045278 00000000 00000000 00000000 00000000
85045288 00000000 00000000 00000000 00000000
85045298 00000000 00000000 00000000 00000000
850452a8 00000000 00000000 00000000 00000000
850452b8 00000000 00000000 00000000 00000000
850452c8 00000000 00000000 00000000 00000000
850452d8 00000000 00000000 00000000 00000000
850452e8 00000000 00000000 00000000 00000000
850452f8 00000000 00000000 00000000 00000000
85045308 00000000 00000000


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Ian Service wrote:

Hi Martin, I asked one of the developers who works on this if he could
help. Here is what he said.

Ian,

Many thanks for your colleagues time and that information - that’s
enough information to get me started down a fruitful line of
investigation. It’s probably a bus analyzer and a fair amount of head
scratching from here on in!

MH.