Hi all,
I am doing WHQL for pci device. and i am left over with two test cases. they are CHAOS test and Sleep and PnP(Disable and Enable) IO Before and after tests. these two tests were giving the same error . The BSOD happening with the two tests. here i am putting the analysed dump data. you can go through it.
WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000005, Generic Error
Arg2: fffffa800e354028, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000
Arg4: 0000000000000000
Debugging Details:
BUGCHECK_STR: 0x124_GenuineIntel
DEFAULT_BUCKET_ID: WIN8_DRIVER_FAULT
PROCESS_NAME: watchdogmgr.ex
CURRENT_IRQL: f
STACK_COMMAND: kb
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: GenuineIntel
IMAGE_NAME: GenuineIntel
DEBUG_FLR_IMAGE_TIMESTAMP: 0
FAILURE_BUCKET_ID: 0x124_GenuineIntel_VRF_PCIEXPRESS
BUCKET_ID: 0x124_GenuineIntel_VRF_PCIEXPRESS
Followup: MachineOwner
10: kd> ! errrec fffffa800e354028
Common Platform Error Record @ fffffa800e354028
Record Id : 01cf6ddc4f631b93
Severity : Fatal (1)
Length : 408
Creator : Microsoft
Notify Type : Generic
Timestamp : 5/12/2014 12:43:02 (UTC)
Flags : 0x00000000
===============================================================================
Section 0 : PCI Express
Descriptor @ fffffa800e3540a8
Section @ fffffa800e3540f0
Offset : 200
Length : 208
Flags : 0x00000001 Primary
Severity : Fatal
Port Type : Root Port
Version : 1.0
Command/Status: 0x0546/0x4010
Device Id :
VenId:DevId : 8086:3410
Class code : 060400
Function No : 0x00
Device No : 0x09
Segment : 0x0000
Primary Bus : 0x80
Second. Bus : 0x84
Slot : 0x0000
Sec. Status : 0x0000
Bridge Ctl. : 0x0007
Express Capability Information @ fffffa800e354124
Device Caps : 00008021 Role-Based Error Reporting: 1
Device Ctl : 012e UR FE NF ce
Dev Status : 0004 ur FE nf ce
Root Ctl : 000e FS NFS cs
AER Information @ fffffa800e354160
Uncorrectable Error Status : 00004000 ur ecrc mtlp rof uc ca CTO fcp ptlp sd dlp und
Uncorrectable Error Mask : 00218000 ur ecrc mtlp rof UC CA cto fcp ptlp sd dlp und
Uncorrectable Error Severity : 00067030 ur ecrc MTLP ROF uc ca CTO FCP PTLP SD DLP und
Correctable Error Status : 00000000 adv rtto rnro dllp tlp re
Correctable Error Mask : 000031c1 ADV RTTO RNRO DLLP TLP RE
Caps & Control : 0000000e ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
Header Log : 00000000 00000000 00000000 00000000
Root Error Command : 00000000 fen nfen cen
Root Error Status : 00000054 MSG# 00 FER nfer FUF mur UR mcr cer
Correctable Error Source ID : 00,00,00
Correctable Error Source ID : 80,09,00
by watching this info we can conclude that Completion Time Out is happening(CTO) since the is set in the uncorrectable error status register.
we tried below things to remove the error.
- disabled all the interrupts, but didn’t worked
- tried to mask (reset)the CTO bit, didn’t worked
by reading matter in the link http://msdn.microsoft.com/en-us/library/windows/hardware/ff557321(v=vs.85).aspx we know that it’s a generic hardware error, but we can remove it by adding some code ???
i have read the things about the tests in the below links
http://msdn.microsoft.com/en-us/library/windows/hardware/jj673008(v=vs.85).aspx
http://msdn.microsoft.com/en-us/library/windows/hardware/jj673017(v=vs.85).aspx#About_sleep_with_IO
since i am passing the all the test cases on which these two tests were depending, why i am getting the BSOD problem only for these two cases, why i am not getting for the all the tests related to these two(CHAOS and sleep pnp io before and after) tests.???
can we correct it by adding some code or should we go for the hard ware correction??
i respect all the suggestions in the past, but i am curious to know any other way to solve it.
if you have any idea , please share
Thanks in advance.