CHAOS test ,sleep and pnp IO Before and after tests failed

Hi all,
I am doing WHQL for pci device. and i am left over with two test cases. they are CHAOS test and Sleep and PnP(Disable and Enable) IO Before and after tests. these two tests were giving the same error . The BSOD happening with the two tests. here i am putting the analysed dump data. you can go through it.

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000005, Generic Error
Arg2: fffffa800e354028, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:

BUGCHECK_STR: 0x124_GenuineIntel

DEFAULT_BUCKET_ID: WIN8_DRIVER_FAULT

PROCESS_NAME: watchdogmgr.ex

CURRENT_IRQL: f

STACK_COMMAND: kb

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: GenuineIntel

IMAGE_NAME: GenuineIntel

DEBUG_FLR_IMAGE_TIMESTAMP: 0

FAILURE_BUCKET_ID: 0x124_GenuineIntel_VRF_PCIEXPRESS

BUCKET_ID: 0x124_GenuineIntel_VRF_PCIEXPRESS

Followup: MachineOwner

10: kd> ! errrec fffffa800e354028

Common Platform Error Record @ fffffa800e354028

Record Id : 01cf6ddc4f631b93
Severity : Fatal (1)
Length : 408
Creator : Microsoft
Notify Type : Generic
Timestamp : 5/12/2014 12:43:02 (UTC)
Flags : 0x00000000

===============================================================================
Section 0 : PCI Express

Descriptor @ fffffa800e3540a8
Section @ fffffa800e3540f0
Offset : 200
Length : 208
Flags : 0x00000001 Primary
Severity : Fatal

Port Type : Root Port
Version : 1.0
Command/Status: 0x0546/0x4010
Device Id :
VenId:DevId : 8086:3410
Class code : 060400
Function No : 0x00
Device No : 0x09
Segment : 0x0000
Primary Bus : 0x80
Second. Bus : 0x84
Slot : 0x0000
Sec. Status : 0x0000
Bridge Ctl. : 0x0007
Express Capability Information @ fffffa800e354124
Device Caps : 00008021 Role-Based Error Reporting: 1
Device Ctl : 012e UR FE NF ce
Dev Status : 0004 ur FE nf ce
Root Ctl : 000e FS NFS cs

AER Information @ fffffa800e354160
Uncorrectable Error Status : 00004000 ur ecrc mtlp rof uc ca CTO fcp ptlp sd dlp und
Uncorrectable Error Mask : 00218000 ur ecrc mtlp rof UC CA cto fcp ptlp sd dlp und
Uncorrectable Error Severity : 00067030 ur ecrc MTLP ROF uc ca CTO FCP PTLP SD DLP und
Correctable Error Status : 00000000 adv rtto rnro dllp tlp re
Correctable Error Mask : 000031c1 ADV RTTO RNRO DLLP TLP RE
Caps & Control : 0000000e ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
Header Log : 00000000 00000000 00000000 00000000
Root Error Command : 00000000 fen nfen cen
Root Error Status : 00000054 MSG# 00 FER nfer FUF mur UR mcr cer
Correctable Error Source ID : 00,00,00
Correctable Error Source ID : 80,09,00

by watching this info we can conclude that Completion Time Out is happening(CTO) since the is set in the uncorrectable error status register.

we tried below things to remove the error.

  1. disabled all the interrupts, but didn’t worked
  2. tried to mask (reset)the CTO bit, didn’t worked

by reading matter in the link http://msdn.microsoft.com/en-us/library/windows/hardware/ff557321(v=vs.85).aspx we know that it’s a generic hardware error, but we can remove it by adding some code ???

i have read the things about the tests in the below links
http://msdn.microsoft.com/en-us/library/windows/hardware/jj673008(v=vs.85).aspx
http://msdn.microsoft.com/en-us/library/windows/hardware/jj673017(v=vs.85).aspx#About_sleep_with_IO

since i am passing the all the test cases on which these two tests were depending, why i am getting the BSOD problem only for these two cases, why i am not getting for the all the tests related to these two(CHAOS and sleep pnp io before and after) tests.???

can we correct it by adding some code or should we go for the hard ware correction??

i respect all the suggestions in the past, but i am curious to know any other way to solve it.

if you have any idea , please share

Thanks in advance.

Hardware error? Perhaps, ah, fix the hardware??

Maybe I’m missing something, but this actually sounds like it’d be the right answer…

Peter
OSR
@OSRDrivers

Hi Peter,

Before going to the hardware error, do we need to do anything before running the tests.
or are we missing something???

CTO is a correctable error. The severity shown is “fatal” which is NOT correctable. Further, the error is being shown as coming from the Root Port.

So, if I were to guess, and that’s what I’m really doing here, I would say your hardware is not coming ready quickly enough, or something like that.

Like I said before, this looks like a hardware error not a software error in your driver.

Peter
OSR
@OSRDrivers

xxxxx@gmail.com wrote:

I am doing WHQL for pci device. and i am left over with two test cases. they are CHAOS test and Sleep and PnP(Disable and Enable) IO Before and after tests. these two tests were giving the same error . The BSOD happening with the two tests. here i am putting the analysed dump data. you can go through it.

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000005, Generic Error
Arg2: fffffa800e354028, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000
Arg4: 0000000000000000

we tried below things to remove the error.

  1. disabled all the interrupts, but didn’t worked
  2. tried to mask (reset)the CTO bit, didn’t worked

by reading matter in the link http://msdn.microsoft.com/en-us/library/windows/hardware/ff557321(v=vs.85).aspx we know that it’s a generic hardware error, but we can remove it by adding some code ???

This is a HARDWARE error. You cannot work around this in software.
Even if software is triggering the error, you cannot release a product
that can disable the system like this through the use of an ordinary
user-mode application. You need to buy or rent a PCIExpress bus
analyzer and capture the failing transaction so your hardware people can
fix your PCIExpress IP.

since i am passing the all the test cases on which these two tests were depending, why i am getting the BSOD problem only for these two cases, why i am not getting for the all the tests related to these two(CHAOS and sleep pnp io before and after) tests.???

There is simply some sequence of operations that is causing your
PCIExpress IP to miss a completion, and apparently those are the only
tests that happen to execute that sequence of operations.

can we correct it by adding some code or should we go for the hard ware correction??

You must fix the hardware.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Have I mentioned that I’ve been ASTONISHED lately at the quality of some of the PCIe interface IP blocks out there?

I don’t know where these things come from, but it’s starting to feel like PCIe IP is being developed on Code Project.

Peter
OSR
@OSRDrivers

xxxxx@osr.com wrote:

Have I mentioned that I’ve been ASTONISHED lately at the quality of some of the PCIe interface IP blocks out there?

I assume you mean “poor quality”. I agree.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Hi All,
I have a silly doubt here, before installing my driver i am release signing the driver and doing TESTSIGNING OFF. then i am installing my driver and running the tests. here i observed that my system is automatically turning into the TESTMODE while running the tests. will it happens normally??? (this may not be the cause , but i want to know) .