Bugcheck 0xC4 (0xA0) disk verifier

James_Harper · April 17, 2009, 10:06am

I’m getting occasional crashes due to the verifier’s disk integrity
checking, and have a couple of questions…

The docs say that the verifier calculates a checksum for each sector
accessed, and then compares the checksum next time the sector is
accessed. Is this checksum just stored in memory?
The docs also say that you can get false errors if you do ‘memory
writes to in-flight write buffers’ or ‘concurrent in-flight reads and
writes to the same sector’. I don’t ever do the former, but could
possibly allow the latter if windows gave me such a request combination.
My drivers are PV drivers so I just stuff each request I get onto the
ring… For the problem to occur, would re-ordering of requests be
required? I don’t reorder anything but there are a few layers below me
(xen block device backend driver, linux scsi/sas/sata driver, linux
scsi/sas/sata controller, physical disk) that could potentially re-order
the requests.

Could the issue happen if windows sent requests in this order:
a. Write value #1234 to sector
b. Read value from sector
c. Write value #4567 to sector

if (c) was sent before (b) was completed, then would it be a problem
that (b) returned #1234 when Windows had since calculated the checksum
to now be #4567?

Maybe I am accepting concurrent requests when I shouldn’t be? I could
probably add some code to detect the above pattern and find out, but the
crash has happened once in about a week of testing. Or maybe I really am
corrupting sectors… it’s a bit hard to tell when there are so many
unknowns involved

Thanks

James

Mark_Roddy · April 17, 2009, 10:49am

The crcs are kept in memory.

You might not do in-flight buffer writes but somebody else might. Test
on a disk that does not have the paging file present. My experience
indicates that in flight writes do happen.

Your example should never happen with standard filesystem IO, the
filesystem is going to hold the write to the sector until the
outstanding read has completed, you would have to write an application
that explicitly violated ordering rules on its own.

Mark Roddy

On Fri, Apr 17, 2009 at 10:06 AM, James Harper
wrote:
> I’m getting occasional crashes due to the verifier’s disk integrity
> checking, and have a couple of questions…
>
> 1. The docs say that the verifier calculates a checksum for each sector
> accessed, and then compares the checksum next time the sector is
> accessed. Is this checksum just stored in memory?
>
> 2. The docs also say that you can get false errors if you do ‘memory
> writes to in-flight write buffers’ or ‘concurrent in-flight reads and
> writes to the same sector’. I don’t ever do the former, but could
> possibly allow the latter if windows gave me such a request combination.
> My drivers are PV drivers so I just stuff each request I get onto the
> ring… For the problem to occur, would re-ordering of requests be
> required? I don’t reorder anything but there are a few layers below me
> (xen block device backend driver, linux scsi/sas/sata driver, linux
> scsi/sas/sata controller, physical disk) that could potentially re-order
> the requests.
>
> Could the issue happen if windows sent requests in this order:
> a. Write value #1234 to sector
> b. Read value from sector
> c. Write value #4567 to sector
>
> if (c) was sent before (b) was completed, then would it be a problem
> that (b) returned #1234 when Windows had since calculated the checksum
> to now be #4567?
>
> Maybe I am accepting concurrent requests when I shouldn’t be? I could
> probably add some code to detect the above pattern and find out, but the
> crash has happened once in about a week of testing. Or maybe I really am
> corrupting sectors… it’s a bit hard to tell when there are so many
> unknowns involved
>
> Thanks
>
> James
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
>

Alex_Grig · April 17, 2009, 11:01am

Mark,

If the paging system does allow not the most recent version of the page to be pulled out of a pagefile, it is broken. Normally, if a dirty page is flushed to the pagefile, it should be unmarked dirty before the write is posted. If the page was modified (dirtied) during the write, it then will still be considered dirty, and its written copy will never be read back.

Mark_Roddy · April 17, 2009, 11:10am

Hmm… yes that is true. So for crc calcs the inflight writes that
occur on page files isn’t a problem. Good point. My experience was
with validating mirrors, and as each mirror could have a different
version of discarded page data, this turned out to be an impossible
task. It should not affect the crc test as these pages should never be
read back in.

Mark Roddy

On Fri, Apr 17, 2009 at 11:01 AM, wrote:
> Mark,
>
> If the paging system does allow not the most recent version of the page to be pulled out of a pagefile, it is broken. Normally, if a dirty page is flushed to the pagefile, it should be unmarked dirty before the write is posted. If the page was modified (dirtied) during the write, it then will still be considered dirty, and its written copy will never be read back.
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
>