Bugcheck 0xC4 (0xA0) disk verifier

I’m getting occasional crashes due to the verifier’s disk integrity
checking, and have a couple of questions…

  1. The docs say that the verifier calculates a checksum for each sector
    accessed, and then compares the checksum next time the sector is
    accessed. Is this checksum just stored in memory?

  2. The docs also say that you can get false errors if you do ‘memory
    writes to in-flight write buffers’ or ‘concurrent in-flight reads and
    writes to the same sector’. I don’t ever do the former, but could
    possibly allow the latter if windows gave me such a request combination.
    My drivers are PV drivers so I just stuff each request I get onto the
    ring… For the problem to occur, would re-ordering of requests be
    required? I don’t reorder anything but there are a few layers below me
    (xen block device backend driver, linux scsi/sas/sata driver, linux
    scsi/sas/sata controller, physical disk) that could potentially re-order
    the requests.

Could the issue happen if windows sent requests in this order:
a. Write value #1234 to sector
b. Read value from sector
c. Write value #4567 to sector

if (c) was sent before (b) was completed, then would it be a problem
that (b) returned #1234 when Windows had since calculated the checksum
to now be #4567?

Maybe I am accepting concurrent requests when I shouldn’t be? I could
probably add some code to detect the above pattern and find out, but the
crash has happened once in about a week of testing. Or maybe I really am
corrupting sectors… it’s a bit hard to tell when there are so many
unknowns involved :frowning:

Thanks

James

The crcs are kept in memory.

You might not do in-flight buffer writes but somebody else might. Test
on a disk that does not have the paging file present. My experience
indicates that in flight writes do happen.

Your example should never happen with standard filesystem IO, the
filesystem is going to hold the write to the sector until the
outstanding read has completed, you would have to write an application
that explicitly violated ordering rules on its own.

Mark Roddy

On Fri, Apr 17, 2009 at 10:06 AM, James Harper
wrote:
> I’m getting occasional crashes due to the verifier’s disk integrity
> checking, and have a couple of questions…
>
> 1. The docs say that the verifier calculates a checksum for each sector
> accessed, and then compares the checksum next time the sector is
> accessed. Is this checksum just stored in memory?
>
> 2. The docs also say that you can get false errors if you do ‘memory
> writes to in-flight write buffers’ or ‘concurrent in-flight reads and
> writes to the same sector’. I don’t ever do the former, but could
> possibly allow the latter if windows gave me such a request combination.
> My drivers are PV drivers so I just stuff each request I get onto the
> ring… For the problem to occur, would re-ordering of requests be
> required? I don’t reorder anything but there are a few layers below me
> (xen block device backend driver, linux scsi/sas/sata driver, linux
> scsi/sas/sata controller, physical disk) that could potentially re-order
> the requests.
>
> Could the issue happen if windows sent requests in this order:
> a. Write value #1234 to sector
> b. Read value from sector
> c. Write value #4567 to sector
>
> if (c) was sent before (b) was completed, then would it be a problem
> that (b) returned #1234 when Windows had since calculated the checksum
> to now be #4567?
>
> Maybe I am accepting concurrent requests when I shouldn’t be? I could
> probably add some code to detect the above pattern and find out, but the
> crash has happened once in about a week of testing. Or maybe I really am
> corrupting sectors… it’s a bit hard to tell when there are so many
> unknowns involved :frowning:
>
> Thanks
>
> James
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
>

Mark,

If the paging system does allow not the most recent version of the page to be pulled out of a pagefile, it is broken. Normally, if a dirty page is flushed to the pagefile, it should be unmarked dirty before the write is posted. If the page was modified (dirtied) during the write, it then will still be considered dirty, and its written copy will never be read back.

Hmm… yes that is true. So for crc calcs the inflight writes that
occur on page files isn’t a problem. Good point. My experience was
with validating mirrors, and as each mirror could have a different
version of discarded page data, this turned out to be an impossible
task. It should not affect the crc test as these pages should never be
read back in.

Mark Roddy

On Fri, Apr 17, 2009 at 11:01 AM, wrote:
> Mark,
>
> If the paging system does allow not the most recent version of the page to be pulled out of a pagefile, it is broken. Normally, if a dirty page is flushed to the pagefile, it should be unmarked dirty before the write is posted. If the page was modified (dirtied) during the write, it then will still be considered dirty, and its written copy will never be read back.
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
>