Magnetic media deteriorates over time. All disks utilize what is called a
“Sector Sparing Algorithm”. SCSI devices expose this interface to the host
for fine-grained control, IDE devices do not. Each sector has a ECC
calculated and stored each time the sector is written. When read, the ECC
can correct most errors. If an error is discovered, the sector is logically
removed from use and the (hopefully corrected) data is written to a spare
sector and logically remapped to exist in the old sector’s place (causing
invisible fragmentation in the process). This happens transparently with
IDE devices (and SCSI devices which have the AWRE/ARRE bits in the MODE
SELECT pages set), so the user (and host OS) is typically not aware that
sectors have failed and been remapped from the sector spare pool. Typically
once the magnetic media starts to fail, the number of bad sectors rise
exponentially, and it is not uncommon for the spare sectors to be exhausted.
When this happens, it is no longer possible for the device to correct
errors, and the bad sectors are then reported to the host OS, which will
then remove the sectors(or clusters) from use by the file system – but at
this point, you’re already starting to lose actual data in the process.
Having a power fail in the middle of all this isn’t likely to help either.
The STATUS_IMAGE_CHECKSUM_MISMATCH is basically reporting that a file isn’t
matching the CRC – In other words, there is a corrupt file on the disk. At
this point, it’s time to return the drive to the manufacturer for
replacement if it’s still under warranty. (It seems exceptionally common
these days for disks to fail well within the typically 5yr warranty period).
Regards,
Paul Bunn, UltraBac.com, 425-644-6000
Microsoft MVP - WindowsNT/2000
http://www.ultrabac.com
-----Original Message-----
From: Taed Nelson [mailto:xxxxx@vertical.com]
Sent: Friday, August 18, 2000 11:56 AM
To: NT Developers Interest List
Subject: [ntdev] Mirrored file system errors and verification.
[While I realize this is not a development issue, it’s fairly low-level and
requires some driver knowledge.]
We ship systems which use mirrored IDE drives for the boot (and only)
partition. The other day, we had a customer whose system would not boot up
(after a power outage, may be relevant since the system went down suddenly)
due to one of the hard blue screens (usually given for bus and memory
errors) with the error:
0xC0000221 UNKNOWN_HARD_ERROR
However, that error is really STATUS_IMAGE_CHECKSUM_MISMATCH.
I had them switch to their mirror, and that booted fine. (Yay, mirroring!)
I’ve investigated the drive a bit, and it had a bunch of bad sectors on it.
So, I’m assuming that the master disk developed a bad sector in an area
required for system booting – prior to the fault tolerant driver running
(and thus, it could not get the good data off the mirror). However, the
disk would boot for me, which I don’t understand.
First, I’m interested in understanding what could have happened. I don’t
understand how a drive can “suddenly” develop bad sectors in the system boot
files. If it weren’t sudden, it seems that the fault tolerant driver would
have fixed the problem earlier.
Secondly, I’m wondering if there is somehow that I could make my system even
more fault tolerant to problems such as this. There seems to be no way to
do a fail-over boot since it always choses the IDE master. (Anyone have any
clever ideas?) I was also thinking that perhaps the fault tolerant driver
could be set to do even more checking?
Thanks for any info or ideas!
You are currently subscribed to ntdev as: xxxxx@ultrabac.com
To unsubscribe send a blank email to $subst(‘Email.Unsub’)