Sudden powerdown disk corruption

Loren_Wilton · February 5, 2013, 10:39pm

I’ve got a bunch of vehicle-mounted systems in the field that are based on
NT Embedded. They use an 8G flash disk so that vehicle vibration won’t cause
head crashes.

The systems work fine, with one problem. The spec called for normal power
sequences. When those occur things work fine. However, it turns out that the
machines are subject to anything from 1-2 to 10 sudden power losses every
day, with no warning. This is outside the spec for the unit, but the vehicle
manufacturer simply said “we didn’t mention that because we didn’t think it
was a problem”.

Of course, it is. I’m getting systems returned as non-functional, and it
turns out they are all suffering from disk corruption. I suspect this is due
to a power drop in the middle of a disk write.

I’m fishing for possible workarounds for this problem. There is nothing I
can do to prevent the power losses, and the vehicle manufacturer can’t fix
that either. So I have to live with it and find a way to keep them from
causing disk corruption.

Could I use the UPS “low battery” warning? What will NT do when it sees
that? I can get around 75ms notification from the vehicle when power is
going down before it dumps completely. I know that is real close, but is
there a path where I could get disk writes inhibited before the power fail?
Lost data is preferrable to trashed disk directories. (Almost all disk write
activity should be to pre-opened log files, so there generally should be no
directory activity at the time of a power failure.)

Thanks for any suggestions!

Loren

OSR_Community_User · February 6, 2013, 1:01am

What file system are you using on your SSDs? FATxx, for example, is
notoriously unreliable if a “transaction” (what should be a transaction,
but isn’t) does not complete.

When maintaining log files, I discovered the only semi-reliable way to
manage this on FATxx systems was that every message logged looked like
this:

BOOL DoLog(LPCTSTR Msg)
{
if(! open log file)
return FALSE;
seektoend(logfile);
write(logfile, Msg);
close(logfile);
return TRUE;
}

Note: testing for success of operations other than opening the file is
left as An Exercise For The Reader.

I once destroyed a disk because each open required a couple seek
operations, and in a year of development I ran the seek mechanics through
five or ten years of ordinary usage. But with SSDs, this may not be an
issue, depending on how the write-leveling algorithms work.

But it really is difficult to answer a question for which critical
information is missing.
joe

I’ve got a bunch of vehicle-mounted systems in the field that are based on
NT Embedded. They use an 8G flash disk so that vehicle vibration won’t
cause
head crashes.

The systems work fine, with one problem. The spec called for normal power
sequences. When those occur things work fine. However, it turns out that
the
machines are subject to anything from 1-2 to 10 sudden power losses every
day, with no warning. This is outside the spec for the unit, but the
vehicle
manufacturer simply said “we didn’t mention that because we didn’t think
it
was a problem”.

Of course, it is. I’m getting systems returned as non-functional, and it
turns out they are all suffering from disk corruption. I suspect this is
due
to a power drop in the middle of a disk write.

I’m fishing for possible workarounds for this problem. There is nothing I
can do to prevent the power losses, and the vehicle manufacturer can’t fix
that either. So I have to live with it and find a way to keep them from
causing disk corruption.

Could I use the UPS “low battery” warning? What will NT do when it sees
that? I can get around 75ms notification from the vehicle when power is
going down before it dumps completely. I know that is real close, but is
there a path where I could get disk writes inhibited before the power
fail?
Lost data is preferrable to trashed disk directories. (Almost all disk
write
activity should be to pre-opened log files, so there generally should be
no
directory activity at the time of a power failure.)

Thanks for any suggestions!

Loren

NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · February 6, 2013, 1:38am

Also: are you sure the corruption is due to software? Vehicular systems
are notorious for things like power spikes, or failures that can be
portrayed as |||||||||||||…|…|.|.|…|.|.|…|.|.|
where the | indicates power present and . is power absent. I knew someone
who had worked in the auto industry and had a real set of horror stories
about power and temperature problems they were encountering during the
initial digital revolution of the 1980s. His previous experience had been
in “battle-hardened” military systems, and he said these were easy
compared to the automobile industry.

One possible way of extending that 75ms could be a big MF capacitor. I
once did an embedded system with about 50,000mf in the power supply (it
was a low-power system-on-a-board, and I had a couple hundred ms to “safe”
the system. It had no secondary storage, and this was well before flash
drives). I had about a second before the cutoff circuity cut the power to
zero (I didn’t design that circuit, I found it in an application note.
The problem was that like most circuitry of that era it ran just dandy at
5.0 volts, but erratically at 4.3 volts, so the power curve was a slow
decay from 5.0 to about 4.7 volts, then complete cutout to zero; startup
was that power did not appear until it reached about 4.8 volts, with the
“reset” line held low until somewhere between 4.9 and 5.0 volts had been
present for more than a couple seconds.) The lesson here is that if you
continue writing to the SSD at below-spec voltages, it doesn’t matter that
the CPU and write circuitry continue to run; the data may not arrive at
the SSD intact, or the voltage may be too low to get it written reliably
(and when the address-select circuits aren’t working right, all bets are
off). I was never very good at analog circuitry, so I either bought power
supplies or relied on serious experts to give me designs.

So before you go searching for a software solution, make sure the hardware
is going to function correctly during those 75ms.

(A friend designed a disk controller. It had special power-detection
circuitry, and would retract the heads from the disk cartridge on power
failure. What he forgot to do was turn off power to the write head. So
if there was a power failure during a write operation, the heads wrote a
spiral as they retracted outwards. The disk was unrecoverable. He could
handle, in the added bits on each track, recovery from an 8-bit burst
error. He later computed that based on the head retraction, the
decay-spiral corrupted 11 bits on each sector it passed over. There was
no software solution for this; several hundred controllers required
hardware mods). So consider my log file example of the previous post and
make sure that if you are trying to do some kind of commit or rollback in
the power-failure window that the hardware itself will be reliable for the
75ms, or whatever you end up with. Just because the CPU can continue to
fetch and execute instructions doesn’t guarantee that your SSD will have
the power (in watt-seconds at nominal operating voltage) to successfully
write the data.
joe

I’ve got a bunch of vehicle-mounted systems in the field that are based on
NT Embedded. They use an 8G flash disk so that vehicle vibration won’t
cause
head crashes.

The systems work fine, with one problem. The spec called for normal power
sequences. When those occur things work fine. However, it turns out that
the
machines are subject to anything from 1-2 to 10 sudden power losses every
day, with no warning. This is outside the spec for the unit, but the
vehicle
manufacturer simply said “we didn’t mention that because we didn’t think
it
was a problem”.

Of course, it is. I’m getting systems returned as non-functional, and it
turns out they are all suffering from disk corruption. I suspect this is
due
to a power drop in the middle of a disk write.

I’m fishing for possible workarounds for this problem. There is nothing I
can do to prevent the power losses, and the vehicle manufacturer can’t fix
that either. So I have to live with it and find a way to keep them from
causing disk corruption.

Could I use the UPS “low battery” warning? What will NT do when it sees
that? I can get around 75ms notification from the vehicle when power is
going down before it dumps completely. I know that is real close, but is
there a path where I could get disk writes inhibited before the power
fail?
Lost data is preferrable to trashed disk directories. (Almost all disk
write
activity should be to pre-opened log files, so there generally should be
no
directory activity at the time of a power failure.)

Thanks for any suggestions!

Loren

NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Loren_Wilton · February 6, 2013, 1:40am

> What file system are you using on your SSDs?

Apologies, meant to mention that and forgot.
This is NTFS. I agree, FAT has problems in this sort of environment.

On three machines I’ve been looking at for the last few days the corruption
was fixed simply by mounting the disk on another machine and letting
checkdsk do its thing. However, the corruption had been bad enough that the
machine couldn’t boot far enough to run checkdsk itself. In all cases it
seems that the system32 directory had been corrupted and inaccessable, or in
one case a critical DLL had gone missing from system32. Why that directory
should be corrupted in all three cases is unclear, since I can’t think of
anything that should have likely been accessing it at the times of failure.

Loren

David_R_Cattley · February 6, 2013, 9:48am

> big MF capacitor

Funny how this is a suggestion that someone unfamiliar with the SI unit
named for Mr. Faraday might still find instantly recognizable as a
‘quantity’, and large at that.

Cheers,
Dave Cattley.

Peter_Viscarola_OSR · February 6, 2013, 10:05am

I know from working with amateur radio equipment that there are power interruptions in cars all the time. The power is an incredible mess, in fact, with huge spikes and sags that look effectively like complete drop-outs. It’s lovely.

As Dr. Joe said, I think you’re looking at a hardware solution here, not a software one. I was also going to suggest a capacitor – You might not even need one that’s too very large.

Would that sort of approach meet your requirements?

Peter
OSR

Alex_Grig · February 6, 2013, 10:46am

Another problem with the flash drives that any interruption during the flash block write will corrupt the whole flash page which could span multiple allocation units, unrelated to the current write operation.

A good suggestion would be to have a (not so MoFo) capacitor dedicated to the flash. When the power drops, make sure the flash storage has it for long enough to finish writing a page.

Also, are you getting power from after the switch, or directly from the battery?

OSR_Community_User · February 6, 2013, 12:46pm

Generally, “MF” stands for MicroFarad, but “MF” stands for something else,
so a “Big MF” capacitor could mean one of two things. I leave the second
thing as an Exercise For The Reader. (Both translate to the same idea, by
the way).

I once figured out that using low-voltage tantalum capacitors (7.5V), it
would be possible to construct a 1-Farad capacitor for under $5,000 (1977
$) and it would fit in a relatively small space.

> big MF capacitor

Funny how this is a suggestion that someone unfamiliar with the SI unit
named for Mr. Faraday might still find instantly recognizable as a
‘quantity’, and large at that.

Cheers,
Dave Cattley.

NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · February 6, 2013, 1:42pm

>I once figured out that using low-voltage tantalum capacitors (7.5V), it would be possible to construct

a 1-Farad capacitor for under $5,000 (1977$) and it would fit in a relatively small space.

Capacitor technology has advanced a bit, a 10 Farad 2.5V super capacitor is about $5 now. For example see https://www.sparkfun.com/products/746, for those of us not doing electronics all day, a comment on that page says a typical 2.8 amp hour 1.5 Volt AA battery is the equivalent of 10,080 Farads.

Jan

David_R_Cattley · February 6, 2013, 1:43pm

> 1-Farad capacitor for under $5,000

Now < $100 US and available at just about any car stereo shop or ‘removable’
from the trunk of most thumping Honda Accords tooling about the
neighborhood.

Funny what drive technology prices down sometimes…

Cheers,
Dave Cattley

Peter_Viscarola_OSR · February 6, 2013, 1:52pm

Today, 1F 2.7V super capacitor… less than $3, quantity one. No, definitely not tantalum.

https://www.sparkfun.com/products/10068

Peter
OSR

OSR_Community_User · February 6, 2013, 6:06pm

Utterly cool! Back when I was 12, I built a strobe light. The goal was
to allow me to take a series of photos of images like the parabolic
bounces of a ping-pong ball. To get the required voltage, I put three
450V metal-can electrolytic capacitors in series, which gave me 1350V
dielectric. I had a transformer that was rated at 1200 volts output.

I left it running while I went to get a root beer. I heard a loud bang,
and the room I had been in was filled with smoke. One of the capacitors
had exploded, covering the area with tar-mixed-with-paper-shreds and
aluminum confetti. There was a hole in the (drywall-finished) ceiling
where the can hit it.

My uncle took the time to explain that AC voltage was measured as RMS
mean, and the peak voltage after rectification was sqrt(2) times te RMS
voltage, and that was what the capacitors were seeing. My 1350V capacitor
was seeing close to 1700V.

This was in the days before silicon diodes, and to get te current
requirements I had to use a mercury-vapor rectifier tube (which I still
have, 54 years later…). To build the RC circuit to change the flash
rate, I had a “potentiometer” made from 10W ceramic resistors (I needed
about 7W capacity given the high charging current) and a high-voltage
selector switch (the whole potentiometer unit was about eight inches in
diameter and about five inches in depth behind the panel. I still have
that, too) I spent months on that design, and most calculations were done
on a slide rule. I got everything right except the transformer output
voltage.

We found the can several years later, when we moved a shelving system away
from the wall. It had hit the ceiling and flown about fifteen feet across
the room.

To return to the OP’s question, note that the ad warns to not
reverse-polarize the capacitor. In auto electric systems, there are not
only massive spikes that could destroy the capacitor, but back-EMF from
inductive loads (window motors and the starter being the most common, but
not only, source of such spikes) will generate massive *inverse* spikes.
Typically you protect against positive spikes with Zener diodes and
inverse spikes by a reversed diode. At this point, considerations like
the inductance of the protection circuitry can still allow these spikes
through. Power supply design in such environments is not for the
faint-of-heart (and while I understand and appreciate the underlying
problems, hearing someone who has done this describe the details gives me
the same feeling as listening to Tony Mason give one his
file-system-filter talks: you need a whole whopping lot of knowledge and
experience just to sit at the table, never mind playing the game).
joe

I once figured out that using low-voltage tantalum capacitors (7.5V), it
would be possible to construct a 1-Farad capacitor for under $5,000 (1977
$) and it would fit in a relatively small space.

Today, 1F 2.7V super capacitor… less than $3, quantity one. No,
definitely not tantalum.

https://www.sparkfun.com/products/10068

Peter
OSR

NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Loren_Wilton · February 6, 2013, 6:31pm

>I know from working with amateur radio equipment that there are power

interruptions in cars all the time. The power is an incredible mess, in
fact, with huge spikes and sags that look effectively like complete
drop-outs. It’s lovely.

If you think road vehicles are bad, try working in the railroad locomotive
power environment sometime. 20KV 5ms spikes, both positive and negative, on
the 74VDC lines are commonplace. (Think about what happens when you switch
the 0.1 ohm resistor across the 600VDC generator, or when you disconnect it,
and you do this with something like 200 feet of cable wrapped all around the
vehicle.)

As Dr. Joe said, I think you’re looking at a hardware solution here, not a
software one. I was also going to suggest a capacitor – You might not
even need one that’s too very large.

Would that sort of approach meet your requirements?

I may need to put some filtering on the line in the end. That costs money,
and a lot of it, because this is a high vibration high reliability
environment. This isn’t just a big cap and a tie-wrap to hold it. So I’m
trying to avoid that if I can. (This isn’t a car on a road. This is a ride
vehicle in a theme park attraction.)

Even with a big cap, I still need some sort of power-fail indication into
the PC so I can make sure writes stop before the power does. Can I use any
of the built-in UPS “battery fail” indications to get things to shut down,
or am I going to have to roll some code of my own to somehow try to block
writes on the system? What does NT *do* with the “battery fail” indication
from a UPS?

Loren

Loren_Wilton · February 6, 2013, 6:37pm

> Another problem with the flash drives that any interruption during the

flash block write will corrupt the whole flash page which could span
multiple allocation units, unrelated to the current write operation.

Ah! I didn’t know that, but that helps explain what I’ve been seeing.

A good suggestion would be to have a (not so MoFo) capacitor dedicated to
the flash. When the power drops, make sure the flash storage has it for
long enough to finish writing a page.

This is something I might be able to build into the PC units, I’ll
investigate that. Thanks.

Also, are you getting power from after the switch, or directly from the
battery?

There is no battery. The vehicles are powered by 48V track power, which runs
a 28VDC filtered supply to run the onboard PLCs and the like. I’m told this
this is a very well behaved supply with monitoring and logging if the
voltage varies by more than a tenth of a volt or so. I can’t prove that, but
it does seem to be moderaely good. I haven’t had anything fried by
overvoltage spikes yet. The problem is track power comes and goes randomly,
so 28V also comes and goes, a little more cleanly, but just as abrubtly.

I still think I need some way to detect that I’ve lost input power. I can
hold it up a short while with a cap, but I still need to tell the software
to stop writing before the cap dies.

Loren

Peter_Viscarola_OSR · February 6, 2013, 7:14pm

Cool! Why don’t *I* get to work on cool shit like that? The most interesting thing *I* get to work on is a device that inserts mail into envelopes.

I’m not sure if you’re up on capacitors, so just in case… there are issues with capacitors, right? They voltage sags quickly, and you need to be careful how they’re charged. In ANY case, you can’t just attach some caps across the power supply and expect them to live.

In my experience, Windows itself doesn’t do ANYTHING. The UPS manufacturers have a program that runs and, when power goes to a low level, it just initiates a normal shutdown.

You *can* hold writes for a while… Think of the accelerometers in laptops that detect if the laptop is being dropped too hard and (hold disk I/Os and) park the disk heads before the device drops. We’ve designed those systems here at OSR… I’ll be happy to give you a few free hints based on our experience if you contact me off-line.

Peter
OSR

Alex_Grig · February 6, 2013, 7:31pm

>In my experience, Windows itself doesn’t do ANYTHING.

Windows 7 has all that built in. When the UPS tells it the battery is down to some %, Windows will do whatever it’s set up.

If you have 48 VDC, a 10000 uF capacitor stores 11 Joules which will sustain a 5W device for a while.

But seruously, you need a cheap low power ARM based board instead of full-featured x86. Then you can use a backup battery.

Loren_Wilton · February 6, 2013, 7:45pm

> But seruously, you need a cheap low power ARM based board instead of

full-featured x86. Then you can use a backup battery.

I may, but I don’t want to have to rewrite around a half million lines of
NT-based applications, much less design device drivers for a handful of
proprietary boards I don’t own and whose vendors aren’t going to be much
interested in an effort like that to solve my disk corruption problem. Not
to mention the minor cost hit to the company to replace a few hundred
installed units in the field, all over the world.

No, I think I’m better off just finding a workaround for the power drop
problem on this little industrial PC I’m using.

Loren

anton_bassov · February 6, 2013, 7:46pm

> When the power drops, make sure the flash storage has it for long enough to finish writing a page.

Please note that a computer does not react to the power failure as a whole entity that fails at once - instead, different pieces of hardware may fail at different rates due to the specifics of a given hardware. RAM is one of the most fragile things in this respect - it is going to lose its contents straight away. However, hard disk may stay operational for a little bit longer.

I hope by now you already see the potential problem here - if disk is still operational and RAM is already not, you may be writing data that had already become a binary garbage due to the power failure, effectively ending up with long-term data corruption.

Therefore, what you are offering is not a solution at all - in fact, it is just a recipe for a disaster. This problem has to be solved at the level of the software, rather than hardware…

Anton Bassov

Loren_Wilton · February 6, 2013, 8:02pm

> Therefore, what you are offering is not a solution at all - in fact, it is

just a recipe for a disaster. This problem has to be solved at the level
of the software, rather than hardware…

I think I’ve suggested that. Actually it needs to be solved in both arenas:
I need to get a ‘power good’ line out of the main power supply that will go
false some determinate time before power drops below some specific voltage,
and then I need some software to do something with that power good signal.

I’m negotiating with the vehicle vendor on getting the power good signal and
the guarantee on how long it is good for (which will be not many
milliseconds). That isn’t something that can be solved in this forum.

Now, what can I do with the signal to make writes stop fairly quickly?

I could feed it into my logging program and tell it to not do any more
writes. But there are some moderately big reads that might be happening from
other components. What if the write is queued behind 200ms of reads? If I’m
lucky power will drop before the write starts. If I’m not lucky, well, then
I’m not lucky. So feeding the signal to the logging program isn’t a very
good idea. Not to mention that NT may decide to do a write or two of its own
for some reason.

Loren

James_Harper · February 6, 2013, 8:26pm

>

> Therefore, what you are offering is not a solution at all - in fact, it is
> just a recipe for a disaster. This problem has to be solved at the level
> of the software, rather than hardware…

I think I’ve suggested that. Actually it needs to be solved in both arenas:
I need to get a ‘power good’ line out of the main power supply that will go
false some determinate time before power drops below some specific
voltage,
and then I need some software to do something with that power good
signal.

I’m negotiating with the vehicle vendor on getting the power good signal and
the guarantee on how long it is good for (which will be not many
milliseconds). That isn’t something that can be solved in this forum.

Now, what can I do with the signal to make writes stop fairly quickly?

I could feed it into my logging program and tell it to not do any more
writes. But there are some moderately big reads that might be happening
from
other components. What if the write is queued behind 200ms of reads? If I’m
lucky power will drop before the write starts. If I’m not lucky, well, then
I’m not lucky. So feeding the signal to the logging program isn’t a very
good idea. Not to mention that NT may decide to do a write or two of its own
for some reason.

If you get a ‘power not good’ signal then is doom certain? Or could you get a ‘power not good’ signal but the interruption is brief enough that there is no reset? If doom is certain then it wouldn’t be unreasonable to trigger a BSoD (with the system configure to not write a crash dump and to auto reset). That should very quickly prevent further writes from happening and shut the system down (or does a BSoD always result in a reboot?)

Are there any knobs you can tweak to minimise disk queue length?

Another options would be to use something like deepfreeze to guarantee getting a good reboot even if part of the SSD is now corrupt (depending on the actual storage layout…). With enough memory you might even be able to use a ramdisk as the ‘working area’ for deepfreeze, if that is possible. Logging could be done to an SD card set to ‘optimise for quick removal’ to keep write queues short. You might even be able to get a USB flash drive and rig up a battery to power it independently to ensure write operations get completed (although that sounds a bit fragile for the environment you describe).

James