Is calling FlushFileBuffers(hVolume) necessary?

Our application services remote clients by reading and writing files using NTFS. Our service accesses all files with the FILE_FLAG_WRITE_THROUGH set and no other flags set (in particular, FILE_FLAG_NOBUFFERING is not set, so NTFS manages the file cache). Our files reside on a non-boot partition.

Occasionally, our service suspends all file writes to the volume, and uses various SAN-based methods (e.g. fracture clone or snapcopy) to create a point-in-time copy of the LUN containing the volume. This copy of the LUN is used in our backup process.

Our question is, do we need to call FlushFileBuffers with a handle to the volume after we suspend writes and before we create the copy? FlushFileBuffers can take a long time to complete, and our application is quiesced during that time, so we would prefer to eliminate that call. Our research indicates that the call should not be necessary, although the language in the various Microsoft documentation is ambiguous.

Empirically, even in carefully controlled tests, ProcMon running during our call to FlushFileBuffers shows metadata ($MFT) being written to disk. We’re puzzled by this behavior and welcome any insights.

> Our question is, do we need to call FlushFileBuffers

Not only you need a flush, but you need the full VSS state machine to notify the DB software about the snapshot being created.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

Well, there are various issues here. You should really go the route that Maxim suggested and work with VSS.

First, there are certain things that your service doesn’t control that might require flushing (think about directory entries, the FS itself uses caching for those). So if you don’t flush the whole volume then you might miss some of the metadata so the volume won’t be in a consistent state (meaning your snapshot is corrupted from an NTFS perspective).

Second, even if the volume is NTFS consistent it is possible that some applications have not flushed their data and so their state is inconsistent. If you only do this to access certain files on the volume, files that YOUR PRODUCT owns then you might be safe since you can guarantee that the program state is saved to files (because presumably you’ve told your product to flush all its state) and that the data is written to disk (this is the only place where FILE_FLAG_WRITE_THROUGH helps). If you touch any other application’s files (and the OS is just another application from this perspective, so any system files fall under this) then you must use VSS to let the application know to flush its data. Moreover, not only is it possible that the state isn’t preserved across different files, but it is also possible that any file isn’t in a consistent state (if an application opens a file and does not share READ it doesn’t expect that there is anyone reading the file so it doesn’t have to try to keep it consistent). So this is the case for VSS integration, hope it makes sense…

Thanks,
Alex.

On Apr 2, 2013, at 9:55 AM, xxxxx@tastewar.com wrote:

Our application services remote clients by reading and writing files using NTFS. Our service accesses all files with the FILE_FLAG_WRITE_THROUGH set and no other flags set (in particular, FILE_FLAG_NOBUFFERING is not set, so NTFS manages the file cache). Our files reside on a non-boot partition.

Occasionally, our service suspends all file writes to the volume, and uses various SAN-based methods (e.g. fracture clone or snapcopy) to create a point-in-time copy of the LUN containing the volume. This copy of the LUN is used in our backup process.

Our question is, do we need to call FlushFileBuffers with a handle to the volume after we suspend writes and before we create the copy? FlushFileBuffers can take a long time to complete, and our application is quiesced during that time, so we would prefer to eliminate that call. Our research indicates that the call should not be necessary, although the language in the various Microsoft documentation is ambiguous.

Empirically, even in carefully controlled tests, ProcMon running during our call to FlushFileBuffers shows metadata ($MFT) being written to disk. We’re puzzled by this behavior and welcome any insights.


NTFSD is sponsored by OSR

OSR is hiring!! Info at http://www.osr.com/careers

For our schedule of debugging and file system seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Hi Maxim, Alex:

I work with Tom. Thanks for your quick replies.

Please note that for all practical purposes our service has exclusive ownership of the volume. Ours is the only application reading or writing to the disk. There is no other application installed. Also, this volume is not the boot volume. It’s created for our exclusive use.

Also note our backup method is proprietary and does make use of VSS. In effect though, we act as both the Writer and Provider in that design. Our application calls the APIs or command line instructions provided by the SAN vendor to manage LUNs and orchestrate the creation of the point-in-time copies for backup.

So the question comes down to this: if we use FILE_FLAG_WRITE_THROUGH for all file I/O, can we be assured that after we pause all file access, that all user and meta data has been written to disk so that we can then create a valid snap copy or fracture a clone? Or do we have to first call FlushFileBuffers with a handle to the volume? And if we must make that call, what conditions would cause it to take minutes to complete (keeping in mind the restrictions in which our application runs)?

Thanks again for your input.

Mike

Key correction to the above statement. It should have read:

Also note our backup method is proprietary and does NOT make use of VSS.

Mike

I’m sorry but I don’t have a clear answer…

I don’t remember what FlushFileBuffers actually does (and I don’t have a debugger handy to look at it :(). But in general if nobody else is writing to the volume the only thing I can think of are file system operations. Those would need to be flushed as well. I’m not sure how long that would take or if that’s what you’re seeing.

For debugging purposes I’d suggest trying to close all the files you have open on the volume and then try FlushFileBuffers and see if it still takes this long. It would be interesting to see if it’s your files or some other data that is causing this delay.

If you really have ownership of this volume you could try to dismount it and take a snapshot while it’s dismounted (which is basically a copy at that point). At least you could try to see how long the dismount takes and compare it with FlushFileBuffers.

Finally, to really investigate this I would issue a FlushFileBuffers and look at what’s actually being written on the volume and where those writes go.

Thanks,
Alex.
On Apr 2, 2013, at 11:37 AM, xxxxx@meditech.com wrote:

Key correction to the above statement. It should have read:

Also note our backup method is proprietary and does NOT make use of VSS.

Mike


NTFSD is sponsored by OSR

OSR is hiring!! Info at http://www.osr.com/careers

For our schedule of debugging and file system seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

What kind of disk drives are you using?

We found situations in the past where doing flushes and certain combinations of non-cached I/O would cause extended delays on the underlying disk drives. I’d be tempted to monitor the amount of time the disk drive spends processing those requests because you may likely be looking at a feature of your disk drives and not of the file system itself.

There is plenty of other information that NTFS does not flush during a non-cached I/O operation that will be flushed when you flush the entire volume. Plus there is also the heritage of POSIX semantics (“sync; sync; sync; halt”) that may be playing a subtle game as well. Of course, non-cached I/O is only a hint to the file system and it can (and will) be demoted to write-through under some circumstances (this is how compressed files will behave, for example). That’s probably good enough for your purposes as well.

But I suspect this is a disk drive level issue. The Faustian bargain you make with the disk drive is that you either get *correct* (transactional) semantics or you get *fast* (non-transactional) semantics typically. I know that Microsoft recommends using SCSI disk units with SQL precisely because many cheap SATA drives don’t even bother to implement “write-through”.

This is a good general write-up of the issue with an SQL focus: http://disruptivesql.wordpress.com/2012/05/08/sata-and-write-through/

Indeed, one of the real motivations for something like ReFS is the harsh reality that disk drives suffer from Byzantine failure conditions (http://blogs.msdn.com/b/b8/archive/2012/01/16/building-the-next-generation-file-system-for-windows-refs.aspx is a nice description of some of these issues). I don’t think they use the term “Byzantine” but that’s really what we’re talking about here - the disk drives lie to you.

When you stop trusting the underlying disk drive, the world gets very interesting - I participated in a design for a proposed system like this many years ago. It was quite a challenge to figure out how to deal with the myriad sources of failures that exist in the real world.

But this is probably beyond your concern. My observation: you’re not getting anything more for your paranoid flush than you are probably getting from the underlying drive anyhow other than long pauses. But of course that would depend upon the characteristics of the disk drives you are using.

Tony
OSR

>might miss some of the metadata so the volume won’t be in a consistent state (meaning your

snapshot is corrupted from an NTFS perspective).

Flush is not enough. There will be a window between flush and reading the backup copies.

Flush-and-hold is needed, which is provided by VSS.

Also, FS-level flush-and-hold has nothing to do with transaction boundaries in DB engines. To ensure that the on-disk-at-the-moment-of-snapshot-creation (==snapshot) data copy contains transaction-consistent DB, the notion of VSS writers is invented.

Without these, if you restore your backup, it will behave the same way as if the computer undergone a sudden power failure/blue screen.

Also VSS provides the way of writer-specific post-processing the backup image just before it will be ready for a the backup app, for things like resetting the “Windows did not shut down normally” flag and various AD stuff (to allow restoring of a DC from the backup without disturbing DC-to-DC replication). This is called “auto-recovery” and also “post final commit”.

Also VSS provides some restore facilities, which can also IIRC allow some post-restore postprocessing.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

Tony - the “disks” we are writing to are LUNs presented to our hosts by a SAN over a fibre channel HBA. The vendor and model of SAN will vary, but all are “industrial grade.” Those expensive storage devices tend to feature gobs of (write) cache. When our app “writes to disk” it is really writing to this layer of cache. Hopefully, we are getting both *fast* and *correct* (our customers are certainly paying for that). And we long ago stopped trusting the actual disk drives – hopefully, the RAID behind our LUNs protects us from most actual disk failures. Our overall experience has been quite good in this area.

Maxim- Sorry this wasn’t stated more clearly in the original message, but our software *is* the database engine, and I believe we are doing what you would call flush-and-hold, which is to say we pause all file activity at a transactional boundary, flush the file system, then use SAN-specific software to create a point-in-time copy, and only after that is complete do we resume file i/o. (I don’t want to get into a debate about VSS, but we did evaluate it enthusiastically, and wrote a writer for our database, and while in concept it met our needs, the provider software from SAN vendors was weak and the API provided only the lowest common denominator of functionality at the storage end. Also, very few backup vendors handle the generic case of VSS writer, but rather see VSS as a means of backing up SQL Server and Exchange Server data stores only)

The advice is still valid. find a way to monitor what commands are being
send to your disk system via the SAN during the flush and see if there that
will conflict with your backup

wrote in message news:xxxxx@ntfsd…

Tony - the “disks” we are writing to are LUNs presented to our hosts by a
SAN over a fibre channel HBA. The vendor and model of SAN will vary, but
all are “industrial grade.” Those expensive storage devices tend to feature
gobs of (write) cache. When our app “writes to disk” it is really writing to
this layer of cache. Hopefully, we are getting both *fast* and *correct*
(our customers are certainly paying for that). And we long ago stopped
trusting the actual disk drives – hopefully, the RAID behind our LUNs
protects us from most actual disk failures. Our overall experience has been
quite good in this area.

Maxim- Sorry this wasn’t stated more clearly in the original message, but
our software *is* the database engine, and I believe we are doing what you
would call flush-and-hold, which is to say we pause all file activity at a
transactional boundary, flush the file system, then use SAN-specific
software to create a point-in-time copy, and only after that is complete do
we resume file i/o. (I don’t want to get into a debate about VSS, but we did
evaluate it enthusiastically, and wrote a writer for our database, and while
in concept it met our needs, the provider software from SAN vendors was weak
and the API provided only the lowest common denominator of functionality at
the storage end. Also, very few backup vendors handle the generic case of
VSS writer, but rather see VSS as a means of backing up SQL Server and
Exchange Server data stores only)