Sorry to bring up such a confusing topic !
I still stands behind Peter W’s logic. From the uplevel, all it matters is A
goes before B, then what the lower level is doing is upto them. Is there
partial ordering or whatever is immeterial. Flushing will race with other
I/O too…
Also I dont buy that two or more disk(s). A disk could not conceivably think
about how many more disk would be involved in a sane transactional system.
Also flushing to disk, and an ack from disk is all that matters to the layer
who is flushing to disk, then what disk does is irrelevant. So essentially
reordering/caching everything is fine, as long as it maintain the
consistency, if A is modified to A’, you get A’, it is upto the disk to keep
it in NVRAM or wherever before being permanently magnetized … AND THAT IS
MY SIMPLE WAY TO LOOK AT IT, and possibly called transitive trust of
consistency …
Of course, most of you are far more familiar around this area, but that is
just my pi/2 cents
-pro
-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Ravisankar
Pudipeddi
Sent: Tuesday, November 02, 2004 6:38 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Ordering of I/O requests
Here’s some clarification:
From windows 2000 SP3 onwards, we use FUA on SCSI when WRITE_THROUGH is
supplied. This will force immediate write back to the disk from the
cache (no guarantees on IDE controllers).
FlushFileBuffers - this is very expensive, but it will do a complete
sync of the cache with the disk on both SCSI and IDE.
Given that information it is up to you to choose what semantics you
want. Remember FlushFileBuffers() is very expensive…
Ravi
-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tony Mason
Sent: Tuesday, November 02, 2004 4:54 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Ordering of I/O requests
Jan,
First, let me note (again) that I have a very file systems perspective
in this regard. I was also careful to mention that this is a feature of
the Win32 API (it sets FILE_SYNCHRONOUS_IO_NONALERT when opening the
file). But this is specific to CreateFile (except for overlapped I/O).
Your example is a good one - it strips away the Win32 processing to look
at just what the OS itself does.
The I/O Manager in this case does not impose serialization (I’m assuming
you did NOT ask for FILE_SYNCHRONOUS_IO_ALERT or
FILE_SYNCHRONOUS_IO_NONALERT) and passes the calls to the underlying
file system. The file system might impose its own internal
serialization, but from my experience this is THE performance path and
we have to make sure there’s as little serialization as possible.
If the file is compressed, NTFS will not honor the
FILE_NO_INTERMEDIATE_BUFFERING request you made - the file will be
written to the cache. If you are building a transactional system, you
should NOT allow the file system to store the file compressed.
Assuming that is not the case, NTFS will rather quickly try to post that
asynchronous I/O operation and return STATUS_PENDING back to you. This
doesn’t even mean the I/O has been STARTED to disk, so certainly there
is no ordering. When the second I/O begins it might start out to disk
before - or after - the first. There is no ordering between the two.
When they have BOTH completed, then you do in fact know they are BOTH on
disk (assuming FILE_WRITE_THROUGH or FILE_NO_INTERMEDIATE_BUFFERING and
no compression…) But if one completes you know nothing about the
state of the other I/O request.
In the case you provided, it is quite possible that the second I/O
operation could complete before the first one is completed. This is not
what you would want in such a case - so yes, you need to block and wait
for the log to be written BEFORE you write the update to the end-of-log
information. When I constructed the file systems journal those many
years ago, the log pages actually had a “pass number” field. I’d find
the beginning (most recent) record of the log by doing a binary search
on the log file (sounds simple, but there are some annoying edge
conditions). This eliminated the ordered writes you are describing.
When we tested this, we used a simulator that would snapshot the disk
state after every sector update and verify that we could recover the
log. We (literally) ran this in an automated fashion every night on
various workstations for almost the entire lifetime of the project,
often testing thousands of “disk states” each night. I’ve gratefully
forgotten the amazingly weird bugs that we saw over time. Transactional
systems can be quite amazing when they work properly.
Regards,
Tony
Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com
-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Tuesday, November 02, 2004 7:03 PM
To: ntdev redirect
Subject: Re:[ntdev] Ordering of I/O requests
Tony,
By “serialize” you mean wait for completion, or wait for the write call
to return with a PENDING status?
Let me be specific: say a driver opens a file with ZwCreateFile and a
flag of FILE_NO_INTERMEDIATE_BUFFERING. If I issue an IRP_MJ_WRITE and
wait for the call to IoCallDriver to return STATUS_PENDING, and then
issue a IRP_MJ_WRITE to the same offset, is it guaranteed the disk will
contain the data from the second write after both IRP’s complete? Or
will the data on the disk be indeterminate, and the only way to make
this works is to wait for the full completion of the first IRP before
issuing the second one?
Is
everything the same if I open a raw volume instead of a file on an NTFS
file system?
The application is a log, where I write some data in the log file, and
then want to update the header saying the offset of the last data
written. I would like to maximize performance by issuing I/O as soon as
I know what to write, instead of waiting for the previous write of
header data to complete.
The goal is to be assured the header will always contain that latest
pointer update. This will all run at PASSIVE level, probably in an
arbitrary process and thread context. Imagine a log device where apps
pass down IOCTL’s to
write to the log. The IOCTL calls the write irp to append the log data
block, and then when IoCallDriver returns PENDING, issues another write
to update the log ending pointer. The IOCTL returns SUCCES after the
header
write returns PENDING, and the completion routines clean up.
If they used the Win32 API and specified overlapped I/O, but themselves
serialize the I/O internally, then the I/O would be serialized.
If they don’t serialize the I/O internally between two threads then the
application itself really has no idea in which order the two operations
were issued.
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256
You are currently subscribed to ntdev as: xxxxx@osr.com To unsubscribe
send a blank email to xxxxx@lists.osr.com
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256
You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256
You are currently subscribed to ntdev as: unknown lmsubst tag argument: ‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com