Ordering of I/O requests

In your situation listed below the result is indeterminate - the
requests can get reordered by any number of hardware or software
components between your application and the disk drive. If you need to
write A and then B you must wait for A to complete. It doesn’t matter
how you open the volume, what caching you provide, what controller or
disk you use, etc…

There is no guarantee of ordering between two in-flight I/Os to the disk
in windows. In flight can be defined as the period between when you
initiate the I/O (ReadFile/WriteFile or IoCallDriver) and the time you
receive notification that the request has completed (completion
event/apc/port, or completion routine).

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Tuesday, November 02, 2004 4:03 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] Ordering of I/O requests

Tony,

By “serialize” you mean wait for completion, or wait for the
write call to return with a PENDING status?

Let me be specific: say a driver opens a file with
ZwCreateFile and a flag of FILE_NO_INTERMEDIATE_BUFFERING. If
I issue an IRP_MJ_WRITE and wait for the call to IoCallDriver
to return STATUS_PENDING, and then issue a IRP_MJ_WRITE to
the same offset, is it guaranteed the disk will contain the
data from the second write after both IRP’s complete? Or will
the data on the disk be indeterminate, and the only way to
make this works is to wait for the full completion of the
first IRP before issuing the second one? Is everything the
same if I open a raw volume instead of a file on an NTFS file system?

The application is a log, where I write some data in the log
file, and then want to update the header saying the offset of
the last data written. I would like to maximize performance
by issuing I/O as soon as I know what to write, instead of
waiting for the previous write of header data to complete.
The goal is to be assured the header will always contain that
latest pointer update. This will all run at PASSIVE level,
probably in an arbitrary process and thread context. Imagine
a log device where apps pass down IOCTL’s to write to the
log. The IOCTL calls the write irp to append the log data
block, and then when IoCallDriver returns PENDING, issues
another write to update the log ending pointer. The IOCTL
returns SUCCES after the header write returns PENDING, and
the completion routines clean up.

  • Jan

>If they used the Win32 API and specified overlapped I/O, but
themselves
>serialize the I/O internally, then the I/O would be serialized.
>If they don’t serialize the I/O internally between two
threads then the
>application itself really has no idea in which order the two
operations
>were issued.


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as:
xxxxx@windows.microsoft.com To unsubscribe send a blank
email to xxxxx@lists.osr.com

> Transactional semantics do NOT require serializing writes.

At least the log block with the record about the operation must be flushed
before the data blocks updated by the operation.

From what I know on Windows, the Cache Manager has the facility of a) assigning
LSNs to the blocks on updates b) calling the FSD-provided callback before
flushing the block, passing the block’s LSN via parameter. The callback has the
power to veto the flush, so Cc will skip this block and continue the “dirt”
collection run with another blocks.

The LFS subsystem in NTFS assign incremented LSNs to the log records, and the
callback is the LFS routine which checks for the log being physically flushed
up to this LSN. Also NTFS assigns the LSNs to cache blocks, the LSNs which are
asked from LFS after the operation is logged.

Another great approach on FS consistency and fault tolerance is FreeBSD’s
“softdep”.

Very great. It does not maintain any on-disk logs, but imposes a metadata flush
ordering, and, if the order is circular, it rolls back the update in the block,
writes it to the disk, and then roll forwards the same update in the in-memory
copy.

This provides the “weak consistency” guarantee, where the only damage in the
on-disk FS can be the leaked blocks or inodes only (which are easily recovered
by special very fast mode of fsck), but not any other corruptions.

This hits lots of practical purposes. No full fsck on each boot as in Linux
with ext2. No additional disk writes to a separate file which can be far away
disk-wise as in NTFS or ReiserFS, which are slow due to head seeks.

I can email the McKusick’s (FreeBSD FS author) document on softdep to all
interested parties.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Jan,

First, let me note (again) that I have a very file systems perspective
in this regard. I was also careful to mention that this is a feature of
the Win32 API (it sets FILE_SYNCHRONOUS_IO_NONALERT when opening the
file). But this is specific to CreateFile (except for overlapped I/O).

Your example is a good one - it strips away the Win32 processing to look
at just what the OS itself does.

The I/O Manager in this case does not impose serialization (I’m assuming
you did NOT ask for FILE_SYNCHRONOUS_IO_ALERT or
FILE_SYNCHRONOUS_IO_NONALERT) and passes the calls to the underlying
file system. The file system might impose its own internal
serialization, but from my experience this is THE performance path and
we have to make sure there’s as little serialization as possible.

If the file is compressed, NTFS will not honor the
FILE_NO_INTERMEDIATE_BUFFERING request you made - the file will be
written to the cache. If you are building a transactional system, you
should NOT allow the file system to store the file compressed.

Assuming that is not the case, NTFS will rather quickly try to post that
asynchronous I/O operation and return STATUS_PENDING back to you. This
doesn’t even mean the I/O has been STARTED to disk, so certainly there
is no ordering. When the second I/O begins it might start out to disk
before - or after - the first. There is no ordering between the two.

When they have BOTH completed, then you do in fact know they are BOTH on
disk (assuming FILE_WRITE_THROUGH or FILE_NO_INTERMEDIATE_BUFFERING and
no compression…) But if one completes you know nothing about the
state of the other I/O request.

In the case you provided, it is quite possible that the second I/O
operation could complete before the first one is completed. This is not
what you would want in such a case - so yes, you need to block and wait
for the log to be written BEFORE you write the update to the end-of-log
information. When I constructed the file systems journal those many
years ago, the log pages actually had a “pass number” field. I’d find
the beginning (most recent) record of the log by doing a binary search
on the log file (sounds simple, but there are some annoying edge
conditions). This eliminated the ordered writes you are describing.

When we tested this, we used a simulator that would snapshot the disk
state after every sector update and verify that we could recover the
log. We (literally) ran this in an automated fashion every night on
various workstations for almost the entire lifetime of the project,
often testing thousands of “disk states” each night. I’ve gratefully
forgotten the amazingly weird bugs that we saw over time. Transactional
systems can be quite amazing when they work properly.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Tuesday, November 02, 2004 7:03 PM
To: ntdev redirect
Subject: Re:[ntdev] Ordering of I/O requests

Tony,

By “serialize” you mean wait for completion, or wait for the write call
to
return with a PENDING status?

Let me be specific: say a driver opens a file with ZwCreateFile and a
flag
of FILE_NO_INTERMEDIATE_BUFFERING. If I issue an IRP_MJ_WRITE and wait
for
the call to IoCallDriver to return STATUS_PENDING, and then issue a
IRP_MJ_WRITE to the same offset, is it guaranteed the disk will contain
the
data from the second write after both IRP’s complete? Or will the data
on
the disk be indeterminate, and the only way to make this works is to
wait
for the full completion of the first IRP before issuing the second one?
Is
everything the same if I open a raw volume instead of a file on an NTFS
file
system?

The application is a log, where I write some data in the log file, and
then
want to update the header saying the offset of the last data written. I
would like to maximize performance by issuing I/O as soon as I know what
to
write, instead of waiting for the previous write of header data to
complete.
The goal is to be assured the header will always contain that latest
pointer
update. This will all run at PASSIVE level, probably in an arbitrary
process
and thread context. Imagine a log device where apps pass down IOCTL’s to

write to the log. The IOCTL calls the write irp to append the log data
block, and then when IoCallDriver returns PENDING, issues another write
to
update the log ending pointer. The IOCTL returns SUCCES after the header

write returns PENDING, and the completion routines clean up.

  • Jan

If they used the Win32 API and specified overlapped I/O, but themselves

serialize the I/O internally, then the I/O would be serialized.
If they don’t serialize the I/O internally between two threads then the

application itself really has no idea in which order the two operations
were issued.


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@osr.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

As I already admitted, I oversimplified here. I should have known
better on a topic of this technical complexity that simplifying ANYTHING
would come back to haunt me.

Of course, the details will depend entirely upon the specifics of the
underlying hardware.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S. Shatskih
Sent: Tuesday, November 02, 2004 6:21 PM
To: ntdev redirect
Subject: Re: [ntdev] Ordering of I/O requests

NTFS does exactly the same thing - it explicitly requests that
write-back caching be DISABLED in the disk controller and on the disk

Am I wrong that, on SCSI, NTFS uses the FUA request flag for log writes,
and
still runs with write cache on for other writes?

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@osr.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

I did not say that there were NO ordering constraints, only that a
system of this type does not require full serialization. For example,
it was common for us to have multiple blocks heading out to the log
simultaneously - if they committed out-of-order we’d deal with that when
it came time to do rollback (read: system crashed, we replayed the log).

In fact, in each page we wrote out to the log we kept track of the
“active region” of the log - the last committed page of the log and the
oldest ACTIVE page of the log. When we replayed, we’d check from last
committed page to current page and ensure they were all written out
correctly. If not, we’d go back to the last contiguously written log
page.

On the write side, we’d write out in an LRU-style fashion and (much as
you described in LFS) we knew when it came time to flush out a disk page
when it was safe (based upon the log state). We didn’t have
transactional dependencies, just log page dependencies, because we did
old value/new value logging (so we could process UNDO operation as well
as REDO operations).

The advantage of journaling file systems versus “careful write” systems
(McKusick’s) is that they actually can be faster in most any meta-data
intensive operation - there is no need to write meta-data out
aggressively because the journal can be used to restore the correct
(consistent state).

Of course, log structured file systems are another interesting example -
and one that we don’t see much in the Windows environment (I’ve seen one
over the years, developed by the most talented file systems team I’ve
ever had the pleasure to work with, but they were dismantled after an
acquisition of their company and the technology was given away and
shelved).

I’ve gone through this argument (over and over again, it seems) about
journaling. Careful update is actually very I/O intensive and while it
can be made to WORK right, it isn’t necessarily the best solution for
all circumstances. If you have to make it work with an existing on disk
format, it makes sense. Building something new, it does not. In
clustered environments (shared on-disk file system) careful update
doesn’t work.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S. Shatskih
Sent: Tuesday, November 02, 2004 7:21 PM
To: ntdev redirect
Subject: Re: [ntdev] Ordering of I/O requests

Transactional semantics do NOT require serializing writes.

At least the log block with the record about the operation must be
flushed
before the data blocks updated by the operation.

From what I know on Windows, the Cache Manager has the facility of a)
assigning
LSNs to the blocks on updates b) calling the FSD-provided callback
before
flushing the block, passing the block’s LSN via parameter. The callback
has the
power to veto the flush, so Cc will skip this block and continue the
“dirt”
collection run with another blocks.

The LFS subsystem in NTFS assign incremented LSNs to the log records,
and the
callback is the LFS routine which checks for the log being physically
flushed
up to this LSN. Also NTFS assigns the LSNs to cache blocks, the LSNs
which are
asked from LFS after the operation is logged.

Another great approach on FS consistency and fault tolerance is
FreeBSD’s
“softdep”.

Very great. It does not maintain any on-disk logs, but imposes a
metadata flush
ordering, and, if the order is circular, it rolls back the update in the
block,
writes it to the disk, and then roll forwards the same update in the
in-memory
copy.

This provides the “weak consistency” guarantee, where the only damage in
the
on-disk FS can be the leaked blocks or inodes only (which are easily
recovered
by special very fast mode of fsck), but not any other corruptions.

This hits lots of practical purposes. No full fsck on each boot as in
Linux
with ext2. No additional disk writes to a separate file which can be far
away
disk-wise as in NTFS or ReiserFS, which are slow due to head seeks.

I can email the McKusick’s (FreeBSD FS author) document on softdep to
all
interested parties.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@osr.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Here’s some clarification:
From windows 2000 SP3 onwards, we use FUA on SCSI when WRITE_THROUGH is
supplied. This will force immediate write back to the disk from the
cache (no guarantees on IDE controllers).
FlushFileBuffers - this is very expensive, but it will do a complete
sync of the cache with the disk on both SCSI and IDE.
Given that information it is up to you to choose what semantics you
want. Remember FlushFileBuffers() is very expensive…
Ravi

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tony Mason
Sent: Tuesday, November 02, 2004 4:54 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Ordering of I/O requests

Jan,

First, let me note (again) that I have a very file systems perspective
in this regard. I was also careful to mention that this is a feature of
the Win32 API (it sets FILE_SYNCHRONOUS_IO_NONALERT when opening the
file). But this is specific to CreateFile (except for overlapped I/O).

Your example is a good one - it strips away the Win32 processing to look
at just what the OS itself does.

The I/O Manager in this case does not impose serialization (I’m assuming
you did NOT ask for FILE_SYNCHRONOUS_IO_ALERT or
FILE_SYNCHRONOUS_IO_NONALERT) and passes the calls to the underlying
file system. The file system might impose its own internal
serialization, but from my experience this is THE performance path and
we have to make sure there’s as little serialization as possible.

If the file is compressed, NTFS will not honor the
FILE_NO_INTERMEDIATE_BUFFERING request you made - the file will be
written to the cache. If you are building a transactional system, you
should NOT allow the file system to store the file compressed.

Assuming that is not the case, NTFS will rather quickly try to post that
asynchronous I/O operation and return STATUS_PENDING back to you. This
doesn’t even mean the I/O has been STARTED to disk, so certainly there
is no ordering. When the second I/O begins it might start out to disk
before - or after - the first. There is no ordering between the two.

When they have BOTH completed, then you do in fact know they are BOTH on
disk (assuming FILE_WRITE_THROUGH or FILE_NO_INTERMEDIATE_BUFFERING and
no compression…) But if one completes you know nothing about the
state of the other I/O request.

In the case you provided, it is quite possible that the second I/O
operation could complete before the first one is completed. This is not
what you would want in such a case - so yes, you need to block and wait
for the log to be written BEFORE you write the update to the end-of-log
information. When I constructed the file systems journal those many
years ago, the log pages actually had a “pass number” field. I’d find
the beginning (most recent) record of the log by doing a binary search
on the log file (sounds simple, but there are some annoying edge
conditions). This eliminated the ordered writes you are describing.

When we tested this, we used a simulator that would snapshot the disk
state after every sector update and verify that we could recover the
log. We (literally) ran this in an automated fashion every night on
various workstations for almost the entire lifetime of the project,
often testing thousands of “disk states” each night. I’ve gratefully
forgotten the amazingly weird bugs that we saw over time. Transactional
systems can be quite amazing when they work properly.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Tuesday, November 02, 2004 7:03 PM
To: ntdev redirect
Subject: Re:[ntdev] Ordering of I/O requests

Tony,

By “serialize” you mean wait for completion, or wait for the write call
to return with a PENDING status?

Let me be specific: say a driver opens a file with ZwCreateFile and a
flag of FILE_NO_INTERMEDIATE_BUFFERING. If I issue an IRP_MJ_WRITE and
wait for the call to IoCallDriver to return STATUS_PENDING, and then
issue a IRP_MJ_WRITE to the same offset, is it guaranteed the disk will
contain the data from the second write after both IRP’s complete? Or
will the data on the disk be indeterminate, and the only way to make
this works is to wait for the full completion of the first IRP before
issuing the second one?
Is
everything the same if I open a raw volume instead of a file on an NTFS
file system?

The application is a log, where I write some data in the log file, and
then want to update the header saying the offset of the last data
written. I would like to maximize performance by issuing I/O as soon as
I know what to write, instead of waiting for the previous write of
header data to complete.
The goal is to be assured the header will always contain that latest
pointer update. This will all run at PASSIVE level, probably in an
arbitrary process and thread context. Imagine a log device where apps
pass down IOCTL’s to

write to the log. The IOCTL calls the write irp to append the log data
block, and then when IoCallDriver returns PENDING, issues another write
to update the log ending pointer. The IOCTL returns SUCCES after the
header

write returns PENDING, and the completion routines clean up.

  • Jan

If they used the Win32 API and specified overlapped I/O, but themselves

serialize the I/O internally, then the I/O would be serialized.
If they don’t serialize the I/O internally between two threads then the

application itself really has no idea in which order the two operations

were issued.


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@osr.com To unsubscribe
send a blank email to xxxxx@lists.osr.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

Sorry to bring up such a confusing topic !

I still stands behind Peter W’s logic. From the uplevel, all it matters is A
goes before B, then what the lower level is doing is upto them. Is there
partial ordering or whatever is immeterial. Flushing will race with other
I/O too…

Also I dont buy that two or more disk(s). A disk could not conceivably think
about how many more disk would be involved in a sane transactional system.
Also flushing to disk, and an ack from disk is all that matters to the layer
who is flushing to disk, then what disk does is irrelevant. So essentially
reordering/caching everything is fine, as long as it maintain the
consistency, if A is modified to A’, you get A’, it is upto the disk to keep
it in NVRAM or wherever before being permanently magnetized … AND THAT IS
MY SIMPLE WAY TO LOOK AT IT, and possibly called transitive trust of
consistency …

Of course, most of you are far more familiar around this area, but that is
just my pi/2 cents :slight_smile:

-pro

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Ravisankar
Pudipeddi
Sent: Tuesday, November 02, 2004 6:38 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Ordering of I/O requests

Here’s some clarification:

From windows 2000 SP3 onwards, we use FUA on SCSI when WRITE_THROUGH is
supplied. This will force immediate write back to the disk from the
cache (no guarantees on IDE controllers).
FlushFileBuffers - this is very expensive, but it will do a complete
sync of the cache with the disk on both SCSI and IDE.
Given that information it is up to you to choose what semantics you
want. Remember FlushFileBuffers() is very expensive…
Ravi

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tony Mason
Sent: Tuesday, November 02, 2004 4:54 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Ordering of I/O requests

Jan,

First, let me note (again) that I have a very file systems perspective
in this regard. I was also careful to mention that this is a feature of
the Win32 API (it sets FILE_SYNCHRONOUS_IO_NONALERT when opening the
file). But this is specific to CreateFile (except for overlapped I/O).

Your example is a good one - it strips away the Win32 processing to look
at just what the OS itself does.

The I/O Manager in this case does not impose serialization (I’m assuming
you did NOT ask for FILE_SYNCHRONOUS_IO_ALERT or
FILE_SYNCHRONOUS_IO_NONALERT) and passes the calls to the underlying
file system. The file system might impose its own internal
serialization, but from my experience this is THE performance path and
we have to make sure there’s as little serialization as possible.

If the file is compressed, NTFS will not honor the
FILE_NO_INTERMEDIATE_BUFFERING request you made - the file will be
written to the cache. If you are building a transactional system, you
should NOT allow the file system to store the file compressed.

Assuming that is not the case, NTFS will rather quickly try to post that
asynchronous I/O operation and return STATUS_PENDING back to you. This
doesn’t even mean the I/O has been STARTED to disk, so certainly there
is no ordering. When the second I/O begins it might start out to disk
before - or after - the first. There is no ordering between the two.

When they have BOTH completed, then you do in fact know they are BOTH on
disk (assuming FILE_WRITE_THROUGH or FILE_NO_INTERMEDIATE_BUFFERING and
no compression…) But if one completes you know nothing about the
state of the other I/O request.

In the case you provided, it is quite possible that the second I/O
operation could complete before the first one is completed. This is not
what you would want in such a case - so yes, you need to block and wait
for the log to be written BEFORE you write the update to the end-of-log
information. When I constructed the file systems journal those many
years ago, the log pages actually had a “pass number” field. I’d find
the beginning (most recent) record of the log by doing a binary search
on the log file (sounds simple, but there are some annoying edge
conditions). This eliminated the ordered writes you are describing.

When we tested this, we used a simulator that would snapshot the disk
state after every sector update and verify that we could recover the
log. We (literally) ran this in an automated fashion every night on
various workstations for almost the entire lifetime of the project,
often testing thousands of “disk states” each night. I’ve gratefully
forgotten the amazingly weird bugs that we saw over time. Transactional
systems can be quite amazing when they work properly.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Tuesday, November 02, 2004 7:03 PM
To: ntdev redirect
Subject: Re:[ntdev] Ordering of I/O requests

Tony,

By “serialize” you mean wait for completion, or wait for the write call
to return with a PENDING status?

Let me be specific: say a driver opens a file with ZwCreateFile and a
flag of FILE_NO_INTERMEDIATE_BUFFERING. If I issue an IRP_MJ_WRITE and
wait for the call to IoCallDriver to return STATUS_PENDING, and then
issue a IRP_MJ_WRITE to the same offset, is it guaranteed the disk will
contain the data from the second write after both IRP’s complete? Or
will the data on the disk be indeterminate, and the only way to make
this works is to wait for the full completion of the first IRP before
issuing the second one?
Is
everything the same if I open a raw volume instead of a file on an NTFS
file system?

The application is a log, where I write some data in the log file, and
then want to update the header saying the offset of the last data
written. I would like to maximize performance by issuing I/O as soon as
I know what to write, instead of waiting for the previous write of
header data to complete.
The goal is to be assured the header will always contain that latest
pointer update. This will all run at PASSIVE level, probably in an
arbitrary process and thread context. Imagine a log device where apps
pass down IOCTL’s to

write to the log. The IOCTL calls the write irp to append the log data
block, and then when IoCallDriver returns PENDING, issues another write
to update the log ending pointer. The IOCTL returns SUCCES after the
header

write returns PENDING, and the completion routines clean up.

  • Jan

If they used the Win32 API and specified overlapped I/O, but themselves

serialize the I/O internally, then the I/O would be serialized.
If they don’t serialize the I/O internally between two threads then the

application itself really has no idea in which order the two operations

were issued.


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@osr.com To unsubscribe
send a blank email to xxxxx@lists.osr.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument: ‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

> -----Original Message-----

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim
S. Shatskih
Sent: Tuesday, November 02, 2004 6:21 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Ordering of I/O requests

> NTFS does exactly the same thing - it explicitly requests that
> write-back caching be DISABLED in the disk controller and
on the disk

Am I wrong that, on SCSI, NTFS uses the FUA request flag for
log writes, and still runs with write cache on for other writes?

That is what I have seen.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as:
xxxxx@hollistech.com To unsubscribe send a blank email to
xxxxx@lists.osr.com

> The advantage of journaling file systems versus “careful write” systems

(McKusick’s) is that they actually can be faster in most any meta-data

Looks like we must agree on what is “careful write”.

Usually “careful write” means - synchronously flushing the cache blocks just
after update.

From what I know on McKusick’s “softdep”, it does not do this. It only imposes
an order on block flushing, and the flushing itself can be done in a lazy way.
So, maybe softdep is not a careful write system.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

In my experience, “careful write” is correct ordering of meta-data write
operations (or specifically the blocks that contain said information) in
order to minimize the “holes” that can otherwise exist (using “arbitrary
write” semantics for example). It is my understanding that this is
essentially the same as McKusick’s work, at least as I understand it.

There are clearly times when you must flush meta-data in order to
guarantee correct application semantics (write followed by flush, for
example) but no file system I’ve ever seen that cares about performance
does aggressive flushing of I/O operations. FAT, for instance, uses a
more aggressive write-back scheme when running on top of a removable
device than NTFS and it DOES do synchronous write-back in a significant
number of cases. But I wouldn’t want to be using that as my
general-purpose I/O device, either.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

Looking forward to seeing you at the Next OSR File Systems Class October
18, 2004 in Silicon Valley!

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S. Shatskih
Sent: Wednesday, November 03, 2004 8:59 AM
To: ntdev redirect
Subject: Re: [ntdev] Ordering of I/O requests

The advantage of journaling file systems versus “careful write”
systems
(McKusick’s) is that they actually can be faster in most any meta-data

Looks like we must agree on what is “careful write”.

Usually “careful write” means - synchronously flushing the cache blocks
just
after update.

From what I know on McKusick’s “softdep”, it does not do this. It only
imposes
an order on block flushing, and the flushing itself can be done in a
lazy way.
So, maybe softdep is not a careful write system.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@osr.com
To unsubscribe send a blank email to xxxxx@lists.osr.com