Ordering of I/O requests

I’m trying to decide if there is actually an “ordering” to I/O requests. Say
I have two threads on two processors and they both issue a disk write in
some known order on a single file handle. It seems like both those requests
will pass down the storage stack in parallel, and not get serialized until
they reach either the disk bus or possibly the SCSI port driver target
queue. As either thread can be interrupted by DPC or ISR processing, it’s
not know which request will actually happen first. If I’m running
multipathing hardware, they might even take different routes to the drive,
and get serialized there.

I care about this because I would like to be able to pipeline a bunch of
disk writes with overlapping headers and trailers, and assure they get
applied in the pipeline order. I don’t really want to wait for one to
complete before issuing the next one, which seems like the only way to
ASSURE an ordering. The end data on the disk will not be correct unless the
header of the second write is applied after the trailer of the first write,
and the header of the third write is applied after the trailer of the second
write.

I put on my user mode hat and think “of course the order the writes get
initially issued in will be preserved on the physical disk”, but I put on my
driver writer hat and say “what specifically will keep the writes in the
same order as they started in?”.

Anybody know the real answer?

  • Jan

The wonderful thing about this topic is it all depends upon processing
at each layer. Since you specified they are on the same handle, let’s
note that the first thing will depend upon how the user opened the file.
If they used the Win32 API and did not specify overlapped I/O, then the
I/O Manager will serialize the operations on behalf of the original
user. If they used the Win32 API and specified overlapped I/O, but
themselves serialize the I/O internally, then the I/O would be
serialized. If they don’t serialize the I/O internally between two
threads then the application itself really has no idea in which order
the two operations were issued.

At the file system level, if the file was opened for cached I/O, the
operations are satisfied by copying data into the cache. Upon return,
the I/O is “done” in memory but not on disk. There is no guarantee as
to the order the data will be written. If the file was opened for
non-cached I/O but the file is compressed, NTFS will treat it as cached
I/O anyway (this is one reason why you will notice that database files
can’t be compressed - they can’t guarantee order of operations as is
required for transactional semantics).

So, let’s assume that the USER did everything write - file is opened for
overlapped I/O, non-cached, not compressed. I/O operations are directly
issued to the disk subsystem. Of course, a single user-level I/O might
be split into multiple I/O operations (if the allocation on disk is
fragmented). These I/O operation(s) are then send to the volume
manager. Again, the volume manager may split a single I/O operation up
into multiple I/O operations (mirroring, striping, or striping with
parity).

By the time this seemingly simple set of I/O operations reaches the SCSI
port layer there’s potentially little relationship between the original
(user) I/O and the actual (disk) I/O. Further, high-performance port
drivers often implement additional queuing algorithms (think of
simplistic algorithms like elevator scans, or more complex tag queuing
scans, or priority based real-time algorithms). Some port drivers
implement RAID features as well.

Both controllers and disks also add hardware (memory) caching. This can
(and typically does) again change the actual order in which things
occur.

Application programmers that believe the ordering of two arbitrary I/O
operations is in some way constrained are fooling themselves. Database
people (that is, the folks that implement SQL or Jet) understand that
nothing is certain until such time as the I/O is acknowledged from the
drive. They go to great lengths to disable features like disk (or
controller) caching because they need to know (for transactional
correctness) that I/O has in fact been committed to disk. The fastest
way to screw up a transactional system is to lie to it and claim blocks
are committed to disk when they in fact are not.

Regards,

Tony

Tony Mason

Consulting Partner

OSR Open Systems Resources, Inc.

http://www.osr.com

Looking forward to seeing you at the Next OSR File Systems Class October
18, 2004 in Silicon Valley!


From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Tuesday, November 02, 2004 2:45 AM
To: ntdev redirect
Subject: [ntdev] Ordering of I/O requests

I’m trying to decide if there is actually an “ordering” to I/O requests.
Say I have two threads on two processors and they both issue a disk
write in some known order on a single file handle. It seems like both
those requests will pass down the storage stack in parallel, and not get
serialized until they reach either the disk bus or possibly the SCSI
port driver target queue. As either thread can be interrupted by DPC or
ISR processing, it’s not know which request will actually happen first.
If I’m running multipathing hardware, they might even take different
routes to the drive, and get serialized there.

I care about this because I would like to be able to pipeline a bunch of
disk writes with overlapping headers and trailers, and assure they get
applied in the pipeline order. I don’t really want to wait for one to
complete before issuing the next one, which seems like the only way to
ASSURE an ordering. The end data on the disk will not be correct unless
the header of the second write is applied after the trailer of the first
write, and the header of the third write is applied after the trailer of
the second write.

I put on my user mode hat and think “of course the order the writes get
initially issued in will be preserved on the physical disk”, but I put
on my driver writer hat and say “what specifically will keep the writes
in the same order as they started in?”.

Anybody know the real answer?

  • Jan

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

the only way to assure ordering is to wait for one request to complete
before starting the next one.

-p


From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Monday, November 01, 2004 11:45 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] Ordering of I/O requests

I’m trying to decide if there is actually an “ordering” to I/O
requests. Say I have two threads on two processors and they both issue a
disk write in some known order on a single file handle. It seems like
both those requests will pass down the storage stack in parallel, and
not get serialized until they reach either the disk bus or possibly the
SCSI port driver target queue. As either thread can be interrupted by
DPC or ISR processing, it’s not know which request will actually happen
first. If I’m running multipathing hardware, they might even take
different routes to the drive, and get serialized there.

I care about this because I would like to be able to pipeline a
bunch of disk writes with overlapping headers and trailers, and assure
they get applied in the pipeline order. I don’t really want to wait for
one to complete before issuing the next one, which seems like the only
way to ASSURE an ordering. The end data on the disk will not be correct
unless the header of the second write is applied after the trailer of
the first write, and the header of the third write is applied after the
trailer of the second write.

I put on my user mode hat and think “of course the order the
writes get initially issued in will be preserved on the physical disk”,
but I put on my driver writer hat and say “what specifically will keep
the writes in the same order as they started in?”.

Anybody know the real answer?

  • Jan

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag
argument: ‘’
To unsubscribe send a blank email to
xxxxx@lists.osr.com

Watching order of completion isn’t even sufficient at the disk interface
level, let alone at user application level. Disks and controllers both
implement caching logic that might reorder the writes (and return
premature results) prior to actually being committed to disk. If an
application must know the order, it must be cognizant of these
distinctions and disable them or provide additional logic to ensure data
has been written through (ultimately, this becomes hardware dependent as
well.)

Regards,

Tony

Tony Mason

Consulting Partner

OSR Open Systems Resources, Inc.

http://www.osr.com


From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Peter Wieland
Sent: Tuesday, November 02, 2004 10:06 AM
To: ntdev redirect
Subject: RE: [ntdev] Ordering of I/O requests

the only way to assure ordering is to wait for one request to complete
before starting the next one.

-p


From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Monday, November 01, 2004 11:45 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] Ordering of I/O requests

I’m trying to decide if there is actually an “ordering” to I/O
requests. Say I have two threads on two processors and they both issue a
disk write in some known order on a single file handle. It seems like
both those requests will pass down the storage stack in parallel, and
not get serialized until they reach either the disk bus or possibly the
SCSI port driver target queue. As either thread can be interrupted by
DPC or ISR processing, it’s not know which request will actually happen
first. If I’m running multipathing hardware, they might even take
different routes to the drive, and get serialized there.

I care about this because I would like to be able to pipeline a
bunch of disk writes with overlapping headers and trailers, and assure
they get applied in the pipeline order. I don’t really want to wait for
one to complete before issuing the next one, which seems like the only
way to ASSURE an ordering. The end data on the disk will not be correct
unless the header of the second write is applied after the trailer of
the first write, and the header of the third write is applied after the
trailer of the second write.

I put on my user mode hat and think “of course the order the
writes get initially issued in will be preserved on the physical disk”,
but I put on my driver writer hat and say “what specifically will keep
the writes in the same order as they started in?”.

Anybody know the real answer?

  • Jan

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag
argument: ‘’
To unsubscribe send a blank email to
xxxxx@lists.osr.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

Now again I’m lost, just out of curiosity, if a disk controller does reordering of writes and reads, and could not feed the upper interface layer with the most current data ( transaction symantics ) then I would assume it is nothing but a brick. So the question comes, when would it be really necessary to turnoff disk caching, assuming that the catching ( say at sector level ) is implemented correctly at the disk firmware level !!!

-pro

Say you have concurrent read and write to same sector … now define most
current data!

“Programmers Society Prokash Sinha” wrote in message
news:xxxxx@ntdev…
> Now again I’m lost, just out of curiosity, if a disk controller does
> reordering of writes and reads, and could not feed the upper interface
> layer with the most current data ( transaction symantics ) then I would
> assume it is nothing but a brick. So the question comes, when would it be
> really necessary to turnoff disk caching, assuming that the catching ( say
> at sector level ) is implemented correctly at the disk firmware level !!!
>
> -pro
>

I have a very file systems centric view here - and 15 years ago I was
developing the transaction/journaling components of a journaling file
system, so I’ve been over this territory before.

The general rule is: nobody cares unless they explicitly ask. In other
words, if you don’t ask, the underlying system will assume that you
don’t care about the ordering of specific operations.

For a journaling file system, we have the same problem as a database.
In my personal example from so many years ago, we used an old value/new
value journal, which means that we stored the data BEFORE the change and
the data AFTER the change. We then would periodically write a block of
such change records out to disk.

There are ordering constraints here - we must ensure that pieces of the
log are committed (permanently recorded on disk) before the
corresponding changes to meta-data are written out to disk. Thus to
guarantee our transactional semantics, we needed to ensure that any
hardware level out-of-order caching was disabled. Otherwise, we
couldn’t guarantee correct recovery.

NTFS does exactly the same thing - it explicitly requests that
write-back caching be DISABLED in the disk controller and on the disk
drive itself (see IOCTL_DISK_SET_CACHE_INFORMATION and
IOCTL_DISK_SET_CACHE_SETTING for example, no doubt there are also other
mechanisms floating around through here).

If you want to implement caching, the only safe transparent way to do
this (from a single disk transactional perspective) is to guarantee no
reordering of operations. But databases where the log is on one disk
and the database on another will NOT be happy if they find out that data
they were told had committed to the log didn’t, while the updates (which
must now be aborted) were written to the 2nd disk drive. The only safe
way to do this is to disable caching on both drives, so that once data
has been acknowledged back, we know that it has been written out to
disk.

A quick search also turns up yet more information. For example
http://www.storagereview.com/guide2000/ref/hdd/if/scsi/protCQR.html
discusses tagged command queuing (and this DOES improve performance).
There is a wealth of information about this topic floating around. This
just demonstrates that systems DO reorder disks.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Programmers Society
Prokash Sinha
Sent: Tuesday, November 02, 2004 12:12 PM
To: ntdev redirect
Subject: RE: [ntdev] Ordering of I/O requests

Now again I’m lost, just out of curiosity, if a disk controller does
reordering of writes and reads, and could not feed the upper interface
layer with the most current data ( transaction symantics ) then I would
assume it is nothing but a brick. So the question comes, when would it
be really necessary to turnoff disk caching, assuming that the catching
( say at sector level ) is implemented correctly at the disk firmware
level !!!

-pro


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@osr.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

The concurrent read and write can happens to any place ( memory, cache, disk cache). From the receiver’s point of view, whatever comes first. As an example, I can submit two transaction to my bank ( withdraw, and deposit > if an withdraw goes before deposit, I’m fined, but from the receiving end in this case the banking system, the serilization in time is more important, than anything in the read/write reordering at any level. If a write is withhold, and a read is allowed it is the consistency problem at that level, and read should get the cached data that is not yet committed. EVEN I DONT THINK THAT YOU COULD DO A CONCURRENT READ AND WRITE TO A SPECIFIC DATA ELEMENT ( BE IT THE SMALLEST ADDRESSABLE UNIT OR A SECTOR OR WHATEVER > WHEN YOU NEED DATA CONSISTENCY.

So still I’m not clear about this read / write reordering that could not be trusted, well of course, if implemented correctly !!!

-pro

Thanks Tony,

The first one you did send was very nice and I saved :).

Yes I agree what you are saying, but my point is very specific to single disk. The dual or more disk scenario you just gave is similar to MP caching consistency and that is a very very valid point … and I’m not sure how I’m going to point my gun to the read / write reordering or disk level caching !!!

I was partly involved in some higher than firmware level about read/write reordering, so that just go me into confusion.

Best Regards,
-pro

How do you implement it correctly if you rely upon two sectors of two
different disks? What is the inherent ordering of such operations?

Heck, the same problem arises with processors - what is the ordering of
writes between two different processors? Maybe you think that they are
well ordered, but in the real world that leads to performance
bottlenecks, so in fact the rules are generally relaxed and the order of
reads and writes are ordered for a given processors, but not between
processors. Let’s scale this up - we could have different processors,
executing different I/O operations to different disks located on
different controllers.

Sometimes it amazes me that any of this stuff works right, ever.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Programmers Society
Prokash Sinha
Sent: Tuesday, November 02, 2004 12:39 PM
To: ntdev redirect
Subject: Re:[ntdev] Ordering of I/O requests

The concurrent read and write can happens to any place ( memory, cache,
disk cache). From the receiver’s point of view, whatever comes first. As
an example, I can submit two transaction to my bank ( withdraw, and
deposit > if an withdraw goes before deposit, I’m fined, but from the
receiving end in this case the banking system, the serilization in time
is more important, than anything in the read/write reordering at any
level. If a write is withhold, and a read is allowed it is the
consistency problem at that level, and read should get the cached data
that is not yet committed. EVEN I DONT THINK THAT YOU COULD DO A
CONCURRENT READ AND WRITE TO A SPECIFIC DATA ELEMENT ( BE IT THE
SMALLEST ADDRESSABLE UNIT OR A SECTOR OR WHATEVER > WHEN YOU NEED DATA
CONSISTENCY.

So still I’m not clear about this read / write reordering that could not
be trusted, well of course, if implemented correctly !!!

-pro


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@osr.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Disk controllers that reorder writes when you have specified
WRITE_THROUGH and synchronously waited for a request to complete before
sending down another are buggy. Or they are lying so as to get a boost
in performance at the expense of correctness. You will see chkdsk
running on those machines. Caching in the controller is one thing but
not ordering writes is something else altogether, when WRITE_THROUGH is
supplied.

Actually NTFS doesn’t disable write back caching - because it assumes
ordering is never broken, it uses WRITE_THROUGH.

Ravi

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tony Mason
Sent: Tuesday, November 02, 2004 9:35 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Ordering of I/O requests

I have a very file systems centric view here - and 15 years ago I was
developing the transaction/journaling components of a journaling file
system, so I’ve been over this territory before.

The general rule is: nobody cares unless they explicitly ask. In other
words, if you don’t ask, the underlying system will assume that you
don’t care about the ordering of specific operations.

For a journaling file system, we have the same problem as a database.
In my personal example from so many years ago, we used an old value/new
value journal, which means that we stored the data BEFORE the change and
the data AFTER the change. We then would periodically write a block of
such change records out to disk.

There are ordering constraints here - we must ensure that pieces of the
log are committed (permanently recorded on disk) before the
corresponding changes to meta-data are written out to disk. Thus to
guarantee our transactional semantics, we needed to ensure that any
hardware level out-of-order caching was disabled. Otherwise, we
couldn’t guarantee correct recovery.

NTFS does exactly the same thing - it explicitly requests that
write-back caching be DISABLED in the disk controller and on the disk
drive itself (see IOCTL_DISK_SET_CACHE_INFORMATION and
IOCTL_DISK_SET_CACHE_SETTING for example, no doubt there are also other
mechanisms floating around through here).

If you want to implement caching, the only safe transparent way to do
this (from a single disk transactional perspective) is to guarantee no
reordering of operations. But databases where the log is on one disk
and the database on another will NOT be happy if they find out that data
they were told had committed to the log didn’t, while the updates (which
must now be aborted) were written to the 2nd disk drive. The only safe
way to do this is to disable caching on both drives, so that once data
has been acknowledged back, we know that it has been written out to
disk.

A quick search also turns up yet more information. For example
http://www.storagereview.com/guide2000/ref/hdd/if/scsi/protCQR.html
discusses tagged command queuing (and this DOES improve performance).
There is a wealth of information about this topic floating around. This
just demonstrates that systems DO reorder disks.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Programmers Society
Prokash Sinha
Sent: Tuesday, November 02, 2004 12:12 PM
To: ntdev redirect
Subject: RE: [ntdev] Ordering of I/O requests

Now again I’m lost, just out of curiosity, if a disk controller does
reordering of writes and reads, and could not feed the upper interface
layer with the most current data ( transaction symantics ) then I would
assume it is nothing but a brick. So the question comes, when would it
be really necessary to turnoff disk caching, assuming that the catching
( say at sector level ) is implemented correctly at the disk firmware
level !!!

-pro


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@osr.com To unsubscribe
send a blank email to xxxxx@lists.osr.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

This statement is STILL too strong. Here are two distinct cases that I
would consider to obey WRITE_THROUGH and yet still allow a disk
controller to reorder writes:

  • NTFS writes I/O operations in “clusters” that consist of one (or more)
    disk sectors. A single cluster might involve I/O to disjoint regions on
    the disk (sector sparing or some bizarre striping implementation).
    There is *no* ordering constraint on the writes within that region.
    Essentially, then, the “WRITE_THROUGH” request (which is nothing more
    than a request to the disk and/or controller to disable write-back
    caching for that given instance) merely says “tell me when the I/O is
    really committed to disk”. Nothing within the operation requires that
    the sectors be written in the correct order (I’m not sure how NTFS
    handles this issue, but we used to compute a checksum over log records
    to detect sector-write failures on replay).

  • Distinct I/O operations to distinct regions of the disk may be
    interleaved. There is no “atomicity” or ordering with respect to two
    different I/O operations to the same disk. Combining that with the
    previous example, there’s no ordering relative to sector-level writes
    split between two different I/O operations.

While Ravi’s point is correct - my comments about NTFS were overly
strong (think of it as disabling write-back caching on the data IT cares
about, rather than ALL data on the drive) the underlying point remains:
any journaling system needs to know that once its write to disk has been
acknowledged it will never go away. There is no other way to ensure
inter-disk operations remain consistent with one another. That
shouldn’t be an issue for NTFS (where, as I understand it the journal is
part of the NTFS file system) but is an issue if you store the journal
on a different disk drive (witness IBM’s AIX where JFS used a separate
logical volume for storing the journal for a file system example).

Of course, this does not equate to requiring that the data be “on disk”,
merely that it be persistent. A disk (or controller) with NVRAM can
provide blazingly fast performance by lying about such I/O operations -
but they guarantee the data will eventually be written back to disk.
That’s sufficient for our purposes.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Ravisankar Pudipeddi
Sent: Tuesday, November 02, 2004 1:16 PM
To: ntdev redirect
Subject: RE: [ntdev] Ordering of I/O requests

Disk controllers that reorder writes when you have specified
WRITE_THROUGH and synchronously waited for a request to complete before
sending down another are buggy. Or they are lying so as to get a boost
in performance at the expense of correctness. You will see chkdsk
running on those machines. Caching in the controller is one thing but
not ordering writes is something else altogether, when WRITE_THROUGH is
supplied.

Actually NTFS doesn’t disable write back caching - because it assumes
ordering is never broken, it uses WRITE_THROUGH.

Ravi

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tony Mason
Sent: Tuesday, November 02, 2004 9:35 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Ordering of I/O requests

I have a very file systems centric view here - and 15 years ago I was
developing the transaction/journaling components of a journaling file
system, so I’ve been over this territory before.

The general rule is: nobody cares unless they explicitly ask. In other
words, if you don’t ask, the underlying system will assume that you
don’t care about the ordering of specific operations.

For a journaling file system, we have the same problem as a database.
In my personal example from so many years ago, we used an old value/new
value journal, which means that we stored the data BEFORE the change and
the data AFTER the change. We then would periodically write a block of
such change records out to disk.

There are ordering constraints here - we must ensure that pieces of the
log are committed (permanently recorded on disk) before the
corresponding changes to meta-data are written out to disk. Thus to
guarantee our transactional semantics, we needed to ensure that any
hardware level out-of-order caching was disabled. Otherwise, we
couldn’t guarantee correct recovery.

NTFS does exactly the same thing - it explicitly requests that
write-back caching be DISABLED in the disk controller and on the disk
drive itself (see IOCTL_DISK_SET_CACHE_INFORMATION and
IOCTL_DISK_SET_CACHE_SETTING for example, no doubt there are also other
mechanisms floating around through here).

If you want to implement caching, the only safe transparent way to do
this (from a single disk transactional perspective) is to guarantee no
reordering of operations. But databases where the log is on one disk
and the database on another will NOT be happy if they find out that data
they were told had committed to the log didn’t, while the updates (which
must now be aborted) were written to the 2nd disk drive. The only safe
way to do this is to disable caching on both drives, so that once data
has been acknowledged back, we know that it has been written out to
disk.

A quick search also turns up yet more information. For example
http://www.storagereview.com/guide2000/ref/hdd/if/scsi/protCQR.html
discusses tagged command queuing (and this DOES improve performance).
There is a wealth of information about this topic floating around. This
just demonstrates that systems DO reorder disks.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Programmers Society
Prokash Sinha
Sent: Tuesday, November 02, 2004 12:12 PM
To: ntdev redirect
Subject: RE: [ntdev] Ordering of I/O requests

Now again I’m lost, just out of curiosity, if a disk controller does
reordering of writes and reads, and could not feed the upper interface
layer with the most current data ( transaction symantics ) then I would
assume it is nothing but a brick. So the question comes, when would it
be really necessary to turnoff disk caching, assuming that the catching
( say at sector level ) is implemented correctly at the disk firmware
level !!!

-pro


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@osr.com To unsubscribe
send a blank email to xxxxx@lists.osr.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

Yeah, I agree that lying or incorrect assumption(s) are there when there is no strong order, enven on a simple system with various level of caching and optio9nally reordering.

I THINK THE MAIN QUESTION IS EVEN IF IT IS IMPLEMENTED, THERE HAS TO BE A REPLAY MECHANISM TO CAPTURE ANY HICKUP. As an example, I write a transaction on a single CPU, writeback enabled, written to cache, systems fails before gets written out to disk, similarly any place that does do caching, and read / write transaction reordering …

It is an ever interseting area, I suppose.

Thank U both !

-pro

Hi,

The conversation here has been great!

It seems like thinking of I/O as a cloud and not a queue is a much better
metaphore. When an I/O comes out of the cloud (i.e. is completed), you can’t
make ANY assumptions about other I/O’s still in the cloud, and there is NO
ordering once an I/O is in the cloud. So let me ask about a FLUSH IRP. My
context is from the point of view of a disk volume filter driver, just so
you know my problem space. Would the correct metaphore for a flush be that
when a flush returns from the I/O cloud, you can be assured that all I/O
that have already come out of the cloud are committed to physical disk? Or
will flush also not come out of the cloud until ALL other I/O has come out
of the cloud? Let me define “being in the cloud” as: I have passed an IRP
down to the driver below me with IoCallDriver and I have not had my
completion routine for that IRP called yet.

Thanks :slight_smile:

  • Jan

Flush races with all other disk I/O. If your completion routine hasn’t
been called then you can’t assume the driver has even received your
request much less accounted for it by the time it gets the flush.

The effect of a flush on any I/O that hasn’t completed back to you is
indeterminate.

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Tuesday, November 02, 2004 12:17 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Ordering of I/O requests

Hi,

The conversation here has been great!

It seems like thinking of I/O as a cloud and not a queue is a
much better metaphore. When an I/O comes out of the cloud
(i.e. is completed), you can’t make ANY assumptions about
other I/O’s still in the cloud, and there is NO ordering once
an I/O is in the cloud. So let me ask about a FLUSH IRP. My
context is from the point of view of a disk volume filter
driver, just so you know my problem space. Would the correct
metaphore for a flush be that when a flush returns from the
I/O cloud, you can be assured that all I/O that have already
come out of the cloud are committed to physical disk? Or will
flush also not come out of the cloud until ALL other I/O has
come out of the cloud? Let me define “being in the cloud” as:
I have passed an IRP down to the driver below me with
IoCallDriver and I have not had my completion routine for
that IRP called yet.

Thanks :slight_smile:

  • Jan

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as:
xxxxx@windows.microsoft.com To unsubscribe send a blank
email to xxxxx@lists.osr.com

> Now again I’m lost, just out of curiosity, if a disk controller does
reordering of writes and reads,

Not controller, but a SCSI LUN itself (the disk drive) does this reordering.

and could not feed the upper interface layer with the most current data (
transaction symantics )
hen I would assume it is nothing but a brick.

If you need transaction semantics - then wait for the completion of write 1
before submitting write 2.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

> NTFS does exactly the same thing - it explicitly requests that

write-back caching be DISABLED in the disk controller and on the disk

Am I wrong that, on SCSI, NTFS uses the FUA request flag for log writes, and
still runs with write cache on for other writes?

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Most disk drives implement CSCAN or some variation on CSCAN. You can find
lots of information on CSCAN by doing a web search.

Jamey Kirby

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S. Shatskih
Sent: Tuesday, November 02, 2004 3:19 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Ordering of I/O requests

Now again I’m lost, just out of curiosity, if a disk controller does
reordering of writes and reads,

Not controller, but a SCSI LUN itself (the disk drive) does this reordering.

and could not feed the upper interface layer with the most current data (
transaction symantics )
hen I would assume it is nothing but a brick.

If you need transaction semantics - then wait for the completion of write 1
before submitting write 2.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@storagecraft.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Tony,

By “serialize” you mean wait for completion, or wait for the write call to
return with a PENDING status?

Let me be specific: say a driver opens a file with ZwCreateFile and a flag
of FILE_NO_INTERMEDIATE_BUFFERING. If I issue an IRP_MJ_WRITE and wait for
the call to IoCallDriver to return STATUS_PENDING, and then issue a
IRP_MJ_WRITE to the same offset, is it guaranteed the disk will contain the
data from the second write after both IRP’s complete? Or will the data on
the disk be indeterminate, and the only way to make this works is to wait
for the full completion of the first IRP before issuing the second one? Is
everything the same if I open a raw volume instead of a file on an NTFS file
system?

The application is a log, where I write some data in the log file, and then
want to update the header saying the offset of the last data written. I
would like to maximize performance by issuing I/O as soon as I know what to
write, instead of waiting for the previous write of header data to complete.
The goal is to be assured the header will always contain that latest pointer
update. This will all run at PASSIVE level, probably in an arbitrary process
and thread context. Imagine a log device where apps pass down IOCTL’s to
write to the log. The IOCTL calls the write irp to append the log data
block, and then when IoCallDriver returns PENDING, issues another write to
update the log ending pointer. The IOCTL returns SUCCES after the header
write returns PENDING, and the completion routines clean up.

  • Jan

If they used the Win32 API and specified overlapped I/O, but themselves
serialize the I/O internally, then the I/O would be serialized.
If they don’t serialize the I/O internally between two threads then the
application itself really has no idea in which order the two operations
were issued.

Transactional semantics do NOT require serializing writes. However,
while a write is “in flight” you cannot depend upon it to be committed.
Databases (and journaling file systems) ROUTINELY allow multiple
asynchronous I/Os and there are well understood techniques for ensuring
correctness.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jamey Kirby
Sent: Tuesday, November 02, 2004 6:36 PM
To: ntdev redirect
Subject: RE: [ntdev] Ordering of I/O requests

Most disk drives implement CSCAN or some variation on CSCAN. You can
find
lots of information on CSCAN by doing a web search.

Jamey Kirby

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S.
Shatskih
Sent: Tuesday, November 02, 2004 3:19 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Ordering of I/O requests

Now again I’m lost, just out of curiosity, if a disk controller does
reordering of writes and reads,

Not controller, but a SCSI LUN itself (the disk drive) does this
reordering.

and could not feed the upper interface layer with the most current data
(
transaction symantics )
hen I would assume it is nothing but a brick.

If you need transaction semantics - then wait for the completion of
write 1
before submitting write 2.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@storagecraft.com
To unsubscribe send a blank email to xxxxx@lists.osr.com


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@osr.com
To unsubscribe send a blank email to xxxxx@lists.osr.com