Write optimization in File system

I’m working on a file system driver. In this, read/write is normally done in units of 4 KB, call it a block. Apart from this, we subdivide the disk space in 2MB chunks, call it a segment. During write operation, after every 511 blocks of user data, file system requires to write 1 block of metadata (511 +1=512 blocks of 4KB size=2MB segment).

Currently, for a big write request (say 2MB user data), I end up passing 3 IRPs to low-level driver i.e.

  • First IRP to write 511 blocks of user data
  • Second IRP to write 1 block of metadata
  • Third IRP to write remaining 1 block of user data

I want to reduce it to one IRP. Can this 1 block of metadata be embedded with user data buffer and written together? Of course, I can’t afford the overhead of making a large temporary buffer to achieve the same. Please suggest.

> I want to reduce it to one IRP. Can this 1 block of metadata be embedded with user data buffer and

written together?

Looks like no without having an intermediate buffer and copying to it.

I have major doubts the volume/disk stack supports MDL chains.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Ummm…

Before you spend too much time on this, do you have any hard data that
sending down a single IRP will be faster than send 3 (or even 10)? I could
hypothesise a badly written disk firmware/drive pair in which 10 Irps in
parallel might well be faster than just one…

Rod

wrote in message news:xxxxx@ntfsd…
> I’m working on a file system driver. In this, read/write is normally done
> in units of 4 KB, call it a block. Apart from this, we subdivide the disk
> space in 2MB chunks, call it a segment. During write operation, after
> every 511 blocks of user data, file system requires to write 1 block of
> metadata (511 +1=512 blocks of 4KB size=2MB segment).
>
> Currently, for a big write request (say 2MB user data), I end up passing 3
> IRPs to low-level driver i.e.
> - First IRP to write 511 blocks of user data
> - Second IRP to write 1 block of metadata
> - Third IRP to write remaining 1 block of user data
>
> I want to reduce it to one IRP. Can this 1 block of metadata be embedded
> with user data buffer and written together? Of course, I can’t afford the
> overhead of making a large temporary buffer to achieve the same. Please
> suggest.
>
>
>
>

>Before you spend too much time on this, do you have any hard data that sending down a single IRP will be faster than send 3 (or even 10)?
I used Diskperf and observed that for the same non-cached write request (2MB user data), NTFS generates fewer disk IO than my file-system.

Looks like no without having an intermediate buffer and copying to it. I have major doubts the >volume/disk stack supports MDL chains.

Now I’ve a better understanding of what one feels when a nightmare comes true. I read the same in OSR’s "Windows NT Device Driver Development? that MDL chains are designed for use by network drivers only, and not supported by most of standard functions of IO Manager.
I suppose This is the right time to listen to Enigma?s ?I?m asking why?.

> I used Diskperf and observed that for the same non-cached write request (2MB user data), NTFS

generates fewer disk IO than my file-system.

Maybe you can delay your metadata updates till some later time?


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

> >Before you spend too much time on this, do you have any hard data that

>sending down a single IRP will be faster than send 3 (or even 10)?
I used Diskperf and observed that for the same non-cached write request
(2MB user data), NTFS generates fewer disk IO than my file-system.

Yes, but is it any faster against an elapsed clock? The time spent sending
the data will consist of three chunks

  1. The cpu time consumed
  2. The time consumed actually sending the data onto the oxide
  3. The time consumed waiting for the heads to be over the corect bit of
    oxide.

Taking them backwards

(3) Certainly used to dominate and there is nothing you can do about it - it
will penalize NTFS as much as you. There is, I suppose a chance that you
might come out better if the driver does write re-ordering.

The fact you are sending more data will cost you more in (2)

As for (1), sending more IRPs is certainly going to cost more CPU (but how
much is an unknown since because caching will fool you). But if your IRPS
are asynchronous the time taken to send them down will not count to the
elapsed time because while you are doing that the disk is busy worrying
about (2 & 3).

Rod

first things first: Thank you Maxim and Rod for your attention :slight_smile:

Maybe you can delay your metadata updates till some later time?
Even if I manage to do that, I will end up sending 2 IRPs.
Moreover, it sounds like time distribution only. Whatever I save at this moment, will be consumed at some later point of time.

The time spent sending the data will consist of three chunks 1) The cpu time consumed 2) The >time consumed actually sending the data onto the oxide 3) The time consumed waiting for the >heads to be over the corect bit of oxide.

Well, I thought that if allocating and sending one IRP down involves “x” amount of overhead, 3 IRP for the same amount of data will make it “3*x”.
Arithmetic may not have fallen quite right here but when it comes to making a choice between 3 disks IO with small buffers, and 1 disk IO with relatively large compound buffer, aforementioned 3 points do very little to tilt the scale in favor of former.
No oxide, no heads. I?m dealing with a flash based SSD.