Writing faster to disks

Gary_Leonne-2 · December 14, 2005, 6:21am

Hi all

I am toying with different possibilities to write data faster to SCSI disks
other than using the standard WriteFile() API call. My constraint is that my
data must be recognisable to NTFS so I cannot write at sector level. One
way that I could gather was:
Share memory between kernel and user level and call the NTFS driver with
IRP_MJ_WRITE, which contains the buffer to write.
My question is: Is it going to give me a comprehensible gain in speed ? Only
if the gain is large enough than the normal WriteFile() API, I would take
the pain to write the kernel driver. Has anyone had experiance with this? Is
there some other way as well?

regards
Gary

Valeriy_Glushkov · December 14, 2005, 7:17am

Gary,

I’m not sure writing a kernal mode helper driver as you described will give
you much gain in speed.
But the solution will be much more complex than using only usermode code…

I’d better try using overlapped multiple WriteFileEx() calls to encrease
throughput instead of synchronous WriteFile().
This definitely should give you better performance but the approach demands
changing your code
to support the asynchronous I/O model that will take you a bit of work.

I hope this helps.

Best regards,
Valeriy Glushkov

??: Gary Leonne
???: Windows System Software Devs Interest List
???: 14 ??? 2005 ?. 13:20
???: [ntdev] Writing faster to disks

Hi all
I am toying with different possibilities to write data faster to SCSI disks
other than using the standard WriteFile() API call. My constraint is that my
data must be recognisable to NTFS so I cannot write at sector level. One
way that I could gather was:
Share memory between kernel and user level and call the NTFS driver with
IRP_MJ_WRITE, which contains the buffer to write.
My question is: Is it going to give me a comprehensible gain in speed ? Only
if the gain is large enough than the normal WriteFile() API, I would take
the pain to write the kernel driver. Has anyone had experiance with this? Is
there some other way as well?

regards
Gary
— Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 You are currently subscribed to
ntdev as: unknown lmsubst tag argument: ‘’ To unsubscribe send a blank email
to xxxxx@lists.osr.com

Mark_Roddy · December 14, 2005, 7:30am

WriteFile translates into IRP_MJ_WRITE using direct IO, which would make it
more efficient than putting your driver in the middle of the operation. What
exactly do you mean by ‘write data faster to SCSI disks’ and how exactly are
you measuring this?

=====================
Mark Roddy DDK MVP
Windows 2003/XP/2000 Consulting
Hollis Technology Solutions 603-321-1032
www.hollistech.com

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Gary Leonne
Sent: Wednesday, December 14, 2005 6:20 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Writing faster to disks

Hi all

I am toying with different possibilities to write data faster to SCSI disks
other than using the standard WriteFile() API call. My constraint is that my
data must be recognisable to NTFS so I cannot write at sector level. One
way that I could gather was:
Share memory between kernel and user level and call the NTFS driver with
IRP_MJ_WRITE, which contains the buffer to write.

My question is: Is it going to give me a comprehensible gain in speed ? Only
if the gain is large enough than the normal WriteFile() API, I would take
the pain to write the kernel driver. Has anyone had experiance with this? Is
there some other way as well?

regards
Gary
— Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 You are currently subscribed to
ntdev as: unknown lmsubst tag argument: ‘’ To unsubscribe send a blank email
to xxxxx@lists.osr.com

Gary_Leonne-2 · December 14, 2005, 7:57am

Hi

I check the writing speed with files of around 5 MB for one disk. If this is
x, then writing parallely (using threads) to lets say 8 disks should be
around 8*x. This is of course a very simple assumption (and probably
incorrect). The number of logical CPUs is a constraint. I am reaching a
speed of around 6*x and would like to extract more juice.

To measure the speed, I have used high performance counter and a buffer
overhead of approx. 5 MB. In my case x is around 70 MBps. I am using no
intermediate buffering. Any suggestions?

regards
Gary

PS: Valeriy, I would check the use of overlapped structure and see if I can
get better performance with WriteFileEx().

On 12/14/05, Mark Roddy wrote:
>
> WriteFile translates into IRP_MJ_WRITE using direct IO, which would make
> it more efficient than putting your driver in the middle of the operation.
> What exactly do you mean by ‘write data faster to SCSI disks’ and how
> exactly are you measuring this?
>
>
> =====================
> Mark Roddy DDK MVP
> Windows 2003/XP/2000 Consulting
> Hollis Technology Solutions 603-321-1032
> www.hollistech.com
>
>
> ------------------------------
> From: xxxxx@lists.osr.com [mailto:
> xxxxx@lists.osr.com] *On Behalf Of *Gary Leonne
> Sent: Wednesday, December 14, 2005 6:20 AM
> To: Windows System Software Devs Interest List
> Subject: [ntdev] Writing faster to disks
>
>
>
> Hi all
>
> I am toying with different possibilities to write data faster to SCSI
> disks other than using the standard WriteFile() API call. My constraint is
> that my data must be recognisable to NTFS so I cannot write at sector
> level. One way that I could gather was:
> Share memory between kernel and user level and call the NTFS driver with
> IRP_MJ_WRITE, which contains the buffer to write.
> My question is: Is it going to give me a comprehensible gain in speed ?
> Only if the gain is large enough than the normal WriteFile() API, I would
> take the pain to write the kernel driver. Has anyone had experiance with
> this? Is there some other way as well?
>
> regards
> Gary
> — Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256 You are currently subscribed
> to ntdev as: unknown lmsubst tag argument: ‘’ To unsubscribe send a blank
> email to xxxxx@lists.osr.com
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: unknown lmsubst tag argument: ‘’
>
> To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · December 14, 2005, 9:41am

Doing the write from kernel mode is not going to improve your
performance. I might be misreading you, but if you are getting 6*70MBs
IO to 6 disks on one SCSI bus then perhaps 420MBs is all that the SCSI
bus can possibly do?

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Gary Leonne
Sent: Wednesday, December 14, 2005 7:57 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Writing faster to disks

Hi

I check the writing speed with files of around 5 MB for one disk. If
this is x, then writing parallely (using threads) to lets say 8 disks
should be around 8*x. This is of course a very simple assumption (and
probably incorrect). The number of logical CPUs is a constraint. I am
reaching a speed of around 6*x and would like to extract more juice.

To measure the speed, I have used high performance counter and a buffer
overhead of approx. 5 MB. In my case x is around 70 MBps. I am using no
intermediate buffering. Any suggestions?

regards

Gary

PS: Valeriy, I would check the use of overlapped structure and see if I
can get better performance with WriteFileEx().

On 12/14/05, Mark Roddy wrote:

WriteFile translates into IRP_MJ_WRITE using direct IO, which would make
it more efficient than putting your driver in the middle of the
operation. What exactly do you mean by ‘write data faster to SCSI disks’
and how exactly are you measuring this?

=====================
Mark Roddy DDK MVP
Windows 2003/XP/2000 Consulting
Hollis Technology Solutions 603-321-1032
www.hollistech.com http:</http:>

________________________________

From: xxxxx@lists.osr.com [mailto:
xxxxx@lists.osr.com
mailto:xxxxx] On Behalf Of Gary Leonne
Sent: Wednesday, December 14, 2005 6:20 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Writing faster to disks

Hi all

I am toying with different possibilities to write data faster to
SCSI disks other than using the standard WriteFile() API call. My
constraint is that my data must be recognisable to NTFS so I cannot
write at sector level. One way that I could gather was:
Share memory between kernel and user level and call the NTFS
driver with IRP_MJ_WRITE, which contains the buffer to write.

My question is: Is it going to give me a comprehensible gain in
speed ? Only if the gain is large enough than the normal WriteFile()
API, I would take the pain to write the kernel driver. Has anyone had
experiance with this? Is there some other way as well?

regards

Gary

— Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 You are currently subscribed
to ntdev as: unknown lmsubst tag argument: ‘’ To unsubscribe send a
blank email to xxxxx@lists.osr.com

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

— Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 You are currently subscribed
to ntdev as: unknown lmsubst tag argument: ‘’ To unsubscribe send a
blank email to xxxxx@lists.osr.com</mailto:xxxxx>

Gary_Little-2 · December 14, 2005, 10:37am

Is this a SCSIPORT or STORPORT mini-port? If it’s SCSIPORT, transfers to
each LUN are basically synchronous, so it won’t matter how many threads
send how much data to a given LUN. How large is your SCSI block transfer
size? Unless you have changed it, I believe that 64K is the standard, so
every 5MB block you send from the application threads gets broken up into
lots of 64k chunks, and each chunk gets transferred synchronously. Now
consider map registers. When your mini-port allocates its DMA adapter you
may have a delay as the mini-port waits for resources because not enough
map registers were allocated.

The point is LOTS of things in the system cause overhead that slows down
throughput. The application layer has limitied control, but therearetweaks
that can be done. If you own the mini-port you have a little more control,
but when all is said and done, you have a storage stack consisting of many
device drivers and services that have to have their say. The only way you
can do what you want is to write a PCI driver for a given SCSI HBA, but
even that will have it’s limitations.

Gary G. Little

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@googlemail.com
Sent: Wednesday, December 14, 2005 6:57 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Writing faster to disks

Hi

I check the writing speed with files of around 5 MB for one disk. If this
is x, then writing parallely (using threads) to lets say 8 disks should be
around 8*x. This is of course a very simple assumption (and probably
incorrect). The number of logical CPUs is a constraint. I am reaching a
speed of around 6*x and would like to extract more juice.

To measure the speed, I have used high performance counter and a buffer
overhead of approx. 5 MB. In my case x is around 70 MBps. I am using no
intermediate buffering. Any suggestions?

regards

Gary

PS: Valeriy, I would check the use of overlapped structure and see if I
can get better performance with WriteFileEx().

On 12/14/05, Mark Roddy wrote:

WriteFile translates into IRP_MJ_WRITE using direct IO, which would make
it more efficient than putting your driver in the middle of the operation.
What exactly do you mean by ‘write data faster to SCSI disks’ and how
exactly are you measuring this?

=====================
Mark Roddy DDK MVP
Windows 2003/XP/2000 Consulting
Hollis Technology Solutions 603-321-1032
www.hollistech.com http:</http:>

_____

From: xxxxx@lists.osr.com [mailto:
mailto:xxxxx
xxxxx@lists.osr.com] On Behalf Of Gary Leonne
Sent: Wednesday, December 14, 2005 6:20 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Writing faster to disks

Hi all

I am toying with different possibilities to write data faster to SCSI
disks other than using the standard WriteFile() API call. My constraint is
that my data must be recognisable to NTFS so I cannot write at sector
level. One way that I could gather was:
Share memory between kernel and user level and call the NTFS driver with
IRP_MJ_WRITE, which contains the buffer to write.

My question is: Is it going to give me a comprehensible gain in speed ?
Only if the gain is large enough than the normal WriteFile() API, I would
take the pain to write the kernel driver. Has anyone had experiance with
this? Is there some other way as well?

regards

Gary

— Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 You are currently subscribed
to ntdev as: unknown lmsubst tag argument: ‘’ To unsubscribe send a blank
email to xxxxx@lists.osr.com

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument: ‘’

To unsubscribe send a blank email to xxxxx@lists.osr.com

— Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 You are currently subscribed
to ntdev as: unknown lmsubst tag argument: ‘’ To unsubscribe send a blank
email to xxxxx@lists.osr.com</mailto:xxxxx>

OSR_Community_User · December 14, 2005, 1:43pm

Gary, i’m not sure where you got the idea that in SCSIPORT transfers to
a LUN are synchronous. \SCSIPORT can handle up to 250ish requests
outstanding at a time (device and memory conditions permitting) spread
across all LUNs. Most modern controllers can handle more than one
request at a time to a particular ITL nexus (there was a time when this
wasn’t true) and most drives have a reasonable queue depth before they
start fending off requests.

You are correct about request splitting - requests are split up based on
the maximum transfer size reported by the port driver, which is based
more on the number of SG list breaks the device says it can support than
on any transfer size limitations. There are some registry settings the
admin can use to up the number of breaks allowed to the maximum that the
controller can support, but pushing this limit up costs more in
pre-allocated memory (srb extension sizes go up with this count). 68KB
ends up being the default size (to allow for a 64KB transfer buffer
which is sector but not page page aligned).

The OP will get the best performance benefit first by switching to an
asynchronous I/O model that uses completion ports. This will let them
send the most I/O with the least number of threads.

He should also be pre-allocating the space for files by creating them
and then setting the valid data length (not just the file size) out in
large chunks. Otherwise the writes through the file system will result
in a lot of file extensions, which are synchronizing operations.

And he should examine their SCSI configurations to be sure they aren’t
saturating the PCI bus or the SCSI bus - SCSI starts to degrade (if i
recall) once you go past 3 devices on a chain.

Finally he should look into the configuration settings the SCSI adapters
provide - including the number of physical breaks that are allowed - and
then adjust their I/O size accordingly to avoid the need for the driver
to split them.

There’s plenty that can be done before writing your own miniport or
trying to redesign the way NT I/O works.

-p

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@seagate.com
Sent: Wednesday, December 14, 2005 7:36 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Writing faster to disks

Is this a SCSIPORT or STORPORT mini-port? If it’s SCSIPORT, transfers to
each LUN are basically synchronous, so it won’t matter how many threads
send how much data to a given LUN. How large is your SCSI block transfer
size? Unless you have changed it, I believe that 64K is the standard, so
every 5MB block you send from the application threads gets broken up
into lots of 64k chunks, and each chunk gets transferred synchronously.
Now consider map registers. When your mini-port allocates its DMA
adapter you may have a delay as the mini-port waits for resources
because not enough map registers were allocated.

The point is LOTS of things in the system cause overhead that slows down
throughput. The application layer has limitied control, but
therearetweaks that can be done. If you own the mini-port you have a
little more control, but when all is said and done, you have a storage
stack consisting of many device drivers and services that have to have
their say. The only way you can do what you want is to write a PCI
driver for a given SCSI HBA, but even that will have it’s limitations.

Gary G. Little

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@googlemail.com
Sent: Wednesday, December 14, 2005 6:57 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Writing faster to disks

Hi

I check the writing speed with files of around 5 MB for one disk. If
this is x, then writing parallely (using threads) to lets say 8 disks
should be around 8*x. This is of course a very simple assumption (and
probably incorrect). The number of logical CPUs is a constraint. I am
reaching a speed of around 6*x and would like to extract more juice.

To measure the speed, I have used high performance counter and a buffer
overhead of approx. 5 MB. In my case x is around 70 MBps. I am using no
intermediate buffering. Any suggestions?

regards

Gary

PS: Valeriy, I would check the use of overlapped structure and see if I
can get better performance with WriteFileEx().

On 12/14/05, Mark Roddy wrote:

WriteFile translates into IRP_MJ_WRITE using direct IO, which would make
it more efficient than putting your driver in the middle of the
operation. What exactly do you mean by ‘write data faster to SCSI disks’
and how exactly are you measuring this?

=====================
Mark Roddy DDK MVP
Windows 2003/XP/2000 Consulting
Hollis Technology Solutions 603-321-1032
www.hollistech.com http:</http:>

________________________________

From: xxxxx@lists.osr.com [mailto:
xxxxx@lists.osr.com
mailto:xxxxx] On Behalf Of Gary Leonne
Sent: Wednesday, December 14, 2005 6:20 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Writing faster to disks

Hi all

I am toying with different possibilities to write data faster to
SCSI disks other than using the standard WriteFile() API call. My
constraint is that my data must be recognisable to NTFS so I cannot
write at sector level. One way that I could gather was:
Share memory between kernel and user level and call the NTFS
driver with IRP_MJ_WRITE, which contains the buffer to write.

My question is: Is it going to give me a comprehensible gain in
speed ? Only if the gain is large enough than the normal WriteFile()
API, I would take the pain to write the kernel driver. Has anyone had
experiance with this? Is there some other way as well?

regards

Gary

— Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 You are currently subscribed
to ntdev as: unknown lmsubst tag argument: ‘’ To unsubscribe send a
blank email to xxxxx@lists.osr.com

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

— Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 You are currently subscribed
to ntdev as: unknown lmsubst tag argument: ‘’ To unsubscribe send a
blank email to xxxxx@lists.osr.com

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com</mailto:xxxxx>

OSR_Community_User · December 14, 2005, 2:26pm

I agree with most of the recommendations below from Peter: use
asynchronous (i.e. overlapped in Win32 API terms), non-buffered i/o, if
you want to talk to the metal and get close to raw disk bandwidth. You
do not necessarily need completion ports unless you are doing
multi-threaded i/o: note you don’t need multi-threaded i/o to saturate
disk bandwidth. You just need a deep enough pipeline (i.e. enough number
of i/o requests pending). Use large buffers, and post as much as you can
afford without tanking the system, and keep the pipeline going. You can
do this all from user mode, no need for kernel drivers and fancier
miniports.

Setting end-of-file ahead is imperative as Peter points out. However
please do not set the valid data length: when you call
SetValidDataLength(), NTFS simply updates the VDL - which means that if
there was a crash before the file was completely overwritten, users can
read uninitialized data, which has bad security implications
(disclosure). And SetValidDataLength() is also a privileged operation.

If this is a file you repeatedly do i/o to/from (say such as a database
file), you can benefit by simply zeroing out the file to begin with - a
one time cost, but now i/o to the file will incur minimal FS overhead.

Ravi

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Peter Wieland
Sent: Wednesday, December 14, 2005 10:43 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Writing faster to disks

Gary, i’m not sure where you got the idea that in SCSIPORT transfers to
a LUN are synchronous. \SCSIPORT can handle up to 250ish requests
outstanding at a time (device and memory conditions permitting) spread
across all LUNs. Most modern controllers can handle more than one
request at a time to a particular ITL nexus (there was a time when this
wasn’t true) and most drives have a reasonable queue depth before they
start fending off requests.

You are correct about request splitting - requests are split up based on
the maximum transfer size reported by the port driver, which is based
more on the number of SG list breaks the device says it can support than
on any transfer size limitations. There are some registry settings the
admin can use to up the number of breaks allowed to the maximum that the
controller can support, but pushing this limit up costs more in
pre-allocated memory (srb extension sizes go up with this count). 68KB
ends up being the default size (to allow for a 64KB transfer buffer
which is sector but not page page aligned).

The OP will get the best performance benefit first by switching to an
asynchronous I/O model that uses completion ports. This will let them
send the most I/O with the least number of threads.

He should also be pre-allocating the space for files by creating them
and then setting the valid data length (not just the file size) out in
large chunks. Otherwise the writes through the file system will result
in a lot of file extensions, which are synchronizing operations.

And he should examine their SCSI configurations to be sure they aren’t
saturating the PCI bus or the SCSI bus - SCSI starts to degrade (if i
recall) once you go past 3 devices on a chain.

Finally he should look into the configuration settings the SCSI adapters
provide - including the number of physical breaks that are allowed - and
then adjust their I/O size accordingly to avoid the need for the driver
to split them.

There’s plenty that can be done before writing your own miniport or
trying to redesign the way NT I/O works.

-p

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@seagate.com
Sent: Wednesday, December 14, 2005 7:36 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Writing faster to disks

Is this a SCSIPORT or STORPORT mini-port? If it’s SCSIPORT, transfers to
each LUN are basically synchronous, so it won’t matter how many threads
send how much data to a given LUN. How large is your SCSI block transfer
size? Unless you have changed it, I believe that 64K is the standard, so
every 5MB block you send from the application threads gets broken up
into lots of 64k chunks, and each chunk gets transferred synchronously.
Now consider map registers. When your mini-port allocates its DMA
adapter you may have a delay as the mini-port waits for resources
because not enough map registers were allocated.

The point is LOTS of things in the system cause overhead that slows down
throughput. The application layer has limitied control, but
therearetweaks that can be done. If you own the mini-port you have a
little more control, but when all is said and done, you have a storage
stack consisting of many device drivers and services that have to have
their say. The only way you can do what you want is to write a PCI
driver for a given SCSI HBA, but even that will have it’s limitations.

Gary G. Little

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@googlemail.com
Sent: Wednesday, December 14, 2005 6:57 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Writing faster to disks

Hi

I check the writing speed with files of around 5 MB for one disk. If
this is x, then writing parallely (using threads) to lets say 8 disks
should be around 8*x. This is of course a very simple assumption (and
probably incorrect). The number of logical CPUs is a constraint. I am
reaching a speed of around 6*x and would like to extract more juice.

To measure the speed, I have used high performance counter and a buffer
overhead of approx. 5 MB. In my case x is around 70 MBps. I am using no
intermediate buffering. Any suggestions?

regards

Gary

PS: Valeriy, I would check the use of overlapped structure and see if I
can get better performance with WriteFileEx().

On 12/14/05, Mark Roddy wrote:

WriteFile translates into IRP_MJ_WRITE using direct IO, which would make
it more efficient than putting your driver in the middle of the
operation. What exactly do you mean by ‘write data faster to SCSI disks’
and how exactly are you measuring this?

=====================
Mark Roddy DDK MVP
Windows 2003/XP/2000 Consulting
Hollis Technology Solutions 603-321-1032
www.hollistech.com http:</http:>

________________________________

From: xxxxx@lists.osr.com [mailto:
xxxxx@lists.osr.com
mailto:xxxxx] On Behalf Of Gary Leonne
Sent: Wednesday, December 14, 2005 6:20 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Writing faster to disks

Hi all

I am toying with different possibilities to write data faster to
SCSI disks other than using the standard WriteFile() API call. My
constraint is that my data must be recognisable to NTFS so I cannot
write at sector level. One way that I could gather was:
Share memory between kernel and user level and call the NTFS
driver with IRP_MJ_WRITE, which contains the buffer to write.

My question is: Is it going to give me a comprehensible gain in
speed ? Only if the gain is large enough than the normal WriteFile()
API, I would take the pain to write the kernel driver. Has anyone had
experiance with this? Is there some other way as well?

regards

Gary

— Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 You are currently subscribed
to ntdev as: unknown lmsubst tag argument: ‘’ To unsubscribe send a
blank email to xxxxx@lists.osr.com

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

— Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 You are currently subscribed
to ntdev as: unknown lmsubst tag argument: ‘’ To unsubscribe send a
blank email to xxxxx@lists.osr.com

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com
—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com</mailto:xxxxx>

OSR_Community_User · December 14, 2005, 2:43pm

More recommendations:

* Allocate memory in 64K chunks, aligned to 64K boundaries, and issue writes
in 64K chunks. This means use VirtualAlloc instead of malloc.

* I/O completion ports work great, even for single-threaded apps. They work
very well on SMP platforms: Create one thread per physical processor, and
have them service the IOCP queue.

* Queue queue queue.

OP: You can definitely saturate the SCSI buses, without resorting to
building your own miniport, if you use the existing system effectively. The
I/O paths make trade-offs between performance and flexibility. Just calling
WriteFile in a loop is easy, but is definitely not the fastest way. You
need to understand the I/O architecture a bit more, in order to exploit the
“fast” paths.

– arlie

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Ravisankar Pudipeddi
Sent: Wednesday, December 14, 2005 2:26 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Writing faster to disks

I agree with most of the recommendations below from Peter: use asynchronous
(i.e. overlapped in Win32 API terms), non-buffered i/o, if you want to talk
to the metal and get close to raw disk bandwidth. You do not necessarily
need completion ports unless you are doing multi-threaded i/o: note you
don’t need multi-threaded i/o to saturate disk bandwidth. You just need a
deep enough pipeline (i.e. enough number of i/o requests pending). Use large
buffers, and post as much as you can afford without tanking the system, and
keep the pipeline going. You can do this all from user mode, no need for
kernel drivers and fancier miniports.

Setting end-of-file ahead is imperative as Peter points out. However please
do not set the valid data length: when you call SetValidDataLength(), NTFS
simply updates the VDL - which means that if there was a crash before the
file was completely overwritten, users can read uninitialized data, which
has bad security implications (disclosure). And SetValidDataLength() is also
a privileged operation.

If this is a file you repeatedly do i/o to/from (say such as a database
file), you can benefit by simply zeroing out the file to begin with - a one
time cost, but now i/o to the file will incur minimal FS overhead.

Ravi

Gary_Little-2 · December 14, 2005, 2:44pm

The synchronous LUN thought was incorrect, but it was from my own
experience down in WinDbg watching a LUN hang when an SRB was not
completed. However, that debugging session was 3 to 4 years ago, and since
it was a bug, that behavior stopped once SRB completion was working on all
Loons.

Gary G. Little

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Peter Wieland
Sent: Wednesday, December 14, 2005 12:43 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Writing faster to disks

Gary, i’m not sure where you got the idea that in SCSIPORT transfers to a
LUN are synchronous. \SCSIPORT can handle up to 250ish requests
outstanding at a time (device and memory conditions permitting) spread
across all LUNs. Most modern controllers can handle more than one request
at a time to a particular ITL nexus (there was a time when this wasn’t
true) and most drives have a reasonable queue depth before they start
fending off requests.

You are correct about request splitting - requests are split up based on
the maximum transfer size reported by the port driver, which is based more
on the number of SG list breaks the device says it can support than on any
transfer size limitations. There are some registry settings the admin can
use to up the number of breaks allowed to the maximum that the controller
can support, but pushing this limit up costs more in pre-allocated memory
(srb extension sizes go up with this count). 68KB ends up being the
default size (to allow for a 64KB transfer buffer which is sector but not
page page aligned).

The OP will get the best performance benefit first by switching to an
asynchronous I/O model that uses completion ports. This will let them
send the most I/O with the least number of threads.

He should also be pre-allocating the space for files by creating them and
then setting the valid data length (not just the file size) out in large
chunks. Otherwise the writes through the file system will result in a lot
of file extensions, which are synchronizing operations.

And he should examine their SCSI configurations to be sure they aren’t
saturating the PCI bus or the SCSI bus - SCSI starts to degrade (if i
recall) once you go past 3 devices on a chain.

Finally he should look into the configuration settings the SCSI adapters
provide - including the number of physical breaks that are allowed - and
then adjust their I/O size accordingly to avoid the need for the driver to
split them.

There’s plenty that can be done before writing your own miniport or trying
to redesign the way NT I/O works.

-p

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@seagate.com
Sent: Wednesday, December 14, 2005 7:36 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Writing faster to disks

Is this a SCSIPORT or STORPORT mini-port? If it’s SCSIPORT, transfers to
each LUN are basically synchronous, so it won’t matter how many threads
send how much data to a given LUN. How large is your SCSI block transfer
size? Unless you have changed it, I believe that 64K is the standard, so
every 5MB block you send from the application threads gets broken up into
lots of 64k chunks, and each chunk gets transferred synchronously. Now
consider map registers. When your mini-port allocates its DMA adapter you
may have a delay as the mini-port waits for resources because not enough
map registers were allocated.

The point is LOTS of things in the system cause overhead that slows down
throughput. The application layer has limitied control, but therearetweaks
that can be done. If you own the mini-port you have a little more control,
but when all is said and done, you have a storage stack consisting of many
device drivers and services that have to have their say. The only way you
can do what you want is to write a PCI driver for a given SCSI HBA, but
even that will have it’s limitations.

Gary G. Little

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@googlemail.com
Sent: Wednesday, December 14, 2005 6:57 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Writing faster to disks

Hi

I check the writing speed with files of around 5 MB for one disk. If this
is x, then writing parallely (using threads) to lets say 8 disks should be
around 8*x. This is of course a very simple assumption (and probably
incorrect). The number of logical CPUs is a constraint. I am reaching a
speed of around 6*x and would like to extract more juice.

To measure the speed, I have used high performance counter and a buffer
overhead of approx. 5 MB. In my case x is around 70 MBps. I am using no
intermediate buffering. Any suggestions?

regards

Gary

PS: Valeriy, I would check the use of overlapped structure and see if I
can get better performance with WriteFileEx().

On 12/14/05, Mark Roddy wrote:

WriteFile translates into IRP_MJ_WRITE using direct IO, which would make
it more efficient than putting your driver in the middle of the operation.
What exactly do you mean by ‘write data faster to SCSI disks’ and how
exactly are you measuring this?

=====================
Mark Roddy DDK MVP
Windows 2003/XP/2000 Consulting
Hollis Technology Solutions 603-321-1032
www.hollistech.com http:</http:>

_____

From: xxxxx@lists.osr.com [mailto:
mailto:xxxxx
xxxxx@lists.osr.com] On Behalf Of Gary Leonne
Sent: Wednesday, December 14, 2005 6:20 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Writing faster to disks

Hi all

I am toying with different possibilities to write data faster to SCSI
disks other than using the standard WriteFile() API call. My constraint is
that my data must be recognisable to NTFS so I cannot write at sector
level. One way that I could gather was:
Share memory between kernel and user level and call the NTFS driver with
IRP_MJ_WRITE, which contains the buffer to write.

My question is: Is it going to give me a comprehensible gain in speed ?
Only if the gain is large enough than the normal WriteFile() API, I would
take the pain to write the kernel driver. Has anyone had experiance with
this? Is there some other way as well?

regards

Gary

— Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 You are currently subscribed
to ntdev as: unknown lmsubst tag argument: ‘’ To unsubscribe send a blank
email to xxxxx@lists.osr.com

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument: ‘’

To unsubscribe send a blank email to xxxxx@lists.osr.com

— Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 You are currently subscribed
to ntdev as: unknown lmsubst tag argument: ‘’ To unsubscribe send a blank
email to xxxxx@lists.osr.com

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument: ‘’

To unsubscribe send a blank email to xxxxx@lists.osr.com
—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument: ‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com</mailto:xxxxx>

OSR_Community_User · December 14, 2005, 2:53pm

I defer to Ravi on the point of setting valid data-length.

-p

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Ravisankar
Pudipeddi
Sent: Wednesday, December 14, 2005 11:26 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Writing faster to disks