DMA + common buffer ARC

OSR_Community_User · May 11, 2005, 3:51am

Hello there,

I am developing DMA support for the tranfer of high rate stream data between
my device and the driver. The task Is to flush this data into my hard disk.
I chose to use the common buffer architecture because no user-specific data
will be transfer to or from the card and user space. I have some questions
though that remain unclear to me.

Will the logical address returned from the call to
“AllocateCommonBuffer” function be accessible from my hardware? I understand
that my PLX chip is aware of all Bus addresses and so will be able to access
this Buffer through this address. Am I right ?
Is there an optimum solution concerning the size of the Buffer and
the amount of the Buffers to be allocated? I am expecting reception of more
than 100Mbs.
Is there a documentation that really explains DMA transactions? The
books I am reading explain the functions but not the actual architecture and
that makes it very difficult to understand even the type of DMA to use.

Thank you in advance.

Stylianides Nikolas.

OSR_Community_User · May 11, 2005, 9:59am

Hi Nikolas,

I think so - since it is a logical address which is what the hardware
understands in the Windows DMA Model - but double check with the DDK
documentation to be sure.
It depends…
There is a really good paper on Microsoft’s web site describing both
the WDM and WDF models of DMA:
http://www.microsoft.com/whdc/driver/kernel/DMA.mspx

Regards,
-Mike

Nikolas Stylianides wrote:

Hello there,

I am developing DMA support for the tranfer of high rate stream data
between my device and the driver. The task Is to flush this data into
my hard disk. I chose to use the common buffer architecture because no
user-specific data will be transfer to or from the card and user
space. I have some questions though that remain unclear to me.

Will the logical address returned from the call to
“AllocateCommonBuffer” function be accessible from my hardware?
I understand that my PLX chip is aware of all Bus addresses and
so will be able to access this Buffer through this address. Am I
right ?

Is there an optimum solution concerning the size of the Buffer
and the amount of the Buffers to be allocated? I am expecting
reception of more than 100Mbs.

Is there a documentation that really explains DMA transactions?
The books I am reading explain the functions but not the actual
architecture and that makes it very difficult to understand even
the type of DMA to use.

Thank you in advance.

Stylianides Nikolas.

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag
argument: ‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

Tim_Roberts · May 11, 2005, 12:34pm

Nikolas Stylianides wrote:

Hello there,

I am developing DMA support for the tranfer of high rate stream data
between my device and the driver. The task Is to flush this data into
my hard disk. I chose to use the common buffer architecture because no
user-specific data will be transfer to or from the card and user
space. I have some questions though that remain unclear to me.

Will the logical address returned from the call to
“AllocateCommonBuffer” function be accessible from my hardware?
I understand that my PLX chip is aware of all Bus addresses and
so will be able to access this Buffer through this address. Am I
right ?

AllocateCommonBuffer returns a physical address and a linear address.
The linear address is used by your driver. The physical address is the
bus address your hardware will see.

Is there an optimum solution concerning the size of the Buffer
and the amount of the Buffers to be allocated? I am expecting
reception of more than 100Mbs.

The optimal value can really only be determined by experimentation. If
your disk cannot keep up at 12 megabytes per second, then the amount of
buffering is irrelevant: you are going to overflow sooner or later, no
matter how large the buffer is. If your disk CAN keep up, then the
buffering acts as a cushion to handle the occasional delay until you can
write it all to disk. If you want to be able to handle 1/4 second delay,
then you will need 3 megabytes of buffering.

Who is doing the disk write? Is it a user-mode app?

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

OSR_Community_User · May 11, 2005, 1:01pm

Thanks for your answers. I really had to clarify it out because none of the
books I have red clears it.

Flush all that data in disk is another issue of my concern. The driver
should write the buffered data in disk. I have not figured out yet how I am
going to do this.

I know that there are two ways to do that. (a) The ntfs system via ZwXXX
functions, (b) raw sector write (faster). But I have not yet checked either.

Is there another way?

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tim Roberts
Sent: Wednesday, May 11, 2005 7:34 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] DMA + common buffer ARC

Nikolas Stylianides wrote:

Hello there,

I am developing DMA support for the tranfer of high rate stream data
between my device and the driver. The task Is to flush this data into
my hard disk. I chose to use the common buffer architecture because no
user-specific data will be transfer to or from the card and user
space. I have some questions though that remain unclear to me.

Will the logical address returned from the call to
“AllocateCommonBuffer” function be accessible from my hardware?
I understand that my PLX chip is aware of all Bus addresses and
so will be able to access this Buffer through this address. Am I
right ?

AllocateCommonBuffer returns a physical address and a linear address.
The linear address is used by your driver. The physical address is the
bus address your hardware will see.

Is there an optimum solution concerning the size of the Buffer
and the amount of the Buffers to be allocated? I am expecting
reception of more than 100Mbs.

The optimal value can really only be determined by experimentation. If
your disk cannot keep up at 12 megabytes per second, then the amount of
buffering is irrelevant: you are going to overflow sooner or later, no
matter how large the buffer is. If your disk CAN keep up, then the
buffering acts as a cushion to handle the occasional delay until you can
write it all to disk. If you want to be able to handle 1/4 second delay,
then you will need 3 megabytes of buffering.

Who is doing the disk write? Is it a user-mode app?

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@4plus.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Tim_Roberts · May 11, 2005, 2:02pm

Nikolas Stylianides wrote:

Thanks for your answers. I really had to clarify it out because none of the
books I have red clears it.

Flush all that data in disk is another issue of my concern. The driver
should write the buffered data in disk. I have not figured out yet how I am
going to do this.

I know that there are two ways to do that. (a) The ntfs system via ZwXXX
functions, (b) raw sector write (faster). But I have not yet checked either.

Neither would be my first choice. One should always have in mind a
primary goal of putting as little as possible into kernel mode. Given
that, I would expect to have a user-mode application acting as a conduit
between my DMA driver and the hard disk. If the disk can keep up, this
will work just fine, and it’s certainly more flexible and easier to
debug. If the disk can’t keep up, you’re screwed anyway; kernel mode
won’t help. If the disk is right at the bleeding edge, latency is going
to kill you sooner or later. Buy a faster disk.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Maxim_S_Shatskih · May 11, 2005, 2:12pm

>Will the logical address returned from the call to “AllocateCommonBuffer”
function be

accessible from my hardware?

Yes.

Is there an optimum solution concerning the size of the Buffer and the amount
of the
Buffers to be allocated? I am expecting reception of more than 100Mbs.

Yes. No buffers at all the app sends lots of overlapped IOs to the driver,
the driver chains their MDLs together, feeds them to the DMA engine and runs
DMA over them.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Maxim_S_Shatskih · May 11, 2005, 3:49pm

> I know that there are two ways to do that. (a) The ntfs system via ZwXXX

functions, (b) raw sector write (faster).

Nearly not faster if the output file will not be compressed.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

OSR_Community_User · May 12, 2005, 2:31am

You mean that I will have the same throughput if I upload my buffer to user
space and then flush it to disk?

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tim Roberts
Sent: Wednesday, May 11, 2005 9:02 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] DMA + common buffer ARC

Nikolas Stylianides wrote:

Thanks for your answers. I really had to clarify it out because none of the
books I have red clears it.

Flush all that data in disk is another issue of my concern. The driver
should write the buffered data in disk. I have not figured out yet how I am
going to do this.

I know that there are two ways to do that. (a) The ntfs system via ZwXXX
functions, (b) raw sector write (faster). But I have not yet checked
either.

Neither would be my first choice. One should always have in mind a
primary goal of putting as little as possible into kernel mode. Given
that, I would expect to have a user-mode application acting as a conduit
between my DMA driver and the hard disk. If the disk can keep up, this
will work just fine, and it’s certainly more flexible and easier to
debug. If the disk can’t keep up, you’re screwed anyway; kernel mode
won’t help. If the disk is right at the bleeding edge, latency is going
to kill you sooner or later. Buy a faster disk.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@4plus.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Mark_Roddy · May 12, 2005, 7:41am

The bottleneck is going to be getting the disk head in the right position,
not the cost of flushing the buffers from user mode.

=====================
Mark Roddy
Windows .NET/XP/2000 Consulting
Hollis Technology Solutions 603-321-1032
www.hollistech.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
Nikolas Stylianides
Sent: Thursday, May 12, 2005 2:30 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] DMA + common buffer ARC

You mean that I will have the same throughput if I upload my
buffer to user space and then flush it to disk?

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tim Roberts
Sent: Wednesday, May 11, 2005 9:02 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] DMA + common buffer ARC

Nikolas Stylianides wrote:

>Thanks for your answers. I really had to clarify it out
because none of
>the books I have red clears it.
>
>Flush all that data in disk is another issue of my concern.
The driver
>should write the buffered data in disk. I have not figured
out yet how
>I am going to do this.
>
>I know that there are two ways to do that. (a) The ntfs system via
>ZwXXX functions, (b) raw sector write (faster). But I have not yet
>checked
either.
>
>

Neither would be my first choice. One should always have in
mind a primary goal of putting as little as possible into
kernel mode. Given that, I would expect to have a user-mode
application acting as a conduit between my DMA driver and the
hard disk. If the disk can keep up, this will work just
fine, and it’s certainly more flexible and easier to debug.
If the disk can’t keep up, you’re screwed anyway; kernel mode
won’t help. If the disk is right at the bleeding edge,
latency is going to kill you sooner or later. Buy a faster disk.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@4plus.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@hollistech.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Tim_Roberts · May 12, 2005, 2:47pm

Nikolas Stylianides wrote:

You mean that I will have the same throughput if I upload my buffer to user
space and then flush it to disk?

What I mean is that the difference in throughput is not going to be an
issue unless you are already in real trouble, but the ease of
development and debugging is significantly higher, and the risk in case
of problem is significantly lower.

Kernel mode should always be a last resort.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

OSR_Community_User · May 12, 2005, 4:10pm

Not necessarily the same throughput (there will be some added latency
involved in getting the user-mode process to push the next request to
the disk, but disks add unpredictable latency anyway) but the chance of
bugchecking the system due to a bug in your kernel-mode code is reduced.

Of course if you’re building an embedded system to do this then there’s
probably little difference to you between a user-mode and kernel-mode
crash. But if this is supposed to run on any system then the user would
probably appreciate it.

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Nikolas
Stylianides
Sent: Wednesday, May 11, 2005 11:30 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] DMA + common buffer ARC

You mean that I will have the same throughput if I upload my buffer to
user space and then flush it to disk?

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tim Roberts
Sent: Wednesday, May 11, 2005 9:02 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] DMA + common buffer ARC

Nikolas Stylianides wrote:

Thanks for your answers. I really had to clarify it out because none of

the books I have red clears it.

Flush all that data in disk is another issue of my concern. The driver
should write the buffered data in disk. I have not figured out yet how
I am going to do this.

I know that there are two ways to do that. (a) The ntfs system via
ZwXXX functions, (b) raw sector write (faster). But I have not yet
checked
either.

Neither would be my first choice. One should always have in mind a
primary goal of putting as little as possible into kernel mode. Given
that, I would expect to have a user-mode application acting as a conduit
between my DMA driver and the hard disk. If the disk can keep up, this
will work just fine, and it’s certainly more flexible and easier to
debug. If the disk can’t keep up, you’re screwed anyway; kernel mode
won’t help. If the disk is right at the bleeding edge, latency is going
to kill you sooner or later. Buy a faster disk.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@4plus.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@windows.microsoft.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · May 13, 2005, 3:09am

I have already began to design the kernel mode version. I want to test it
and see if there is significant improvement. I will send my results back to
the newsgroup.

Thank you very much everybody. See you in about a week.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Peter Wieland
Sent: Thursday, May 12, 2005 11:10 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] DMA + common buffer ARC

Not necessarily the same throughput (there will be some added latency
involved in getting the user-mode process to push the next request to
the disk, but disks add unpredictable latency anyway) but the chance of
bugchecking the system due to a bug in your kernel-mode code is reduced.

Of course if you’re building an embedded system to do this then there’s
probably little difference to you between a user-mode and kernel-mode
crash. But if this is supposed to run on any system then the user would
probably appreciate it.

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Nikolas
Stylianides
Sent: Wednesday, May 11, 2005 11:30 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] DMA + common buffer ARC

You mean that I will have the same throughput if I upload my buffer to
user space and then flush it to disk?

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tim Roberts
Sent: Wednesday, May 11, 2005 9:02 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] DMA + common buffer ARC

Nikolas Stylianides wrote:

Thanks for your answers. I really had to clarify it out because none of

the books I have red clears it.

Flush all that data in disk is another issue of my concern. The driver
should write the buffered data in disk. I have not figured out yet how
I am going to do this.

I know that there are two ways to do that. (a) The ntfs system via
ZwXXX functions, (b) raw sector write (faster). But I have not yet
checked
either.

Neither would be my first choice. One should always have in mind a
primary goal of putting as little as possible into kernel mode. Given
that, I would expect to have a user-mode application acting as a conduit
between my DMA driver and the hard disk. If the disk can keep up, this
will work just fine, and it’s certainly more flexible and easier to
debug. If the disk can’t keep up, you’re screwed anyway; kernel mode
won’t help. If the disk is right at the bleeding edge, latency is going
to kill you sooner or later. Buy a faster disk.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@4plus.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@windows.microsoft.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument: ‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

Tim_Roberts · May 13, 2005, 1:02pm

Nikolas Stylianides wrote:

I have already began to design the kernel mode version. I want to test it
and see if there is significant improvement. I will send my results back to
the newsgroup.

How are you going to judge “improvement”? Are you already doing it in
user mode and falling behind? If not, why go to the trouble?

Some people seem to think that the CPU somehow runs its cycles faster in
kernel mode than it does in user mode. It ain’t so. The only thing you
would gain by doing your disk writes in the kernel is a reduction in the
kernel/user transitions. For the most part, that overhead is going to
be completely lost in the overhead of writing to the disk itself.

There is more at stake here than raw performance. If a user-mode scheme
handles your data and runs the CPU at 35% load, and a kernel-mode scheme
would run at 30% load, it would be a huge mistake to release the kernel
scheme.

Now, if your user-mode scheme runs the CPU at 110%, so that you drop
data, and you have already exhausted other performance issues (like
buffer sizes, buffer counts, batching up work), then it might be worth
an experiment.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

OSR_Community_User · May 16, 2005, 3:27am

Tim, I have already tested it in user mode. I am able to flush about
8Mbit/sec and that’s it. I want to eliminate the action of copying the
buffer from the DMA transaction into a User space buffer and then flush it
to disk. Have you ever tested it? How can you be sure that is a waste of
time to try it?

P.S. I even thought to use the disk.sys directly and be sector aligned in
order to be faster.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tim Roberts
Sent: Friday, May 13, 2005 8:02 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] DMA + common buffer ARC

Nikolas Stylianides wrote:

I have already began to design the kernel mode version. I want to test it
and see if there is significant improvement. I will send my results back to
the newsgroup.

How are you going to judge “improvement”? Are you already doing it in
user mode and falling behind? If not, why go to the trouble?

Some people seem to think that the CPU somehow runs its cycles faster in
kernel mode than it does in user mode. It ain’t so. The only thing you
would gain by doing your disk writes in the kernel is a reduction in the
kernel/user transitions. For the most part, that overhead is going to
be completely lost in the overhead of writing to the disk itself.

There is more at stake here than raw performance. If a user-mode scheme
handles your data and runs the CPU at 35% load, and a kernel-mode scheme
would run at 30% load, it would be a huge mistake to release the kernel
scheme.

Now, if your user-mode scheme runs the CPU at 110%, so that you drop
data, and you have already exhausted other performance issues (like
buffer sizes, buffer counts, batching up work), then it might be worth
an experiment.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@4plus.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · May 16, 2005, 11:01am

Perhaps you should consider using direct IO and MDL based transfers from
your PCI device into your user mode writer app instead of using a common
buffer scheme that forces you into a buffer copy.

=====================
Mark Roddy

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Nikolas Stylianides
Sent: Monday, May 16, 2005 3:26 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] DMA + common buffer ARC

Tim, I have already tested it in user mode. I am able to flush about
8Mbit/sec and that’s it. I want to eliminate the action of copying the
buffer from the DMA transaction into a User space buffer and then flush it
to disk. Have you ever tested it? How can you be sure that is a waste of
time to try it?

P.S. I even thought to use the disk.sys directly and be sector aligned in
order to be faster.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tim Roberts
Sent: Friday, May 13, 2005 8:02 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] DMA + common buffer ARC

Nikolas Stylianides wrote:

I have already began to design the kernel mode version. I want to test
it and see if there is significant improvement. I will send my results
back to the newsgroup.

How are you going to judge “improvement”? Are you already doing it in user
mode and falling behind? If not, why go to the trouble?

Some people seem to think that the CPU somehow runs its cycles faster in
kernel mode than it does in user mode. It ain’t so. The only thing you
would gain by doing your disk writes in the kernel is a reduction in the
kernel/user transitions. For the most part, that overhead is going to be
completely lost in the overhead of writing to the disk itself.

There is more at stake here than raw performance. If a user-mode scheme
handles your data and runs the CPU at 35% load, and a kernel-mode scheme
would run at 30% load, it would be a huge mistake to release the kernel
scheme.

Now, if your user-mode scheme runs the CPU at 110%, so that you drop data,
and you have already exhausted other performance issues (like buffer sizes,
buffer counts, batching up work), then it might be worth an experiment.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@4plus.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@stratus.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · May 16, 2005, 1:08pm

How does this scheme work? How does this scheme eliminate buffer copy?
You mean before transfer my device interrupts and let me know about the
amount of data to give me. After that I prepare a buffer in user mode and
give it to the device to fill it. After completion my device interrupts
again and so I flush the data to the disk. Is this the scheme?

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Roddy, Mark
Sent: Monday, May 16, 2005 6:00 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] DMA + common buffer ARC

Perhaps you should consider using direct IO and MDL based transfers from
your PCI device into your user mode writer app instead of using a common
buffer scheme that forces you into a buffer copy.

=====================
Mark Roddy

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Nikolas Stylianides
Sent: Monday, May 16, 2005 3:26 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] DMA + common buffer ARC

Tim, I have already tested it in user mode. I am able to flush about
8Mbit/sec and that’s it. I want to eliminate the action of copying the
buffer from the DMA transaction into a User space buffer and then flush it
to disk. Have you ever tested it? How can you be sure that is a waste of
time to try it?

P.S. I even thought to use the disk.sys directly and be sector aligned in
order to be faster.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tim Roberts
Sent: Friday, May 13, 2005 8:02 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] DMA + common buffer ARC

Nikolas Stylianides wrote:

I have already began to design the kernel mode version. I want to test
it and see if there is significant improvement. I will send my results
back to the newsgroup.

How are you going to judge “improvement”? Are you already doing it in user
mode and falling behind? If not, why go to the trouble?

Some people seem to think that the CPU somehow runs its cycles faster in
kernel mode than it does in user mode. It ain’t so. The only thing you
would gain by doing your disk writes in the kernel is a reduction in the
kernel/user transitions. For the most part, that overhead is going to be
completely lost in the overhead of writing to the disk itself.

There is more at stake here than raw performance. If a user-mode scheme
handles your data and runs the CPU at 35% load, and a kernel-mode scheme
would run at 30% load, it would be a huge mistake to release the kernel
scheme.

Now, if your user-mode scheme runs the CPU at 110%, so that you drop data,
and you have already exhausted other performance issues (like buffer sizes,
buffer counts, batching up work), then it might be worth an experiment.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@4plus.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@stratus.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@4plus.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · May 16, 2005, 1:22pm

Your application sends some number of requests to your pci driver using an
IOCTL (or a read for that matter) using DIRECT_IO. These requests are each
marked pending and queued in your driver and are used as the DMA sink for
data transfer from your device. As each buffer is filled, you complete the
corresponding io request, which notifies the app that the data is available
for writing to disk. No copies are involved other than the original DMA
operation from pci device to system memory, and the DMA operation from
system memory to the HBA. Your app has to feed your driver at an appropriate
rate, but you certainly ought to be able to do better than 8Mbs. Your app
provides ‘big enough’ buffers. I don’t know the details of your data format

you will have to figure this out. The idea is that your app keeps the
‘pump primed’ so that there is always another buffer waiting to be filled in
the queue. You should, with a bit of work, easily be able to keep up with
the disk, which ought to be the bottleneck.

Of course if your device doesn’t do scatter gather dma this probably won’t
help as the extra copy operation is going to happen anyhow.

=====================
Mark Roddy

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Nikolas Stylianides
Sent: Monday, May 16, 2005 1:07 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] DMA + common buffer ARC

How does this scheme work? How does this scheme eliminate buffer copy?
You mean before transfer my device interrupts and let me know about the
amount of data to give me. After that I prepare a buffer in user mode and
give it to the device to fill it. After completion my device interrupts
again and so I flush the data to the disk. Is this the scheme?

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Roddy, Mark
Sent: Monday, May 16, 2005 6:00 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] DMA + common buffer ARC

Perhaps you should consider using direct IO and MDL based transfers from
your PCI device into your user mode writer app instead of using a common
buffer scheme that forces you into a buffer copy.

=====================
Mark Roddy

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Nikolas Stylianides
Sent: Monday, May 16, 2005 3:26 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] DMA + common buffer ARC

Tim, I have already tested it in user mode. I am able to flush about
8Mbit/sec and that’s it. I want to eliminate the action of copying the
buffer from the DMA transaction into a User space buffer and then flush it
to disk. Have you ever tested it? How can you be sure that is a waste of
time to try it?

P.S. I even thought to use the disk.sys directly and be sector aligned in
order to be faster.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tim Roberts
Sent: Friday, May 13, 2005 8:02 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] DMA + common buffer ARC

Nikolas Stylianides wrote:

I have already began to design the kernel mode version. I want to test
it and see if there is significant improvement. I will send my results
back to the newsgroup.

How are you going to judge “improvement”? Are you already doing it in user
mode and falling behind? If not, why go to the trouble?

Some people seem to think that the CPU somehow runs its cycles faster in
kernel mode than it does in user mode. It ain’t so. The only thing you
would gain by doing your disk writes in the kernel is a reduction in the
kernel/user transitions. For the most part, that overhead is going to be
completely lost in the overhead of writing to the disk itself.

There is more at stake here than raw performance. If a user-mode scheme
handles your data and runs the CPU at 35% load, and a kernel-mode scheme
would run at 30% load, it would be a huge mistake to release the kernel
scheme.

Now, if your user-mode scheme runs the CPU at 110%, so that you drop data,
and you have already exhausted other performance issues (like buffer sizes,
buffer counts, batching up work), then it might be worth an experiment.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@4plus.com To unsubscribe
send a blank email to xxxxx@lists.osr.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@stratus.com To
unsubscribe send a blank email to xxxxx@lists.osr.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@4plus.com To unsubscribe
send a blank email to xxxxx@lists.osr.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@stratus.com To
unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · May 16, 2005, 1:39pm

I have to agree with the assessment about the speed. We have a PCI device
that does DMA from our own custom hardware. Initially we didn’t support
hardware scatter/gather, but do now. Our benchmark doesn’t actually move
the data to/from disk, but is processed internally. We could achieve
~140MBytes/sec write (from memory to the PCI device, which is actually a
bus master read since it is in terms of the DMA master which is the PCI
device), and ~56MBytes/sec read from the PCI device to PC memory. After
scatter/gather was implemented, this went up to 220MBytes/sec write and
140MBytes/sec read. I should add that our card is 64bit/66Mhz.

I have benchmarked this on different platforms, and the particular chipset
platform does affect the performance. Some chipsets seem to perform much
better than others. For example, a SuperMicro motherboard using the
ServerWorks GC-SL chipset performed noticeably poorer than the Intel E7501
or E7505 chipset (these are Xeon based motherboards). An AMD Athlon MP
based system using the AMD-762 chipset performed on par with the Intel
E7501 based system.

At 10:21 AM 5/16/2005, you wrote:

Your application sends some number of requests to your pci driver using an
IOCTL (or a read for that matter) using DIRECT_IO. These requests are each
marked pending and queued in your driver and are used as the DMA sink for
data transfer from your device. As each buffer is filled, you complete the
corresponding io request, which notifies the app that the data is available
for writing to disk. No copies are involved other than the original DMA
operation from pci device to system memory, and the DMA operation from
system memory to the HBA. Your app has to feed your driver at an appropriate
rate, but you certainly ought to be able to do better than 8Mbs. Your app
provides ‘big enough’ buffers. I don’t know the details of your data format

you will have to figure this out. The idea is that your app keeps the
‘pump primed’ so that there is always another buffer waiting to be filled in
the queue. You should, with a bit of work, easily be able to keep up with
the disk, which ought to be the bottleneck.

Of course if your device doesn’t do scatter gather dma this probably won’t
help as the extra copy operation is going to happen anyhow.

Russ Poffenberger
Credence Systems Corp.
xxxxx@credence.com

Maxim_S_Shatskih · May 16, 2005, 3:14pm

That is normal. Allocate the space in the app, send overlapped IO to your
driver, then - after IO completion - do WriteFile to disk file.
Use FILE_FLAG_NO_BUFFERING, and allocate the memory sector-aligned
(VirtualAlloc will give you page-aligned memory)

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

----- Original Message -----
From: “Nikolas Stylianides”
To: “Windows System Software Devs Interest List”
Sent: Monday, May 16, 2005 11:25 AM
Subject: RE: [ntdev] DMA + common buffer ARC

> Tim, I have already tested it in user mode. I am able to flush about
> 8Mbit/sec and that’s it. I want to eliminate the action of copying the
> buffer from the DMA transaction into a User space buffer and then flush it
> to disk. Have you ever tested it? How can you be sure that is a waste of
> time to try it?
>
> P.S. I even thought to use the disk.sys directly and be sector aligned in
> order to be faster.
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf Of Tim Roberts
> Sent: Friday, May 13, 2005 8:02 PM
> To: Windows System Software Devs Interest List
> Subject: Re: [ntdev] DMA + common buffer ARC
>
> Nikolas Stylianides wrote:
>
> >I have already began to design the kernel mode version. I want to test it
> >and see if there is significant improvement. I will send my results back to
> >the newsgroup.
> >
> >
>
> How are you going to judge “improvement”? Are you already doing it in
> user mode and falling behind? If not, why go to the trouble?
>
> Some people seem to think that the CPU somehow runs its cycles faster in
> kernel mode than it does in user mode. It ain’t so. The only thing you
> would gain by doing your disk writes in the kernel is a reduction in the
> kernel/user transitions. For the most part, that overhead is going to
> be completely lost in the overhead of writing to the disk itself.
>
> There is more at stake here than raw performance. If a user-mode scheme
> handles your data and runs the CPU at 35% load, and a kernel-mode scheme
> would run at 30% load, it would be a huge mistake to release the kernel
> scheme.
>
> Now, if your user-mode scheme runs the CPU at 110%, so that you drop
> data, and you have already exhausted other performance issues (like
> buffer sizes, buffer counts, batching up work), then it might be worth
> an experiment.
>
> –
> Tim Roberts, xxxxx@probo.com
> Providenza & Boekelheide, Inc.
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@4plus.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com
>
>
>
> —
> Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@storagecraft.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

Maxim_S_Shatskih · May 16, 2005, 6:11pm

> How does this scheme work? How does this scheme eliminate buffer copy?

DMA of your card runs over the same memory as the disk controller’s DMA.

Your driver must:

use DO_DIRECT_IO
pass Irp->MdlAddress to the DMA routines, they will give you a scatter-gather
list, which you will feed to the hardware.

Your app must:

allocate aligned memory with VirtualAlloc
send lots of overlapped reads to the driver
after reads will be completed - the app must arrange them in the proper order
(for instance, allocate OVERLAPPED as a part of your structure and keep the
sequence number there), and do WriteFile to the disk file.
the file must be opened with FILE_FLAG_NO_BUFFERING. Note that in this case
all your lengths must be sector-aligned.

This is the fastest possible way.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com