RE(3): problems with DMA transfer with size > 4k

OSR_Community_User · January 17, 2004, 1:45am

There are user-mode semantics available for allocating locked and/or a
single page of memory.

To allocate a single page, use VirtualAlloc(NULL,
GetSystemInfo()->PageSize, MEM_COMMIT).
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/memory/
base/virtualalloc.asp

If you need more than one page, you need additional privileges in the
user-mode process. Look at both the MEM_LARGE_PAGES and MEM_PHYSICAL
flags in the above page for starting points. Sounds like you’ll need a
service running with a user account you create (with special
privileges), or for your clients to be restricted to administrators (or
for them to add the privileges to any clients).

Anyways, the above should get you started working with this board. Try
to influence future designs to support scatter/gather, eh?
.

-----Original Message-----
From: Russ Poffenberger [mailto:xxxxx@nptest.com]
Sent: Friday, January 16, 2004 9:17 AM
Subject: Re[2]: problems with DMA transfer with size > 4k

At 10:06 AM 1/16/2004 -0700, you wrote:

> Isn’t a common buffer allocated from the non-paged pool? If so, you
have to
> be careful with how large a buffer you allocate. Mapping it back to
the
> user program sounds like a bad idea. I would think you want to copy
it to
> the users buffer before completing the IRP.

Copying to the users buffer is good for low data rates but not when
data rates are very large.

True, but if the hardware doesn’t support scatter/gather, you already
have
a performance headache to deal with. You could, as previously suggested,

use multiple buffers with a round-robin approach.

If the DMA is large, then using a kernel allocated common buffer for all
of
it will run into resource problems anyway.

Russ Poffenberger
NPTest, Inc.
xxxxx@NPTest.com

OSR_Community_User · January 20, 2004, 8:18am

Thanks to all who took their time for an answer

After comprehending the page size problem, we decided to try out to
split up the DMA to 4k bulks. I changed the Dpc and the AdapterControl
routine and checked it out with a 100M DMA transfer (actually 245
pages). We measured a speed of about 20MB/s. The theoretical reachable
value of the PCI (32-Bit, 33MHz) is about 120MB/s. The measured speed is
not astonishing, but we know how to increase it: Scatter/gather. The
hardware designer thinks now about a redisign (a bigger FPGA). We
decided for page sized DMA because if we allocate in kernel contiguous
memory it’s not sure if we get it and if it’s big enough, especially if
the card is running on a system with low memory. So I thought it’s more
reliable to split up and make small DMA bulks but write it directly to
user mem and also not burden the proc with an additional copy from
kernel to user mem.

OK, so far so good, we’re not completely satisfied but can live with it
for the moment
See ya
Daniel

OSR_Community_User · January 20, 2004, 9:00am

Daniel,

Beware that your value for “theoretical maximum” is very dependant on how
the card interacts with the bus. If you’re having a bus transfer that is not
in “burst mode”, it’s not going to get anywhere near 120MB/s. Check this
with your HW designer and maybe a logic analyzer (or better yet, PCI Bus
analyzer) before deciding to add scatter/gather.

How large are your average packets (not bounding them at 4K boundaries).

What is causing the drop in transfer rate from the theoretical? Is it that
you’re waiting for the first packet to be sent before the next one can be
started? If so, can you prepare another packet so that it’s just a question
of adding it to the chip registers when the interrupt comes in, rather than
having to go through a lot of processing steps to get it started?

Are you copying large chunks of data (> 50K) in the drtiver? If so, consider
writing a special copy memory routine that uses SSE registers (if available
in the processor) and uses either non-cacheable memory or non-temporal
stores, so that the data is written directly to memory. There’s lots to be
gained from copying directly to memory rather than storing it in the cache,
especially if it’s going to be DMA’d anyway. [1]

I’m sorry if I’m asking questions that you’ve already answered, but I’m just
trying to make sure you don’t redesign the hardware and then find that it’s
not that much better (or no better at all).

[1] There is a little known effect of writing data to the cache, and that is
that the first write to a cache-line will force the processor to read in the
remaining data for that cache-line, which for modern processors is either 64
or 128 bytes. Not only does this cause extra pressure on the memory
bandwidth, but also causes a read from a completely different area of the
memory from the previous read. This will interfere with the reads to get the
actual data. And if you’re spilling the cache content to memory later on
anyway, the read was wasted…

-----Original Message-----
From: Daniel Luethi [mailto:xxxxx@psi.ch]
Sent: Tuesday, January 20, 2004 1:18 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] problems with DMA transfer with size > 4k

Thanks to all who took their time for an answer

After comprehending the page size problem, we decided to try out to
split up the DMA to 4k bulks. I changed the Dpc and the
AdapterControl
routine and checked it out with a 100M DMA transfer (actually 245
pages). We measured a speed of about 20MB/s. The theoretical
reachable
value of the PCI (32-Bit, 33MHz) is about 120MB/s. The
measured speed is
not astonishing, but we know how to increase it: Scatter/gather. The
hardware designer thinks now about a redisign (a bigger FPGA). We
decided for page sized DMA because if we allocate in kernel
contiguous
memory it’s not sure if we get it and if it’s big enough,
especially if
the card is running on a system with low memory. So I thought
it’s more
reliable to split up and make small DMA bulks but write it
directly to
user mem and also not burden the proc with an additional copy from
kernel to user mem.

OK, so far so good, we’re not completely satisfied but can
live with it
for the moment
See ya
Daniel

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@3dlabs.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · January 20, 2004, 9:57am

At 02:17 PM 1/20/2004 +0100, you wrote:

Thanks to all who took their time for an answer

After comprehending the page size problem, we decided to try out to split
up the DMA to 4k bulks. I changed the Dpc and the AdapterControl routine
and checked it out with a 100M DMA transfer (actually 245 pages). We
measured a speed of about 20MB/s. The theoretical reachable value of the
PCI (32-Bit, 33MHz) is about 120MB/s. The measured speed is not
astonishing, but we know how to increase it: Scatter/gather. The hardware
designer thinks now about a redisign (a bigger FPGA). We decided for page
sized DMA because if we allocate in kernel contiguous memory it’s not sure
if we get it and if it’s big enough, especially if the card is running on
a system with low memory. So I thought it’s more reliable to split up and
make small DMA bulks but write it directly to user mem and also not burden
the proc with an additional copy from kernel to user mem.

Here is what I do for our custom PCI card. It too does not support
scatter/gather. I am using Compuware’s DriverWorks, but these are just
wrappers around the standard kernel routines for the most part, so there
should be a DDK equivalent.

The DW DMA concept has you setup a KDmaAdapter object. In this you specify
attributes of the DMA transfer such as scatter/gather support (true/false),
interface type (PCI), width, etc. When it comes time to perform the DMA, a
KDmaTransfer object is initialized (it takes the KDmaAdapter instance as an
argument), and calls the Initiate method on it (which takes an mdl and
callback as an argument). This method sets up mapping, and invokes the
callback. My callback can get (from the KDmaTransfer object) the number of
bytes remaining, and if non-zero, calls GetTransferDescriptors (on
KDmaTransfer) to get a DMA segment pair. This pair is a PCI physical
address and length. The card’s DMA engine is programmed and the DMA is
started. When the ISR gets the interrupt, it calls the Continue method on
KDmaTransfer. This sets up the next contiguous segment (from the original
mdl) and invokes the callback again. If the card had implemented
scatter/gather, the first invocation of the callback would get enough
transfer descriptors to program however many scatter/gather registers the
hardware supported. If there were more transfers than registers, then you
would use the above method to program and start another sequence of
segments. SO in a way, not having scatter/gather is just a special case
where only one register exists.

In this way, it breaks the DMA up into contiguous segments directly from
the mdl. In my experience, the first segment is typically smaller than the
rest, but usually 6 or 7K. After that, segments are usually 32k or more.

Our card is a 64bit/66Mhz device, and we routinely achieve 250Mbytes/sec on
DMA writes (PCI device to memory) and 90Mbytes/sec on DMA reads (memory to
PCI device). It took a bit of tuning on the hardware side as well. A good
PCI analyzer is a must (we have a VMetro). What is very important to
getting good DMA performance is getting the burst size as large as
possible. When bursting, transfers are 15ns each (66Mhz PCI bus, 30ns for
33Mhz). However, if the burst is short (a burst can be terminated by either
the PCI device or the CPU), then it can take several hundred nanoseconds to
re-negotiate the bus and the PCI device to re-acquire bus mastership, this
is the killer for DMA performance.

You need to keep burst sizes above 128 transfers to start reaping
performance benefits.

I have also noticed that different motherboard/chipset combos have
different performance, mainly due to their ability to sustain a burst,
which is probably a measure of their PCI bridge and memory bandwidth. The
best results I have seen are on an Intel motherboard using the E7501
chipset. A close second is a SuperMicro MB using an Intel E7505 chipset.
(both using P4 Xeon processors). I have an AMD Athlon MP based MB which
comes in third (at about 150Mbytes/s DMA write), and last was a SuperMicro
motherboard (P4 Xeon) using the ServerWorks GC-LE chipset at 75Mbytes/sec
DMA write.

Russ Poffenberger
NPTest, Inc.
xxxxx@NPTest.com

OSR_Community_User · January 20, 2004, 2:10pm

Well, maybe I don’t understand the main problem, but we have been
recently working on PCI card that was also busmaster without
scatter/gather. All the logic was in the end done by the card, the only
think it needed from my driver was address and length of the buffer.
There were two buffers about 512 kB long allocated using
AllocateCommonBuffer and switching mechanism that was exposed to the
usermode program. When the transfer ended, the PCI card issued an
interrupt. The transfer speed was about 30 MDW/s that’s practically the
full speed of PCI. Internally the PCI card was transferring the data
using smaller chunks and between them it did disconnects, but that was
all in the FPGA logic, not in the driver. Honestly I don’t understand
why there is such lot of talk on busmaster DMA (without scatter/gather),
because I think that busmaster is really only about hardware, the
software does minimum of the work. But of course everything depends on
the structure and size of data you want to transfer… Because I havent
seen here any description of the data, times and sizes you are
transferring, the only reason for this small transfer is IMO bad design
of the hardware.

Martin

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@3Dlabs.com
Sent: Tuesday, January 20, 2004 3:01 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] problems with DMA transfer with size > 4k

Daniel,

Beware that your value for “theoretical maximum” is very
dependant on how the card interacts with the bus. If you’re
having a bus transfer that is not in “burst mode”, it’s not
going to get anywhere near 120MB/s. Check this with your HW
designer and maybe a logic analyzer (or better yet, PCI Bus
analyzer) before deciding to add scatter/gather.

How large are your average packets (not bounding them at 4K
boundaries).

What is causing the drop in transfer rate from the
theoretical? Is it that you’re waiting for the first packet
to be sent before the next one can be started? If so, can you
prepare another packet so that it’s just a question of adding
it to the chip registers when the interrupt comes in, rather
than having to go through a lot of processing steps to get it started?

Are you copying large chunks of data (> 50K) in the drtiver?
If so, consider writing a special copy memory routine that
uses SSE registers (if available in the processor) and uses
either non-cacheable memory or non-temporal stores, so that
the data is written directly to memory. There’s lots to be
gained from copying directly to memory rather than storing it
in the cache, especially if it’s going to be DMA’d anyway. [1]

I’m sorry if I’m asking questions that you’ve already
answered, but I’m just trying to make sure you don’t redesign
the hardware and then find that it’s not that much better (or
no better at all).

[1] There is a little known effect of writing data to the
cache, and that is that the first write to a cache-line will
force the processor to read in the remaining data for that
cache-line, which for modern processors is either 64 or 128
bytes. Not only does this cause extra pressure on the memory
bandwidth, but also causes a read from a completely different
area of the memory from the previous read. This will
interfere with the reads to get the actual data. And if
you’re spilling the cache content to memory later on anyway,
the read was wasted…

> -----Original Message-----
> From: Daniel Luethi [mailto:xxxxx@psi.ch]
> Sent: Tuesday, January 20, 2004 1:18 PM
> To: Windows System Software Devs Interest List
> Subject: Re:[ntdev] problems with DMA transfer with size > 4k
>
>
> Thanks to all who took their time for an answer
>
> After comprehending the page size problem, we decided to try out to
> split up the DMA to 4k bulks. I changed the Dpc and the
> AdapterControl
> routine and checked it out with a 100M DMA transfer (actually 245
> pages). We measured a speed of about 20MB/s. The theoretical
> reachable
> value of the PCI (32-Bit, 33MHz) is about 120MB/s. The
> measured speed is
> not astonishing, but we know how to increase it:
Scatter/gather. The
> hardware designer thinks now about a redisign (a bigger FPGA). We
> decided for page sized DMA because if we allocate in kernel
> contiguous
> memory it’s not sure if we get it and if it’s big enough,
> especially if
> the card is running on a system with low memory. So I thought
> it’s more
> reliable to split up and make small DMA bulks but write it
> directly to
> user mem and also not burden the proc with an additional copy from
> kernel to user mem.
>
> OK, so far so good, we’re not completely satisfied but can
> live with it
> for the moment
> See ya
> Daniel
>
>
> —
> Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@3dlabs.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are
currently subscribed to ntdev as: xxxxx@centrum.cz
To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · January 20, 2004, 2:50pm

At 08:09 PM 1/20/2004 +0100, you wrote:

Well, maybe I don’t understand the main problem, but we have been
recently working on PCI card that was also busmaster without
scatter/gather. All the logic was in the end done by the card, the only
think it needed from my driver was address and length of the buffer.
There were two buffers about 512 kB long allocated using
AllocateCommonBuffer and switching mechanism that was exposed to the
usermode program. When the transfer ended, the PCI card issued an
interrupt. The transfer speed was about 30 MDW/s that’s practically the
full speed of PCI. Internally the PCI card was transferring the data
using smaller chunks and between them it did disconnects, but that was
all in the FPGA logic, not in the driver. Honestly I don’t understand
why there is such lot of talk on busmaster DMA (without scatter/gather),
because I think that busmaster is really only about hardware, the
software does minimum of the work. But of course everything depends on
the structure and size of data you want to transfer… Because I havent
seen here any description of the data, times and sizes you are
transferring, the only reason for this small transfer is IMO bad design
of the hardware.

Well, sure, AllocateCommonBuffer allocates a physically contiguous memory
space. However, you can’t always use the common buffer. It is a valuable
limited resource. If you are reading data from the card, this works. If you
are writing data to the card, say from a user application, then this
doesn’t work, since the data buffer is allocated by the user mode app, and
may not be physically contiguous, hence the need for scatter/gather, unless
you workaround it, such as breaking the transfer up into smaller contiguous
chains, or copying the user buffer to a common buffer. Both will reduce
performance.

Russ Poffenberger
NPTest, Inc.
xxxxx@NPTest.com

OSR_Community_User · January 21, 2004, 5:24pm

Well why not map this contiguous memory allocated in kernel mode using
AllocateCommonBuffer to the usermode program? The program had very
complicated algorithm that generated data for the card. Copying of the
resulting data to the buffer was really not noticeable in the computer
nor the transfer performance… As I’ve said we had 30 MDW/s transfer
rate.

Martin

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Russ
Poffenberger
Sent: Tuesday, January 20, 2004 8:49 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] problems with DMA transfer with size > 4k

At 08:09 PM 1/20/2004 +0100, you wrote:
>Well, maybe I don’t understand the main problem, but we have been
>recently working on PCI card that was also busmaster without
>scatter/gather. All the logic was in the end done by the
card, the only
>think it needed from my driver was address and length of the buffer.
>There were two buffers about 512 kB long allocated using
>AllocateCommonBuffer and switching mechanism that was exposed to the
>usermode program. When the transfer ended, the PCI card issued an
>interrupt. The transfer speed was about 30 MDW/s that’s
practically the
>full speed of PCI. Internally the PCI card was transferring the data
>using smaller chunks and between them it did disconnects,
but that was
>all in the FPGA logic, not in the driver. Honestly I don’t
understand
>why there is such lot of talk on busmaster DMA (without
>scatter/gather), because I think that busmaster is really only about
>hardware, the software does minimum of the work. But of course
>everything depends on the structure and size of data you want to
>transfer… Because I havent seen here any description of the data,
>times and sizes you are transferring, the only reason for this small
>transfer is IMO bad design of the hardware.

Well, sure, AllocateCommonBuffer allocates a physically
contiguous memory
space. However, you can’t always use the common buffer. It is
a valuable
limited resource. If you are reading data from the card, this
works. If you
are writing data to the card, say from a user application, then this
doesn’t work, since the data buffer is allocated by the user
mode app, and
may not be physically contiguous, hence the need for
scatter/gather, unless
you workaround it, such as breaking the transfer up into
smaller contiguous
chains, or copying the user buffer to a common buffer. Both
will reduce
performance.

Russ Poffenberger
NPTest, Inc.
xxxxx@NPTest.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are
currently subscribed to ntdev as: xxxxx@centrum.cz
To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · January 22, 2004, 2:47am

Russ, Mats and Martin

Thanks for your answers. After a careful look with the PCI snooper (we
also use VMetro) I noticed where we loose so much time in the transfer:
The DbgPrints() commands which I used quite often to watch every step in
the driver took really lots of time, I think about 10us each (on a 2GHz
P4!). After changing DbgPrint() to KdPrint(()) the driver was free of
debug prints if I build it in fre environment. The speed increase was
dramatically: We reached 98MB/s (before 20MB/s)! The snooper showed us
that the time between two bursts (one burst is actually one page:
1024DW, I think that’s the fastest way) decreased from previously 100us
down to 12us (the PCI bus is 33MHz, 32Bit)!
So, we’re now really satisfied with that speed, which is pretty close to
the theoretical speed. Of course the time between two birsts where we
have to load the PCI bus master card again with the new logical address,
depends very much on the speed of the PC, but so far it looks good and
the need of scatter/gather became less urgent!
Actually the driver is now switched to scatter/gather in
IoGetDmaAdapter(DevDescr…) kind of a single scatter/gather register.
This way I can be sure the user buffer is not copied intermediately.

Daniel

OSR_Community_User · January 22, 2004, 10:03am

At 08:46 AM 1/22/2004 +0100, you wrote:

Russ, Mats and Martin

Thanks for your answers. After a careful look with the PCI snooper (we
also use VMetro) I noticed where we loose so much time in the transfer:
The DbgPrints() commands which I used quite often to watch every step in
the driver took really lots of time, I think about 10us each (on a 2GHz
P4!). After changing DbgPrint() to KdPrint(()) the driver was free of
debug prints if I build it in fre environment. The speed increase was
dramatically: We reached 98MB/s (before 20MB/s)! The snooper showed us
that the time between two bursts (one burst is actually one page: 1024DW,
I think that’s the fastest way) decreased from previously 100us down to
12us (the PCI bus is 33MHz, 32Bit)!
So, we’re now really satisfied with that speed, which is pretty close to
the theoretical speed. Of course the time between two birsts where we have
to load the PCI bus master card again with the new logical address,
depends very much on the speed of the PC, but so far it looks good and the
need of scatter/gather became less urgent!
Actually the driver is now switched to scatter/gather in
IoGetDmaAdapter(DevDescr…) kind of a single scatter/gather register.
This way I can be sure the user buffer is not copied intermediately.

Good! There you go.

BTW, on a 64bit/66Mhz bus, we see 270Mbytes/sec. The limit is the other end
of our card. Our card acts like a bridge between the PCI bus and our custom
hardware bus which goes through a 3Gbit/sec serdes link to our testhead
bay. This link is more the limiting factor. I have verified (using the
VMetro as a bus exerciser) that the PCI bus is theoretically capable of
over 400Mbytes/sec.

Russ Poffenberger
NPTest, Inc.
xxxxx@NPTest.com

OSR_Community_User · January 22, 2004, 10:16am

Daniel,

Good that you got the speed up without having to change the hardware.

A comment on the Debug situation. We have LOTS of debug statements in our
source code. For this reason, we have a macro that when we build a release
build, becomes nothing. This means that you can stick as much debug
printouts as you like in the driver, but the customer won’t see any (not in
performance, and not when searching for text strings in the driver) of it.
Same thing with ASSERT’s of course.

Of course, we also make use of the “DebugLevel” variable to make sure that
only the debug info that is of interest is printed.

–
Mats

Russ, Mats and Martin

Thanks for your answers. After a careful look with the PCI
snooper (we
also use VMetro) I noticed where we loose so much time in the
transfer:
The DbgPrints() commands which I used quite often to watch
every step in
the driver took really lots of time, I think about 10us each
(on a 2GHz
P4!). After changing DbgPrint() to KdPrint(()) the driver was free of
debug prints if I build it in fre environment. The speed increase was
dramatically: We reached 98MB/s (before 20MB/s)! The snooper
showed us
that the time between two bursts (one burst is actually one page:
1024DW, I think that’s the fastest way) decreased from
previously 100us
down to 12us (the PCI bus is 33MHz, 32Bit)!
So, we’re now really satisfied with that speed, which is
pretty close to
the theoretical speed. Of course the time between two birsts where we
have to load the PCI bus master card again with the new
logical address,
depends very much on the speed of the PC, but so far it looks
good and
the need of scatter/gather became less urgent!
Actually the driver is now switched to scatter/gather in
IoGetDmaAdapter(DevDescr…) kind of a single scatter/gather
register.
This way I can be sure the user buffer is not copied intermediately.

Daniel

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@3dlabs.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Michal_Vodicka-1 · January 22, 2004, 5:50pm

> ----------

From: xxxxx@3Dlabs.com[SMTP:xxxxx@3Dlabs.com]
Reply To: xxxxx@lists.osr.com
Sent: Thursday, January 22, 2004 4:16 PM
To: xxxxx@lists.osr.com
Subject: RE: [ntdev] problems with DMA transfer with size > 4k

A comment on the Debug situation. We have LOTS of debug statements in our
source code. For this reason, we have a macro that when we build a release
build, becomes nothing. This means that you can stick as much debug
printouts as you like in the driver, but the customer won’t see any (not
in
performance, and not when searching for text strings in the driver) of it.
Same thing with ASSERT’s of course.

Of course, we also make use of the “DebugLevel” variable to make sure that
only the debug info that is of interest is printed.

Yes, it is pretty standard approach. Also, distributing drivers with debug
prints is taken as irresponsible (at least) by some other developers. The
exception is limited field testing and even there should be most messages
disabled by default.

WPP should solve both problems with visible text strings and unwanted debug
output at once.

Best regards,

Michal Vodicka
STMicroelectronics Design and Application s.r.o.
[michal.vodicka@st.com, http:://www.st.com]

OSR_Community_User · January 23, 2004, 12:26pm

It’s great to hear you found the root cause of the problem. Debug
prints which are formatted on the same system that is running the code
are always going to affect perf. There are alternatives, however…

Please allow me to suggest you look into the use of Event Tracing for
Windows (ETW). It offers very quick conversion from any of your current
macros/functions, and allows you to selectively enable tracing on FRE
builds without causing major slowdowns when not enabled. I’ve used this
for all my components since XP, and one or two of the ones in Windows
2000. This allows you to get logs from a customer without shipping a
new binary to them, without using a debugger (although that’s an option
too), and without the customer being able to decode the log (without you
giving the magic decoder ring).

ETW is only documented in the DDK currently (I couldn’t find it in MSDN,
at least).

.

-----Original Message-----
From: Daniel Luethi [mailto:xxxxx@psi.ch]
Sent: Wednesday, January 21, 2004 11:46 PM
Subject: Re: problems with DMA transfer with size > 4k

Russ, Mats and Martin

Thanks for your answers. After a careful look with the PCI snooper (we
also use VMetro) I noticed where we loose so much time in the transfer:
The DbgPrints() commands which I used quite often to watch every step in
the driver took really lots of time, I think about 10us each (on a 2GHz
P4!). After changing DbgPrint() to KdPrint(()) the driver was free of
debug prints if I build it in fre environment. The speed increase was
dramatically: We reached 98MB/s (before 20MB/s)! The snooper showed us
that the time between two bursts (one burst is actually one page:
1024DW, I think that’s the fastest way) decreased from previously 100us
down to 12us (the PCI bus is 33MHz, 32Bit)!
So, we’re now really satisfied with that speed, which is pretty close to
the theoretical speed. Of course the time between two birsts where we
have to load the PCI bus master card again with the new logical address,
depends very much on the speed of the PC, but so far it looks good and
the need of scatter/gather became less urgent!
Actually the driver is now switched to scatter/gather in
IoGetDmaAdapter(DevDescr…) kind of a single scatter/gather register.
This way I can be sure the user buffer is not copied intermediately.

Daniel

OSR_Community_User · January 28, 2004, 3:24am

Henry,

First thank you for your answer. You’re mentioning Event Tracing for
Windows (ETW) as a usable tool. I’m not sure if this is a standard
windows service? Is it the Event Viewer which can be started in:

My Computer (Right Mouseclick)>Manage>ComputerManagement>System
Tools>Event Viewer> ?

I never used this tool, what is it good for? It seems to trace some more
or less important info, warnings and errors.

regards
Daniel

Henry Gabryjelski wrote:

It’s great to hear you found the root cause of the problem. Debug
prints which are formatted on the same system that is running the code
are always going to affect perf. There are alternatives, however…

Please allow me to suggest you look into the use of Event Tracing for
Windows (ETW). It offers very quick conversion from any of your current
macros/functions, and allows you to selectively enable tracing on FRE
builds without causing major slowdowns when not enabled. I’ve used this
for all my components since XP, and one or two of the ones in Windows
2000. This allows you to get logs from a customer without shipping a
new binary to them, without using a debugger (although that’s an option
too), and without the customer being able to decode the log (without you
giving the magic decoder ring).

ETW is only documented in the DDK currently (I couldn’t find it in MSDN,
at least).

.

-----Original Message-----
From: Daniel Luethi [mailto:xxxxx@psi.ch]
Sent: Wednesday, January 21, 2004 11:46 PM
Subject: Re: problems with DMA transfer with size > 4k

Russ, Mats and Martin

Thanks for your answers. After a careful look with the PCI snooper (we
also use VMetro) I noticed where we loose so much time in the transfer:
The DbgPrints() commands which I used quite often to watch every step in
the driver took really lots of time, I think about 10us each (on a 2GHz
P4!). After changing DbgPrint() to KdPrint(()) the driver was free of
debug prints if I build it in fre environment. The speed increase was
dramatically: We reached 98MB/s (before 20MB/s)! The snooper showed us
that the time between two bursts (one burst is actually one page:
1024DW, I think that’s the fastest way) decreased from previously 100us
down to 12us (the PCI bus is 33MHz, 32Bit)!
So, we’re now really satisfied with that speed, which is pretty close to
the theoretical speed. Of course the time between two birsts where we
have to load the PCI bus master card again with the new logical address,
depends very much on the speed of the PC, but so far it looks good and
the need of scatter/gather became less urgent!
Actually the driver is now switched to scatter/gather in
IoGetDmaAdapter(DevDescr…) kind of a single scatter/gather register.
This way I can be sure the user buffer is not copied intermediately.

Daniel