Funky DMA Problem

Hi, Guys,

I have an interesting problem. My application sends me a buffer
(it’s a 3-dimensional buffer, but for the sake of simplicity you
can assume it’s a 2D buffer), say, from x,y coordinates from 0,0
to xmax,ymax. Now, inside that buffer there’s a rectangular
subbuffer from xbegin,ybegin to xend,yend. I want to DMA the
subbuffer out to my chip, but I do not want to move data around:
it would be easy to copy the subbuffer to a kernel buffer and
DMA it from there, but I do not want to spend the time and the
memory doing that, if nothing else because even the subvolumes
on occasion can be pretty big.

I know how to generate a scatter-gather list by hand, no
problem, but I want to give Windows a chance. So, here’s my
question: is there any functionality in the DMA calls to set up
a scatter-gather list for such a subbuffer ? Note that the
scanlines are linearly contiguous within themselves but not
relative to each other; they may not be physically contiguous,
there’s no assurance that one or more scanlines will not cross a
page boundary and be broken into two physically discontinuous
line segments. The problem is way more complex when I have a 3D
image, because now I have not only scanlines but also plane
slices that build up a subvolume of the big volume.

Is there any hope the API can help me, or must I do it by hand ?
Tks,

Alberto.

The usual way. The MDL is created and passed via ->GetScatterGatherList, or
even ->MapTransfer loop.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

----- Original Message -----
From: “Alberto Moreira”
To: “Windows System Software Devs Interest List”
Sent: Tuesday, December 06, 2005 5:42 AM
Subject: [ntdev] Funky DMA Problem

> Hi, Guys,
>
> I have an interesting problem. My application sends me a buffer
> (it’s a 3-dimensional buffer, but for the sake of simplicity you
> can assume it’s a 2D buffer), say, from x,y coordinates from 0,0
> to xmax,ymax. Now, inside that buffer there’s a rectangular
> subbuffer from xbegin,ybegin to xend,yend. I want to DMA the
> subbuffer out to my chip, but I do not want to move data around:
> it would be easy to copy the subbuffer to a kernel buffer and
> DMA it from there, but I do not want to spend the time and the
> memory doing that, if nothing else because even the subvolumes
> on occasion can be pretty big.
>
> I know how to generate a scatter-gather list by hand, no
> problem, but I want to give Windows a chance. So, here’s my
> question: is there any functionality in the DMA calls to set up
> a scatter-gather list for such a subbuffer ? Note that the
> scanlines are linearly contiguous within themselves but not
> relative to each other; they may not be physically contiguous,
> there’s no assurance that one or more scanlines will not cross a
> page boundary and be broken into two physically discontinuous
> line segments. The problem is way more complex when I have a 3D
> image, because now I have not only scanlines but also plane
> slices that build up a subvolume of the big volume.
>
> Is there any hope the API can help me, or must I do it by hand ?
> Tks,
>
>
> Alberto.
>
>
>
>
>
> —
> Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@storagecraft.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

I had the exact same requirement about a year ago. I looked, but didn’t
find anything, and ended up writing my own routines in the driver to
build scatter gather descripters to do a 2D transfer in virtual address
space overlaid on the physical pages.

One complicating factor, which I had to deal with but you probably don’t,
was that in my case the rows could overlap each other (i.e the distance
between rows could be less than the length of the row).

Transferring every other byte (which the API allowed the user to do) is
an excellent way to

a) slow the dma to a crawl
b) chew up lots of SG descriptors.

Good Luck!

Don

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
Alberto Moreira
Sent: 06 December 2005 02:42
To: Windows System Software Devs Interest List
Subject: [ntdev] Funky DMA Problem

Hi, Guys,

I have an interesting problem. My application sends me a buffer
(it’s a 3-dimensional buffer, but for the sake of simplicity you
can assume it’s a 2D buffer), say, from x,y coordinates from 0,0
to xmax,ymax. Now, inside that buffer there’s a rectangular
subbuffer from xbegin,ybegin to xend,yend. I want to DMA the
subbuffer out to my chip, but I do not want to move data around:
it would be easy to copy the subbuffer to a kernel buffer and
DMA it from there, but I do not want to spend the time and the
memory doing that, if nothing else because even the subvolumes
on occasion can be pretty big.

I know how to generate a scatter-gather list by hand, no
problem, but I want to give Windows a chance. So, here’s my
question: is there any functionality in the DMA calls to set up
a scatter-gather list for such a subbuffer ? Note that the
scanlines are linearly contiguous within themselves but not
relative to each other; they may not be physically contiguous,
there’s no assurance that one or more scanlines will not cross a
page boundary and be broken into two physically discontinuous
line segments. The problem is way more complex when I have a 3D
image, because now I have not only scanlines but also plane
slices that build up a subvolume of the big volume.

Is there any hope the API can help me, or must I do it by hand ?
Tks,

Alberto.


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@careful.co.uk
To unsubscribe send a blank email to xxxxx@lists.osr.com

You’re right, my rows don’t overlap. It’s rather the other way
around, rows aren’t consecutive in linear space! My chip has a
DMA command queue and it takes the SG list from that queue, so,
I probably don’t risk running out of descriptors. Now, the DMA
may be slow as molasses, depending on how things go, so, the
decision’s going to be whether to do a host copy of the
subvolume to a kernel buffer and DMA it from there. There’s pros
and cons, my volumes can be rather big and place some pressure
on memory availability, and because they’re big, copying takes a
good amount of time. On the other hand, if DMAing directly from
the user buffer proves to be slow enough, maybe it’s a bad idea
not to copy the data to a kernel buffer before doing the DMA. I
guess in the end I may end up having to measure the throughput
both ways to see which way the pendulum goes!

Alberto.

----- Original Message -----
From: “Don Ward”
To: “Windows System Software Devs Interest List”

Sent: Tuesday, December 06, 2005 4:57 AM
Subject: RE: [ntdev] Funky DMA Problem

>I had the exact same requirement about a year ago. I looked,
>but didn’t
> find anything, and ended up writing my own routines in the
> driver to
> build scatter gather descripters to do a 2D transfer in
> virtual address
> space overlaid on the physical pages.
>
> One complicating factor, which I had to deal with but you
> probably don’t,
> was that in my case the rows could overlap each other (i.e the
> distance
> between rows could be less than the length of the row).
>
> Transferring every other byte (which the API allowed the user
> to do) is
> an excellent way to
>
> a) slow the dma to a crawl
> b) chew up lots of SG descriptors.
>
> Good Luck!
>
> Don
>
>> -----Original Message-----
>> From: xxxxx@lists.osr.com
>> [mailto:xxxxx@lists.osr.com] On Behalf Of
>> Alberto Moreira
>> Sent: 06 December 2005 02:42
>> To: Windows System Software Devs Interest List
>> Subject: [ntdev] Funky DMA Problem
>>
>>
>> Hi, Guys,
>>
>> I have an interesting problem. My application sends me a
>> buffer
>> (it’s a 3-dimensional buffer, but for the sake of simplicity
>> you
>> can assume it’s a 2D buffer), say, from x,y coordinates from
>> 0,0
>> to xmax,ymax. Now, inside that buffer there’s a rectangular
>> subbuffer from xbegin,ybegin to xend,yend. I want to DMA the
>> subbuffer out to my chip, but I do not want to move data
>> around:
>> it would be easy to copy the subbuffer to a kernel buffer and
>> DMA it from there, but I do not want to spend the time and
>> the
>> memory doing that, if nothing else because even the
>> subvolumes
>> on occasion can be pretty big.
>>
>> I know how to generate a scatter-gather list by hand, no
>> problem, but I want to give Windows a chance. So, here’s my
>> question: is there any functionality in the DMA calls to set
>> up
>> a scatter-gather list for such a subbuffer ? Note that the
>> scanlines are linearly contiguous within themselves but not
>> relative to each other; they may not be physically
>> contiguous,
>> there’s no assurance that one or more scanlines will not
>> cross a
>> page boundary and be broken into two physically discontinuous
>> line segments. The problem is way more complex when I have a
>> 3D
>> image, because now I have not only scanlines but also plane
>> slices that build up a subvolume of the big volume.
>>
>> Is there any hope the API can help me, or must I do it by
>> hand ?
>> Tks,
>>
>>
>> Alberto.
>>
>>
>>
>>
>>
>> —
>> Questions? First check the Kernel Driver FAQ at
>> http://www.osronline.com/article.cfm?id=256
>>
>> You are currently subscribed to ntdev as: xxxxx@careful.co.uk
>> To unsubscribe send a blank email to
>> xxxxx@lists.osr.com
>>
>
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@ieee.org
> To unsubscribe send a blank email to
> xxxxx@lists.osr.com

> good amount of time. On the other hand, if DMAing directly from

the user buffer proves to be slow enough, maybe it’s a bad idea
not to copy the data to a kernel buffer before doing the DMA. I

Why do you think that DMA from user buffer is slower then from the kernel
buffer?

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

> You’re right, my rows don’t overlap. It’s rather the other way

around, rows aren’t consecutive in linear space! My chip has a
DMA command queue and it takes the SG list from that queue, so,
I probably don’t risk running out of descriptors.

Well, you know your hardware better than me … but consider the
pathological case of transferring every other byte. My hardware
needed 64 bytes per descriptor. Thus, if I attempted to build a
single scatter gather list, transferring every other byte from
a 200MB user buffer would have required 640MB of dma descriptors!

In this, and similar, cases I split the transfer up on the user side
and sent a bunch of overlapped transfer requests to the driver: it
was going to be as slow as hell anyway, so I didn’t much care about
the extra user/kernel transitions. Splitting up the transfer inside
the driver would perhaps have been slightly better, but I doubt
if anybody would have noticed the improvement in performance.

… if DMAing directly from
the user buffer proves to be slow enough, maybe it’s a bad idea
not to copy the data to a kernel buffer before doing the DMA. I
guess in the end I may end up having to measure the throughput
both ways to see which way the pendulum goes!

Measurement beats guessing every time but, like Maxim, I don’t
understand why you think copying non contiguous slices into a
contiguous buffer and then doing a DMA is going to be quicker than
a SG DMA transfer straight from, or into, the user’s buffer.

Even if your kernel buffer is physically contiguous (which would
save the overhead of the DMA engine fetching a new descriptor
every 4KB) I wouldn’t have expected it to outperform a non-copy
solution.

The only possibility I can come up with is if your rows are really
sparse: In that case the I/O manager will have locked an awful lot of
memory down (because it doesn’t know you only intend to transfer
a subset of the buffer) and other programs might suffer a performance
hit because of memory starvation. But, if you do this copy in the
driver, the memory will have been locked down already, unless you
give up on direct I/O and manage the whole thing yourself by switching
to neither I/O - which introduces a bunch of other problems.

If forcing the I/O manager to lock down huge buffers, most of which
are not actually transferred, is the concern then I’d be inclined to
split a problematic transfer into smaller pieces before it gets to
the driver.

Don

> Well, you know your hardware better than me … but consider the

pathological case of transferring every other byte. My hardware
needed 64 bytes per descriptor. Thus, if I attempted to build a
single scatter gather list, transferring every other byte from
a 200MB user buffer would have required 640MB of dma descriptors!

typo: that should be 6400MB of descriptor space.

I hear you! Typically a user buffer will be spread over a number
of 4k pages, so there’s going to be at most as many descriptors
as there are pages used by the subvolume. The worst case will be
when every scanline spreads over multiple pages, typically two
(although I can see some pathological cases where three or even
four pages per scanline might be the case, because a scanline
may be longer than 4K bytes). Say I have a 512x512x512
subvolume, 4 bytes per element, two pages per scanline, that’s
30 bits, that is, 1G descriptors, or 8,000 Mb assuming I need a
4-byte address and a 4-byte length per DMA descriptor. However,
the chip’s architecture is such that the descriptors are fetched
from a memory-resident hw queue, so, I don’t need to produce all
those descriptors at once: I can flow-control them by monitoring
a “queue full” situation and delaying my host accordingly.

I know this sounds a bit extreme, but such is the world I now
inhabit. The alternative would be to allocate a 512Mb kernel
buffer, copy the subvolume into it, peg it down into physical
memory, and DMA from it: I would need only 29-12 = 17 bits, that
is, 128K descriptors, or 1Mb of memory (I hope my arithmetic is
right!). However, that would require the additional big buffer,
and if I have a few of those volumes being rendered at the same
time, the amount of memory required may make this approach
unfeasible. One of the applications my company sells is an
Imaging Server where doctors can remotely connect and request
for images to be rendered at the server and then shipped over
the network to the doctor’s workstation: that requires servers
to be big and to have multiple cards (typically four), and each
card has multiple megabytes of memory (typically 2 to 4 Mb), so,
the server must be able to render several volumes concurrently.

Looks like I’m going to have to measure the performance of the
two approaches before I decide which one to ship!

Alberto.

----- Original Message -----
From: “Don Ward”
To: “Windows System Software Devs Interest List”

Sent: Wednesday, December 07, 2005 6:35 AM
Subject: RE: [ntdev] Funky DMA Problem

>> Well, you know your hardware better than me … but consider
>> the
>> pathological case of transferring every other byte. My
>> hardware
>> needed 64 bytes per descriptor. Thus, if I attempted to build
>> a
>> single scatter gather list, transferring every other byte
>> from
>> a 200MB user buffer would have required 640MB of dma
>> descriptors!
>
> typo: that should be 6400MB of descriptor space.
>
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@ieee.org
> To unsubscribe send a blank email to
> xxxxx@lists.osr.com

Because chances are that the kernel buffer will be page aligned
and occupy an integral number of pages. By copying the subvolume
into a kernel buffer, I make it linearly contiguous, and I can
take maximum advantage of the physical contiguity of each page.
With the user buffer, first, I don’t know what kind of
allocation I get, and second, my scanlines aren’t contiguous and
hence even in the best of the worlds I would need one separate
DMA descriptor per scanline. So, if I have a 64x64x64 buffer, 4
bytes per element, I’d need a minimum of 1M descriptors for the
user buffer, while if I’m lucky in the kernel buffer I’d need
64x64x64x4 / 4K = 256 descriptors of 4K bytes each. And things
get worse if I have a big subvolume, say, 512x512x512.

Alberto.

----- Original Message -----
From: “Maxim S. Shatskih”
To: “Windows System Software Devs Interest List”

Sent: Tuesday, December 06, 2005 11:33 PM
Subject: Re: [ntdev] Funky DMA Problem

>> good amount of time. On the other hand, if DMAing directly
>> from
>> the user buffer proves to be slow enough, maybe it’s a bad
>> idea
>> not to copy the data to a kernel buffer before doing the DMA.
>> I
>
> Why do you think that DMA from user buffer is slower then from
> the kernel
> buffer?
>
> Maxim Shatskih, Windows DDK MVP
> StorageCraft Corporation
> xxxxx@storagecraft.com
> http://www.storagecraft.com
>
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@ieee.org
> To unsubscribe send a blank email to
> xxxxx@lists.osr.com

> unfeasible. One of the applications my company sells is an

Imaging Server where doctors can remotely connect and request
for images to be rendered at the server and then shipped over
the network to the doctor’s workstation: that requires servers
to be big and to have multiple cards (typically four), and each
card has multiple megabytes of memory (typically 2 to 4 Mb), so,

Why use any cards on the server? Why no do the server rendering in a user-mode
software?

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Because rendering in software is slow as molasses.

Alberto.

----- Original Message -----
From: “Maxim S. Shatskih”
To: “Windows System Software Devs Interest List”

Sent: Thursday, December 08, 2005 3:53 AM
Subject: Re: [ntdev] Funky DMA Problem

>> unfeasible. One of the applications my company sells is an
>> Imaging Server where doctors can remotely connect and request
>> for images to be rendered at the server and then shipped over
>> the network to the doctor’s workstation: that requires
>> servers
>> to be big and to have multiple cards (typically four), and
>> each
>> card has multiple megabytes of memory (typically 2 to 4 Mb),
>> so,
>
> Why use any cards on the server? Why no do the server
> rendering in a user-mode
> software?
>
> Maxim Shatskih, Windows DDK MVP
> StorageCraft Corporation
> xxxxx@storagecraft.com
> http://www.storagecraft.com
>
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@ieee.org
> To unsubscribe send a blank email to
> xxxxx@lists.osr.com