DMA setup overhead too long

Hi All,

We have a PCI master device and I am testing the performance of DMA
read/write. On a PII 400MHz Windows2000 server machine, I found the core
transfer rate can reach more than 70MByte per second. But for every
transfer, it took a long overhead time to setup the dma operation mainly
spends on the AllocateAdapterChannel function call to the AdapterControl
callback function. For example, if I read 1024byte data from the device,
it took about 35us second to call my AdapterControl callback function
after I called AllocateAdapterChannel. So actually the data transfter rate
is only about 30MBytes per second. There is only one transaction on
going, the system should be idle. What does the OS do during this time?
Does the map register allocation and software gather/scatter take so much
time? Is this a normal situation for all the PCI master device on the
Windows platform? Can anybody here explains it?
A bunch of thanks!

William

“William Zhang” wrote in message news:xxxxx@ntdev…
>
> callback function. For example, if I read 1024byte data from the device,
> it took about 35us second to call my AdapterControl callback function
> after I called AllocateAdapterChannel. So actually the data transfter rate
> is only about 30MBytes per second. There is only one transaction on
>

Wellllll… 35 microseconds isn’t really THAT much time. Consider that
there could be intervening interrupts from other devices, etc. In most
cases, the callback should occur without delay, before you even return from
the call to AllocateAdapterChannel.

Some more detail might help: How wide is the PCI bus, the device, how much
memory on the host, does your DMA device support scatter/gather, and what’s
the operating system?

Is there some reason you’re not using GetScatterGatherList (which is highly
preferred)?

Peter
OSR

Thanks Peter!

Some more detail might help: How wide is the PCI bus, the device, how
much
memory on the host, does your DMA device support scatter/gather, and
what’s the operating system?

I have very typical test enviroment: PII400MHz, 128Mbyte, 10G HD,
Windows2000 Server, and hardware supports 32bit PCI bus wide, but does not
support hardware scatter/gather.

I agree that the OS will do a lot protection work during this period, but
35us on a PII400M will runs 35*400=14000 instructions, will the OS spend
this much time to do its protection?

Does the GetScatterGatherList support system scatter/gather? I thought
the GetScatterGatherList do the same work as AllocateAdapterChannel,
MapTransfer if the hardware does not support scatter/gather. So are you
saying the GetScatterGatherList still have some performance improvement
than the old way even it is system scatter/gatter?

Best Regards,
William

> I have very typical test enviroment: PII400MHz, 128Mbyte, 10G HD,

Windows2000 Server, and hardware supports 32bit PCI bus wide, but
does not
support hardware scatter/gather.

I agree that the OS will do a lot protection work during this
period, but

I suspect the OS doing double-buffering there, not protection.

There was a DEVICE_DESCRIPTION::ScatterGather BOOLEAN field which
governs this. If this is set to TRUE, then the OS will not do double
buffering, and will return you the scatter-gather list instead of
single range.
Then do the scatter-gather yourself in the driver, usually
interrupt-driven (interrupt at the end of each chunk) if the hardware
does not support SG lists.

Max

The os is for sure re-organizing the buffer because my hardware does not
support scatter/gather and hence that flag is FALSE.
But actually one more important clue I found yesterday is that the long
overhead time only happen when the I read from the hardware and always the
same amount about 35us no matter the buffer length is. But if I write data
to the device, it always takes about 5us which I think is acceptable.
So now the question is that 35us DMA read setup time a typical value for all
the DMA PCI device such as mass storage device, fast Ethernet NIC in windows
OS?

William

-----Original Message-----
From: Maxim S. Shatskih [mailto:xxxxx@storagecraft.com]
Sent: 2002年5月31日 3:30
To: NT Developers Interest List
Subject: [ntdev] Re: DMA setup overhead too long

I have very typical test enviroment: PII400MHz, 128Mbyte, 10G HD,
Windows2000 Server, and hardware supports 32bit PCI bus wide, but
does not
support hardware scatter/gather.

I agree that the OS will do a lot protection work during this
period, but

I suspect the OS doing double-buffering there, not protection.

There was a DEVICE_DESCRIPTION::ScatterGather BOOLEAN field which
governs this. If this is set to TRUE, then the OS will not do double
buffering, and will return you the scatter-gather list instead of
single range.
Then do the scatter-gather yourself in the driver, usually
interrupt-driven (interrupt at the end of each chunk) if the hardware
does not support SG lists.

Max


You are currently subscribed to ntdev as: xxxxx@altigen.com
To unsubscribe send a blank email to %%email.unsub%%

RE: [ntdev] Re: DMA setup overhead too long

So now the question is that 35us DMA read setup time a typical value for all the DMA PCI device such as
mass storage device, fast Ethernet NIC in windows OS?

Usually, PCI hardware do support scatter-gather - commodity IDE, USB, 1394 controllers support it, so as the majority of NICs and SCSI HBAs. That’s why there is no performance issues.
Can you emulate scatter-gather in software, like having an interrupt after each chunk? If yes - then set the parameter to TRUE to avoid HAL from double buffering, which is really slow.

Max

Hi Max,

I did a test that intentionally set the ScatterGather flag to TRUE in
the DEVICE_DESCRIPTION structure. It did help but it just drop the overhead
time from 35us to 25us, there is still a large amount time spending in
waiting the callback function. One thing that I feel very strange is the
long overhead setup time is only happen when I read data from the device
while writing data to device is only about 5us setup time no matter the
ScatterGather flag is true or false. I believe there must be something
behind this.

William

-----Original Message-----
From: Maxim S. Shatskih [mailto:xxxxx@storagecraft.com]
Sent: 2002年5月31日 10:39
To: NT Developers Interest List
Subject: [ntdev] Re: DMA setup overhead too long

So now the question is that 35us DMA read setup time a typical value for
all the DMA PCI device such as
mass storage device, fast Ethernet NIC in windows OS?

Usually, PCI hardware do support scatter-gather - commodity IDE, USB, 1394
controllers support it, so as the majority of NICs and SCSI HBAs. That’s why
there is no performance issues.
Can you emulate scatter-gather in software, like having an interrupt after
each chunk? If yes - then set the parameter to TRUE to avoid HAL from double
buffering, which is really slow.

Max


You are currently subscribed to ntdev as: xxxxx@altigen.com
To unsubscribe send a blank email to %%email.unsub%%

“William Zhang” wrote in message news:xxxxx@ntdev…
>

So, the system will be intermediately buffering all your transfers. You
know that, right? But that does NOT happen on the call to
AllocateAdapterChannel… only the allocation of the buffer takes place at
this time.

> I agree that the OS will do a lot protection work during this period, but
> 35us on a PII400M will runs 35*400=14000 instructions, will the OS spend
> this much time to do its protection?
>

Well, that’s sorta pushing it, right? You’re assuming here that every
instruction takes one CPU cycle, which isn’t really a good assumption.

> Does the GetScatterGatherList support system scatter/gather?

Yes

> I thought
> the GetScatterGatherList do the same work as AllocateAdapterChannel,
> MapTransfer if the hardware does not support scatter/gather. So are you
> saying the GetScatterGatherList still have some performance improvement
> than the old way even it is system scatter/gatter?
>

Yes, there’s SOME performance improvement. Not to mention it’s safer and
bypasses at least a few common bugs. I’d recommend you try it…

peter
OSR

Thanks Peter! I will try it and that is probably the only way that may help.

William

-----Original Message-----
From: Peter Viscarola [mailto:xxxxx@osr.com]
Sent: 2002年6月2日 16:40
To: NT Developers Interest List
Subject: [ntdev] Re: DMA setup overhead too long

“William Zhang” wrote in message news:xxxxx@ntdev…
>

So, the system will be intermediately buffering all your transfers. You
know that, right? But that does NOT happen on the call to
AllocateAdapterChannel… only the allocation of the buffer takes place at
this time.

> I agree that the OS will do a lot protection work during this period, but
> 35us on a PII400M will runs 35*400=14000 instructions, will the OS spend
> this much time to do its protection?
>

Well, that’s sorta pushing it, right? You’re assuming here that every
instruction takes one CPU cycle, which isn’t really a good assumption.

> Does the GetScatterGatherList support system scatter/gather?

Yes

> I thought
> the GetScatterGatherList do the same work as AllocateAdapterChannel,
> MapTransfer if the hardware does not support scatter/gather. So are you
> saying the GetScatterGatherList still have some performance improvement
> than the old way even it is system scatter/gatter?
>

Yes, there’s SOME performance improvement. Not to mention it’s safer and
bypasses at least a few common bugs. I’d recommend you try it…

peter
OSR


You are currently subscribed to ntdev as: xxxxx@altigen.com
To unsubscribe send a blank email to %%email.unsub%%