Hello ‘NTDEV’ gents,
I have a WDF driver operating for both 64 and 32 bits windows XP platform. The driver uses S&G list chains.
The DMA operation is launched by the write event ( when the API calls WriteFile() ). The size passed to WriteFile() is always 8MBytes.
The driver does not call WdfDmaEnablerSetMaximumScatterGatherElements() so the WDF framework itself should not gives any limitation.
One argument passed to the driver’s wait event is ‘IN PSCATTER_GATHER_LIST SgList’
and then ‘SgList->NumberOfElements’ is used to allocate the S&G list.
My problem is that on some systems ‘SgList->NumberOfElements’ corresponds to a 8MB list, 2048 elements ( 2048 x pagesize is 8388608 which is 8MB ). One another systems the operating system has decided to start 8 DMAs each consecutive DMA being 1MB ( 256 elements ).
The fact to have each 8MB DMA transfer split in 8 consecutive 1MB dma writes decrease the overall performances.
This is happening on a dual Xeon processor main board populated with 16GB of memory. On this test case all the test boxes are running XP64. The athlon 64 box ( with one processor only ) populated with 4GB of memory, and other systems do not show this behavior.
The same driver, same API and same software application is used on all the machines.
As you guys are expert, what could make windows to dictate smaller DMA bursts through WDF?
Thank you,
Arnaud
I think that this is exactly what can happen on different systems - the DMA management is simply doing whatever is necessary to mesh correctly with the DMA hardware and memory situation. For example, if the particular DMA controller involved could not handle transfers of more than 1 MByte (no idea if this is likely but it is at least theoretically possible) then that is what you would see.
Or at least I think so - I don’t have the level of expertise of some people who post here - but I have done a fair amount of DMA programming and you see all sorts of behaviours.
The DMA controller is actually on the PCI(PMC) card. To initiate a DMA
transfer we write few registers in the PCI configuration
space, the DMA engine read/write data from/to the memory and an
interrupt is asserted by the PCI card as soon the DMA engine has
finished.
I do not think any additional DMA controler is involved in this case so
in this respect it seems to be purely operating system
related. We do not use any computer chipset’s DMA engine.
Thanks
xxxxx@ced.co.uk wrote:
I think that this is exactly what can happen on different systems - the DMA management is simply doing whatever is necessary to mesh correctly with the DMA hardware and memory situation. For example, if the particular DMA controller involved could not handle transfers of more than 1 MByte (no idea if this is likely but it is at least theoretically possible) then that is what you would see.
Or at least I think so - I don’t have the level of expertise of some people who post here - but I have done a fair amount of DMA programming and you see all sorts of behaviours.
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
*
*
It is likely only people from Microsoft could give more information.
Are you using WDF or WDM style driver?
If you using WDF you probably could get more quick response from Doron who check this forum everyday.
For my guess, in your particular configuration for some reasons, Windows does not have enough memory to give you more locket pages.
When you running your test program you could look at Windows available resources.
Igor Sharovar
Privet Igor,
xxxxx@hotmail.com wrote:
It is likely only people from Microsoft could give more information.
Are you using WDF or WDM style driver?
This is a WDF driver actually.
If you using WDF you probably could get more quick response from Doron who check this forum everyday.
That would be awesome.
For my guess, in your particular configuration for some reasons, Windows does not have enough memory to give you more locket pages.
When you running your test program you could look at Windows available resources.
I can see that 570MB out of 16GB are used, which I believe should be
enough to obtain 8MB of pages.
Thank you,
Arnaud
*
*
Sorry to tell you but wdf has nothing to do with this, it is the underlying dma apis that build the sg list
d
Sent from my phone with no t9, all spilling mistakes are not intentional.
-----Original Message-----
From: Arnaud Maye
Sent: Thursday, September 17, 2009 7:37 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] WDF: For a 8MB DMA transfer, ‘SgList->NumberOfElements’ vary depending the system
Privet Igor,
xxxxx@hotmail.com wrote:
> It is likely only people from Microsoft could give more information.
> Are you using WDF or WDM style driver?
>
This is a WDF driver actually.
> If you using WDF you probably could get more quick response from Doron who check this forum everyday.
That would be awesome.
>
> For my guess, in your particular configuration for some reasons, Windows does not have enough memory to give you more locket pages.
> When you running your test program you could look at Windows available resources.
>
I can see that 570MB out of 16GB are used, which I believe should be
enough to obtain 8MB of pages.
Thank you,
Arnaud
—
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
> I can see that 570MB out of 16GB are used, which I believe should
be enough to obtain 8MB of pages.
I would not bet on it - NT is very restrictive of the amount of memory it allows you to lock down. Have you tried looking at and modifying the working set - if you increased your processes working set significantly it might change what happens. We use code like (trivial bits removed):
// Now attempt to get information on working set size
hProcess = OpenProcess(STANDARD_RIGHTS_REQUIRED |
PROCESS_QUERY_INFORMATION |
PROCESS_SET_QUOTA,
FALSE, _getpid());
if (hProcess)
{
DWORD dwMinSize,dwMaxSize;
if (GetProcessWorkingSetSize(hProcess, &dwMinSize, &dwMaxSize))
{
dwMinSize = dwMin;
dwMaxSize = dwMax;
if (!SetProcessWorkingSetSize(hProcess, dwMinSize, dwMaxSize))
sRetVal = 3; // failed to change size
}
CloseHandle(hProcess);
}
xxxxx@4dsp.com wrote:
I have a WDF driver operating for both 64 and 32 bits windows XP platform. The driver uses S&G list chains.
…
My problem is that on some systems ‘SgList->NumberOfElements’ corresponds to a 8MB list, 2048 elements ( 2048 x pagesize is 8388608 which is 8MB ). One another systems the operating system has decided to start 8 DMAs each consecutive DMA being 1MB ( 256 elements ).
The fact to have each 8MB DMA transfer split in 8 consecutive 1MB dma writes decrease the overall performances.
This is happening on a dual Xeon processor main board populated with 16GB of memory. On this test case all the test boxes are running XP64. The athlon 64 box ( with one processor only ) populated with 4GB of memory, and other systems do not show this behavior.
The same driver, same API and same software application is used on all the machines.
As you guys are expert, what could make windows to dictate smaller DMA bursts through WDF?
My guess is that your PCI device only handles 32-bit addresses. In that
case, on a system where it is possible to use memory above the 4GB mark
(meaning 64-bit systems with more than 4GB of RAM), the system is forced
to allocate “bounce buffers” for you in the region below 4GB, so that
incoming transfers can be copied there before creating your
scatter/gather list. The system only allocates about a megabyte for
those bounce buffers.
If you want maximum throughput, you need to modify your device to handle
64-bit physical addresses. Alternatively, you can allocate your own
common buffer in the low space, and do the copies yourself.
–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.
Doron,
Is my understanding right that the GetSgl callback can get in this case called 8 times with 1 MB chunks?
xxxxx@broadcom.com wrote:
Is my understanding right that the GetSgl callback can get in this case called 8 times with 1 MB chunks?
It certainly CAN do so, in the right circumstances.
–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.
Tim Roberts wrote:
xxxxx@broadcom.com wrote:
> Is my understanding right that the GetSgl callback can get in this case called 8 times with 1 MB chunks?
>
>
It certainly CAN do so, in the right circumstances.
Well it does, I am 100% sure about this. Tim I assume you must be right.
Our device does not supports 64 bit physical addressing.
More of this this is the only 64bit/16GB box we try. Taking in account
that windows allocates 1MB per bounce buffer we
have to discard a coincidence on these three vectors.
So unless there is a way to tell windows allocating more than 1MB as
bounce buffer we are indeed back to the drawing board.
At least the whole thing makes sense. It is obvious this scenario can
happen as the driver query WDF to know how many bytes
have been transfered already. This is how the driver behave, he is able
to chain requests updating the offset on each sub call.
The performance loss is quite high, approximately 150MB/s less ( on
600MB/s ) when windows decide to limit DMA operations to
1MB. Well it makes sense, restarting the DMA operation is what costs time…
The firmware engineers are not willing to update the VHDL base for 64
bit so I will have a look around a way to implement common
low buffers.
Thank you for all your answers guys. Special thanks goes to Tim because
he has got the most relevant point I would say.
Arnaud
*
*
Sorry to be “late to the party” with a reply…
As Tim said, it looks to ME like you’re seeing exactly the sort of issue you’d expect ASSUMING you’re seeing it on systems with more 4GB or more of memory.
The easy way to test this is to enable driver verifier’s DMA Verification. IIRC, this causes ALL transfers to be intermediately buffered (via the bounce buffers), so you should see the same makeup of your SGL.
I’m not sure what the motivation is for the existing limit to Windows (actually the HALs) bounce buffer size, but if you think about it, it’s a very difficult thing to size properly. You need to allocate physically contiguous memory, you might need multiple blocks for multiple devices, etc. It would be nice if there were a tuning parameter for this, I guess, but the bottom line is: From the start, a 32-bit busmaster DMA device is fatally flawed in a 64-bit world. From the perspective of the OS as a whole, it probably makes little sense to talk about “optimizing” behavior in this situation. I suspect those are some of the factors at play (no “inside knowledge” here, I’m not even sure who owns that code these days).
Of course, as device/driver developers that’s not what we’d prefer to hear.
The good news is that “rolling your own” isn’t conceptually difficult. The bad news is that doing a really good job of the code, and handling all the error and edge cases gracefully, is more difficult than you might think. For example, if you decide to roll your own bounce buffer scheme, will you bounce every DMA, or only those buffers with fragments physically at the 4GB mark?
Give it some thought… and the list will certainly be here to advise,
Peter
OSR
The maximum buffer size for a DMA supported by the hardware is 8MB. So
in this respect I would just need
a sticky 8MB buffer below 4GB.
I will give it a few thoughts and keep you guys updated.
Thank you.
Arnaud
xxxxx@osr.com wrote:
Sorry to be “late to the party” with a reply…
As Tim said, it looks to ME like you’re seeing exactly the sort of issue you’d expect ASSUMING you’re seeing it on systems with more 4GB or more of memory.
The easy way to test this is to enable driver verifier’s DMA Verification. IIRC, this causes ALL transfers to be intermediately buffered (via the bounce buffers), so you should see the same makeup of your SGL.
I’m not sure what the motivation is for the existing limit to Windows (actually the HALs) bounce buffer size, but if you think about it, it’s a very difficult thing to size properly. You need to allocate physically contiguous memory, you might need multiple blocks for multiple devices, etc. It would be nice if there were a tuning parameter for this, I guess, but the bottom line is: From the start, a 32-bit busmaster DMA device is fatally flawed in a 64-bit world. From the perspective of the OS as a whole, it probably makes little sense to talk about “optimizing” behavior in this situation. I suspect those are some of the factors at play (no “inside knowledge” here, I’m not even sure who owns that code these days).
Of course, as device/driver developers that’s not what we’d prefer to hear.
The good news is that “rolling your own” isn’t conceptually difficult. The bad news is that doing a really good job of the code, and handling all the error and edge cases gracefully, is more difficult than you might think. For example, if you decide to roll your own bounce buffer scheme, will you bounce every DMA, or only those buffers with fragments physically at the 4GB mark?
Give it some thought… and the list will certainly be here to advise,
Peter
OSR
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
*
*
Or, two or more… so you can increase your hardware utilization?? You don’t want to unnecessarily delay the next transfer with the overhead of calling IoCompleteRequest on a completed request (assuming this makes sense on your device).
Peter
OSR
Is it correct to assume that if the buffer passed to WriteFile()
actually resides below 4GB the os will not “bounce it”?
Thanks,
Arnaud
xxxxx@osr.com wrote:
Or, two or more… so you can increase your hardware utilization?? You don’t want to unnecessarily delay the next transfer with the overhead of calling IoCompleteRequest on a completed request (assuming this makes sense on your device).
Peter
OSR
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
–
*
**
*
If Driver Verifier DMA Verification is OFF, yes. The OS will not “bounce” it in this case.
Peter
OSR
So actually I would “just” need to ensure the “DMA buffer” passed from
the application via the API is allocated in low memory, right?
Please mind the “” around just because it can be tricky I guess.
Arnaud
xxxxx@osr.com wrote:
If Driver Verifier DMA Verification is OFF, yes. The OS will not “bounce” it in this case.
Peter
OSR
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
*
*
Yes.
Similarly I “just” need to make a platinum record to be a rock star.
Peter
OSR
User mode generally has no control over where its virtual pages will be physically backed, if they are indeed backed by physical memory.
It is not very uncommon anymore for consumer machines to have physical addresses in excess of 4gb considering other physical space consumers (e.g. video card memory mapped ranges), etc.
-----Original Message-----
From: Arnaud Maye
Sent: Friday, September 18, 2009 7:36
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] WDF: For a 8MB DMA transfer, ‘SgList->NumberOfElements’ vary depending the system
So actually I would “just” need to ensure the “DMA buffer” passed from
the application via the API is allocated in low memory, right?
Please mind the “” around just because it can be tricky I guess.
Arnaud
xxxxx@osr.com wrote:
>
>
> If Driver Verifier DMA Verification is OFF, yes. The OS will not “bounce” it in this case.
>
> Peter
> OSR
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
>
>
—
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
Tim, but that would break many PCI bus-master drivers. How would a scsi miniport perform data-in if the buffer is not ready?
"xxxxx@broadcom.com wrote:
> Is my understanding right that the GetSgl callback can get in this case called
8 times with 1 MB chunks?
>
It certainly CAN do so, in the right circumstances.