Sporadic very strange behaviour with scatter/gather DMA to System memory

Charles_Gardiner-2 · February 16, 2012, 7:31pm

Hi,

I am observing a very strange but reproducable behaviour with an embedded (Win-XP
embedded) application and Intel Atom Z5xx/US15W chipset

My system has four DMA channels, each writing continuously to a 20 MB sized
circular buffer over PCI Express.

The failures I see is that periodically 4K Bytes (1K successive DWord ring-buffer
entries) of memory contains all zeros. These 4 KByte holes are repeated with a
distance of 64 KByte. The holes are also somewhat strange, the last 16 bytes of a
page (to be more precise, of the previous 1K buffer Dword entries) are missing and
the next page contains all zeros except for the last sixteen bytes which again
contains valid data.

I have a test suite which runs for between 15 mins and maybe two hours. Once the
test is finished it rebotts the system (shutdown /r /t 20) and begins again
automatically. On my system, about 1/4 of all test runs fail, on my customers it
can be up to 1/2. This may be dependent on whether EWF is active or not. Whether
the test suite fails or not seems to depend on the session. i.e. a boot session
either has some failures or none at all. If I manually interrupt a successful
session, it will run on for days without errors. If I manually interrupt a failed
session, the same test will always fail.

One thing I noticed about the 16 byte offset at the end of a page is also the
following. When my service (which keeps the circular buffers locked down using a
pended/asynchronous request) starts up, I log the size of the first scatter/gather
page, the size of the last one and the largest size encountered on a
scatter/gather element. The first page always (well anytime I’ve looked) has a
size of 0xff0, the last a size of 0x10 and the largest scatter/gather element(s)
have a size of 0x3000. The scatter/list for each ring-buffer is stored in a buffer
(about 1200 elements) since it is too large to fit in the FPGA, which can only
hold 512 entries. The scatter/gather FIFO in the FPGA is periodically updated.

My FPGA always tries to transfer 128 bytes at a time except of course at page end etc.

The driver alignment requirement by the way has been set to 64-byte.

Has anybody seen anything similar?

Alex_Grig · February 16, 2012, 10:46pm

If you do a large malloc in your application, it may use a page-aligned piece of memory and use first 0x10 bytes as the chunk header, so you get a buffer which starts from 0x10 offset.

If you run under x64 client, or under server SKU, you may be getting memory pages with physical address over 4GB. If your device doesn’t support 64bit address, you need to declare that when you call IoAllocateDmaAdapter. Note that you have to specify correct direction of transfer in GetScatterGatherList and PutScatterGatherList calls.

Charles_Gardiner-2 · February 17, 2012, 4:21am

Am 17.02.2012 04:46, schrieb xxxxx@broadcom.com:

If you do a large malloc in your application, it may use a page-aligned piece
of memory and use first 0x10 bytes as the chunk header, so you get a buffer
which starts from 0x10 offset.

The malloc (really c++ new()) in the windows service which owns the circular
buffers currently requests 20 MByte. there are four such buffers.

If you run under x64 client, or under server SKU, you may be getting memory
pages with physical address over 4GB. If your device doesn’t support 64bit
address, you need to declare that when you call IoAllocateDmaAdapter. Note that
you have to specify correct direction of transfer in GetScatterGatherList and
PutScatterGatherList calls.

The FPGA implementation supports addresses above 4 GB and this is also set when
the DMA adapters (KMDF driver) are set up. Here, I don’t use IoAllocateDmaAdapter,
GetScatterGatherList or PutScatterGatherList directly.

With a call like
status = WdfDmaTransactionInitializeUsingRequest(bufferContext->pendedTransaction,
Request, VibEvtProgramDataDma,
WdfDmaDirectionReadFromDevice);

The VibEvtProgramDataDma() callback gets passed the scatter/gather list which is
then written to a (WDM) buffer. From here, the scatter/gather entries are
periodically written to a FIFO in the FPGA and just popped from there by the DMA
controller. The error trigger for FIFO underrun never fires so I am assuming this
is not an FPGA problem. The FIFO entries are 64 bit wide, a 48 Bit page address
and the rest is the current S/G element size plus a flag bit indicating start of
circular buffer.

Tim_Roberts · February 17, 2012, 1:09pm

Charles wrote:

I am observing a very strange but reproducable behaviour with an
embedded (Win-XP embedded) application and Intel Atom Z5xx/US15W
chipset

My system has four DMA channels, each writing continuously to a 20 MB
sized circular buffer over PCI Express.

The failures I see is that periodically 4K Bytes (1K successive DWord
ring-buffer entries) of memory contains all zeros. These 4 KByte
holes are repeated with a distance of 64 KByte. The holes are also
somewhat strange, the last 16 bytes of a page (to be more precise, of
the previous 1K buffer Dword entries) are missing and the next page
contains all zeros except for the last sixteen bytes which again
contains valid data. … One thing I noticed about the 16 byte offset
at the end of a page is also the following. When my service (which
keeps the circular buffers locked down using a pended/asynchronous
request) starts up, I log the size of the first scatter/gather page,
the size of the last one and the largest size encountered on a
scatter/gather element. The first page always (well anytime I’ve
looked) has a size of 0xff0, the last a size of 0x10 and the largest
scatter/gather element(s) have a size of 0x3000.

It is hard to see this as anything other than a DMA failure in the
hardware. Did you buy the PCIExpress IP or did you build it yourself?
Do you have a PCIe bus analyzer?

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Charles_Gardiner-2 · February 17, 2012, 4:52pm

Am 17.02.2012 19:08, schrieb Tim Roberts:

Charles wrote:
> I am observing a very strange but reproducable behaviour with an
…

It is hard to see this as anything other than a DMA failure in the
hardware. Did you buy the PCIExpress IP or did you build it yourself?
Do you have a PCIe bus analyzer?

Built it myself and I’ve been looking for the cause in my hardware first. It has
been simulated extensively (Vhdl/SystemVerilog) with bus-functional models for PCI
Express, the external data sources and even a scatter/gather model for the target
memory which is designed to catch addresses straying outside of predefined areas.
Plenty of randomisation in there too, packet sizes, target memory areas etc.
Simulation is still ongoing but nothing turned up so far.

Whereas theoretically yes, the missing page would suggest two scatter/gather
entries being popped out of a FIFO instead of one or one entry not arriving at all
but there is logic in there to detect this. What does not suggest the FPGA is the
period of 64K Dwords between offending pages. This is way more entries than would
fit on the FPGA at any one time.

Also, I have an Intel Crown-Beach reference board running the same FPGA content
except with programmable data generators instead of the external real-world DSPs.
This runs for weeks without a single error with largely varying data-bandwidths
from about 10 MB/s up to 40 MB/s

Charles_Gardiner-2 · February 17, 2012, 5:01pm

Am 17.02.2012 04:46, schrieb xxxxx@broadcom.com:

If you do a large malloc in your application, it may use a page-aligned piece
of memory and use first 0x10 bytes as the chunk header, so you get a buffer
which starts from 0x10 offset.

Oh, 16 byte offset is a pain. I have discovered that Atom based systems are very
very short of breath. Previously, the hardware sent packets to system memory as
they came in. Changing the DMA controllers to always send 128 bytes at a time
(wherever possible) brought a fantastic nearly 30% increase in attainable PCI
Express bandwidth, presumably because of the x86 cache line sizes

This sixteen byte header means that at the end of every scatter/gather page, this
rule is broken. Is there any way of increasing the chunk header to align it to a
64-byte boundary? Well, I suppose I could just pad the circular buffers with a 64

16 byte additional dummy header to push everything down and get it aligned
again. Should maybe bring another few percent.

Thanks for the info.

OSR_Community_User · February 18, 2012, 1:24am

Have you run your driver under driver verifier with DMA checking turned on?
This will force transfers to use bounce buffers and will apply things like
buffer boundary checking.

You may actually want to run the FULL checked OS, with driver verifier
enabled for ALL drivers. As it seems like you’re saying a DMA transfer isn’t
showing up in the buffer as expected, a question is does the data show up
someplace else, like in memory that’s not your buffer. The checked build
will be looking for other memory that has been corrupted, especially if you
enable driver verifier with the options turned way up on everything. The
tricky part is finding a modern system that correctly run this. I’ve had
better success with Intel motherboards running the full checked build. This
integrity checked environment may also degrade performance so much your
device/driver have a problem, which in itself is a good test of a behavior
under high system load.

If you can generate known data patterns in the transfers, it also can be
useful to add test code in your driver that verifies these patterns as early
as possible, like after a transfer complete interrupt. If you break just
after corrupted transfer, you can do things like run the pool walkers and
look for pool header corruption.

You should also try varying the SMP characteristics. I once debugged a DMA
corruption issue that ONLY would show up if two or more physical processors
(before multicore or HT) were doing transfers. If you can, use 1 or more
PHYSICAL processors and compare 1 vs more than 1. It turned out to be a
hardware issue.

Certainly if you have a PCIe bus analyzer that may help. It can be helpful
to get the driver to validate the data, and if corrupted trigger the bus
analyzer to stop tracing. Bus analyzers have less than infinite trace memory
and it helps a LOT if you can freeze things as soon as possible after
transfer corruption.

PCIe analyzers can be rented, at least they used to be rentable, although
it’s now been some years since I’ve done so. I’ve been told some FPGAs can
have simple PCIe bus analyzers added to their PCIe block.

A useful OS validation feature would be if you could turn on DMA transfer
bounds validation using the hardware IOMMU in many modern systems.

Jan

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@broadcom.com
Sent: Thursday, February 16, 2012 7:46 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Sporadic very strange behaviour with scatter/gather DMA
to System memory

If you do a large malloc in your application, it may use a page-aligned
piece of memory and use first 0x10 bytes as the chunk header, so you get a
buffer which starts from 0x10 offset.

If you run under x64 client, or under server SKU, you may be getting memory
pages with physical address over 4GB. If your device doesn’t support 64bit
address, you need to declare that when you call IoAllocateDmaAdapter. Note
that you have to specify correct direction of transfer in
GetScatterGatherList and PutScatterGatherList calls.

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · February 18, 2012, 4:29am

About 35 years ago, when I was writing drivers for new devices, I learned
several lessons

The hardware doesn’t work as documented
The documentation is incomplete, especially on “corner cases”
The documentation does not agree with the code sample provided by the
vendor
The hardware doesn’t work

In those days, when 7400 TTL was state of the art, I remember one
afternoon working with a colleague on reading the circuit diagrams and
reverse-engineering the specs (which resulted in one of those AHA! moments
when we realized a race condition would leave one set of bus drivers high
and the other drivers low, instead of “floating” as they were supposed to,
to which the vendor’s response was “Oh, that must be why the diagnostics
keep reporting an error!”) This added a rule “The documentation does not
correspond to the actual implementation”, something which is nearly
impossible for a customer to verify these days, except by noticing that a
driver written to the specs fails. It is a special subcase of Rule 1.

As far as I can tell, the only change today is that the devices are
physically smaller and with vastly higher gate density than in 1975. The
rules are still good.

You will have to prove this to the hardware people; plan on renting a bus
analyzer.

I remember one course I taught on kernel debugging where the students gave
ME homework. They gave me a crash dump to analyze. Cold, no hints or
background. I determined that it was most probably a PCI bus timing
problem in the hardware, and gave them a two-hour lecture on why this was
the only possible conclusion. Turns out none of the driver team knew
anything at all about how PCI busses worked, except by magic.
joe

Charles wrote:
> I am observing a very strange but reproducable behaviour with an
> embedded (Win-XP embedded) application and Intel Atom Z5xx/US15W
> chipset
>
> My system has four DMA channels, each writing continuously to a 20 MB
> sized circular buffer over PCI Express.
>
> The failures I see is that periodically 4K Bytes (1K successive DWord
> ring-buffer entries) of memory contains all zeros. These 4 KByte
> holes are repeated with a distance of 64 KByte. The holes are also
> somewhat strange, the last 16 bytes of a page (to be more precise, of
> the previous 1K buffer Dword entries) are missing and the next page
> contains all zeros except for the last sixteen bytes which again
> contains valid data. … One thing I noticed about the 16 byte offset
> at the end of a page is also the following. When my service (which
> keeps the circular buffers locked down using a pended/asynchronous
> request) starts up, I log the size of the first scatter/gather page,
> the size of the last one and the largest size encountered on a
> scatter/gather element. The first page always (well anytime I’ve
> looked) has a size of 0xff0, the last a size of 0x10 and the largest
> scatter/gather element(s) have a size of 0x3000.

It is hard to see this as anything other than a DMA failure in the
hardware. Did you buy the PCIExpress IP or did you build it yourself?
Do you have a PCIe bus analyzer?

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Alex_Grig · February 18, 2012, 9:41pm

Charles:

A common approach to getting an aligned buffer is either use VirtualAlloc (which guarantees you 64 KB alignment), or malloc a bigger buffer and use an aligned piece of it.

Charles_Gardiner-2 · February 19, 2012, 5:11pm

Am 17.02.2012 19:08, schrieb Tim Roberts:

It is hard to see this as anything other than a DMA failure in the
hardware. Did you buy the PCIExpress IP or did you build it yourself?
Do you have a PCIe bus analyzer?

Well, I think today, I have been able to show that the error is coming from the
WDF DMA Adapter setup. On my Crown-Beach reference system (i.e. Intel reference
board, not the customer board). I have had a situation where the error occurred
exactly once about every 3 secs and only on 1 DMA channel. Again here there was a
1K Dword sized memory hole but only on buffer offset range (0x2037fc - 0x203bfb)

One of the things I write to syslog when the error occurs is the iteration number
through the circular buffer. The error was exactly once per iteration. There were
hex 0x1371 (4977 decimal) scatter/elements describing the 20 MByte sized
ring-buffer. The missing buffer indexes are somewhere in the middle, the highest
buffer DWord index being 0x500000. Of the 4977 scatter/gather entries, only 512
fit in the FPGA at any one time

Simply restarting the service which owns the circular buffers was enough to get
rid of the error which, so far, I haven’t seen again today.

The 4977 scatter/gather elements are stored in an array in the device context,
really an array of arrays since there is one scatter/gather table for each of the
four DMA channels.

I set up the scatter/gather tables using the pretty simple code below.
VibEvtProgramDataDma() is the Callback for EVT_WDF_PROGRAM_DMA which is assigned
when the Adapter is set up. The scatter/gather list tables are each stored in a
WDFMEMORY object

bufferContext->sgElements is a ULONG, when the table is complete, I know the
number of entries in a scatter/gather table

bufferContext->sgListMax is configured during driver set-up and is derived from a
registry setting defining the maximum ring-buffer size. The formula is
(BufSizeInMBytes / 4096) + 2

Is there any way to check the validity of the SgList received in the EvtProgramDma
Callback?

BOOLEAN VibEvtProgramDataDma(IN WDFDMATRANSACTION Transaction,
IN WDFDEVICE Device,
IN WDFCONTEXT Context,
IN WDF_DMA_DIRECTION Direction,
IN PSCATTER_GATHER_LIST SgList) {

ULONG bufSel;
PCIRCULAR_BUFFER_CONTEXT bufferContext;
PDEVICE_EXTENSION devExt = NULL;
PVIB_TRANSACTION_CONTEXT transactionContext;
ULONG ix;

TraceEvents(TRACE_LEVEL_INFORMATION, DBG_INIT,
“VibEvtProgramDataDma called”);
TraceEvents(TRACE_LEVEL_INFORMATION, DBG_INIT,
" Scatter-Gather List Size: %d", SgList->NumberOfElements);

devExt = WdfObjectGetTypedContext(Device, DEVICE_EXTENSION);

transactionContext = (PVIB_TRANSACTION_CONTEXT)Context;

bufferContext = transactionContext->bufContext;

if (SgList->NumberOfElements > 0) {
for (ix = 0; ix < SgList->NumberOfElements; ix++) {
if (bufferContext->sgElements < bufferContext->sgListMax) {
bufferContext->sgList[bufferContext->sgElements] = SgList->Elements[ix];
bufferContext->sgElements += 1;
}
else {
VibLogEventDpc(Device, STATUS_SUCCESS,
VMPCI_MSG_SG_LIST_TRUNCATED_WRN, transactionContext->bufSel);
TraceEvents(TRACE_LEVEL_INFORMATION, DBG_INIT,
" ** !!!Too many elements for scatter gather list.
Truncating");
}
}
}

return TRUE;
}

Maxim_S_Shatskih · February 19, 2012, 10:48pm

> rule is broken. Is there any way of increasing the chunk header to align it to a

64-byte boundary?

Use VirtualAlloc instead of malloc(), this will give you page alignment.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

OSR_Community_User · February 20, 2012, 2:05am

Actually, it gives allocation-granularity alignment, which is nominally
64K-boundary alignment. So the page alignment falls out by accident.
There’s an API I always have to look up that tells you what the allocation
granularity is.
joe

> rule is broken. Is there any way of increasing the chunk header to align
> it to a
> 64-byte boundary?

Use VirtualAlloc instead of malloc(), this will give you page alignment.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Tim_Roberts · February 20, 2012, 1:37pm

Charles wrote:

Well, I think today, I have been able to show that the error is coming from the
WDF DMA Adapter setup.

So, you are suggesting that the scatter/gather list you are given has an
incorrect page number somewhere in the middle? I have to admit to a
VERY great deal of skepticism. Building a scatter/gather list is just
not that hard. If it works once, then it’s pretty much going to work
forever.

It would be more interesting to know what the physical address of the
hole region was. Are you printing the contents of the s/g list to the
debugger? Consider, for example, if you have a path in your PCIExpress
core where the setup and hold times are right on the edge. You might
step from physical address 2FFFF000 to 30000000, but have one of the
bits “stick” in the 1 position. That, in my opinion, is the kind of
thing that will be at the root of this problem.

On my Crown-Beach reference system (i.e. Intel reference
board, not the customer board). I have had a situation where the error occurred
exactly once about every 3 secs and only on 1 DMA channel. Again here there was a
1K Dword sized memory hole but only on buffer offset range (0x2037fc - 0x203bfb)

Crown Beach includes system DMA. Are you using your own DMA engine, or
the motherboard’s DMA engine?

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.