DMA operations on WinXp64 introduce artifacts into the data

Hi,

I have a DMA problem that looks rather spooky so I would be grateful if you could contribute any ideas towards overcoming it.

Summary: DMA operations on WinXp64 introduce artifacts into the data.
Background: 1394 OHCI adapter using our drivers (not the MS drivers) that receives large images from a 1394 digital camera. Adapter can only do DMA on 32bit addresses. Target system is WinXP 64 on exotic hardware (NUMA from what I have come to understand).

Images can be quite large, 5-6 MB, and we generally write our software so that one frame corresponds to “one” DMA operation. The whole buffer gets locked down, mapped and then required map registers are calculated, a call to DmaOperations->AllocateAdapterChannel is made, then the physical addresses are obtained with successive calls to DmaOperations->MapTransfer, etc.

Everything works like a breeze on x86 systems and on normal x64 systems. Then we met the exotic system, which as far as I have understood is a NUMA creature. In this system the call to IoGetDmaAdapter returns a maximum of only 256 map registers no matter what value I specify in DEVICE_DESCRIPTION::MaximumLength. Of course if I specify less, I get less, for example specifying MaximumLength of 262144 yields 65 map registers, which also leads me to understand that the page size is 4KB.

Although a max of 256 map registers should be available we found out that we could only DMA buffers of size up to 56KB. Of course we queued several of these smaller DMA requests, each time the AdapterControl routine returning DeallocateObjectKeepRegisters. At some moment there were no more map registers so the IRPs were internally queued.

So we modified the software to split the single user-mode buffer over multiple DMA requests. The size of each DMA request is not a multiple of page size, but a multiple of the 1394 image packet size. This means that a single physical page might get mapped up to some extent in one DMA operation, and the rest of it on the next DMA operation. This got me skeptical initially, but the solution seemed to work fine on the first 8CPU/8GB machine where we tested.

However after moving the solution to a different WinXp64 machine the client started having a weird phenomenon. Every second or third frame artifacts would appear in the image. We had the camera transmit in test mode, where all quadlets are 0xBCBCBCBC, and we captured the frames at the lowest possible level and the artifacts were there.
The artifacts in all cases were exactly 56 bytes long, except for exactly one case where it was 32. All 56-byte artifacts contained exactly the same byte pattern. Moreover, since the application recycles the user mode buffers it uses to receive frames, the offsets of the artifacts where exactly the same when the same user mode buffer was being used. There might have been a small difference like 1 artifact that was present on one frame and not present on the next capture but other than that the artifacts on the same user buffer were on exactly the same offsets.

To make matters even more interesting, the client told me that if he reduces memory from 4GB to 2GB then the artifacts are gone.

Does anyone have an idea why this is happening? My main suspect is the fact that the same page is often used in two independent and consequent DMA transfers and we might be facing some weird cache coherency issue on these machines.

The only option I can currently consider is to use separately allocated user-mode buffers for each of the DMA operations and then reconstruct the image from the pieces doing an extra memory copy that is currently not needed.

Any advice will be greatly appreciated.

Dimitris Staikos
Unibrain

So, to rule out the obvious (that your code isn’t working correctly) I have to ask: Have you run this driver under driver verifier with DMA verifier enabled?

Peter
OSR

((Blinding flash of the obvious))…
No we haven’t run verifier on the client’s problematic system, only on our test systems. I’ll give it a try and keep post updated. Thanks :slight_smile:

We ran Verifier on the client system with DMA checking enabled and there were no blue screens, but on the output of DebugView the following message appears sometimes:

* * * * * * * * HAL Verifier Detected Violation * * * * * * * *
* *
* * VF: Allocated too many map registers : 10e (max 100)
* *
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

What should I make of it?
IoGetDmaAdapter returns a max of 256 map registers (0x100) but individual calls to AllocateAdapterChannel will only succeed if I map less than 56K. Our requests are broken down in these small chunks and then several of those are programmed through AllocateAdapterChannel while the rest are queued. Isn’t AllocateAdapterChannel responsible for keeping track of how many map registers are allocated on the DmaAdapter object? How could I possibly allocate more than the maximum?

Thanks,
Dimitris

AllocateAdapterChannel will return STATUS_INSUFFICIENT_RESOURCES if in any SINGLE call you request more than the allowed maximum number of map registers. The *non-verifier* version does not, however, keep a running-tally of the number of map registers you’ve used. The verifier version does keep track, and this is the message you’re seeing.

To be honest, I was surprised to see that verifier considers this an error – Though it may be impolite to use a running total of map registers great than the max, I have always been under the impression that such behavior was in fact allowed.

Apparently I am mistaken.

I know there’s been a lot of change in the HALs over the past few years, and maybe this is one of them. I’d be interested in hearing the definitive answer from one of the MSFT authorities.

Now, having said all that, I’d be surprised if this was the underlying problem in your driver. But, I guess it’s easy enough for you to test to see if it is. Just limit the running total of map registers that you allocate to no more than whatever’s returned from IoGetDmaAdapter.

I’ll be very curious to hear if that makes the difference.

Peter
OSR

We finally resolved some other issues and got to implement a limit on how many map registers we request. Sure as heck the Verifier stopped complaining, but unfortunately the artifacts issue is not gone.

The funny thing is that for some minutes the application will work OK. We restart it several times, to make sure that cleanup is OK, and it behaves as expected. Then at some moment the artifacts start to appear in the images captured.

The interesting thing is that they are not random. The application uses internally 4 buffers that it recycles to the drivers. So for example what happens is this: Buffer 0 is always good, Buffer 1 always has some artifacts on the top part of the picture, Buffer 2 has artifacts all over the picture, Buffer 3 is good. No matter how many pictures we shoot each buffer produces the same results! We have verified that the artifacts are not introduced by the camera in two different ways, so that’s not the place to look for the problem.

I have studied the code as carefully as I could and have found two suspects.

First, do I need to call FlushAdapterBuffers once for each call to MapTransfer, or is one call enough, with the ‘initial’ CurrentVa as stated in the DDK documentation (This value must be the same as the initial CurrentVa value passed to MapTransfer)?

Second is the one that has me worried most. The 1394 adapter is programmed so that it splits incoming packets and receives the header into one buffer and the data into another buffer. This means that the driver is programming a DMA operation into two distinct buffers. The code proceeds through the buffers calling MapTransfer for the header buffer, then several calls to MapTransfer for the data buffer, then again MapTransfer for the header buffer, etc, using the *same* MapRegisterBase. When the DMA operation completes two calls are made to FlushAdapterBuffers and one call to FreeMapRegisters.
All this seems to work fine, up to the point where something triggers the system into the weird, buffer-specific behaviour, I described above.
Is there something invalid with this approach? It has never caused trouble on x86 systems.

Another clue: if the debugger is attached, it is much more difficult to make the artifacts appear. No debug messages are printed throughout the receive operations, still it is much more harder to driver the system into the dizzy state.

Warm Regards
Dimitris Staikos

Once is enough. You’re doing it the right way. And, if it wasn’t right, verifier would smack you in any case.

This has me a little confused. When you call FlushAdapterBuffers, the CurrentVa and Length parameters are used to reflect the VA and length in the MDL. The current VA is just used to determine the starting offset of the transfer. By definition, a single MDL describes a single, virtually contiguous, data buffer. So, if you’re doing one DMA transfer that’s putting the data physically into multiple locations, and those physical locations are described by a single MDL, you only call FlushAdapterBuffers ONCE with the CurrentVa indicating the starting location within the MDL for the transfer and Length indicating the overall transfer length.

You typically only set CurrentVa and Length as anything other than the starting VA of your MDL and the length of the MDL, respectively, when you break a given transfer into multiple chunks (as when you’re limited by the max number of map registers).

I’m not sure that’ll help any, or if I just don’t understand your situation… which is always a possibility,

Peter
OSR