My KMDF confuses the PCIe firmware

Hi all,

this is a follow up to my topic “hardware Scatter/Gather table size”(http://www.osronline.com/showThread.cfm?link=263332) which has actually left the scope of its headline.

After deep investigation and consultation of the FPGA developer I am still hanging at the same point:
Something in my KMDF driver (or just in the Windows Kernel) seems to confuse the PCIe device.
I have setup a driver for my device on the same machine with Linux, working as known (the device is already in production for some years). When I do exactly the same (order of Reads/Writes on registers, setting up DMA), the result is mostly different compared to the Linux driver. In most cases, the PCIe-Core or the FPGA firmware is doing unexpected things or write operations to wrong addresses.
Now we have a new PCIe device that implements a similar behaviour. It uses completely different hardware (Xilinx Virtex-7 FPGA Gen3). And we experience similar mysteries. The Firmware shows bugs that never have been seen in the reference-Linux-Implementation.

I’ve tried using the macros READ_REGISTER_ULONG/WRITE_REGISTERULONG instead of direct read/write operations. Although this did not fix the problem, a slightly different behaviour could be seen. Then I added a simple dummy read operation to the write macro - also different behaviour but not actually good.

I’ve tried some more actions:

  • Changed PC hardware
  • Changed DMA memory area to be 32bit addressable
  • Put all CommonBuffers into one page aligned buffer, made every block page aligned.

At this point I have no more ideas and one big question:

What can my driver possibly do or miss that can confuse 2 different hardware devices in 2 different PCs? Keeping in mind that the initialization sequence is exactly the same as in the linux driver.

Thanks a lot!

How are you mapping your device registers and the shared memory section?

-p

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@t-online.de
Sent: Wednesday, January 14, 2015 10:55 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] My KMDF confuses the PCIe firmware

Hi all,

this is a follow up to my topic “hardware Scatter/Gather table size”(http://www.osronline.com/showThread.cfm?link=263332) which has actually left the scope of its headline.

After deep investigation and consultation of the FPGA developer I am still hanging at the same point:
Something in my KMDF driver (or just in the Windows Kernel) seems to confuse the PCIe device.
I have setup a driver for my device on the same machine with Linux, working as known (the device is already in production for some years). When I do exactly the same (order of Reads/Writes on registers, setting up DMA), the result is mostly different compared to the Linux driver. In most cases, the PCIe-Core or the FPGA firmware is doing unexpected things or write operations to wrong addresses.
Now we have a new PCIe device that implements a similar behaviour. It uses completely different hardware (Xilinx Virtex-7 FPGA Gen3). And we experience similar mysteries. The Firmware shows bugs that never have been seen in the reference-Linux-Implementation.

I’ve tried using the macros READ_REGISTER_ULONG/WRITE_REGISTERULONG instead of direct read/write operations. Although this did not fix the problem, a slightly different behaviour could be seen. Then I added a simple dummy read operation to the write macro - also different behaviour but not actually good.

I’ve tried some more actions:

  • Changed PC hardware
  • Changed DMA memory area to be 32bit addressable
  • Put all CommonBuffers into one page aligned buffer, made every block page aligned.

At this point I have no more ideas and one big question:

What can my driver possibly do or miss that can confuse 2 different hardware devices in 2 different PCs? Keeping in mind that the initialization sequence is exactly the same as in the linux driver.

Thanks a lot!


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

In the EvtDevicePrepareHardware I enumerate the resources. I ignore all resource types (CmResourceTypePort, CmResourceTypeInterrupt) other than CmResourceTypeMemory. I consider the first descriptor of this type as BAR0 and map my MMIO:

ASSERT(descriptor->u.Memory.Length >= ZFAS_PCI_REGS_LEN);
pDeviceExtension->PciMemBAR0 = MmMapIoSpace(
descriptor->u.Memory.Start,
ZFAS_PCI_REGS_LEN,
MmNonCached);
TraceEvents(TRACE_LEVEL_VERBOSE, TRACE_DRIVER, “PciMemBAR0=%p\n”, pDeviceExtension->PciMemBAR0);
Q: Should I map the whole descriptor->u.Memory.Length rather than the portion of ZFAS_PCI_REGS_LEN?

The common buffers for control structures and scatter/gather table are situated all together in one buffer, created as:
status = WdfCommonBufferCreate(pDeviceExtension->WdfDmaEnabler, sizeof(zFas_CommonBuffer), WDF_NO_OBJECT_ATTRIBUTES, &pDeviceExtension->DmaCommonBuffer);
if (STATUS_SUCCESS == status) {
PHYSICAL_ADDRESS pa;
pDeviceExtension->DmaVirtualAddr = (PzFas_CommonBuffer)WdfCommonBufferGetAlignedVirtualAddress(pDeviceExtension->DmaCommonBuffer);
TraceEvents(TRACE_LEVEL_INFORMATION, TRACE_DEVICE, “VirtAddresses - Head=%p, SGTable=%p, Data=%p\n”, &pDeviceExtension->DmaVirtualAddr->Head, &pDeviceExtension->DmaVirtualAddr->Desc, &pDeviceExtension->DmaVirtualAddr->Data);
pa = WdfCommonBufferGetAlignedLogicalAddress(pDeviceExtension->DmaCommonBuffer);
pDeviceExtension->DmaHeadLogicalAddr.QuadPart = pa.QuadPart + FIELD_OFFSET(zFas_CommonBuffer, Head);
pDeviceExtension->SGTableLogicalAddr.QuadPart = pa.QuadPart + FIELD_OFFSET(zFas_CommonBuffer, Desc);
pDeviceExtension->DataLogicalAddr.QuadPart = pa.QuadPart + FIELD_OFFSET(zFas_CommonBuffer, Data);
TraceEvents(TRACE_LEVEL_INFORMATION, TRACE_DEVICE, “LogAddresses - Head=0x%llx, SGTable=0x%llx, Data=0x%llx\n”, pDeviceExtension->DmaHeadLogicalAddr.QuadPart, pDeviceExtension->SGTableLogicalAddr.QuadPart, pDeviceExtension->DataLogicalAddr.QuadPart);
}

The code seems to work, I read and write registers as expected and I get the scatter/gather structures filled correctly. Just some misfits, not seen at the fisrt view:

  • sometimes the result length, that is updated by the hardware inside the SG descriptors, is written to completely wrong memory location (crashing the kernel after some time)
  • a scatter gather operation that is expected tu run over 16 elements stops after 6 elements.
  • and so on.
    Quite sure, there may be bugs in the firmware of the PCIe devices too. But the question is: Why don’t they occur with the linux driver?

Thanks a lot.

Solution is in sight!

Instead of using a WdfCommonBuffer for the DMA control structures an Scatter/Gather table, I have tried the classical way now:
I allocated the buffer using MmAllocateContiguousMemorySpecifyCache(CACHE=MmNonCached) and provided the device logical address by MmGetPhysicalAddress(). This way none of my long nagging misbehaviours have occurred.
Very surprising for me, all I have read about WDF DMA abstraction and the concept of WdfCommonBuffer, this was difficult to believed!

What should be my next steps from here?

  1. Find out why WdfCommonBufferCreate() fails to provide a non cached (or contiguous) buffer to my hardware device?
  2. Integrate the WDF DMA abstraction with my manually created physical buffer. How?
  3. Go back to the classical way, ignoring all advices not to work around DMA abstraction?

Thanks to all!

Does your device support 64 bit physical address in DMA operations?

A WdfCommonBuffer is physically contiguous, so I have to believe what’s different is the cache attribute.

A problem with using MmGetPhysicalAddress() is it returns the processor relative physical address, which will only be the same as the device bus relative physical address if the system is using an identity bus translation, which is typical but not guaranteed. The DMA abstraction handles bus address translation transparently.

It’s common for mapped device BARs to be non-cached, but it seems pretty strange for a shared region of memory, a common buffer, to be non-cached. This sounds like your device might be incorrectly managing cache coherency. I’d have to go refresh my memory of who is responsible for assuring cache coherency on PCIe (the device or the root complex). On PCI busses, the device has to use the correct cache flushing bus transactions. On PCI/PCIe bus systems, main memory is pretty much always cache coherent with devices, at least on x86 processors.

So when the processor goes to access your non-cached shared region, it’s going to have horrible performance compared to normal cached memory. One of the big reasons you want a device interface through shared memory, which the device does DMA to, is so the processor can access that memory with high performance.

I think you found an interesting clue, but am not sure you have really arrived at a correctly functioning high performance final solution. I would go ask the firmware/FPGA folks if they REALLY expect the shared memory to not cached. You say it works now, but by using non-cached memory you have dramatically changed the timing of things, and perhaps it only works because things are much slower.

Jan

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@t-online.de
Sent: Thursday, January 15, 2015 8:47 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] My KMDF confuses the PCIe firmware

Solution is in sight!

Instead of using a WdfCommonBuffer for the DMA control structures an Scatter/Gather table, I have tried the classical way now:
I allocated the buffer using MmAllocateContiguousMemorySpecifyCache(CACHE=MmNonCached) and provided the device logical address by MmGetPhysicalAddress(). This way none of my long nagging misbehaviours have occurred.
Very surprising for me, all I have read about WDF DMA abstraction and the concept of WdfCommonBuffer, this was difficult to believed!

What should be my next steps from here?

  1. Find out why WdfCommonBufferCreate() fails to provide a non cached (or contiguous) buffer to my hardware device?
  2. Integrate the WDF DMA abstraction with my manually created physical buffer. How?
  3. Go back to the classical way, ignoring all advices not to work around DMA abstraction?

Thanks to all!


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

STOP! I was spoofing myself, sorry.

I have just made my scatter/gather table much smaller since MmAllocateContiguousMemorySpecifyCache() has failed with the standard sice. Now I have seen the bug is gone only when the whole buffer is much smaller.
Then I have tried using WdfCommonBufferCreate again with the smaller SG table - this also works. AND: Now tried MmAllocateContiguousMemorySpecifyCache() with large SG table (with more RAM inside) => the error still occurs!

Conclusion now: WdfCommonBufferCreate() works fine. But there’s a problem when the SG table is larger than 16 entries.

So I do not really have a clue, still searching for any idea desperately. What can I try/check?

Here I shortly repeat my environment:

  • my PCIe Device (Xilinx-Virtex 7 based) provides a scatter/gather DMA engine
  • the engine uses 2 common buffers:
  1. The Head table, where the hardware writes the current SG index. This is read by the driver.
  2. The SG Table
  • contains a number of descriptors (16, 1024, 4096 or 16384). We need the 16384 config.
  • one descriptor has 16 byte
  • each descriptor contains the target data address (each one is used for max 4kB)
  • The hardware READS the target data address from the descriptor AND WRITES the actual length of data written to the target.
  • Thus, the descriptor is read by the hardware as well as it is written (update the length written).

The DMA engine automatically scans through this SG table and wraps after the configured number of descriptors.
The software sets the target addresses inside the descriptors and updates a tail pointer (hardware register) to tell the engine the valid range.

Thanks very much for any hint!

xxxxx@t-online.de wrote:

Here I shortly repeat my environment:

  • my PCIe Device (Xilinx-Virtex 7 based) provides a scatter/gather DMA engine

Whose PCIe IP is it? Is it from Xilinx?

  • the engine uses 2 common buffers:
  1. The Head table, where the hardware writes the current SG index. This is read by the driver.
  2. The SG Table
  • contains a number of descriptors (16, 1024, 4096 or 16384). We need the 16384 config.
  • one descriptor has 16 byte
  • each descriptor contains the target data address (each one is used for max 4kB)
  • The hardware READS the target data address from the descriptor AND WRITES the actual length of data written to the target.
  • Thus, the descriptor is read by the hardware as well as it is written (update the length written).

And the problem, as I recall, is that you aren’t able to see the
hardware’s update in the driver? Are you just dereferencing a pointer,
or are you using READ_REGISTER_ULONG? If you are just dereferencing a
pointer, did you declare it “volatile”, so the compiler doesn’t optimize
away the reads?


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

I have 2 different hardware devices. The newer one uses the Xilinx Virtex-7 FPGA Gen3
Integrated Block for PCI Express v3.0.
The other hardware utilizes a 3rd party IP core, I have to ask for the name. I was told the vendor if that is working closely with Xilinx so it seems possible they have both similar features/bugs.

Yes, the original problem was that the “lenght written”, which is updated by the hardware, does not occur in the right place but in Nirvana - the computer crashes after 1 second to 5 minutes.
Then I have tried the newer hardware (the one with the “Integrated Block for PCI express 3.0”). That seemed to work fine on first view. But after short time I saw the firmware is doing errors, the developer cannot explain. These errors do not occur with the Linux-driver.
So I have 2 devices, 2 PCs. In all combinations the Linux driver does not show any bug at all. And in all combinations my KMDF driver fails in a very mysterious way. Or better: It seems the firmare of the PCIe-device is going wrong…

What value does your hardware set to No Snoop attribute of TLP header? DO NOT set it to 1.

@Alex: Thank you. I’ll consult the FPGA developer tomorrow (here it is 9:00pm now). And yes, the HW is 64bit.

> Instead of using a WdfCommonBuffer for the DMA control structures an Scatter/Gather table

Are you sure you provided correct init params for the common buffer?


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

// Create a new DMA Object for Scatter/Gather DMA mode.
WDF_DMA_ENABLER_CONFIG_INIT(
&dmaConfig,
// only Read direction in 1st version
WdfDmaProfileScatterGather64,
// 16MiB read request buffer, i.e. 4096 SG elements of 4kB = 1/4 SG table
SG_CONFIG.IrpBufferSize);

// Set the maximum allowable DMA Scatter/Gather list fragmentation size.
WdfDmaEnablerSetMaximumScatterGatherElements(
pDeviceExtension->WdfDmaEnabler,
SG_CONFIG.NumDescEntries); // is 16384

// Create Common Buffer for DMA Scatter/Gather strutures for zFAS
WdfCommonBufferCreate(pDeviceExtension->WdfDmaEnabler,
sizeof(zFas_CommonBuffer), // see below
WDF_NO_OBJECT_ATTRIBUTES, &pDeviceExtension->DmaCommonBuffer);

… zFas_CommonBuffer is
/**
* DMA Head struct for zFas Monitor
*/
typedef struct zFas_DMA_HEAD_t{
ULONG Rx0Head;
ULONG fill0;
ULONG RxHead;
ULONG fill1;
}zFas_DMA_HEAD, *PzFas_DMA_HEAD;
/**
* SG Descriptor for zFas Monitor
*/
typedef struct zFas_DMA_SG_DESCR_t{
ULONG opts1;
ULONG opts2;
ULONG TargetLow;
ULONG TargetHigh;
}zFas_DMA_SG_DESCR, *PzFas_DMA_SG_DESCR;

// all together in a common buffer
typedef struct zFas_CommonBuffer_t{
zFas_DMA_HEAD Head;
ULONG Gap1[1024 - (sizeof(zFas_DMA_HEAD)/sizeof(ULONG))]; // margin to next page
zFas_DMA_SG_DESCR Desc[16384];
}zFas_CommonBuffer, *PzFas_CommonBuffer;

xxxxx@t-online.de wrote:

… zFas_CommonBuffer is
/**
* DMA Head struct for zFas Monitor
*/
typedef struct zFas_DMA_HEAD_t{
ULONG Rx0Head;
ULONG fill0;
ULONG RxHead;
ULONG fill1;
}zFas_DMA_HEAD, *PzFas_DMA_HEAD;

Any field that can be changed behind your back should be declared volatile.

typedef struct zFas_DMA_HEAD_t{
volatile ULONG Rx0Head;
ULONG fill0;
volatile ULONG RxHead;
ULONG fill1;
}zFas_DMA_HEAD, *PzFas_DMA_HEAD;

Otherwise, the compiler is allowed to assume that whatever you wrote
there will still be there the next time you ask. It’s not required to
read from memory.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

To add to what Tim said, even if you declare something volatile, this ONLY
is a clue to the compiler not the processor. Due to speculative execution
it does NOT assure the processor will access memory in the order of your
source code. To force specific run-time execution ordering you need to
assure there are appropriate execution fence operations.

Modern processors essentially execute down all possible undetermined
branches in parallel, and when the correct branch is determined, the
execution results from the wrong branches are discarded. The impact of
this is memory reads may be reordered, unless you take action to force a
specific ordering. The processor knows how to execute a single thread “as
if” it was executed in the expected order. The instant you have other
processor cores executing threads in parallel, or external devices doing
DMA, the processor no longer can enforce this “as if” ordering.

EVERYBODY doing driver development should watch Herb Sutter’s talks on
concurrency issues, see
http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-a
tomic-Weapons-1-of-2

Jan

On 1/16/15, 1:05 AM, “Tim Roberts” wrote:

>xxxxx@t-online.de wrote:
>> … zFas_CommonBuffer is
>> /**
>> * DMA Head struct for zFas Monitor
>> */
>> typedef struct zFas_DMA_HEAD_t{
>> ULONG Rx0Head;
>> ULONG fill0;
>> ULONG RxHead;
>> ULONG fill1;
>> }zFas_DMA_HEAD, *PzFas_DMA_HEAD;
>
>Any field that can be changed behind your back should be declared
>volatile.
>
>typedef struct zFas_DMA_HEAD_t{
> volatile ULONG Rx0Head;
> ULONG fill0;
> volatile ULONG RxHead;
> ULONG fill1;
>}zFas_DMA_HEAD, *PzFas_DMA_HEAD;
>
>Otherwise, the compiler is allowed to assume that whatever you wrote
>there will still be there the next time you ask. It’s not required to
>read from memory.
>
>–
>Tim Roberts, xxxxx@probo.com
>Providenza & Boekelheide, Inc.
>
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
>OSR is HIRING!! See http://www.osr.com/careers
>
>For our schedule of WDF, WDM, debugging and other seminars visit:
>http://www.osr.com/seminars
>
>To unsubscribe, visit the List Server section of OSR Online at
>http://www.osronline.com/page.cfm?name=ListServer

NO SNOOP - That’s it!

Hi,

Thanks all! The EnableNoSnoop=0 setting in the PCIE_DeviceControl did it for my new hardware. I have tested now for 2 hrs and did not see the mysterious bugs again.
Ok, on the other (older) hardware, the bug did not disappear, I suspect there’s still another issue with it. But I have an environment now to work with, I am really excited now!

@Tim and Jan (volatile):
Yes at least the Head members should be volatile, thank you for the worthwhile hint. I’ve just forgot it. It was not the reason for the problems but of course I have corrected it. Access to the BARs is done using WRITE_REGISTER_ULONG/READ_REGISTER_ULONG.

@Jan (Cache/NonCache)
I am not an expert in PC hardware or PCIe. But I tend to believe the developer of the hardware when he says that there is no much gain using cache for reading writing the Scatter/Gather control structures (i.e. the CommonBuffer). The target data - of course - should be cached and are cached.
During a running DMA, for 4096 Bytes payload to transfer from/to the target buffer, the hardware DMA engine has to read 8 bytes from the SG descriptor, then it writes 2 bytes to a different location within the same descriptor. After all, it sets the head index at a different location, 2 bytes. The CPU needs to read the head index and to write to the descriptors only once for 4096 descriptors.
I take with me the fact that the PC hardware ensures the coherence and thus it’s best using the “system default”.
Am I wrong?

Nope. Not in my book you’re not. That’s how I do it as well for common buffer DMA control structures.

And now I’m off to watch the talk on concurrency suggested by Mr. Bortorff…

Peter
OSR
@OSRDrivers

xxxxx@t-online.de wrote:

@Tim and Jan (volatile):
Yes at least the Head members should be volatile, thank you for the worthwhile hint. I’ve just forgot it. It was not the reason for the problems but of course I have corrected it. Access to the BARs is done using WRITE_REGISTER_ULONG/READ_REGISTER_ULONG.

For what it’s worth, if you religiously use WRITE_REGISTER_ULONG and
READ_REGISTER_ULONG, then you don’t need the separate “volatile”
declarations. Those macros/functions (depending on processor) cast to
volatile pointers on your behalf.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.