hardware Scatter/Gather table size

Happy new year all!

Making the basic concept for a KMDF driver for an existing PCIe hardware, I
got stuck with 2 major questions:

  1. Sizes of SG Table, Transfers and Transactions:
    The hardware takes a SG table with 16384 entries; each occupies 16 Bytes and
    contains a 64 bit target address. Each SG elements address points to a fixed
    chunk of 4096 bytes in the target buffer/MDL, i.e. it addresses full pages.
    The control structure with the SG head index is maintained in another memory
    location. The head index wraps automatically at the end of the SG table.
    My plan is to create a WdfCommonBuffer of 256 MiB memory for the SG table
    (accessed by the device contiguously) plus a smaller WdfCommonBuffer for the
    head structure. The Read- or Writerequests from the application will always
    use 16MiB buffers at even 4KB-page-addresses. This way I can set up DMA for
    4 I/O requests at one time (16MiB is 4096 chunks of 4KB, SG table allows
    4*4096 elements).

Do I get it right when implementing this like:

  • WdfDeviceSetAlignmentRequirement(…, FILE_OCTA_ALIGNMENT)
  • WDF_DMA_ENABLER_CONFIG_INIT(…, WdfDmaProfileScatterGather64,
    16*1024*1024); i.e. transfer size is 16MiB (program the interrupt to fire
    after 4096 SG elements)
  • Transaction size is the same as transfer size = 16MiB
  • WdfDmaEnablerSetMaximumScatterGatherElements(…, 16*1024); i.e., the SG
    table contains 16384 entries.
  1. Terms of “Read” or “Write”
    The hardware does always both, reading the SG table and head table and
    writing back there (updates head index, writes the actual length transferred
    inside the 4KB chunk to each SG element). Is that random R/W in control
    structures covered by the DMA abstraction?

Thanks very much, best regards,
Thoralf

Hi all,

Updating my last post: Currently the main problem seems about Read/Write or maybe alignment. My driver crashes after the first SG chunk of the first transfer. But the crash does not occur immediately.
The situation after this first 4KB chunk of my transfer is:

  • The DMA machine of the hardware has written the data correctly to the destination buffer, i.e. it has read the logical address from the SG table correctly.
  • The head pointer in the common buffer with the Head-structure is also set correctly.
  • The result-length field in the SG Table entry, that should be updated by the DMA machine, is left unchanged.
    For me this looks like the DMA machine has written the result length somewhere else than to the SG table causing the crash then.

Is there some sample code available dealing with HW SG DMA, i.e. a ‘hybrid DMA’? Maybe a sample could make me see what I am doing wrong. I just have no idea where to start with debugging?

Thank you for help!

Thoralf Baum wrote:

Happy new year all!

Making the basic concept for a KMDF driver for an existing PCIe hardware, I
got stuck with 2 major questions:

  1. Sizes of SG Table, Transfers and Transactions:
    The hardware takes a SG table with 16384 entries; each occupies 16 Bytes and
    contains a 64 bit target address. Each SG elements address points to a fixed
    chunk of 4096 bytes in the target buffer/MDL, i.e. it addresses full pages.
    The control structure with the SG head index is maintained in another memory
    location. The head index wraps automatically at the end of the SG table.
    My plan is to create a WdfCommonBuffer of 256 MiB memory for the SG table
    (accessed by the device contiguously) plus a smaller WdfCommonBuffer for the
    head structure.

I’m wondering about your math here. 16,384 x 16 is 262k bytes, not 256M
bytes.

Even 16,384 x 4,096 is only 64M bytes.

BTW, there’s no reason not to put everything into one common buffer.

The Read- or Writerequests from the application will always
use 16MiB buffers at even 4KB-page-addresses. This way I can set up DMA for
4 I/O requests at one time (16MiB is 4096 chunks of 4KB, SG table allows
4*4096 elements).

Does the hardware allow you to consolidate contiguous addresses? That
is, does each S/G element include a length, or is really limited to one
entry == 4096 bytes?

  1. Terms of “Read” or “Write”
    The hardware does always both, reading the SG table and head table and
    writing back there (updates head index, writes the actual length transferred
    inside the 4KB chunk to each SG element). Is that random R/W in control
    structures covered by the DMA abstraction?

When you use a common buffer, you aren’t really using the DMA
abstraction. You don’t need the system to create a scatter/gather list,
so you won’t use WdfDmaTransactionExecute. It’s up to you to do
whatever you need to do to make sure the memory view is consistent.

You WILL use the DMA transaction for the application’s buffer, and that
is presumably unidirectional.

So, you’ll do WdfDmaTransactionExecute with the application’s buffer.
In the EvtProgramDma callback, you’ll convert the WDF scatter/gather
list to your hardware’s format in the common buffer. Then, you’ll write
the common buffer address to your hardware’s register, and initiate the
transaction. When you get the “transaction complete” interrupt, you’ll
complete the transaction. The fact that the hardware also happened to
read and write your common buffer is irrelevant to the transaction.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

xxxxx@t-online.de wrote:

Updating my last post: Currently the main problem seems about Read/Write or maybe alignment. My driver crashes after the first SG chunk of the first transfer. But the crash does not occur immediately.
The situation after this first 4KB chunk of my transfer is:

  • The DMA machine of the hardware has written the data correctly to the destination buffer, i.e. it has read the logical address from the SG table correctly.
  • The head pointer in the common buffer with the Head-structure is also set correctly.
  • The result-length field in the SG Table entry, that should be updated by the DMA machine, is left unchanged.
    For me this looks like the DMA machine has written the result length somewhere else than to the SG table causing the crash then.

Is there some sample code available dealing with HW SG DMA, i.e. a ‘hybrid DMA’? Maybe a sample could make me see what I am doing wrong. I just have no idea where to start with debugging?

Do not discard the possibility that your (a) hardware is buggy, or (b)
you have misunderstood the hardware specs. PCIe bus master DMA is not
easy. Is your hardware supposed to handle 64-bit physical addresses?
Are you quite sure the hardware is supposed to update the S/G list in
place? It’s not impossible to imagine a design where there is a
“result” table that is separate from the “input” table.

Does the crash actually occur in your driver? If an out of control DMA
engine writes one value to a random location in physical memory, the
odds of it causing an immediate crash are very, very small. Such a
problem isn’t usually detected until LONG after the spurious write
actually occurs.

Can you share the actual !analyze -v output?


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

thanks a lot.

  • Yes, 256kB table size, sorry. So every Read/Write request with 16MiB buffer will be setup as one transaction containing one transfer, ok.
  • The hardware is limited to one entry == 4096 bytes but there?s no problem with that for our application.
  • The PCIe hardware is working well with a Linux driver. Since I can watch its behavior there I am quite sure it?s working as expected.
  • Well possible the crash does not appear inside my driver. The debugger has displayed something about the network driver but I suspected the debugger in vain.

Unfortunately, I could not reproduce the crash for the last hour. But still the result length is not found at the place I expect it (and where I find it within the Linux driver).
I will send the !analyze ?v as soon as I have it.

BTW: Is there something to consider with the alignment inside the common buffers? I have just defined my data structures using #pragma pack(4) and cast the virtual addresses to them.

Thoralf Baum

xxxxx@t-online.de wrote:

Unfortunately, I could not reproduce the crash for the last hour. But still the result length is not found at the place I expect it (and where I find it within the Linux driver).
I will send the !analyze ?v as soon as I have it.

BTW: Is there something to consider with the alignment inside the common buffers? I have just defined my data structures using #pragma pack(4) and cast the virtual addresses to them.

The key is to make sure the compiler’s packing matches the hardware
packing. Usually, I use pack(1) and specify all of the padding myself,
just to avoid surprises. Have you disassembled the code to make sure
the compiler’s address arithmetic matches what you expect?


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

I cease the last question :slight_smile: I did not disassembly but I can see all other struct members and MMIO registers as expected. So this should not be the cause for the DMA engine’s misfire.

I will simplify my code now, do some more tests. Then I’ll return here with better targeted questions. We’ll be back in about 17 hrs…

After some investigation, I’m now more suspecting a problem in hardware/firmware, but not sure:

  • I’ve compared the behavior with the Linux driver step by step. The only difference I found was that the logical addresses in the Linux driver were all below 4GB.
  • So I have switched the Profile fromWdfDmaProfileScatterGather64 to WdfDmaProfileScatterGather, presenting 32bit addresses to the hardware - same result

My next way is to consult the firmware developer. Do you see any point that I am missing in the investigation of the KMDF integration? It seems confusing that the PCIe device is able to read logical addresses from the CommonBuffer, write to 2 different locations there (target data and head index) successfully, but fails in writing the result length value.

Do you have any idea for deeper investigation?

xxxxx@t-online.de wrote:

After some investigation, I’m now more suspecting a problem in hardware/firmware, but not sure:

  • I’ve compared the behavior with the Linux driver step by step. The only difference I found was that the logical addresses in the Linux driver were all below 4GB.
  • So I have switched the Profile from WdfDmaProfileScatterGather64 to WdfDmaProfileScatterGather, presenting 32bit addresses to the hardware - same result

Was your common buffer actually getting allocated above 4GB? Does the
hardware advertise support for 64-bit physical addresses? It’s hard to
imagine that some operations would work and some wouldn’t. There is a
performance penalty for dropping 64-bit support, so you should only do
that if you really need it.

Technically, I think your device would actually be
WdfDmaProfileScatterGather64Duplex, but that shouldn’t affect the common
buffer operation.

My next way is to consult the firmware developer. Do you see any point that I am missing in the investigation of the KMDF integration? It seems confusing that the PCIe device is able to read logical addresses from the CommonBuffer, write to 2 different locations there (target data and head index) successfully, but fails in writing the result length value.

No, it’s hard to come up with any reasonable software explanation for this.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Hello Tim,
thank you. Yes, the CommonBuffer addresses in 64bit are actually above 4GB (Logical-Addr: 0x00000001.00790000, VirtAddr: 0xFFFFE001.70790000). The device is developed for 64bit only.

I didn’t find a definition for SG Duplex but I “feel” I have a “Single direction” operation since data is only transferred from the PCIe hardware to the PCs memory. I’d not consider Reading/Updating the control structures as character of operation…

Thoralf

Duplex allocates two adapter channels under the covers for you, one read and one write, so that you can simultaneously map independent operations (one of each type). The target use case is for a host adapter with independent tx and rx channels.

-p

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@t-online.de
Sent: Monday, January 5, 2015 11:20 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] hardware Scatter/Gather table size

Hello Tim,
thank you. Yes, the CommonBuffer addresses in 64bit are actually above 4GB (Logical-Addr: 0x00000001.00790000, VirtAddr: 0xFFFFE001.70790000). The device is developed for 64bit only.

I didn’t find a definition for SG Duplex but I “feel” I have a “Single direction” operation since data is only transferred from the PCIe hardware to the PCs memory. I’d not consider Reading/Updating the control structures as character of operation…

Thoralf


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

As it seems now, a solution is in sight. Follow the new headline at http://www.osronline.com/cf.cfm?PageURL=showlists.cfm?list=NTDEV .

Thanks to all

sorry, link was wrong:
http://www.osronline.com/showthread.cfm?link=263554