KMDF PCIE driver, DMA failure

Hi guys,

I am developing a KMDF PCIE driver, I developed my code based on PLX9X5X sample.

The hardware platform is a Microsemi FPGA, It has a sample firmware for DMA controller that requires logical address for DMA transfer, transfer length and direction to start a transfer. Here is a summery of hardware:

  • Supports only 32-bit addressing
  • 8KB buffer for DMA operations
  • Supports hardware scatter/gather, 4 elements (application buffer is 8KB)

From the application point of view, DMA transfers are done via calling WriteFile and ReadFile APIs.

When firstly I wrote the driver, the buffer access method for IO operations (WriteFile , ReadFile) was BufferedIO and DMA enabler config set to WdfDmaProfileScatterGather. With this configuration my driver works fine and perform DMA transfers correctly.

Later I realized that BufferedIO decreases driver performance by copying data from user memory to kernel memory. Also because my hardware does not support 64-bit addressing, framework allocates map registers (when buffer is located above 4GB address mark) which means it again copy the buffer to a memory location in lower 4GB address space to enables hardware to access the buffer.

For the first step to improve my driver, I changed the IO method to DirectIO and I decided to change DMA enabler config to WdfDmaProfile (because my hardware does not support 64-bit). Now device registers configured correctly but DMA does not start. I debugged the driver and everything seems alright. I mean scatter/gather list is created with only one element and device registers are configured.

It seems to me that the DMA engine in the hardware can not perform DMA when buffer access method is DirectIO.

Any idea why changing IO access to DirectIO can cause this issue?

xxxxx@gmail.com wrote:

For the first step to improve my driver, I changed the IO method to DirectIO and I decided to change DMA enabler config to WdfDmaProfile (because my hardware does not support 64-bit). Now device registers configured correctly but DMA does not start.

Which DMA profile did you choose? (WdfDmaProfile is not a choice.)

How do you know that DMA does not start? Do you actually know the
operation did not start, or are you just guessing that because you don’t
see any results in memory?

I hope it is clear to you that the problem has to be in your setup of
the hardware. In the driver, there is no difference at all between
buffered and direct I/O. In both cases, you get a kernel address, which
you have to convert to a physical address. An address is an address is
an address.

I debugged the driver and everything seems alright. I mean scatter/gather list is created with only one element and device registers are configured.
It seems to me that the DMA engine in the hardware can not perform DMA when buffer access method is DirectIO.
Any idea why changing IO access to DirectIO can cause this issue?

It cannot. You must have changed something else. Perhaps you changed
the code so that you’re not triggering the DMA operation.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

In addition to what Tim said…

As a blanket statement this is not necessarily true. The overhead of Direct
I/O is not zero and in some cases this overhead can be worse than the copy.

Your hardware does not support 64-bit PHYSICAL addressing, this has nothing
to do with VIRTUAL addressing. Direct I/O makes no guarantees about where
the physical buffer is located, you’ll still need map registers for your
transfer when necessary.

-scott
OSR
@OSRDrivers

wrote in message news:xxxxx@ntdev…

Hi guys,

I am developing a KMDF PCIE driver, I developed my code based on PLX9X5X
sample.

The hardware platform is a Microsemi FPGA, It has a sample firmware for DMA
controller that requires logical address for DMA transfer, transfer length
and direction to start a transfer. Here is a summery of hardware:

  • Supports only 32-bit addressing
  • 8KB buffer for DMA operations
  • Supports hardware scatter/gather, 4 elements (application buffer is 8KB)

From the application point of view, DMA transfers are done via calling
WriteFile and ReadFile APIs.

When firstly I wrote the driver, the buffer access method for IO operations
(WriteFile , ReadFile) was BufferedIO and DMA enabler config set to
WdfDmaProfileScatterGather. With this configuration my driver works fine and
perform DMA transfers correctly.

Later I realized that BufferedIO decreases driver performance by copying
data from user memory to kernel memory. Also because my hardware does not
support 64-bit addressing, framework allocates map registers (when buffer
is located above 4GB address mark) which means it again copy the buffer to a
memory location in lower 4GB address space to enables hardware to access the
buffer.

For the first step to improve my driver, I changed the IO method to DirectIO
and I decided to change DMA enabler config to WdfDmaProfile (because my
hardware does not support 64-bit). Now device registers configured correctly
but DMA does not start. I debugged the driver and everything seems alright.
I mean scatter/gather list is created with only one element and device
registers are configured.

It seems to me that the DMA engine in the hardware can not perform DMA when
buffer access method is DirectIO.

Any idea why changing IO access to DirectIO can cause this issue?

Tim,
Sorry it was a typo, I meant WdfDmaProfilePacket.

Scott,
“As a blanket statement this is not necessarily true. The overhead of Direct I/O is not zero and in some cases this overhead can be worse than the copy.”

  • I would like to know when the overhead of Direct I/O is more than Buffered I/O? I imagine one case is when the buffer size is big and it’s too fragmented so that many hardware registers should be updated before starting the DMA transfer. is that the case?
  • If yes, the buffer size in my application is only 8KB. Can I say Direct I/O is always better than Buffered I/O in my case?

I found a way to change my device addressing capability of FPGA to 64 bit and my hardware now can support up to 4 scatter gather elements.
IO method is set to WdfDeviceIoDirect and I changed the dma profile to WdfDmaProfileScatterGather64.
I was expecting to get one or two elements in the sglist (because the PAGE_SIZE is 4KB) but I got three elements as below!

SgList->Elements[0].Address.LowPart 0x6268af98
SgList->Elements[0].Address.HighPart 0n0
SgList->Elements[0].Length 0x68

SgList->Elements[1].Address.LowPart 0x624cd000
SgList->Elements[1].Address.HighPart 0n0
SgList->Elements[1].Length 0x1000

SgList->Elements[2].Address.LowPart 0x6bdac000
SgList->Elements[2].Address.HighPart 0n0
SgList->Elements[2].Length 0xf98

can someone explain why the buffer is scattered in three memory regions?

Um, because an 8kB buffer can span three pages if it isn’t page aligned.

You have 104 bytes at the top of the first page (starting at 0:0x6268af98),
then a full page (0:0x624cd000) and finally 3992 bytes at the bottom of a
third page (0:0x6bdac000). 104 + 4096 + 3992 = 8192.

Jeff

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@gmail.com
Sent: Thursday, 14 April 2016 3:11 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] KMDF PCIE driver, DMA failure

Tim,
Sorry it was a typo, I meant WdfDmaProfilePacket.

Scott,
“As a blanket statement this is not necessarily true. The overhead of Direct
I/O is not zero and in some cases this overhead can be worse than the copy.”
- I would like to know when the overhead of Direct I/O is more than Buffered
I/O? I imagine one case is when the buffer size is big and it’s too
fragmented so that many hardware registers should be updated before starting
the DMA transfer. is that the case?
- If yes, the buffer size in my application is only 8KB. Can I say Direct
I/O is always better than Buffered I/O in my case?

I found a way to change my device addressing capability of FPGA to 64 bit
and my hardware now can support up to 4 scatter gather elements.
IO method is set to WdfDeviceIoDirect and I changed the dma profile to
WdfDmaProfileScatterGather64.
I was expecting to get one or two elements in the sglist (because the
PAGE_SIZE is 4KB) but I got three elements as below!

SgList->Elements[0].Address.LowPart 0x6268af98
SgList->Elements[0].Address.HighPart 0n0
SgList->Elements[0].Length 0x68

SgList->Elements[1].Address.LowPart
SgList->Elements[1].Address.HighPart 0n0
SgList->Elements[1].Length 0x1000

SgList->Elements[2].Address.LowPart 0x6bdac000
SgList->Elements[2].Address.HighPart 0n0
SgList->Elements[2].Length 0xf98

can someone explain why the buffer is scattered in three memory regions?


NTDEV is sponsored by OSR

Visit the list online at:
http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at
http:</http:></http:></http:>

My idea for using Direct I/O is because I want to keep a same physical address for next DMA transfers so there is no need to update device address registers for each transfer and DMA transfer can be initiated just by writing to one register. Can I achieve this by using Direct I/O? Is this understanding correct?

Jeff,

is it possible to create page aligned buffer in the application?

>Can I achieve this by using Direct I/O?

No, only by common buffer.

Direct IO is per-request, so, each next request will have different physical addresses.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

> is it possible to create page aligned buffer in the application?

Yes, allocate by VirtualAlloc


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

Whether or not Direct I/O is faster than Buffered depends on a lot of
factors, including the size of the transfers, frequency of the transfers,
and whether or not your driver needs a pointer to the user buffer. 8K is a
very small transfer, so I wouldn’t expect the switch to Direct I/O to show
an immediate impact on performance (though I’d be interested to see the
results of the performance analysis, in my experience performance is never
very intuitive)

That being said, for a DMA operation Direct I/O is a more natural choice.

-scott
OSR
@OSRDrivers

wrote in message news:xxxxx@ntdev…

Tim,
Sorry it was a typo, I meant WdfDmaProfilePacket.

Scott,
“As a blanket statement this is not necessarily true. The overhead of Direct
I/O is not zero and in some cases this overhead can be worse than the copy.”

  • I would like to know when the overhead of Direct I/O is more than Buffered
    I/O? I imagine one case is when the buffer size is big and it’s too
    fragmented so that many hardware registers should be updated before starting
    the DMA transfer. is that the case?
  • If yes, the buffer size in my application is only 8KB. Can I say Direct
    I/O is always better than Buffered I/O in my case?

I found a way to change my device addressing capability of FPGA to 64 bit
and my hardware now can support up to 4 scatter gather elements.
IO method is set to WdfDeviceIoDirect and I changed the dma profile to
WdfDmaProfileScatterGather64.
I was expecting to get one or two elements in the sglist (because the
PAGE_SIZE is 4KB) but I got three elements as below!

SgList->Elements[0].Address.LowPart 0x6268af98
SgList->Elements[0].Address.HighPart 0n0
SgList->Elements[0].Length 0x68

SgList->Elements[1].Address.LowPart 0x624cd000
SgList->Elements[1].Address.HighPart 0n0
SgList->Elements[1].Length 0x1000

SgList->Elements[2].Address.LowPart 0x6bdac000
SgList->Elements[2].Address.HighPart 0n0
SgList->Elements[2].Length 0xf98

can someone explain why the buffer is scattered in three memory regions?

xxxxx@gmail.com wrote:

My idea for using Direct I/O is because I want to keep a same physical address for next DMA transfers so there is no need to update device address registers for each transfer and DMA transfer can be initiated just by writing to one register. Can I achieve this by using Direct I/O? Is this understanding correct?

That is a silly micro-optimization. You’re talking about a few register
writes per transfer.

One alternative is for you to allocate a common buffer at
initialization. That way, you’re guaranteed a fixed address, and
guaranteed physically contiguous memory. The downside is an additional
copy back to the user buffer, but it’s easy to overestimate the cost of
such a copy. What’s your bandwidth?


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

You could try _aigned_malloc with alignment set to 4096. I haven’t used it
myself but according to the documentation it should work.

Jeff

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@gmail.com
Sent: Thursday, 14 April 2016 4:18 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] KMDF PCIE driver, DMA failure

Jeff,

is it possible to create page aligned buffer in the application?


NTDEV is sponsored by OSR

Visit the list online at:
http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at
http:</http:></http:></http:>

xxxxx@gmail.com wrote:

is it possible to create page aligned buffer in the application?

You know, even without a special API, you can always do your own
arbitrary alignment.

To get 4096 byte alignment:
void * original = malloc( size + 4095 );
void * aligned = (void*)((ULONG_PTR)original + 4095 & ~4095);


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Or even use the PAGE_ALIGN macro for the second line:

void * aligned = PAGE_ALIGN(original);

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tim Roberts
Sent: Thursday, April 14, 2016 8:30 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] KMDF PCIE driver, DMA failure

xxxxx@gmail.com wrote:

is it possible to create page aligned buffer in the application?

You know, even without a special API, you can always do your own arbitrary
alignment.

To get 4096 byte alignment:
void * original = malloc( size + 4095 );
void * aligned = (void*)((ULONG_PTR)original + 4095 & ~4095);


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.


NTDEV is sponsored by OSR

Visit the list online at:
http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at
http:</http:></http:></http:>

> You could try _aigned_malloc

Of course I meant _aligned_malloc …

Jeff

One important point for the OP:

Even though there are many ways that a UM application can allocate buffers with specific alignment, your driver cannot assume that it has done so. If your design has alignment requirements, then you must check the buffer from the IRP and if the alignment is wrong then either fail it or have failback logic.

Most UM applications do not pay attention to this, so be careful if your driver is intended to work with 3rd part software. If it will only work with your proprietary software, then you still need to check this to avoid creating a security hole, but it is less of an issue to just fail miss-aligned requests

Sent from Mailhttps: for Windows 10

From: Don Burnmailto:xxxxx
Sent: April 14, 2016 8:41 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE: [ntdev] KMDF PCIE driver, DMA failure

Or even use the PAGE_ALIGN macro for the second line:

void * aligned = PAGE_ALIGN(original);

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tim Roberts
Sent: Thursday, April 14, 2016 8:30 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] KMDF PCIE driver, DMA failure

xxxxx@gmail.com wrote:
> is it possible to create page aligned buffer in the application?

You know, even without a special API, you can always do your own arbitrary
alignment.

To get 4096 byte alignment:
void * original = malloc( size + 4095 );
void * aligned = (void*)((ULONG_PTR)original + 4095 & ~4095);


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.


NTDEV is sponsored by OSR

Visit the list online at:
http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at
http:


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>