How to get physical pages from user space for DMA transaction.

kishan_patel · September 12, 2018, 4:57am

Hi Tim,

It depends on what you need to send. If you need to transfer from user-mode
buffers, then you don’t want to use a common buffer. You should build a DMA
transaction from the WDFREQUEST, and build descriptors out of the scatter/gather
list it gives you.

Yes I need to transfer from user-mode buffers. I initialize DMA transaction using following.

WdfDmaTransactionInitializeUsingRequest(
devExt->WriteDmaTransaction,
Request,
EvtProgramWriteDma,
WdfDmaDirectionWriteToDevice);

And the request for Initialize DMA transaction was successful.

In call back I got PSCATTER_GATHER_LIST SgList .

EvtProgramWriteDma(
IN WDFDMATRANSACTION Transaction,
IN WDFDEVICE Device,
IN PVOID Context,
IN WDF_DMA_DIRECTION Direction,
IN PSCATTER_GATHER_LIST SgList
)

Now what I am doing I am sending descriptor using SgList->Elements[0].Address.LowPart.
Is it correct method?

Thanks,
Kishan Patel

Tim_Roberts · September 12, 2018, 3:06pm

xxxxx@gmail.com wrote:

In call back I got PSCATTER_GATHER_LIST SgList .

EvtProgramWriteDma(
IN WDFDMATRANSACTION Transaction,
IN WDFDEVICE Device,
IN PVOID Context,
IN WDF_DMA_DIRECTION Direction,
IN PSCATTER_GATHER_LIST SgList
)

Now what I am doing I am sending descriptor using SgList->Elements[0].Address.LowPart.
Is it correct method?

When you called WdfDmaEnablerCreate, did you declare yourself as
WdfDmaProfileScatterGather?Â If so, then the system should copy any user
buffers that are above 4GB down into low memory, and this is the correct
process.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

kishan_patel · September 14, 2018, 2:45am

Hi Tim,

I tried and it is working but I found one issue with that.

After initialization of DMA transaction I receieved callback EvtProgramWriteDma and on that callback I used PSCATTER_GATHER_LIST .

WdfDmaTransactionInitialize(devExt->ReadDmaTransaction,
EvtProgramReadDma,
WdfDmaDirectionReadFromDevice,
mdl,
virtualAddress,
length = 1024);

WDF_DMA_ENABLER_CONFIG_INIT(&dmaConfig,
WdfDmaProfileScatterGather,
4096);

Now thing is that to generate interrupt I need to send at least 1024 bytes in descriptor but in callback when I see SgList->NumberOfElements it gives value 2 and it divides 1024 bytes in some random number. Is there any way that I can fix it in single transaction?

Thanks,
Kishan Patel

Tim_Roberts · September 14, 2018, 1:16pm

xxxxx@gmail.com wrote:

I tried and it is working but I found one issue with that.

After initialization of DMA transaction I receieved callback EvtProgramWriteDma and on that callback I used PSCATTER_GATHER_LIST.

WdfDmaTransactionInitialize(devExt->ReadDmaTransaction,
EvtProgramReadDma,
WdfDmaDirectionReadFromDevice,
mdl,
virtualAddress,
length = 1024);

WDF_DMA_ENABLER_CONFIG_INIT(&dmaConfig,
WdfDmaProfileScatterGather,
4096);

Now thing is that to generate interrupt I need to send at least 1024 bytes in descriptor…

Who is designing this hardware?Â Have these hardware engineers never
actually used their systems for real work?

…but in callback when I see SgList->NumberOfElements it gives value 2 and it divides 1024 bytes in some random number. Is there any way that I can fix it in single transaction?

You should have been able to figure this out by looking at the entries.Â
If your 1024-byte region happens to fall across a page boundary, then it
lives in two different pages and will require two scatter/gather
entries.Â Now, you can check to see if those two pages are consecutive,
and combine them into one descriptor if they are, but if the buffer
really does cross two separated pages, then it cannot be done with one
descriptor.

You might need to allocate your own “bounce buffer” and copy the small
user buffers in there.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Peter_Viscarola_OSR · September 14, 2018, 3:29pm

Windows will automagically combine adjacent pages into a single SGL entry.

This. Unfortunately. I’ve had to do it at times, too.

Peter
OSR
@OSRDrivers

anton_bassov · September 15, 2018, 5:59pm

> Who is designing this hardware?? Have these hardware engineers never actually

used their systems for real work?

Well, up to the moment this thread grew up to 36 posts long we had discussed several approaches to handling the common buffer DMA, the issues that have to be taken into consideration while implementing some of these, as well as certain aspects of the Windows security model. To make it even more exciting, I got a free pass for making an inflammatory, rude,arrogant and socially irresponsible post from “The Hanging Judge”, which does not seem to e happening very other day .

At this point the OP made a post No37,effectively revealing that, in actuality, he had no idea about the capabilities of a piece of a hardware that he was/is attempting to write a driver for. Do you really think that ANY statement made by the OP is worth being taken into serious consideration under these circumstances???

Anton Bassov

kishan_patel · October 11, 2018, 3:14am

Hi,

As I have implemented scatter gather and tested and it is working properly. But still not getting good throughput.

//Initialization
WdfDeviceSetAlignmentRequirement(DevExt->Device,
FILE_64_BYTE_ALIGNMENT);
{
WDF_DMA_ENABLER_CONFIG dmaConfig;
WDF_DMA_ENABLER_CONFIG_INIT(&dmaConfig,
WdfDmaProfileScatterGatherDuplex,
(210241024));

	status = WdfDmaEnablerCreate(DevExt->Device,
		&dmaConfig,
		WDF_NO_OBJECT_ATTRIBUTES,
		&DevExt->DmaEnabler);
	if (!NT_SUCCESS(status)) {
		KdPrintEx((DPFLTR_DEFAULT_ID, DPFLTR_ERROR_LEVEL, "\n WdfDmaEnablerCreate failed. status: %ld\n",status));
		return status;
	}
	WdfDmaEnablerSetMaximumScatterGatherElements(
		DevExt->DmaEnabler,
		256);

}

Now problem is there while using jungo WDC_DMALock and it’s functionality I can read write 8 MB within 15ms while using this method i can write in 90 or 110ms. what could be the reason?

THanks,
Kishan Patel

Tim_Roberts · October 11, 2018, 4:45am

On Oct 10, 2018, at 8:14 PM, kishan_patel wrote:
>
> Now problem is there while using jungo WDC_DMALock and it’s functionality I can read write 8 MB within 15ms while using this method i can write in 90 or 110ms. what could be the reason?

How do you know? How, exactly, are you measuring this? From where to where, using what clock? How many lanes and which PCIe generation is your device? You’re saying Jungo got 500MB/s. That’s quite a claim.
—
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

kishan_patel · October 11, 2018, 9:08am

Hi Tim,

I am using clock_t to measure time.

How do you know? How, exactly, are you measuring this?
clock_t cl;
cl = clock();
//write fn
cl = clock() - cl;

and measuring time elapsed by cl/CLOCKS_PER_SEC

How many lanes and which PCIe generation is your device? You’re saying Jungo got 500MB/s. That’s quite a claim.
PCIe generation is 2.0 and of 4 lanes.

On read and write I am dispatching data in sequential manner.
WDF_IO_QUEUE_CONFIG_INIT(&queueConfig,
WdfIoQueueDispatchSequential);

and writing/reading data using writefile/readfile function from application.

THanks,
Kishan Patel

Don_Burn · October 11, 2018, 11:55am

I can see at least three potential problems here:

First how many times do you issue the call from user space then average the results. Your code basically implies one call, if that is the case so many things can impact the number that it is worthless, I would try 10 million calls and get the average.

Second, you are using a sequential queue call model so your performance will be impacted significantly by the system call overhead and the locking to make things sequential.

Third, you are stating this is a DMA problem, but have you measured the same call model without actually doing the DMA or its setup?

I converted a number of Jungo drivers for clients in the past, in no case did I find that Jungo was faster. Throw in the number of bugs and limitations in Jungo, and there are many good reasons to junk any driver that uses it.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

kishan_patel · October 11, 2018, 12:18pm

First how many times do you issue the call from user space then average the results. Your code basically implies one call, if that is the case so many things can impact the number that it is worthless, I would try 10 million calls and get the average.
Second, you are using a sequential queue call model so your performance will be impacted significantly by the system call overhead and the locking to make things sequential.
Third, you are stating this is a DMA problem, but have you measured the same call model without actually doing the DMA or its setup?

I don’t mean it’s a DMA problem. Problem is the way of handling data. I am able to do communicate data using scatter-gather method. And my diagnose application is sending 8M data in different different chunk ( 512K*16, 1MB * 8, 2MB * 4, 8MB *1).
MY device support MAX_DESC upto 256. So i can send 1MB data on every iteration. code is working fine. But when it comes for throughput I don’t understand why it is very less.

On every iteration I am sending aligned virtual address and on read/write queue I am initializing DMA using mdl chain where this virtual address is used in one of argument. Once it run I am getting scatter-gather list and number of transfer element. and sending descriptor once it is full.

Can you help me to find out where is the problem? I also suspect problem could be either more operation between usermode and driver or either I am using synchronizing queue that might can cause the problem.

Regards,
Kishan Patel

kishan_patel · October 11, 2018, 12:37pm

? I can see at least three potential problems here:

First how many times do you issue the call from user space then average the results. Your code basically implies one call, if that is the case so many things can impact the number that it is worthless, I would try 10 million calls and get the average.
My diagnose application is sending data on following format.
(512KB * 16, 1MB * 8, 2MB * 4, 8MB * 1 )

In current driver I have enabled scatter gather functionality and I can send MAX upto 1MB data because it supports upto 256 MAX_DESCRIPTOR of page size of 4K. So after running diagnose application with this it gives me only around 220 MB/s (Max I have seen). Even Jungo also doing the same thing. Except WD_DMALock() (How this function is getting physical address that I don’t know).

Second, you are using a sequential queue call model so your performance will be impacted significantly by the system call overhead and the locking to make things sequential.
What can be done in replacement of this sequential call?
Inverted IOCTL ca be helpful ?

Third, you are stating this is a DMA problem, but have you measured the same call model without actually doing the DMA or its setup?
I am not saying DMA is a problem. The way of handling DMA transfer is problem. With DMA I measured this.

Please let me know better way to handle this.

Regards,
Kishan Patel

Tim_Roberts · October 12, 2018, 2:49am

On Oct 11, 2018, at 2:08 AM, kishan_patel wrote:
>
> I am using clock_t to measure time.

There’s your problem. The “clock” function is only updated once per scheduler interval, which is once every 16ms. If you’re measuring items for multiple seconds, it works fine, but for anything less than about 150ms, it is meaningless. You need to be using QueryPerformanceCounter or std::chrono::steady_clock.

>> How do you know? How, exactly, are you measuring this?
>
> clock_t cl;
> cl = clock();
> //write fn
> cl = clock() - cl;

How do you know that the write function waited until the write was complete? The whole purpose of DMA is to let the transfer complete while the application moves on to something else. In other words, you may have just been measuring the submission time, not the transfer time.
—
Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

kishan_patel · October 23, 2018, 8:25am

Hi Tim,

I tried with QueryPerformanceCounter and still I am getting lesser throughput.

In my code I used DMA transactions and on each transaction I am sending 512K bytes and the same thing jungo is doing with WD_DMALock() function(I dont know it’s internal).

In my case I am sending 512K buffer on every iteration using IOCTL. Inside driver I am getting mdl list for buffer and using that mdl I am initializing DMA transaction and executing it using WdfDmaTransactionExecute.
Note : I am using synchronous queue in IOCTL.

My question is why it is taking more time compare to WD_DMALock() and it’s friends?

Regards,
Kishan Patel

Tim_Roberts · October 23, 2018, 7:31pm

kishan_patel wrote:

I tried with QueryPerformanceCounter and still I am getting lesser throughput.

In my code I used DMA transactions and on each transaction I am sending 512K bytes and the same thing jungo is doing with WD_DMALock() function(I dont know it’s internal).

In my case I am sending 512K buffer on every iteration using IOCTL. Inside driver I am getting mdl list for buffer and using that mdl I am initializing DMA transaction and executing it using WdfDmaTransactionExecute.

Note : I am using synchronous queue in IOCTL.

My question is why it is taking more time compare to WD_DMALock() and it’s friends?

There’s still no way for us to tell whether you’re actually measuring
the transfer time, or just measuring the ioctl time. Remember, the only
fair comparison is to time from the point the transfer is submitted
until the time the transfer is known to be complete. The actual bus
throughput is going to be the same no matter how you start it. That’s
what leads me to think you have a measurement issue, not a speed issue.

anton_bassov · October 24, 2018, 12:03am

Can you help me to find out where is the problem?

The problem is that your “measurements” do not make any sense - just like everything else that you said on this thread, they are simply nonsensical…

DMA is asynchronous by its very definition - the CPU instructs the device to start a transfer, and goes upon its own business meanwhile. When transfer is complete or if an error occurs, the device informs the CPU about it by raising an interrupt.

Now let’s look at the code you have presented

clock_t cl;
cl = clock();
//write fn
cl = clock() - cl;

Assuming that DMA transfer is complete by the time your write_fn() returns, it invariably implies that your write_fn() makes a blocking call behind the scenes to WaitForXXX() and friends, so that some other threads are running on the CPU before your thread gets a chance to resume its execution. Therefore, a delay in between opening and closing measurements includes the one introduced by the scheduler. This delay may be (and normally is) SIGNIFICANTLY longer than the period of time that DMA operation in itself involves.

Therefore, your measurements are so imprecise that they become simply meaningless. The same thing holds true for your measurement on both read and write transfers…

Anton Bassov

kishan_patel · October 24, 2018, 6:56am

Thank you Anton Bassov and Tim .

Regards,
Kishan Patel

Jan_Bottorff · October 29, 2018, 7:17pm

>Remember, the only fair comparison is to time from the point the transfer is submitted

until the time the transfer is known to be complete.

And this is an excellent reason to use ETW tracing, the application in user-mode can write events, and the kernel driver can write events, and both event sources can write into the same trace session. It’s potentially possible to use correlation ids to tag the user-mode events with the same correlation id tag used in kernel mode, so when you analyze the trace, and sort events by correlation id+timestamp, you will see the flow of requests all the way from user-mode, through the driver, and back to user-mode, with a fraction of a microsecond resolution. And doing this ETW trace may take as little as 100 lines of code. Having good data helps a lot when analyzing performance issues.

Jan

Tim_Roberts · October 29, 2018, 9:50pm

Jan_Bottorff wrote:

And this is an excellent reason to use ETW tracing, the application in user-mode can write events, and the kernel driver can write events, and both event sources can write into the same trace session. It’s potentially possible to use correlation ids to tag the user-mode events with the same correlation id tag used in kernel mode, so when you analyze the trace, and sort events by correlation id+timestamp, you will see the flow of requests all the way from user-mode, through the driver, and back to user-mode, with a fraction of a microsecond resolution. And doing this ETW trace may take as little as 100 lines of code. Having good data helps a lot when analyzing performance issues.

Have you ever written a blog post that summarizes ETW tracing for those
of us who are still stuck in the KdPrint world? You had an [ntdev] post
a year or two ago which I had flagged, but it’s awfully easy to lose those.

Peter_Viscarola_OSR · October 29, 2018, 10:31pm

Have you ever written a blog post that summarizes ETW tracing for those
of us who are still stuck in the KdPrint world?

Awesome idea.

I’d be more than happy to put a guest article in The NT Insider if you were willing to do that!

Peter