Are Single-Copy architectures impossible in NT/2000?

Hi everyone. I am developing a device driver for a
speed-hungry device that sends and receives large amounts
of data. This made me search for ways to avoid copying
between the application and the driver. I control the
developing of both the app and the driver.
My first version of the code utilized METHOD_NEITHER
in both the transmit and the receive side, and completed
the IRPs in the dispatch functions (synchronously). Putting
the transmit aside (it is easy to code), the code for the
IOCTL_RECEIVE dispatch function COPIED the data into the
app’s buffer from a queue of ExAllocatePool’d buffers that
were allocated from within the DPC (as a response to
interrupts indicating data have arrived). In pseudo-code:

DPC:
Allocate buffer
Read data from card into buffer
Add buffer into queue

IOCTL_RECEIVE
Extract buffer from queue
Copy data from buffer into user space buffer

This was of course making the receiving side ‘heavier’
than the transmitting one,because of the extra allocate-
copy-free involved for each data packet. I thus decided to
change the IOCTL_RECEIVE mode to METHOD_OUT_DIRECT and the
model of usage from synchronous to asynchronous. The app,
issues a number of IOCTL_RECEIVEs at the beginning, and
then waits in a WaitForMultipleObjects. The IOCTL_RECEIVE
dispatch function, puts the IRP in a queue of pending
receive-requests and returns with STATUS_PENDING. When
receive-data interrupts arrive in the card, the DPC
removes pending IRPs from the queue, copies the data
inside the (locked, due to METHOD_OUT_DIRECT) buffer,
and completes the IRP. In pseudo-code:

IOCTL_RECEIVE
Put IRP in receive-requests queue
return STATUS_PENDING

DPC:
Take IRP from queue
Read data directly from card into buffer
IoCompleteRequest

Naturally, when the WaitForMultipleObjects wakes up,
the app quickly issues another pending request
(IOCTL_RECEIVE) to compensate for the completed one,
before it starts reading the arrived data.
This seamed to work (in terms of rate achieved),
but unfortunately, I quickly realized that it couldn’t
be relied on. Why? Because the order of data arriving
IS IMPORTANT, and sadly, the I/O manager doesn’t
‘SetEvent’ in the same order that the driver
IoCompleteRequests. Even though the architecture is
a uni-processor one, for reasons unknown to me, under
moderate CPU load, something like this eventually happens:

App IOCTL_RECEIVE bufferA
Driver (Dispatch function) Queue IRPofBufferA
App IOCTL_RECEIVE bufferB
Driver (Dispatch function) Queue IRPofBufferB
App WaitForMultipleObjects(2, …)

Driver (DPC)
IoCompleteRequest(IRPofBufferA)
IoCompleteRequest(IRPofBufferB)
App
WaitForMultipleObjects returns with B signaled
WaitForMultipleObjects returns with A signaled

This reversal of notifications is of course messing
things up, requires special code to cope with in the app,
and generally makes the whole thing not worthwhile
(Take into account that the first approach doesn’t
require Cancel processing…)

Which brings us to the subject: Are Single-Copy
receives possible in NT/2000, when order of
arrival is vital? Or does one have to use intermediate
queues, like I did in the first approach?

Thanks for reading this (rather long) question.

Thanassis Tsiodras
xxxxx@4Plus.com
4Plus Technologies


You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com

>Perhaps you should only wait on a single event in the app, for the NEXT

buffer in order that you queued them. You can then wait on the following
buffer, which may have already been signaled so you will not actually
suspend.
I think the question is there an ordering guarantee for
WaitForMultipleObjects, and think the answer is no. In the above example,
if we assume both buffer A and B get their event signaled in the SAME run
of the DPC, the app thread will not get a chance to run before BOTH events
are signaled.

That’s what I meant by ‘special code to cope with this in the app’,
but the question indeed refers to ordering guarantees.
There is another matter as well: the delay involved between
IoCompleteRequest and the waking-up of WaitForXXXXObject(s). If this is too
long, the driver will complete all pending IRPs and subsequently loose
arrived packets (no place to store them!) This means that I can’t
complete too many IRPs in the DPC, because I NEED the user thread to
replenish the queue with fresh new pending requests.

To put it simply, I see a lot of difficulties in the
single-copy receiving path, so the question could be re-phrased like this:

Has any of you people successfuly implemented single-copy receiving paths
for large data rate devices?


You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com

Would mapping the buffer into the app’s address space help ?

Pseudo-code:
DPC:
Allocate buffer .
Allocate mdl for buffer (IoAllocateMdl )
Lock pages (MmProbeAndLockPages)

IOCTL_RECEIVE:
extract buffer from queue
map into current user address space (MmMapLockedPagesSpecifyCache).
Build a context containing mdl & ptr (you will need it to cleanup).
send back mapping address to user (using ioctl output parameter).

IOCTL_FREE:
sanity check ptr
unmap ptr (MmUnmapLockedPages) and recycle the buffer (or free it).

CLOSE:
unmap all inflight ptrs. (this is important. Otherwise, the machine
will crash with PROCESS_HAS_LOCKED_PAGES

The app should call IOCTL_RECEIVE, process the request, the call IOCTL_FREE
to recycle it.
Or you may combine IOCTL_RECEIVE & IOCTL_FREE in a single call …

Regards,
Pascal.

----- Original Message -----
From: “Thanassis Tsiodras”
To: “NT Developers Interest List”
Sent: Monday, September 17, 2001 8:52 AM
Subject: [ntdev] Are Single-Copy architectures impossible in NT/2000?

> Hi everyone. I am developing a device driver for a
> speed-hungry device that sends and receives large amounts
> of data. This made me search for ways to avoid copying
> between the application and the driver. I control the
> developing of both the app and the driver.
> My first version of the code utilized METHOD_NEITHER
> in both the transmit and the receive side, and completed
> the IRPs in the dispatch functions (synchronously). Putting
> the transmit aside (it is easy to code), the code for the
> IOCTL_RECEIVE dispatch function COPIED the data into the
> app’s buffer from a queue of ExAllocatePool’d buffers that
> were allocated from within the DPC (as a response to
> interrupts indicating data have arrived). In pseudo-code:
>
> DPC:
> Allocate buffer
> Read data from card into buffer
> Add buffer into queue
>
> IOCTL_RECEIVE
> Extract buffer from queue
> Copy data from buffer into user space buffer
>
> This was of course making the receiving side ‘heavier’
> than the transmitting one,because of the extra allocate-
> copy-free involved for each data packet. I thus decided to
> change the IOCTL_RECEIVE mode to METHOD_OUT_DIRECT and the
> model of usage from synchronous to asynchronous. The app,
> issues a number of IOCTL_RECEIVEs at the beginning, and
> then waits in a WaitForMultipleObjects. The IOCTL_RECEIVE
> dispatch function, puts the IRP in a queue of pending
> receive-requests and returns with STATUS_PENDING. When
> receive-data interrupts arrive in the card, the DPC
> removes pending IRPs from the queue, copies the data
> inside the (locked, due to METHOD_OUT_DIRECT) buffer,
> and completes the IRP. In pseudo-code:
>
> IOCTL_RECEIVE
> Put IRP in receive-requests queue
> return STATUS_PENDING
>
> DPC:
> Take IRP from queue
> Read data directly from card into buffer
> IoCompleteRequest
>
> Naturally, when the WaitForMultipleObjects wakes up,
> the app quickly issues another pending request
> (IOCTL_RECEIVE) to compensate for the completed one,
> before it starts reading the arrived data.
> This seamed to work (in terms of rate achieved),
> but unfortunately, I quickly realized that it couldn’t
> be relied on. Why? Because the order of data arriving
> IS IMPORTANT, and sadly, the I/O manager doesn’t
> ‘SetEvent’ in the same order that the driver
> IoCompleteRequests. Even though the architecture is
> a uni-processor one, for reasons unknown to me, under
> moderate CPU load, something like this eventually happens:
>
> App IOCTL_RECEIVE bufferA
> Driver (Dispatch function) Queue IRPofBufferA
> App IOCTL_RECEIVE bufferB
> Driver (Dispatch function) Queue IRPofBufferB
> App WaitForMultipleObjects(2, …)
>
> Driver (DPC)
> IoCompleteRequest(IRPofBufferA)
> IoCompleteRequest(IRPofBufferB)
> App
> WaitForMultipleObjects returns with B signaled
> WaitForMultipleObjects returns with A signaled
>
> This reversal of notifications is of course messing
> things up, requires special code to cope with in the app,
> and generally makes the whole thing not worthwhile
> (Take into account that the first approach doesn’t
> require Cancel processing…)
>
> Which brings us to the subject: Are Single-Copy
> receives possible in NT/2000, when order of
> arrival is vital? Or does one have to use intermediate
> queues, like I did in the first approach?
>
> Thanks for reading this (rather long) question.
>
> Thanassis Tsiodras
> xxxxx@4Plus.com
> 4Plus Technologies
>
>
>
> —
> You are currently subscribed to ntdev as: xxxxx@evidian.com
> To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com
>


You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com

----- Original Message -----
From: “Thanassis Tsiodras”
To: “NT Developers Interest List”
Sent: Monday, September 17, 2001 8:57 AM
Subject: [ntdev] Re: Are Single-Copy architectures impossible in NT/2000?

> >Perhaps you should only wait on a single event in the app, for the NEXT
> >buffer in order that you queued them. You can then wait on the following
> >buffer, which may have already been signaled so you will not actually
> suspend.
> >I think the question is there an ordering guarantee for
> >WaitForMultipleObjects, and think the answer is no. In the above example,
> >if we assume both buffer A and B get their event signaled in the SAME run
> >of the DPC, the app thread will not get a chance to run before BOTH events
> >are signaled.
>
> That’s what I meant by ‘special code to cope with this in the app’,
> but the question indeed refers to ordering guarantees.
> There is another matter as well: the delay involved between
> IoCompleteRequest and the waking-up of WaitForXXXXObject(s). If this is too
> long, the driver will complete all pending IRPs and subsequently loose
> arrived packets (no place to store them!) This means that I can’t
> complete too many IRPs in the DPC, because I NEED the user thread to
> replenish the queue with fresh new pending requests.

I think Jan gave you the solution. Let me restate it.

Your app presumably does the I/O from a single thread, or else there could be no ordering
guarantee.

Your app starts out by issuing n reads, r1…rn, then it wait for a single event: r1. As soon
as it is done with r1, it issues rn+1, and waits for r2. If r2 has aready been signaled, there is no wait.

The driver receives the IRPs in order, and transfers the data in IRP order, then completes the IRP in some order.
The app ignores IRPs completed out of order because it is waiting for only for the in-order one.
If the app does not run in time to pass the IRPs back to the driver, you’ll overrun, so you’ll want to
look at thread priorities. There is not a whole lot of reason for n to be signficantly greater than about 4, if you can vary
the size of the read. If the input
is in relatively small frames, e.g., 2K, you can reduce the overhead significantly, and thereby increase throughput,
by having each read give back multiple frames.

Also, WaitForMultipleObjects is significantly less efficient than WaitForSingleObject, especially when n gets large.

If you decide to run at a Real time priority, I’d recommend the following to avoid locking up the machine with a bug:
create two other threads, on at prio+1 and the other at prio-1. The lower prio thread just sleeps and increments
a counter. The higher prio sleeps and watches the counter. If the counter is not changing, suspect the real thread
for a while. This allows the debugger to run if you get stuck in an infinite loop, or the thread can’t keep up with the data
arrival rate.

Lastly, make sure you properly code the cancel function here. Its quite easy to miss the case where the
request is cancelled before it hits your dispatch routine and set the cancel handler. This is what Irp->Cancel is for.

-DH

>
> To put it simply, I see a lot of difficulties in the
> single-copy receiving path, so the question could be re-phrased like this:
>
> Has any of you people successfuly implemented single-copy receiving paths
> for large data rate devices?
>
>
>
> —
> You are currently subscribed to ntdev as: xxxxx@syssoftsol.com
> To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com


You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com

Concluding remarks

My sincerest thanks to everyone that helped.
To summarize, single-copy receive implementations face two problems:
A. Out of order signaling of events from the I/O manager
(In a different order than the IoCompleteRequests’s one)
B. Being too late in replenishment of receive-request IRPs
(Done from the application)

Problem A:
…is indeed handled optimally by what Jan and Dave described.
I had implemented a similar approach when I first posted in ntdev,
(and I called it in my first mail: ‘special code to cope with
this in the app’, although no-one noticed it :slight_smile: but still,
I consistently lost packets. To be honest, I still can’t
understand how two IoCompleteRequests can trigger two swapped
KeSetEvents (from the I/O mgr) in a UNI-processor environment,
unless WaitForMultipleObjects works in a very different
way than the one I hoped.

Problem B:
…was solved by Dave’s proposal. I didn’t have to resort to
realtime-class, though: for my needs (400MBps), priority 15
(TIME_CRITICAL) was enough. Still, I find Dave’s idea about the
two monitoring threads quite ingenious: I’d never touch the
realtime class if it wasn’t for this method.

Lastly, the cancel function indeed causes the case where the
request is cancelled before it hits the dispatch routine and
sets the cancel handler. However, thanks to Jamie Hanrahan’s
gracious donation to the world (see USENET thread about cancel
processing two years ago) I didn’t have any problem
handling this.

Thanks again to all that replied.

Thanassis Tsiodras, Dr.-Ing
xxxxx@4plus.com
4Plus Technologies


You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com