The best TDI data receive method ?

Valeriy_Glushkov · July 24, 2003, 7:43pm

Hi All !

I’m trying to implement a TDI client module.

One of the project’s goals is to send/receive data as fast as possible (I know it depends on the hardware).

I guess there is no choice how to implement send operations - simple send IRPs with TDI_SEND minor code asyncronously.

But the question is what the best way to receive data ?

About my input data stream:
Any received packet has a header structure (constant length) and a variable length data part (the length is in the header part).
To speed up the things the data part of a packet should be placed into an already available buffer, which is big enough.

I learned there are 3 TDI data receive methods suitable for my purposes:

To use IoCallDriver with IRP TDI_RECEIVE to get a header of a packet, then use IRP TDI_RECEIVE to get the packet’s data directly
into the corresponding data buffer.
To register ClientEventReceive and receive up to a full TSDU each time the callback is called.
To register ClientEventChainedReceive and receive a full TSDU per call.

Which of the methods is the fastest and most reliable one ?

Is it possible for a TDI client to receive data directly from a NIC into an allocated data buffer ?
I am 99% sure that such a zero-data copy is not possible because the network stack must copy data from a NDIS packet allocated by a
NDIS miniport into my buffers.
But maybe I’m wrong ?

Any help or suggestions are appreciated.

Best regards,
Valeriy Glushkov

James_Antognini · July 25, 2003, 4:37am

Without having done chained receive myself, I believe it returns a pointer to the packet payload, including headers, and I imagine it
does so without copying the payload. See the doc on chained receive and, for example, Max Shatskih’s remark in this newsgroup under
“How to modify the packet” on 14 July.

Valeriy Glushkov wrote:

Is it possible for a TDI client to receive data directly from a NIC into an allocated data buffer ?
I am 99% sure that such a zero-data copy is not possible because the network stack must copy data from a NDIS packet allocated by a
NDIS miniport into my buffers.

–
If replying by e-mail, please remove “nospam.” from the address.

James Antognini
Windows DDK MVP

Maxim_S_Shatskih · July 25, 2003, 4:29pm

> Is it possible for a TDI client to receive data directly from a NIC into an
allocated data

buffer ?

No. This is impossible even theoretically with NdisMIndicateReceivePacket path,
since the data are already in the memory - in the miniport’s common buffer.

This is possible theoretically with old-style indications, but IP will do this
for you and call NdisTransferData to IP-allocated buffer.

So, on a path from the NIC chip to a TDI client, at least 1 memcpy() will
occur.

Max

Maxim_S_Shatskih · July 25, 2003, 4:51pm

TDI chained receive path provides you with a MDL chain with data,
offset/length pair for it, and an opaque pointer which describes the packet and
must be handled back to the protocol via TdiReturnChainedReceives, when the TDI
client will no more need this data.

The “opaque pointer” is really NDIS_PACKET pointer, and
TdiReturnChainedReceives is really NdisReturnPackets.
So, the protocol can indicate using TDI chained receive only in case if it
got a packet-based indication from NDIS. In this case, the protocol just
indicates NDIS packet up after checksum check and possibly some data
truncations from begin (the header and possible overlap with earlier packets)
and from end, which are done by Offset and Length parameters to
ClientEventChainedReceive.
This is, in turn, possible only if the NIC miniport uses
NdisMIndicateReceivePacket, which means that the packet memory is allocated by
the miniport, usually by carving it from the DMA common buffer which is used by
the miniport for receives.
If the miniport is running short of this common buffer, it sets
NDIS_STATUS_RESOURCES to the packet before calling NdisMIndicateReceivePacket,
which causes NDIS to indicate this packet to protocols in old non-packet-based
paths. In this case, the protocols are obliged to memcpy() the data to their
own buffer by calling NdisTransferData, and the TransferData routine (which
actually does this memcpy()) is provided by NDIS itself. In this case, the
original packet returns to miniport on NdisMIndicateReceivePacket return, and
the miniport can reclaim this common buffer space.

For UDP, the chained receive path is IIRC not implemented at all.

Max

----- Original Message -----
From: “James Antognini”
Newsgroups: ntdev
To: “Windows System Software Developers Interest List”
Sent: Friday, July 25, 2003 12:37 PM
Subject: [ntdev] Re: The best TDI data receive method ?

> Without having done chained receive myself, I believe it returns a pointer to
the packet payload, including headers, and I imagine it
> does so without copying the payload. See the doc on chained receive and, for
example, Max Shatskih’s remark in this newsgroup under
> “How to modify the packet” on 14 July.
>
> Valeriy Glushkov wrote:
>
> > Is it possible for a TDI client to receive data directly from a NIC into an
allocated data buffer ?
> > I am 99% sure that such a zero-data copy is not possible because the
network stack must copy data from a NDIS packet allocated by a
> > NDIS miniport into my buffers.
>
> –
> If replying by e-mail, please remove “nospam.” from the address.
>
> James Antognini
> Windows DDK MVP
>
>
>
> —
> You are currently subscribed to ntdev as: xxxxx@storagecraft.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

Valeriy_Glushkov · July 25, 2003, 6:16pm

Max,

Thank you for the complete answers.

You confirmed my own thoughts about the subject.

So zero-copy is not possible. OK.

Best regards,
Valeriy Glushkov

TDI chained receive path provides you with a MDL chain with data,
offset/length pair for it, and an opaque pointer which describes the packet and
must be handled back to the protocol via TdiReturnChainedReceives, when the TDI
client will no more need this data.

The “opaque pointer” is really NDIS_PACKET pointer, and
TdiReturnChainedReceives is really NdisReturnPackets.
So, the protocol can indicate using TDI chained receive only in case if it
got a packet-based indication from NDIS. In this case, the protocol just
indicates NDIS packet up after checksum check and possibly some data
truncations from begin (the header and possible overlap with earlier packets)
and from end, which are done by Offset and Length parameters to
ClientEventChainedReceive.
This is, in turn, possible only if the NIC miniport uses
NdisMIndicateReceivePacket, which means that the packet memory is allocated by
the miniport, usually by carving it from the DMA common buffer which is used by
the miniport for receives.
If the miniport is running short of this common buffer, it sets
NDIS_STATUS_RESOURCES to the packet before calling NdisMIndicateReceivePacket,
which causes NDIS to indicate this packet to protocols in old non-packet-based
paths. In this case, the protocols are obliged to memcpy() the data to their
own buffer by calling NdisTransferData, and the TransferData routine (which
actually does this memcpy()) is provided by NDIS itself. In this case, the
original packet returns to miniport on NdisMIndicateReceivePacket return, and
the miniport can reclaim this common buffer space.

For UDP, the chained receive path is IIRC not implemented at all.

Max

----- Original Message -----
From: “James Antognini”
> Newsgroups: ntdev
> To: “Windows System Software Developers Interest List”
> Sent: Friday, July 25, 2003 12:37 PM
> Subject: [ntdev] Re: The best TDI data receive method ?
>
>
> > Without having done chained receive myself, I believe it returns a pointer to
> the packet payload, including headers, and I imagine it
> > does so without copying the payload. See the doc on chained receive and, for
> example, Max Shatskih’s remark in this newsgroup under
> > “How to modify the packet” on 14 July.
> >
> > Valeriy Glushkov wrote:
> >
> > > Is it possible for a TDI client to receive data directly from a NIC into an
> allocated data buffer ?
> > > I am 99% sure that such a zero-data copy is not possible because the
> network stack must copy data from a NDIS packet allocated by a
> > > NDIS miniport into my buffers.
> >
> > –
> > If replying by e-mail, please remove “nospam.” from the address.
> >
> > James Antognini
> > Windows DDK MVP
> >
> >
> >
> > —
> > You are currently subscribed to ntdev as: xxxxx@storagecraft.com
> > To unsubscribe send a blank email to xxxxx@lists.osr.com
>
>
>
> —
> Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: gvvua@fm.com.ua
> To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · July 25, 2003, 8:13pm

Valery,

memcpy() itself is not performance killer. You should just code both
sender and receiver in the way they will build a pipeline. So latency that
memcpy() will add to the single data packet processing will be eliminated
with muliple packets being under processing in different pipeline stages
at the same time. In other words while you’ll be sending packet N,
server’s NDIS miniport will receive packet N-1 into it’s internal buffer,
TDI client at server’s side will be copying packet N-2 from NDIS
miniport’s memory to it’s buffer, in the same time TDI client at server
side will be sending you packet N-3 and you’ll be receiving packet N-4
that was sent before etc etc etc. In such a case the performance of the
pipeline (frequency of the results coming out from it’s end) is limited
with the slowest operation (atomic operation in the pipeline that requires
largest amount of time to be completed). In your case network interface
(even multigigabit one) will be much slower then memory-memory copy
operation. The only thing you should care is CPU usage (keeping it high is
not a good idea, see below an example of “why?”, and we do not have DMA
engine to handle memory-memory operations any more). So try to avoid
memcpy(), but not be scared with it ))

In case of send you’re absolutely correct. Just use asyncrhonous TDI_SEND
with MDL that describes scattered data packet. For your information you
can search the list to find out why the sends should be asyncronous and
use chained MDLs (keywords are “delayed ACK”).

In case of receive everything is not so clear. The answer to question
“what to use?” will be – “it depends”. Key point – you have to take the
incoming data as fast as you can. If I understand the process correcly if
you’ll stall the incoming data receiving process other side will not get
ACKs on sent packets fast and TCP window size will be decreased (gurus,
please point if this is wrong assertation). This will result performance
degradation. So if all the code will be written by you and you can
guarantee that the actual data processing of the single packet (you’re not
writing “echo” code, are you?) will not take more then sending this packet
over the network – just use receive event handlers that will copy the
memory from NDIS miniport’s memory to your internal memory as this is
easiest way. If you’re not able to guarantee that you’ll be able to
process the data fast (reason does not matter really…) – use some sort
of intermediate buffer that will deal as temporary storage between NDIS
miniports buffer (you must keep it free as much as you can, remember!) and
your real data buffers.

We had just completed the project where we had to use this technique. The
results were quite impressive – our code outperformed Microsoft’s iSCSI
initiator (full-duplex operations are 114MB/sec MS vs. 156MB/sec ours) and
just run circles around IBM iSCSI initiator. However be very careful with
this stuff. When the pair of testing machines was changed from dual
channel DDR400 equipped P4s to PC133 P3s situation changed – we were not
able to catch up with MS and were loosing ~20%-25% to them. B/s of the
high CPU usage…

Another way to keep the receive fast is to prepost the buffers to the NDIS
miniport. The idea is to send asynchronous TDI_RECEIVE IRP with the
response buffer before you’ll be sending TDI_SEND with the request buffer.
So when your machine will get feedback from the server the data will be
stored inside your main buffer. Or course you have to take care about the
packet order in this case. I have not tried this persoally so if you’ll
find the forces to check this please let me know about results ))

Good luck!

Anton Kolomyeytsev

Hi All !

I’m trying to implement a TDI client module.

One of the project’s goals is to send/receive data as fast as possible (I know it depends on the hardware).

I guess there is no choice how to implement send operations - simple send IRPs with TDI_SEND minor code asyncronously.

But the question is what the best way to receive data ?

About my input data stream:
Any received packet has a header structure (constant length) and a variable length data part (the length is in the header part).
To speed up the things the data part of a packet should be placed into an already available buffer, which is big enough.

I learned there are 3 TDI data receive methods suitable for my purposes:

To use IoCallDriver with IRP TDI_RECEIVE to get a header of a packet, then use IRP TDI_RECEIVE to get the packet’s data directly
into the corresponding data buffer.

To register ClientEventReceive and receive up to a full TSDU each time the callback is called.

To register ClientEventChainedReceive and receive a full TSDU per call.

Which of the methods is the fastest and most reliable one ?

Is it possible for a TDI client to receive data directly from a NIC into an allocated data buffer ?
I am 99% sure that such a zero-data copy is not possible because the network stack must copy data from a NDIS packet allocated by a
NDIS miniport into my buffers.
But maybe I’m wrong ?

Any help or suggestions are appreciated.

Best regards,
Valeriy Glushkov

James_Antognini · July 25, 2003, 11:06pm

This is the sort of real-life information that makes following this newsgroup truly useful. Thanks.

Anton Kolomyeytsev wrote:

We had just completed the project where we had to use this technique. The
results were quite impressive – our code outperformed Microsoft’s iSCSI
initiator (full-duplex operations are 114MB/sec MS vs. 156MB/sec ours) and
just run circles around IBM iSCSI initiator. However be very careful with
this stuff. When the pair of testing machines was changed from dual
channel DDR400 equipped P4s to PC133 P3s situation changed – we were not
able to catch up with MS and were loosing ~20%-25% to them. B/s of the
high CPU usage…

–
If replying by e-mail, please remove “nospam.” from the address.

James Antognini
Windows DDK MVP

OSR_Community_User · July 28, 2003, 11:22am

On Fri, 2003-07-25 at 19:12, Anton Kolomyeytsev wrote:

memcpy() itself is not performance killer. You should just code both
sender and receiver in the way they will build a pipeline.

This isn’t necessarily true - you do increase delay, which increases the
bandwidth-delay product, which decreases the throughput of any reliable
connection. You’d have to be talking about a pretty fast network to see
the difference though.

largest amount of time to be completed). In your case network interface
(even multigigabit one) will be much slower then memory-memory copy
operation.

This is probably true in the Fast Ethernet realm, but much faster
networks do exist.

OSR_Community_User · July 28, 2003, 2:41pm

Steve,

you’re absolutely correct. However I’ve pointed in my original post that
the delay we’re adding does not hurt the performance if:

pipeline is 100% loaded
delay added to single atomic operation of the pipelone does not make
this atomic operation the most time consuming one

This is true for a gigabit ethernet (good implementation of a gigabit
ethernet I need to admit) at least. Other fast networks we have potential
access to do not have IP stack under Windows yet.

Anton Kolomyeytsev

P.S. If you want to discuss theory of parallel computations let’s do this
in private. I do not think this topic will be very interesting to the rest
of the guys here.

AK

On Fri, 2003-07-25 at 19:12, Anton Kolomyeytsev wrote:
> memcpy() itself is not performance killer. You should just code both
> sender and receiver in the way they will build a pipeline.

This isn’t necessarily true - you do increase delay, which increases the
bandwidth-delay product, which decreases the throughput of any reliable
connection. You’d have to be talking about a pretty fast network to see
the difference though.

> largest amount of time to be completed). In your case network interface
> (even multigigabit one) will be much slower then memory-memory copy
> operation.

This is probably true in the Fast Ethernet realm, but much faster
networks do exist.