Valery,
memcpy() itself is not performance killer. You should just code both
sender and receiver in the way they will build a pipeline. So latency that
memcpy() will add to the single data packet processing will be eliminated
with muliple packets being under processing in different pipeline stages
at the same time. In other words while you’ll be sending packet N,
server’s NDIS miniport will receive packet N-1 into it’s internal buffer,
TDI client at server’s side will be copying packet N-2 from NDIS
miniport’s memory to it’s buffer, in the same time TDI client at server
side will be sending you packet N-3 and you’ll be receiving packet N-4
that was sent before etc etc etc. In such a case the performance of the
pipeline (frequency of the results coming out from it’s end) is limited
with the slowest operation (atomic operation in the pipeline that requires
largest amount of time to be completed). In your case network interface
(even multigigabit one) will be much slower then memory-memory copy
operation. The only thing you should care is CPU usage (keeping it high is
not a good idea, see below an example of “why?”, and we do not have DMA
engine to handle memory-memory operations any more). So try to avoid
memcpy(), but not be scared with it ))
In case of send you’re absolutely correct. Just use asyncrhonous TDI_SEND
with MDL that describes scattered data packet. For your information you
can search the list to find out why the sends should be asyncronous and
use chained MDLs (keywords are “delayed ACK”).
In case of receive everything is not so clear. The answer to question
“what to use?” will be – “it depends”. Key point – you have to take the
incoming data as fast as you can. If I understand the process correcly if
you’ll stall the incoming data receiving process other side will not get
ACKs on sent packets fast and TCP window size will be decreased (gurus,
please point if this is wrong assertation). This will result performance
degradation. So if all the code will be written by you and you can
guarantee that the actual data processing of the single packet (you’re not
writing “echo” code, are you?) will not take more then sending this packet
over the network – just use receive event handlers that will copy the
memory from NDIS miniport’s memory to your internal memory as this is
easiest way. If you’re not able to guarantee that you’ll be able to
process the data fast (reason does not matter really…) – use some sort
of intermediate buffer that will deal as temporary storage between NDIS
miniports buffer (you must keep it free as much as you can, remember!) and
your real data buffers.
We had just completed the project where we had to use this technique. The
results were quite impressive – our code outperformed Microsoft’s iSCSI
initiator (full-duplex operations are 114MB/sec MS vs. 156MB/sec ours) and
just run circles around IBM iSCSI initiator. However be very careful with
this stuff. When the pair of testing machines was changed from dual
channel DDR400 equipped P4s to PC133 P3s situation changed – we were not
able to catch up with MS and were loosing ~20%-25% to them. B/s of the
high CPU usage…
Another way to keep the receive fast is to prepost the buffers to the NDIS
miniport. The idea is to send asynchronous TDI_RECEIVE IRP with the
response buffer before you’ll be sending TDI_SEND with the request buffer.
So when your machine will get feedback from the server the data will be
stored inside your main buffer. Or course you have to take care about the
packet order in this case. I have not tried this persoally so if you’ll
find the forces to check this please let me know about results ))
Good luck!
Anton Kolomyeytsev
Hi All !
I’m trying to implement a TDI client module.
One of the project’s goals is to send/receive data as fast as possible (I know it depends on the hardware).
I guess there is no choice how to implement send operations - simple send IRPs with TDI_SEND minor code asyncronously.
But the question is what the best way to receive data ?
About my input data stream:
Any received packet has a header structure (constant length) and a variable length data part (the length is in the header part).
To speed up the things the data part of a packet should be placed into an already available buffer, which is big enough.
I learned there are 3 TDI data receive methods suitable for my purposes:
-
To use IoCallDriver with IRP TDI_RECEIVE to get a header of a packet, then use IRP TDI_RECEIVE to get the packet’s data directly
into the corresponding data buffer.
-
To register ClientEventReceive and receive up to a full TSDU each time the callback is called.
-
To register ClientEventChainedReceive and receive a full TSDU per call.
Which of the methods is the fastest and most reliable one ?
Is it possible for a TDI client to receive data directly from a NIC into an allocated data buffer ?
I am 99% sure that such a zero-data copy is not possible because the network stack must copy data from a NDIS packet allocated by a
NDIS miniport into my buffers.
But maybe I’m wrong ?
Any help or suggestions are appreciated.
Best regards,
Valeriy Glushkov