TCP building of send packet

Hi all,

Lets say that a user allocates a buffer for send with the winsock interface
Does the TCP/IP driver copies the payload to a driver allocated buffer in any case?

How does it go with NDIS6 NET_BUFFER, how does is avoid allocating the additional
buffer for the header data?

Thanks
Tom

tomerg22@012.net.il wrote:

Lets say that a user allocates a buffer for send with the winsock interface
Does the TCP/IP driver copies the payload to a driver allocated buffer in any case?

How does it go with NDIS6 NET_BUFFER, how does is avoid allocating the additional
buffer for the header data?

Clearly, the TCP layer has to add its wrapper, and the IP layer has to
add its wrapper. Copies are unavoidable. But why does it matter? Even
for gigabit Ethernet, the additional copies are unimportant. x86
processors can copy memory blocks pretty darned fast these days.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

TCP/IP allocates buffers for the eth, ip, and tcp
headers, and maps the user memory “in place” for
transmits. TCP edits the network headers directly per
packet, and avoids copies for application data
transmits.

Cheers

— tomerg22@012.net.il wrote:

Hi all,

Lets say that a user allocates a buffer for send
with the winsock interface
Does the TCP/IP driver copies the payload to a
driver allocated buffer in any case?

How does it go with NDIS6 NET_BUFFER, how does is
avoid allocating the additional
buffer for the header data?

Thanks
Tom


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR
Online at
http://www.osronline.com/page.cfm?name=ListServer


It’s here! Your new message!
Get new email alerts with the free Yahoo! Toolbar.
http://tools.search.yahoo.com/toolbar/features/mail/

> Lets say that a user allocates a buffer for send with the winsock interface

Does the TCP/IP driver copies the payload to a driver allocated buffer in any
case?

Pre-Vista TCP/IP never does this.

Instead, it just pend the TDI_SEND IRP till all ACKs will arrive, so the IRP’s
data is available if the protocol will need to retransmit it. No retransmits
after the ACK - so, TCPIP can (and do) complete the TDI_SEND IRP.

Who makes the data copy is AFD.SYS and not TCPIP.SYS. AFD maintains the
SO_SNDBUF support, and has the “bytes currently inside TCPIP” counter.

If this counter is < SO_SNDBUF option - then AFD creates a temporary memory
buffer, copies the data to it, immediately completes the user’s send IRP, and
send this tmp buffer to TCPIP in TDI_SEND.

Otherwise, it reuses the user send IRP as TDI_SEND for TCPIP.

If you set SO_SNDBUF to 0 - then AFD will always reuse the user send IRP as
TDI_SEND. Advantage: zero-copy in the whole stack. Drawback: all sends must be
large enough to saturate a network, otherwise, this is a performance penalty.
The user send will not complete till all ACKs will arrive for this portion.

In Vista, TCP/IP has undocumented upper egde instead of TDI, but I expect that
the text above is valid for Vista too.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

> Clearly, the TCP layer has to add its wrapper, and the IP layer has to

add its wrapper. Copies are unavoidable.

MS’s TCP/IP stack was zero-copy on sends (if SO_SNDBUF is zero) for around 10
years.

Headers are added by chaining the buffers in front.

At least on NT4 the outgoing TCP packet had the following buffers on the
miniport level:

  • MAC header
  • IP and TCP headers
  • data

So, IP and TCP headers are combined. I suspect that MS’s TCP/IP has no strong
division between TCP and IP layers which the UNIX implementation have.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

> TCP/IP allocates buffers for the eth, ip, and tcp

headers, and maps the user memory “in place” for
transmits.

It just reuses the MDL built by IO manager when sending MJ_WRITE to AFD.

It maps this MDL to system addresses for checksum calculation only.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Maxim S. Shatskih wrote:

> Clearly, the TCP layer has to add its wrapper, and the IP layer has to
> add its wrapper. Copies are unavoidable.
>

MS’s TCP/IP stack was zero-copy on sends (if SO_SNDBUF is zero) for around 10
years.

Headers are added by chaining the buffers in front.

At least on NT4 the outgoing TCP packet had the following buffers on the
miniport level:

  • MAC header
  • IP and TCP headers
  • data

So, IP and TCP headers are combined. I suspect that MS’s TCP/IP has no strong
division between TCP and IP layers which the UNIX implementation have.

Really? So, the DMA setup pulls 8 bytes from one place, then 10 bytes
from another, then 32 bytes from another? I would have guessed that was
worse than copying the data and DMAing one larger block.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Yes, and in some cases, yes it’s worse. It’s not
uncommon for a performant NIC driver to “coalesce”
those first few short segments to a “coalesce buffer”,
then send to squeeze out a few more bps performance
along the send path.

— Tim Roberts wrote:

> Maxim S. Shatskih wrote:
> >> Clearly, the TCP layer has to add its wrapper,
> and the IP layer has to
> >> add its wrapper. Copies are unavoidable.
> >>
> >
> > MS’s TCP/IP stack was zero-copy on sends (if
> SO_SNDBUF is zero) for around 10
> > years.
> >
> > Headers are added by chaining the buffers in
> front.
> >
> > At least on NT4 the outgoing TCP packet had the
> following buffers on the
> > miniport level:
> >
> > - MAC header
> > - IP and TCP headers
> > - data
> >
> > So, IP and TCP headers are combined. I suspect
> that MS’s TCP/IP has no strong
> > division between TCP and IP layers which the UNIX
> implementation have.
>
> Really? So, the DMA setup pulls 8 bytes from one
> place, then 10 bytes
> from another, then 32 bytes from another? I would
> have guessed that was
> worse than copying the data and DMAing one larger
> block.
>
> –
> Tim Roberts, xxxxx@probo.com
> Providenza & Boekelheide, Inc.
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> To unsubscribe, visit the List Server section of OSR
> Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

____________________________________________________________________________________
Finding fabulous fares is fun.
Let Yahoo! FareChase search your favorite travel sites to find flight and hotel bargains.
http://farechase.yahoo.com/promo-generic-14795097

> -----Original Message-----

From: xxxxx@lists.osr.com [mailto:bounce-282829-
xxxxx@lists.osr.com] On Behalf Of Tim Roberts
Sent: Thursday, April 05, 2007 5:24 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] TCP building of send packet

Maxim S. Shatskih wrote:
>> Clearly, the TCP layer has to add its wrapper, and the IP layer has
to
>> add its wrapper. Copies are unavoidable.
>>
>
> MS’s TCP/IP stack was zero-copy on sends (if SO_SNDBUF is zero) for
around 10
> years.
>
> Headers are added by chaining the buffers in front.
>
> At least on NT4 the outgoing TCP packet had the following buffers on
the
> miniport level:
>
> - MAC header
> - IP and TCP headers
> - data
>
> So, IP and TCP headers are combined. I suspect that MS’s TCP/IP has
no strong
> division between TCP and IP layers which the UNIX implementation
have.

Really? So, the DMA setup pulls 8 bytes from one place, then 10 bytes
from another, then 32 bytes from another? I would have guessed that
was
worse than copying the data and DMAing one larger block.

[PCAUSA] NDIS uses a feature of MDLs called “MDL Chaining”. If you examine the MDL structure you will see a field called “Next”. For most subsystems Next is NULL. However, NDIS can use multiple MDLs chained together - each describing memory with different origins.

For example, the “data” MDL is a mapping of (typically) a part the user-mode sender’s send buffer. Lower levels chain (at front…) MDLs describing the MAC header and other headers. The MDL chain (called a "NDIS_BUFFER chain in NDIS lingo) is wrapped in a NDIS_PACKET structure which used to carry additional information (OOB data, NDIS private data. Etc) and then the “packet” is passed along.

MDLs for the headers can be re-used efficiently.

Similar to U**x “mbufs”?

Thomas F. Divine

Tim,

So, the DMA setup pulls 8 bytes from one place, then 10 bytes
from another, then 32 bytes from another? I would have guessed that was
worse than copying the data and DMAing one larger block.

Please think about it carefully, and you will understand that copying much, much worse…

The problem is that copying approach inevitably either involves allocations/deallocations from/to non-paged pool, or requres TCPIP to have a large store of unpageable memory if it goes for some custom memory management scheme. However, if you just chain MDLs you are able to reuse client-provided buffer with a payload…

Anton Bassov

>Similar to U**x “mbufs”?

To FreeBSD ones.

Linux uses the “skb” packet structure which is allocated by TCP to be large
enough to have the space for headers. Then the headers are created by just
filling their slots.

Yes, Linux is not zero-copy, a copy from user buffer to the skb is necessary,
but its copy function is combined with checksum calculation, and checksum
calculation is inevitable anyway.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com