>
> With scsiport I can tell windows the maximum number of sg
> breaks I can have, but I can’t seem to find the same setting
> for ndis. Does it exist?
You have to copy all or part of the packet to private buffers if the
number
of fragments exceeds your hardware capability. That’s interesting to
know
iperf with a send size of 1 is a test of lots of fragments, I’m
curious why
that is so.
I think iperf just sends a whole load of 1 byte buffers, and NT’s TCP
consolidates them into one TCP packet.
It can actually improve your performance to copy fragments under
certain
other conditions, as PCI(e) busses are slower with many tiny transfer
bursts. You might even want to dynamically measure PCI bus burst and
cpu
memory copy speed on driver initialization, and dynamically adjust
fragment
consolidation. It will depend on your chip architecture, and on a
virtual
nic like yours this may not be meaningful, or very different
optimizations
may apply.
Actually, in this case I wrote the DMA handler object (xen PV drivers)
and I have now put some code in to make BuildScatterGatherList routine
fail with INSUFFICIENT_RESOURCES if there would be too many SG entries.
When that happens, windows does the copy-into-a-single-buffer itself and
retries again.
Some nic chips allow some of the data to be inline in the send queue
command
entry, so you can consolidate the typical header fragments into memory
the
nic is going to read anyway, essentially consolidating the command
queue
read with the (often small) data fragment reads that make up the
packet
headers. This can get the average PCI bus read bursts to less than 2
per
send packet. One large burst read of a bunch of command+inline data
entries,
and a large burst for the body of the packet, which sometimes crosses
a
physical page, so sometimes takes two fragments (or more for jumbo
packets).
The headers are typically already in the cpu cache, because they were
just
created as an application send request was getting transformed into
some
send packet requests. For small packets (typical Internet traffic is
about
33% small packets, like TCP acks) on some nic hardware these are just
consolidated inline into the command queue entry, allowing less than 1
DMA
per small packet (because you get a large burst for multiple command
entries
and there is no fragment for the packet body), although some memory
copy
overhead. The send command queue is often in normal cache coherent
memory,
so copying from the packet headers to the memory that will be dmaed
for the
send command entry is often pretty fast. It’s one of those things that
needs
careful tuning with a bus analyzer in combination with cpu profiling.
In addition to jumbo packets (which I’ve never tested), LSO (which I do
support) supports packets up to just under 64K which is up to 16 SG
segments (more actually because of the way Windows splits the headers
into separate buffers)
Because I am sending to Xen, I have to deal with the limitations of the
Linux network stack when it comes to SG buffers. One of the limitations
is that the ETH+IP+TCP headers must all be in the same SG buffer. This
is seldom the case with packets my driver receives from Windows, so I am
already doing some coalescing. I have a pool of 512 byte ‘common
buffers’ that I use for this. 512 bytes is plenty of room to hold these
‘small packets’ that you are referring to, and one improvement I planned
to make was to put in a heuristic to copy some or all of the rest of the
packet into that buffer too depending on the effort required (eg mapping
and copying bytes from x buffers vs using up x ring slots). I’ll
definitely make that change based on what you have said. I think it
would be reasonable to assume that there is no mapping overhead for the
headers as they are already coming from kernel space, but the data could
be coming from user space so the heuristic might be different for
headers vs data…
Thanks for the response - very informative.
James