IP Reassembly Timeout

Tim_Roberts · February 22, 2011, 8:33pm

Perhaps one of you networking gurus can enlighten me.

I’m working with a telemetry system that ships 100 megabytes/second
continuously over a fiber gigabit Ethernet. The stream is sent as UDP
packets. The thing generates 8k byte samples at roughly a 12 kHz rate.
To keep the interrupt rate down, we actually send these as IP fragments,
with 7 fragments making up one 56k byte UDP packet.

The IP header includes a 16-bit identification number for each big
packet, so that all of the fragments can be identified and collected.
If you do the math, you’ll see that this identification number rolls
over every 37.5 seconds.

Here’s the problem. Say that, for whatever reason, the first three
fragments after power up never reach the destination. The receiving
system (running Windows Server 2008 R2) will hold on to the next four
fragments. When the identification number rolls over, 37.5 seconds
later, it takes the first three fragments of this new packet, combines
them with the four old fragments, and sends that out as the UDP
packet. It then holds on to the last four fragments of this new packet
until another 37.5 seconds passes. It stays out of sync like that forever.

My friend Google tells me that there used to be a registry parameter to
control this (IpReassemblyTimeout) but that the parameter is not used by
any 21st Century Windows system, where the timeout is hard-coded to 60
seconds.

Clearly, this 60 second number was invented before there were practical
real-life scenarios where these numbers could overlap in less time than
that. Is there really no solution to this problem?

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Thomas_Divine · February 22, 2011, 8:57pm

Interesting problem…

If you are interested specifically at power-up (but it occurs to me this can
happen any time), can you use a protocol where at power-up you send a series
(say 8) short UDP packets. The protocol is that the receiver would recognize
short UDP datagrams as a power-up preamble and ignore them. It would only
handle your longer multi-fragment datagrams as data when they arrive later.

Another thought is to use two server sockets. Send 30 seconds of data to
one, then switch to the other.

Just grasping at straws.

FWIW.

Thomas F. Divine

From: “Tim Roberts”
Sent: Tuesday, February 22, 2011 8:32 PM
To: “Windows System Software Devs Interest List”
Subject: [ntdev] IP Reassembly Timeout

> Perhaps one of you networking gurus can enlighten me.
>
> I’m working with a telemetry system that ships 100 megabytes/second
> continuously over a fiber gigabit Ethernet. The stream is sent as UDP
> packets. The thing generates 8k byte samples at roughly a 12 kHz rate.
> To keep the interrupt rate down, we actually send these as IP fragments,
> with 7 fragments making up one 56k byte UDP packet.
>
> The IP header includes a 16-bit identification number for each big
> packet, so that all of the fragments can be identified and collected.
> If you do the math, you’ll see that this identification number rolls
> over every 37.5 seconds.
>
> Here’s the problem. Say that, for whatever reason, the first three
> fragments after power up never reach the destination. The receiving
> system (running Windows Server 2008 R2) will hold on to the next four
> fragments. When the identification number rolls over, 37.5 seconds
> later, it takes the first three fragments of this new packet, combines
> them with the four old fragments, and sends that out as the UDP
> packet. It then holds on to the last four fragments of this new packet
> until another 37.5 seconds passes. It stays out of sync like that
> forever.
>
> My friend Google tells me that there used to be a registry parameter to
> control this (IpReassemblyTimeout) but that the parameter is not used by
> any 21st Century Windows system, where the timeout is hard-coded to 60
> seconds.
>
> Clearly, this 60 second number was invented before there were practical
> real-life scenarios where these numbers could overlap in less time than
> that. Is there really no solution to this problem?
>
> –
> Tim Roberts, xxxxx@probo.com
> Providenza & Boekelheide, Inc.
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

James_Harper · February 22, 2011, 10:32pm

>

Perhaps one of you networking gurus can enlighten me.

I’m working with a telemetry system that ships 100 megabytes/second
continuously over a fiber gigabit Ethernet. The stream is sent as UDP
packets. The thing generates 8k byte samples at roughly a 12 kHz
rate.
To keep the interrupt rate down, we actually send these as IP
fragments,
with 7 fragments making up one 56k byte UDP packet.

The IP header includes a 16-bit identification number for each big
packet, so that all of the fragments can be identified and collected.
If you do the math, you’ll see that this identification number rolls
over every 37.5 seconds.

Here’s the problem. Say that, for whatever reason, the first three
fragments after power up never reach the destination. The receiving
system (running Windows Server 2008 R2) will hold on to the next four
fragments. When the identification number rolls over, 37.5 seconds
later, it takes the first three fragments of this new packet, combines
them with the four old fragments, and sends that out as the UDP
packet. It then holds on to the last four fragments of this new
packet
until another 37.5 seconds passes. It stays out of sync like that
forever.

My friend Google tells me that there used to be a registry parameter
to
control this (IpReassemblyTimeout) but that the parameter is not used
by
any 21st Century Windows system, where the timeout is hard-coded to 60
seconds.

Clearly, this 60 second number was invented before there were
practical
real-life scenarios where these numbers could overlap in less time
than
that. Is there really no solution to this problem?

Is the fragmentation reassembly offloaded to the card? I assume it is as
otherwise it wouldn’t have any effect on interrupts.

If the stream is constant and runs 24/7, I’d go back to 9000 byte jumbo
packets (which should fit your 8K sample) and switch the network adapter
to a polled mode with a reasonable polling interval such that on average
you are picking up the packets before the incoming buffer is around 50%
full. I’m not sure how you’d do that under Windows but I’m sure it’s
possible, in fact I’m surprised that the adapter doesn’t detect a high
interrupt rate and do that automatically.

James

Jeffrey_Tippet_MSFT · February 22, 2011, 10:33pm

What exactly does IP fragmentation have to do with the NIC’s interrupt rate? The receiving NIC likely has lots of fancy options to trade latency for interrupt frequency, so why not just send each sample as a separate UDP datagram, and not do any IP fragmentation? Since it sounds like you’re already using jumbo frames, I suppose you can afford an extra few bytes of UDP headers.

Also, have you actually identified interrupt rate as a real problem? 12 kHz is certainly not zero, but it’s hardly the worst I’ve ever seen. (Jumbo UDP is such an efficient way to move bits). I can push 30 kHz of receive interrupts on my desktop box while the CPU is hovering around 10%. I guess if you can’t afford to spare any CPU at all, it matters.

Incidentally, you may find interesting a new performance counter in Windows Server 2008 R2: “\Per Processor Network Interface Card Activity(_TOTAL)\Interrupts/sec”.

David_R_Cattley · February 22, 2011, 11:47pm

As a prophylactic perhaps you could introduce a WFP callout or some other sort of filter that understands the flow and that will discard all fragments in the flow until it observes an initial fragment. Then, presuming a nearly reliable link, the system will be ‘synchronized’ and remain so without the oddity of being out of phase by an amount smaller than the reassembly timeout.

By your description I am assuming that the only thing you can ‘modify’ is the receiving system and not the source telemetry generator.

Good Luck,
Dave Cattley

Tim_Roberts · February 23, 2011, 12:33pm

David R. Cattley wrote:

As a prophylactic perhaps you could introduce a WFP callout or some other sort of filter that understands the flow and that will discard all fragments in the flow until it observes an initial fragment. Then, presuming a nearly reliable link, the system will be ‘synchronized’ and remain so without the oddity of being out of phase by an amount smaller than the reassembly timeout.

Yes, we have tried something similar. If we have the hardware send a
burst of non-fragmented packets until the id rolls over once, that does
flush out any dangling fragments. We could do that at startup, but if
we happen to drop a packet in mid-flight (unlikely with fiber, but not
impossible), we’d be back in the same boat.

By your description I am assuming that the only thing you can ‘modify’ is the receiving system and not the source telemetry generator.

Actually, no – we control the FPGA that generates the data stream.
We’re using Rocket I/O as the PHY, so there isn’t a separate NIC. We
have to generate everything ourselves.

We initially thought that switching to IPv6 would solve this, because
the identification number grows to 32 bits. However, in IPv6 the UDP
header is required to have a checksum, and at the time we have to send
the first fragment, the data for the other 6 fragments doesn’t even
exist yet.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

David_R_Cattley · February 23, 2011, 1:42pm

Ok, given that you control the source, I think any scheme that avoids IP fragments is the direction I would go.
That could be then just generating a UDP stream using jumbo frames as was suggested earlier or handling the re-assembly yourself inline above UDP. I would trend towards application level re-assembly.

Your FPGA is chunking out IP fragment sets one after the other in effect building the IP datagram ‘on the fly’ which contains a single UDP message. Instead, how about chunking out multiple UDP messages that start with a ‘sequence number’. The receiving software can easily resynchronize to that stream and detect out-of-order/loss conditions and just throw away data until it resynchronizes again. The sequence number can be as big as you want (32-bit or 64-bit) and either structured to know ‘start of frame’ or rely on knowledge of the frame size to UDP message size modulus.

The software gobbling up the UDP reads already has a buffer large enough to ‘reassemble’ the next frame of data into. It would just need to deal with reading the data piece-meal.

Just one of many possibly reasonable (but work) options. Not knowing how hard it is to change the FGPA logic in your specific case, the trade-off metrics are not all that clear.

Good Luck,
Dave Cattley

Tim_Roberts · February 23, 2011, 2:34pm

David R. Cattley wrote:

Ok, given that you control the source, I think any scheme that avoids IP fragments is the direction I would go.
That could be then just generating a UDP stream using jumbo frames as was suggested earlier or handling the re-assembly yourself inline above UDP. I would trend towards application level re-assembly.

Your FPGA is chunking out IP fragment sets one after the other in effect building the IP datagram ‘on the fly’ which contains a single UDP message. Instead, how about chunking out multiple UDP messages that start with a ‘sequence number’. The receiving software can easily resynchronize to that stream and detect out-of-order/loss conditions and just throw away data until it resynchronizes again. The sequence number can be as big as you want (32-bit or 64-bit) and either structured to know ‘start of frame’ or rely on knowledge of the frame size to UDP message size modulus.

Yesterday, we did try changing things to send 12,000 unfragmented
packets per second, instead of using the fragment trick. In that case,
I lost a significant fraction of the packets (like 20%). I have
SO_RCVBUF set to 6MB, and CPU loading is not an issue, so I “assumed”
that the issue was simply the overhead of so many socket reads.

We seem to have come up with a workaround that solves the problem.
Instead of sequencing the identification numbers as 0, 1, 2, 3, …,
65534, 65535, 0, …, we are now sending them as
0,1,0,1,0,1,0,1,2,3,2,3,2,3,2,3,4,5,4,5, … That way, the number
doesn’t roll over for 150 seconds, which is beyond the 60 second
timeout. If there is a glitch, it only hits four packets, and then
cleans up. As slimy as that sounds, it does solve the problem. We’ve
run 100 million packets through a dozen power cycles this morning
without a single glitch.

I appreciate the feedback from y’all.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.