Windows networking bottleneck - all socket notifications coming from 1 thread inside winsock?

Hi all

I’m sorry if this has been asked elsewhere.

We’ve been auditioning various different frameworks for socket-based
server application that has to handle high load (lots of connections).

We’ve tried:

* IOCP using boost asio, and hand-coded
* blocking calls
* overlapped calls with callback + wait state

The test cases are a number of connections making consecutive http
requests on the connection, so there’s not much connection churn, nearly
all the IO is send/recv.

In all cases however, we’ve noticed a bottleneck in Windows, where core
0 goes to 100% (all in kernel), and after that, even though overall CPU
load may be less than 50%, and network is nowhere near pegged, the
performance levels out.

It looks like the OS is doing all the completion notifications from a
single thread. Running in the debugger it’s in a function inside
winsock with a name like postsocketnotification or something (sorry
can’t remember exact name).

Is this a known architectural issue with Windows? I’m not certain if it
happens on server OSes as well or not.

Anyone know if there are any TCP or winsock tuning parameters that could
maybe make the system use multiple cores for this?

Thanks

Adrien

What kind of socket notifications is being posted?

It will be read and write operations (no files involved in one test -
serving from memory - no difference). Connections are being reused so
basically no open/close activity

It happens with all models we’ve tried: IOCP, blocking sends,
callback+WaitForEtc. Doesn’t seem to matter which way we handle
completion.

I’ve seen it on several desktop OSes as well, from Win7 onwards. Didn’t
try earlier.

Adrien

------ Original Message ------
From: “xxxxx@broadcom.com
To: “Windows System Software Devs Interest List”
Sent: 3/05/2017 4:57:02 AM
Subject: RE:[ntdev] Windows networking bottleneck - all socket
notifications coming from 1 thread inside winsock?

>What kind of socket notifications is being posted?
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list online at:
>http:
>
>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>software drivers!
>Details at http:
>
>To unsubscribe, visit the List Server section of OSR Online at
>http:</http:></http:></http:>

What size of buffer are you posting from the application? How many TCP packets come with PSH flag?

our server sends whatever it received, it uses 64kb buffers for reading,
but the actual payloads (1kb) in the http transactions are small (under
1 packet). so probably PSH on every packet in each direction.

Cheers

Adrien

------ Original Message ------
From: “xxxxx@broadcom.com
To: “Windows System Software Devs Interest List”
Sent: 3/05/2017 11:05:22 AM
Subject: RE:[ntdev] Windows networking bottleneck - all socket
notifications coming from 1 thread inside winsock?

>What size of buffer are you posting from the application? How many TCP
>packets come with PSH flag?
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list online at:
>http:
>
>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>software drivers!
>Details at http:
>
>To unsubscribe, visit the List Server section of OSR Online at
>http:</http:></http:></http:>

Uh, no ? there is no such architectural limitation in Windows

Checking a simple test program of mine on my desktop (windows 10 14393.1066) I can saturate a 10 Gb/s link (intel X540 copper) with UDP traffic with only 20-30% usage of my I7-6700K @ 4GHz (4 cores, 8 Logical CPUs)

TCP traffic is somewhat worse as network saturation depends to a great degree on other network factors (including packet loss / reordering and the mix of traffic between hosts), but you should have no trouble with handing several Gb/s of traffic with an IOCP based design or thread pool IO on similar hardware.

Many factors can affect your performance, so without more information, I can?t help you much further

Sent from Mailhttps: for Windows 10

From: Adrien de Croymailto:xxxxx
Sent: May 2, 2017 1:14 AM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: [ntdev] Windows networking bottleneck - all socket notifications coming from 1 thread inside winsock?

Hi all

I’m sorry if this has been asked elsewhere.

We’ve been auditioning various different frameworks for socket-based server application that has to handle high load (lots of connections).

We’ve tried:

* IOCP using boost asio, and hand-coded
* blocking calls
* overlapped calls with callback + wait state

The test cases are a number of connections making consecutive http requests on the connection, so there’s not much connection churn, nearly all the IO is send/recv.

In all cases however, we’ve noticed a bottleneck in Windows, where core 0 goes to 100% (all in kernel), and after that, even though overall CPU load may be less than 50%, and network is nowhere near pegged, the performance levels out.

It looks like the OS is doing all the completion notifications from a single thread. Running in the debugger it’s in a function inside winsock with a name like postsocketnotification or something (sorry can’t remember exact name).

Is this a known architectural issue with Windows? I’m not certain if it happens on server OSes as well or not.

Anyone know if there are any TCP or winsock tuning parameters that could maybe make the system use multiple cores for this?

Thanks

Adrien


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

The select and callback models can be considered essentially useless ? ignore them. They make it easier to port existing applications and provide compatibility with older software. Also don?t bother with older OSes as the performance will only be worse unless you have a UP machine.

For IOCP or thread pool, what depth of pending reads have you been using?

Sent from Mailhttps: for Windows 10

From: Adrien de Croymailto:xxxxx
Sent: May 2, 2017 5:33 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: Re: [ntdev] Windows networking bottleneck - all socket notifications coming from 1 thread inside winsock?

It will be read and write operations (no files involved in one test -
serving from memory - no difference). Connections are being reused so
basically no open/close activity

It happens with all models we’ve tried: IOCP, blocking sends,
callback+WaitForEtc. Doesn’t seem to matter which way we handle
completion.

I’ve seen it on several desktop OSes as well, from Win7 onwards. Didn’t
try earlier.

Adrien

------ Original Message ------
From: “xxxxx@broadcom.com
To: “Windows System Software Devs Interest List”
Sent: 3/05/2017 4:57:02 AM
Subject: RE:[ntdev] Windows networking bottleneck - all socket
notifications coming from 1 thread inside winsock?

>What kind of socket notifications is being posted?
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list online at:
>http:
>
>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>software drivers!
>Details at http:
>
>To unsubscribe, visit the List Server section of OSR Online at
>http:


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

OK, so the question then is what is causing the saturation of core 0,
and leading to a bottleneck

We’re certainly seeing that.

the test is a large number of small transactions. I don’t think a
comparison with flooding an interface with UDP is really relevant.
Perhaps if your test was a UDP echo server?

Effectivrely in this test the client makes a number of connections,
sends individual http requests (all the same) on it in a single send
call, so each request is less than the MTU so will be in 1 packet each
(possibly nagled I guess at the sender) but each received packet will
have a PSH flag.

Our test server using iocp reads the request, doesn’t even parse it,
sends back a pre-cooked response, so there’s no cost incurred in the
preparation of the response or processing of the request. We are purely
trying to find out the best architecture to handle high load.

the saturation of core 0 causes a bottleneck long before the network is
saturated.

If we increase the payload sizes, then sure we can easily saturate the
network. That’s not the point of this test

So I’m wondering whether the outstanding recv buffer depth will make any
difference.

Is it possibly the NIC driver? E.g. the ISR maybe affinitized to 1 core
or something?

The CPU is showing as kernel time. I imagine that winsock posting IOCP
to a port is done in ring 3?

so maybe the serialization / affinitization is in the ISR, or NDIS or
TCP?

I’ve seen it on several computers, different network hardware. I’d
expect that if it was the NIC ISR, then it wouldn’t make a difference
how big the payload is, since it will be just packets. So I feel it
must be in tcp or higher.

Thanks

Adrien

------ Original Message ------
From: “Marion Bond”
To: “Windows System Software Devs Interest List”
Sent: 3/05/2017 1:04:17 PM
Subject: RE: [ntdev] Windows networking bottleneck - all socket
notifications coming from 1 thread inside winsock?

>Uh, no – there is no such architectural limitation in Windows
>
>
>
>Checking a simple test program of mine on my desktop (windows 10
>14393.1066) I can saturate a 10 Gb/s link (intel X540 copper) with UDP
>traffic with only 20-30% usage of my I7-6700K @ 4GHz (4 cores, 8
>Logical CPUs)
>
>
>
>TCP traffic is somewhat worse as network saturation depends to a great
>degree on other network factors (including packet loss / reordering and
>the mix of traffic between hosts), but you should have no trouble with
>handing several Gb/s of traffic with an IOCP based design or thread
>pool IO on similar hardware.
>
>
>
>Many factors can affect your performance, so without more information,
>I can’t help you much further
>
>
>
>Sent from Mail https: for
>Windows 10
>
>
>
>From: Adrien de Croy mailto:xxxxx
>Sent: May 2, 2017 1:14 AM
>To: Windows System Software Devs Interest List
>mailto:xxxxx
>Subject: [ntdev] Windows networking bottleneck - all socket
>notifications coming from 1 thread inside winsock?
>
>
>
>Hi all
>
>
>
>I’m sorry if this has been asked elsewhere.
>
>
>
>We’ve been auditioning various different frameworks for socket-based
>server application that has to handle high load (lots of connections).
>
>
>
>We’ve tried:
>
>
>
>* IOCP using boost asio, and hand-coded
>
>* blocking calls
>
>* overlapped calls with callback + wait state
>
>
>
>The test cases are a number of connections making consecutive http
>requests on the connection, so there’s not much connection churn,
>nearly all the IO is send/recv.
>
>
>
>In all cases however, we’ve noticed a bottleneck in Windows, where core
>0 goes to 100% (all in kernel), and after that, even though overall CPU
>load may be less than 50%, and network is nowhere near pegged, the
>performance levels out.
>
>
>
>It looks like the OS is doing all the completion notifications from a
>single thread. Running in the debugger it’s in a function inside
>winsock with a name like postsocketnotification or something (sorry
>can’t remember exact name).
>
>
>
>Is this a known architectural issue with Windows? I’m not certain if
>it happens on server OSes as well or not.
>
>
>
>Anyone know if there are any TCP or winsock tuning parameters that
>could maybe make the system use multiple cores for this?
>
>
>
>Thanks
>
>
>Adrien
>
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list online at:
>http:
>
>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>software drivers!
>Details at http:
>
>To unsubscribe, visit the List Server section of OSR Online at
>http:
>
>
>
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list online at:
>http:
>
>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>software drivers!
>Details at http:
>
>To unsubscribe, visit the List Server section of OSR Online at
>http:</http:></http:></http:></http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

interestingly we don’t see the core 0 saturation if we run the client
and server on the same computer.

These tests which we saw core 0 saturation were running across a LAN.

So that implicates something below winsock?

We’re just trying a similar test bashing against IIS on 2k12 R2, and
seeing slightly elevated load on core 0 as well (yes we disabled
logging) compared to the other cores. Next test I guess is IIS on
Windows 10.

Adrien

------ Original Message ------
From: “Adrien de Croy”
To: “Windows System Software Devs Interest List”
Sent: 4/05/2017 9:33:02 AM
Subject: Re: [ntdev] Windows networking bottleneck - all socket
notifications coming from 1 thread inside winsock?

>OK, so the question then is what is causing the saturation of core 0,
>and leading to a bottleneck
>
>We’re certainly seeing that.
>
>the test is a large number of small transactions. I don’t think a
>comparison with flooding an interface with UDP is really relevant.
>Perhaps if your test was a UDP echo server?
>
>Effectivrely in this test the client makes a number of connections,
>sends individual http requests (all the same) on it in a single send
>call, so each request is less than the MTU so will be in 1 packet each
>(possibly nagled I guess at the sender) but each received packet will
>have a PSH flag.
>
>Our test server using iocp reads the request, doesn’t even parse it,
>sends back a pre-cooked response, so there’s no cost incurred in the
>preparation of the response or processing of the request. We are
>purely trying to find out the best architecture to handle high load.
>
>the saturation of core 0 causes a bottleneck long before the network is
>saturated.
>
>If we increase the payload sizes, then sure we can easily saturate the
>network. That’s not the point of this test
>
>So I’m wondering whether the outstanding recv buffer depth will make
>any difference.
>
>Is it possibly the NIC driver? E.g. the ISR maybe affinitized to 1 core
>or something?
>
>The CPU is showing as kernel time. I imagine that winsock posting IOCP
>to a port is done in ring 3?
>
>so maybe the serialization / affinitization is in the ISR, or NDIS or
>TCP?
>
>I’ve seen it on several computers, different network hardware. I’d
>expect that if it was the NIC ISR, then it wouldn’t make a difference
>how big the payload is, since it will be just packets. So I feel it
>must be in tcp or higher.
>
>Thanks
>
>Adrien
>
>------ Original Message ------
>From: “Marion Bond”
>To: “Windows System Software Devs Interest List”
>Sent: 3/05/2017 1:04:17 PM
>Subject: RE: [ntdev] Windows networking bottleneck - all socket
>notifications coming from 1 thread inside winsock?
>
>>Uh, no – there is no such architectural limitation in Windows
>>
>>
>>
>>Checking a simple test program of mine on my desktop (windows 10
>>14393.1066) I can saturate a 10 Gb/s link (intel X540 copper) with UDP
>>traffic with only 20-30% usage of my I7-6700K @ 4GHz (4 cores, 8
>>Logical CPUs)
>>
>>
>>
>>TCP traffic is somewhat worse as network saturation depends to a great
>>degree on other network factors (including packet loss / reordering
>>and the mix of traffic between hosts), but you should have no trouble
>>with handing several Gb/s of traffic with an IOCP based design or
>>thread pool IO on similar hardware.
>>
>>
>>
>>Many factors can affect your performance, so without more information,
>>I can’t help you much further
>>
>>
>>
>>Sent from Mail https: for
>>Windows 10
>>
>>
>>
>>From: Adrien de Croy mailto:xxxxx
>>Sent: May 2, 2017 1:14 AM
>>To: Windows System Software Devs Interest List
>>mailto:xxxxx
>>Subject: [ntdev] Windows networking bottleneck - all socket
>>notifications coming from 1 thread inside winsock?
>>
>>
>>
>>Hi all
>>
>>
>>
>>I’m sorry if this has been asked elsewhere.
>>
>>
>>
>>We’ve been auditioning various different frameworks for socket-based
>>server application that has to handle high load (lots of connections).
>>
>>
>>
>>We’ve tried:
>>
>>
>>
>>* IOCP using boost asio, and hand-coded
>>
>>* blocking calls
>>
>>* overlapped calls with callback + wait state
>>
>>
>>
>>The test cases are a number of connections making consecutive http
>>requests on the connection, so there’s not much connection churn,
>>nearly all the IO is send/recv.
>>
>>
>>
>>In all cases however, we’ve noticed a bottleneck in Windows, where
>>core 0 goes to 100% (all in kernel), and after that, even though
>>overall CPU load may be less than 50%, and network is nowhere near
>>pegged, the performance levels out.
>>
>>
>>
>>It looks like the OS is doing all the completion notifications from a
>>single thread. Running in the debugger it’s in a function inside
>>winsock with a name like postsocketnotification or something (sorry
>>can’t remember exact name).
>>
>>
>>
>>Is this a known architectural issue with Windows? I’m not certain if
>>it happens on server OSes as well or not.
>>
>>
>>
>>Anyone know if there are any TCP or winsock tuning parameters that
>>could maybe make the system use multiple cores for this?
>>
>>
>>
>>Thanks
>>
>>
>>Adrien
>>
>>
>>—
>>NTDEV is sponsored by OSR
>>
>>Visit the list online at:
>>http:
>>
>>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>>software drivers!
>>Details at http:
>>
>>To unsubscribe, visit the List Server section of OSR Online at
>>http:
>>
>>
>>
>>
>>—
>>NTDEV is sponsored by OSR
>>
>>Visit the list online at:
>>http:
>>
>>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>>software drivers!
>>Details at http:
>>
>>To unsubscribe, visit the List Server section of OSR Online at
>>http:
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list online at:
>http:
>
>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>software drivers!
>Details at http:
>
>To unsubscribe, visit the List Server section of OSR Online at
>http:</http:></http:></http:></http:></http:></http:></http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

Is RSS enabled on your NIC? Note that it will be disabled if there is a HyperV switch on the NIC.

Adrien de Croy wrote:

We’re certainly seeing that.

the test is a large number of small transactions. I don’t think a
comparison with flooding an interface with UDP is really relevant.
Perhaps if your test was a UDP echo server?

the saturation of core 0 causes a bottleneck long before the network
is saturated.

If we increase the payload sizes, then sure we can easily saturate the
network. That’s not the point of this test

Well then, what is the point? What are you expecting? You now have a
benchmark that shows that a Windows network server handling very small
packets will achieve CPU saturation before it achieves network
saturation. That’s a useful piece of information, and not terribly
surprising. Handling small packets quite obviously will have higher CPU
overhead per packet that large packets.

What are the actual numbers you are seeing? How many requests per
second, how many bytes per request?

You seem to think that there is a network knob you can turn somewhere
that will magically reduce the CPU load and make network throughput the
bottleneck again. I think your expectation is unrealistic. The real
question here is this: is the performance you are achieving enough for
your needs? If your benchmark represents a truly realistic load, and
you are handling the number of requests per second that you need, then
WHO CARES what resource runs out first?

Now, if you are reaching a limit below what you expect to require, then
you may need to look at multiple servers and some kind of load balancing.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Your observation that the loopback adapter has higher throughput for all types of network traffic is an unsurprising one. Microsoft has a very effective design, but even a brain dead one could not fail to achieve an order of magnitude better than any real NIC than can effectively be driven by the host.

The more interesting question, is what is the request / response rate you see from IIS versus the one you see from your own software?

IIS has been heavily optimized by MSFT to the point that many of the operations execute entirely within KM. One of the key assumptions for this has been that requests will come from many diverse hosts at a high rate rather than from a single of few hosts at a high rate. Many HW, driver, OS components operate under this assumption as well.

And to this end I asked you what your queue depth of pending reads was. If you will saturate the network with a single TCP connection (as opposed to the usual case of having traffic from many diverse hosts) you cannot hope to achieve performance without a significant queue of pending read buffers fro UM.

Unfortunately the Wincosk / Win32 APIs only provide a poor way to support this paradigm where the UM app is required to ?lock the world? while queuing a new read (and this is fundamental to the OS design so no fault to the API) and none of the standard samples demonstrate the use of this technique. The good news is that even with these limitations to this paradigm (which are even worse on *nix) and barring brain dead hardware / NIC driver setup, it is possible to nearly saturate even a 10 Gb/s NIC with a single TCP connection between appropriately configured hosts.

Sent from Mailhttps: for Windows 10

From: Adrien de Croymailto:xxxxx
Sent: May 3, 2017 6:32 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: Re: [ntdev] Windows networking bottleneck - all socket notifications coming from 1 thread inside winsock?

interestingly we don’t see the core 0 saturation if we run the client and server on the same computer.

These tests which we saw core 0 saturation were running across a LAN.

So that implicates something below winsock?

We’re just trying a similar test bashing against IIS on 2k12 R2, and seeing slightly elevated load on core 0 as well (yes we disabled logging) compared to the other cores. Next test I guess is IIS on Windows 10.

Adrien

------ Original Message ------
From: “Adrien de Croy” >
To: “Windows System Software Devs Interest List” >
Sent: 4/05/2017 9:33:02 AM
Subject: Re: [ntdev] Windows networking bottleneck - all socket notifications coming from 1 thread inside winsock?

OK, so the question then is what is causing the saturation of core 0, and leading to a bottleneck

We’re certainly seeing that.

the test is a large number of small transactions. I don’t think a comparison with flooding an interface with UDP is really relevant. Perhaps if your test was a UDP echo server?

Effectivrely in this test the client makes a number of connections, sends individual http requests (all the same) on it in a single send call, so each request is less than the MTU so will be in 1 packet each (possibly nagled I guess at the sender) but each received packet will have a PSH flag.

Our test server using iocp reads the request, doesn’t even parse it, sends back a pre-cooked response, so there’s no cost incurred in the preparation of the response or processing of the request. We are purely trying to find out the best architecture to handle high load.

the saturation of core 0 causes a bottleneck long before the network is saturated.

If we increase the payload sizes, then sure we can easily saturate the network. That’s not the point of this test

So I’m wondering whether the outstanding recv buffer depth will make any difference.

Is it possibly the NIC driver? E.g. the ISR maybe affinitized to 1 core or something?

The CPU is showing as kernel time. I imagine that winsock posting IOCP to a port is done in ring 3?

so maybe the serialization / affinitization is in the ISR, or NDIS or TCP?

I’ve seen it on several computers, different network hardware. I’d expect that if it was the NIC ISR, then it wouldn’t make a difference how big the payload is, since it will be just packets. So I feel it must be in tcp or higher.

Thanks

Adrien

------ Original Message ------
From: “Marion Bond” >
To: “Windows System Software Devs Interest List” >
Sent: 3/05/2017 1:04:17 PM
Subject: RE: [ntdev] Windows networking bottleneck - all socket notifications coming from 1 thread inside winsock?

Uh, no ? there is no such architectural limitation in Windows

Checking a simple test program of mine on my desktop (windows 10 14393.1066) I can saturate a 10 Gb/s link (intel X540 copper) with UDP traffic with only 20-30% usage of my I7-6700K @ 4GHz (4 cores, 8 Logical CPUs)

TCP traffic is somewhat worse as network saturation depends to a great degree on other network factors (including packet loss / reordering and the mix of traffic between hosts), but you should have no trouble with handing several Gb/s of traffic with an IOCP based design or thread pool IO on similar hardware.

Many factors can affect your performance, so without more information, I can?t help you much further

Sent from Mailhttps: for Windows 10

From: Adrien de Croymailto:xxxxx
Sent: May 2, 2017 1:14 AM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: [ntdev] Windows networking bottleneck - all socket notifications coming from 1 thread inside winsock?

Hi all

I’m sorry if this has been asked elsewhere.

We’ve been auditioning various different frameworks for socket-based server application that has to handle high load (lots of connections).

We’ve tried:

* IOCP using boost asio, and hand-coded
* blocking calls
* overlapped calls with callback + wait state

The test cases are a number of connections making consecutive http requests on the connection, so there’s not much connection churn, nearly all the IO is send/recv.

In all cases however, we’ve noticed a bottleneck in Windows, where core 0 goes to 100% (all in kernel), and after that, even though overall CPU load may be less than 50%, and network is nowhere near pegged, the performance levels out.

It looks like the OS is doing all the completion notifications from a single thread. Running in the debugger it’s in a function inside winsock with a name like postsocketnotification or something (sorry can’t remember exact name).

Is this a known architectural issue with Windows? I’m not certain if it happens on server OSes as well or not.

Anyone know if there are any TCP or winsock tuning parameters that could maybe make the system use multiple cores for this?

Thanks

Adrien


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:></mailto:xxxxx></mailto:xxxxx></https:>

Yeah I checked that.

There is Hyper-V on the NIC, but RSS is still enabled (on both the
virtual switch and the core NIC).

Also interrupt moderation is on.

Adrien

------ Original Message ------
From: “xxxxx@broadcom.com
To: “Windows System Software Devs Interest List”
Sent: 4/05/2017 10:45:59 AM
Subject: RE:[ntdev] Windows networking bottleneck - all socket
notifications coming from 1 thread inside winsock?

>Is RSS enabled on your NIC? Note that it will be disabled if there is a
>HyperV switch on the NIC.
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list online at:
>http:
>
>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>software drivers!
>Details at http:
>
>To unsubscribe, visit the List Server section of OSR Online at
>http:</http:></http:></http:>

>>

> If we increase the payload sizes, then sure we can easily saturate
>the
> network. That’s not the point of this test

Well then, what is the point? What are you expecting? You now have a
benchmark that shows that a Windows network server handling very small
packets will achieve CPU saturation before it achieves network
saturation.

No. It’s only saturating core 0. There are another 11 that could be
doing some more work to get the network utilisation higher.

------ Original Message ------
From: “Marion Bond”
To: “Windows System Software Devs Interest List”
Sent: 4/05/2017 12:12:47 PM
Subject: RE: [ntdev] Windows networking bottleneck - all socket
notifications coming from 1 thread inside winsock?

>Your observation that the loopback adapter has higher throughput for
>all types of network traffic is an unsurprising one. Microsoft has a
>very effective design, but even a brain dead one could not fail to
>achieve an order of magnitude better than any real NIC than can
>effectively be driven by the host.
>
>
>
Sure, I expect loopback to be a lot higher, it’s optimised a lot for
IPC.

The key difference is it’s not saturating core 0. So whatever it is
that causes core 0 saturation to be a bottleneck is happening below
loopback. That’s the point of making that observation.

>
>
>The more interesting question, is what is the request / response rate
>you see from IIS versus the one you see from your own software?
>
Yeah, serving the same file in IIS, we got about 160k/s whereas we
struggle to break 40k/s.

We aren’t doing things like kernel-mode SendFile though (and can’t since
we have to be able to filter it).

Also we aren’t caching file content, which I’m confident IIS must be
doing.

>
>
>IIS has been heavily optimized by MSFT to the point that many of the
>operations execute entirely within KM. One of the key assumptions for
>this has been that requests will come from many diverse hosts at a high
>rate rather than from a single of few hosts at a high rate. Many HW,
>driver, OS components operate under this assumption as well.
>
OK, so even having a lot of connections from a single host could be
causing too many entries in a hash bucket or something, and slow down
packet processing?

>
>
>And to this end I asked you what your queue depth of pending reads was.
> If you will saturate the network with a single TCP connection (as
>opposed to the usual case of having traffic from many diverse hosts)
>you cannot hope to achieve performance without a significant queue of
>pending read buffers fro UM.
>
Understood. We aren’t saturating with a single connection though. We
are running anywhere from 10 to 10000 connections.

With 10000 sockets each with a pending read, there’s a lot of buffer
space there ready to copy packets into.

But each request and response is a single packet or less, and the client
won’t send another request until it got a response (not pipelining
client). So I don’t think it will make a difference in this test having
multiple pending reads on each socket.

>
>
>Unfortunately the Wincosk / Win32 APIs only provide a poor way to
>support this paradigm where the UM app is required to ‘lock the world’
>while queuing a new read (and this is fundamental to the OS design so
>no fault to the API) and none of the standard samples demonstrate the
>use of this technique. The good news is that even with these
>limitations to this paradigm (which are even worse on nix) and barring
>brain dead hardware / NIC driver setup, it is possible to nearly
>saturate even a 10 Gb/s NIC with a single TCP connection between
>appropriately configured hosts.
>

understood.

At the moment I’m really trying to figure the cause of the core 0
saturation, and it seems to be related to the rate of notifications
(whether IOCP completions, or notifications, or other socket
notification mechanism, It looks like the serialization is happening
below winsock (at least below loopback).

even the IIS test was very interesting. Core 0 was doing more work than
the other cores, but perhaps more interestingly, its work was 100%
kernel work, while other cores were a mixture and one was 100% UM work.

Very odd workload distribution.

Adrien

>
>
>
>
>Sent from Mail https: for
>Windows 10
>
>
>
>From: Adrien de Croy mailto:xxxxx
>Sent: May 3, 2017 6:32 PM
>To: Windows System Software Devs Interest List
>mailto:xxxxx
>Subject: Re: [ntdev] Windows networking bottleneck - all socket
>notifications coming from 1 thread inside winsock?
>
>
>
>
>
>interestingly we don’t see the core 0 saturation if we run the client
>and server on the same computer.
>
>
>
>These tests which we saw core 0 saturation were running across a LAN.
>
>
>
>So that implicates something below winsock?
>
>
>
>We’re just trying a similar test bashing against IIS on 2k12 R2, and
>seeing slightly elevated load on core 0 as well (yes we disabled
>logging) compared to the other cores. Next test I guess is IIS on
>Windows 10.
>
>
>Adrien
>
>
>
>
>
>------ Original Message ------
>
>From: “Adrien de Croy”
>
>To: “Windows System Software Devs Interest List”
>
>Sent: 4/05/2017 9:33:02 AM
>
>Subject: Re: [ntdev] Windows networking bottleneck - all socket
>notifications coming from 1 thread inside winsock?
>
>
>
>>OK, so the question then is what is causing the saturation of core 0,
>>and leading to a bottleneck
>>
>>
>>
>>We’re certainly seeing that.
>>
>>
>>
>>the test is a large number of small transactions. I don’t think a
>>comparison with flooding an interface with UDP is really relevant.
>>Perhaps if your test was a UDP echo server?
>>
>>
>>
>>Effectivrely in this test the client makes a number of connections,
>>sends individual http requests (all the same) on it in a single send
>>call, so each request is less than the MTU so will be in 1 packet each
>>(possibly nagled I guess at the sender) but each received packet will
>>have a PSH flag.
>>
>>
>>
>>Our test server using iocp reads the request, doesn’t even parse it,
>>sends back a pre-cooked response, so there’s no cost incurred in the
>>preparation of the response or processing of the request. We are
>>purely trying to find out the best architecture to handle high load.
>>
>>
>>
>>the saturation of core 0 causes a bottleneck long before the network
>>is saturated.
>>
>>
>>
>>If we increase the payload sizes, then sure we can easily saturate the
>>network. That’s not the point of this test
>>
>>
>>
>>So I’m wondering whether the outstanding recv buffer depth will make
>>any difference.
>>
>>
>>
>>Is it possibly the NIC driver? E.g. the ISR maybe affinitized to 1
>>core or something?
>>
>>
>>
>>The CPU is showing as kernel time. I imagine that winsock posting
>>IOCP to a port is done in ring 3?
>>
>>
>>
>>so maybe the serialization / affinitization is in the ISR, or NDIS or
>>TCP?
>>
>>
>>
>>I’ve seen it on several computers, different network hardware. I’d
>>expect that if it was the NIC ISR, then it wouldn’t make a difference
>>how big the payload is, since it will be just packets. So I feel it
>>must be in tcp or higher.
>>
>>
>>
>>Thanks
>>
>>
>>
>>Adrien
>>
>>
>>
>>------ Original Message ------
>>
>>From: “Marion Bond”
>>
>>To: “Windows System Software Devs Interest List”
>>
>>Sent: 3/05/2017 1:04:17 PM
>>
>>Subject: RE: [ntdev] Windows networking bottleneck - all socket
>>notifications coming from 1 thread inside winsock?
>>
>>
>>
>>>Uh, no – there is no such architectural limitation in Windows
>>>
>>>
>>>
>>>Checking a simple test program of mine on my desktop (windows 10
>>>14393.1066) I can saturate a 10 Gb/s link (intel X540 copper) with
>>>UDP traffic with only 20-30% usage of my I7-6700K @ 4GHz (4 cores, 8
>>>Logical CPUs)
>>>
>>>
>>>
>>>TCP traffic is somewhat worse as network saturation depends to a
>>>great degree on other network factors (including packet loss /
>>>reordering and the mix of traffic between hosts), but you should have
>>>no trouble with handing several Gb/s of traffic with an IOCP based
>>>design or thread pool IO on similar hardware.
>>>
>>>
>>>
>>>Many factors can affect your performance, so without more
>>>information, I can’t help you much further
>>>
>>>
>>>
>>>Sent from Mail https: for
>>>Windows 10
>>>
>>>
>>>
>>>From: Adrien de Croy mailto:xxxxx
>>>Sent: May 2, 2017 1:14 AM
>>>To: Windows System Software Devs Interest List
>>>mailto:xxxxx
>>>Subject: [ntdev] Windows networking bottleneck - all socket
>>>notifications coming from 1 thread inside winsock?
>>>
>>>
>>>
>>>Hi all
>>>
>>>
>>>
>>>I’m sorry if this has been asked elsewhere.
>>>
>>>
>>>
>>>We’ve been auditioning various different frameworks for socket-based
>>>server application that has to handle high load (lots of
>>>connections).
>>>
>>>
>>>
>>>We’ve tried:
>>>
>>>
>>>
>>>
IOCP using boost asio, and hand-coded
>>>
>>>* blocking calls
>>>
>>>* overlapped calls with callback + wait state
>>>
>>>
>>>
>>>The test cases are a number of connections making consecutive http
>>>requests on the connection, so there’s not much connection churn,
>>>nearly all the IO is send/recv.
>>>
>>>
>>>
>>>In all cases however, we’ve noticed a bottleneck in Windows, where
>>>core 0 goes to 100% (all in kernel), and after that, even though
>>>overall CPU load may be less than 50%, and network is nowhere near
>>>pegged, the performance levels out.
>>>
>>>
>>>
>>>It looks like the OS is doing all the completion notifications from a
>>>single thread. Running in the debugger it’s in a function inside
>>>winsock with a name like postsocketnotification or something (sorry
>>>can’t remember exact name).
>>>
>>>
>>>
>>>Is this a known architectural issue with Windows? I’m not certain if
>>>it happens on server OSes as well or not.
>>>
>>>
>>>
>>>Anyone know if there are any TCP or winsock tuning parameters that
>>>could maybe make the system use multiple cores for this?
>>>
>>>
>>>
>>>Thanks
>>>
>>>
>>>Adrien
>>>
>>>
>>>—
>>>NTDEV is sponsored by OSR
>>>
>>>Visit the list online at:
>>>http:
>>>
>>>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>>>software drivers!
>>>Details at http:
>>>
>>>To unsubscribe, visit the List Server section of OSR Online at
>>>http:
>>>
>>>
>>>
>>>
>>>—
>>>NTDEV is sponsored by OSR
>>>
>>>Visit the list online at:
>>>http:
>>>
>>>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>>>software drivers!
>>>Details at http:
>>>
>>>To unsubscribe, visit the List Server section of OSR Online at
>>>http:
>>>
>>
>>—
>>NTDEV is sponsored by OSR
>>
>>Visit the list online at:
>>http:
>>
>>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>>software drivers!
>>Details at http:
>>
>>To unsubscribe, visit the List Server section of OSR Online at
>>http:
>>
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list online at:
>http:
>
>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>software drivers!
>Details at http:
>
>To unsubscribe, visit the List Server section of OSR Online at
>http:
>
>
>
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list online at:
>http:
>
>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>software drivers!
>Details at http:
>
>To unsubscribe, visit the List Server section of OSR Online at
>http:</http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:></mailto:xxxxx></mailto:xxxxx></https:>

@Adrien de Croy

There is Hyper-V on the NIC, but RSS is still enabled (on both the virtual switch and the core NIC).

There is traditional RSS (*RSS registry setting) and VPORT RSS (*RssOnHostVPorts registry value). With a virtual switch created on a NIC, *RSS registry value is ignored by the NIC.

*RssOnHostVPorts enables RSS with virtual switch, but only on 2016+, on NIC that support it.

First, how does Hyper-V come into this? If you are doing performance tests on a VM, unless you are specifically testing VM performance, I recommend that you use real hardware as there are so many sources of interference from both the hypervisor as well as other VMs and the host OS that it is usually a waste of time to try performance metrics that way.

Re your other points, this clarifies your test setup tremendously. It is now clear that you have many connections from a single host ? each with a single request / response pattern.

The saturation that you are seeing is most likely caused by RSS being completely disabled or the algorithm being used being defeated by the fact that all connections are from a single host. The purpose of RSS is to distribute NIC interrupt processing access several cores so that independent streams of packets can be processed in parallel. In the absence of this, a core (usually 0) will handle all of this processing. This will also be greatly affected by the design of the NIC driver. Which NIC / driver version do you have?

Are the data rates you list in bytes or transactions? In either case they seem low for both your software and IIS

I wouldn?t read too much into the IIS workload distribution as in my experience it always looks odd due to the design of IIS. The fact that they implement so much in KM inherently skews the results, but then there is the effect of multiple app pools and the many other features that IIS has.

Sent from Mailhttps: for Windows 10

From: Adrien de Croymailto:xxxxx
Sent: May 3, 2017 11:09 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: Re: [ntdev] Windows networking bottleneck - all socket notifications coming from 1 thread inside winsock?

------ Original Message ------
From: “Marion Bond” >
To: “Windows System Software Devs Interest List” >
Sent: 4/05/2017 12:12:47 PM
Subject: RE: [ntdev] Windows networking bottleneck - all socket notifications coming from 1 thread inside winsock?

Your observation that the loopback adapter has higher throughput for all types of network traffic is an unsurprising one. Microsoft has a very effective design, but even a brain dead one could not fail to achieve an order of magnitude better than any real NIC than can effectively be driven by the host.

Sure, I expect loopback to be a lot higher, it’s optimised a lot for IPC.

The key difference is it’s not saturating core 0. So whatever it is that causes core 0 saturation to be a bottleneck is happening below loopback. That’s the point of making that observation.

The more interesting question, is what is the request / response rate you see from IIS versus the one you see from your own software?
Yeah, serving the same file in IIS, we got about 160k/s whereas we struggle to break 40k/s.

We aren’t doing things like kernel-mode SendFile though (and can’t since we have to be able to filter it).

Also we aren’t caching file content, which I’m confident IIS must be doing.

IIS has been heavily optimized by MSFT to the point that many of the operations execute entirely within KM. One of the key assumptions for this has been that requests will come from many diverse hosts at a high rate rather than from a single of few hosts at a high rate. Many HW, driver, OS components operate under this assumption as well.
OK, so even having a lot of connections from a single host could be causing too many entries in a hash bucket or something, and slow down packet processing?

And to this end I asked you what your queue depth of pending reads was. If you will saturate the network with a single TCP connection (as opposed to the usual case of having traffic from many diverse hosts) you cannot hope to achieve performance without a significant queue of pending read buffers fro UM.
Understood. We aren’t saturating with a single connection though. We are running anywhere from 10 to 10000 connections.

With 10000 sockets each with a pending read, there’s a lot of buffer space there ready to copy packets into.

But each request and response is a single packet or less, and the client won’t send another request until it got a response (not pipelining client). So I don’t think it will make a difference in this test having multiple pending reads on each socket.

Unfortunately the Wincosk / Win32 APIs only provide a poor way to support this paradigm where the UM app is required to ?lock the world? while queuing a new read (and this is fundamental to the OS design so no fault to the API) and none of the standard samples demonstrate the use of this technique. The good news is that even with these limitations to this paradigm (which are even worse on nix) and barring brain dead hardware / NIC driver setup, it is possible to nearly saturate even a 10 Gb/s NIC with a single TCP connection between appropriately configured hosts.

understood.

At the moment I’m really trying to figure the cause of the core 0 saturation, and it seems to be related to the rate of notifications (whether IOCP completions, or notifications, or other socket notification mechanism, It looks like the serialization is happening below winsock (at least below loopback).

even the IIS test was very interesting. Core 0 was doing more work than the other cores, but perhaps more interestingly, its work was 100% kernel work, while other cores were a mixture and one was 100% UM work.

Very odd workload distribution.

Adrien

Sent from Mailhttps: for Windows 10

From: Adrien de Croymailto:xxxxx
Sent: May 3, 2017 6:32 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: Re: [ntdev] Windows networking bottleneck - all socket notifications coming from 1 thread inside winsock?

interestingly we don’t see the core 0 saturation if we run the client and server on the same computer.

These tests which we saw core 0 saturation were running across a LAN.

So that implicates something below winsock?

We’re just trying a similar test bashing against IIS on 2k12 R2, and seeing slightly elevated load on core 0 as well (yes we disabled logging) compared to the other cores. Next test I guess is IIS on Windows 10.

Adrien

------ Original Message ------
From: “Adrien de Croy” >
To: “Windows System Software Devs Interest List” >
Sent: 4/05/2017 9:33:02 AM
Subject: Re: [ntdev] Windows networking bottleneck - all socket notifications coming from 1 thread inside winsock?

OK, so the question then is what is causing the saturation of core 0, and leading to a bottleneck

We’re certainly seeing that.

the test is a large number of small transactions. I don’t think a comparison with flooding an interface with UDP is really relevant. Perhaps if your test was a UDP echo server?

Effectivrely in this test the client makes a number of connections, sends individual http requests (all the same) on it in a single send call, so each request is less than the MTU so will be in 1 packet each (possibly nagled I guess at the sender) but each received packet will have a PSH flag.

Our test server using iocp reads the request, doesn’t even parse it, sends back a pre-cooked response, so there’s no cost incurred in the preparation of the response or processing of the request. We are purely trying to find out the best architecture to handle high load.

the saturation of core 0 causes a bottleneck long before the network is saturated.

If we increase the payload sizes, then sure we can easily saturate the network. That’s not the point of this test

So I’m wondering whether the outstanding recv buffer depth will make any difference.

Is it possibly the NIC driver? E.g. the ISR maybe affinitized to 1 core or something?

The CPU is showing as kernel time. I imagine that winsock posting IOCP to a port is done in ring 3?

so maybe the serialization / affinitization is in the ISR, or NDIS or TCP?

I’ve seen it on several computers, different network hardware. I’d expect that if it was the NIC ISR, then it wouldn’t make a difference how big the payload is, since it will be just packets. So I feel it must be in tcp or higher.

Thanks

Adrien

------ Original Message ------
From: “Marion Bond” >
To: “Windows System Software Devs Interest List” >
Sent: 3/05/2017 1:04:17 PM
Subject: RE: [ntdev] Windows networking bottleneck - all socket notifications coming from 1 thread inside winsock?

Uh, no ? there is no such architectural limitation in Windows

Checking a simple test program of mine on my desktop (windows 10 14393.1066) I can saturate a 10 Gb/s link (intel X540 copper) with UDP traffic with only 20-30% usage of my I7-6700K @ 4GHz (4 cores, 8 Logical CPUs)

TCP traffic is somewhat worse as network saturation depends to a great degree on other network factors (including packet loss / reordering and the mix of traffic between hosts), but you should have no trouble with handing several Gb/s of traffic with an IOCP based design or thread pool IO on similar hardware.

Many factors can affect your performance, so without more information, I can?t help you much further

Sent from Mailhttps: for Windows 10

From: Adrien de Croymailto:xxxxx
Sent: May 2, 2017 1:14 AM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: [ntdev] Windows networking bottleneck - all socket notifications coming from 1 thread inside winsock?

Hi all

I’m sorry if this has been asked elsewhere.

We’ve been auditioning various different frameworks for socket-based server application that has to handle high load (lots of connections).

We’ve tried:

IOCP using boost asio, and hand-coded
* blocking calls
* overlapped calls with callback + wait state

The test cases are a number of connections making consecutive http requests on the connection, so there’s not much connection churn, nearly all the IO is send/recv.

In all cases however, we’ve noticed a bottleneck in Windows, where core 0 goes to 100% (all in kernel), and after that, even though overall CPU load may be less than 50%, and network is nowhere near pegged, the performance levels out.

It looks like the OS is doing all the completion notifications from a single thread. Running in the debugger it’s in a function inside winsock with a name like postsocketnotification or something (sorry can’t remember exact name).

Is this a known architectural issue with Windows? I’m not certain if it happens on server OSes as well or not.

Anyone know if there are any TCP or winsock tuning parameters that could maybe make the system use multiple cores for this?

Thanks

Adrien


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:></mailto:xxxxx></mailto:xxxxx></https:></mailto:xxxxx></mailto:xxxxx></https:>

Hi

Thanks.

I took Hyper-V switch off, so there’s just the bare metal NIC / driver
etc.

It’s an Intel 82579V on-board (Asus MoBo). Driver e1i63x64.sys ver
12.15.22.6 (5 April 2016)

If I connect over localhost, we don’t see core 0 saturation.

RSS is enabled.

There is an advanced attribute called “no description” value 1… I
wonder if that’s number of cores to use for processing interrrupts or
something.

I looked in the registry under the Enum\PCI key for the device, and
there are keys like “interrupt management” and “affinity policy”… I
wonder if a bit of playing with some of those attributes may help.

Adrien

------ Original Message ------
From: “Marion Bond”
To: “Windows System Software Devs Interest List”
Sent: 5/05/2017 11:11:10 AM
Subject: RE: [ntdev] Windows networking bottleneck - all socket
notifications coming from 1 thread inside winsock?

>First, how does Hyper-V come into this? If you are doing performance
>tests on a VM, unless you are specifically testing VM performance, I
>recommend that you use real hardware as there are so many sources of
>interference from both the hypervisor as well as other VMs and the host
>OS that it is usually a waste of time to try performance metrics that
>way.
>
>
>
>Re your other points, this clarifies your test setup tremendously. It
>is now clear that you have many connections from a single host – each
>with a single request / response pattern.
>
>
>
>The saturation that you are seeing is most likely caused by RSS being
>completely disabled or the algorithm being used being defeated by the
>fact that all connections are from a single host. The purpose of RSS
>is to distribute NIC interrupt processing access several cores so that
>independent streams of packets can be processed in parallel. In the
>absence of this, a core (usually 0) will handle all of this processing.
> This will also be greatly affected by the design of the NIC driver.
>Which NIC / driver version do you have?
>
>
>
>Are the data rates you list in bytes or transactions? In either case
>they seem low for both your software and IIS
>
>
>
>I wouldn’t read too much into the IIS workload distribution as in my
>experience it always looks odd due to the design of IIS. The fact that
>they implement so much in KM inherently skews the results, but then
>there is the effect of multiple app pools and the many other features
>that IIS has.
>
>
>
>
>
>Sent from Mail https: for
>Windows 10
>
>
>
>From: Adrien de Croy mailto:xxxxx
>Sent: May 3, 2017 11:09 PM
>To: Windows System Software Devs Interest List
>mailto:xxxxx
>Subject: Re: [ntdev] Windows networking bottleneck - all socket
>notifications coming from 1 thread inside winsock?
>
>
>
>
>
>
>
>------ Original Message ------
>
>From: “Marion Bond”
>
>To: “Windows System Software Devs Interest List”
>
>Sent: 4/05/2017 12:12:47 PM
>
>Subject: RE: [ntdev] Windows networking bottleneck - all socket
>notifications coming from 1 thread inside winsock?
>
>
>
>>Your observation that the loopback adapter has higher throughput for
>>all types of network traffic is an unsurprising one. Microsoft has a
>>very effective design, but even a brain dead one could not fail to
>>achieve an order of magnitude better than any real NIC than can
>>effectively be driven by the host.
>>
>>
>>
>Sure, I expect loopback to be a lot higher, it’s optimised a lot for
>IPC.
>
>
>
>The key difference is it’s not saturating core 0. So whatever it is
>that causes core 0 saturation to be a bottleneck is happening below
>loopback. That’s the point of making that observation.
>
>
>
>
>
>
>>
>>
>>The more interesting question, is what is the request / response rate
>>you see from IIS versus the one you see from your own software?
>>
>Yeah, serving the same file in IIS, we got about 160k/s whereas we
>struggle to break 40k/s.
>
>
>
>We aren’t doing things like kernel-mode SendFile though (and can’t
>since we have to be able to filter it).
>
>
>
>Also we aren’t caching file content, which I’m confident IIS must be
>doing.
>
>
>
>>
>>
>>IIS has been heavily optimized by MSFT to the point that many of the
>>operations execute entirely within KM. One of the key assumptions for
>>this has been that requests will come from many diverse hosts at a
>>high rate rather than from a single of few hosts at a high rate. Many
>>HW, driver, OS components operate under this assumption as well.
>>
>OK, so even having a lot of connections from a single host could be
>causing too many entries in a hash bucket or something, and slow down
>packet processing?
>
>
>
>
>>
>>
>>And to this end I asked you what your queue depth of pending reads
>>was. If you will saturate the network with a single TCP connection
>>(as opposed to the usual case of having traffic from many diverse
>>hosts) you cannot hope to achieve performance without a significant
>>queue of pending read buffers fro UM.
>>
>Understood. We aren’t saturating with a single connection though. We
>are running anywhere from 10 to 10000 connections.
>
>
>
>With 10000 sockets each with a pending read, there’s a lot of buffer
>space there ready to copy packets into.
>
>
>
>But each request and response is a single packet or less, and the
>client won’t send another request until it got a response (not
>pipelining client). So I don’t think it will make a difference in this
>test having multiple pending reads on each socket.
>
>
>
>
>>
>>
>>Unfortunately the Wincosk / Win32 APIs only provide a poor way to
>>support this paradigm where the UM app is required to ‘lock the world’
>>while queuing a new read (and this is fundamental to the OS design so
>>no fault to the API) and none of the standard samples demonstrate the
>>use of this technique. The good news is that even with these
>>limitations to this paradigm (which are even worse on nix) and
>>barring brain dead hardware / NIC driver setup, it is possible to
>>nearly saturate even a 10 Gb/s NIC with a single TCP connection
>>between appropriately configured hosts.
>>
>
>
>understood.
>
>
>
>At the moment I’m really trying to figure the cause of the core 0
>saturation, and it seems to be related to the rate of notifications
>(whether IOCP completions, or notifications, or other socket
>notification mechanism, It looks like the serialization is happening
>below winsock (at least below loopback).
>
>
>
>even the IIS test was very interesting. Core 0 was doing more work
>than the other cores, but perhaps more interestingly, its work was 100%
>kernel work, while other cores were a mixture and one was 100% UM work.
>
>
>
>Very odd workload distribution.
>
>
>
>Adrien
>
>
>
>
>
>
>>
>>
>>
>>
>>Sent from Mail https: for
>>Windows 10
>>
>>
>>
>>From: Adrien de Croy mailto:xxxxx
>>Sent: May 3, 2017 6:32 PM
>>To: Windows System Software Devs Interest List
>>mailto:xxxxx
>>Subject: Re: [ntdev] Windows networking bottleneck - all socket
>>notifications coming from 1 thread inside winsock?
>>
>>
>>
>>
>>
>>interestingly we don’t see the core 0 saturation if we run the client
>>and server on the same computer.
>>
>>
>>
>>These tests which we saw core 0 saturation were running across a LAN.
>>
>>
>>
>>So that implicates something below winsock?
>>
>>
>>
>>We’re just trying a similar test bashing against IIS on 2k12 R2, and
>>seeing slightly elevated load on core 0 as well (yes we disabled
>>logging) compared to the other cores. Next test I guess is IIS on
>>Windows 10.
>>
>>
>>Adrien
>>
>>
>>
>>
>>
>>------ Original Message ------
>>
>>From: “Adrien de Croy”
>>
>>To: “Windows System Software Devs Interest List”
>>
>>Sent: 4/05/2017 9:33:02 AM
>>
>>Subject: Re: [ntdev] Windows networking bottleneck - all socket
>>notifications coming from 1 thread inside winsock?
>>
>>
>>
>>>OK, so the question then is what is causing the saturation of core 0,
>>>and leading to a bottleneck
>>>
>>>
>>>
>>>We’re certainly seeing that.
>>>
>>>
>>>
>>>the test is a large number of small transactions. I don’t think a
>>>comparison with flooding an interface with UDP is really relevant.
>>>Perhaps if your test was a UDP echo server?
>>>
>>>
>>>
>>>Effectivrely in this test the client makes a number of connections,
>>>sends individual http requests (all the same) on it in a single send
>>>call, so each request is less than the MTU so will be in 1 packet
>>>each (possibly nagled I guess at the sender) but each received packet
>>>will have a PSH flag.
>>>
>>>
>>>
>>>Our test server using iocp reads the request, doesn’t even parse it,
>>>sends back a pre-cooked response, so there’s no cost incurred in the
>>>preparation of the response or processing of the request. We are
>>>purely trying to find out the best architecture to handle high load.
>>>
>>>
>>>
>>>the saturation of core 0 causes a bottleneck long before the network
>>>is saturated.
>>>
>>>
>>>
>>>If we increase the payload sizes, then sure we can easily saturate
>>>the network. That’s not the point of this test
>>>
>>>
>>>
>>>So I’m wondering whether the outstanding recv buffer depth will make
>>>any difference.
>>>
>>>
>>>
>>>Is it possibly the NIC driver? E.g. the ISR maybe affinitized to 1
>>>core or something?
>>>
>>>
>>>
>>>The CPU is showing as kernel time. I imagine that winsock posting
>>>IOCP to a port is done in ring 3?
>>>
>>>
>>>
>>>so maybe the serialization / affinitization is in the ISR, or NDIS or
>>>TCP?
>>>
>>>
>>>
>>>I’ve seen it on several computers, different network hardware. I’d
>>>expect that if it was the NIC ISR, then it wouldn’t make a difference
>>>how big the payload is, since it will be just packets. So I feel it
>>>must be in tcp or higher.
>>>
>>>
>>>
>>>Thanks
>>>
>>>
>>>
>>>Adrien
>>>
>>>
>>>
>>>------ Original Message ------
>>>
>>>From: “Marion Bond”
>>>
>>>To: “Windows System Software Devs Interest List”
>>>
>>>
>>>Sent: 3/05/2017 1:04:17 PM
>>>
>>>Subject: RE: [ntdev] Windows networking bottleneck - all socket
>>>notifications coming from 1 thread inside winsock?
>>>
>>>
>>>
>>>>Uh, no – there is no such architectural limitation in Windows
>>>>
>>>>
>>>>
>>>>Checking a simple test program of mine on my desktop (windows 10
>>>>14393.1066) I can saturate a 10 Gb/s link (intel X540 copper) with
>>>>UDP traffic with only 20-30% usage of my I7-6700K @ 4GHz (4 cores, 8
>>>>Logical CPUs)
>>>>
>>>>
>>>>
>>>>TCP traffic is somewhat worse as network saturation depends to a
>>>>great degree on other network factors (including packet loss /
>>>>reordering and the mix of traffic between hosts), but you should
>>>>have no trouble with handing several Gb/s of traffic with an IOCP
>>>>based design or thread pool IO on similar hardware.
>>>>
>>>>
>>>>
>>>>Many factors can affect your performance, so without more
>>>>information, I can’t help you much further
>>>>
>>>>
>>>>
>>>>Sent from Mail https: for
>>>>Windows 10
>>>>
>>>>
>>>>
>>>>From: Adrien de Croy mailto:xxxxx
>>>>Sent: May 2, 2017 1:14 AM
>>>>To: Windows System Software Devs Interest List
>>>>mailto:xxxxx
>>>>Subject: [ntdev] Windows networking bottleneck - all socket
>>>>notifications coming from 1 thread inside winsock?
>>>>
>>>>
>>>>
>>>>Hi all
>>>>
>>>>
>>>>
>>>>I’m sorry if this has been asked elsewhere.
>>>>
>>>>
>>>>
>>>>We’ve been auditioning various different frameworks for socket-based
>>>>server application that has to handle high load (lots of
>>>>connections).
>>>>
>>>>
>>>>
>>>>We’ve tried:
>>>>
>>>>
>>>>
>>>>
IOCP using boost asio, and hand-coded
>>>>
>>>>* blocking calls
>>>>
>>>>* overlapped calls with callback + wait state
>>>>
>>>>
>>>>
>>>>The test cases are a number of connections making consecutive http
>>>>requests on the connection, so there’s not much connection churn,
>>>>nearly all the IO is send/recv.
>>>>
>>>>
>>>>
>>>>In all cases however, we’ve noticed a bottleneck in Windows, where
>>>>core 0 goes to 100% (all in kernel), and after that, even though
>>>>overall CPU load may be less than 50%, and network is nowhere near
>>>>pegged, the performance levels out.
>>>>
>>>>
>>>>
>>>>It looks like the OS is doing all the completion notifications from
>>>>a single thread. Running in the debugger it’s in a function inside
>>>>winsock with a name like postsocketnotification or something (sorry
>>>>can’t remember exact name).
>>>>
>>>>
>>>>
>>>>Is this a known architectural issue with Windows? I’m not certain
>>>>if it happens on server OSes as well or not.
>>>>
>>>>
>>>>
>>>>Anyone know if there are any TCP or winsock tuning parameters that
>>>>could maybe make the system use multiple cores for this?
>>>>
>>>>
>>>>
>>>>Thanks
>>>>
>>>>
>>>>Adrien
>>>>
>>>>
>>>>—
>>>>NTDEV is sponsored by OSR
>>>>
>>>>Visit the list online at:
>>>>http:
>>>>
>>>>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>>>>software drivers!
>>>>Details at http:
>>>>
>>>>To unsubscribe, visit the List Server section of OSR Online at
>>>>http:
>>>>
>>>>
>>>>
>>>>
>>>>—
>>>>NTDEV is sponsored by OSR
>>>>
>>>>Visit the list online at:
>>>>http:
>>>>
>>>>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>>>>software drivers!
>>>>Details at http:
>>>>
>>>>To unsubscribe, visit the List Server section of OSR Online at
>>>>http:
>>>>
>>>
>>>—
>>>NTDEV is sponsored by OSR
>>>
>>>Visit the list online at:
>>>http:
>>>
>>>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>>>software drivers!
>>>Details at http:
>>>
>>>To unsubscribe, visit the List Server section of OSR Online at
>>>http:
>>>
>>
>>—
>>NTDEV is sponsored by OSR
>>
>>Visit the list online at:
>>http:
>>
>>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>>software drivers!
>>Details at http:
>>
>>To unsubscribe, visit the List Server section of OSR Online at
>>http:
>>
>>
>>
>>
>>—
>>NTDEV is sponsored by OSR
>>
>>Visit the list online at:
>>http:
>>
>>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>>software drivers!
>>Details at http:
>>
>>To unsubscribe, visit the List Server section of OSR Online at
>>http:
>>
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list online at:
>http:
>
>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>software drivers!
>Details at http:
>
>To unsubscribe, visit the List Server section of OSR Online at
>http:
>
>
>
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list online at:
>http:
>
>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>software drivers!
>Details at http:
>
>To unsubscribe, visit the List Server section of OSR Online at
>http:</http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:></mailto:xxxxx></mailto:xxxxx></https:></mailto:xxxxx></mailto:xxxxx></https:>

OK, now run some multithreading netperf of ntttcp, to verify that RSS actually works.

Adrien de Croy wrote:

I looked in the registry under the Enum\PCI key for the device, and
there are keys like “interrupt management” and “affinity policy”… I
wonder if a bit of playing with some of those attributes may help.

It is possible that dinking with the “affinity policy” might force the
system to spread the interrupts to other processors.

HOWEVER, the authors of the driver would not have forced an affinity
policy without a good reason. To be more specific, it may be that all
of their interrupts are forced to CPU 0 because their driver cannot
handle multiple simultaneous interrupts in multiple processors.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.