Hi
Thanks.
I took Hyper-V switch off, so there’s just the bare metal NIC / driver
etc.
It’s an Intel 82579V on-board (Asus MoBo). Driver e1i63x64.sys ver
12.15.22.6 (5 April 2016)
If I connect over localhost, we don’t see core 0 saturation.
RSS is enabled.
There is an advanced attribute called “no description” value 1… I
wonder if that’s number of cores to use for processing interrrupts or
something.
I looked in the registry under the Enum\PCI key for the device, and
there are keys like “interrupt management” and “affinity policy”… I
wonder if a bit of playing with some of those attributes may help.
Adrien
------ Original Message ------
From: “Marion Bond”
To: “Windows System Software Devs Interest List”
Sent: 5/05/2017 11:11:10 AM
Subject: RE: [ntdev] Windows networking bottleneck - all socket
notifications coming from 1 thread inside winsock?
>First, how does Hyper-V come into this? If you are doing performance
>tests on a VM, unless you are specifically testing VM performance, I
>recommend that you use real hardware as there are so many sources of
>interference from both the hypervisor as well as other VMs and the host
>OS that it is usually a waste of time to try performance metrics that
>way.
>
>
>
>Re your other points, this clarifies your test setup tremendously. It
>is now clear that you have many connections from a single host – each
>with a single request / response pattern.
>
>
>
>The saturation that you are seeing is most likely caused by RSS being
>completely disabled or the algorithm being used being defeated by the
>fact that all connections are from a single host. The purpose of RSS
>is to distribute NIC interrupt processing access several cores so that
>independent streams of packets can be processed in parallel. In the
>absence of this, a core (usually 0) will handle all of this processing.
> This will also be greatly affected by the design of the NIC driver.
>Which NIC / driver version do you have?
>
>
>
>Are the data rates you list in bytes or transactions? In either case
>they seem low for both your software and IIS
>
>
>
>I wouldn’t read too much into the IIS workload distribution as in my
>experience it always looks odd due to the design of IIS. The fact that
>they implement so much in KM inherently skews the results, but then
>there is the effect of multiple app pools and the many other features
>that IIS has.
>
>
>
>
>
>Sent from Mail https: for
>Windows 10
>
>
>
>From: Adrien de Croy mailto:xxxxx
>Sent: May 3, 2017 11:09 PM
>To: Windows System Software Devs Interest List
>mailto:xxxxx
>Subject: Re: [ntdev] Windows networking bottleneck - all socket
>notifications coming from 1 thread inside winsock?
>
>
>
>
>
>
>
>------ Original Message ------
>
>From: “Marion Bond”
>
>To: “Windows System Software Devs Interest List”
>
>Sent: 4/05/2017 12:12:47 PM
>
>Subject: RE: [ntdev] Windows networking bottleneck - all socket
>notifications coming from 1 thread inside winsock?
>
>
>
>>Your observation that the loopback adapter has higher throughput for
>>all types of network traffic is an unsurprising one. Microsoft has a
>>very effective design, but even a brain dead one could not fail to
>>achieve an order of magnitude better than any real NIC than can
>>effectively be driven by the host.
>>
>>
>>
>Sure, I expect loopback to be a lot higher, it’s optimised a lot for
>IPC.
>
>
>
>The key difference is it’s not saturating core 0. So whatever it is
>that causes core 0 saturation to be a bottleneck is happening below
>loopback. That’s the point of making that observation.
>
>
>
>
>
>
>>
>>
>>The more interesting question, is what is the request / response rate
>>you see from IIS versus the one you see from your own software?
>>
>Yeah, serving the same file in IIS, we got about 160k/s whereas we
>struggle to break 40k/s.
>
>
>
>We aren’t doing things like kernel-mode SendFile though (and can’t
>since we have to be able to filter it).
>
>
>
>Also we aren’t caching file content, which I’m confident IIS must be
>doing.
>
>
>
>>
>>
>>IIS has been heavily optimized by MSFT to the point that many of the
>>operations execute entirely within KM. One of the key assumptions for
>>this has been that requests will come from many diverse hosts at a
>>high rate rather than from a single of few hosts at a high rate. Many
>>HW, driver, OS components operate under this assumption as well.
>>
>OK, so even having a lot of connections from a single host could be
>causing too many entries in a hash bucket or something, and slow down
>packet processing?
>
>
>
>
>>
>>
>>And to this end I asked you what your queue depth of pending reads
>>was. If you will saturate the network with a single TCP connection
>>(as opposed to the usual case of having traffic from many diverse
>>hosts) you cannot hope to achieve performance without a significant
>>queue of pending read buffers fro UM.
>>
>Understood. We aren’t saturating with a single connection though. We
>are running anywhere from 10 to 10000 connections.
>
>
>
>With 10000 sockets each with a pending read, there’s a lot of buffer
>space there ready to copy packets into.
>
>
>
>But each request and response is a single packet or less, and the
>client won’t send another request until it got a response (not
>pipelining client). So I don’t think it will make a difference in this
>test having multiple pending reads on each socket.
>
>
>
>
>>
>>
>>Unfortunately the Wincosk / Win32 APIs only provide a poor way to
>>support this paradigm where the UM app is required to ‘lock the world’
>>while queuing a new read (and this is fundamental to the OS design so
>>no fault to the API) and none of the standard samples demonstrate the
>>use of this technique. The good news is that even with these
>>limitations to this paradigm (which are even worse on nix) and
>>barring brain dead hardware / NIC driver setup, it is possible to
>>nearly saturate even a 10 Gb/s NIC with a single TCP connection
>>between appropriately configured hosts.
>>
>
>
>understood.
>
>
>
>At the moment I’m really trying to figure the cause of the core 0
>saturation, and it seems to be related to the rate of notifications
>(whether IOCP completions, or notifications, or other socket
>notification mechanism, It looks like the serialization is happening
>below winsock (at least below loopback).
>
>
>
>even the IIS test was very interesting. Core 0 was doing more work
>than the other cores, but perhaps more interestingly, its work was 100%
>kernel work, while other cores were a mixture and one was 100% UM work.
>
>
>
>Very odd workload distribution.
>
>
>
>Adrien
>
>
>
>
>
>
>>
>>
>>
>>
>>Sent from Mail https: for
>>Windows 10
>>
>>
>>
>>From: Adrien de Croy mailto:xxxxx
>>Sent: May 3, 2017 6:32 PM
>>To: Windows System Software Devs Interest List
>>mailto:xxxxx
>>Subject: Re: [ntdev] Windows networking bottleneck - all socket
>>notifications coming from 1 thread inside winsock?
>>
>>
>>
>>
>>
>>interestingly we don’t see the core 0 saturation if we run the client
>>and server on the same computer.
>>
>>
>>
>>These tests which we saw core 0 saturation were running across a LAN.
>>
>>
>>
>>So that implicates something below winsock?
>>
>>
>>
>>We’re just trying a similar test bashing against IIS on 2k12 R2, and
>>seeing slightly elevated load on core 0 as well (yes we disabled
>>logging) compared to the other cores. Next test I guess is IIS on
>>Windows 10.
>>
>>
>>Adrien
>>
>>
>>
>>
>>
>>------ Original Message ------
>>
>>From: “Adrien de Croy”
>>
>>To: “Windows System Software Devs Interest List”
>>
>>Sent: 4/05/2017 9:33:02 AM
>>
>>Subject: Re: [ntdev] Windows networking bottleneck - all socket
>>notifications coming from 1 thread inside winsock?
>>
>>
>>
>>>OK, so the question then is what is causing the saturation of core 0,
>>>and leading to a bottleneck
>>>
>>>
>>>
>>>We’re certainly seeing that.
>>>
>>>
>>>
>>>the test is a large number of small transactions. I don’t think a
>>>comparison with flooding an interface with UDP is really relevant.
>>>Perhaps if your test was a UDP echo server?
>>>
>>>
>>>
>>>Effectivrely in this test the client makes a number of connections,
>>>sends individual http requests (all the same) on it in a single send
>>>call, so each request is less than the MTU so will be in 1 packet
>>>each (possibly nagled I guess at the sender) but each received packet
>>>will have a PSH flag.
>>>
>>>
>>>
>>>Our test server using iocp reads the request, doesn’t even parse it,
>>>sends back a pre-cooked response, so there’s no cost incurred in the
>>>preparation of the response or processing of the request. We are
>>>purely trying to find out the best architecture to handle high load.
>>>
>>>
>>>
>>>the saturation of core 0 causes a bottleneck long before the network
>>>is saturated.
>>>
>>>
>>>
>>>If we increase the payload sizes, then sure we can easily saturate
>>>the network. That’s not the point of this test
>>>
>>>
>>>
>>>So I’m wondering whether the outstanding recv buffer depth will make
>>>any difference.
>>>
>>>
>>>
>>>Is it possibly the NIC driver? E.g. the ISR maybe affinitized to 1
>>>core or something?
>>>
>>>
>>>
>>>The CPU is showing as kernel time. I imagine that winsock posting
>>>IOCP to a port is done in ring 3?
>>>
>>>
>>>
>>>so maybe the serialization / affinitization is in the ISR, or NDIS or
>>>TCP?
>>>
>>>
>>>
>>>I’ve seen it on several computers, different network hardware. I’d
>>>expect that if it was the NIC ISR, then it wouldn’t make a difference
>>>how big the payload is, since it will be just packets. So I feel it
>>>must be in tcp or higher.
>>>
>>>
>>>
>>>Thanks
>>>
>>>
>>>
>>>Adrien
>>>
>>>
>>>
>>>------ Original Message ------
>>>
>>>From: “Marion Bond”
>>>
>>>To: “Windows System Software Devs Interest List”
>>>
>>>
>>>Sent: 3/05/2017 1:04:17 PM
>>>
>>>Subject: RE: [ntdev] Windows networking bottleneck - all socket
>>>notifications coming from 1 thread inside winsock?
>>>
>>>
>>>
>>>>Uh, no – there is no such architectural limitation in Windows
>>>>
>>>>
>>>>
>>>>Checking a simple test program of mine on my desktop (windows 10
>>>>14393.1066) I can saturate a 10 Gb/s link (intel X540 copper) with
>>>>UDP traffic with only 20-30% usage of my I7-6700K @ 4GHz (4 cores, 8
>>>>Logical CPUs)
>>>>
>>>>
>>>>
>>>>TCP traffic is somewhat worse as network saturation depends to a
>>>>great degree on other network factors (including packet loss /
>>>>reordering and the mix of traffic between hosts), but you should
>>>>have no trouble with handing several Gb/s of traffic with an IOCP
>>>>based design or thread pool IO on similar hardware.
>>>>
>>>>
>>>>
>>>>Many factors can affect your performance, so without more
>>>>information, I can’t help you much further
>>>>
>>>>
>>>>
>>>>Sent from Mail https: for
>>>>Windows 10
>>>>
>>>>
>>>>
>>>>From: Adrien de Croy mailto:xxxxx
>>>>Sent: May 2, 2017 1:14 AM
>>>>To: Windows System Software Devs Interest List
>>>>mailto:xxxxx
>>>>Subject: [ntdev] Windows networking bottleneck - all socket
>>>>notifications coming from 1 thread inside winsock?
>>>>
>>>>
>>>>
>>>>Hi all
>>>>
>>>>
>>>>
>>>>I’m sorry if this has been asked elsewhere.
>>>>
>>>>
>>>>
>>>>We’ve been auditioning various different frameworks for socket-based
>>>>server application that has to handle high load (lots of
>>>>connections).
>>>>
>>>>
>>>>
>>>>We’ve tried:
>>>>
>>>>
>>>>
>>>> IOCP using boost asio, and hand-coded
>>>>
>>>>* blocking calls
>>>>
>>>>* overlapped calls with callback + wait state
>>>>
>>>>
>>>>
>>>>The test cases are a number of connections making consecutive http
>>>>requests on the connection, so there’s not much connection churn,
>>>>nearly all the IO is send/recv.
>>>>
>>>>
>>>>
>>>>In all cases however, we’ve noticed a bottleneck in Windows, where
>>>>core 0 goes to 100% (all in kernel), and after that, even though
>>>>overall CPU load may be less than 50%, and network is nowhere near
>>>>pegged, the performance levels out.
>>>>
>>>>
>>>>
>>>>It looks like the OS is doing all the completion notifications from
>>>>a single thread. Running in the debugger it’s in a function inside
>>>>winsock with a name like postsocketnotification or something (sorry
>>>>can’t remember exact name).
>>>>
>>>>
>>>>
>>>>Is this a known architectural issue with Windows? I’m not certain
>>>>if it happens on server OSes as well or not.
>>>>
>>>>
>>>>
>>>>Anyone know if there are any TCP or winsock tuning parameters that
>>>>could maybe make the system use multiple cores for this?
>>>>
>>>>
>>>>
>>>>Thanks
>>>>
>>>>
>>>>Adrien
>>>>
>>>>
>>>>—
>>>>NTDEV is sponsored by OSR
>>>>
>>>>Visit the list online at:
>>>>http:
>>>>
>>>>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>>>>software drivers!
>>>>Details at http:
>>>>
>>>>To unsubscribe, visit the List Server section of OSR Online at
>>>>http:
>>>>
>>>>
>>>>
>>>>
>>>>—
>>>>NTDEV is sponsored by OSR
>>>>
>>>>Visit the list online at:
>>>>http:
>>>>
>>>>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>>>>software drivers!
>>>>Details at http:
>>>>
>>>>To unsubscribe, visit the List Server section of OSR Online at
>>>>http:
>>>>
>>>
>>>—
>>>NTDEV is sponsored by OSR
>>>
>>>Visit the list online at:
>>>http:
>>>
>>>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>>>software drivers!
>>>Details at http:
>>>
>>>To unsubscribe, visit the List Server section of OSR Online at
>>>http:
>>>
>>
>>—
>>NTDEV is sponsored by OSR
>>
>>Visit the list online at:
>>http:
>>
>>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>>software drivers!
>>Details at http:
>>
>>To unsubscribe, visit the List Server section of OSR Online at
>>http:
>>
>>
>>
>>
>>—
>>NTDEV is sponsored by OSR
>>
>>Visit the list online at:
>>http:
>>
>>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>>software drivers!
>>Details at http:
>>
>>To unsubscribe, visit the List Server section of OSR Online at
>>http:
>>
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list online at:
>http:
>
>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>software drivers!
>Details at http:
>
>To unsubscribe, visit the List Server section of OSR Online at
>http:
>
>
>
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list online at:
>http:
>
>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>software drivers!
>Details at http:
>
>To unsubscribe, visit the List Server section of OSR Online at
>http:</http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:></mailto:xxxxx></mailto:xxxxx></https:></mailto:xxxxx></mailto:xxxxx></https:>