NdisGetCurrentProcessorCpuUsage doesn't change

Hi,
I want to

I am working on win8 kernel driver, and I need to know the current processor usage,
For that I am using NdisGetCurrentProcessorCpuUsage,
But for some reason the value I get from NdisGetCurrentProcessorCpuUsage doesn’t change (stays 32…)
Even if I’m causing the CPU to work around 100% (resource monitor indicates its on 100%)
Any idea why?

NdisGetCurrentProcessorCpuUsage is not useful. It returns the average amount of activity on that processor since boot. So if there’s a spike in activity, it won’t be reflected in NdisGetCurrentProcessorCpuUsage.

Estimating processor usage is tricky, and generally drivers don’t do it.

If you absolutely must, you can get better information from NdisGetCurrentProcessorCounts. Poll it, and compute the deltas. If KernelAndUser increases a lot, but IdleCount doesn’t increase much, then the CPU is loaded. If IdleCount increases much more than KernelAndUser increases, then the CPU is not loaded.

This still doesn’t tell you how *important* the load is. E.g., the CPU might be mining for imaginary currency using a lowest-priority thread. It also doesn’t tell you how badly latency affects the user. E.g., if you’re sharing the CPU with the input thread of the foreground application, the CPU might appear to be lightly loaded, but when you start stealing cycles, the visible user experience quickly starts to suffer.

xxxxx@gmail.com wrote:

I am working on win8 kernel driver, and I need to know the current processor usage,
For that I am using NdisGetCurrentProcessorCpuUsage,
But for some reason the value I get from NdisGetCurrentProcessorCpuUsage doesn’t change (stays 32…)
Even if I’m causing the CPU to work around 100% (resource monitor indicates its on 100%)
Any idea why?

How often are you calling it? Remember that CPU usage is not an
instantaneous measure. At any given point in time, a CPU is either 100%
or 0%. To get the kind of measure you’re thinking of, the number has to
be integrated over time.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Yikes. The average amount of CPU activity since BOOT?

I’m afraid that’s not likely to be useful at all. For almost anyone.

Somebody probably needs to file a bug against the doc page for this function, then (somebody other than ME… given I don’t know much about NDIS anymore). Because the docs are, at the very least, misleading (http://msdn.microsoft.com/en-us/library/windows/hardware/ff562627(v=vs.85).aspx):

“For example, a miniport driver might call this function periodically and, as its percentage of CPU usage trends higher, disable interrupts on a NIC and switch to polling the state of the NIC.”

Yuck.

Peter
OSR

On 04-Dec-2013 20:52, Jeffrey Tippet wrote:

Estimating processor usage is tricky, and generally drivers don’t do it.

So how does Windows or NDIS help to distribute load without RSS? Or with
RSS, does it consider kind of work going on the CPU?
– pa

> So how does Windows or NDIS help to distribute load without RSS?

Am I wrong this relies on APIC logic of interrupt distribution across cores?

MSI-based netcards can be smarter and make this decision themselves.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

[Peter] Somebody probably needs to file a bug against the doc page for this function

Done.

[Pavel] So how does Windows or NDIS help to distribute load without RSS?

RSS is the mechanism by which NDIS distributes load. So without RSS, NDIS does not distribute load. (NDIS may still use RST to make the play well with other tasks running on the same CPU, but RST fundamentally doesn’t move traffic across CPUs). Higher levels of the stack may still distribute load across processors, e.g., http.sys has a set of worker threads affinitized to certain CPUs.

[Pavel] Or with RSS, does it consider kind of work going on the CPU?

RSS doesn’t use this API – nothing in the OS actually uses this API :slight_smile: RSS changes with each OS version, and in its current version, actually has several algorithms (“profiles”) to choose from. So I hate to specific exactly how it works, because it depends on several factors. In general, I don’t believe that RSS takes usermode activity into consideration when load-balancing; it’s mostly meant to balance one TCP stream against another TCP stream.

@Maxim: Once RSS is enabled over a NIC that supports MSI-X, the NIC uses hints from RSS to program the APIC and determine which processors to interrupt. Generally NICs don’t implement their own load distribution algorithms; they just follow the instructions of RSS (or VMQ, which is similar).

Jeffrey Tippet has his facts straight nearly all the time he opens his mouth. (That’s most of the reason that I read his posts here – to learn from him.) In this case, though, he’s wrong on one tiny and mostly inconsequential point. The NIC never programs the APIC. The NIC programs its interrupt generation logic (which is usually done through MSI-X but doesn’t have to be) so that the interrupt is sent to the local APIC of the processor associated with the queue in the NIC. This association is managed by TCP/IP and NDIS, and delivered (I think - this is a case where Jeffrey will probably correct me) as part of the RSS hashing mechanism. The data programmed into the NIC will generally target exactly one local APIC, thus specifying the exact target processor.

  • Jake Oshins
    (former interrupt guy)
    Windows Kernel Team

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Jeffrey Tippet
Sent: Wednesday, December 4, 2013 1:59 PM
To: Windows System Software Devs Interest List
Subject: RE: Re:[ntdev] NdisGetCurrentProcessorCpuUsage doesn’t change

[Peter] Somebody probably needs to file a bug against the doc page for this function

Done.

[Pavel] So how does Windows or NDIS help to distribute load without RSS?

RSS is the mechanism by which NDIS distributes load. So without RSS, NDIS does not distribute load. (NDIS may still use RST to make the play well with other tasks running on the same CPU, but RST fundamentally doesn’t move traffic across CPUs). Higher levels of the stack may still distribute load across processors, e.g., http.sys has a set of worker threads affinitized to certain CPUs.

[Pavel] Or with RSS, does it consider kind of work going on the CPU?

RSS doesn’t use this API – nothing in the OS actually uses this API :slight_smile: RSS changes with each OS version, and in its current version, actually has several algorithms (“profiles”) to choose from. So I hate to specific exactly how it works, because it depends on several factors. In general, I don’t believe that RSS takes usermode activity into consideration when load-balancing; it’s mostly meant to balance one TCP stream against another TCP stream.

@Maxim: Once RSS is enabled over a NIC that supports MSI-X, the NIC uses hints from RSS to program the APIC and determine which processors to interrupt. Generally NICs don’t implement their own load distribution algorithms; they just follow the instructions of RSS (or VMQ, which is similar).


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

If the NIC can’t do RSS (as is covered well in Jeffery Tippet’s response to this) then NDIS and the networking stack don’t do anything to spread load across multiple cores.

Most NICs that don’t support RSS, however, will also not support MSI-X (or any other mechanism for directly targeting a specific processor with an interrupt) and their INF will leave the interrupt routing policy at its default, by simply not including anything about interrupt policy.

When the default interrupt policy is used on x86 or x64, the policy boils down to “pick one APIC cluster and target the device’s interrupt at all the processors in that APIC cluster, with each individual interrupt delivered to exactly one of them.” What happens then depends on the chipset hardware. Some chipsets will round-robin the interrupts. Some chipsets will always pick the processor with the lowest (or highest) APIC ID. Some very old chipsets will inspect the contents of all the targeted local APIC TPR registers and pick the processor with the lowest value.

Thus interrupts from a non-RSS NIC can still be spread out among a set of processors. On machines with fewer than 8 logical processors (HyperThreads, cores, etc.) this will be all the processors in the machine. On machines with 8 or more logical processors running in xAPIC mode, this will be some set of 4 or fewer processors that are part of an APIC cluster. In machines running in x2APIC mode, the notion of APIC cluster is a little more fluid and there might be larger sets of targeted processors.

And while this won’t specifically help the way that RSS does, it can often allow a NIC to get past the bottleneck of a single core.

  • Jake Oshins
    (former interrupt guy)
    Windows Kernel Team

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S. Shatskih
Sent: Wednesday, December 4, 2013 1:01 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] NdisGetCurrentProcessorCpuUsage doesn’t change

So how does Windows or NDIS help to distribute load without RSS?

Am I wrong this relies on APIC logic of interrupt distribution across cores?

MSI-based netcards can be smarter and make this decision themselves.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Thank you Jeffrey for the clarification.
Hope this will be useful for the OP in context of his question…
If I understand correctly, basically the advice is not to use the CPU
load to change behavior of the driver.
Instead, measure % CPU and other performance metrics by external tools,
in specially created test workloads, correlate it to the driver activity
and optimize whatever makes sense?

thanks,
– pa

>@Maxim: Once RSS is enabled over a NIC that supports MSI-X, the NIC uses hints from RSS to

Thanks!

And, before RSS, the random CPU (by APIC) was chosen for both NIC’s ISR and NIC’s NDIS DPC (which executed the whole receive path) - is this correct?


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

It depends on what the OP is trying to do. I wouldn’t really recommend that an NDIS driver attempt to look at CPU load and try to do something fancy. If you need to distribute network traffic load across processors, use RSS. That’s absolutely the tried and successful path.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Pavel A.
Sent: Wednesday, December 4, 2013 2:43 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] NdisGetCurrentProcessorCpuUsage doesn’t change

Thank you Jeffrey for the clarification.
Hope this will be useful for the OP in context of his question…
If I understand correctly, basically the advice is not to use the CPU load to change behavior of the driver.
Instead, measure % CPU and other performance metrics by external tools, in specially created test workloads, correlate it to the driver activity and optimize whatever makes sense?

thanks,
– pa

>And, before RSS, the random CPU (by APIC) was chosen for both NIC’s ISR and
NIC’s NDIS DPC (which executed the whole receive path) - is this correct?

I don’t think you can really use more than one CPU for a particular source of PCI-style level driven interrupts.

MSI(X) interrupts can have multiple CPU affinity.

Thanks everyone! this thread is very informative indeed.

My driver is intended for wifi NIC,
We have a problem with a specific ATOM CPU, which have 4 cores,
Our NIC causes the CPU to work as follows:
Core0 - 100%
Core1 - 30%
Core2 - 30%
Core3 - 30%

The problem in my case is that we don’t support RSS at the moment.

I will try to use NdisGetCurrentProcessorCounts as suggested by Jeffrey
Using the formula from MSDN - (http://msdn.microsoft.com/en-us/library/windows/hardware/ff562625(v=vs.85).aspx)
CpuUsage = 100-100*(Idle - Idle[n])/(KernelAndUser - KernelAndUser[n]);
Or even will just round robin on all 4 cores when calling NdisMQueueDpcEx on interrupt context

But this may solve the DPC CPU load balancing problem, not the interrupt problem.
Any other suggestions?

Does your driver spend excessive time in DPC or in ISR?

Do you use bus mastering or PIO?

Do you have interrupt moderation in the chip?

Have you profiled it to find bottlenecks?

Do you rely on any expensive software routines, such as encryption, etc?

Could you offload any non-hardware stuff to workitems?

Does your driver spend excessive time in DPC or in ISR? - spending most of the time in DPC

Do you use bus mastering or PIO? - bus mastering

Do you have interrupt moderation in the chip? yes

Have you profiled it to find bottlenecks? no, can you explain how?

Do you rely on any expensive software routines, such as encryption, etc? no, only HW

Could you offload any non-hardware stuff to workitems? yes, how can that help leveling CPU usage?

> Our NIC causes the CPU to work as follows:

Core0 - 100%
Core1 - 30%
Core2 - 30%
Core3 - 30%

The problem in my case is that we don’t support RSS at the moment.

So what? normal picture if you don’t support RSS. Why you dislike it?


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

I dislike it because we can’t get high throughput because of bad CPU utilization,
If we could level the RX processing on the 4 cores, we should see better throughput.

>I dislike it because we can’t get high throughput because of bad CPU utilization,

Isn’t Wi-Fi slow enough to really have such issues?

If we could level the RX processing on the 4 cores, we should see better throughput.

Then do support RSS.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

The high CPU consumption is quite unusual for such moderate data rates as WiFi.

This may mean some design error. I suggest running xperf for the first step.

Is the device on PCIe?