Looking for xperf/kernrate/wpa performance analysis best practices

When Windows 7 was released I had a performance issue that I was able to tediously work through with kernrate. It’s my understanding that the tools have only improved since this time. I have a renewed interest in analyzing portions of my code paths with increased (perhaps unnecessary) delays. CPU utilization is uncharacteristically high, and I’d like to correct it.

What tools/workflow are people using to evaluate degraded performance in their drivers? I ran into this article about collecting detailed performance with xperf. Is this still the way to go? Are there alternatives I should consider? Anybody have a good walkthrough that could bring me up to speed quickly? I’ve found that most of the documentation details what the tool(s) can do, but rarely consider detailing usage scenarios.

kernrate was so painful…

Try using the Windows Performance Recorder (WPR) to collect the logs instead of xperf (same data, just more convenient). The Windows Performance Analyzer (WPA) has also undergone a big UI change since I wrote that article but the same ideas hold.

Scott,

I have to respectfully disagree on kernrate being painful. While it had its quirks, like so many Microsoft driver tools it had two big advantages that WPA eliminated:

  1. The data came out as text, so it was possible with a slight massage to make files that could be compared to see what the changes were actually doing to the drivers.

  2. The other thing were the events, this allowed me on multiple occasions to do some complex setup in the benchmarking program, then start the data collection in kernrate. When done with the pounding, I could stop kernrate immediately from collecting data then do any cleanup.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

I ran WPR during a test that exercises the condition of interest and am trying to navigate WPA now. Seems like a pretty powerful visualization tool with a lot of options. I’m going to need some time to figure out how to use this tool in a meaningful way to understand how my driver performs in the areas of interest.

Do you have any tips or references that could help accelerate my evaluation of critical path? I’ve opened the DPC/ISR duration by Module,Function but haven’t yet figured out how to get symbols loaded or interrogate a call stack with execution times or anything yet.

Don,
I suspect the pain of using kernrate is directly proportional to my familiarity with it. I remember having a difficult time understanding the workflow, so I ended up fighting the tool more than the problem I was trying to solve. Again probably my own ignorance. Do you know of any good references for effectively using kernrate?

   I have to respectfully disagree on kernrate being painful.   While it had its quirks, like so many Microsoft driver tools it had two big advantages that WPA eliminated:

 1.   The data came out as text, so it was possible with a slight massage to make files that could be compared to see what the changes were actually doing to the drivers.

xperf/WPR generate ETW trace logs that you can format however you like, WPA just happens to give you the GUI view (e.g. tracerpt lets you dump as a CSV). Admittedly WPA also automatically does the symbolic name resolution which makes the logs much more useful…

 2.   The other thing were the events, this allowed me on multiple occasions to do some complex setup in the benchmarking program, then start the data collection in kernrate.    When done with the pounding, I could stop kernrate immediately from collecting data then do any cleanup.

You can run wpr.exe from the command line to start/stop traces if that’s what you’re looking for.

Scott,

I could never get WPA to give me data in a useable format to compare, it always wanted to do crap like XML. As far as events, they were a lot more powerful than the command line stuff. Kernrate also handled symbols and could use the symbol server with the same formatting of the target string as Windbg.

Shane, the only reference I ever found was the complete command line definition that Microsoft used to provided, at the end of the command line definitions they had some good examples and explanations.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

@Shane_Corbin said:
I ran WPR during a test that exercises the condition of interest and am trying to navigate WPA now. Seems like a pretty powerful visualization tool with a lot of options. I’m going to need some time to figure out how to use this tool in a meaningful way to understand how my driver performs in the areas of interest.

Yes, definitely…It takes a while to get used to what this is telling you. I don’t use it daily so always takes me a bit to get back into it.

Do you have any tips or references that could help accelerate my evaluation of critical path? I’ve opened the DPC/ISR duration by Module,Function but haven’t yet figured out how to get symbols loaded or interrogate a call stack with execution times or anything yet.

I wrote another article that goes over the basics again here: https://www.osr.com/nt-insider/2015-issue1/happiness-xperf/. It also demonstrates adding your own trace events to the log to isolate issues if that’s useful…

You load symbols with Trace->Load Symbols.

Stack walk data isn’t enabled by default and is not available for all trace providers. If you really need them I’m not sure you can get them with a trace started by WPR, you might need to use xperf.

If you can share a trace I can try to grovel through and see if there’s anything i can point to that might be interesting. Might make a good basis for an article in the next issue of The NT Insider (yes, we are working on one!).

I’ve been trying to find a way to profile my driver code. The closest I’ve seen is old documentation on xperf using the stack walk feature. The documentation I was reading was 11 years old, though, and when I tried to find the latest, I found information about Windows Performance Toolkit, with WPR and WPA. Many of the pages are marked as no longer maintained, and I can’t find any mention of the stack walk feature or another means of profiling a driver.

Do I understand from this discussion that WPR/WPA is the way to go for performance measurement, despite the pages marked as no longer being updated, and that xperf is still the tool for the stack walk feature?

Does the stack walk feature allow you to do something similar to a normal code profiler?

The original documentation I was reading said that the stack walk feature was difficult to get to work on a 64-bit system. Is that still the case?

We’re going through some transitional pains because we’re migrating from one doc publishing platform to another. Some times you have to go on a hunt to find the “new” version of the docs. For perf analysis, it starts here: https://docs.microsoft.com/en-us/windows-hardware/test/wpt/

The names are a jumble (typical Microsoft, etc.) WPT is the entire bundle. WPR is the part that collects traces, and WPA is the GUI to analyze them. WPT also includes xperf.exe if you want it, but xperf is indeed deprecated. WPA can do anything that xperf can do. If you’re going to spend time learning something, learn WPA. If you already have xperf skills and don’t want anyone to rock your boat, well, xperf is still there & still works.

Stackwalking is not a hassle on recent kernels. Several years ago, the kernel learned how to dynamically toggle whether stacks and unwind data are paged out. So when WPR starts a trace session, it pins everything in memory, and stack traces “just work”.

WPT is still supported, it still gets new features, and I personally still use it routinely do to my job. IMHO, it’s an awesome tool. It definitely supports stack walking: by default, it’ll collect a stack backtrace 1000 times a second on every processor. Once you get the hang of WPA, you can organize these stacks into a tree view, a flame graph, or (if you want to do your own data processing) just a flat array of 1000’s of callstacks.

Does the stack walk feature allow you to do something similar to a normal code profiler?

Yes, although I’m not sure what counts as “normal” :wink: VS’s built-in native code profiler is way behind WPA for CPU analysis, but much further ahead for heap footprint analysis. (And even more so for its managed code profiler.) Intel has a cool profiler, but it runs out into the architectural details of your processor, so I’d be hesitant to call that tool “normal”. There’s old-timey tools like kernrate, which WPA can replicate, but WPA integrates so much more information into your workbench that it’s not even the same league. Linux “perf” has a few neat features that WPA doesn’t, but they both do the same basic stackwalking + profiling.

@“Jeffrey_Tippet_[MSFT]” said:
We’re going through some transitional pains because we’re migrating from one doc publishing platform to another. Some times you have to go on a hunt to find the “new” version of the docs. For perf analysis, it starts here: Windows Performance Toolkit | Microsoft Learn

That’s the link I ended up at. But over half of the links from that page, most of the WPR and WPA ones, go to docs, but point to Windows 8/8.1 pages that say they are no longer maintained, so I got concerned. E.g. for WPA Features - WPA Features | Microsoft Learn

Stackwalking is not a hassle on recent kernels. Several years ago, the kernel learned how to dynamically toggle whether stacks and unwind data are paged out. So when WPR starts a trace session, it pins everything in memory, and stack traces “just work”.

That’s fantastic news, I’m excited now

Yes, although I’m not sure what counts as “normal” :wink:
I’m looking for something where I can capture data while I put my driver through some paces, and then get a diagram that will show where my driver is spending most of its time, so I know where to concentrate my effort.

Here’s a diagram from the 11 year old page that looks like it would do the trick:

_Ron

Yeah it’s quite straightforward to get that much in WPA, on any ~Win10 kernel and later.

@rstruempf said:

I’m looking for something where I can capture data while I put my driver through some paces, and then get a diagram that will show where my driver is spending most of its time, so I know where to concentrate my effort.

This seems like the most common usage scenario. @“Jeffrey_Tippet_[MSFT]” mentioned it’s quite straightforward. I haven’t yet figured it out. I can be pretty dense sometimes though. I know I’d greatly benefit from a walkthrough of the most common usage scenarios. So far I’ve only gotten it to identify time spent in ISR/DPCs. I haven’t figured out how to get the stackwalk feature to measure the call chain.

This has a screenshot-by-screenshot example: https://docs.microsoft.com/en-us/windows-hardware/test/wpt/optimizing-performance-and-responsiveness-exercise-2

There’s some good depth here: https://docs.microsoft.com/en-us/windows-hardware/test/wpt/cpu-analysis With that background, you’ll have an idea what is the difference between “CPU (sampled)” and “CPU (precise)” graphs, and why having both is valuable. There’s other good docs near that page in the table-of-contents about how to tailor the WPA GUI to your tastes.

If you’re having trouble getting the callstacks to even show up, make sure:

  • You collect the ETL with the “CPU” profiling enabled. From the command line, that’s wpr.exe -start CPU. From the GUI, make sure the “CPU” checkbox is checked.
  • When you load the ETL in WPA, make sure symbols get loaded ( https://docs.microsoft.com/en-us/windows-hardware/test/wpt/load-symbols-or-configure-symbol-paths ). As with the debugger, fussing with symbols is a tiresome timesink. WPA can’t draw any callstacks until it has loaded symbols for nearly every binary on the system, which can take an hour if you haven’t cached symbols from that OS build before. So if possible, I like to get one throwaway ETL loaded up right away, so it can start caching symbols, while in parallel I can begin collecting a more perfect ETL.
  • Once symbols look good, double-click on Graph Explorer > Computation > CPU Usage (sampled) > Utilization by Process, Thread, Stack.

(One secret of WPA is that all the dozens of built-in graphs are actually the same graph, but with different options set by default. So if you squint, it doesn’t matter which graph you start with, since you can bang on it until it looks like any other graph.)

Thank you for the excellent response! This really helped me.

I second that. Thanks, @“Jeffrey_Tippet_[MSFT]”!

I would add to that the trove of articles by Bruce Dawson on https://randomascii.wordpress.com. Though his articles are mostly usermode code, the methods are very relevant and helped me a lot. You can also take a look at his WPA course on Wintellect that has 3 or 4 guided lectures if you want something more structured.

I would also look at Stack-Tags to help you dig into your code faster when analyzing large recordings.

Another nice feature of WPA, is that if you instrument your code using TraceLogging (and use advanced features, like Activities) it will really help you see different stages of your workflow hapenning on the same timeline as the other graphs.

This helped me identify areas where unrelated workflows were happening at the same time, and starving each other from resources, or in another case an area that we tried to speedup a workflow by parallelizing, only to see that they were actually hitting a microsoft implementation lock.