Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Before Posting...
Please check out the Community Guidelines in the Announcements and Administration Category.

Looking for xperf/kernrate/wpa performance analysis best practices

Shane_CorbinShane_Corbin Member Posts: 247

When Windows 7 was released I had a performance issue that I was able to tediously work through with kernrate. It's my understanding that the tools have only improved since this time. I have a renewed interest in analyzing portions of my code paths with increased (perhaps unnecessary) delays. CPU utilization is uncharacteristically high, and I'd like to correct it.

What tools/workflow are people using to evaluate degraded performance in their drivers? I ran into this article about collecting detailed performance with xperf. Is this still the way to go? Are there alternatives I should consider? Anybody have a good walkthrough that could bring me up to speed quickly? I've found that most of the documentation details what the tool(s) can do, but rarely consider detailing usage scenarios.

Comments

  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,151

    kernrate was so painful...

    Try using the Windows Performance Recorder (WPR) to collect the logs instead of xperf (same data, just more convenient). The Windows Performance Analyzer (WPA) has also undergone a big UI change since I wrote that article but the same ideas hold.

    -scott
    OSR

  • Don_BurnDon_Burn Member - All Emails Posts: 1,653
    via Email
    Scott,

    I have to respectfully disagree on kernrate being painful. While it had its quirks, like so many Microsoft driver tools it had two big advantages that WPA eliminated:

    1. The data came out as text, so it was possible with a slight massage to make files that could be compared to see what the changes were actually doing to the drivers.

    2. The other thing were the events, this allowed me on multiple occasions to do some complex setup in the benchmarking program, then start the data collection in kernrate. When done with the pounding, I could stop kernrate immediately from collecting data then do any cleanup.


    Don Burn
    Windows Driver Consulting
    Website: http://www.windrvr.com
  • Shane_CorbinShane_Corbin Member Posts: 247
    edited May 2

    I ran WPR during a test that exercises the condition of interest and am trying to navigate WPA now. Seems like a pretty powerful visualization tool with a lot of options. I'm going to need some time to figure out how to use this tool in a meaningful way to understand how my driver performs in the areas of interest.

    Do you have any tips or references that could help accelerate my evaluation of critical path? I've opened the DPC/ISR duration by Module,Function but haven't yet figured out how to get symbols loaded or interrogate a call stack with execution times or anything yet.

    Don,
    I suspect the pain of using kernrate is directly proportional to my familiarity with it. I remember having a difficult time understanding the workflow, so I ended up fighting the tool more than the problem I was trying to solve. Again probably my own ignorance. Do you know of any good references for effectively using kernrate?

  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,151
       I have to respectfully disagree on kernrate being painful.   While it had its quirks, like so many Microsoft driver tools it had two big advantages that WPA eliminated:
    
     1.   The data came out as text, so it was possible with a slight massage to make files that could be compared to see what the changes were actually doing to the drivers.
    

    xperf/WPR generate ETW trace logs that you can format however you like, WPA just happens to give you the GUI view (e.g. tracerpt lets you dump as a CSV). Admittedly WPA also automatically does the symbolic name resolution which makes the logs much more useful...

     2.   The other thing were the events, this allowed me on multiple occasions to do some complex setup in the benchmarking program, then start the data collection in kernrate.    When done with the pounding, I could stop kernrate immediately from collecting data then do any cleanup.
    

    You can run wpr.exe from the command line to start/stop traces if that's what you're looking for.

    -scott
    OSR

  • Don_BurnDon_Burn Member - All Emails Posts: 1,653
    via Email
    Scott,

    I could never get WPA to give me data in a useable format to compare, it always wanted to do crap like XML. As far as events, they were a lot more powerful than the command line stuff. Kernrate also handled symbols and could use the symbol server with the same formatting of the target string as Windbg.

    Shane, the only reference I ever found was the complete command line definition that Microsoft used to provided, at the end of the command line definitions they had some good examples and explanations.


    Don Burn
    Windows Driver Consulting
    Website: http://www.windrvr.com
  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,151

    @Shane_Corbin said:
    I ran WPR during a test that exercises the condition of interest and am trying to navigate WPA now. Seems like a pretty powerful visualization tool with a lot of options. I'm going to need some time to figure out how to use this tool in a meaningful way to understand how my driver performs in the areas of interest.

    Yes, definitely...It takes a while to get used to what this is telling you. I don't use it daily so always takes me a bit to get back into it.

    Do you have any tips or references that could help accelerate my evaluation of critical path? I've opened the DPC/ISR duration by Module,Function but haven't yet figured out how to get symbols loaded or interrogate a call stack with execution times or anything yet.

    I wrote another article that goes over the basics again here: https://www.osr.com/nt-insider/2015-issue1/happiness-xperf/. It also demonstrates adding your own trace events to the log to isolate issues if that's useful...

    You load symbols with Trace->Load Symbols.

    Stack walk data isn't enabled by default and is not available for all trace providers. If you really need them I'm not sure you can get them with a trace started by WPR, you might need to use xperf.

    If you can share a trace I can try to grovel through and see if there's anything i can point to that might be interesting. Might make a good basis for an article in the next issue of The NT Insider (yes, we are working on one!).

    -scott
    OSR

  • rstruempfrstruempf Member Posts: 103

    I've been trying to find a way to profile my driver code. The closest I've seen is old documentation on xperf using the stack walk feature. The documentation I was reading was 11 years old, though, and when I tried to find the latest, I found information about Windows Performance Toolkit, with WPR and WPA. Many of the pages are marked as no longer maintained, and I can't find any mention of the stack walk feature or another means of profiling a driver.

    Do I understand from this discussion that WPR/WPA is the way to go for performance measurement, despite the pages marked as no longer being updated, and that xperf is still the tool for the stack walk feature?

    Does the stack walk feature allow you to do something similar to a normal code profiler?

    The original documentation I was reading said that the stack walk feature was difficult to get to work on a 64-bit system. Is that still the case?

  • Jeffrey_Tippet_[MSFT]Jeffrey_Tippet_[MSFT] Member - All Emails Posts: 545

    We're going through some transitional pains because we're migrating from one doc publishing platform to another. Some times you have to go on a hunt to find the "new" version of the docs. For perf analysis, it starts here: https://docs.microsoft.com/en-us/windows-hardware/test/wpt/

    The names are a jumble (typical Microsoft, etc.) WPT is the entire bundle. WPR is the part that collects traces, and WPA is the GUI to analyze them. WPT also includes xperf.exe if you want it, but xperf is indeed deprecated. WPA can do anything that xperf can do. If you're going to spend time learning something, learn WPA. If you already have xperf skills and don't want anyone to rock your boat, well, xperf is still there & still works.

    Stackwalking is not a hassle on recent kernels. Several years ago, the kernel learned how to dynamically toggle whether stacks and unwind data are paged out. So when WPR starts a trace session, it pins everything in memory, and stack traces "just work".

    WPT is still supported, it still gets new features, and I personally still use it routinely do to my job. IMHO, it's an awesome tool. It definitely supports stack walking: by default, it'll collect a stack backtrace 1000 times a second on every processor. Once you get the hang of WPA, you can organize these stacks into a tree view, a flame graph, or (if you want to do your own data processing) just a flat array of 1000's of callstacks.

    Does the stack walk feature allow you to do something similar to a normal code profiler?

    Yes, although I'm not sure what counts as "normal" ;) VS's built-in native code profiler is way behind WPA for CPU analysis, but much further ahead for heap footprint analysis. (And even more so for its managed code profiler.) Intel has a cool profiler, but it runs out into the architectural details of your processor, so I'd be hesitant to call that tool "normal". There's old-timey tools like kernrate, which WPA can replicate, but WPA integrates so much more information into your workbench that it's not even the same league. Linux "perf" has a few neat features that WPA doesn't, but they both do the same basic stackwalking + profiling.

  • rstruempfrstruempf Member Posts: 103

    @Jeffrey_Tippet_[MSFT] said:
    We're going through some transitional pains because we're migrating from one doc publishing platform to another. Some times you have to go on a hunt to find the "new" version of the docs. For perf analysis, it starts here: https://docs.microsoft.com/en-us/windows-hardware/test/wpt/

    That's the link I ended up at. But over half of the links from that page, most of the WPR and WPA ones, go to docs, but point to Windows 8/8.1 pages that say they are no longer maintained, so I got concerned. E.g. for WPA Features - https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-8.1-and-8/hh448220(v=win.10)

    Stackwalking is not a hassle on recent kernels. Several years ago, the kernel learned how to dynamically toggle whether stacks and unwind data are paged out. So when WPR starts a trace session, it pins everything in memory, and stack traces "just work".

    That's fantastic news, I'm excited now

    Yes, although I'm not sure what counts as "normal" ;)

    I'm looking for something where I can capture data while I put my driver through some paces, and then get a diagram that will show where my driver is spending most of its time, so I know where to concentrate my effort.

    Here's a diagram from the 11 year old page that looks like it would do the trick:

    _Ron

  • Jeffrey_Tippet_[MSFT]Jeffrey_Tippet_[MSFT] Member - All Emails Posts: 545

    Yeah it's quite straightforward to get that much in WPA, on any ~Win10 kernel and later.

  • Shane_CorbinShane_Corbin Member Posts: 247

    @rstruempf said:

    I'm looking for something where I can capture data while I put my driver through some paces, and then get a diagram that will show where my driver is spending most of its time, so I know where to concentrate my effort.

    This seems like the most common usage scenario. @Jeffrey_Tippet_[MSFT] mentioned it's quite straightforward. I haven't yet figured it out. I can be pretty dense sometimes though. I know I'd greatly benefit from a walkthrough of the most common usage scenarios. So far I've only gotten it to identify time spent in ISR/DPCs. I haven't figured out how to get the stackwalk feature to measure the call chain.

  • Jeffrey_Tippet_[MSFT]Jeffrey_Tippet_[MSFT] Member - All Emails Posts: 545

    This has a screenshot-by-screenshot example: https://docs.microsoft.com/en-us/windows-hardware/test/wpt/optimizing-performance-and-responsiveness-exercise-2

    There's some good depth here: https://docs.microsoft.com/en-us/windows-hardware/test/wpt/cpu-analysis With that background, you'll have an idea what is the difference between "CPU (sampled)" and "CPU (precise)" graphs, and why having both is valuable. There's other good docs near that page in the table-of-contents about how to tailor the WPA GUI to your tastes.

    If you're having trouble getting the callstacks to even show up, make sure:

    • You collect the ETL with the "CPU" profiling enabled. From the command line, that's wpr.exe -start CPU. From the GUI, make sure the "CPU" checkbox is checked.
    • When you load the ETL in WPA, make sure symbols get loaded ( https://docs.microsoft.com/en-us/windows-hardware/test/wpt/load-symbols-or-configure-symbol-paths ). As with the debugger, fussing with symbols is a tiresome timesink. WPA can't draw any callstacks until it has loaded symbols for nearly every binary on the system, which can take an hour if you haven't cached symbols from that OS build before. So if possible, I like to get one throwaway ETL loaded up right away, so it can start caching symbols, while in parallel I can begin collecting a more perfect ETL.
    • Once symbols look good, double-click on Graph Explorer > Computation > CPU Usage (sampled) > Utilization by Process, Thread, Stack.

    (One secret of WPA is that all the dozens of built-in graphs are actually the same graph, but with different options set by default. So if you squint, it doesn't matter which graph you start with, since you can bang on it until it looks like any other graph.)

  • Shane_CorbinShane_Corbin Member Posts: 247

    Thank you for the excellent response! This really helped me.

  • rstruempfrstruempf Member Posts: 103

    I second that. Thanks, @Jeffrey_Tippet_[MSFT]!

  • Vadim_SirotnikovVadim_Sirotnikov Member Posts: 28

    I would add to that the trove of articles by Bruce Dawson on https://randomascii.wordpress.com. Though his articles are mostly usermode code, the methods are very relevant and helped me a lot. You can also take a look at his WPA course on Wintellect that has 3 or 4 guided lectures if you want something more structured.

    I would also look at Stack-Tags to help you dig into your code faster when analyzing large recordings.

    Another nice feature of WPA, is that if you instrument your code using TraceLogging (and use advanced features, like Activities) it will really help you see different stages of your workflow hapenning on the same timeline as the other graphs.

    This helped me identify areas where unrelated workflows were happening at the same time, and starving each other from resources, or in another case an area that we tried to speedup a workflow by parallelizing, only to see that they were actually hitting a microsoft implementation lock.

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Upcoming OSR Seminars
Developing Minifilters 29 July 2019 OSR Seminar Space
Writing WDF Drivers 23 Sept 2019 OSR Seminar Space
Kernel Debugging 21 Oct 2019 OSR Seminar Space
Internals & Software Drivers 18 Nov 2019 Dulles, VA