Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Home NTDEV

Before Posting...

Please check out the Community Guidelines in the Announcements and Administration Category.

More Info on Driver Writing and Debugging


The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.


Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/


ReadThreadProfilingData function not returning useful / sensible / consistent results

eric_mayereric_mayer Member Posts: 4

I'm trying to develop some performance test automation techniques that are more predictable and stable than simple elapsed time measurement. I plan for the tests to be broken down into discrete "use-case" based unit-test style tests rather than simply measuring background activity of our application. The measurements need to cover activity in both user and kernel mode.

My litmus test effort was to come up with a baseline low-ambiguity sample test to show that I could measure some kind of consistent number on CPU time ticks on a given test thread running on a given test O/S on a given test machine for a pure math algorithm happening in active memory, not calling any WinAPIs, not touching file system, etc. The test deliberately created "turbulence" by having a configurable number of competing threads periodically run the same algorithm as the thread being measured. I wanted to see if I could get the elapsed time to vary significantly on the thread being measured but get a solid/consistent number for its CPU time. I figured if I couldn't achieve at least this, attempting measurement of any more complex/messy code would be pointless.
 
At first I tried using QueryThreadCycleTime (or its Kernel equivalent KeQueryTotalCycleTimeThread) but found that I was easily able to show test results where elapsed time (measured with QueryPerformanceCounter or it's kernel equivalent) varied significantly but the CPU time on the measured thread hugged that varying curve tightly rather than being stable/consistent. That was disappointing, but I wasn't too surprised because the documentation on these APIs is extremely spare and trying to find example usage turned up very little.
 
Then I came across the usermode Thread Profiling APIs (i.e. ReadThreadProfilingData), which seem a little better supported. I can provide this code if anyone wants to look it over or try it out. Unfortunately, though, the results looked strikingly similar to the previous effort. No additional measurement stability using thread-specific CPU time ticks rather than overall elapsed time. This time I also measured the thread context switch counts, trying to establish any pattern of potential cost of doing the context switch that might account for the additional time spent in the thread, but no such coherent pattern emerged.
 
The test output showing the tight correlation between overall time and thread-specific CPU time (plus context switch counts) follows. Also following are the specs on my test machine, as I understand the behavior of these profiling APIs can vary with hardware, but I don't see any issue based on my best research effort on the topic -- though I must admit as far as average software developers go, I'm not a hardware / CPU export in any sense. Anyone get any clues / tips? Is this enough to say "Hey what's up with this Microsoft?" or did I fail to account for any other obvious test parameters / conditions?

Comments

  • eric_mayereric_mayer Member Posts: 4

    Test output:

    C:\Source\endpoint_nav1_dev\x64\Release>ConsoleApplication1.exe o 12
    Running test with offline thread and 12 competing threads on 16 processors.
     
    main: Contested Test iteration 1 results:
    main: j is 1804570059818208285, cpu ticks is 477569794.
    main: elapsed microsec is 199718.
    main: ratio is 2391.
    main: Thread context switch count = 13
     
    main: Free Test iteration 2 results:
    main: j is 1804570059818208285, cpu ticks is 362215977.
    main: elapsed microsec is 145259.
    main: ratio is 2493.
    main: Thread context switch count = 1
     
    main: Free Test iteration 3 results:
    main: j is 1804570059818208285, cpu ticks is 359002716.
    main: elapsed microsec is 144025.
    main: ratio is 2492.
    main: Thread context switch count = 2
     
    main: Contested Test iteration 4 results:
    main: j is 1804570059818208285, cpu ticks is 491884161.
    main: elapsed microsec is 197114.
    main: ratio is 2495.
    main: Thread context switch count = 0
     
    main: Free Test iteration 5 results:
    main: j is 1804570059818208285, cpu ticks is 359517039.
    main: elapsed microsec is 144111.
    main: ratio is 2494.
    main: Thread context switch count = 1
     
    main: Free Test iteration 6 results:
    main: j is 1804570059818208285, cpu ticks is 356299170.
    main: elapsed microsec is 142808.
    main: ratio is 2494.
    main: Thread context switch count = 0
     
    main: Contested Test iteration 7 results:
    main: j is 1804570059818208285, cpu ticks is 473551735.
    main: elapsed microsec is 189981.
    main: ratio is 2492.
    main: Thread context switch count = 4
     
    main: Free Test iteration 8 results:
    main: j is 1804570059818208285, cpu ticks is 359549314.
    main: elapsed microsec is 144151.
    main: ratio is 2494.
    main: Thread context switch count = 1
     
    main: Free Test iteration 9 results:
    main: j is 1804570059818208285, cpu ticks is 357777276.
    main: elapsed microsec is 143635.
    main: ratio is 2490.
    main: Thread context switch count = 6
     
    main: Contested Test iteration 10 results:
    main: j is 1804570059818208285, cpu ticks is 472087235.
    main: elapsed microsec is 189280.
    main: ratio is 2494.
    main: Thread context switch count = 1
     
    main: Free Test iteration 11 results:
    main: j is 1804570059818208285, cpu ticks is 356203917.
    main: elapsed microsec is 142790.
    main: ratio is 2494.
    main: Thread context switch count = 3
     
    main: Free Test iteration 12 results:
    main: j is 1804570059818208285, cpu ticks is 354169145.
    main: elapsed microsec is 142164.
    main: ratio is 2491.
    main: Thread context switch count = 1
     
    main: Contested Test iteration 13 results:
    main: j is 1804570059818208285, cpu ticks is 475484813.
    main: elapsed microsec is 190663.
    main: ratio is 2493.
    main: Thread context switch count = 1
     
    main: Free Test iteration 14 results:
    main: j is 1804570059818208285, cpu ticks is 358580220.
    main: elapsed microsec is 143718.
    main: ratio is 2495.
    main: Thread context switch count = 6
     
    main: Free Test iteration 15 results:
    main: j is 1804570059818208285, cpu ticks is 354703498.
    main: elapsed microsec is 142150.
    main: ratio is 2495.
    main: Thread context switch count = 1
     
    main: Contested Test iteration 16 results:
    main: j is 1804570059818208285, cpu ticks is 482798437.
    main: elapsed microsec is 193497.
    main: ratio is 2495.
    main: Thread context switch count = 0
     
    main: Free Test iteration 17 results:
    main: j is 1804570059818208285, cpu ticks is 355251649.
    main: elapsed microsec is 142356.
    main: ratio is 2495.
    main: Thread context switch count = 0
     
    main: Free Test iteration 18 results:
    main: j is 1804570059818208285, cpu ticks is 357151258.
    main: elapsed microsec is 143176.
    main: ratio is 2494.
    main: Thread context switch count = 0
     
    main: Contested Test iteration 19 results:
    main: j is 1804570059818208285, cpu ticks is 469841993.
    main: elapsed microsec is 188402.
    main: ratio is 2493.
    main: Thread context switch count = 3
     
    main: Free Test iteration 20 results:
    main: j is 1804570059818208285, cpu ticks is 356370399.
    main: elapsed microsec is 143112.
    main: ratio is 2490.
    main: Thread context switch count = 0
     
     
    main: Total results:
    main: cpu ticks min/max/variance is 354169145 / 491884161 / 38%.
    main: elapsed microsec min/max/variance is 142150 / 199718 / 40%.

  • eric_mayereric_mayer Member Posts: 4

    CPU specs:
     
     
    11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
    Intel64 Family 6 Model 141 Stepping 1, GenuineIntel
    Microcode signature: 0000003E
    HTT * Hyperthreading enabled
    CET * Supports Control Flow Enforcement Technology
    Kernel CET - Kernel-mode CET Enabled
    User CET * User-mode CET Allowed
    HYPERVISOR * Hypervisor is present
    VMX - Supports Intel hardware-assisted virtualization
    SVM - Supports AMD hardware-assisted virtualization
    X64 * Supports 64-bit mode
     
    SMX - Supports Intel trusted execution
    SKINIT - Supports AMD SKINIT
    SGX - Supports Intel SGX
     
    NX * Supports no-execute page protection
    SMEP * Supports Supervisor Mode Execution Prevention
    SMAP * Supports Supervisor Mode Access Prevention
    PAGE1GB * Supports 1 GB large pages
    PAE * Supports > 32-bit physical addresses
    PAT * Supports Page Attribute Table
    PSE * Supports 4 MB pages
    PSE36 * Supports > 32-bit address 4 MB pages
    PGE * Supports global bit in page tables
    SS * Supports bus snooping for cache operations
    VME * Supports Virtual-8086 mode
    RDWRFSGSBASE * Supports direct GS/FS base access
     
    FPU * Implements i387 floating point instructions
    MMX * Supports MMX instruction set
    MMXEXT - Implements AMD MMX extensions
    3DNOW - Supports 3DNow! instructions
    3DNOWEXT - Supports 3DNow! extension instructions
    SSE * Supports Streaming SIMD Extensions
    SSE2 * Supports Streaming SIMD Extensions 2
    SSE3 * Supports Streaming SIMD Extensions 3
    SSSE3 * Supports Supplemental SIMD Extensions 3
    SSE4a - Supports Streaming SIMDR Extensions 4a
    SSE4.1 * Supports Streaming SIMD Extensions 4.1
    SSE4.2 * Supports Streaming SIMD Extensions 4.2
     
    AES * Supports AES extensions
    AVX * Supports AVX instruction extensions
    AVX2 * Supports AVX2 instruction extensions
    AVX-512-F * Supports AVX-512 Foundation instructions
    AVX-512-DQ * Supports AVX-512 double and quadword instructions
    AVX-512-IFAMA * Supports AVX-512 integer Fused multiply-add instructions
    AVX-512-PF - Supports AVX-512 prefetch instructions
    AVX-512-ER - Supports AVX-512 exponential and reciprocal instructions
    AVX-512-CD * Supports AVX-512 conflict detection instructions
    AVX-512-BW * Supports AVX-512 byte and word instructions
    AVX-512-VL * Supports AVX-512 vector length instructions
    FMA * Supports FMA extensions using YMM state
    MSR * Implements RDMSR/WRMSR instructions
    MTRR * Supports Memory Type Range Registers
    XSAVE * Supports XSAVE/XRSTOR instructions
    OSXSAVE * Supports XSETBV/XGETBV instructions
    RDRAND * Supports RDRAND instruction
    RDSEED * Supports RDSEED instruction
     
    CMOV * Supports CMOVcc instruction
    CLFSH * Supports CLFLUSH instruction
    CX8 * Supports compare and exchange 8-byte instructions
    CX16 * Supports CMPXCHG16B instruction
    BMI1 * Supports bit manipulation extensions 1
    BMI2 * Supports bit manipulation extensions 2
    ADX * Supports ADCX/ADOX instructions
    DCA - Supports prefetch from memory-mapped device
    F16C * Supports half-precision instruction
    FXSR * Supports FXSAVE/FXSTOR instructions
    FFXSR - Supports optimized FXSAVE/FSRSTOR instruction
    MONITOR * Supports MONITOR and MWAIT instructions
    MOVBE * Supports MOVBE instruction
    ERMSB * Supports Enhanced REP MOVSB/STOSB
    PCLMULDQ * Supports PCLMULDQ instruction
    POPCNT * Supports POPCNT instruction
    LZCNT * Supports LZCNT instruction
    SEP * Supports fast system call instructions
    LAHF-SAHF * Supports LAHF/SAHF instructions in 64-bit mode
    HLE - Supports Hardware Lock Elision instructions
    RTM - Supports Restricted Transactional Memory instructions
     
    DE * Supports I/O breakpoints including CR4.DE
    DTES64 * Can write history of 64-bit branch addresses
    DS * Implements memory-resident debug buffer
    DS-CPL - Supports Debug Store feature with CPL
    PCID * Supports PCIDs and settable CR4.PCIDE
    INVPCID * Supports INVPCID instruction
    PDCM * Supports Performance Capabilities MSR
    RDTSCP * Supports RDTSCP instruction
    TSC * Supports RDTSC instruction
    TSC-DEADLINE * Local APIC supports one-shot deadline timer
    TSC-INVARIANT * TSC runs at constant rate
    xTPR * Supports disabling task priority messages
     
    EIST * Supports Enhanced Intel Speedstep
    ACPI * Implements MSR for power management
    TM * Implements thermal monitor circuitry
    TM2 * Implements Thermal Monitor 2 control
    APIC * Implements software-accessible local APIC
    x2APIC * Supports x2APIC
     
    CNXT-ID - L1 data cache mode adaptive or BIOS
     
    MCE * Supports Machine Check, INT18 and CR4.MCE
    MCA * Implements Machine Check Architecture
    PBE * Supports use of FERR#/PBE# pin
     
    PSN - Implements 96-bit processor serial number
     
    PREFETCHW * Supports PREFETCHW instruction
     
    Maximum implemented CPUID leaves: 0000001B (Basic), 80000008 (Extended).
    Maximum implemented address width: 48 bits (virtual), 39 bits (physical).
     
    Processor signature: 000806D1
     
    Logical to Physical Processor Map:
    -------------- Physical Processor 0 (Hyperthreaded)
    --
    ------------ Physical Processor 1 (Hyperthreaded)
    -------------- Physical Processor 2 (Hyperthreaded)
    ------
    -------- Physical Processor 3 (Hyperthreaded)
    -------------- Physical Processor 4 (Hyperthreaded)
    ----------
    ---- Physical Processor 5 (Hyperthreaded)
    -------------- Physical Processor 6 (Hyperthreaded)
    --------------
    Physical Processor 7 (Hyperthreaded)
     
    Logical Processor to Socket Map:
    **************** Socket 0
     
    Logical Processor to NUMA Node Map:
    **************** NUMA Node 0
     
    No NUMA nodes.
     
    Logical Processor to Cache Map:
    -------------- Data Cache 0, Level 1, 48 KB, Assoc 12, LineSize 64
    **-------------- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
    **-------------- Unified Cache 0, Level 2, 1 MB, Assoc 20, LineSize 64
    **************** Unified Cache 1, Level 3, 24 MB, Assoc 12, LineSize 64
    --
    ------------ Data Cache 1, Level 1, 48 KB, Assoc 12, LineSize 64
    -------------- Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
    --
    ------------ Unified Cache 2, Level 2, 1 MB, Assoc 20, LineSize 64
    -------------- Data Cache 2, Level 1, 48 KB, Assoc 12, LineSize 64
    ----
    ---------- Instruction Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64
    -------------- Unified Cache 3, Level 2, 1 MB, Assoc 20, LineSize 64
    ------
    -------- Data Cache 3, Level 1, 48 KB, Assoc 12, LineSize 64
    -------------- Instruction Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64
    ------
    -------- Unified Cache 4, Level 2, 1 MB, Assoc 20, LineSize 64
    -------------- Data Cache 4, Level 1, 48 KB, Assoc 12, LineSize 64
    --------
    ------ Instruction Cache 4, Level 1, 32 KB, Assoc 8, LineSize 64
    -------------- Unified Cache 5, Level 2, 1 MB, Assoc 20, LineSize 64
    ----------
    ---- Data Cache 5, Level 1, 48 KB, Assoc 12, LineSize 64
    -------------- Instruction Cache 5, Level 1, 32 KB, Assoc 8, LineSize 64
    ----------
    ---- Unified Cache 6, Level 2, 1 MB, Assoc 20, LineSize 64
    -------------- Data Cache 6, Level 1, 48 KB, Assoc 12, LineSize 64
    ------------
    -- Instruction Cache 6, Level 1, 32 KB, Assoc 8, LineSize 64
    -------------- Unified Cache 7, Level 2, 1 MB, Assoc 20, LineSize 64
    --------------
    Data Cache 7, Level 1, 48 KB, Assoc 12, LineSize 64
    --------------** Instruction Cache 7, Level 1, 32 KB, Assoc 8, LineSize 64
    --------------** Unified Cache 8, Level 2, 1 MB, Assoc 20, LineSize 64
     
    Logical Processor to Group Map:
    **************** Group 0

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. Sign in or register to get started.

Upcoming OSR Seminars
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!
Kernel Debugging 13-17 May 2024 Live, Online
Developing Minifilters 1-5 Apr 2024 Live, Online
Internals & Software Drivers 11-15 Mar 2024 Live, Online
Writing WDF Drivers 20-24 May 2024 Live, Online