ReadThreadProfilingData function not returning useful / sensible / consistent results

eric_mayer · September 22, 2023, 1:34am

I’m trying to develop some performance test automation techniques that are more predictable and stable than simple elapsed time measurement. I plan for the tests to be broken down into discrete “use-case” based unit-test style tests rather than simply measuring background activity of our application. The measurements need to cover activity in both user and kernel mode.

My litmus test effort was to come up with a baseline low-ambiguity sample test to show that I could measure some kind of consistent number on CPU time ticks on a given test thread running on a given test O/S on a given test machine for a pure math algorithm happening in active memory, not calling any WinAPIs, not touching file system, etc. The test deliberately created “turbulence” by having a configurable number of competing threads periodically run the same algorithm as the thread being measured. I wanted to see if I could get the elapsed time to vary significantly on the thread being measured but get a solid/consistent number for its CPU time. I figured if I couldn’t achieve at least this, attempting measurement of any more complex/messy code would be pointless.

At first I tried using QueryThreadCycleTime (or its Kernel equivalent KeQueryTotalCycleTimeThread) but found that I was easily able to show test results where elapsed time (measured with QueryPerformanceCounter or it’s kernel equivalent) varied significantly but the CPU time on the measured thread hugged that varying curve tightly rather than being stable/consistent. That was disappointing, but I wasn’t too surprised because the documentation on these APIs is extremely spare and trying to find example usage turned up very little.

Then I came across the usermode Thread Profiling APIs (i.e. ReadThreadProfilingData), which seem a little better supported. I can provide this code if anyone wants to look it over or try it out. Unfortunately, though, the results looked strikingly similar to the previous effort. No additional measurement stability using thread-specific CPU time ticks rather than overall elapsed time. This time I also measured the thread context switch counts, trying to establish any pattern of potential cost of doing the context switch that might account for the additional time spent in the thread, but no such coherent pattern emerged.

The test output showing the tight correlation between overall time and thread-specific CPU time (plus context switch counts) follows. Also following are the specs on my test machine, as I understand the behavior of these profiling APIs can vary with hardware, but I don’t see any issue based on my best research effort on the topic – though I must admit as far as average software developers go, I’m not a hardware / CPU export in any sense. Anyone get any clues / tips? Is this enough to say “Hey what’s up with this Microsoft?” or did I fail to account for any other obvious test parameters / conditions?

eric_mayer · September 22, 2023, 1:35am

Test output:

C:\Source\endpoint_nav1_dev\x64\Release>ConsoleApplication1.exe o 12
Running test with offline thread and 12 competing threads on 16 processors.

main: Contested Test iteration 1 results:
main: j is 1804570059818208285, cpu ticks is 477569794.
main: elapsed microsec is 199718.
main: ratio is 2391.
main: Thread context switch count = 13

main: Free Test iteration 2 results:
main: j is 1804570059818208285, cpu ticks is 362215977.
main: elapsed microsec is 145259.
main: ratio is 2493.
main: Thread context switch count = 1

main: Free Test iteration 3 results:
main: j is 1804570059818208285, cpu ticks is 359002716.
main: elapsed microsec is 144025.
main: ratio is 2492.
main: Thread context switch count = 2

main: Contested Test iteration 4 results:
main: j is 1804570059818208285, cpu ticks is 491884161.
main: elapsed microsec is 197114.
main: ratio is 2495.
main: Thread context switch count = 0

main: Free Test iteration 5 results:
main: j is 1804570059818208285, cpu ticks is 359517039.
main: elapsed microsec is 144111.
main: ratio is 2494.
main: Thread context switch count = 1

main: Free Test iteration 6 results:
main: j is 1804570059818208285, cpu ticks is 356299170.
main: elapsed microsec is 142808.
main: ratio is 2494.
main: Thread context switch count = 0

main: Contested Test iteration 7 results:
main: j is 1804570059818208285, cpu ticks is 473551735.
main: elapsed microsec is 189981.
main: ratio is 2492.
main: Thread context switch count = 4

main: Free Test iteration 8 results:
main: j is 1804570059818208285, cpu ticks is 359549314.
main: elapsed microsec is 144151.
main: ratio is 2494.
main: Thread context switch count = 1

main: Free Test iteration 9 results:
main: j is 1804570059818208285, cpu ticks is 357777276.
main: elapsed microsec is 143635.
main: ratio is 2490.
main: Thread context switch count = 6

main: Contested Test iteration 10 results:
main: j is 1804570059818208285, cpu ticks is 472087235.
main: elapsed microsec is 189280.
main: ratio is 2494.
main: Thread context switch count = 1

main: Free Test iteration 11 results:
main: j is 1804570059818208285, cpu ticks is 356203917.
main: elapsed microsec is 142790.
main: ratio is 2494.
main: Thread context switch count = 3

main: Free Test iteration 12 results:
main: j is 1804570059818208285, cpu ticks is 354169145.
main: elapsed microsec is 142164.
main: ratio is 2491.
main: Thread context switch count = 1

main: Contested Test iteration 13 results:
main: j is 1804570059818208285, cpu ticks is 475484813.
main: elapsed microsec is 190663.
main: ratio is 2493.
main: Thread context switch count = 1

main: Free Test iteration 14 results:
main: j is 1804570059818208285, cpu ticks is 358580220.
main: elapsed microsec is 143718.
main: ratio is 2495.
main: Thread context switch count = 6

main: Free Test iteration 15 results:
main: j is 1804570059818208285, cpu ticks is 354703498.
main: elapsed microsec is 142150.
main: ratio is 2495.
main: Thread context switch count = 1

main: Contested Test iteration 16 results:
main: j is 1804570059818208285, cpu ticks is 482798437.
main: elapsed microsec is 193497.
main: ratio is 2495.
main: Thread context switch count = 0

main: Free Test iteration 17 results:
main: j is 1804570059818208285, cpu ticks is 355251649.
main: elapsed microsec is 142356.
main: ratio is 2495.
main: Thread context switch count = 0

main: Free Test iteration 18 results:
main: j is 1804570059818208285, cpu ticks is 357151258.
main: elapsed microsec is 143176.
main: ratio is 2494.
main: Thread context switch count = 0

main: Contested Test iteration 19 results:
main: j is 1804570059818208285, cpu ticks is 469841993.
main: elapsed microsec is 188402.
main: ratio is 2493.
main: Thread context switch count = 3

main: Free Test iteration 20 results:
main: j is 1804570059818208285, cpu ticks is 356370399.
main: elapsed microsec is 143112.
main: ratio is 2490.
main: Thread context switch count = 0

main: Total results:
main: cpu ticks min/max/variance is 354169145 / 491884161 / 38%.
main: elapsed microsec min/max/variance is 142150 / 199718 / 40%.

eric_mayer · September 22, 2023, 1:35am

CPU specs:

11th Gen Intel(R) Core™ i7-11850H @ 2.50GHz
Intel64 Family 6 Model 141 Stepping 1, GenuineIntel
Microcode signature: 0000003E
HTT * Hyperthreading enabled
CET * Supports Control Flow Enforcement Technology
Kernel CET - Kernel-mode CET Enabled
User CET * User-mode CET Allowed
HYPERVISOR * Hypervisor is present
VMX - Supports Intel hardware-assisted virtualization
SVM - Supports AMD hardware-assisted virtualization
X64 * Supports 64-bit mode

SMX - Supports Intel trusted execution
SKINIT - Supports AMD SKINIT
SGX - Supports Intel SGX

NX * Supports no-execute page protection
SMEP * Supports Supervisor Mode Execution Prevention
SMAP * Supports Supervisor Mode Access Prevention
PAGE1GB * Supports 1 GB large pages
PAE * Supports > 32-bit physical addresses
PAT * Supports Page Attribute Table
PSE * Supports 4 MB pages
PSE36 * Supports > 32-bit address 4 MB pages
PGE * Supports global bit in page tables
SS * Supports bus snooping for cache operations
VME * Supports Virtual-8086 mode
RDWRFSGSBASE * Supports direct GS/FS base access

FPU * Implements i387 floating point instructions
MMX * Supports MMX instruction set
MMXEXT - Implements AMD MMX extensions
3DNOW - Supports 3DNow! instructions
3DNOWEXT - Supports 3DNow! extension instructions
SSE * Supports Streaming SIMD Extensions
SSE2 * Supports Streaming SIMD Extensions 2
SSE3 * Supports Streaming SIMD Extensions 3
SSSE3 * Supports Supplemental SIMD Extensions 3
SSE4a - Supports Streaming SIMDR Extensions 4a
SSE4.1 * Supports Streaming SIMD Extensions 4.1
SSE4.2 * Supports Streaming SIMD Extensions 4.2

AES * Supports AES extensions
AVX * Supports AVX instruction extensions
AVX2 * Supports AVX2 instruction extensions
AVX-512-F * Supports AVX-512 Foundation instructions
AVX-512-DQ * Supports AVX-512 double and quadword instructions
AVX-512-IFAMA * Supports AVX-512 integer Fused multiply-add instructions
AVX-512-PF - Supports AVX-512 prefetch instructions
AVX-512-ER - Supports AVX-512 exponential and reciprocal instructions
AVX-512-CD * Supports AVX-512 conflict detection instructions
AVX-512-BW * Supports AVX-512 byte and word instructions
AVX-512-VL * Supports AVX-512 vector length instructions
FMA * Supports FMA extensions using YMM state
MSR * Implements RDMSR/WRMSR instructions
MTRR * Supports Memory Type Range Registers
XSAVE * Supports XSAVE/XRSTOR instructions
OSXSAVE * Supports XSETBV/XGETBV instructions
RDRAND * Supports RDRAND instruction
RDSEED * Supports RDSEED instruction

CMOV * Supports CMOVcc instruction
CLFSH * Supports CLFLUSH instruction
CX8 * Supports compare and exchange 8-byte instructions
CX16 * Supports CMPXCHG16B instruction
BMI1 * Supports bit manipulation extensions 1
BMI2 * Supports bit manipulation extensions 2
ADX * Supports ADCX/ADOX instructions
DCA - Supports prefetch from memory-mapped device
F16C * Supports half-precision instruction
FXSR * Supports FXSAVE/FXSTOR instructions
FFXSR - Supports optimized FXSAVE/FSRSTOR instruction
MONITOR * Supports MONITOR and MWAIT instructions
MOVBE * Supports MOVBE instruction
ERMSB * Supports Enhanced REP MOVSB/STOSB
PCLMULDQ * Supports PCLMULDQ instruction
POPCNT * Supports POPCNT instruction
LZCNT * Supports LZCNT instruction
SEP * Supports fast system call instructions
LAHF-SAHF * Supports LAHF/SAHF instructions in 64-bit mode
HLE - Supports Hardware Lock Elision instructions
RTM - Supports Restricted Transactional Memory instructions

DE * Supports I/O breakpoints including CR4.DE
DTES64 * Can write history of 64-bit branch addresses
DS * Implements memory-resident debug buffer
DS-CPL - Supports Debug Store feature with CPL
PCID * Supports PCIDs and settable CR4.PCIDE
INVPCID * Supports INVPCID instruction
PDCM * Supports Performance Capabilities MSR
RDTSCP * Supports RDTSCP instruction
TSC * Supports RDTSC instruction
TSC-DEADLINE * Local APIC supports one-shot deadline timer
TSC-INVARIANT * TSC runs at constant rate
xTPR * Supports disabling task priority messages

EIST * Supports Enhanced Intel Speedstep
ACPI * Implements MSR for power management
TM * Implements thermal monitor circuitry
TM2 * Implements Thermal Monitor 2 control
APIC * Implements software-accessible local APIC
x2APIC * Supports x2APIC

CNXT-ID - L1 data cache mode adaptive or BIOS

MCE * Supports Machine Check, INT18 and CR4.MCE
MCA * Implements Machine Check Architecture
PBE * Supports use of FERR#/PBE# pin

PSN - Implements 96-bit processor serial number

PREFETCHW * Supports PREFETCHW instruction

Maximum implemented CPUID leaves: 0000001B (Basic), 80000008 (Extended).
Maximum implemented address width: 48 bits (virtual), 39 bits (physical).

Processor signature: 000806D1

Logical to Physical Processor Map:
-------------- Physical Processor 0 (Hyperthreaded)
–------------ Physical Processor 1 (Hyperthreaded)
-------------- Physical Processor 2 (Hyperthreaded)
-------------- Physical Processor 3 (Hyperthreaded)
-------------- Physical Processor 4 (Hyperthreaded)
-------------- Physical Processor 5 (Hyperthreaded)
------------– Physical Processor 6 (Hyperthreaded)
-------------- Physical Processor 7 (Hyperthreaded)

Logical Processor to Socket Map:
**************** Socket 0

Logical Processor to NUMA Node Map:
**************** NUMA Node 0

No NUMA nodes.

Logical Processor to Cache Map:
-------------- Data Cache 0, Level 1, 48 KB, Assoc 12, LineSize 64
-------------- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
-------------- Unified Cache 0, Level 2, 1 MB, Assoc 20, LineSize 64
**************** Unified Cache 1, Level 3, 24 MB, Assoc 12, LineSize 64
–------------ Data Cache 1, Level 1, 48 KB, Assoc 12, LineSize 64
–------------ Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
–------------ Unified Cache 2, Level 2, 1 MB, Assoc 20, LineSize 64
-------------- Data Cache 2, Level 1, 48 KB, Assoc 12, LineSize 64
-------------- Instruction Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64
-------------- Unified Cache 3, Level 2, 1 MB, Assoc 20, LineSize 64
-------------- Data Cache 3, Level 1, 48 KB, Assoc 12, LineSize 64
-------------- Instruction Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64
-------------- Unified Cache 4, Level 2, 1 MB, Assoc 20, LineSize 64
-------------- Data Cache 4, Level 1, 48 KB, Assoc 12, LineSize 64
-------------- Instruction Cache 4, Level 1, 32 KB, Assoc 8, LineSize 64
-------------- Unified Cache 5, Level 2, 1 MB, Assoc 20, LineSize 64
-------------- Data Cache 5, Level 1, 48 KB, Assoc 12, LineSize 64
-------------- Instruction Cache 5, Level 1, 32 KB, Assoc 8, LineSize 64
-------------- Unified Cache 6, Level 2, 1 MB, Assoc 20, LineSize 64
------------– Data Cache 6, Level 1, 48 KB, Assoc 12, LineSize 64
------------– Instruction Cache 6, Level 1, 32 KB, Assoc 8, LineSize 64
------------– Unified Cache 7, Level 2, 1 MB, Assoc 20, LineSize 64
-------------- Data Cache 7, Level 1, 48 KB, Assoc 12, LineSize 64
--------------** Instruction Cache 7, Level 1, 32 KB, Assoc 8, LineSize 64
--------------** Unified Cache 8, Level 2, 1 MB, Assoc 20, LineSize 64

Logical Processor to Group Map:
**************** Group 0