Intel hardware performance counters on Windows

David_Yeager · March 13, 2014, 3:56pm

Hi,

Does anybody know of a way to statistically sample Intel hardware (CPU) performance/event counters on Windows in a way that includes the CPU program counter and process ID for each sampled event? I’ve developed a run-time binary optimizer for Linux that requires this and would like to port it to Windows.

Would I have to write a custom device driver from scratch? If yes then do you know of any open source examples? Are there any APIs in Windows that exports this to user space? For example in Linux this can easily be accomplished with a combination of perf_event_open(), mmap() and ioctl().

Thanks,
David

Daniel_Terhell · March 14, 2014, 12:21am

The Intel Performance Counter Monitor includes source code of a driver that
reads the counters.

http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization

//Daniel

David_Yeager · March 14, 2014, 10:41am

Thanks for your response Daniel.

I already looked at that driver, and unfortunately it only reads total event counts. It doesn’t provide the program counter and PID/TID for each event. That would require an interrupt handler for the hardware event and streaming those values to user space.

I looked at the PAPI project and apparently they once had something that supported this for 32-bit Windows XP on only one CPU core years ago, but no longer.

So I guess the consensus is that I must build such a driver myself? No built-in Windows support, and no driver + user space interface publicly available on Windows?

Thanks,
David

Tim_Roberts · March 14, 2014, 12:52pm

xxxxx@gmail.com wrote:

I already looked at that driver, and unfortunately it only reads total event counts. It doesn’t provide the program counter and PID/TID for each event. That would require an interrupt handler for the hardware event and streaming those values to user space.

I looked at the PAPI project and apparently they once had something that supported this for 32-bit Windows XP on only one CPU core years ago, but no longer.

So I guess the consensus is that I must build such a driver myself? No built-in Windows support, and no driver + user space interface publicly available on Windows?

This is exactly what Intel’s VTune product does. If your goal is to do
performance analysis, acquiring it would be a much better solution than
rolling your own.

If you are trying to build a performance analysis tool to distribute or
sell, then as you say you will pretty much be on your own. The
performance counters vary considerably from model to model and
manufacturer to manufacturer, so you are not going to find any generic
APIs to help you.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

David_Yeager · March 14, 2014, 4:30pm

Thanks Tim. Yeah I suppose I’ll have to get this built then. I’ve used vtune for years and it does a great job, but I’m not looking for a performance analysis tool or trying to build one. The profiling stream is for a run-time profile directed JIT compiler. It’s a shame because linux provides a simple API for this and there are a few open source linux drivers that also do it but I guess nothing on Windows yet.

OSR_Community_User · March 15, 2014, 9:39pm

Note that program-counter-samplers can tell you WHAT is going wrong, but
not WHY.

Some years ago, I wrote an incredibly efficient storage allocator. On a
uniprocessor, we could have had the compiler expand the allocation call
inline, because it was faster to allocate directly than to call the
allocator (remember, we were running on 1 MIPS machines in those days).

The compiler group came to me and said that my allocator was far, far too
slow, and was unusable. To prove this, they showed me the PC-sample run
on Unix. Yes, the gigantic spike was the storage allocator.

So I started up their test, but reached in with the debugger and turned on
a couple internal counters I had in the allocator. Yes, the storage
allocator was the most-used piece of code. On the other hand, there were
2,000,000+ allocate requests and the matching 2,000,000+ release requests.
So I then set some breakpoints, and traced it back by the return address
(we had no source debugger, just to add to the fun). I found an internal
loop that generated all these requests. I recoded it to do just one
allocation per call, and the number of allocations dropped to under
300,000, and the storage allocator was not the problem. Performance data
is always suspect unless you truly understand what it is telling you.
joe

Thanks for your response Daniel.

I already looked at that driver, and unfortunately it only reads total
event counts. It doesn’t provide the program counter and PID/TID for each
event. That would require an interrupt handler for the hardware event and
streaming those values to user space.

I looked at the PAPI project and apparently they once had something that
supported this for 32-bit Windows XP on only one CPU core years ago, but
no longer.

So I guess the consensus is that I must build such a driver myself? No
built-in Windows support, and no driver + user space interface publicly
available on Windows?

Thanks,
David

NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · March 15, 2014, 9:49pm

> Thanks Tim. Yeah I suppose I’ll have to get this built then. I’ve used

vtune for years and it does a great job, but I’m not looking for a
performance analysis tool or trying to build one. The profiling stream is
for a run-time profile directed JIT compiler. It’s a shame because linux
provides a simple API for this and there are a few open source linux
drivers that also do it but I guess nothing on Windows yet.

Note that a JIT optimizer would not be able to optimize the bad loop I
found in my previous example. In fact, it probably couldn’t even identify
that it was being called in an inner loop without massive examination of
the call tree, not cost-effective for a JIT.

I know people who do massive JIT optimizations on Java code, and have
found that the only effective tool is a bus analyzer, about $500,000 for a
modern multicore system, and the real “family jewels” are their analysis
algorithms. They then use the information derived from these analyses to
build a smarter JIT. I can’t say too much because of NDA, but the cache
impact of software tools was substantial, and pretty much invalidated the
use of software monitoring. This knowledge is public, so I’m not
violating any part of the NDA in revealing it.
joe

NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer