Hello all. I was just wondering if there is a kernel data structure or possibly even a user level API that I can use to obtain the total number of instructions executed by a process. I recall possibly happening upon this information when I was involved with one of my previous driver projects, but I cannot determine if that was a false memory or if I just forgot the exact specifications! Any insight into this would be fantastic!
xxxxx@gmail.com wrote:
Hello all. I was just wondering if there is a kernel data structure or possibly even a user level API that I can use to obtain the total number of instructions executed by a process. I recall possibly happening upon this information when I was involved with one of my previous driver projects, but I cannot determine if that was a false memory or if I just forgot the exact specifications! Any insight into this would be fantastic!
No such thing exists, because the operating system does not know this.
It’s possible to use the Pentium performance counters to track this, but
they are non-trivial to use. The rdtsc cycle counter will get you a
close estimate, and it can be used from user mode.
–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.
>No such thing exists, because the operating system does not know this. It’s possible to use the Pentium
performance counters to track this, but they are non-trivial to use. The rdtsc cycle counter will get you
a close estimate, and it can be used from user mode. – Tim Roberts, xxxxx@probo.com Providenza &
Boekelheide, Inc.
Thank you very much for your information!
Tim Roberts wrote:
It’s possible to use the Pentium performance counters to
track this, but they are non-trivial to use.
Why are they “non-trivial to use”?
> No such thing exists, because the operating system does not know this.
It is better to say “because the operating system just does not care”…
Taking into consideration the fact that the number of clock cycles that is required to execute an instruction varies (for example, single multiplication with IMUL is going to take more CPU cycles than several bits-shifts and additions - this is the reason why sometimes you may want to replace multiplication with bits-shifts and additions), the whole thing just does not make sense it itself…
The rdtsc cycle counter will get you a close estimate, and it can be used from user mode.
This is already more indicative approach - your measurement becomes at least meaningfull
if you do things this way. However, in order to make any practical use of it, you have to measure the execution speed of some particular piece of code, rather than process/thread, and, if you want your results to be more or less indicative, raise IRQL at least to DPC level ( if you want to get high-precision results, you have to disable interrupts on a given CPU) - any high-precision
measurement at IRQL< DPC level is just pointless, because your target thread is a subject to context switches.
Taking into consideration the above, using cycle counter from the user mode does not really make sense…
Anton Bassov
xxxxx@hotmail.com wrote:
Taking into consideration the above, using cycle counter from the user mode does not really make sense…
Nonsense. As long as you are conscious of the limitations, repeat the
tests enough times, and take appropriate steps to eliminate wild points
caused by context switches, it provides a perfectly useful metric for
many situations. I wouldn’t necessarily trust it for “this sequence
takes exactly N cycles” statements, but it is certainly very useful for
“this sequence takes longer than this sequence” comparisons.
–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.
xxxxx@hushmail.com wrote:
Tim Roberts wrote:
> It’s possible to use the Pentium performance counters to
> track this, but they are non-trivial to use.
>Why are they “non-trivial to use”?
What one would like to do is just go access one of the machine-specific
registers and say “give me the current value of this counter”, but it’s
not that easy. The available counters vary wildly from chip to chip,
even within the same family. Most of the counters have various
configuration parameters, not all of them consistent. You have to set
affinity to make sure you keep talking to the same core. Each counter
you want to use has to be configured, initialized, and started. At a
later point, you stop it, and read the value. On a P4, only two
counters can be accessed at a time. And, of course, rdmsr and wrmsr are
protected instructions, so all of this has to be done from a kernel driver.
It’s not rocket science, but it was a lot more obscure than I expected.
–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.
> Nonsense. As long as you are conscious of the limitations, repeat the tests
enough times, and take appropriate steps to eliminate wild points caused
by context switches, it provides a perfectly useful metric for many situations.
I wouldn’t necessarily trust it for “this sequence takes exactly N cycles” statements,
but it is certainly very useful for “this sequence takes longer than this sequence” comparisons.
Well, “this sequence takes longer than this sequence” comparison is not that useful in itself, is it??? In order to make any practical use out of it, you have to be able to say
“this sequence takes N nanoseconds (ms, milliseconds, etc) longer than this sequence”, i.e.
normally you have to decide whether this difference is significant in a given context . Therefore, in practical terms, “this sequence takes exactly N cycles” -style judgement just cannot be completely avoided on performance tests…
Anton Bassov