Profiling Win32 applications (amature)

Hi,
I have developed a profiler which uses the microsoft vc compiler
switch “/Gh” to add hook function into each function call. The
drawback of this hooking method is you need to do it at compile time
(needs recompiling to get regular build/profile build) and need the
source (not an issue for developers profiling their own code).
I was exploring a method/mechanism to do a runtime hooking by either
some binary patching or using some callback (if any) on each function
call. I have seen some profilers which can attach to a running process
and generate profile information (needs debug info which is not an
issue).
Can someone give me some leads regarding this. I am willing to
explore suggestions.

with best regards,
dhruva

PS: I am posting this question to both NTdev and Windbg lists as I do
not know whether we need to work at driver level (I have no exp) or
use DbgHelp/ImageHlp libraries (some exp).


Proud FSF member: #1935
http://schemer.fateback.com/

The usual profiling is running a periodic interrupt by the OS means, and,
on each invocation, looks at EIP value. Then find the area to which this EIP
hits, and increment the area usage count.

This require some OS support. In Windows, this is a Profile object. In
UNIX, this is a ptrace() syscall like:

http://www.freebsd.org/cgi/man.cgi?query=ptrace&apropos=0&sektion=0&manpath=FreeBSD+5.2.1-RELEASE+and+Ports&format=html

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

----- Original Message -----
From: “Dhruva Krishnamurthy”
To: “Windows System Software Devs Interest List”
Sent: Wednesday, September 22, 2004 3:47 PM
Subject: [ntdev] Profiling Win32 applications (amature)

> Hi,
> I have developed a profiler which uses the microsoft vc compiler
> switch “/Gh” to add hook function into each function call. The
> drawback of this hooking method is you need to do it at compile time
> (needs recompiling to get regular build/profile build) and need the
> source (not an issue for developers profiling their own code).
> I was exploring a method/mechanism to do a runtime hooking by either
> some binary patching or using some callback (if any) on each function
> call. I have seen some profilers which can attach to a running process
> and generate profile information (needs debug info which is not an
> issue).
> Can someone give me some leads regarding this. I am willing to
> explore suggestions.
>
> with best regards,
> dhruva
>
> PS: I am posting this question to both NTdev and Windbg lists as I do
> not know whether we need to work at driver level (I have no exp) or
> use DbgHelp/ImageHlp libraries (some exp).
>
> –
> Proud FSF member: #1935
> http://schemer.fateback.com/
>
>
> —
> Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@storagecraft.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

I’m not sure what you’re trying to achieve (i.e. are you researching this
for a product), so I may be answering the wrong thing…

Intel’s VTune can do this for you, and there is a product from AMD, Code
Analyst, that does profiling too.

I’m more familiar with VTune (although I worked at AMD, and I have used
Code Analyst too), and it’s basicly using the method of using a timer
interrrupt to figure out where in the code the execution happened, and dig
out which application this belongs to, etc.

There are similar tools for Linux, oprofile is one of the ones I’ve used.

I’ve written code that does the same thing in real-time OS’s, using a timer
interrupt and grabbing the “return address” of the stack, and you can quite
easily make a histogram of where the code is executing. With a reasonably
precise timer (1ms or so), you get a good idea of where you’ve executed for
most of the time. Obviously, depending on the run-time of the application,
more or less precise timer is needed. 1ms is good enough for something that
runs for a few seconds and upwards. If the app only runs for a second or
so, you’re not going to get enough samples to get any good statistics.

I hope this helps somewhat…


Mats

xxxxx@lists.osr.com wrote on 09/22/2004 12:47:04 PM:

Hi,
I have developed a profiler which uses the microsoft vc compiler
switch “/Gh” to add hook function into each function call. The
drawback of this hooking method is you need to do it at compile time
(needs recompiling to get regular build/profile build) and need the
source (not an issue for developers profiling their own code).
I was exploring a method/mechanism to do a runtime hooking by either
some binary patching or using some callback (if any) on each function
call. I have seen some profilers which can attach to a running process
and generate profile information (needs debug info which is not an
issue).
Can someone give me some leads regarding this. I am willing to
explore suggestions.

with best regards,
dhruva

PS: I am posting this question to both NTdev and Windbg lists as I do
not know whether we need to work at driver level (I have no exp) or
use DbgHelp/ImageHlp libraries (some exp).


Proud FSF member: #1935
http://schemer.fateback.com/


Questions? First check the Kernel Driver FAQ at http://www.
osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@3dlabs.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

ForwardSourceID:NT00003EC6

On Wed, 22 Sep 2004 15:56:02 +0400, Maxim S. Shatskih
wrote:
> The usual profiling is running a periodic interrupt by the OS means, and,
> on each invocation, looks at EIP value. Then find the area to which this EIP
> hits, and increment the area usage count.
>

With the periodic timer, can there be occasions where you can miss
out… Something happens in the interval between successive
interrupts. Because, I am not sure if sampling would be a good idea as
I also need the number of times a function is called in a given
scenario. Using the hook function concept, I can record 100% function
calls.

-dhruva


Proud FSF member: #1935
http://schemer.fateback.com/

On Wed, 22 Sep 2004 13:00:58 +0100, Mats PETERSSON
wrote:
>
>
> I’m not sure what you’re trying to achieve (i.e. are you researching this
> for a product), so I may be answering the wrong thing…
Rational Quantify failed to profile on of our applications. I later
developed a profiler which works but needs compile time changes (as
mentioned in my earlier mail). So, I am researching this for the tool
I have developed for in-house use (hope to make it Open Source if
company allows).

> Intel’s VTune can do this for you, and there is a product from AMD, Code
> Analyst, that does profiling too.

Will check Intel VTune (used it a while ago). Thanks for pointing it out.

> I’ve written code that does the same thing in real-time OS’s, using a timer
> interrupt and grabbing the “return address” of the stack, and you can quite
> easily make a histogram of where the code is executing. With a reasonably
> precise timer (1ms or so), you get a good idea of where you’ve executed for
> most of the time. Obviously, depending on the run-time of the application,
> more or less precise timer is needed. 1ms is good enough for something that
> runs for a few seconds and upwards. If the app only runs for a second or
> so, you’re not going to get enough samples to get any good statistics.
>

This approach can miss some function calls which takes less time that
the interval of the timer. With this, you will not get the exact
information regarding number of calls to a given method. In profiling,
we need both the time per call (can be average) and number of calls
per method/function. Maybe I can still use this approach and calibrate
the profiler to find out %age misses for a given scenario (not sure
how complex this can get!).

-dhruva


Proud FSF member: #1935
http://schemer.fateback.com/

I guess it depends on what you’re trying to achieve: Show the NUMBER OF
TIMES a particular section of code is executed, or the AMOUNT OF TIME used,
and at what granularity.

Figuring out that the function “Do_all_the_work” is using 98.9% of the CPU
time may not give you much interesting, but finding that the particular
line that traverses a linked list is using a large portion of the time IN
that function may be very helpful.

Of course, having the choice of both would be even better… :wink:

The periodic timer will naturally not work to catch all function calls,
because you can probably get thousands of function calls in a millisecond.

The periodic timer method’s advantage is, however, that it’s completely
non-intrusive (aside from using a small percentage of the CPU-time), whilst
modifiying existing code and adding calls to some statistics function will
cause all sorts of havoc with the original code (knowing how many bytes the
next instruction is when you’ve overwritten half of it will be just one of
many problems). Really simple functions may not even be long enough to be
overwritten with your interception code… Consider something like:

int func(void)
{
return globalValue;
}

In optimised code, this becomes:
mov eax, globalValue
ret
About 3-5 bytes. A call at the beginning and end of function requires at
least 5 bytes each => 10 bytes (you can probably use the return address to
figure out which function it is, so saving the need to add parameters to
the function as well).


Mats

xxxxx@lists.osr.com wrote on 09/22/2004 01:20:14 PM:

On Wed, 22 Sep 2004 15:56:02 +0400, Maxim S. Shatskih
wrote:
> > The usual profiling is running a periodic interrupt by the OS
> means, and,
> > on each invocation, looks at EIP value. Then find the area to which
this EIP
> > hits, and increment the area usage count.
> >
>
> With the periodic timer, can there be occasions where you can miss
> out… Something happens in the interval between successive
> interrupts. Because, I am not sure if sampling would be a good idea as
> I also need the number of times a function is called in a given
> scenario. Using the hook function concept, I can record 100% function
> calls.
>
> -dhruva
>
> –
> Proud FSF member: #1935
> http://schemer.fateback.com/
>
>
> —
> Questions? First check the Kernel Driver FAQ at http://www.
> osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@3dlabs.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

> ForwardSourceID:NT00003EDA

> This require some OS support. In Windows, this is a Profile object. In

UNIX, this is a ptrace() syscall like:

Actually, OS is unable to support profiling with complete statistics just
because of performance problem and cannot afford to do.

If Hardware itself support some kind of hooking of each instructions(Which
is optional and it can even have solid rules for not to misuse ). This will
enable to collect all kinds of data and report will be accurate ( when we
get accurate data, we can care less for performance i guess ? :wink: ).

While improving any device to release it to next version. What we need is
complete statistics of earlier versions. If each device itself don’t have
option to give complete control to software to report all kinds of operation
which it is doing internally, then collecting statistics is not possible. If
we don’t get proper data, delivering better product will be less chances( or
we take more more versions to release and still no proper data on how device
works ).

Just an personal opinion…

Regards,
Satish K.S

On Wed, 22 Sep 2004 13:41:29 +0100, Mats PETERSSON
wrote:
>
>
> The periodic timer will naturally not work to catch all function calls,
> because you can probably get thousands of function calls in a millisecond.
>
> The periodic timer method’s advantage is, however, that it’s completely
> non-intrusive (aside from using a small percentage of the CPU-time), whilst
> modifiying existing code and adding calls to some statistics function will
> cause all sorts of havoc with the original code (knowing how many bytes the
> next instruction is when you’ve overwritten half of it will be just one of
> many problems). Really simple functions may not even be long enough to be
> overwritten with your interception code… Consider something like:
>

With this, I will take the following approach:
1. For a real fine grain profiling, use the compile time switch from MSVC
2. For regular profiling to get average results (based on which one
can do a fine grained profiling), use the timer based non-intrusive
profiling.

Since #1 is done occasionally, IMO, this is the best course to take.
Thank you for pointing out the finer details which I had missed.

-dhruva

Proud FSF member: #1935
http://schemer.fateback.com/

> On Wed, 22 Sep 2004 13:41:29 +0100, Mats PETERSSON

wrote:
> >
> >
> > The periodic timer will naturally not work to catch all function calls,
> > because you can probably get thousands of function calls in a
millisecond.
> >
> > The periodic timer method’s advantage is, however, that it’s completely
> > non-intrusive (aside from using a small percentage of the CPU-time),
whilst
> > modifiying existing code and adding calls to some statistics function
will
> > cause all sorts of havoc with the original code (knowing how many bytes
the
> > next instruction is when you’ve overwritten half of it will be just one
of
> > many problems). Really simple functions may not even be long enough to
be
> > overwritten with your interception code… Consider something like:
> >
>
> With this, I will take the following approach:
> 1. For a real fine grain profiling, use the compile time switch from MSVC

Note, that this also severly affects the run-time of the application,
especially if you have little tiny functions that get called often (say for
instance reading a large file, using fgetc, fgetc will be called a LARGE
number of times). So the overhead will be quite great under some
circumstances.

> 2. For regular profiling to get average results (based on which one
> can do a fine grained profiling), use the timer based non-intrusive
> profiling.

Yes, this will at least give you a good idea of which area of code is using
the most of the time. With some experience, you can actually figure out
exactly what is happening inside the processor as well, and find problems
like cache-misses and memory latency related problems.

As long as the code is being run in a similar pattern for most of the
profiling run, then this approach works well. And if the code doesn’t spend
much time in any particular place, you’ll get a few hits all over the
place, which in itself is an indication of what’s going on, but perhaps
more in line with reading between the lines rather than being given the
data that it’s happening.


Mats
>
> Since #1 is done occasionally, IMO, this is the best course to take.
> Thank you for pointing out the finer details which I had missed.
>
> -dhruva
> –
> Proud FSF member: #1935
> http://schemer.fateback.com/
>
>
> —
> Questions? First check the Kernel Driver FAQ at http://www.
> osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@3dlabs.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

> ForwardSourceID:NT00003EFE

On Wed, 2004-09-22 at 06:47, Dhruva Krishnamurthy wrote:

I was exploring a method/mechanism to do a runtime hooking by either
some binary patching or using some callback (if any) on each function
call. I have seen some profilers which can attach to a running process
and generate profile information (needs debug info which is not an
issue).
Can someone give me some leads regarding this. I am willing to
explore suggestions.

Microsoft Research published a library project called “Detours” some
time ago that may be of some help to you. It is licensed on
non-commercial terms, but it includes source code, so you can get an
idea of how to do what you want to do.


Steve Dispensa
MVP - Windows DDK
www.kernelmustard.com

> I was exploring a method/mechanism to do a runtime hooking by either

some binary patching or using some callback (if any) on each function
call. I have seen some profilers which can attach to a running process
and generate profile information (needs debug info which is not an
issue).

If you hook some application after running, you are going to miss many
calls. So, you need to watch out from the launch of application. If the
application uses just DLL’s you can try to patch many calls easily.

But, if application interacts with other COM outproc servers and other
Process. it is difficult to get all data related to it.

To do samething which u have done in compile time you have few posibilities
:

  • To Patch all DLL’s which it loads ( Both IAT-Import Address Table &
    EAT-Export Address Table ).
  • To Analyse the Binary of .EXE to enumerate all “CALL instructions” and
    patch them.
  • If a Function Call results in calling some more API’s in other Process
    Context( COM )

While patching any CALL’s you need to take care of Multi-threaded issues and
stability of the application which you patch. The best way to patch is to
replace few bytes in a starting of Function Call with JMP instructions(IIRC,
This method used by Detours). But, using this method you have few things to
take care :

  • When you patch, after getting control inside your Function you need to
    call original function, so normal method could be replacing the JMP
    instruction with original bytes and call, again replace your JMP call there.
    This will cause problem in Multi-thread. To solve this problem, you can try
    putting these 5 bytes instruction in the buffer and execute but it will not
    unless the 5 bytes is complete instruction.
  • While patching any function, it should be more then 5 bytes minimum

hmm… i never really got any better method which works properly always.

Regards,
Satish K.S

On Wed, 22 Sep 2004 08:54:22 -0500, Steve Dispensa
wrote:
> On Wed, 2004-09-22 at 06:47, Dhruva Krishnamurthy wrote:
> > Can someone give me some leads regarding this. I am willing to
> > explore suggestions.
> >
>
> Microsoft Research published a library project called “Detours” some
> time ago that may be of some help to you. It is licensed on
> non-commercial terms, but it includes source code, so you can get an
> idea of how to do what you want to do.
>

IMO, Detours is again compile time. It does something like I have
already implemented (but more professionally).

-dhruva


Proud FSF member: #1935
http://schemer.fateback.com/

>

> Microsoft Research published a library project called “Detours” some
> time ago that may be of some help to you. It is licensed on
> non-commercial terms, but it includes source code, so you can get an
> idea of how to do what you want to do.
>

IMO, Detours is again compile time. It does something like I have
already implemented (but more professionally).

Actually, its supports hooking runtime also. you can check the samples which
attaches to Process.

Regards,
Satish K.S

On Thu, 23 Sep 2004 12:57:24 +0530, int3 wrote:
> >
> > > Microsoft Research published a library project called “Detours” some
> > > time ago that may be of some help to you. It is licensed on
> > > non-commercial terms, but it includes source code, so you can get an
> > > idea of how to do what you want to do.
> > >
> >
> > IMO, Detours is again compile time. It does something like I have
> > already implemented (but more professionally).
>
> Actually, its supports hooking runtime also. you can check the samples which
> attaches to Process.

I have checked the README.txt which has a brief write up of all
samples and I did not find any sample that supports hooking ALL
methods in a running process. Hooking some pre-defined methods is
trivial (as explained in Jeffery Richter’s famous book) using
WriteProcessMemory. Please let me know if I have missed something
here.

-dhruva


Proud FSF member: #1935
http://schemer.fateback.com/

On Thu, 23 Sep 2004 14:53:37 +0530, Dhruva Krishnamurthy
wrote:
> On Thu, 23 Sep 2004 12:57:24 +0530, int3 wrote:
> > >
> > > > Microsoft Research published a library project called “Detours” some
> > > > time ago that may be of some help to you. It is licensed on
> > > > non-commercial terms, but it includes source code, so you can get an
> > > > idea of how to do what you want to do.
> > > >
> > >
> > > IMO, Detours is again compile time. It does something like I have
> > > already implemented (but more professionally).
> >
> > Actually, its supports hooking runtime also. you can check the samples which
> > attaches to Process.
>
> I have checked the README.txt which has a brief write up of all
> samples and I did not find any sample that supports hooking ALL
> methods in a running process. Hooking some pre-defined methods is
> trivial (as explained in Jeffery Richter’s famous book) using
> WriteProcessMemory. Please let me know if I have missed something
> here.

I guess this can be achieved (on paper atleast) with the following approach:
1. Read all the loaded modules, load their debug info
2. Instrument LoadLibrary to trap further DLL load
3. Enumerate through all functions
4. Instrument all functions.
5. Repeat #3 and #4, everytime a new DLL is loaded.

Let me try this approach, I request the experienced to comment on this
approach so that I do not get into a wild goose chase.

-dhruva


Proud FSF member: #1935
http://schemer.fateback.com/

> I have checked the README.txt which has a brief write up of all

samples and I did not find any sample that supports hooking ALL
methods in a running process. Hooking some pre-defined methods is
trivial (as explained in Jeffery Richter’s famous book) using
WriteProcessMemory. Please let me know if I have missed something
here.

I never heard of any Library, which enumerated all CALL instructions in a
Process and patches it. Even if you develop your own Lib, you need to use
WriteProcessMemory() and develop it.

Regards,
Satish K.S

On Thu, 23 Sep 2004 15:57:59 +0530, int3 wrote:

> I never heard of any Library, which enumerated all CALL instructions in a
> Process and patches it. Even if you develop your own Lib, you need to use
> WriteProcessMemory() and develop it.
>

Check out http://www.dyninst.org/ It looks promising.

-dhruva


Proud FSF member: #1935
http://schemer.fateback.com/

hi,
if you start the process being profiled as a debugger (check john robbins)
then you receive event each time a dll is loaded into the process’ memory.
you ll have the dll loading address. now if you stop the thread being
profiled at finite intervals and see the current address in the context,
then you have your answer.
however in essence the main problem will be the granularity of the sampling
time.

kutty

“Dhruva Krishnamurthy” wrote in message
news:xxxxx@ntdev…
> On Thu, 23 Sep 2004 15:57:59 +0530, int3 wrote:
>
>> I never heard of any Library, which enumerated all CALL instructions in a
>> Process and patches it. Even if you develop your own Lib, you need to use
>> WriteProcessMemory() and develop it.
>>
>
> Check out http://www.dyninst.org/ It looks promising.
>
> -dhruva
>
> –
> Proud FSF member: #1935
> http://schemer.fateback.com/
>
>

Did you get this approach working? Actually i implemented somewhat the same
thing sometime back. Search google for StraceNT or visit
http://www.intellectualheaven.com
It uses the IAT patching technique to hook all the APIs exported by DLLs.
What it lacks is the patching of functions compiled inside the exe or
statically linked from the libraries. For that you will need debug
information.

By the way, the steps you listed will for sure work. I guess thats the way
commercial tool like AppSight (Identify software) works, which traces the
complete call flow of a running program.


Pankaj Garg
This posting is provided “AS IS” with no warranties and confers no rights.

“Dhruva Krishnamurthy” wrote in message
news:xxxxx@ntdev…
> On Thu, 23 Sep 2004 14:53:37 +0530, Dhruva Krishnamurthy
> wrote:
> > On Thu, 23 Sep 2004 12:57:24 +0530, int3
wrote:
> > > >
> > > > > Microsoft Research published a library project called “Detours”
some
> > > > > time ago that may be of some help to you. It is licensed on
> > > > > non-commercial terms, but it includes source code, so you can get
an
> > > > > idea of how to do what you want to do.
> > > > >
> > > >
> > > > IMO, Detours is again compile time. It does something like I have
> > > > already implemented (but more professionally).
> > >
> > > Actually, its supports hooking runtime also. you can check the samples
which
> > > attaches to Process.
> >
> > I have checked the README.txt which has a brief write up of all
> > samples and I did not find any sample that supports hooking ALL
> > methods in a running process. Hooking some pre-defined methods is
> > trivial (as explained in Jeffery Richter’s famous book) using
> > WriteProcessMemory. Please let me know if I have missed something
> > here.
>
> I guess this can be achieved (on paper atleast) with the following
approach:
> 1. Read all the loaded modules, load their debug info
> 2. Instrument LoadLibrary to trap further DLL load
> 3. Enumerate through all functions
> 4. Instrument all functions.
> 5. Repeat #3 and #4, everytime a new DLL is loaded.
>
> Let me try this approach, I request the experienced to comment on this
> approach so that I do not get into a wild goose chase.
>
> -dhruva
>
> –
> Proud FSF member: #1935
> http://schemer.fateback.com/
>
>