Why adding one extra KdPrint can increase the performance of an user application

Hi,

I have noticed that adding one extra KdPrint() in my Windows server 2012 driver can at least boost my user apps performance about 42%. I thought it should be other way around.

Basically I use __rdtsc() before and after the apps sending the ioctl to the driver and calculate the delta of both __rdtsc(), then summarize the deltas for all ioctl sent to the and derive the average time spent by the ioctl by dividing the total number of the ioctls sent.

The average time spent for DeviceIoControl() is 231174 ticks. After adding the KdPrint() to the driver, the average time is 133702.

What is the cause?

Thanks,

Ying

__rdtsc readings may not be consistent across different processors (depends on the model). KdPrint may cause hidden CPU switch which makes the reading differ by the TSC skew.

Use Performance Counter instead.

Alex Grig wrote:

__rdtsc readings may not be consistent across different
processors (depends on the model).

Invariant TSC has been around since the Nehalem days.

RDTSC is subject to out of order excecution, it needs to be preceded with a
serializing instruction. Use KeQueryPerformanceCounter instead.

//Daniel

>Use KeQueryPerformanceCounter instead.

Sorry, in usermode, use QueryPerformanceCounter instead.

//Daniel

See the Intel white paper on using RDTSC and RDTSCP

http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf

On the system I use, KeQueryPerformanceCounter seems to have a frequency of cpu clock speed / 1024, which is kind of slow for high speed measurements.

Jan

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@resplendence.com
Sent: Wednesday, March 9, 2016 10:58 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] Why adding one extra KdPrint can increase the performance of an user application

RDTSC is subject to out of order excecution, it needs to be preceded with a serializing instruction. Use KeQueryPerformanceCounter instead.

//Daniel

Jan Bottorff wrote:

On the system I use, KeQueryPerformanceCounter seems to
have a frequency of cpu clock speed / 1024, which is kind of
slow for high speed measurements.

Not to mention if its hitting the HPET that’s going to be very expensive as compared to just reading the TSC.

If OP is worried about out of order execution then just issue a cpuid before it. cpuid is serializing.

xxxxx@resplendence.com wrote:

RDTSC is subject to out of order excecution, it needs to be preceded with a
serializing instruction. Use KeQueryPerformanceCounter instead.

Yes, but that produces differences of a few cycles, not the kind of
deltas the original poster is experiencing.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

IIRC these warnings only ever applied to specific AMD chipsets that are long obsolete.

Timing differences caused by the overhead of logging is always a problem regardless of architecture or the choice of a timestamp source. This is simply a consequence of parallelism and any analysis of logs must account for the fact that events that are near to one another have occurred in arbitrary order rather than in the order that they appear in the log

Sent from Mailhttps: for Windows 10

From: xxxxx@gmail.commailto:xxxxx
Sent: March 10, 2016 12:01 AM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE:[ntdev] Why adding one extra KdPrint can increase the performance of an user application

Alex Grig wrote:

> __rdtsc readings may not be consistent across different
> processors (depends on the model).

Invariant TSC has been around since the Nehalem days.


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

>IIRC these warnings only ever applied to specific AMD chipsets that are long obsolete.

That’s wrong, read the article that Jan posted.

Yes, but that produces differences of a few cycles, not the kind of deltas the original poster is experiencing.

Instruction pipelines have a lot of stages nowadays, I think those can add up to hundreds of cycles. It’s one of the many problems that you encounter when using a RDTSC instruction. I suppose we have to do things right before starting to explain where the differences come from.

Not to mention if its hitting the HPET that’s going to be very expensive as compared to just reading the TSC.

Do you have any information that backs this up ? I am reading on this and getting mixed messages. I cannot believe that reading that register can be as expensive as a RDTSC instruction plus an entire pipeline flush.

//Daniel

Daniel Terhell wrote:

Do you have any information that backs this up ? I am reading on
this and getting mixed messages.

Test it yourself and see. Two ways:

  1. Force the clocksource to “hpet” on Linux and see how many times you can call gettimeofday() in one microsecond. Now do the same for clocksource “tsc”.

  2. Find a machine where QueryPerformanceFrequency gives 14.318mhz. Then examine the cost of calling QueryPerformanceCounter().

I am so glad that you have contradicted me flatly. It does make me feel so much better than when I was worried about whether this part of the problem was relevant to the actual issue of performance measuring via logs on MP systems in a preemtive OS like Windows.

It also does sort of imply that I have only read about this topic, rather than having actual experience. Again this makes me feel better because now that I know I don?t know anything about it, I hardly need to worry about attempting to help the OP: After all, we never want to help the OP when we respond.

For completeness, except for the systemically asymmetric TSC counts in various AMD based multiprocessor systems of the 2002 ? 2006 vintage, the expected direcepencies in TSC values are at least an order of magnitude below the overhead of the logging itself. Even so called light weight logging methods must either be ?lossy? or touch heap or pool memory and thereby disrupt timings, lock patterns and other characteristics of normal execution. Somewhere in the mix thread sync must come into play too as at least the EOF location must be protected to avoid overlapping writes to the same bytes. Fortunately for all of us poor slobs who must engineer working software of every kind, most often none of these problems actually discount the utility of the logs.

Sent from Mailhttps: for Windows 10i
From: xxxxx@resplendence.commailto:xxxxx
Sent: March 10, 2016 10:50 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: Re:[ntdev] Why adding one extra KdPrint can increase the performance of an user application

>IIRC these warnings only ever applied to specific AMD chipsets that are long obsolete.

That?s wrong, read the article that Jan posted.

>Yes, but that produces differences of a few cycles, not the kind of deltas the original poster is experiencing.

Instruction pipelines have a lot of stages nowadays, I think those can add up to hundreds of cycles. It?s one of the many problems that you encounter when using a RDTSC instruction. I suppose we have to do things right before starting to explain where the differences come from.

>Not to mention if its hitting the HPET that’s going to be very expensive as compared to just reading the TSC.

Do you have any information that backs this up ? I am reading on this and getting mixed messages. I cannot believe that reading that register can be as expensive as a RDTSC instruction plus an entire pipeline flush.

//Daniel


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

>After all, we never want to help the OP when we respond.

You are speaking for yourself only, not on my behalf.

//Daniel