I have noticed that adding one extra KdPrint() in my Windows server 2012 driver can at least boost my user apps performance about 42%. I thought it should be other way around.
Basically I use __rdtsc() before and after the apps sending the ioctl to the driver and calculate the delta of both __rdtsc(), then summarize the deltas for all ioctl sent to the and derive the average time spent by the ioctl by dividing the total number of the ioctls sent.
The average time spent for DeviceIoControl() is 231174 ticks. After adding the KdPrint() to the driver, the average time is 133702.
__rdtsc readings may not be consistent across different processors (depends on the model). KdPrint may cause hidden CPU switch which makes the reading differ by the TSC skew.
On the system I use, KeQueryPerformanceCounter seems to have a frequency of cpu clock speed / 1024, which is kind of slow for high speed measurements.
Jan
-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@resplendence.com
Sent: Wednesday, March 9, 2016 10:58 PM
To: Windows System Software Devs Interest List Subject: Re:[ntdev] Why adding one extra KdPrint can increase the performance of an user application
RDTSC is subject to out of order excecution, it needs to be preceded with a serializing instruction. Use KeQueryPerformanceCounter instead.
IIRC these warnings only ever applied to specific AMD chipsets that are long obsolete.
Timing differences caused by the overhead of logging is always a problem regardless of architecture or the choice of a timestamp source. This is simply a consequence of parallelism and any analysis of logs must account for the fact that events that are near to one another have occurred in arbitrary order rather than in the order that they appear in the log
From: xxxxx@gmail.commailto:xxxxx Sent: March 10, 2016 12:01 AM To: Windows System Software Devs Interest Listmailto:xxxxx Subject: RE:[ntdev] Why adding one extra KdPrint can increase the performance of an user application
Alex Grig wrote:
> __rdtsc readings may not be consistent across different > processors (depends on the model).
Invariant TSC has been around since the Nehalem days.
>IIRC these warnings only ever applied to specific AMD chipsets that are long obsolete.
That’s wrong, read the article that Jan posted.
Yes, but that produces differences of a few cycles, not the kind of deltas the original poster is experiencing.
Instruction pipelines have a lot of stages nowadays, I think those can add up to hundreds of cycles. It’s one of the many problems that you encounter when using a RDTSC instruction. I suppose we have to do things right before starting to explain where the differences come from.
Not to mention if its hitting the HPET that’s going to be very expensive as compared to just reading the TSC.
Do you have any information that backs this up ? I am reading on this and getting mixed messages. I cannot believe that reading that register can be as expensive as a RDTSC instruction plus an entire pipeline flush.
I am so glad that you have contradicted me flatly. It does make me feel so much better than when I was worried about whether this part of the problem was relevant to the actual issue of performance measuring via logs on MP systems in a preemtive OS like Windows.
It also does sort of imply that I have only read about this topic, rather than having actual experience. Again this makes me feel better because now that I know I don?t know anything about it, I hardly need to worry about attempting to help the OP: After all, we never want to help the OP when we respond.
For completeness, except for the systemically asymmetric TSC counts in various AMD based multiprocessor systems of the 2002 ? 2006 vintage, the expected direcepencies in TSC values are at least an order of magnitude below the overhead of the logging itself. Even so called light weight logging methods must either be ?lossy? or touch heap or pool memory and thereby disrupt timings, lock patterns and other characteristics of normal execution. Somewhere in the mix thread sync must come into play too as at least the EOF location must be protected to avoid overlapping writes to the same bytes. Fortunately for all of us poor slobs who must engineer working software of every kind, most often none of these problems actually discount the utility of the logs.
Sent from Mailhttps: for Windows 10i From: xxxxx@resplendence.commailto:xxxxx Sent: March 10, 2016 10:50 PM To: Windows System Software Devs Interest Listmailto:xxxxx Subject: Re:[ntdev] Why adding one extra KdPrint can increase the performance of an user application
>IIRC these warnings only ever applied to specific AMD chipsets that are long obsolete.
That?s wrong, read the article that Jan posted.
>Yes, but that produces differences of a few cycles, not the kind of deltas the original poster is experiencing.
Instruction pipelines have a lot of stages nowadays, I think those can add up to hundreds of cycles. It?s one of the many problems that you encounter when using a RDTSC instruction. I suppose we have to do things right before starting to explain where the differences come from.
>Not to mention if its hitting the HPET that’s going to be very expensive as compared to just reading the TSC.
Do you have any information that backs this up ? I am reading on this and getting mixed messages. I cannot believe that reading that register can be as expensive as a RDTSC instruction plus an entire pipeline flush.