Note that the RDTSC instruction is subject to prefetch and speculative
execution. The advice is to always place a “serializing” instruction
ahead of it to block te pipeline so all the instructions prior to entering
your measued block are completed, and that all te instructions in your
loop have completed before it is read again.
The hazards of KeQueryPerformanceCounter/QueryPerformanceCounter/RDTSC
with regard to thread swapping have been pointed out.
On the 32-bit architectures, there are several serializing istructions in
kernel mode, but there is only one user-mode serializing
instruction:CPUID. I don’t know if te 64-bit architecture MFENCE is
fukly-serializing, or only partially-serializing, and I’m not in a
position right now to check it (I’m killing time while waiting for my next
appointment, at 4pm, sitting at a cafeteria table and using my iPad).
Several things to note:
You have not specified if the kernel-mode pages are locked down.
You are not moving a gigabyte. You are moving a megabyte 1,024 times.
The impact on paging behavior would be quite different. So if you want to
see how long it takes to transfer a gigabyte, use a gigabyte buffer. If
your concern is how fast it takes to move a megabyte, then you should try
to measure the cost of moving 1, 2, 4, 8, 16, …, 1024MB. Look for
massive discontinuities in the numbers. These will tell you if paging or
context swaps are a significant factor. Repeat withte user-mode pages
locked down. Repeat all experiments (buffer pageable or locked) at
various priorities.
I could spend several days constructing and running experiments. But at
the end, this could save me several weeks of attempting to solve the wrong
problem.
If you are not actually reading from the device, or writing to te device,
be aware that bus cycles to memory on a device are substantially slower
than RAM-to-RAM copies.
You have not reported any test in user mode where you have used
VirtualLock tolock the pages into memory.
While “wall clock” time is, as this post explains, is a valid measure,
even with paging and context swaps the numbers are too different to make
sense.
By not maintaining the mean and standard deviation of the numbers you get,
you have no idea what they are telling you. Any good statistics text
should give you te formula for mean and variance that requires only three
variables.
The correct way to measure te cost of moving a gigabyte (if you choose to
ignore paging and caching effects) is to measure it in 1MB chunks. If you
see a delta-T in the tens of milliseconds, that’s a context swap.
If context swaps are diagnosed to be the problem, consider running the
thread at a higher priority and repeating the experiments. If priority 15
in user mode doesn’t give adequate numbers, consider using the Vista+
Multimedia Scheduling Service to get you user tbread as high as 26.
One experiment whose output is one number tells you precisely nothing.
An experiment that compares um and km which has not been proven reliable
or meaningful for each tells you nothing. You’re not comparing apples to
oranges; you’re comparing applesauce to orange juice, and without any
prior knowledge of either apples or oranges, attempting to
reverse-engineer the two original fruits from their products.
You have to ask yourself:
What do I want to measure?
Will this experiment measure t?
What artifacts of this experiment could be affecting what I’m measuring?
Are the numbers I’m getting trustworthy?
Do they actually measure what I set out to measure
Kow can I perform this experiment allowing more or fewer degrees of
freedom to outside penomena?
What is the sigal-to-noise ratio of my measurements?
Well, that’s just a start. For example, if you understand the caching
behavior, and can game it properly, you can get a factor of 5-20
performance improvement. If you understand the paging behavior and can
game it, you can see improvements of 10,000-1,000,000 without breaking
into a sweat. But if your original numbers are meaningless, you could end
up directing your effort to solving the wrong problem.
xxxxx@gmail.com wrote:
> In kmode ,I test writing speed like this:
Most of the comments you’ve received in this thread miss the point.
You’re not trying to compute cycle-accurate timings. You’re trying to
compute order of magnitude timings. For that purpose, it’s completely
fine to ignore context switches and interrupts. You just run your test
several times and do some wild point analysis to throw out the obviously
bogus numbers.
Do the numbers make sense? You are writing 1GB in both cases. If your
timings are correct, that means your kernel loop took 1/5 of a second,
and your user-mode loop took 35 seconds. Does that agree with what you
saw? If not, then you have a math problem, not a performance problem.
Are you sure you are using the correct units? KeQuerySystemTime returns
time in 100 nanosecond units. clock() returns a number in 1 millisecond
units. Those are very different units.
However, there are other issues. KeQuerySystemTime measures elapsed
wall-clock time. That includes interrupts and time spent in other
processes. clock() attempts to measure actual processor time used.
That’s a very different thing to measure.
If I were you, I would add
#include <intrin.h>
> and use the __rdtsc() function in both cases to capture elapsed time.
> You’ll have to know the clock speed of your hardware to convert that to
> seconds. If you don’t like that, you can use KeQueryPerformanceCounter
> in kernel-mode and QueryPerformanceCounter in user-mode. Those use the
> same time source, and you can get the frequency from
> QueryPerformanceFrequency.
>
> –
> Tim Roberts, xxxxx@probo.com
> Providenza & Boekelheide, Inc.
>
>
> —
> NTDEV is sponsored by OSR
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
></intrin.h>