memory operation speed

ez_word · March 24, 2013, 11:39am

Hi,
I am working on a windows device driver which is just reading and writing local memory(bios reserved for device).
Here is what is happening,
1,I try to map the local memory to kernel space with MmMapIoSpace(cached) and test its speed as follows :
writing: 6GBytes per second
reading: 5GBytes per second
2,I try to map the local memory to user space with MmMapLockedPagesSpecifyCache(cached) and test its speed :
writing: 30MBytes per second
reading: 33MBytes per second
As you see the r/w speed in user space is too slow to use.I don’t know why there is big difference between user space and kernel space .

Can somebody point how to do to change performance in user space?

Thandks,
-Ezword

OSR_Community_User · March 24, 2013, 1:01pm

I’m a little suspicioudps about caching. If the data in the memory is
changed by the device, the cached version will not be updated and will be
stale.

How are you measuring these rates? It is important to know how you get
these numbers.

Generally, mapping kernel memory to user space is risky unless you handle
all the corner cases. Check the archives of this NG to see all the issues
of app termination, for example.
joe

Hi,
I am working on a windows device driver which is just reading and writing
local memory(bios reserved for device).
Here is what is happening,
1,I try to map the local memory to kernel space with MmMapIoSpace(cached)
and test its speed as follows :
writing: 6GBytes per second
reading: 5GBytes per second
2,I try to map the local memory to user space with
MmMapLockedPagesSpecifyCache(cached) and test its speed :
writing: 30MBytes per second
reading: 33MBytes per second
As you see the r/w speed in user space is too slow to use.I don’t know why
there is big difference between user space and kernel space .

Can somebody point how to do to change performance in user space?

Thandks,
-Ezword

NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Maxim_S_Shatskih · March 25, 2013, 3:03am

Is the test loop the same for both kmode and umode?

wrote in message news:xxxxx@ntdev…
> Hi,
> I am working on a windows device driver which is just reading and writing local memory(bios reserved for device).
> Here is what is happening,
> 1,I try to map the local memory to kernel space with MmMapIoSpace(cached) and test its speed as follows :
> writing: 6GBytes per second
> reading: 5GBytes per second
> 2,I try to map the local memory to user space with MmMapLockedPagesSpecifyCache(cached) and test its speed :
> writing: 30MBytes per second
> reading: 33MBytes per second
> As you see the r/w speed in user space is too slow to use.I don’t know why there is big difference between user space and kernel space .
>
> Can somebody point how to do to change performance in user space?
>
> Thandks,
> -Ezword
>
>

OSR_Community_User · March 25, 2013, 3:14am

Among other factors. I don’t see how the two numbers can differ by that
much unless the measurement technique is flawed. And, over the years,
I’ve seen about every possible mistake that can be made in doing
measurements (one of the advantages of a good undergraduate physics course
is that it teaches you a lot about experimental design and interpreting
the results).
joe

Is the test loop the same for both kmode and umode?

wrote in message news:xxxxx@ntdev…
>> Hi,
>> I am working on a windows device driver which is just reading and
>> writing local memory(bios reserved for device).
>> Here is what is happening,
>> 1,I try to map the local memory to kernel space with
>> MmMapIoSpace(cached) and test its speed as follows :
>> writing: 6GBytes per second
>> reading: 5GBytes per second
>> 2,I try to map the local memory to user space with
>> MmMapLockedPagesSpecifyCache(cached) and test its speed :
>> writing: 30MBytes per second
>> reading: 33MBytes per second
>> As you see the r/w speed in user space is too slow to use.I don’t know
>> why there is big difference between user space and kernel space .
>>
>> Can somebody point how to do to change performance in user space?
>>
>> Thandks,
>> -Ezword
>>
>>
>
> —
> NTDEV is sponsored by OSR
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

ez_word · March 25, 2013, 6:14am

In kmode ,I test writing speed like this:

KeQuerySystemTime(&starttime);
i = 0
while(i < 1024)
{
RtlFillMemory(pKernelAddr,1024*1024,0xa5);
i++;
}
KeQuerySystemTime(&endtime);

In umode like this:
start = clock();
while(i < 1024)
{
memset(pUserAddr,0xAA,1024*1024);
i++;
}
end = clock();

Is there anything wrong?

Daniel_Terhell · March 25, 2013, 7:21am

>Is there anything wrong?

Yes there is, you should not use KeQuerySystemTime. Also I don’t know where
the clock function comes from. If the whole loop completes within one clock
tick (which may take up anywhere between 0.5 ms and 15.6ms) you will have
the same system time reported at the beginning and ending of the loop,
giving the illusion of having consumed 0ms.

Use (Ke)QueryPerformanceCounter. And make sure you excecute the entire test
on a single CPU by using SetThreadAffinityMask in case the time stamp
counters are not synchronized between CPUs.

BTW at what IRQL are you excecuting the kernel test and at what thread
priority the usermode test ? Preemption during execution is a factor that
may further influence the outcome.

//Daniel

anton_bassov · March 25, 2013, 7:32am

> Is there anything wrong?

Where is a guarantee that your operations are atomic??? Look - context switch that may occur at any moment at any place in your code is going to invalidate your results completely, don’t you think? As long as you are in the kernel mode you can alleviate it by rising IRQL, effectively disabling switches (or, even better, disabling interrupts on the CPU in order to get more precise results that are less likely to be skewed). However, there is absolutely nothing that you can do about it in the userland. Furthermore, 100% reliable solution for the KM does not exist either, at least on x86-based platform, because you cannot do anything about SMI that may take hundreds of milliseconds to get served and is a complete ‘terra incognita’ to the OS software

I think you should learn a bit of OS theory before you even attempt to write drivers…

Anton Bassov

ez_word · March 25, 2013, 8:07am

Of course, context switch that may occur at any moment.Disable interrupt or rising Irql,may decrease context switch,But,I think a context switch may lead to test speed more slower,and I do not care.

The point is why two numbers can differ by that much,and how to do to increase speed in umode?

I am a newer in windows driver,and now is begining to study and write drivers.

Thanks a lot.

Ez word

Daniel_Terhell · March 25, 2013, 8:29am

>The point is why two numbers can differ by that much,and how to do to

increase speed in umode?

That’s because the numbers that you obtained are completely bogus due to
relying on the system clock which only gets updated every clock interrupt.

//Daniel

anton_bassov · March 25, 2013, 10:22am

> Of course, context switch that may occur at any moment.Disable interrupt or rising Irql,may decrease

context switch,But,I think a context switch may lead to test speed more slower,and I do not care.

No comments…

The point is why two numbers can differ by that much,and how to do to increase speed in umode?

Actually, the point is that you just don’t want to listen to explanations of why your test results not indicative of absolutely anything. Keep on going this way…

Anton Bassov

anton_bassov · March 25, 2013, 10:31am

> That’s because the numbers that you obtained are completely bogus due to relying on the system

clock which only gets updated every clock interrupt.

Timer resolution is also a serious “invalidation factor” here, but I think that unreliability due solely to this factor is just funny, compared to the skew that the OP may get from scheduler-related issues If a thread gets kicked off the CPU it may potentially have to wait for quite a few ticks before it has a chance to run again - it depends on the system load and on the priority of his thread versus the ones of other threads…

Anton Bassov

Daniel_Terhell · March 25, 2013, 10:42am

Suppose he runs his test with all interrupts disabled as you suggested, that
means the clock interrupt never occurs so the clock interrupt never updates
the system time so all his measurements would ALWAYS be exactly 0.00ms,
right ? That means his calculation MB/elapsed becomes a division by zero or
rises to +INF. Also executing at HIGH_LEVEL for too long such as the loop he
is doing will cause his system to crash on a MP system.

Rising to HIGH_LEVEL is not a reasonable thing to suggest because he cannot
do it in production code. He might just want to measure throughput in a real
life rather than sterile situation.

//Daniel

Timer resolution is also a serious “invalidation factor” here, but I think
that unreliability due solely to this factor is just funny, compared to
the skew that >the OP may get from scheduler-related issues If a thread
gets kicked off the CPU it may potentially have to wait for quite a few
ticks before it has a >chance to run again - it depends on the system load
and on the priority of his thread versus the ones of other threads…

anton_bassov · March 25, 2013, 11:10am

> Suppose he runs his test with all interrupts disabled as you suggested, that means the clock

interrupt never occurs so the clock interrupt never updates the system time so all his measurements
would ALWAYS be exactly 0.00ms,

True…

In order to get some more or less reliable results the OP needs RDTSC and, apparently, CLI…

Rising to HIGH_LEVEL is not a reasonable thing to suggest because he cannot do it in production code.

Well, I guess the whole discussion applies strictly to a very specific set of tools that are not meant to run in a production environment, in the first place…

He might just want to measure throughput in a real life rather than sterile situation.

Well, the beauty of a testing environment is that, assuming that you have the right tools, you can simulate any
workload. Concerning the “real life”…well, we have to accept the inevitable fact that, in some cases, simulator may the only thing available to us throughout the entire development process (OK, I stop now - otherwise I am bound to go miles off-topic the way I usually do)…

Anton Bassov

OSR_Community_User · March 25, 2013, 12:16pm

> In kmode ,I test writing speed like this:

KeQuerySystemTime(&starttime);
i = 0
while(i < 1024)
{
RtlFillMemory(pKernelAddr,1024*1024,0xa5);
i++;
}
KeQuerySystemTime(&endtime);

In umode like this:
start = clock();
while(i < 1024)
{
memset(pUserAddr,0xAA,1024*1024);
i++;
}
end = clock();

Is there anything wrong?

Yes. You are totally cluelesss. clock() is the first and worst mistake
most programmers make. Its usage means your numbers are meaningless,
which is what I suspected when I saw the first post.

You have done two experiments, used two different measurement techniques,
and have arrived at an irrelevant conclusion…

At the very least, you should have used QueryPerformanceCounter. You have
also done essentially one experiment, and have NO idea of its reliability.
What is the standard deviation across fifty or a thousand experiments? As
I suspected, your numbers are nonsense. I’ll wager you have not
subtracted out timer overheads or accounted for discrete clock events (aka
“gating error”).

NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · March 25, 2013, 12:20pm

> Of course, context switch that may occur at any moment.Disable interrupt

or rising Irql,may decrease context switch,But,I think a context switch
may lead to test speed more slower,and I do not care.

The point is why two numbers can differ by that much,and how to do to
increase speed in umode?

I am a newer in windows driver,and now is begining to study and write
drivers.

Thanks a lot.

Ez word

You are also new to performance measurement. You did two uncalibrated
experiments which are unrelated to each other, and have somehow reached
the conclusion that the numbers are correlated and meaningful. They are
not. This has nothing to do with drivers, and everything to do with
experimental design and meaningful interpretation of the results.
joe

NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Tim_Roberts · March 25, 2013, 1:06pm

xxxxx@gmail.com wrote:

In kmode ,I test writing speed like this:

Most of the comments you’ve received in this thread miss the point.
You’re not trying to compute cycle-accurate timings. You’re trying to
compute order of magnitude timings. For that purpose, it’s completely
fine to ignore context switches and interrupts. You just run your test
several times and do some wild point analysis to throw out the obviously
bogus numbers.

Do the numbers make sense? You are writing 1GB in both cases. If your
timings are correct, that means your kernel loop took 1/5 of a second,
and your user-mode loop took 35 seconds. Does that agree with what you
saw? If not, then you have a math problem, not a performance problem.

Are you sure you are using the correct units? KeQuerySystemTime returns
time in 100 nanosecond units. clock() returns a number in 1 millisecond
units. Those are very different units.

However, there are other issues. KeQuerySystemTime measures elapsed
wall-clock time. That includes interrupts and time spent in other
processes. clock() attempts to measure actual processor time used.
That’s a very different thing to measure.

If I were you, I would add
#include <intrin.h>
and use the __rdtsc() function in both cases to capture elapsed time.
You’ll have to know the clock speed of your hardware to convert that to
seconds. If you don’t like that, you can use KeQueryPerformanceCounter
in kernel-mode and QueryPerformanceCounter in user-mode. Those use the
same time source, and you can get the frequency from
QueryPerformanceFrequency.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.</intrin.h>

Tim_Roberts · March 25, 2013, 1:12pm

xxxxx@gmail.com wrote:

In kmode ,I test writing speed like this:
…
RtlFillMemory(pKernelAddr,1024*1024,0xa5);
…
memset(pUserAddr,0xAA,1024*1024);
…
Is there anything wrong?

Also, you are assuming here that RtlFillMemory and memset use the same
algorithm. Now, as it turns out, in this case you got lucky.
RtlFillMemory in the WDK is just a macro that expands to “memset”, so
you’ll get the same function.

However, there is no guarantee that “memset” is optimal. To do a fair
comparison of performance, you’d want to make sure you were writing a
dword at a time. “memset” does not promise to do that.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

OSR_Community_User · March 25, 2013, 3:43pm

Note that the RDTSC instruction is subject to prefetch and speculative
execution. The advice is to always place a “serializing” instruction
ahead of it to block te pipeline so all the instructions prior to entering
your measued block are completed, and that all te instructions in your
loop have completed before it is read again.

The hazards of KeQueryPerformanceCounter/QueryPerformanceCounter/RDTSC
with regard to thread swapping have been pointed out.

On the 32-bit architectures, there are several serializing istructions in
kernel mode, but there is only one user-mode serializing
instruction:CPUID. I don’t know if te 64-bit architecture MFENCE is
fukly-serializing, or only partially-serializing, and I’m not in a
position right now to check it (I’m killing time while waiting for my next
appointment, at 4pm, sitting at a cafeteria table and using my iPad).

Several things to note:

You have not specified if the kernel-mode pages are locked down.

You are not moving a gigabyte. You are moving a megabyte 1,024 times.
The impact on paging behavior would be quite different. So if you want to
see how long it takes to transfer a gigabyte, use a gigabyte buffer. If
your concern is how fast it takes to move a megabyte, then you should try
to measure the cost of moving 1, 2, 4, 8, 16, …, 1024MB. Look for
massive discontinuities in the numbers. These will tell you if paging or
context swaps are a significant factor. Repeat withte user-mode pages
locked down. Repeat all experiments (buffer pageable or locked) at
various priorities.

I could spend several days constructing and running experiments. But at
the end, this could save me several weeks of attempting to solve the wrong
problem.

If you are not actually reading from the device, or writing to te device,
be aware that bus cycles to memory on a device are substantially slower
than RAM-to-RAM copies.

You have not reported any test in user mode where you have used
VirtualLock tolock the pages into memory.

While “wall clock” time is, as this post explains, is a valid measure,
even with paging and context swaps the numbers are too different to make
sense.

By not maintaining the mean and standard deviation of the numbers you get,
you have no idea what they are telling you. Any good statistics text
should give you te formula for mean and variance that requires only three
variables.

The correct way to measure te cost of moving a gigabyte (if you choose to
ignore paging and caching effects) is to measure it in 1MB chunks. If you
see a delta-T in the tens of milliseconds, that’s a context swap.

If context swaps are diagnosed to be the problem, consider running the
thread at a higher priority and repeating the experiments. If priority 15
in user mode doesn’t give adequate numbers, consider using the Vista+
Multimedia Scheduling Service to get you user tbread as high as 26.

One experiment whose output is one number tells you precisely nothing.
An experiment that compares um and km which has not been proven reliable
or meaningful for each tells you nothing. You’re not comparing apples to
oranges; you’re comparing applesauce to orange juice, and without any
prior knowledge of either apples or oranges, attempting to
reverse-engineer the two original fruits from their products.

You have to ask yourself:
What do I want to measure?
Will this experiment measure t?
What artifacts of this experiment could be affecting what I’m measuring?
Are the numbers I’m getting trustworthy?
Do they actually measure what I set out to measure
Kow can I perform this experiment allowing more or fewer degrees of
freedom to outside penomena?
What is the sigal-to-noise ratio of my measurements?

Well, that’s just a start. For example, if you understand the caching
behavior, and can game it properly, you can get a factor of 5-20
performance improvement. If you understand the paging behavior and can
game it, you can see improvements of 10,000-1,000,000 without breaking
into a sweat. But if your original numbers are meaningless, you could end
up directing your effort to solving the wrong problem.

xxxxx@gmail.com wrote:
> In kmode ,I test writing speed like this:

Most of the comments you’ve received in this thread miss the point.
You’re not trying to compute cycle-accurate timings. You’re trying to
compute order of magnitude timings. For that purpose, it’s completely
fine to ignore context switches and interrupts. You just run your test
several times and do some wild point analysis to throw out the obviously
bogus numbers.

Do the numbers make sense? You are writing 1GB in both cases. If your
timings are correct, that means your kernel loop took 1/5 of a second,
and your user-mode loop took 35 seconds. Does that agree with what you
saw? If not, then you have a math problem, not a performance problem.

Are you sure you are using the correct units? KeQuerySystemTime returns
time in 100 nanosecond units. clock() returns a number in 1 millisecond
units. Those are very different units.

However, there are other issues. KeQuerySystemTime measures elapsed
wall-clock time. That includes interrupts and time spent in other
processes. clock() attempts to measure actual processor time used.
That’s a very different thing to measure.

If I were you, I would add
#include <intrin.h>
> and use the __rdtsc() function in both cases to capture elapsed time.
> You’ll have to know the clock speed of your hardware to convert that to
> seconds. If you don’t like that, you can use KeQueryPerformanceCounter
> in kernel-mode and QueryPerformanceCounter in user-mode. Those use the
> same time source, and you can get the frequency from
> QueryPerformanceFrequency.
>
> –
> Tim Roberts, xxxxx@probo.com
> Providenza & Boekelheide, Inc.
>
>
> —
> NTDEV is sponsored by OSR
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
></intrin.h>

Tim_Roberts · March 25, 2013, 5:52pm

xxxxx@flounder.com wrote:

Note that the RDTSC instruction is subject to prefetch and speculative
execution.

But do you understand that this is utterly irrelevant to the gross
problem he’s trying to solve? He wants an order-of-magnitude
comparison. What you’re talking about here makes a difference of a few
cycles. In his case, a few tens of thousands of cycles isn’t going to
make a difference.

Advice is a good thing, but how much time have we wasted pointing out
the accuracy problems inherent in micrometers when someone is just
trying to measure the length of his driveway for an estimate on blacktop?

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Pavel_Lebedinsky · March 25, 2013, 8:13pm

Check to make sure you’re actually getting a cached mapping in the slow case (using !pte on a few addresses from the mapped region). The cache type you pass to MmMapLockedPagesSpecifyCache/MmMapIoSpace may be ignored if there is already a mapping to those same physical pages with a different caching attribute.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@gmail.com
Sent: Sunday, March 24, 2013 8:38 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] memory operation speed

Hi,
I am working on a windows device driver which is just reading and writing local memory(bios reserved for device).
Here is what is happening,
1,I try to map the local memory to kernel space with MmMapIoSpace(cached) and test its speed as follows :
writing: 6GBytes per second
reading: 5GBytes per second
2,I try to map the local memory to user space with MmMapLockedPagesSpecifyCache(cached) and test its speed :
writing: 30MBytes per second
reading: 33MBytes per second
As you see the r/w speed in user space is too slow to use.I don’t know why there is big difference between user space and kernel space .

Can somebody point how to do to change performance in user space?