Several issues.
If this is a pure application, the only way you can suck down enough time to
make the overall system unresponsive is to be running it at a raised
priority. Windows makes no attempt at “fairness” and the closest it comes
to antistarvation is the Balance Set Manager, which gives you about a 3-4
second lag (read Windows Internals for the gory details of the computation,
which is based on several system timing parameters). Setting the priority
to idle means that your app is now the one most likely to be starved,
raising its overall throughput delay.
You can’t compare the time of reading file A with the time of reading file B
based solely on file size; you need know the fragmentation of the two files
and how many seek operations were required to get each one in.
The most common failure mode is people who try to “optimize the hashing
algorithm” without actually determining if it IS the bottleneck. This
pointless flailing about in the hope of doing something that might improve
performance is like trying to find your way downtown from the suburbs by
tossing a coin at every intersection and hoping that you eventually find
yourself downtown.
If you have not gathered detailed performance numbers, you are just wasting
your time. Everything you have done indicates a futile attempt to guess at
what is going on in the total absence of any real performance data.
I did performance measurement for 15 years, and I know one thing: if you ask
someone where the time is going in their program, you will get the wrong
answer! I never found an exception in all those years (including my own
code!)
Without data, you can’t optimize.
Putting sleep calls in is generally silly. That is, the app is too slow, so
let’s artificially make it slower. It may even raise the chances that some
of your pages might be paged out, because once it comes out of sleep, it
just becomes another set of threads in the queue, and could end up behind a
lot of others. It certainly will if you set the priority low.
My first suspect would be page faults, but you’ve pretty much disproven
that. Note that allocation in the debug version of the application runtimes
is VASTLY more expensive than in the release versions, and most data you get
on performance in a debug runtime is totally and completely useless in
predicting actual performance. So if you are allocating buffers for each
I/O, it’s going to kill you. If you allocate one buffer, once, and it is
only 4MB, it probably isn’t paging.
If performance data shows minimal CPU utilization, why did you waste time
redoing hash tables? This is classic timewasting “optimization”. You have
the data that says computations don’t matter, so the approach you take is to
optimize the computations?
The PGO capabilities (Profile-Guided Optimization) can tell you how much
time you are spending computing in the kernel and how much time you are
spending computing in user space. I think it uses the performance counters
that already exist.
When I have performance problems, the LAST thing I think about is rewriting
algorithms. The FIRST thing I think about is getting REAL DATA to tell me
where the time is going, so I don’t foolishly waste my time optimizing
things that never mattered at all.
I agree that the kernel should not be driving the system into catatonia.
However, Microsoft has not been known for building the best device drivers;
in an infamous problem some years ago, they had one that gratuitously
introduced a 1.5 SECOND dead time into disk I/O (something about ATAPI
protocols) but that was fixed in XP. It doesn’t mean it couldn’t happen
again, particularly if there are third-party drivers involved.
What does diskperf tell you about I/O times? That’s the next thing I’d try.
Also, there are some performance counters for the amount of time spent at
DPC level, which is going to impact the perceived response time. If you see
a high percentage of time at DPC level, it would be worthwhile to figure out
who is doing it.
joe
-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@NeoSmart.net
Sent: Tuesday, February 23, 2010 11:13 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] (Terribly) Degraded System Performance Running User Code
I have a program that loads a file (anywhere from 10MB to 5GB) a chunk at a
time (ReadFile), and for each chunk performs a set of mathematical
operations (basically calculates the hash).
After calculating the hash, it stores info about the chunk in an STL map
(basically ) and then writes the chunk itself to another file
(WriteFile).
That’s all it does. This program will cause certain PCs to choke and die.
The mouse begins to stutter, the task manager takes > 2 min to show,
ctrl+alt+del is unresponsive, running programs are slow… the works.
I’ve done literally everything I can think of to optimize the program, and
have triple-checked all objects.
What I’ve done:
Tried different (less intensive) hashing algorithms.
Switched all allocations to nedmalloc instead of the default new operator
Switched from stl::map to unordered_set, found the performance to still be
abysmal, so I switched again to Google’s dense_hash_map.
Converted all objects to store pointers to objects instead of the objects
themselves.
Caching all Read and Write operations. Instead of reading a 16k chunk of the
file and performing the math on it, I read 4MB into a buffer and read 16k
chunks from there instead. Same for all write operations - they are
coalesced into 4MB blocks before being written to disk.
Run extensive profiling with Visual Studio 2010, AMD Code Analyst, and
perfmon.
Set the thread priority to THREAD_MODE_BACKGROUND_BEGIN
Set the thread priority to THREAD_PRIORITY_IDLE
Added a Sleep(100) call after every loop.
Even after all this, the application still results in a system-wide hang on
certain machines under certain circumstances.
Perfmon and Process Explorer show minimal CPU usage (with the sleep), no
constant reads/writes from disk, few hard pagefaults (and only ~30k
pagefaults in the lifetime of the application on a 5GB input file), little
virtual memory (never more than 150MB), no leaked handles, no memory leaks.
The machines I’ve tested it on run Windows XP - Windows 7, x86 and x64
versions included. None have less than 2GB RAM, though the problem is always
exacerbated under lower memory conditions.
I’m at a loss as to what to do next. I don’t know what’s causing it - I’m
torn between CPU or Memory as the culprit. CPU because without the sleep and
under different thread priorities the system performances changes
noticeably. Memory because there’s a huge difference in how often the issue
occurs when using unordered_set vs Google’s dense_hash_map.
What’s really weird? Obviously, the NT kernel design is supposed to prevent
this sort of behavior from ever occurring (a user-mode application driving
the system to this sort of extreme poor performance!?)… but when I
compile the code and run it on OS X or Linux (it’s fairly standard C++
throughout) it performs excellently even on poor machines with little RAM
and weaker CPUs.
What am I supposed to do next? How do I know what the hell it is that
Windows is doing behind the scenes that’s killing system performance, when
all the indicators are that the application itself isn’t doing anything
extreme?
Any advice would be most welcome.
—
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer
–
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.