Going Faster...!

We’ve just completed an exhaustive test of several Pentium and Athlon
processors, trying to find the fastest combination of components on which to
run our NTW4 software. Our code is carefully written to run entirely in
memory - once we’re loaded into memory, the hard drive is very seldom
accessed and only then by NT’s internal housekeeping. Our code runs as an NT
Service and performs no video operations of any kind, so video bandwidth
shouldn’t be a factor either. Math operations are almost entirely integer,
not FP. Because of these factors, we believe(d) we are limited only by core
and bus speed… and while we don’t expect speed to scale linearly with core
speed, we expected something.

Our testing showed that above 800MHz core speed there doesn’t appear to be
any benefit to faster cores. Going from 800 to 867 to 900 MHz, which
represents a 13% increase in core speed, we see exactly zero delivered
improvement. Going from 450 to 700 to 800, we do see improvements - but they
level off above 800. Changing the external bus speed (from 100 MHz to 133
MHz) also has no measurable effect.

These NTW4 machines are running very lean. Minimal NT Services are running,
and there’s only two cards in the backplane (AGP video and PCI network).
TaskMan reports only 13 processes, most of them NT’s own, and memory
consumption at idle is under 15MB.

Since we’ve factored out the disk and video, what’s left? Any suggestions?

RLH

If your code is multi-threaded then add more CPUs. That should definitly
speed things up.

Jim

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Richard Hartman
Sent: Friday, May 26, 2000 4:47 PM
To: NT Developers Interest List
Subject: [ntdev] Going Faster…!

We’ve just completed an exhaustive test of several Pentium and Athlon
processors, trying to find the fastest combination of components on which to
run our NTW4 software. Our code is carefully written to run entirely in
memory - once we’re loaded into memory, the hard drive is very seldom
accessed and only then by NT’s internal housekeeping. Our code runs as an NT
Service and performs no video operations of any kind, so video bandwidth
shouldn’t be a factor either. Math operations are almost entirely integer,
not FP. Because of these factors, we believe(d) we are limited only by core
and bus speed… and while we don’t expect speed to scale linearly with core
speed, we expected something.

Our testing showed that above 800MHz core speed there doesn’t appear to be
any benefit to faster cores. Going from 800 to 867 to 900 MHz, which
represents a 13% increase in core speed, we see exactly zero delivered
improvement. Going from 450 to 700 to 800, we do see improvements - but they
level off above 800. Changing the external bus speed (from 100 MHz to 133
MHz) also has no measurable effect.

These NTW4 machines are running very lean. Minimal NT Services are running,
and there’s only two cards in the backplane (AGP video and PCI network).
TaskMan reports only 13 processes, most of them NT’s own, and memory
consumption at idle is under 15MB.

Since we’ve factored out the disk and video, what’s left? Any suggestions?

RLH


You are currently subscribed to ntdev as: xxxxx@youngendeavors.com
To unsubscribe send a blank email to $subst(‘Email.Unsub’)

Memory bandwidth may be an issue. I wrote a small program to test this and
it is available at:
ftp://ftp.ultrabac.com/pub/utils/bm_mem/x86/bm_mem.zip
One of the tests stress-tests the ability to perform rapid context switches.
I’m rather proud of the fact that this little program identified a bug in a
beta of Win2K that revealed a disasterous performance problem that MS were
able to fix prior to release (after blaming my code for the problem, of
course!). The results that it gives aren’t much use by themselves, but they
are excellent at using as an “index” to compare other machines against each
other.

You may find that the problem is the quanta that NT’s scheduler is letting
you have. One way around this might be to boost your priority of execution
to see if this helps (and you’re not concerned with system performance by
other applications).

You might also want to run VTUNE from Intel to see where the processors are
spending their time.

Regards,

Paul Bunn, UltraBac.com, 425-644-6000
Microsoft MVP - WindowsNT/2000
http://www.ultrabac.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Richard Hartman
Sent: Friday, May 26, 2000 4:47 PM
To: NT Developers Interest List
Subject: [ntdev] Going Faster…!

We’ve just completed an exhaustive test of several Pentium and Athlon
processors, trying to find the fastest combination of components on which to
run our NTW4 software. Our code is carefully written to run entirely in
memory - once we’re loaded into memory, the hard drive is very seldom
accessed and only then by NT’s internal housekeeping. Our code runs as an NT
Service and performs no video operations of any kind, so video bandwidth
shouldn’t be a factor either. Math operations are almost entirely integer,
not FP. Because of these factors, we believe(d) we are limited only by core
and bus speed… and while we don’t expect speed to scale linearly with core
speed, we expected something.

Our testing showed that above 800MHz core speed there doesn’t appear to be
any benefit to faster cores. Going from 800 to 867 to 900 MHz, which
represents a 13% increase in core speed, we see exactly zero delivered
improvement. Going from 450 to 700 to 800, we do see improvements - but they
level off above 800. Changing the external bus speed (from 100 MHz to 133
MHz) also has no measurable effect.

These NTW4 machines are running very lean. Minimal NT Services are running,
and there’s only two cards in the backplane (AGP video and PCI network).
TaskMan reports only 13 processes, most of them NT’s own, and memory
consumption at idle is under 15MB.

Since we’ve factored out the disk and video, what’s left? Any suggestions?

>Since we’ve factored out the disk and video, what’s left? Any suggestions?

Lots of things. For example, memory bandwidth and locality? PCI bus
bandwidth? A 800 Mhz processor can thrash the cache just as fast as a 900
Mhz processor.

If you run a profile with VTune does it show any hotspots? Like for cache
misses?

You also could find the Intel processor performance counter plugin for NT’s
performance monitor (resource kit???), and look at lots of processor
internal performance measures.

The processor support chipset may have some effect on things too. For
example, the memory latency of a 440BX chipset is lower than the newer 820
chipset running with SDRAM. The 820 was designed for RDRAM, so has to have
a memory protocol translator device in the memory access path. Are your
comparisons on IDENTICAL systems, except for processor clock speed?

You might also be bumping into some device latency limitation. For example,
a LAN device I once worked on took 100+ microseconds for the firmware to
process a command. Even on an infinitly fast processor, LAN performance
would have not changed much because commands could only get processed at a
firmware limited rate.

If your product is a PCI device, you might also want to use a PCI bus
analyzer to see how efficent your bus transfers are. Seeing the actual
processor bus activity takes a very expensive piece of equipment (maybe
thousands of dollars a month to rent).

You could also be having some issue like your LAN device is interrupting
5000 times/sec, and polluting the cache. Even if not that much data is
getting transfered, the cache pollution could hurt. The faster Pentium
III’s only have 256K of faster L2 cache vs. 512K on slower processors. The
Xeon’s can have heaps more L2 cache, at a hefty price. I think Xeon’s don’t
come in as fast core clock speeds as Pentium III’s either.

My suggestion is collect data, like from the Intel performance counters.

  • Jan