>That’s why I expected 100 vs. 133 FSB to make a measurable difference - it
did not. And yes, the working code image exceeds the L1/L2 cache capacity so
we should be hitting main memory.
Let me remember some nitty gritty about memory timing. If you accessing
main memory in essentially random cache line chunks, you have to look at
the page miss memory timing, which I believe is like 9-1-1-1 for a burst.
So we calculate total time per burst as:
12 clocks at 100 Mhz = 0.120 microseconds or 8.3 mega cache lines/sec
(which is 266 MBytes/sec)
12 clocks at 133 Mhz = 0.900 microseconds or 11.11 mega cache lines/sec
(not sure if 133 Mhz memory actually is faster on the first access)
So let’s assume your app is touching a new random memory location every 10
processor clock cycles.
at 800 Mhz that’s 80 mega transactions/sec desired, and only 8.3 are
available, so your CPU spends 90% of it’s time stalled in wait states
at 500 Mhz that’s 50 mega transactions/sec desired, and only 8.3 are
available, the result is 500 Mhz and 800 Mhz processors will have the same
application performance.
If we slide the curve up to match you reported performance, 800 Mhz and 900
Mhz processors give the same performance, it’s possible you access memory
every 96 processor cycles, so we get
at 800 Mhz that’s 8.3 mega transactions/sec desired and available, and your
processor is not stalled for memory
at 900 Mhz that’s 9.37 mega transactions/sec desired and only 8.3
available, so your waiting about 12% of the time, and the app is not going
any faster
So here are some things to measure. Install the Pentium Performance counter
module for the NT performance monitor. I believe one of the available
things to measure is memory transaction rate. See if it’s the same number
with a 800 and 900 Mhz processor?
Does performance of your app scale directly with processor clock speed from
say 500 Mhz? It could be the “wall” is at 692 Mhz, so really you should try
to see if things scale smooth, and then have a ledge where they stop
improving suddenly.
You should also measure the memory transaction rate at 900 Mhz, for both
100 Mhz and 133 Mhz memory. You may find that what happens is 100 Mhz
memory can deliver a page miss burst at 9-1-1-1 and 133 Mhz memory only can
do 13-1-1-1, both of which will give about 0.120 microseconds per burst.
For memory page hits, you will get better maximum bandwidth from the 133
Mhz memory because page hit timing is as fast as maybe 2-1-1-1. Different
brands of 133 Mhz memory may also have different initial latency.
You might also get clues at memory timing by looking at what the BIOS setup
shows when automatically setting memory timing using SPD (a tiny parameter
ROM on the memory DIMM).
Can you tell me exactly what processor chipset is on your Pentium III
motherboard (440 BX/GX, 820, 840). The 820/840 accesses SDRAM through a
memory protocol translator, as the chipsets basically speak RDRAM protocol.
This RDRAM bus may always run at the same speed, no matter what the
external processor bus clock runs at. I’d have to look at the block
diagrams of the chipset to see.
There lots of confusion about RDRAM. The BIG advantage of RDRAM is the
address requests and data response are pipelined. The total latency is no
better (and maybe worse) than SDRAM. If the processor and chipset have it
together about predicting upcoming memory accesses (the Pentium III
prefetch instructions especially can help), the pipelined architecture can
deliver dramatically higher memory access transaction rates. Essentially
the 9 clocks in 9-1-1-1 SDRAM timing are overlapped, so addresses continue
into the pipeline, and data continues to flow out, while the memory is
getting around to finding the data. The RAMBUS web site has some pretty
graphics to show this. I don’t actually know if Intel processors and
chipsets can keep the pipeline full though. If not, RDRAM is hugely more
expensive and will be no faster that SDRAM. The Itanium execution is even
more speculative than the Pentium’s, so the trend is to get things to
effectively use pipelined memory accessing.
If random memory accessing is the limiting factor, you may find rearranging
your code to improve locality may help a lot. I know things like matrix
multiply’s and such can go lots faster with code that understands memory
delays. For example, you might preallocate data structures instead of just
calling new, so instead of bouncing all over memory, your stay within a
much smaller range when walking a specific data structure. If you have to
make 100 visit’s to 5000 small nodes, you may get 10x difference in
performance. A useful rule of thumb on current processors is you get
something like 10 mega memory transactions/sec for random accesses. If your
accessing one byte, that’s only a bandwidth of 10 MByte/sec, a really
dramatically lower number (like 80x times lower) than what you might
expect. If your accessing large amounts of memory randomly, performance
degrades even more, because the virtual memory page tables may no longer
fit in L2 cache, so every memory access requires multiple memory
transactions to load the correct TLB entry first.