Re: Going Faster...!

xxxxx@realresume.com said:

Our testing showed that above 800MHz core speed there doesn’t appear
to be any benefit to faster cores.

I’m surprised you did so well. I would think you would be memory (or cache)
limited long before then.


Steve Williams “The woods are lovely, dark and deep.
xxxxx@icarus.com But I have promises to keep,
xxxxx@picturel.com and lines to code before I sleep,
http://www.picturel.com And lines to code before I sleep.”

>Paul Bunn wrote:

>
> Memory bandwidth may be an issue. I wrote a small program to test this and
> it is available at:
> ftp://ftp.ultrabac.com/pub/utils/bm_mem/x86/bm_mem.zip

Excellent! I’ll give it a try.

RLH

>Jan Bottorff wrote:

>
> >Since we’ve factored out the disk and video, what’s left? Any suggestions?
>
> Lots of things. For example, memory bandwidth and locality?

That’s why I expected 100 vs. 133 FSB to make a measurable difference - it
did not. And yes, the working code image exceeds the L1/L2 cache capacity so
we should be hitting main memory.

> PCI bus bandwidth?

This code performs a minimum amount of LAN I/O which occurs at the start and
end of the processing cycle. Total data transferred is under 100K on a
100BaseTX connection. FYI: The only cards in the backplane are the NIC and
the AGP video. As for the latter, the code runs as a service and no one is
logged in when the test is run, so there are no actual video operations.

> If you run a profile with VTune does it show any hotspots? Like for cache
> misses?
> You also could find the Intel processor performance counter plugin for NT’s
> performance monitor (resource kit???), and look at lots of processor
> internal performance measures.

I’ll try these - thanks.

> The processor support chipset may have some effect on things too. For
> example, the memory latency of a 440BX chipset is lower than the newer 820
> chipset running with SDRAM. The 820 was designed for RDRAM, so has to have
> a memory protocol translator device in the memory access path. Are your
> comparisons on IDENTICAL systems, except for processor clock speed?

Yes. We are using one motherboard for all Pentium tests, and another
motherboard for all Athlon tests. The only change from test to test is
swapping out the CPU and changing the jumpers/BIOS settings as necessary to
support the new CPU. Same motherboard, same memory, same NIC, same video.

> You might also be bumping into some device latency limitation. For example,
> a LAN device I once worked on took 100+ microseconds for the firmware to
> process a command. Even on an infinitly fast processor, LAN performance
> would have not changed much because commands could only get processed at a
> firmware limited rate.

See above. The only peripherals are the chipset itself, an SMC 9432TX NIC,
and an AOpen 8MB AGP video card.

> If your product is a PCI device…

Our product is software - we’re just trying to find the fastest platform on
which to run it.

RLH

At 08:50 PM 05/27/2000 -0400, Xanrel wrote:

Can you be more specific? You said you got it to fit in memory. Did you mean
cache or RAM?

RAM. It’s too large to fit into L1/L2 cache. I don’t have exact values, but
we know for a fact it’s larger than on-chip cache.

What kind of operation (addition, multiplication, combinations of
operations, etc.)? Almost no FP != no FP, what kind are there, and what
integer operations is it mixed with?

It’s mostly text crunching and data rearrangement - very little math of any
kind, integer or FP.

If you’re not reading the drive or using video,
RDRAM will do nothing for you as long as your processor bus is
still at 133MHz*64-bit, and the latency is worse (about equivalent to CAS4
or CAS5).

Hmm… first I’ve heard of this. We have a RamBus hardware setup coming in
for evaluation, so we’ll have a chance to prove this.

You said changing the external bus speed had no effect. If you just
increased the processor bus speed and not the SDRAM bus, that’s why – up
the SDRAM to 133MHz (and use CAS2 PC133) and you should actually see a
difference, if your data is getting pulled from RAM because it won’t fit in
the cache.

I was specifically referring to the memory bus… the motherboards permit
changing the memory speed separately from the processor bus speed. Playing
with the memory bus speed (from 100MHz to 133MHz) made zero measurable
difference.

It feels like there’s another bottleneck somewhere, but I can’t think of
what it might be. We’ve essentially eliminated disk and video… there’s
very little network I/O, and then only at the start and end of each
processing cycle… there’s enough physical memory to avoid using virtual
memory. What else could there be?

I read a whitepaper a couple of years ago which said that when core speeds
exceeded about 4X the external bus speeds, further overall system speed
improvement went asymptotic. 800+ MHz CPU’s have multipliers well beyond 4X,
so perhaps what we’re seeing is the whitepaper’s limit: Core speed alone
doesn’t yield improvement because the working set is too large for on-chip
cache, and memory speed is so slow relative to the core that minor
improvements in memory speed don’t yield significant results.

RLH

>That’s why I expected 100 vs. 133 FSB to make a measurable difference - it

did not. And yes, the working code image exceeds the L1/L2 cache capacity so
we should be hitting main memory.

Let me remember some nitty gritty about memory timing. If you accessing
main memory in essentially random cache line chunks, you have to look at
the page miss memory timing, which I believe is like 9-1-1-1 for a burst.
So we calculate total time per burst as:

12 clocks at 100 Mhz = 0.120 microseconds or 8.3 mega cache lines/sec
(which is 266 MBytes/sec)

12 clocks at 133 Mhz = 0.900 microseconds or 11.11 mega cache lines/sec
(not sure if 133 Mhz memory actually is faster on the first access)

So let’s assume your app is touching a new random memory location every 10
processor clock cycles.

at 800 Mhz that’s 80 mega transactions/sec desired, and only 8.3 are
available, so your CPU spends 90% of it’s time stalled in wait states

at 500 Mhz that’s 50 mega transactions/sec desired, and only 8.3 are
available, the result is 500 Mhz and 800 Mhz processors will have the same
application performance.

If we slide the curve up to match you reported performance, 800 Mhz and 900
Mhz processors give the same performance, it’s possible you access memory
every 96 processor cycles, so we get

at 800 Mhz that’s 8.3 mega transactions/sec desired and available, and your
processor is not stalled for memory

at 900 Mhz that’s 9.37 mega transactions/sec desired and only 8.3
available, so your waiting about 12% of the time, and the app is not going
any faster

So here are some things to measure. Install the Pentium Performance counter
module for the NT performance monitor. I believe one of the available
things to measure is memory transaction rate. See if it’s the same number
with a 800 and 900 Mhz processor?

Does performance of your app scale directly with processor clock speed from
say 500 Mhz? It could be the “wall” is at 692 Mhz, so really you should try
to see if things scale smooth, and then have a ledge where they stop
improving suddenly.

You should also measure the memory transaction rate at 900 Mhz, for both
100 Mhz and 133 Mhz memory. You may find that what happens is 100 Mhz
memory can deliver a page miss burst at 9-1-1-1 and 133 Mhz memory only can
do 13-1-1-1, both of which will give about 0.120 microseconds per burst.
For memory page hits, you will get better maximum bandwidth from the 133
Mhz memory because page hit timing is as fast as maybe 2-1-1-1. Different
brands of 133 Mhz memory may also have different initial latency.

You might also get clues at memory timing by looking at what the BIOS setup
shows when automatically setting memory timing using SPD (a tiny parameter
ROM on the memory DIMM).

Can you tell me exactly what processor chipset is on your Pentium III
motherboard (440 BX/GX, 820, 840). The 820/840 accesses SDRAM through a
memory protocol translator, as the chipsets basically speak RDRAM protocol.
This RDRAM bus may always run at the same speed, no matter what the
external processor bus clock runs at. I’d have to look at the block
diagrams of the chipset to see.

There lots of confusion about RDRAM. The BIG advantage of RDRAM is the
address requests and data response are pipelined. The total latency is no
better (and maybe worse) than SDRAM. If the processor and chipset have it
together about predicting upcoming memory accesses (the Pentium III
prefetch instructions especially can help), the pipelined architecture can
deliver dramatically higher memory access transaction rates. Essentially
the 9 clocks in 9-1-1-1 SDRAM timing are overlapped, so addresses continue
into the pipeline, and data continues to flow out, while the memory is
getting around to finding the data. The RAMBUS web site has some pretty
graphics to show this. I don’t actually know if Intel processors and
chipsets can keep the pipeline full though. If not, RDRAM is hugely more
expensive and will be no faster that SDRAM. The Itanium execution is even
more speculative than the Pentium’s, so the trend is to get things to
effectively use pipelined memory accessing.

If random memory accessing is the limiting factor, you may find rearranging
your code to improve locality may help a lot. I know things like matrix
multiply’s and such can go lots faster with code that understands memory
delays. For example, you might preallocate data structures instead of just
calling new, so instead of bouncing all over memory, your stay within a
much smaller range when walking a specific data structure. If you have to
make 100 visit’s to 5000 small nodes, you may get 10x difference in
performance. A useful rule of thumb on current processors is you get
something like 10 mega memory transactions/sec for random accesses. If your
accessing one byte, that’s only a bandwidth of 10 MByte/sec, a really
dramatically lower number (like 80x times lower) than what you might
expect. If your accessing large amounts of memory randomly, performance
degrades even more, because the virtual memory page tables may no longer
fit in L2 cache, so every memory access requires multiple memory
transactions to load the correct TLB entry first.

  • Jan

At 04:06 PM 05/30/2000 -0700, Jan Bottorff wrote:

If random memory accessing is the limiting factor, you may find rearranging
your code to improve locality may help a lot. I know things like matrix
multiply’s and such can go lots faster with code that understands memory
delays. For example, you might preallocate data structures instead of just
calling new, so instead of bouncing all over memory, your stay within a
much smaller range when walking a specific data structure.

Very true. I take this one step further: Not only do I preallocate arrays of
structures, but I also reuse them from bottom to top. This packs the active
structures near one end of the array, making them contiguous and more likely
to fit within fewer NT memory pages. By the way, NT completion ports assign
threads in LIFO order for the same reason: their stacks are more likely to
be nearby (in L1 or L2).

RLH