Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results


Before Posting...

Please check out the Community Guidelines in the Announcements and Administration Category.

More Info on Driver Writing and Debugging

The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.

Check out The OSR Learning Library at:

Re: Going Faster...!

OSR_Community_UserOSR_Community_User Member Posts: 110,217
[email protected] said:
> Our testing showed that above 800MHz core speed there doesn't appear
> to be any benefit to faster cores.

I'm surprised you did so well. I would think you would be memory (or cache)
limited long before then.

Steve Williams "The woods are lovely, dark and deep.
[email protected] But I have promises to keep,
[email protected] and lines to code before I sleep, And lines to code before I sleep."


  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    >Paul Bunn wrote:
    >> Memory bandwidth may be an issue. I wrote a small program to test this and
    >> it is available at:

    Excellent! I'll give it a try.

  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    >Jan Bottorff wrote:
    >> >Since we've factored out the disk and video, what's left? Any suggestions?
    >> Lots of things. For example, memory bandwidth and locality?

    That's why I expected 100 vs. 133 FSB to make a measurable difference - it
    did not. And yes, the working code image exceeds the L1/L2 cache capacity so
    we should be hitting main memory.

    >> PCI bus bandwidth?

    This code performs a minimum amount of LAN I/O which occurs at the start and
    end of the processing cycle. Total data transferred is under 100K on a
    100BaseTX connection. FYI: The only cards in the backplane are the NIC and
    the AGP video. As for the latter, the code runs as a service and no one is
    logged in when the test is run, so there are no actual video operations.

    >> If you run a profile with VTune does it show any hotspots? Like for cache
    >> misses?
    >> You also could find the Intel processor performance counter plugin for NT's
    >> performance monitor (resource kit???), and look at lots of processor
    >> internal performance measures.

    I'll try these - thanks.

    >> The processor support chipset may have some effect on things too. For
    >> example, the memory latency of a 440BX chipset is lower than the newer 820
    >> chipset running with SDRAM. The 820 was designed for RDRAM, so has to have
    >> a memory protocol translator device in the memory access path. Are your
    >> comparisons on IDENTICAL systems, except for processor clock speed?

    Yes. We are using one motherboard for all Pentium tests, and another
    motherboard for all Athlon tests. The only change from test to test is
    swapping out the CPU and changing the jumpers/BIOS settings as necessary to
    support the new CPU. Same motherboard, same memory, same NIC, same video.

    >> You might also be bumping into some device latency limitation. For example,
    >> a LAN device I once worked on took 100+ microseconds for the firmware to
    >> process a command. Even on an infinitly fast processor, LAN performance
    >> would have not changed much because commands could only get processed at a
    >> firmware limited rate.

    See above. The only peripherals are the chipset itself, an SMC 9432TX NIC,
    and an AOpen 8MB AGP video card.

    >> If your product is a PCI device....

    Our product is software - we're just trying to find the fastest platform on
    which to run it.

  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    At 08:50 PM 05/27/2000 -0400, Xanrel wrote:
    >Can you be more specific? You said you got it to fit in memory. Did you mean
    >cache or RAM?

    RAM. It's too large to fit into L1/L2 cache. I don't have exact values, but
    we know for a fact it's larger than on-chip cache.

    >What kind of operation (addition, multiplication, combinations of
    >operations, etc.)? Almost no FP != no FP, what kind are there, and what
    >integer operations is it mixed with?

    It's mostly text crunching and data rearrangement - very little math of any
    kind, integer or FP.

    >If you're not reading the drive or using video,
    >RDRAM will do nothing for you as long as your processor bus is
    >still at 133MHz*64-bit, and the latency is worse (about equivalent to CAS4
    >or CAS5).

    Hmm... first I've heard of this. We have a RamBus hardware setup coming in
    for evaluation, so we'll have a chance to prove this.

    >You said changing the external bus speed had no effect. If you just
    >increased the processor bus speed and not the SDRAM bus, that's why -- up
    >the SDRAM to 133MHz (and use CAS2 PC133) and you should actually see a
    >difference, if your data is getting pulled from RAM because it won't fit in
    >the cache.

    I was specifically referring to the memory bus... the motherboards permit
    changing the memory speed separately from the processor bus speed. Playing
    with the memory bus speed (from 100MHz to 133MHz) made zero measurable

    It feels like there's another bottleneck somewhere, but I can't think of
    what it might be. We've essentially eliminated disk and video... there's
    very little network I/O, and then only at the start and end of each
    processing cycle... there's enough physical memory to avoid using virtual
    memory. What else could there be?

    I read a whitepaper a couple of years ago which said that when core speeds
    exceeded about 4X the external bus speeds, further overall system speed
    improvement went asymptotic. 800+ MHz CPU's have multipliers well beyond 4X,
    so perhaps what we're seeing is the whitepaper's limit: Core speed alone
    doesn't yield improvement because the working set is too large for on-chip
    cache, and memory speed is so slow relative to the core that minor
    improvements in memory speed don't yield significant results.

  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    >That's why I expected 100 vs. 133 FSB to make a measurable difference - it
    >did not. And yes, the working code image exceeds the L1/L2 cache capacity so
    >we should be hitting main memory.

    Let me remember some nitty gritty about memory timing. If you accessing
    main memory in essentially random cache line chunks, you have to look at
    the page miss memory timing, which I believe is like 9-1-1-1 for a burst.
    So we calculate total time per burst as:

    12 clocks at 100 Mhz = 0.120 microseconds or 8.3 mega cache lines/sec
    (which is 266 MBytes/sec)

    12 clocks at 133 Mhz = 0.900 microseconds or 11.11 mega cache lines/sec
    (not sure if 133 Mhz memory actually is faster on the first access)

    So let's assume your app is touching a new random memory location every 10
    processor clock cycles.

    at 800 Mhz that's 80 mega transactions/sec desired, and only 8.3 are
    available, so your CPU spends 90% of it's time stalled in wait states

    at 500 Mhz that's 50 mega transactions/sec desired, and only 8.3 are
    available, the result is 500 Mhz and 800 Mhz processors will have the same
    application performance.

    If we slide the curve up to match you reported performance, 800 Mhz and 900
    Mhz processors give the same performance, it's possible you access memory
    every 96 processor cycles, so we get

    at 800 Mhz that's 8.3 mega transactions/sec desired and available, and your
    processor is not stalled for memory

    at 900 Mhz that's 9.37 mega transactions/sec desired and only 8.3
    available, so your waiting about 12% of the time, and the app is not going
    any faster

    So here are some things to measure. Install the Pentium Performance counter
    module for the NT performance monitor. I believe one of the available
    things to measure is memory transaction rate. See if it's the same number
    with a 800 and 900 Mhz processor?

    Does performance of your app scale directly with processor clock speed from
    say 500 Mhz? It could be the "wall" is at 692 Mhz, so really you should try
    to see if things scale smooth, and then have a ledge where they stop
    improving suddenly.

    You should also measure the memory transaction rate at 900 Mhz, for both
    100 Mhz and 133 Mhz memory. You may find that what happens is 100 Mhz
    memory can deliver a page miss burst at 9-1-1-1 and 133 Mhz memory only can
    do 13-1-1-1, both of which will give about 0.120 microseconds per burst.
    For memory page hits, you will get better maximum bandwidth from the 133
    Mhz memory because page hit timing is as fast as maybe 2-1-1-1. Different
    brands of 133 Mhz memory may also have different initial latency.

    You might also get clues at memory timing by looking at what the BIOS setup
    shows when automatically setting memory timing using SPD (a tiny parameter
    ROM on the memory DIMM).

    Can you tell me exactly what processor chipset is on your Pentium III
    motherboard (440 BX/GX, 820, 840). The 820/840 accesses SDRAM through a
    memory protocol translator, as the chipsets basically speak RDRAM protocol.
    This RDRAM bus may always run at the same speed, no matter what the
    external processor bus clock runs at. I'd have to look at the block
    diagrams of the chipset to see.

    There lots of confusion about RDRAM. The BIG advantage of RDRAM is the
    address requests and data response are pipelined. The total latency is no
    better (and maybe worse) than SDRAM. If the processor and chipset have it
    together about predicting upcoming memory accesses (the Pentium III
    prefetch instructions especially can help), the pipelined architecture can
    deliver dramatically higher memory access transaction rates. Essentially
    the 9 clocks in 9-1-1-1 SDRAM timing are overlapped, so addresses continue
    into the pipeline, and data continues to flow out, while the memory is
    getting around to finding the data. The RAMBUS web site has some pretty
    graphics to show this. I don't actually know if Intel processors and
    chipsets can keep the pipeline full though. If not, RDRAM is hugely more
    expensive and will be no faster that SDRAM. The Itanium execution is even
    more speculative than the Pentium's, so the trend is to get things to
    effectively use pipelined memory accessing.

    If random memory accessing is the limiting factor, you may find rearranging
    your code to improve locality may help a lot. I know things like matrix
    multiply's and such can go lots faster with code that understands memory
    delays. For example, you might preallocate data structures instead of just
    calling new, so instead of bouncing all over memory, your stay within a
    much smaller range when walking a specific data structure. If you have to
    make 100 visit's to 5000 small nodes, you may get 10x difference in
    performance. A useful rule of thumb on current processors is you get
    something like 10 mega memory transactions/sec for random accesses. If your
    accessing one byte, that's only a bandwidth of 10 MByte/sec, a really
    dramatically lower number (like 80x times lower) than what you might
    expect. If your accessing large amounts of memory randomly, performance
    degrades even more, because the virtual memory page tables may no longer
    fit in L2 cache, so every memory access requires multiple memory
    transactions to load the correct TLB entry first.

    - Jan
  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    At 04:06 PM 05/30/2000 -0700, Jan Bottorff wrote:
    >If random memory accessing is the limiting factor, you may find rearranging
    >your code to improve locality may help a lot. I know things like matrix
    >multiply's and such can go lots faster with code that understands memory
    >delays. For example, you might preallocate data structures instead of just
    >calling new, so instead of bouncing all over memory, your stay within a
    >much smaller range when walking a specific data structure.

    Very true. I take this one step further: Not only do I preallocate arrays of
    structures, but I also reuse them from bottom to top. This packs the active
    structures near one end of the array, making them contiguous and more likely
    to fit within fewer NT memory pages. By the way, NT completion ports assign
    threads in LIFO order for the same reason: their stacks are more likely to
    be nearby (in L1 or L2).

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. Sign in or register to get started.

Upcoming OSR Seminars
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!
Kernel Debugging 13-17 May 2024 Live, Online
Developing Minifilters 1-5 Apr 2024 Live, Online
Internals & Software Drivers 11-15 Mar 2024 Live, Online
Writing WDF Drivers 26 Feb - 1 Mar 2024 Live, Online