I believe the number applies to the total stall time of the thread. Which
is why I pointed out that the numbers would be possibly a couple orders of
magnitude lower for a RAMdisk.
Note that “A few hundred microseconds” is a lot of cycles on a 2.8GHz
processor, and if it is a superscalar the CPU cycles turn into a large
number of instructions per cycle (2-6, if I’ve read the Core docs
correctly). So at, say, 3 instructions per cycle, that’s 9 instructions
per nanosecond, 9000 instructions per microsecond, and 900,000
instructions in a hundred microseconds. So it sounds like under optimal
conditions, that’s six orders of magnitude degradation. Executing the
code out of slow RAM will lose less, because of instruction prefetch,
instruction pipelining, L1 and L2 caching and speculative execution. None
of which we had on the 360/67, and with a factor-of-10 slower memory, the
overhead of the page faults was higher than the cost of direct execution
from slow RAM.
So there is a simpler solution that using a RAMdisk for paging: Don’t do
paging. You can turn paging off, execute from slow RAM, and still be
ahead of the game. You get your savings in system cost, and quite likely
better performance than the paging solution, with zero effort expended.
By the way, there is no way to accurately predict the performance. A lot
of it depends upon execution patterns and data access patterns. You
actually have to build and measure. But you may well have a simpler
solution if you build only one kind of memory onto the board. So it saves
fabrication cost, saves system complexity, has essentially zero extra cost
for trying to figure out how to reroute paging to the RAMdisk, and I’d go
that way first.
Otherwise, you are “pre-optimizing” without any substantiating data. This
never works out well.
joe
> A page fault costs 30,000,000-40,000,000 clock cycles
That number seems a little fishy, back of enveloper calculations suggest a
real different number than 30 million cycles.
A page fault might take 30-40 million clocks of time to resolve, but
unless the I/O subsystem is just polling, it doesn’t seem likely it will
consume 30-40 million clocks of processor power. On a 3 Ghz processor, 30
million clocks is 1/100 of a second, about disk seek time, that would also
be no more than 100 page faults/sec. It seems like I’ve seen soft fault
rates, which are essentially the page fault without the physical I/O, a
couple orders of magnitude higher, like 10K+/sec, which would be more like
300K clock cycles.
I know there are storage controllers than can execute 500K+ IOPS/sec, and
if page fault handing was a similar cycles as an I/O operation, that would
only be 6 thousand cycles. Factoring in multiple cores is a little fuzzy,
as I don’t know if that’s 500K IOPS/sec, is consuming 16 cores at 100% cpu
load or something a little kinder.
Another data point, some Googling finds
http://blogs.technet.com/b/askperf/archive/2008/01/29/an-overview-of-troubleshooting-memory-issues-part-two.aspx,
where someone reports 200K soft faults/sec, so that would be 15K clocks
each on a 3 Ghz processor.
I might believe a page fault takes 30K cycles, but not 30M. As there are
flash storage devices that can do 500K-1000K random read IOPS/sec, hard
page faults may not really take 1/100 of a second of elapsed time anymore
either
Jan
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer