Also remember about cache effects. A core can sit idle for hundreds of cycles waiting for a read from remote memory in a NUMA system regardless of how many mu-ops an instruction might require.
Although much maligned, hyper-threading does truly improve overall performance when software with bad memory layout (the default when using OO languages) execute
Sent from Surface Pro
From: Maxim S. Shatskih
Sent: Monday, October 20, 2014 7:00 PM
To: Windows System Software Devs Interest List
However, consider this. If I have an instruction that needs to read
from memory, first I need to figure out what address I need to read.
That might involve pulling values from two registers, adding them, and
adding a constant from the instruction.
…and this is done by a separate set of silicon gates, parallel to the main execution flow, since 80286.
Division is another interesting example. There is still no algorithm
for doing a division in one cycle.
Yes, though some CPUs (IIRC DEC Alpha, though I can be wrong, also possible the modern x86 CPUs too) have the “flash multiplier” which can do MUL in 1 cycle.
–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com
NTDEV is sponsored by OSR
Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
OSR is HIRING!! See http://www.osr.com/careers
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer