I have a friend who measured this years ago, and he was seeing
interrupt-to-user-space latency of regularly > 250ms, and often as long as
450ms (that’s ms, not us) in what I think was Win2K. Much of this appeared
to be due to the scheduler. Just because an IRP is completed doesn’t mean
the thread is going to run in the forseeable future. It is going to run
when the scheduler damn well feels like running it, and even if you give
huge boost points on IoCompleteRequest or its WDF equivalent, all you are
saying is “Hey, scheduler, this thread is now runnable, at this priority, so
please run it when you get around to it”. Eventually, it gets around to it.
Since he really cared about real-time responsiveness, this was far too slow
to be usable. He uses an RTOS for his work now.
OTOH, there are many changes in Vista+, such as the MultiMedia Class
Scheduler Service (MMCSS), whose purpose is to improve response to very
time-sensitive tasks.
Bottom line, you have to build and measure. And you have to measure on real
loads that your end users will see, not on a dedicated box running in your
development lab.
Note that interrupt rates aren’t always a good measure, because you can get
lots of interrupts on lots of devices and still have high latency to user
space. When I was doing some time-sensitive work some years ago, if we used
synchronous I/O the latencies were high-variance around 100ms. The trick
was to pump down about 50 ReadFiles on an asynchronous open, and I could get
the inter-packet timings down to about 80us, but this meant I would have
about 50x80us = 4000us latency overall. If I dropped to 40 ReadFiles
pending, I got really bad inter-packet timings; they started to scatter up
around 100ms again. And this was on a nearly-bare machine (only one app
running, but it was the “real” load for this problem domain, for a Proejct I
Can’t Talk About).
Again, note that this is nearly completely independent of bus architecture,
since the dominant cost is the scheduler delays, which are not overheads,
but essentially are priority-driven.
joe
-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
choward@ix.netcom.com
Sent: Sunday, March 06, 2011 3:47 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] What’s a ballpark figure for PCIe interrupt-to-userspace
latency?
Before you get excited about the potential for sub-microsecond latencies,
keep in mind two things – first, as Peter pointed out there are many, many
factors in a machine which can dramatically gum up the works … I’m using
Daniel’s most excellent latency monitor [much better than Kernrate!!] on my
development machine [which isn’t a slouch by any means] and I’m seeing
latencies of 50-80us for most stuff and a few outliers at 200+ us [1ms,
yeesh, just get a serial port then!]. Infiniband MPI computers are probably
the most hand-groomed and lovingly optimized boxes on the planet and *they*
don’t get 500ns even under Linux … [our latest benchmarks were at at
900-980ns]
Second, there is a *world* of differences between Linux and Windows … for
this, primarily that Linux is an asymmetric multiprocessing system and
Windows is a symmetric multiprocessing one. A good analogy is a bunch of
kids playing in a sandbox. With Linux this sandbox is basically “Mad Max:
Thunderdome” with each kid hogging as much sand as they want/ can get,
hogging the pail/shovel, whacking other kids in the head if they want, etc.
Under Linux there is no problem at all grabbing a core at the first
interrupt and keeping this core entirely to yourself until you feel like
releasing it (if ever). As long as you can grab enough memory up front you
don’t even care about the OS paging memory, and as long as you can expose a
pinned region or BAR to the other “kids” then you’re good. Contrast this to
Windows, which is basically the sandbox being ruled by a grumpy despot who
barely tolerates the kids at all and doles out sand/ pails/ shovels on a
miserly basis and requires them to be returned immediately if not sooner.
This makes the Linux “mine, all mine!” paradigm simply impossible …
There is a good catchphrase in the “Mantracker” TV show … “know your land,
know your prey” which is appropriate here. Does your hardware support
MSI-X, and can you factor interrupts into groups to take advantage of it?
Are there other interrupt-intensive cards running on the system? What does
your usermode app do with the data once it gets it? Are you expecting only
a few interrupts or a constant barrage? How many cores do you expect to
have available on the machine? Is this targeting Xeon or Opteron CPU’s, and
Gen1 or Gen2 PCIe? Is the motherboard a Harpertown or Nehalem, Barcelona or
[fill in the blank]? All of these will significantly affect designing for a
lowest possible latency, and need to be considered.
General case SWAG, less than about 10K events per second with no MSI-X
support and using an inverted IOCTL will likely give you ballpark 70-100us
latency with some outliers at 200-300us … but again, I/we/you need to know
a lot more about the capabilities of your HW and the ecosystem it will be
living in for better optimizations to get closer to my [and the Mellanox]
numbers …
Cheers!
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer
–
This message has been scanned for viruses and dangerous content by
MailScanner, and is believed to be clean.