What's a ballpark figure for PCIe interrupt-to-userspace latency?

Assume you had a PCIe card on a fast, modern system (say, a Sandy Bridge @ ~3.5ghz). Now say that in response to some stimuli, the card generates an interrupt. The kernel driver acknowledges the interrupt, schedules a DPC, and then the DPC grabs an inverted-call IOCTL from somewhere and completes it. (That’s how it normally goes, right?) Assume the IOCTL has a payload of say, 200 bytes of information.

In your guys’ experience, how long on “average” does this whole process take? A handful of microseconds, perhaps? (Feel free to state any other assumptions necessary besides those that I gave above.)

Two follow-up questions as well:

  1. Could lower latency be achieved by just polling the driver in a tight loop from userspace, and having the ISR store away the relevant information for later retrieval?

  2. Off-topic for this list but I plead for mercy: is the timing significantly different on Linux to accomplish something comparable to this?

You’ve got five steps there, take them in order (this is based on PCIe Gen2 x16 cards on a true Gen2 x16 bus using Server 2K8 R2 on a Nehalem-based chipset) … your mileage may vary, of course, and these are “real world” numbers …

  • Board signals the MSI on the bus, TLP’s move around, OS does it’s thing, an MSI interrupt is asserted: sub-microsecond, observed 80-160ns
  • Interrupt is triggered, ISR does the right thing by pushing a DPC: can be around 30-50ns if the card passes the “what happened here” info as a part of an MSI-X interrupt, as much as 1us if the driver needs to poll the card to figure out what happened (remember, that’s at least two bus accesses and a cache line hit if you need to do this, another excellent reason to use MSI-X to differentiate interrupt sources). This is also where you can do NUMA-based optimization
  • DPC is dispatched to the handler, which I’ve seen happen as quickly as 500-800ns and as long as 6-8us depending on the DPC queue depth for the processor (which is why it’s important to not try do “optimize” the DPC scheduling by using DPC processor affinity and let the OS do it’s thing with the round robin scheduler) … there is a free utility (DPCLat) which can measure this as well as PerfMon
  • The DPC does its thing, could take as quick as 1us and up to the recommended 10us. More than that should be scheduled to a passive thread or to a kernel APC
  • A pending IOCTL is completed and dropped back down to userspace which usually takes a bit under 1us (double that for the round trip)

So, an expected latency for a typical PCIe card to push some useful info from the card to userspace would be .05-.1us + .3-1us + .5-8us + 1-10us + 1us (x2) = 5 to 22us … there are some caveats here though. First, once the usermode thread gets the return it has to process the results at passive, which means it will get pre-empted by every other DPC and APC scheduled to the processor, including all of the interrupts spooling in from the card (thus the famous interrupt storm and essentially a priority inversion based denial of service for your card). Next, you will potentially need to handle up to 2.5million interrupts per second which (if you’re using an inverted call mechanism to handle each one) means you’re pushing 2.5million IRP’s around per second – don’t think the OS would appreciate that too much. Finally, you need to decide how you’re going to be dealing with the memory for those 200b per IOCTL – again, 2.5million small allocations won’t make the OS very happy.

Answering your other other questions (which I can do based on the WinOF source code and Mellanox Infiniband drivers, as well as the QLogic Linux driver source code, all public domain and available) …

  1. Polling the driver in a tight loop looks good on paper (have the ISR update a shared memory segment, even better simply map a BAR segment on the card containing the register you care about into userspace, have a thread poll for a change, easy peasy) but will suffer from dramatic latency variations due to the thread doing the polling being at Passive, which is at the mercy of every DPC, APC and higher level thread on that processor as well as effectively pinning that core down. Add in the race condition inherent in two clock domains sharing the same memory (the card is potentially updating the location while you’re reading it), the delay for the InterlockedExchange to synchronize things for the read, etc. etc. and you’re not heading down a happy path. This is what the Verbs MPI api does for Infiniband and latencies of 50-100us are not uncommon.

  2. Linux (again, based on the QLogic and Mellanox drivers) handles everything in an ISR to get around the latency problem – the interrupt, the shuffling of memory addresses, the walking of the dog, the break for coffee, everything. As soon as the first interrupt hits they start work and essentially ignore further interrupts until the traffic stops for however long the traffic runs … seconds, minutes, hours, days, whatever and all at ISR priority, pinning one (or more) cores. Not appropriate for Windows.

Linux (both the Mellanox and QLogic drivers and the Mellanox Windows driver, can’t discuss the QLogic Windows driver) mainly uses “zero-copy” RDMA for userspace interactions, which is essentially an inverted call to the card rather than to the OS. Userspace allocates a big chunk of memory (1-10MB is typical), pushes it to the driver, driver pins it and gets PA addresses for the MDL and programs the card with those PA’s. Card maintains a very large table of PA’s programmed by the driver (typically enough for 256K pages) and as data comes in it DMA’s data into those pages and interrupts the driver. This is where Linux and Windows differ – for Linux, they update a memory location that is being monitored by (typically) an MPICH-2 app thread running on another core which is itself spinning in a loop waiting for something to happen. For Windows, the driver replaces the “used” PA pages in the cards DMA target table with “fresh” PA pages in the ISR and once a full “buffer” of pages have been filled triggers a DPC to notify the usermode app of new data – for the NetworkDirect protocol, this is through an inverted call … so, for the user app they allocate a 10MB buffer, do an overlapped “read” to the driver and when the overlapped completes they dig through the 10MB to see what they care about … this gives good bandwidth, but nowhere near the latency that Linux gives (Linux drivers have sub-us latency for MPICH-2 traffic, Windows are around 5-8us).

So, long wall of text bottom line … for low interrupt counts (less than around 10K/sec) your inverted call will work fine, plan on around 5-22us latency depending on use of MSI-X interrupt segregation. Lower latencies aren’t really possible in my experience …

As I’ve mentioned previously (somewhere) it would be valuable for you to dig through the WinOF OpenIB source code (it’s in a public SVN repo) written largely by the Mellanox boys, this driver lives and breathes high speed interrupt/ PCI extended space/ low latency/ PCIe Gen2 x16 stuff and is a useful (if obtuse) source of info for this …

Cheers!

I got a utility which displays all DPCs and ISRs, their execution times and
their responsible drivers. This should get you some idea (Vista and higher
only).
http://www.resplendence.com/latencymon

//Daniel

wrote in message news:xxxxx@ntdev…
> Assume you had a PCIe card on a fast, modern system (say, a Sandy Bridge @
> ~3.5ghz). Now say that in response to some stimuli, the card generates an
> interrupt. The kernel driver acknowledges the interrupt, schedules a DPC,
> and then the DPC grabs an inverted-call IOCTL from somewhere and completes
> it. (That’s how it normally goes, right?) Assume the IOCTL has a payload
> of say, 200 bytes of information.
>
> In your guys’ experience, how long on “average” does this whole process
> take? A handful of microseconds, perhaps? (Feel free to state any other
> assumptions necessary besides those that I gave above.)
>
> Two follow-up questions as well:
>
> 1) Could lower latency be achieved by just polling the driver in a tight
> loop from userspace, and having the ISR store away the relevant
> information for later retrieval?
>
> 2) Off-topic for this list but I plead for mercy: is the timing
> significantly different on Linux to accomplish something comparable to
> this?
>

You asked about AVERAGE DPC/ISR latency. THE most important thing to understand is that the average is typically pretty good. The PROBLEM is that the distribution has a VERY long tail, out several standard deviations. So, in my experience, what kills you is the WORST CASE latency.

Mr. Terhell’s DPC/ISR latency monitor is a great tool.

Mr. Howard: Thanks for those timings. It’s very generous of you to post those for the community… extremely helpful. However, the seem to me to be best case… the result of careful tuning. There are a lot of audio engineers (and real-time data processing folks) who would *love* to be able to count on no worse than 22us DPC-to-ISR latency.

In my experience, it is VERY difficult to know what worst case latency to expect on an ARBITRARY system, even a new one with PCIe and a card that does MSI. It’s all about system configuration, workloads, and what OTHER drivers are doing on the system at the time. You can measure and configure all you want in the lab, then some user in the field updates a driver or his BIOS and WHAM… the latency profile changes dramatically. Of course, you can prevent this if you can ship a locked-down configuration.

Some systems routine experience horrific ISR-DPC latency… in the past couple of years I’ve seen worst-case latencies measured well over 500us and even more than 1ms. Really, I kid you not.

Even on new, modern, systems, I personally wouldn’t count on worst case ISR to DPC latencies being < 150-200us. Though by carefully choosing your hardware (and drivers) you can definitely achieve a latency profile similar to that quoted by Mr. Howard.

I hope that’s helpful,

Peter
OSR

Hey everyone, thanks, especially Craig whose explanation was extraordinarily in-depth.

Craig, I’m glad I asked about the difference in performance versus Linux – I’m quite surprised that the magnitude in fact. Especially since you mentioned that interrupt-to-userspace could be sub-microsecond. That’s twenty times faster than Windows?

I’m assuming you obviously still have to incur the 80-160ns cost of signaling the interrupt on the bus, and the 30-50ns cost of the ISR running and doing whatever it’s going to do such that the userspace component picks up a “change”. And by this time the DMA has already occured to transfer the appropriate data into system memory.

But is that about all the work that needs to be done? Essentially, you cut out all the latency of scheduling a DPC, having the DPC fire and run, and so on? Such that the end-to-end latency is sub-microsecond? Could you venture a guess here in the “average” case (apologies to Peter) – maybe 500 nanoseconds or so?

Before you get excited about the potential for sub-microsecond latencies, keep in mind two things – first, as Peter pointed out there are many, many factors in a machine which can dramatically gum up the works … I’m using Daniel’s most excellent latency monitor [much better than Kernrate!!] on my development machine [which isn’t a slouch by any means] and I’m seeing latencies of 50-80us for most stuff and a few outliers at 200+ us [1ms, yeesh, just get a serial port then!]. Infiniband MPI computers are probably the most hand-groomed and lovingly optimized boxes on the planet and *they* don’t get 500ns even under Linux … [our latest benchmarks were at at 900-980ns]

Second, there is a *world* of differences between Linux and Windows … for this, primarily that Linux is an asymmetric multiprocessing system and Windows is a symmetric multiprocessing one. A good analogy is a bunch of kids playing in a sandbox. With Linux this sandbox is basically “Mad Max: Thunderdome” with each kid hogging as much sand as they want/ can get, hogging the pail/shovel, whacking other kids in the head if they want, etc. Under Linux there is no problem at all grabbing a core at the first interrupt and keeping this core entirely to yourself until you feel like releasing it (if ever). As long as you can grab enough memory up front you don’t even care about the OS paging memory, and as long as you can expose a pinned region or BAR to the other “kids” then you’re good. Contrast this to Windows, which is basically the sandbox being ruled by a grumpy despot who barely tolerates the kids at all and doles out sand/ pails/ shovels on a miserly basis and requires them to be returned immediately if not sooner. This makes the Linux “mine, all mine!” paradigm simply impossible …

There is a good catchphrase in the “Mantracker” TV show … “know your land, know your prey” which is appropriate here. Does your hardware support MSI-X, and can you factor interrupts into groups to take advantage of it? Are there other interrupt-intensive cards running on the system? What does your usermode app do with the data once it gets it? Are you expecting only a few interrupts or a constant barrage? How many cores do you expect to have available on the machine? Is this targeting Xeon or Opteron CPU’s, and Gen1 or Gen2 PCIe? Is the motherboard a Harpertown or Nehalem, Barcelona or [fill in the blank]? All of these will significantly affect designing for a lowest possible latency, and need to be considered.

General case SWAG, less than about 10K events per second with no MSI-X support and using an inverted IOCTL will likely give you ballpark 70-100us latency with some outliers at 200-300us … but again, I/we/you need to know a lot more about the capabilities of your HW and the ecosystem it will be living in for better optimizations to get closer to my [and the Mellanox] numbers …

Cheers!

I have a friend who measured this years ago, and he was seeing
interrupt-to-user-space latency of regularly > 250ms, and often as long as
450ms (that’s ms, not us) in what I think was Win2K. Much of this appeared
to be due to the scheduler. Just because an IRP is completed doesn’t mean
the thread is going to run in the forseeable future. It is going to run
when the scheduler damn well feels like running it, and even if you give
huge boost points on IoCompleteRequest or its WDF equivalent, all you are
saying is “Hey, scheduler, this thread is now runnable, at this priority, so
please run it when you get around to it”. Eventually, it gets around to it.

Since he really cared about real-time responsiveness, this was far too slow
to be usable. He uses an RTOS for his work now.

OTOH, there are many changes in Vista+, such as the MultiMedia Class
Scheduler Service (MMCSS), whose purpose is to improve response to very
time-sensitive tasks.

Bottom line, you have to build and measure. And you have to measure on real
loads that your end users will see, not on a dedicated box running in your
development lab.

Note that interrupt rates aren’t always a good measure, because you can get
lots of interrupts on lots of devices and still have high latency to user
space. When I was doing some time-sensitive work some years ago, if we used
synchronous I/O the latencies were high-variance around 100ms. The trick
was to pump down about 50 ReadFiles on an asynchronous open, and I could get
the inter-packet timings down to about 80us, but this meant I would have
about 50x80us = 4000us latency overall. If I dropped to 40 ReadFiles
pending, I got really bad inter-packet timings; they started to scatter up
around 100ms again. And this was on a nearly-bare machine (only one app
running, but it was the “real” load for this problem domain, for a Proejct I
Can’t Talk About).

Again, note that this is nearly completely independent of bus architecture,
since the dominant cost is the scheduler delays, which are not overheads,
but essentially are priority-driven.
joe

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
choward@ix.netcom.com
Sent: Sunday, March 06, 2011 3:47 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] What’s a ballpark figure for PCIe interrupt-to-userspace
latency?

Before you get excited about the potential for sub-microsecond latencies,
keep in mind two things – first, as Peter pointed out there are many, many
factors in a machine which can dramatically gum up the works … I’m using
Daniel’s most excellent latency monitor [much better than Kernrate!!] on my
development machine [which isn’t a slouch by any means] and I’m seeing
latencies of 50-80us for most stuff and a few outliers at 200+ us [1ms,
yeesh, just get a serial port then!]. Infiniband MPI computers are probably
the most hand-groomed and lovingly optimized boxes on the planet and *they*
don’t get 500ns even under Linux … [our latest benchmarks were at at
900-980ns]

Second, there is a *world* of differences between Linux and Windows … for
this, primarily that Linux is an asymmetric multiprocessing system and
Windows is a symmetric multiprocessing one. A good analogy is a bunch of
kids playing in a sandbox. With Linux this sandbox is basically “Mad Max:
Thunderdome” with each kid hogging as much sand as they want/ can get,
hogging the pail/shovel, whacking other kids in the head if they want, etc.
Under Linux there is no problem at all grabbing a core at the first
interrupt and keeping this core entirely to yourself until you feel like
releasing it (if ever). As long as you can grab enough memory up front you
don’t even care about the OS paging memory, and as long as you can expose a
pinned region or BAR to the other “kids” then you’re good. Contrast this to
Windows, which is basically the sandbox being ruled by a grumpy despot who
barely tolerates the kids at all and doles out sand/ pails/ shovels on a
miserly basis and requires them to be returned immediately if not sooner.
This makes the Linux “mine, all mine!” paradigm simply impossible …

There is a good catchphrase in the “Mantracker” TV show … “know your land,
know your prey” which is appropriate here. Does your hardware support
MSI-X, and can you factor interrupts into groups to take advantage of it?
Are there other interrupt-intensive cards running on the system? What does
your usermode app do with the data once it gets it? Are you expecting only
a few interrupts or a constant barrage? How many cores do you expect to
have available on the machine? Is this targeting Xeon or Opteron CPU’s, and
Gen1 or Gen2 PCIe? Is the motherboard a Harpertown or Nehalem, Barcelona or
[fill in the blank]? All of these will significantly affect designing for a
lowest possible latency, and need to be considered.

General case SWAG, less than about 10K events per second with no MSI-X
support and using an inverted IOCTL will likely give you ballpark 70-100us
latency with some outliers at 200-300us … but again, I/we/you need to know
a lot more about the capabilities of your HW and the ecosystem it will be
living in for better optimizations to get closer to my [and the Mellanox]
numbers …

Cheers!


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


This message has been scanned for viruses and dangerous content by
MailScanner, and is believed to be clean.

Do you need interrupts? If the driver provides a DMA buffer and an API for
the app to access it, the hardware can communicate with the app by both
polling some memory flags and sharing data. Then the latency is as low as
the app thread can manage to keep running. I presume such a thread running
at high priority will virtually (or completely?) monopolise a CPU so should
be able to respond in nanoseconds. Or is this a silly architecture? You
already suggest the app could be polling a tight loop. M

>>>>

  1. Could lower latency be achieved by just polling the driver in a tight
    loop from userspace, and having the ISR store away the relevant information
    for later retrieval?