What's a ballpark figure for PCIe interrupt-to-userspace latency?

craig_howard · March 6, 2011, 3:02am

You’ve got five steps there, take them in order (this is based on PCIe Gen2 x16 cards on a true Gen2 x16 bus using Server 2K8 R2 on a Nehalem-based chipset) … your mileage may vary, of course, and these are “real world” numbers …

Board signals the MSI on the bus, TLP’s move around, OS does it’s thing, an MSI interrupt is asserted: sub-microsecond, observed 80-160ns
Interrupt is triggered, ISR does the right thing by pushing a DPC: can be around 30-50ns if the card passes the “what happened here” info as a part of an MSI-X interrupt, as much as 1us if the driver needs to poll the card to figure out what happened (remember, that’s at least two bus accesses and a cache line hit if you need to do this, another excellent reason to use MSI-X to differentiate interrupt sources). This is also where you can do NUMA-based optimization
DPC is dispatched to the handler, which I’ve seen happen as quickly as 500-800ns and as long as 6-8us depending on the DPC queue depth for the processor (which is why it’s important to not try do “optimize” the DPC scheduling by using DPC processor affinity and let the OS do it’s thing with the round robin scheduler) … there is a free utility (DPCLat) which can measure this as well as PerfMon
The DPC does its thing, could take as quick as 1us and up to the recommended 10us. More than that should be scheduled to a passive thread or to a kernel APC
A pending IOCTL is completed and dropped back down to userspace which usually takes a bit under 1us (double that for the round trip)

So, an expected latency for a typical PCIe card to push some useful info from the card to userspace would be .05-.1us + .3-1us + .5-8us + 1-10us + 1us (x2) = 5 to 22us … there are some caveats here though. First, once the usermode thread gets the return it has to process the results at passive, which means it will get pre-empted by every other DPC and APC scheduled to the processor, including all of the interrupts spooling in from the card (thus the famous interrupt storm and essentially a priority inversion based denial of service for your card). Next, you will potentially need to handle up to 2.5million interrupts per second which (if you’re using an inverted call mechanism to handle each one) means you’re pushing 2.5million IRP’s around per second – don’t think the OS would appreciate that too much. Finally, you need to decide how you’re going to be dealing with the memory for those 200b per IOCTL – again, 2.5million small allocations won’t make the OS very happy.

Answering your other other questions (which I can do based on the WinOF source code and Mellanox Infiniband drivers, as well as the QLogic Linux driver source code, all public domain and available) …

Polling the driver in a tight loop looks good on paper (have the ISR update a shared memory segment, even better simply map a BAR segment on the card containing the register you care about into userspace, have a thread poll for a change, easy peasy) but will suffer from dramatic latency variations due to the thread doing the polling being at Passive, which is at the mercy of every DPC, APC and higher level thread on that processor as well as effectively pinning that core down. Add in the race condition inherent in two clock domains sharing the same memory (the card is potentially updating the location while you’re reading it), the delay for the InterlockedExchange to synchronize things for the read, etc. etc. and you’re not heading down a happy path. This is what the Verbs MPI api does for Infiniband and latencies of 50-100us are not uncommon.
Linux (again, based on the QLogic and Mellanox drivers) handles everything in an ISR to get around the latency problem – the interrupt, the shuffling of memory addresses, the walking of the dog, the break for coffee, everything. As soon as the first interrupt hits they start work and essentially ignore further interrupts until the traffic stops for however long the traffic runs … seconds, minutes, hours, days, whatever and all at ISR priority, pinning one (or more) cores. Not appropriate for Windows.

Linux (both the Mellanox and QLogic drivers and the Mellanox Windows driver, can’t discuss the QLogic Windows driver) mainly uses “zero-copy” RDMA for userspace interactions, which is essentially an inverted call to the card rather than to the OS. Userspace allocates a big chunk of memory (1-10MB is typical), pushes it to the driver, driver pins it and gets PA addresses for the MDL and programs the card with those PA’s. Card maintains a very large table of PA’s programmed by the driver (typically enough for 256K pages) and as data comes in it DMA’s data into those pages and interrupts the driver. This is where Linux and Windows differ – for Linux, they update a memory location that is being monitored by (typically) an MPICH-2 app thread running on another core which is itself spinning in a loop waiting for something to happen. For Windows, the driver replaces the “used” PA pages in the cards DMA target table with “fresh” PA pages in the ISR and once a full “buffer” of pages have been filled triggers a DPC to notify the usermode app of new data – for the NetworkDirect protocol, this is through an inverted call … so, for the user app they allocate a 10MB buffer, do an overlapped “read” to the driver and when the overlapped completes they dig through the 10MB to see what they care about … this gives good bandwidth, but nowhere near the latency that Linux gives (Linux drivers have sub-us latency for MPICH-2 traffic, Windows are around 5-8us).

So, long wall of text bottom line … for low interrupt counts (less than around 10K/sec) your inverted call will work fine, plan on around 5-22us latency depending on use of MSI-X interrupt segregation. Lower latencies aren’t really possible in my experience …

As I’ve mentioned previously (somewhere) it would be valuable for you to dig through the WinOF OpenIB source code (it’s in a public SVN repo) written largely by the Mellanox boys, this driver lives and breathes high speed interrupt/ PCI extended space/ low latency/ PCIe Gen2 x16 stuff and is a useful (if obtuse) source of info for this …

Cheers!