It’s not an error when the BIOS says nothing about NUMA node of I/O devices.
Many of your replies (quoted below) imply that you think that all I/O
devices have a “home node.” That’s not true in the vast majority of
machines, even those that exhibit a large NUMA ratio. You’re using
processor metrics and assuming that they apply to I/O.
Furthermore, since I/O can’t do interlocked updates to memory (like spinlock
acquisition, etc.) the effects of being far from the node with your common
buffer are smaller.
Other than the common buffer (the one that holds your worklist of things
that the device should do) you won’t be able to control which nodes the
other buffers are in. They are where the app is, for the most part. And
your device needs to be able to do I/O on their behalf. That’s what RSS and
NUMA I/O help with.
Jake Oshins
Hyper-V I/O Architect
Windows Kernel Group
This post implies no warranties and confers no rights.
“QuasiCodo” wrote in message news:xxxxx@ntdev…
Jake:
The OS only knows what the BIOS tells it, at least with respect to this.
Ah, this is what I suspected. So the BIOS is probably not enumerating
the cards correctly with respect to NUMA and so the OS doesn’t know what
the assignment is. Thus we get the “unknown” status.
But, really, are you sure it matters? In almost all machines, I/O is
equally close to every CPU/memory node.
CoreInfo for this system shows the following access matrix:
Approximate Cross-NUMA Node Access Cost (relative to fastest):
00 01 02 03
00: 1.4 1.6 1.6 1.7
01: 1.6 1.3 1.6 1.3
02: 1.6 1.6 1.3 1.0
03: 1.6 1.7 1.6 1.0
And even when it’s not, the most
important thing you can do with respect to NUMA for I/O is to complete
the interrupt back to the processor which initiated the I/O. (This is
facilitated through RSS in NDIS and “NUMA I/O” in storport.)
Thanks for the info. We are currently developing directed completion
thread to handle this.
Any other affinitization of stuff for a device is truly secondary. I
suppose you could try to allocate your common buffer “close” to the
device, but I’d be surprised if even that makes much difference.
Yes, in general, performance suffers somewhat when IO requests from
other NUMA nodes come in and then we access memory buffers which have
been allocated on the “local” node.
Case in point, there is another NUMA system we have been working on
which has 128 processors distributed to 2 groups of 64 processors with 8
NUMA nodes. In that system, cross NUMA memory access penalties can be
as high as 4.2 (the CoreInfo rating).
Thanks for the info,
((&->