IoGetDeviceNumaNode returns "unknown"

QuasiCodo · June 9, 2011, 2:01pm

I have a server with 64 logical processors and 4 NUMA nodes running WS08R2.

In my driver I am calling IoGetDeviceNumaNode() and it seems to always
return STATUS_NOT_FOUND, which means that the NUMA node number of the
device is unknown.

Is there a way to force the system to assign the devices to “nearby”
NUMA nodes? Or is this rather a function of the BIOS which WS08R2 reads
during OS boot?

thx,

((&->

Alex_Grig · June 9, 2011, 2:08pm

Are you passing the PDO pointer to it?

QuasiCodo · June 9, 2011, 2:32pm

On 6/9/2011 12:11 PM, xxxxx@broadcom.com wrote:

Are you passing the PDO pointer to it?

Yes. IoGetDeviceNumaNode() works great on other NUMA platforms, but for
some reason, it is not working on this platform:

WDFDEVICE device;
PDEVICE_OBJECT wdmPdo;

…
status = WdfDeviceCreate( &DeviceInit, &fdoAttributes, &device );
…
wdmPdo = WdfDeviceWdmGetPhysicalDevice( device);

numa_status = (*IoGetDeviceNumaNode)(wdmPdo, &nodeNum);
if ( NT_SUCCESS(numa_status) )
{
fdoContext->driverNumaNode = nodeNum;
}
else
{
print (“IoGetDeviceNumaNode failed with status 0x%X, disabling
affinity for this device. Please try setting affinity manually.”,
numa_status);
}

Is this something to do with the HAL perhaps?

((&->

Alex_Grig · June 9, 2011, 7:53pm

I suspect that whatever is supposed to report the topology to Windows doesn’t do its job.

Jake_Oshins · June 9, 2011, 11:29pm

The OS only knows what the BIOS tells it, at least with respect to this.

But, really, are you sure it matters? In almost all machines, I/O is
equally close to every CPU/memory node. And even when it’s not, the most
important thing you can do with respect to NUMA for I/O is to complete the
interrupt back to the processor which initiated the I/O. (This is
facilitated through RSS in NDIS and “NUMA I/O” in storport.)

Any other affinitization of stuff for a device is truly secondary. I
suppose you could try to allocate your common buffer “close” to the device,
but I’d be surprised if even that makes much difference.

Jake Oshins
Windows Kernel Team
Microsoft

“QuasiCodo” wrote in message news:xxxxx@ntdev…

I have a server with 64 logical processors and 4 NUMA nodes running WS08R2.

In my driver I am calling IoGetDeviceNumaNode() and it seems to always
return STATUS_NOT_FOUND, which means that the NUMA node number of the
device is unknown.

Is there a way to force the system to assign the devices to “nearby”
NUMA nodes? Or is this rather a function of the BIOS which WS08R2 reads
during OS boot?

thx,

((&->

Alex_Grig · June 10, 2011, 12:37pm

I’m not even sure if there is a way to allocate a common buffer on a given node.

QuasiCodo · June 10, 2011, 1:15pm

Jake:

The OS only knows what the BIOS tells it, at least with respect to this.

Ah, this is what I suspected. So the BIOS is probably not enumerating
the cards correctly with respect to NUMA and so the OS doesn’t know what
the assignment is. Thus we get the “unknown” status.

But, really, are you sure it matters? In almost all machines, I/O is
equally close to every CPU/memory node.

CoreInfo for this system shows the following access matrix:

Approximate Cross-NUMA Node Access Cost (relative to fastest):
00 01 02 03
00: 1.4 1.6 1.6 1.7
01: 1.6 1.3 1.6 1.3
02: 1.6 1.6 1.3 1.0
03: 1.6 1.7 1.6 1.0

And even when it’s not, the most
important thing you can do with respect to NUMA for I/O is to complete
the interrupt back to the processor which initiated the I/O. (This is
facilitated through RSS in NDIS and “NUMA I/O” in storport.)

Thanks for the info. We are currently developing directed completion
thread to handle this.

Any other affinitization of stuff for a device is truly secondary. I
suppose you could try to allocate your common buffer “close” to the
device, but I’d be surprised if even that makes much difference.

Yes, in general, performance suffers somewhat when IO requests from
other NUMA nodes come in and then we access memory buffers which have
been allocated on the “local” node.

Case in point, there is another NUMA system we have been working on
which has 128 processors distributed to 2 groups of 64 processors with 8
NUMA nodes. In that system, cross NUMA memory access penalties can be
as high as 4.2 (the CoreInfo rating).

Thanks for the info,

((&->

Jake_Oshins · June 10, 2011, 7:33pm

It’s not an error when the BIOS says nothing about NUMA node of I/O devices.
Many of your replies (quoted below) imply that you think that all I/O
devices have a “home node.” That’s not true in the vast majority of
machines, even those that exhibit a large NUMA ratio. You’re using
processor metrics and assuming that they apply to I/O.

Furthermore, since I/O can’t do interlocked updates to memory (like spinlock
acquisition, etc.) the effects of being far from the node with your common
buffer are smaller.

Other than the common buffer (the one that holds your worklist of things
that the device should do) you won’t be able to control which nodes the
other buffers are in. They are where the app is, for the most part. And
your device needs to be able to do I/O on their behalf. That’s what RSS and
NUMA I/O help with.

Jake Oshins
Hyper-V I/O Architect
Windows Kernel Group

This post implies no warranties and confers no rights.

“QuasiCodo” wrote in message news:xxxxx@ntdev…

Jake:

The OS only knows what the BIOS tells it, at least with respect to this.

Ah, this is what I suspected. So the BIOS is probably not enumerating
the cards correctly with respect to NUMA and so the OS doesn’t know what
the assignment is. Thus we get the “unknown” status.

But, really, are you sure it matters? In almost all machines, I/O is
equally close to every CPU/memory node.

CoreInfo for this system shows the following access matrix:

Approximate Cross-NUMA Node Access Cost (relative to fastest):
00 01 02 03
00: 1.4 1.6 1.6 1.7
01: 1.6 1.3 1.6 1.3
02: 1.6 1.6 1.3 1.0
03: 1.6 1.7 1.6 1.0

And even when it’s not, the most
important thing you can do with respect to NUMA for I/O is to complete
the interrupt back to the processor which initiated the I/O. (This is
facilitated through RSS in NDIS and “NUMA I/O” in storport.)

Thanks for the info. We are currently developing directed completion
thread to handle this.

Any other affinitization of stuff for a device is truly secondary. I
suppose you could try to allocate your common buffer “close” to the
device, but I’d be surprised if even that makes much difference.

Yes, in general, performance suffers somewhat when IO requests from
other NUMA nodes come in and then we access memory buffers which have
been allocated on the “local” node.

Case in point, there is another NUMA system we have been working on
which has 128 processors distributed to 2 groups of 64 processors with 8
NUMA nodes. In that system, cross NUMA memory access penalties can be
as high as 4.2 (the CoreInfo rating).

Thanks for the info,

((&->

QuasiCodo · June 10, 2011, 8:13pm

On 6/10/2011 5:32 PM, Jake Oshins wrote:

It’s not an error when the BIOS says nothing about NUMA node of I/O
devices. Many of your replies (quoted below) imply that you think that
all I/O devices have a “home node.” That’s not true in the vast majority
of machines, even those that exhibit a large NUMA ratio. You’re using
processor metrics and assuming that they apply to I/O.

The machine on which I am working was in the past correctly reporting
the NUMA node correctly. Then all of a sudden it stopped. I was just
trying to figure out what changed. So this is the reason my previous
posts are implying something is wrong.

But, yes, I agree that it not an error if a device has not been assigned
a NUMA node.

Thx,

((&->

QuasiCodo · June 10, 2011, 8:26pm

Very interesting …

I just removed my 1394 card from the system and put another one of our
devices in its place and all of a sudden IoGetDeviceNumaNode() started
working again.

It can’t be as simple as turning off the debugger, can it?

Or perhaps the 1394 card was somehow interfering with the BIOS in
respect to NUMA.

Thanks for all the help!

((&->

Jake_Oshins · June 10, 2011, 10:46pm

Any PCI device (USB, 1394 or future “other”) in use as a debugger transport
changes the way plug and play works. I wish it weren’t so. (And I doubt
that’s what’s causing your problem, though it could be. Send me mail
privately so I can see if I can replicate your issue.)

When I originally coded the parts of the HAL that set up a PCI device as the
debugger, I tried very hard to make sure that it never perturbed PnP,
preferring to fail instead. This worked just fine until serial ports
started disappearing and people really needed the debugger to work through
some other means. So the people who came after me changed the code to be
less timid and more likely to result in a working debugger. The side effect
is that PnP will often think that the BIOS set something up that was really
configured by the debugger. This can have odd ramifications all through the
PnP subsystem.

UARTs may have been slow, but they never had much of an effect on the rest
of the system.

Jake Oshins
Hyper-V I/O Architect
Windows Kernel Group

This post implies no warranties and confers no rights.

“QuasiCodo” wrote in message news:xxxxx@ntdev…

Very interesting …

I just removed my 1394 card from the system and put another one of our
devices in its place and all of a sudden IoGetDeviceNumaNode() started
working again.

It can’t be as simple as turning off the debugger, can it?

Or perhaps the 1394 card was somehow interfering with the BIOS in
respect to NUMA.

Thanks for all the help!

((&->

QuasiCodo · June 16, 2011, 1:00pm

Jake:

I tried to send you an e-mail to your Microsoft e-mail address, but it
bounced, so I’ll include my comments here.

The BIOS was 3 months out of date, so I updated it and the problem went
away. I then put the 1394 card back into the system and
IoGetDeviceNumaNode() works, so I think it was a flaky BIOS issue.

Thanks for the info.

((&->