NUMA machines (at least the ones that can run commodity OSes) are still
symmetric with respect to everything except latency. That means that the
physical address of any region of memory is the same no matter which
processor you use to fetch it. This implies that only one node can have
local memory starting at physical location 0. The SRAT covers the whole
system, or at least an entire partition.
I/O and interrupts are the same story. Any I/O operation can be done on any
processor, and any interrupt can be serviced on any processor. (Some
machines cheat with the interrupts, including that x440 from IBM that you
referenced in your other posts. There’s one of those a couple of offices
down from me, by the way, and the HAL for it was written by a couple of
friends of mine from when I worked at IBM.)
Whether you actually want to do I/O from a non-local node is another story.
We’ve been working on whether it really matters. If you put
intelligent-enough I/O controllers into the system, they mostly operate
through DMA anyhow, which minimizes the I/O problem.
As for partitioning, it’s on our roadmap for the future. Most of the NUMA
machines are statically partitionable – e.g. you can bring up multiple OS
instances on them, but you have to shut the whole thing off if you want to
change the partition boundaries. That can be done without OS help. But
Dynamic Partitioning is another story, and much, much harder. We’ve added
most of the interfaces that we’ll need to ACPI 2.0 and we’ll probably get
the rest into ACPI 3.0. Exactly what we’ll productize and when isn’t
something that I can’t discuss without an NDA.
–
Jake Oshins
Windows Base Kernel Team
This posting is provided “AS IS” with no warranties, and confers no rights.
“Moreira, Alberto” wrote in message
news:xxxxx@ntdev…
> Actually, a question for Jake. I read that SRAT piece of paper, and I’m
not
> too sure I understood. Say I have four SMP’s connected through a switch,
> each SMP has its own Bios. Say I want to have one node per SMP machine. I
> assume that one can have four different configurations on the four nodes,
so
> that, for example, local memory always starts at location 0 ? And how
would
> I/O and interrupts be handled by Win2003 ? Am I right in saying that a
SRAT
> is basically a node’s view of the system, or does the system assume
there’s
> only one SRAT for all processors ? Also, did you guys put in any support
for
> partitioning the system ?
>
> Alberto.
>
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com]On Behalf Of Moreira, Alberto
> Sent: Thursday, January 08, 2004 10:19 AM
> To: Windows System Software Devs Interest List
> Subject: RE: [ntdev] AMD64/Opteron
>
>
> Ok, let’s kill the thread.
>
> I remit you back to that IBM article I mentioned in my post yesterday,
> there’s lots of information on how to do it, and on different approaches
to
> partitioning and to routing of threads and processes. They managed to
> manufacture a 16x NUMA processor out of four 4x Xeons. But they had to add
> hardware to provide remote memory and i/o access, and also to route
> interrupts beyond the four local processors. They also had to frig the
HAL.
> They did OS work to support that system in Linux too, and there are some
> interesting comparisons between the two systems in that paper.
>
> Alberto.
>
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com]On Behalf Of Jake Oshins
> Sent: Thursday, January 08, 2004 1:39 AM
> To: Windows System Software Devs Interest List
> Subject: Re:[ntdev] AMD64/Opteron
>
>
> Of all the responses to this, so far Loren’s has been closest to the
facts.
> I’ll just add a couple of things.
>
> 1) Not all AMD64 machines provide the NUMA information in their ACPI
BIOS.
> In fact very few do, and this is intentional. You have a choice when you
> set up your memory controllers. You either stripe the memory between the
> two (or more) processors, usually on cache-line boundaries, or you set the
> memory up into contiguous chunks. Which one performs better depends a lot
> on the workload and on the latency added by getting your data from the
“far”
> processor(s). Current Opteron implementations add very little latency for
> “far” processors, so the value that you can get from running the NUMA code
> in Windows Server 2003 on them is small. The value you get from striping,
> on the other hand, can be very large. The choice is up to the BIOS
> provider.
>
> 2) Windows Server 2003, when running on a machine with NUMA information
in
> its BIOS, will do two things that are of interest to this thread. First,
it
> will attempt to generally keep threads on one node or another. And
second,
> it will automatically attempt to allocate memory that is local to the
> processor that the allocation was done from. So, if you have a four
> processor Opteron system, running as a NUMA machine, we might start a
thread
> on processor 2, keep it there, and all the memory allocations that that
> thread makes will come from pool that is close to processor 2.
>
> (There are lots more details and corner cases to this stuff, particularly
> when one node gets busy and you’d like to migrate part of its workload to
> another node. But I don’t remember the details off the top of my head.
And
> I’m sure that you’ll all argue about the relative points until Alberto
jumps
> in, at which point the thread will probably die.)
>
> –
> Jake Oshins
> Windows Base Kernel Team
>
> This posting is provided “AS IS” with no warranties, and confers no
rights.
>
> “Loren Wilton” wrote in message
news:xxxxx@ntdev…
> > > As far as I understand, Windows Server 2003 has the ability to
> understand
> > > NUMA and the concept that “this memory belongs to processor X”, and
> thus,
> > > for instance, when allocating memory, give preference to memory in the
> > range
> > > that is closest to the processor that the code is running on.
> >
> > This is conditional on the ACPI BIOS correctly setting up the memory
> > mappings and marking them, and the processors, with corresponding group
> > numbers. Given this NT can build processor specific memory pools, and
> > attempt to allocate things from the local pool. Note that there can
also
> be
> > memory that isn’t local to any processor. Also, NT and ACPI only know
> about
> > “near” and “not near”; that is, there isn’t a range of distances, only a
> > binary choice.
> >
> > What I don’t recall is if NT will allocate from the processor pool on a
> > normal allocation if you don’t have any sort of affinity set up. I
> suspect
> > it may not necessarily do that. For instance, if you have a user thread
> > that can run on any processor, and have memory that isn’t “near” any
> > processor, it might actually make sense to allocate from the far memory.
> If
> > there isn’t any completely “not near” memory, then it might make sense
to
> > allocate from the processor pool with the most free space, even if it
> isn’t
> > the current processor. After the thread isn’t affine to any processor,
so
> > will likely be running near the allocated memory some of the time, no
> matter
> > where it is allocated.
> >
> > Loren
> >
> >
>
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@compuware.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com
>
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
disclose
> it to anyone else. If you received it in error please notify us
immediately
> and then destroy it.
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@compuware.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com
>
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
disclose
> it to anyone else. If you received it in error please notify us
immediately
> and then destroy it.
>
>