AMD64/Opteron

Loren_Wilton · January 9, 2004, 3:11am

From: “Jake Oshins”

> As for partitioning, it’s on our roadmap for the future. Most of the NUMA
> machines are statically partitionable – e.g. you can bring up multiple OS
> instances on them, but you have to shut the whole thing off if you want to
> change the partition boundaries. That can be done without OS help. But
> Dynamic Partitioning is another story, and much, much harder. We’ve added
> most of the interfaces that we’ll need to ACPI 2.0 and we’ll probably get
> the rest into ACPI 3.0. Exactly what we’ll productize and when isn’t
> something that I can’t discuss without an NDA.

Isn’t DP fun?
Which reminds me, I probably need to talk to you about IO APICs again one of
these days.

Loren

OSR_Community_User · January 9, 2004, 11:14am

Thanks, Jake ! I understand it better now. Yet I still believe this style of
addressing suits better a centralized or mostly centralized memory that’s
switched among the processors, as opposed to having a true differentiation
between local and remote memory. In which case, of course, using the
off-the-shelf commodity OS might mean that local memory is always mapped at
lower addresses, and that different processors don’t necessarily need to
have the same view to remote memory. Like, it might fit a Profusion chip
very nicely, but it may be a bit more involved if all I wanted is to put
four standalone machines on a crossbar ?

Alberto.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Jake Oshins
Sent: Friday, January 09, 2004 12:27 AM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] AMD64/Opteron

NUMA machines (at least the ones that can run commodity OSes) are still
symmetric with respect to everything except latency. That means that the
physical address of any region of memory is the same no matter which
processor you use to fetch it. This implies that only one node can have
local memory starting at physical location 0. The SRAT covers the whole
system, or at least an entire partition.

I/O and interrupts are the same story. Any I/O operation can be done on any
processor, and any interrupt can be serviced on any processor. (Some
machines cheat with the interrupts, including that x440 from IBM that you
referenced in your other posts. There’s one of those a couple of offices
down from me, by the way, and the HAL for it was written by a couple of
friends of mine from when I worked at IBM.)

Whether you actually want to do I/O from a non-local node is another story.
We’ve been working on whether it really matters. If you put
intelligent-enough I/O controllers into the system, they mostly operate
through DMA anyhow, which minimizes the I/O problem.

As for partitioning, it’s on our roadmap for the future. Most of the NUMA
machines are statically partitionable – e.g. you can bring up multiple OS
instances on them, but you have to shut the whole thing off if you want to
change the partition boundaries. That can be done without OS help. But
Dynamic Partitioning is another story, and much, much harder. We’ve added
most of the interfaces that we’ll need to ACPI 2.0 and we’ll probably get
the rest into ACPI 3.0. Exactly what we’ll productize and when isn’t
something that I can’t discuss without an NDA.

–
Jake Oshins
Windows Base Kernel Team

This posting is provided “AS IS” with no warranties, and confers no rights.

“Moreira, Alberto” wrote in message
news:xxxxx@ntdev…
> Actually, a question for Jake. I read that SRAT piece of paper, and I’m
not
> too sure I understood. Say I have four SMP’s connected through a switch,
> each SMP has its own Bios. Say I want to have one node per SMP machine. I
> assume that one can have four different configurations on the four nodes,
so
> that, for example, local memory always starts at location 0 ? And how
would
> I/O and interrupts be handled by Win2003 ? Am I right in saying that a
SRAT
> is basically a node’s view of the system, or does the system assume
there’s
> only one SRAT for all processors ? Also, did you guys put in any support
for
> partitioning the system ?
>
> Alberto.
>
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com]On Behalf Of Moreira, Alberto
> Sent: Thursday, January 08, 2004 10:19 AM
> To: Windows System Software Devs Interest List
> Subject: RE: [ntdev] AMD64/Opteron
>
>
> Ok, let’s kill the thread.
>
> I remit you back to that IBM article I mentioned in my post yesterday,
> there’s lots of information on how to do it, and on different approaches
to
> partitioning and to routing of threads and processes. They managed to
> manufacture a 16x NUMA processor out of four 4x Xeons. But they had to add
> hardware to provide remote memory and i/o access, and also to route
> interrupts beyond the four local processors. They also had to frig the
HAL.
> They did OS work to support that system in Linux too, and there are some
> interesting comparisons between the two systems in that paper.
>
> Alberto.
>
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com]On Behalf Of Jake Oshins
> Sent: Thursday, January 08, 2004 1:39 AM
> To: Windows System Software Devs Interest List
> Subject: Re:[ntdev] AMD64/Opteron
>
>
> Of all the responses to this, so far Loren’s has been closest to the
facts.
> I’ll just add a couple of things.
>
> 1) Not all AMD64 machines provide the NUMA information in their ACPI
BIOS.
> In fact very few do, and this is intentional. You have a choice when you
> set up your memory controllers. You either stripe the memory between the
> two (or more) processors, usually on cache-line boundaries, or you set the
> memory up into contiguous chunks. Which one performs better depends a lot
> on the workload and on the latency added by getting your data from the
“far”
> processor(s). Current Opteron implementations add very little latency for
> “far” processors, so the value that you can get from running the NUMA code
> in Windows Server 2003 on them is small. The value you get from striping,
> on the other hand, can be very large. The choice is up to the BIOS
> provider.
>
> 2) Windows Server 2003, when running on a machine with NUMA information
in
> its BIOS, will do two things that are of interest to this thread. First,
it
> will attempt to generally keep threads on one node or another. And
second,
> it will automatically attempt to allocate memory that is local to the
> processor that the allocation was done from. So, if you have a four
> processor Opteron system, running as a NUMA machine, we might start a
thread
> on processor 2, keep it there, and all the memory allocations that that
> thread makes will come from pool that is close to processor 2.
>
> (There are lots more details and corner cases to this stuff, particularly
> when one node gets busy and you’d like to migrate part of its workload to
> another node. But I don’t remember the details off the top of my head.
And
> I’m sure that you’ll all argue about the relative points until Alberto
jumps
> in, at which point the thread will probably die.)
>
> –
> Jake Oshins
> Windows Base Kernel Team
>
> This posting is provided “AS IS” with no warranties, and confers no
rights.
>
> “Loren Wilton” wrote in message
news:xxxxx@ntdev…
> > > As far as I understand, Windows Server 2003 has the ability to
> understand
> > > NUMA and the concept that “this memory belongs to processor X”, and
> thus,
> > > for instance, when allocating memory, give preference to memory in the
> > range
> > > that is closest to the processor that the code is running on.
> >
> > This is conditional on the ACPI BIOS correctly setting up the memory
> > mappings and marking them, and the processors, with corresponding group
> > numbers. Given this NT can build processor specific memory pools, and
> > attempt to allocate things from the local pool. Note that there can
also
> be
> > memory that isn’t local to any processor. Also, NT and ACPI only know
> about
> > “near” and “not near”; that is, there isn’t a range of distances, only a
> > binary choice.
> >
> > What I don’t recall is if NT will allocate from the processor pool on a
> > normal allocation if you don’t have any sort of affinity set up. I
> suspect
> > it may not necessarily do that. For instance, if you have a user thread
> > that can run on any processor, and have memory that isn’t “near” any
> > processor, it might actually make sense to allocate from the far memory.
> If
> > there isn’t any completely “not near” memory, then it might make sense
to
> > allocate from the processor pool with the most free space, even if it
> isn’t
> > the current processor. After the thread isn’t affine to any processor,
so
> > will likely be running near the allocated memory some of the time, no
> matter
> > where it is allocated.
> >
> > Loren
> >
> >
>
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@compuware.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com
>
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
disclose
> it to anyone else. If you received it in error please notify us
immediately
> and then destroy it.
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@compuware.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com
>
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
disclose
> it to anyone else. If you received it in error please notify us
immediately
> and then destroy it.
>
>

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@compuware.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it.

Jake_Oshins · January 12, 2004, 1:32pm

Whether that approach works well or not depends on how strongly you’re
trying to leverage a commodity SMP OS on the machines. As you start moving
away from the statement that the machine is functionally an SMP, then you
start seeing the vast differences in implementations. These differences
imply very dissimilar changes to the kernel to support the machines.
Consequently, we advocate a strategy that allows us to run one kernel across
many vendors’ platforms.

Even if you don’t believe what I just said, the hard reality is that machine
builders need to run old, non-NUMA-aware OSes, even if performance suffers
some. And you can’t do that if your machine doesn’t act like an SMP.

–
Jake Oshins
Windows Base Kernel Team

This posting is provided “AS IS” with no warranties, and confers no rights.

“Moreira, Alberto” wrote in message
news:xxxxx@ntdev…
> Thanks, Jake ! I understand it better now. Yet I still believe this style
of
> addressing suits better a centralized or mostly centralized memory that’s
> switched among the processors, as opposed to having a true differentiation
> between local and remote memory. In which case, of course, using the
> off-the-shelf commodity OS might mean that local memory is always mapped
at
> lower addresses, and that different processors don’t necessarily need to
> have the same view to remote memory. Like, it might fit a Profusion chip
> very nicely, but it may be a bit more involved if all I wanted is to put
> four standalone machines on a crossbar ?
>
>
> Alberto.
>
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com]On Behalf Of Jake Oshins
> Sent: Friday, January 09, 2004 12:27 AM
> To: Windows System Software Devs Interest List
> Subject: Re:[ntdev] AMD64/Opteron
>
>
> NUMA machines (at least the ones that can run commodity OSes) are still
> symmetric with respect to everything except latency. That means that the
> physical address of any region of memory is the same no matter which
> processor you use to fetch it. This implies that only one node can have
> local memory starting at physical location 0. The SRAT covers the whole
> system, or at least an entire partition.
>
> I/O and interrupts are the same story. Any I/O operation can be done on
any
> processor, and any interrupt can be serviced on any processor. (Some
> machines cheat with the interrupts, including that x440 from IBM that you
> referenced in your other posts. There’s one of those a couple of offices
> down from me, by the way, and the HAL for it was written by a couple of
> friends of mine from when I worked at IBM.)
>
> Whether you actually want to do I/O from a non-local node is another
story.
> We’ve been working on whether it really matters. If you put
> intelligent-enough I/O controllers into the system, they mostly operate
> through DMA anyhow, which minimizes the I/O problem.
>
> As for partitioning, it’s on our roadmap for the future. Most of the NUMA
> machines are statically partitionable – e.g. you can bring up multiple OS
> instances on them, but you have to shut the whole thing off if you want to
> change the partition boundaries. That can be done without OS help. But
> Dynamic Partitioning is another story, and much, much harder. We’ve added
> most of the interfaces that we’ll need to ACPI 2.0 and we’ll probably get
> the rest into ACPI 3.0. Exactly what we’ll productize and when isn’t
> something that I can’t discuss without an NDA.
>
> –
> Jake Oshins
> Windows Base Kernel Team
>
> This posting is provided “AS IS” with no warranties, and confers no
rights.
>
> “Moreira, Alberto” wrote in message
> news:xxxxx@ntdev…
> > Actually, a question for Jake. I read that SRAT piece of paper, and I’m
> not
> > too sure I understood. Say I have four SMP’s connected through a switch,
> > each SMP has its own Bios. Say I want to have one node per SMP machine.
I
> > assume that one can have four different configurations on the four
nodes,
> so
> > that, for example, local memory always starts at location 0 ? And how
> would
> > I/O and interrupts be handled by Win2003 ? Am I right in saying that a
> SRAT
> > is basically a node’s view of the system, or does the system assume
> there’s
> > only one SRAT for all processors ? Also, did you guys put in any support
> for
> > partitioning the system ?
> >
> > Alberto.
> >
> >
> > -----Original Message-----
> > From: xxxxx@lists.osr.com
> > [mailto:xxxxx@lists.osr.com]On Behalf Of Moreira, Alberto
> > Sent: Thursday, January 08, 2004 10:19 AM
> > To: Windows System Software Devs Interest List
> > Subject: RE: [ntdev] AMD64/Opteron
> >
> >
> > Ok, let’s kill the thread.
> >
> > I remit you back to that IBM article I mentioned in my post yesterday,
> > there’s lots of information on how to do it, and on different approaches
> to
> > partitioning and to routing of threads and processes. They managed to
> > manufacture a 16x NUMA processor out of four 4x Xeons. But they had to
add
> > hardware to provide remote memory and i/o access, and also to route
> > interrupts beyond the four local processors. They also had to frig the
> HAL.
> > They did OS work to support that system in Linux too, and there are some
> > interesting comparisons between the two systems in that paper.
> >
> > Alberto.
> >
> >
> > -----Original Message-----
> > From: xxxxx@lists.osr.com
> > [mailto:xxxxx@lists.osr.com]On Behalf Of Jake Oshins
> > Sent: Thursday, January 08, 2004 1:39 AM
> > To: Windows System Software Devs Interest List
> > Subject: Re:[ntdev] AMD64/Opteron
> >
> >
> > Of all the responses to this, so far Loren’s has been closest to the
> facts.
> > I’ll just add a couple of things.
> >
> > 1) Not all AMD64 machines provide the NUMA information in their ACPI
> BIOS.
> > In fact very few do, and this is intentional. You have a choice when
you
> > set up your memory controllers. You either stripe the memory between
the
> > two (or more) processors, usually on cache-line boundaries, or you set
the
> > memory up into contiguous chunks. Which one performs better depends a
lot
> > on the workload and on the latency added by getting your data from the
> “far”
> > processor(s). Current Opteron implementations add very little latency
for
> > “far” processors, so the value that you can get from running the NUMA
code
> > in Windows Server 2003 on them is small. The value you get from
striping,
> > on the other hand, can be very large. The choice is up to the BIOS
> > provider.
> >
> > 2) Windows Server 2003, when running on a machine with NUMA information
> in
> > its BIOS, will do two things that are of interest to this thread.
First,
> it
> > will attempt to generally keep threads on one node or another. And
> second,
> > it will automatically attempt to allocate memory that is local to the
> > processor that the allocation was done from. So, if you have a four
> > processor Opteron system, running as a NUMA machine, we might start a
> thread
> > on processor 2, keep it there, and all the memory allocations that that
> > thread makes will come from pool that is close to processor 2.
> >
> > (There are lots more details and corner cases to this stuff,
particularly
> > when one node gets busy and you’d like to migrate part of its workload
to
> > another node. But I don’t remember the details off the top of my head.
> And
> > I’m sure that you’ll all argue about the relative points until Alberto
> jumps
> > in, at which point the thread will probably die.)
> >
> > –
> > Jake Oshins
> > Windows Base Kernel Team
> >
> > This posting is provided “AS IS” with no warranties, and confers no
> rights.
> >
> > “Loren Wilton” wrote in message
> news:xxxxx@ntdev…
> > > > As far as I understand, Windows Server 2003 has the ability to
> > understand
> > > > NUMA and the concept that “this memory belongs to processor X”, and
> > thus,
> > > > for instance, when allocating memory, give preference to memory in
the
> > > range
> > > > that is closest to the processor that the code is running on.
> > >
> > > This is conditional on the ACPI BIOS correctly setting up the memory
> > > mappings and marking them, and the processors, with corresponding
group
> > > numbers. Given this NT can build processor specific memory pools, and
> > > attempt to allocate things from the local pool. Note that there can
> also
> > be
> > > memory that isn’t local to any processor. Also, NT and ACPI only know
> > about
> > > “near” and “not near”; that is, there isn’t a range of distances, only
a
> > > binary choice.
> > >
> > > What I don’t recall is if NT will allocate from the processor pool on
a
> > > normal allocation if you don’t have any sort of affinity set up. I
> > suspect
> > > it may not necessarily do that. For instance, if you have a user
thread
> > > that can run on any processor, and have memory that isn’t “near” any
> > > processor, it might actually make sense to allocate from the far
memory.
> > If
> > > there isn’t any completely “not near” memory, then it might make sense
> to
> > > allocate from the processor pool with the most free space, even if it
> > isn’t
> > > the current processor. After the thread isn’t affine to any
processor,
> so
> > > will likely be running near the allocated memory some of the time, no
> > matter
> > > where it is allocated.
> > >
> > > Loren
> > >
> > >
> >
> >
> >
> > —
> > Questions? First check the Kernel Driver FAQ at
> > http://www.osronline.com/article.cfm?id=256
> >
> > You are currently subscribed to ntdev as: xxxxx@compuware.com
> > To unsubscribe send a blank email to xxxxx@lists.osr.com
> >
> >
> >
> > The contents of this e-mail are intended for the named addressee only.
It
> > contains information that may be confidential. Unless you are the named
> > addressee or an authorized designee, you may not copy or use it, or
> disclose
> > it to anyone else. If you received it in error please notify us
> immediately
> > and then destroy it.
> >
> >
> > —
> > Questions? First check the Kernel Driver FAQ at
> > http://www.osronline.com/article.cfm?id=256
> >
> > You are currently subscribed to ntdev as: xxxxx@compuware.com
> > To unsubscribe send a blank email to xxxxx@lists.osr.com
> >
> >
> >
> > The contents of this e-mail are intended for the named addressee only.
It
> > contains information that may be confidential. Unless you are the named
> > addressee or an authorized designee, you may not copy or use it, or
> disclose
> > it to anyone else. If you received it in error please notify us
> immediately
> > and then destroy it.
> >
> >
>
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@compuware.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com
>
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
disclose
> it to anyone else. If you received it in error please notify us
immediately
> and then destroy it.
>
>