AMD64/Opteron

OSR_Community_User · January 6, 2004, 12:39pm

I know that each processor in the AMD64 family has its own memory. I am
curious to know if the Windows OS will enable processes to have affinity
to a given processor (typically yes) and that a given block of memory
can be allocated to a given processor affinity as well. Basically I
want to know if it is possible to tune an application to have its
process and memory used by the process to be off of the same physical
processor.

Thanks,
Dominick Cafarelli

OSR_Community_User · January 6, 2004, 12:48pm

This has been there for years. SetThreadAffinityMask,
SetProcessAffinityMask, SetThreadIdealProcessor, etc.

Processors (at least in the NT/SMP model) do not have their “own
memory”, outside of registers and cache. The processor cache hardware
works quite hard to insure that the caches are coherent in SMP systems
– as much as possible, the system appears to have a single memory
store. (Note: Cache coherence, memory barriers, etc. are a huge topic,
and I’m not trying to do it justice here. Search for “cache coherency”
or “SMP memory design” for more info. And obviously, I’m talking about
cache-coherent SMP systems, since that is what NT is designed for.)

Memory has a sort of “soft” affinity, determined by the cache state of
processors in SMP systems. (For example, if two processors continually
write to the same cache line, or perform interlocked integer access,
performance will be greatly reduced.) Nearly all high-performance
algorithms are dominated by cache behavior.

The short answer is: yes. Read up on processor affinity, both hard
affinity and soft affinity. It’s a very complex topic, and no short
answer can truly do the topic justice.

– arlie

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@NAI.com
Sent: Tuesday, January 06, 2004 12:36 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] AMD64/Opteron

I know that each processor in the AMD64 family has its own memory. I am
curious to know if the Windows OS will enable processes to have affinity
to a given processor (typically yes) and that a given block of memory
can be allocated to a given processor affinity as well. Basically I
want to know if it is possible to tune an application to have its
process and memory used by the process to be off of the same physical
processor.

Thanks,
Dominick Cafarelli

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@sublinear.org To
unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · January 6, 2004, 12:58pm

The description you provide is essentially a NUMA architecture, and such
is certainly supported by Windows Server 2003. As part of scheduling,
the preferred node for processing is one of the tracked attributes (in
addition to hard affinity and soft affinity). Certainly key OS
structures are also allocated from appropriate memory locations
(remember the Windows mantra - “performance, performance, performance”).

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

Hope to see you at the next OSR file systems class in Boston, MA,
February 23, 2003!
-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@NAI.com
Sent: Tuesday, January 06, 2004 12:36 PM
To: ntdev redirect
Subject: [ntdev] AMD64/Opteron

I know that each processor in the AMD64 family has its own memory. I am
curious to know if the Windows OS will enable processes to have affinity
to a given processor (typically yes) and that a given block of memory
can be allocated to a given processor affinity as well. Basically I
want to know if it is possible to tune an application to have its
process and memory used by the process to be off of the same physical
processor.

Thanks,
Dominick Cafarelli

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@osr.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Maxim_S_Shatskih · January 6, 2004, 6:17pm

> I know that each processor in the AMD64 family has its own memory. I am

curious to know if the Windows OS will enable processes to have affinity
to a given processor (typically yes) and that a given block of memory
can be allocated to a given processor affinity as well. Basically I
want to know if it is possible to tune an application to have its
process and memory used by the process to be off of the same physical
processor.

Surely, and no kernel-mode hacking is necessary.

Just use SetThreadAffinityMask, and then maintain per-thread memory as you want
so - you can even use per-thread heap with no mutex on it.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

OSR_Community_User · January 7, 2004, 5:47am

Hi Dominick,

I know that each processor in the AMD64 family has its own
memory. I am
curious to know if the Windows OS will enable processes to
have affinity
to a given processor (typically yes) and that a given block of memory
can be allocated to a given processor affinity as well. Basically I
want to know if it is possible to tune an application to have its
process and memory used by the process to be off of the same physical
processor.

Thanks,
Dominick Cafarelli

I’m writing a reply to this e-mail, because I beleive that the other
responses have been missing the point to some extent.

First an explanation on why this question even exists: In an x86-64
architecture processor, at least in the current implementation, the memory
controller is integrated into the processor, rather than having a separate
“northbridge” memory controller outside of the processor. This means that in
a multiprocessor system, it’s possible (and preferred) to have memory
attached to each processor. When processor 0 accesses memory that is owned
by processor 1, for example, there is a small penalty compared to processor
0 accessing it’s own memory. Thus, for optimal performance, making a best
effort to have thread and memory on the same processor is a good idea.

As far as I understand, Windows Server 2003 has the ability to understand
NUMA and the concept that “this memory belongs to processor X”, and thus,
for instance, when allocating memory, give preference to memory in the range
that is closest to the processor that the code is running on.

For this to work well, you really need to have it working “automagically”,
i.e. the code should not need to know that it runs on a NUMA machine. If it
had to do that, it would mean that 99% of all applications would not
benefit, as they wouldn’t have the right code in them.

Now, some people suggested using SetThreadAffinityMask etc. That’s great if
you want to prevent a thread from switching from one processor to another.
However, it doesn’t really help a whole lot unless it also ties in with the
memory allocation routines. Imagine that we tell Windows that we want to run
on processor 0, and then Windows allocated memory in processor 1’s space.
Also, there is no way (as far as I know) to ask Windows “which processor
does my memory belong to”.

However, SetThreadAffinityMask helps in the case of caching, as it’s forcing
a threads cached data (and code) to be kept in that processor, rather than
moving from one processor to another.

Further, there are two possible memory controller setups for multiple
processor Opteron systems. One is that each processor has it’s own range of
memory, like this (1GB per processor, quad processor system):

Processor 0: 0x00000000…0x3FFFFFFF
Processor 1: 0x40000000…0x7FFFFFFF
Processor 2: 0x80000000…0xBFFFFFFF
Processor 3: 0xC0000000…0xFFFFFFFF

The other option is to “rotate” every 4KB, so for the first 32K of memory:

Processor 0: 0x00000000…0x00000FFF
Processor 1: 0x00001000…0x00001FFF
Processor 2: 0x00002000…0x00002FFF
Processor 3: 0x00003000…0x00003FFF
Processor 0: 0x00004000…0x00004FFF
Processor 1: 0x00005000…0x00005FFF
Processor 2: 0x00006000…0x00006FFF
Processor 3: 0x00007000…0x00007FFF

This latter setup is for non-NUMA-aware OS’s. It ensures that all processor
have “equally bad” [I’m sure AMD has a better term for this :-)] access to
any particular (larger perspective) of memory, and that at least some of the
memory is held by “your own” processor. Also, if threads are ping-ponged
around between processors, this would be the better memory setup, because
you’ll have no case that is particularly bad.

Finally, I’d like to point out that the overhead for fetching from another
processor than “self” is relatively small. There is an extra latency of the
same size as a “page not open” on a DDR SDRAM controller, which happens
every time your system isn’t accessing a range of memory that has been
accessed very recently (modern SDRAM can have something like 8 pages open at
the same time). But if EVERY access that a particular process does is one
that fetches from another processor, it’s noticable, especially for memory
intensive applications.

Hope this helps.

–
Mats

Maxim_S_Shatskih · January 7, 2004, 6:17am

> Now, some people suggested using SetThreadAffinityMask etc. That’s great if

you want to prevent a thread from switching from one processor to another.
However, it doesn’t really help a whole lot unless it also ties in with the
memory allocation routines. Imagine that we tell Windows that we want to run
on processor 0, and then Windows allocated memory in processor 1’s space.

HeapCreate() for each CPU, put the heap handle to TLS value, then write your
own Alloc() and operator new() which will allocate from this thread-local heap.

You can even make this heap without a mutex on entry.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Loren_Wilton · January 7, 2004, 6:20am

> As far as I understand, Windows Server 2003 has the ability to understand

NUMA and the concept that “this memory belongs to processor X”, and thus,
for instance, when allocating memory, give preference to memory in the
range
that is closest to the processor that the code is running on.

This is conditional on the ACPI BIOS correctly setting up the memory
mappings and marking them, and the processors, with corresponding group
numbers. Given this NT can build processor specific memory pools, and
attempt to allocate things from the local pool. Note that there can also be
memory that isn’t local to any processor. Also, NT and ACPI only know about
“near” and “not near”; that is, there isn’t a range of distances, only a
binary choice.

What I don’t recall is if NT will allocate from the processor pool on a
normal allocation if you don’t have any sort of affinity set up. I suspect
it may not necessarily do that. For instance, if you have a user thread
that can run on any processor, and have memory that isn’t “near” any
processor, it might actually make sense to allocate from the far memory. If
there isn’t any completely “not near” memory, then it might make sense to
allocate from the processor pool with the most free space, even if it isn’t
the current processor. After the thread isn’t affine to any processor, so
will likely be running near the allocated memory some of the time, no matter
where it is allocated.

Loren

OSR_Community_User · January 7, 2004, 6:54am

Max,

HeapCreate() for each CPU, put the heap handle to TLS value,
then write your
own Alloc() and operator new() which will allocate from this
thread-local heap.

Are you saying that HeapCreate will automatically figure out that the memory
needs to be allocated in the physical range that is on the current
processor? Because that is what this problem is about.

–
Mats

Maxim_S_Shatskih · January 7, 2004, 9:00am

> > HeapCreate() for each CPU, put the heap handle to TLS value,

> then write your
> own Alloc() and operator new() which will allocate from this
> thread-local heap.

Are you saying that HeapCreate will automatically figure out that the memory
needs to be allocated in the physical range that is on the current
processor? Because that is what this problem is about.

No, it will just divide the allocations to 2 or more huge pools, each touched
by only 1 CPU, so, there will be no cache snoop hits on these blocks, and on
their allocator’s headers.

This works without NUMA hardware too.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

OSR_Community_User · January 7, 2004, 9:17am

> > > HeapCreate() for each CPU, put the heap handle to TLS value,

> > then write your
> > own Alloc() and operator new() which will allocate from this
> > thread-local heap.
>
> Are you saying that HeapCreate will automatically figure
out that the memory
> needs to be allocated in the physical range that is on the current
> processor? Because that is what this problem is about.

No, it will just divide the allocations to 2 or more huge
pools, each touched
by only 1 CPU, so, there will be no cache snoop hits on these
blocks, and on
their allocator’s headers.

Yes, that’s fine, except the problem we’re trying to solve isn’t about cache
snoops (although those are obviously also needing to be taken into
consideration).

Let’s assume that we have a 2P machine with 2GB of memory. Processor 0 has
the first GB of memory connected to it, and Processor 1 has the second GB.

If memory is allocated from 0 (physical address) up, and we have two
threads, each allocating a chunk of 5MB. This would mean that all of the
10MB is allocated on Processor 0 memory. When processor 1 is reading from
this memory, each memory read will have a penalty of about 40 ns, or an
overall access time of about 110 ns instead of 70ns for processor 0’s access
to the memory.

Moreover, if the threads are (for arguments sake) thrashing through this 5MB
memory region, the processor 0’s memory controller will be very busy
reading, while processor 1 is not being used to read memory at all, because
all the memory access happens on processor 0.

A much better way would be to allocate thread 0’s memory from processor 0,
and thread 1’s memory from processor 1. This way, no hold-ups because we
have to wait for the other processors congestion on the memory controller,
and assuming we have thread 0 on processor 0 and thread 1 on processor 1, we
also gain the improvement of lower latency.

But just simply allocating memory on either processor according to some
heuristic will allow the memory controllers to work more efficently. This is
where the “rotate on 4KB boundaries” come in very handy, as it automatically
spreads the load between different processors, and allow each of the memory
controllers to do an equal share of the memory reads.

This works without NUMA hardware too.

I see how that is, but it doesn’t actually solve the above problem, as far
as I can see.

–
Mats

Maxim_S_Shatskih · January 7, 2004, 9:51am

> Let’s assume that we have a 2P machine with 2GB of memory. Processor 0 has

the first GB of memory connected to it, and Processor 1 has the second GB.

This is NUMA hardware, which is supported starting from XP, and XP has some
documented APIs to work with it.

With SMP (non-NUMA) hardware, all CPUs are equal in access to any RAM location.

BTW - I will be grossly amused to see some commodity x86 mobo providing NUMA.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

OSR_Community_User · January 7, 2004, 10:09am

Max,

BTW - I will be grossly amused to see some commodity x86 mobo
providing NUMA.

Depends a bit on what you mean by commodity. But there are boards on the
market for this.

How about this page then:
http://www.tyan.com/products/html/opteron.html

All of those boards have “NUMA” constructions (well, except for the ones
with ONE cpu, of course). But the fact is of course that the “penalty” for
the “not my CPU” access is pretty small, so you could just look at it as a
2P or 4P SMP system, and ignore the locality of the memory. In this case,
the best option would be to use the “rotate per 4KB” option.

–
Mats

OSR_Community_User · January 7, 2004, 12:55pm

This can help in some situations. However, please be aware that memory
management becomes much more complex in this situation. Specifically,
you must deal with freeing memory blocks that were not allocated by the
current thread. You can either 1) structure your algorithms so that a
thread only ever manipulates memory allocated by that same thread, or 2)
implement free such that it is aware that a given memory block can come
from multiple different threads.

If you go with 1), then you have some very unpleasant constraints to
deal with. If you go with 2), then you still need the per-heap mutex,
because free() must still be able to interact with the heaps of any
thread.

A better (and much easier) implementation is to continue to use a single
heap, but have thread-local allocation pools. The size of the buffers
in these pools, and the max count of each pool, depends on your
application. Each thread can interact with its thread-local pool
without any synchronization, and when necessary can return memory to the
real heap. This is a far simpler design, and gets you 99% of the
performance gain of the multiple-heap scenario, with far less
(bug-prone) complexity.

Mats Petersson’s description of ccNUMA is certainly useful, and shows
how complex the topic is.

HeapCreate/HeapAlloc definitely does not have any knowledge of memory
regions relevant to ccNUMA. It may assist with cache locality – it
would prevent cache lines from bouncing back and forth between
processors. That, coupled with the “small” NUMA penalty, may be all the
benefit the OP needs. (That comment applies to either approach –
thread-local pools or entire thread-local heaps. However, I’ve worked
on apps that use thread-local heaps for just that reason, and they tend
to be more difficult to get right.)

– arlie

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S.
Shatskih
Sent: Wednesday, January 07, 2004 6:17 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] AMD64/Opteron

Now, some people suggested using SetThreadAffinityMask etc. That’s
great if you want to prevent a thread from switching from one
processor to another. However, it doesn’t really help a whole lot
unless it also ties in with the memory allocation routines. Imagine
that we tell Windows that we want to run on processor 0, and then
Windows allocated memory in processor 1’s space.

HeapCreate() for each CPU, put the heap handle to TLS value, then write
your own Alloc() and operator new() which will allocate from this
thread-local heap.

You can even make this heap without a mutex on entry.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@sublinear.org To
unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · January 7, 2004, 1:11pm

It’s more or less clear that the higher the locality, the higher the
performance. It matters when thread A uses local memory as opposed to remote
memory, but remote memory is slower even when it belongs to thread A. I
personally much prefer to hide this whole morass behind something like MPI !

For a real nice implementation of ccNUMA over Windows, take a look at

http://www.research.ibm.com/journal/rd/452/brock.html

where they took four 4x SMPs and extended WinNT to make it into a
16-processor partitionable ccNUMA system.

Alberto.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Arlie Davis
Sent: Wednesday, January 07, 2004 12:55 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] AMD64/Opteron

This can help in some situations. However, please be aware that memory
management becomes much more complex in this situation. Specifically,
you must deal with freeing memory blocks that were not allocated by the
current thread. You can either 1) structure your algorithms so that a
thread only ever manipulates memory allocated by that same thread, or 2)
implement free such that it is aware that a given memory block can come
from multiple different threads.

If you go with 1), then you have some very unpleasant constraints to
deal with. If you go with 2), then you still need the per-heap mutex,
because free() must still be able to interact with the heaps of any
thread.

A better (and much easier) implementation is to continue to use a single
heap, but have thread-local allocation pools. The size of the buffers
in these pools, and the max count of each pool, depends on your
application. Each thread can interact with its thread-local pool
without any synchronization, and when necessary can return memory to the
real heap. This is a far simpler design, and gets you 99% of the
performance gain of the multiple-heap scenario, with far less
(bug-prone) complexity.

Mats Petersson’s description of ccNUMA is certainly useful, and shows
how complex the topic is.

HeapCreate/HeapAlloc definitely does not have any knowledge of memory
regions relevant to ccNUMA. It may assist with cache locality – it
would prevent cache lines from bouncing back and forth between
processors. That, coupled with the “small” NUMA penalty, may be all the
benefit the OP needs. (That comment applies to either approach –
thread-local pools or entire thread-local heaps. However, I’ve worked
on apps that use thread-local heaps for just that reason, and they tend
to be more difficult to get right.)

– arlie

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S.
Shatskih
Sent: Wednesday, January 07, 2004 6:17 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] AMD64/Opteron

Now, some people suggested using SetThreadAffinityMask etc. That’s
great if you want to prevent a thread from switching from one
processor to another. However, it doesn’t really help a whole lot
unless it also ties in with the memory allocation routines. Imagine
that we tell Windows that we want to run on processor 0, and then
Windows allocated memory in processor 1’s space.

HeapCreate() for each CPU, put the heap handle to TLS value, then write
your own Alloc() and operator new() which will allocate from this
thread-local heap.

You can even make this heap without a mutex on entry.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@sublinear.org To
unsubscribe send a blank email to xxxxx@lists.osr.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@compuware.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it.

Jake_Oshins · January 8, 2004, 1:40am

Of all the responses to this, so far Loren’s has been closest to the facts.
I’ll just add a couple of things.

Not all AMD64 machines provide the NUMA information in their ACPI BIOS.
In fact very few do, and this is intentional. You have a choice when you
set up your memory controllers. You either stripe the memory between the
two (or more) processors, usually on cache-line boundaries, or you set the
memory up into contiguous chunks. Which one performs better depends a lot
on the workload and on the latency added by getting your data from the “far”
processor(s). Current Opteron implementations add very little latency for
“far” processors, so the value that you can get from running the NUMA code
in Windows Server 2003 on them is small. The value you get from striping,
on the other hand, can be very large. The choice is up to the BIOS
provider.
Windows Server 2003, when running on a machine with NUMA information in
its BIOS, will do two things that are of interest to this thread. First, it
will attempt to generally keep threads on one node or another. And second,
it will automatically attempt to allocate memory that is local to the
processor that the allocation was done from. So, if you have a four
processor Opteron system, running as a NUMA machine, we might start a thread
on processor 2, keep it there, and all the memory allocations that that
thread makes will come from pool that is close to processor 2.

(There are lots more details and corner cases to this stuff, particularly
when one node gets busy and you’d like to migrate part of its workload to
another node. But I don’t remember the details off the top of my head. And
I’m sure that you’ll all argue about the relative points until Alberto jumps
in, at which point the thread will probably die.)

–
Jake Oshins
Windows Base Kernel Team

This posting is provided “AS IS” with no warranties, and confers no rights.

“Loren Wilton” wrote in message news:xxxxx@ntdev…
> > As far as I understand, Windows Server 2003 has the ability to
understand
> > NUMA and the concept that “this memory belongs to processor X”, and
thus,
> > for instance, when allocating memory, give preference to memory in the
> range
> > that is closest to the processor that the code is running on.
>
> This is conditional on the ACPI BIOS correctly setting up the memory
> mappings and marking them, and the processors, with corresponding group
> numbers. Given this NT can build processor specific memory pools, and
> attempt to allocate things from the local pool. Note that there can also
be
> memory that isn’t local to any processor. Also, NT and ACPI only know
about
> “near” and “not near”; that is, there isn’t a range of distances, only a
> binary choice.
>
> What I don’t recall is if NT will allocate from the processor pool on a
> normal allocation if you don’t have any sort of affinity set up. I
suspect
> it may not necessarily do that. For instance, if you have a user thread
> that can run on any processor, and have memory that isn’t “near” any
> processor, it might actually make sense to allocate from the far memory.
If
> there isn’t any completely “not near” memory, then it might make sense to
> allocate from the processor pool with the most free space, even if it
isn’t
> the current processor. After the thread isn’t affine to any processor, so
> will likely be running near the allocated memory some of the time, no
matter
> where it is allocated.
>
> Loren
>
>

OSR_Community_User · January 8, 2004, 4:45am

Jake,

thanks for the useful information. Since I worked with the Linux guys rather
than MS when I worked at AMD, I didn’t know exactly how and what you’d
implemented in the OS.

As regards the “striping”, it’s actually not on cache-line boundaries, but
4K blocks that gets “striped”. The way this works is simply that bits 12…14
of the physical address are used as an index to which processor to ask for
the memory from. Obviously with an “index & no_of_processors” so that you
don’t ask for memory on a non-existant processor. It’s also, obviously, not
allowed if the number of processors is non-power-of-two (but then I don’t
think it works too well to do that anyway…).

–
Mats

-----Original Message-----
From: Jake Oshins [mailto:xxxxx@windows.microsoft.com]
Sent: Thursday, January 08, 2004 6:39 AM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] AMD64/Opteron

Of all the responses to this, so far Loren’s has been closest
to the facts.
I’ll just add a couple of things.

Not all AMD64 machines provide the NUMA information in
their ACPI BIOS.
In fact very few do, and this is intentional. You have a
choice when you
set up your memory controllers. You either stripe the memory
between the
two (or more) processors, usually on cache-line boundaries,
or you set the
memory up into contiguous chunks. Which one performs better
depends a lot
on the workload and on the latency added by getting your data
from the “far”
processor(s). Current Opteron implementations add very
little latency for
“far” processors, so the value that you can get from running
the NUMA code
in Windows Server 2003 on them is small. The value you get
from striping,
on the other hand, can be very large. The choice is up to the BIOS
provider.

Windows Server 2003, when running on a machine with NUMA
information in
its BIOS, will do two things that are of interest to this
thread. First, it
will attempt to generally keep threads on one node or
another. And second,
it will automatically attempt to allocate memory that is local to the
processor that the allocation was done from. So, if you have a four
processor Opteron system, running as a NUMA machine, we might
start a thread
on processor 2, keep it there, and all the memory allocations
that that
thread makes will come from pool that is close to processor 2.

(There are lots more details and corner cases to this stuff,
particularly
when one node gets busy and you’d like to migrate part of its
workload to
another node. But I don’t remember the details off the top
of my head. And
I’m sure that you’ll all argue about the relative points
until Alberto jumps
in, at which point the thread will probably die.)

–
Jake Oshins
Windows Base Kernel Team

This posting is provided “AS IS” with no warranties, and
confers no rights.

“Loren Wilton” wrote in message
news:xxxxx@ntdev…
> > As far as I understand, Windows Server 2003 has the ability to
understand
> > NUMA and the concept that “this memory belongs to processor X”, and
thus,
> > for instance, when allocating memory, give preference to memory in the
> range
> > that is closest to the processor that the code is running on.
>
> This is conditional on the ACPI BIOS correctly setting up the memory
> mappings and marking them, and the processors, with corresponding group
> numbers. Given this NT can build processor specific memory pools, and
> attempt to allocate things from the local pool. Note that there can also
be
> memory that isn’t local to any processor. Also, NT and ACPI only know
about
> “near” and “not near”; that is, there isn’t a range of distances, only a
> binary choice.
>
> What I don’t recall is if NT will allocate from the processor pool on a
> normal allocation if you don’t have any sort of affinity set up. I
suspect
> it may not necessarily do that. For instance, if you have a user thread
> that can run on any processor, and have memory that isn’t “near” any
> processor, it might actually make sense to allocate from the far memory.
If
> there isn’t any completely “not near” memory, then it might make sense to
> allocate from the processor pool with the most free space, even if it
isn’t
> the current processor. After the thread isn’t affine to any processor, so
> will likely be running near the allocated memory some of the time, no
matter
> where it is allocated.
>
> Loren
>
>

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@3dlabs.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · January 8, 2004, 10:13am

Ok, let’s kill the thread.

I remit you back to that IBM article I mentioned in my post yesterday,
there’s lots of information on how to do it, and on different approaches to
partitioning and to routing of threads and processes. They managed to
manufacture a 16x NUMA processor out of four 4x Xeons. But they had to add
hardware to provide remote memory and i/o access, and also to route
interrupts beyond the four local processors. They also had to frig the HAL.
They did OS work to support that system in Linux too, and there are some
interesting comparisons between the two systems in that paper.

Alberto.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Jake Oshins
Sent: Thursday, January 08, 2004 1:39 AM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] AMD64/Opteron

Of all the responses to this, so far Loren’s has been closest to the facts.
I’ll just add a couple of things.

Not all AMD64 machines provide the NUMA information in their ACPI BIOS.
In fact very few do, and this is intentional. You have a choice when you
set up your memory controllers. You either stripe the memory between the
two (or more) processors, usually on cache-line boundaries, or you set the
memory up into contiguous chunks. Which one performs better depends a lot
on the workload and on the latency added by getting your data from the “far”
processor(s). Current Opteron implementations add very little latency for
“far” processors, so the value that you can get from running the NUMA code
in Windows Server 2003 on them is small. The value you get from striping,
on the other hand, can be very large. The choice is up to the BIOS
provider.
Windows Server 2003, when running on a machine with NUMA information in
its BIOS, will do two things that are of interest to this thread. First, it
will attempt to generally keep threads on one node or another. And second,
it will automatically attempt to allocate memory that is local to the
processor that the allocation was done from. So, if you have a four
processor Opteron system, running as a NUMA machine, we might start a thread
on processor 2, keep it there, and all the memory allocations that that
thread makes will come from pool that is close to processor 2.

(There are lots more details and corner cases to this stuff, particularly
when one node gets busy and you’d like to migrate part of its workload to
another node. But I don’t remember the details off the top of my head. And
I’m sure that you’ll all argue about the relative points until Alberto jumps
in, at which point the thread will probably die.)

–
Jake Oshins
Windows Base Kernel Team

This posting is provided “AS IS” with no warranties, and confers no rights.

“Loren Wilton” wrote in message news:xxxxx@ntdev…
> > As far as I understand, Windows Server 2003 has the ability to
understand
> > NUMA and the concept that “this memory belongs to processor X”, and
thus,
> > for instance, when allocating memory, give preference to memory in the
> range
> > that is closest to the processor that the code is running on.
>
> This is conditional on the ACPI BIOS correctly setting up the memory
> mappings and marking them, and the processors, with corresponding group
> numbers. Given this NT can build processor specific memory pools, and
> attempt to allocate things from the local pool. Note that there can also
be
> memory that isn’t local to any processor. Also, NT and ACPI only know
about
> “near” and “not near”; that is, there isn’t a range of distances, only a
> binary choice.
>
> What I don’t recall is if NT will allocate from the processor pool on a
> normal allocation if you don’t have any sort of affinity set up. I
suspect
> it may not necessarily do that. For instance, if you have a user thread
> that can run on any processor, and have memory that isn’t “near” any
> processor, it might actually make sense to allocate from the far memory.
If
> there isn’t any completely “not near” memory, then it might make sense to
> allocate from the processor pool with the most free space, even if it
isn’t
> the current processor. After the thread isn’t affine to any processor, so
> will likely be running near the allocated memory some of the time, no
matter
> where it is allocated.
>
> Loren
>
>

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@compuware.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it.

OSR_Community_User · January 8, 2004, 10:42am

Actually, a question for Jake. I read that SRAT piece of paper, and I’m not
too sure I understood. Say I have four SMP’s connected through a switch,
each SMP has its own Bios. Say I want to have one node per SMP machine. I
assume that one can have four different configurations on the four nodes, so
that, for example, local memory always starts at location 0 ? And how would
I/O and interrupts be handled by Win2003 ? Am I right in saying that a SRAT
is basically a node’s view of the system, or does the system assume there’s
only one SRAT for all processors ? Also, did you guys put in any support for
partitioning the system ?

Alberto.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Moreira, Alberto
Sent: Thursday, January 08, 2004 10:19 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] AMD64/Opteron

Ok, let’s kill the thread.

I remit you back to that IBM article I mentioned in my post yesterday,
there’s lots of information on how to do it, and on different approaches to
partitioning and to routing of threads and processes. They managed to
manufacture a 16x NUMA processor out of four 4x Xeons. But they had to add
hardware to provide remote memory and i/o access, and also to route
interrupts beyond the four local processors. They also had to frig the HAL.
They did OS work to support that system in Linux too, and there are some
interesting comparisons between the two systems in that paper.

Alberto.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Jake Oshins
Sent: Thursday, January 08, 2004 1:39 AM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] AMD64/Opteron

Of all the responses to this, so far Loren’s has been closest to the facts.
I’ll just add a couple of things.

Not all AMD64 machines provide the NUMA information in their ACPI BIOS.
In fact very few do, and this is intentional. You have a choice when you
set up your memory controllers. You either stripe the memory between the
two (or more) processors, usually on cache-line boundaries, or you set the
memory up into contiguous chunks. Which one performs better depends a lot
on the workload and on the latency added by getting your data from the “far”
processor(s). Current Opteron implementations add very little latency for
“far” processors, so the value that you can get from running the NUMA code
in Windows Server 2003 on them is small. The value you get from striping,
on the other hand, can be very large. The choice is up to the BIOS
provider.
Windows Server 2003, when running on a machine with NUMA information in
its BIOS, will do two things that are of interest to this thread. First, it
will attempt to generally keep threads on one node or another. And second,
it will automatically attempt to allocate memory that is local to the
processor that the allocation was done from. So, if you have a four
processor Opteron system, running as a NUMA machine, we might start a thread
on processor 2, keep it there, and all the memory allocations that that
thread makes will come from pool that is close to processor 2.

(There are lots more details and corner cases to this stuff, particularly
when one node gets busy and you’d like to migrate part of its workload to
another node. But I don’t remember the details off the top of my head. And
I’m sure that you’ll all argue about the relative points until Alberto jumps
in, at which point the thread will probably die.)

–
Jake Oshins
Windows Base Kernel Team

This posting is provided “AS IS” with no warranties, and confers no rights.

“Loren Wilton” wrote in message news:xxxxx@ntdev…
> > As far as I understand, Windows Server 2003 has the ability to
understand
> > NUMA and the concept that “this memory belongs to processor X”, and
thus,
> > for instance, when allocating memory, give preference to memory in the
> range
> > that is closest to the processor that the code is running on.
>
> This is conditional on the ACPI BIOS correctly setting up the memory
> mappings and marking them, and the processors, with corresponding group
> numbers. Given this NT can build processor specific memory pools, and
> attempt to allocate things from the local pool. Note that there can also
be
> memory that isn’t local to any processor. Also, NT and ACPI only know
about
> “near” and “not near”; that is, there isn’t a range of distances, only a
> binary choice.
>
> What I don’t recall is if NT will allocate from the processor pool on a
> normal allocation if you don’t have any sort of affinity set up. I
suspect
> it may not necessarily do that. For instance, if you have a user thread
> that can run on any processor, and have memory that isn’t “near” any
> processor, it might actually make sense to allocate from the far memory.
If
> there isn’t any completely “not near” memory, then it might make sense to
> allocate from the processor pool with the most free space, even if it
isn’t
> the current processor. After the thread isn’t affine to any processor, so
> will likely be running near the allocated memory some of the time, no
matter
> where it is allocated.
>
> Loren
>
>

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@compuware.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it.

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@compuware.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it.

OSR_Community_User · January 8, 2004, 10:44pm

I hope, killing was that easy for any object/subject :-). Then again, there
are people
fighting to avoid killing some services and such …

BTW, the article is interesting, though the experience with linux shows the
moving target problem. The OP was looking to control without mucking around
with the kernel and/or HAL as such, so it depends on where the playing field
is ( I mean at the level of control ). I think from the point of view of
just to know where it started, May be Lesly Lamports paper on Sequential
consisitency would be good to look at !!!

-prokash

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Moreira, Alberto
Sent: Thursday, January 08, 2004 7:19 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] AMD64/Opteron

Ok, let’s kill the thread.

I remit you back to that IBM article I mentioned in my post yesterday,
there’s lots of information on how to do it, and on different approaches to
partitioning and to routing of threads and processes. They managed to
manufacture a 16x NUMA processor out of four 4x Xeons. But they had to add
hardware to provide remote memory and i/o access, and also to route
interrupts beyond the four local processors. They also had to frig the HAL.
They did OS work to support that system in Linux too, and there are some
interesting comparisons between the two systems in that paper.

Alberto.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of Jake Oshins
Sent: Thursday, January 08, 2004 1:39 AM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] AMD64/Opteron

Of all the responses to this, so far Loren’s has been closest to the facts.
I’ll just add a couple of things.

Not all AMD64 machines provide the NUMA information in their ACPI BIOS.
In fact very few do, and this is intentional. You have a choice when you
set up your memory controllers. You either stripe the memory between the
two (or more) processors, usually on cache-line boundaries, or you set the
memory up into contiguous chunks. Which one performs better depends a lot
on the workload and on the latency added by getting your data from the “far”
processor(s). Current Opteron implementations add very little latency for
“far” processors, so the value that you can get from running the NUMA code
in Windows Server 2003 on them is small. The value you get from striping,
on the other hand, can be very large. The choice is up to the BIOS
provider.
Windows Server 2003, when running on a machine with NUMA information in
its BIOS, will do two things that are of interest to this thread. First, it
will attempt to generally keep threads on one node or another. And second,
it will automatically attempt to allocate memory that is local to the
processor that the allocation was done from. So, if you have a four
processor Opteron system, running as a NUMA machine, we might start a thread
on processor 2, keep it there, and all the memory allocations that that
thread makes will come from pool that is close to processor 2.

(There are lots more details and corner cases to this stuff, particularly
when one node gets busy and you’d like to migrate part of its workload to
another node. But I don’t remember the details off the top of my head. And
I’m sure that you’ll all argue about the relative points until Alberto jumps
in, at which point the thread will probably die.)

–
Jake Oshins
Windows Base Kernel Team

This posting is provided “AS IS” with no warranties, and confers no rights.

“Loren Wilton” wrote in message news:xxxxx@ntdev…
> > As far as I understand, Windows Server 2003 has the ability to
understand
> > NUMA and the concept that “this memory belongs to processor X”, and
thus,
> > for instance, when allocating memory, give preference to memory in the
> range
> > that is closest to the processor that the code is running on.
>
> This is conditional on the ACPI BIOS correctly setting up the memory
> mappings and marking them, and the processors, with corresponding group
> numbers. Given this NT can build processor specific memory pools, and
> attempt to allocate things from the local pool. Note that there can also
be
> memory that isn’t local to any processor. Also, NT and ACPI only know
about
> “near” and “not near”; that is, there isn’t a range of distances, only a
> binary choice.
>
> What I don’t recall is if NT will allocate from the processor pool on a
> normal allocation if you don’t have any sort of affinity set up. I
suspect
> it may not necessarily do that. For instance, if you have a user thread
> that can run on any processor, and have memory that isn’t “near” any
> processor, it might actually make sense to allocate from the far memory.
If
> there isn’t any completely “not near” memory, then it might make sense to
> allocate from the processor pool with the most free space, even if it
isn’t
> the current processor. After the thread isn’t affine to any processor, so
> will likely be running near the allocated memory some of the time, no
matter
> where it is allocated.
>
> Loren
>
>

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@compuware.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it.

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@garlic.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Jake_Oshins · January 9, 2004, 12:28am

NUMA machines (at least the ones that can run commodity OSes) are still
symmetric with respect to everything except latency. That means that the
physical address of any region of memory is the same no matter which
processor you use to fetch it. This implies that only one node can have
local memory starting at physical location 0. The SRAT covers the whole
system, or at least an entire partition.

I/O and interrupts are the same story. Any I/O operation can be done on any
processor, and any interrupt can be serviced on any processor. (Some
machines cheat with the interrupts, including that x440 from IBM that you
referenced in your other posts. There’s one of those a couple of offices
down from me, by the way, and the HAL for it was written by a couple of
friends of mine from when I worked at IBM.)

Whether you actually want to do I/O from a non-local node is another story.
We’ve been working on whether it really matters. If you put
intelligent-enough I/O controllers into the system, they mostly operate
through DMA anyhow, which minimizes the I/O problem.

As for partitioning, it’s on our roadmap for the future. Most of the NUMA
machines are statically partitionable – e.g. you can bring up multiple OS
instances on them, but you have to shut the whole thing off if you want to
change the partition boundaries. That can be done without OS help. But
Dynamic Partitioning is another story, and much, much harder. We’ve added
most of the interfaces that we’ll need to ACPI 2.0 and we’ll probably get
the rest into ACPI 3.0. Exactly what we’ll productize and when isn’t
something that I can’t discuss without an NDA.

–
Jake Oshins
Windows Base Kernel Team

This posting is provided “AS IS” with no warranties, and confers no rights.

“Moreira, Alberto” wrote in message
news:xxxxx@ntdev…
> Actually, a question for Jake. I read that SRAT piece of paper, and I’m
not
> too sure I understood. Say I have four SMP’s connected through a switch,
> each SMP has its own Bios. Say I want to have one node per SMP machine. I
> assume that one can have four different configurations on the four nodes,
so
> that, for example, local memory always starts at location 0 ? And how
would
> I/O and interrupts be handled by Win2003 ? Am I right in saying that a
SRAT
> is basically a node’s view of the system, or does the system assume
there’s
> only one SRAT for all processors ? Also, did you guys put in any support
for
> partitioning the system ?
>
> Alberto.
>
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com]On Behalf Of Moreira, Alberto
> Sent: Thursday, January 08, 2004 10:19 AM
> To: Windows System Software Devs Interest List
> Subject: RE: [ntdev] AMD64/Opteron
>
>
> Ok, let’s kill the thread.
>
> I remit you back to that IBM article I mentioned in my post yesterday,
> there’s lots of information on how to do it, and on different approaches
to
> partitioning and to routing of threads and processes. They managed to
> manufacture a 16x NUMA processor out of four 4x Xeons. But they had to add
> hardware to provide remote memory and i/o access, and also to route
> interrupts beyond the four local processors. They also had to frig the
HAL.
> They did OS work to support that system in Linux too, and there are some
> interesting comparisons between the two systems in that paper.
>
> Alberto.
>
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com]On Behalf Of Jake Oshins
> Sent: Thursday, January 08, 2004 1:39 AM
> To: Windows System Software Devs Interest List
> Subject: Re:[ntdev] AMD64/Opteron
>
>
> Of all the responses to this, so far Loren’s has been closest to the
facts.
> I’ll just add a couple of things.
>
> 1) Not all AMD64 machines provide the NUMA information in their ACPI
BIOS.
> In fact very few do, and this is intentional. You have a choice when you
> set up your memory controllers. You either stripe the memory between the
> two (or more) processors, usually on cache-line boundaries, or you set the
> memory up into contiguous chunks. Which one performs better depends a lot
> on the workload and on the latency added by getting your data from the
“far”
> processor(s). Current Opteron implementations add very little latency for
> “far” processors, so the value that you can get from running the NUMA code
> in Windows Server 2003 on them is small. The value you get from striping,
> on the other hand, can be very large. The choice is up to the BIOS
> provider.
>
> 2) Windows Server 2003, when running on a machine with NUMA information
in
> its BIOS, will do two things that are of interest to this thread. First,
it
> will attempt to generally keep threads on one node or another. And
second,
> it will automatically attempt to allocate memory that is local to the
> processor that the allocation was done from. So, if you have a four
> processor Opteron system, running as a NUMA machine, we might start a
thread
> on processor 2, keep it there, and all the memory allocations that that
> thread makes will come from pool that is close to processor 2.
>
> (There are lots more details and corner cases to this stuff, particularly
> when one node gets busy and you’d like to migrate part of its workload to
> another node. But I don’t remember the details off the top of my head.
And
> I’m sure that you’ll all argue about the relative points until Alberto
jumps
> in, at which point the thread will probably die.)
>
> –
> Jake Oshins
> Windows Base Kernel Team
>
> This posting is provided “AS IS” with no warranties, and confers no
rights.
>
> “Loren Wilton” wrote in message
news:xxxxx@ntdev…
> > > As far as I understand, Windows Server 2003 has the ability to
> understand
> > > NUMA and the concept that “this memory belongs to processor X”, and
> thus,
> > > for instance, when allocating memory, give preference to memory in the
> > range
> > > that is closest to the processor that the code is running on.
> >
> > This is conditional on the ACPI BIOS correctly setting up the memory
> > mappings and marking them, and the processors, with corresponding group
> > numbers. Given this NT can build processor specific memory pools, and
> > attempt to allocate things from the local pool. Note that there can
also
> be
> > memory that isn’t local to any processor. Also, NT and ACPI only know
> about
> > “near” and “not near”; that is, there isn’t a range of distances, only a
> > binary choice.
> >
> > What I don’t recall is if NT will allocate from the processor pool on a
> > normal allocation if you don’t have any sort of affinity set up. I
> suspect
> > it may not necessarily do that. For instance, if you have a user thread
> > that can run on any processor, and have memory that isn’t “near” any
> > processor, it might actually make sense to allocate from the far memory.
> If
> > there isn’t any completely “not near” memory, then it might make sense
to
> > allocate from the processor pool with the most free space, even if it
> isn’t
> > the current processor. After the thread isn’t affine to any processor,
so
> > will likely be running near the allocated memory some of the time, no
> matter
> > where it is allocated.
> >
> > Loren
> >
> >
>
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@compuware.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com
>
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
disclose
> it to anyone else. If you received it in error please notify us
immediately
> and then destroy it.
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@compuware.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com
>
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
disclose
> it to anyone else. If you received it in error please notify us
immediately
> and then destroy it.
>
>