Hi Guys,
Need to understand about when the thread migration happens. I understand that the thread migration is completely up to the OS and is done in a few cases like to keep the load balanced across the cores. But does the thread migration also happens to keep the locality of the data as well. For example
Assume it is a NUMA processor with two sockets, each socket with 4 cores. core 0-3 on socket zero and core 4-7 on socket 1.
There is a device which has a submission queue and completion queue. There are 8 completion queues. Each completion queue is further bound to a MSIx. There are 8 MSIx messages one for each completion queue. Msi-x 0 to 3 are bound to core 0-3 and Msi-x 4-7 are bound to cores 4-7.
This in a way each completion queue is actually processed/bound to a particular core.
A process starts on core 0 and starts submitting data buffers to the device. The completions for these commands are being delivered using completion queue 7 (different socket and different core). There is no data locality here and since the process is running on core-0 we will have to move all the completion data in to its process address space and there will be a access to the memory on a different socket that will be done.
Is this a good case for process migration from core 0 to core 7?
I know I must have ignored a lot in the above simple example. But please direct me to some sources where these types of things are explained.
Any whitepaper,publications etc (or any chapters in Windows Internal Book?) would be a great pointers.
This is a very complex subject and there are no hard and fast rules. It was hard enough with classic SMP, and NUMA makes this much harder. The latest generation of CPUs have fast and slow cores and that makes it much more complex again
You should not think of it as thread migration, but rather pre-emption. A thread is assigned to run on a CPU until it is preempted or its quantum runs out and another ready to run thread takes over that core. If it remains ready to run and a new core becomes available, the scheduler might decide to assign and start it running there.
and the APIs that exist do not allow much insight or control. In KM it is even worse because you dependent on how the hardware routes interrupts and maps device memory - something which may not concord at all with how the scheduler assigns threads to CPUs. AFAIK no major operating system has a good solution for these kinds of problems, and academic research is very much lacking
The issues of thread scheduling on fast and slow cores reminded me of my past. For what it’s worth, in grad school at UIUC in 1990, I worked a bit on CHOICES, which was an object-oriented OS. It supported a bunch of different CPUs and architectures (the most interesting being the Intel HyperCube). But the interesting bit here is that not only did it have process migration between CPUs of the same machine but actually had process migration between systems! There are of course all sorts of caveats about whether such migration was possible (same CPU instructions supported, etc.), but it did work.
This is a topic that I am very interested in, but it is very complex. That combined with my lack of time means that my best advice is to not worry about it and hope the OS will do a good enough job.
Some large programs (SQL server) detect the presence of processor groups and NUMA nodes and affinatize different workloads accordingly. This mostly works for CPU bound work accessing what should be ‘local’ memory. Paging, or delayed commit might move the memory to a different NUMA node or group of course, but it is a reasonable optimization. But optimizing IO locality is very difficult because most IOPS require the network stack or file system or some other driver stack code to do work between the UM code and the driver that actually handles the hardware. It is very hard to predict how that might behave because those things have both code and state - as well as cache data or offload engines - that have to be shared between many processes that run in many different contexts. If it is a custom IOCTL, you can do somewhat better by detecting the machine topology by using the SetupDi functions to determine appropriate thread affinity - but the results are nebulous at best.
and all that is before considering hyper threading or fast / slow cores
There really is too much here to cover in some forum posts, but Windows NT is a preemptive multi-tasking operating system. The basic unit of scheduling is a thread. each process has one or more threads, and the basic states for each are running, ready to run or suspended. Running means that its instruction stream is currently being executed by some CPU. Ready to run means that the thread wants CPU time, but the scheduler has not assigned it to one. And suspended means that the thread is waiting for some scheduler object - sleep or a wait object. Most threads spend most of the time in the suspended state and CPU usage as reported by task manager indicates the % of time that the scheduler managed to find a thread to occupy the available CPUs.
context switches happen when a different thread is assigned to a CPU. This can happen when a thread re-enters the scheduler (entering a sleep or wait) or via pre-emption. Pre-emption happens when the is another thread in the ready to run state, and this thread has used it quantum (the maximum time that a thread can run before the scheduler will check if there are others waiting for the CPU it is currently using)
when a thread is rescheduled, it might be rescheduled on the same CPU, or it might be a different one. This is entirely in the prevue of the scheduler and the thread that was rescheduled does not even know that it has happened. This is sort of like ‘migration’ but it isn’t because most of the time most threads are not preempted - think about how often your machine runs near 100% CPU usage for an extended period. That’s why migration is a bad way to describe this process