Microkernel discussion ( SSDT on FSD continued)

anton_bassov · August 18, 2008, 1:15am

Peter,

But, sadly, you’re right. I think we’ve reached the time to end the SSDT thread.
It saddens me to do so, though. I’m kinda interested in microkernel architectures.

Therefore, let’s start another thread - I am quite interested in this as well…

Consider the model where an application doesn’t allocate the buffer in an arbitrary location,
but rather allocates it from a central pool of such buffers managed by the system.
This type of scheme has the advantage that you can enforce constraints such as "
one owner at a time"… all very clever stuff, and the data is effectively “pre-marshaled”
and can be passed-around without copying.

IMHO, currently known implementations of microkernel with all their ports, pipes, messages, etc rely upon the concept that is flawed in itself, and trying to improve it is pretty much useless idea. I think what we really need in order to design a high-performance microkernel is just to rethink the concepts of the process/thread and of the address space( ironically, the former part has been already done in Linux, so that microkernels may take advantage of it)

First of all, let’s ask ourselves a question - assuming that the address space is shared, can module X make any use of the fact that it has a full access to module Y’s code and data,
provided that they are not in the same compilation unit and code/data in question is not exported by module Y??? Obviously, not… Therefore, even if we unmap module Y’s code and data from the target address space while module X runs, we are not going to limit the latter’s functionality in any possible way( of course apart from the ability of modules to register callbacks directly with one another).

Every scheduleable execution unit in existence can have its own page directory and descriptor table. Therefore, you can think of threads just as of independent processes that just happen to share the address space with their parent for read-write access (i.e. the way Linux thinks of threads) only because their PDEs and PTEs are set up this way. However, modifying
thread X’s PDE is not going to affect thread Y’s one in any possible way.

From now on, I am going to refer to any scheduleable execution unit with the term “task”, and
to the code that it currently runs with the term “module”. Every task can have its address space sub- divided into module area, task’s private area and kernel area, with all these areas residing at some pre-defined base addresses. Only one module may be mapped into a given task’s PDE at any particular moment

Module area. This area holds currently running module’s code and data (including dynamic allocations and mappins of shared memory segments), and is shared by all tasks that currently run a given module, which may mean either multiple requests processed by the driver/server module, or multiple threads of mutlithreaded application’s module.Therefore, all tasks that execute a given module’s code have to synchronize their accesses to module’s global/static data. Once they share the same address space, they also happen to share semaphore indexes, so that there is no problem here whatsoever. If the module in question happens to be a device driver, device memory may be mapped into this area.
Task’s private area. This area holds the call stack, plus IO buffers that server and client receive respectively upon read()/write()/ioctl() requests and mmap() requests.
Kernel area.

Please note that apps/servers/drivers and totally unaware about the very existence of module and task private areas - the only one who knows about it is kernel itself.

What has to be done by the kernel when app/server/driver X calls server/driver Y under this model? Not that much. For example, let’s look at what kernel has to do when processing read/write/ioctl calls:

Check whether IO buffers are in the module area and if they are, map them to the task-private area and record this fact - we will need it upon returning. Please note that they may be already resident in the task-private area (for example, be passed on the stack), so that,
in this case, no mapping is needed.
Replace module area in the tasks page directory with that of a module Y’s one, and forward execution to module Y’s code (addresses of IO buffers that module Y receives upon this call are already in the task-private area)
After module Y returns control, restore module area in the tasks page directory and make it refer to module X’a area again. If we had to map IO buffers to the task private area before calling module Y (this is why we had recorder it), unmap them
Return to the original callee

As you can see, the whole thing involves just updating few pointers in the target task’s page directory (this call chain may be pretty long, so that we will have to record callers in stack-like fashion)

If you think about it carefully, you will realize that basically it is still the same monolithic kernel
(at least in terms of performance), because we don’t need to wait for port messages, copy data around or block execution - everything happens in the same address space and in context of the same task. At the same time, it is a microkernel, because all drivers/servers are isolated from one another. In the above example, module Y has no chance to screw up module X’s
code and data whatsoever, because module X’s area is unmapped from the task’s address space while module Y executes its code, which is particularly useful if module X happens to be a driver with a device memory mapped into its module area.

Anton Bassov

Pavel_A1 · August 22, 2008, 7:45pm

wrote in message news:xxxxx@ntdev…

> First of all, let’s ask ourselves a question - assuming that the address
> space is shared, can module X make any use of the fact that it has a full
> access to module Y’s code and data,
> provided that they are not in the same compilation unit and code/data in
> question is not exported by module Y??? Obviously, not… Therefore, even
> if we unmap module Y’s code and data from the target address space while
> module X runs, we are not going to limit the latter’s functionality in any
> possible way( of course apart from the ability of modules to register
> callbacks directly with one another).

Anton, what you describe here, reminds the ancient Intel segmented model.
There was certain polymorphism in the instruction set:
the call instruction could call a normal subroutine, or a “call gate” that
is
like a handle into another task’s space (so you can call it without
seeing it’s address space).

–PA

anton_bassov · August 23, 2008, 2:52am

Pavel,

Anton, what you describe here, reminds the ancient Intel segmented model.

In the days of 16-bit computing it would be practically impossible to implement the thing that I have described above purely in the software without the additional support from the hardware. This is what
segmented memory model was meant to address. In practical terms, if you make the above model rely upon segmentation, your code would be just a nightmare to implement and maintain.

However, times do change. Just to give you an idea, when x86_64 runs in 64-bit mode a single entry in top-level page directory describes 512G(!!!) of the virtual address space. Therefore, all you have to do in order to implement the above model on x86_64 is to update few entries in PDE ( for the practical purposes, one seems to be more than enough - 512 G seems to be sufficient, don’t you think)…

There was certain polymorphism in the instruction set: the call instruction could call a normal > subroutine, or a “call gate” that is like a handle into another task’s space
(so you can call it without seeing it’s address space).

Actually, call gate is just a form of so-called “far call” i.e. the one to a different segment (indeed, x86 has a special opcode for it ). The purpose of a call gate is to allow inter-privilege calls. As long as caller and callee run at the same privilege level, a direct far call to the task will suffice - in this case segment selector that you specify upon the call may refer to the descriptor of the task itself. However, if callee is more privileged than the caller, segment selector that you specify upon the call has to refer to the call gate descriptor, rather than that of the callee task…

Anton Bassov

Pavel_A1 · August 23, 2008, 10:00am

wrote in message news:xxxxx@ntdev…

> In the days of 16-bit computing it would be practically impossible to
> implement the thing that I have described above purely in the software
> without the additional support from the hardware. This is what
> segmented memory model was meant to address. In practical terms, if you
> make the above model rely upon segmentation, your code would be just a
> nightmare to implement and maintain.
>

Why 16-bit, I meant 32-bit segmented model vs. 32 bit flat address space
(+paging).
In this model, segments served to isolate processes/tasks, not to extend
address
space above 16 bit.

> However, times do change. Just to give you an idea, when x86_64 runs in
> 64-bit mode a single entry in top-level page directory describes
> 512G(!!!) of the virtual address space. Therefore, all you have to do in
> order to implement the above model on x86_64 is to update few entries in
> PDE ( for the practical purposes, one seems to be more than enough - 512 G
> seems to be sufficient, don’t you think)…
>
>
>> There was certain polymorphism in the instruction set: the call
>> instruction could call a normal > subroutine, or a “call gate” that is
>> like a handle into another task’s space
>> (so you can call it without seeing it’s address space).
>
> The purpose of a call gate is to allow inter-privilege calls. As long as
> caller and callee run at the same privilege level, a direct far call to
> the task will suffice - in this case segment selector that you specify
> upon the call may refer to the descriptor of the task itself. However, if
> callee is more privileged than the caller, segment selector that you
> specify upon the call has to refer to the call gate descriptor, rather
> than that of the callee task…

It is only an example how you can let one module call another without
exposing their whole address space to each other.
Yes, times do change. Maybe it is not worth to implement things like
inter-task calls and messages in the CPU microcode any longer,
but once abandoned ideas tend to return.

–pa

anton_bassov · August 23, 2008, 3:28pm

> Why 16-bit, I meant 32-bit segmented model vs. 32 bit flat address space (+paging).

In this model, segments served to isolate processes/tasks, not to extend address space
above 16 bit.

Please note that linear address is always calculated as ‘segment base+offset’, regardless of CPU mode.
If CPU runs in protected mode with paging enabled, it first calculates the linear address, which is virtual, and then translates it into the physical one via PDE/PTE. Therefore, as long as tasks have the same PDE, you cannot make them refer to different physical addresses by the same virtual one, no matter if you use flat memory model (i.e. map all segments bases to the virtual address 0 with 4G segment limit) or segmented one. As a result, your chances to run different tasks in the same address space are dramatically limited if you rely only upon segmentation in order to isolate them from one another. In order to run different tasks in the same address space, you either have to update PDE for every task, or just give every task its own PDE. In both cases, it just defeats the purpose of using segmentation.

Therefore, major OSes use segments only in order to separate privileged code from non-privileged one.
Kernel and user code reside in different code segments that are mapped to the base address of zero with 4G segment limit, so that they refer to exactly the same address space(data segment is the same for both kernel and user modes). Kernel pages are marked as supervisor-only in their PTEs. Therefore, if page is marked as supervisor-only and user code segment accesses it, it will case access violation, but if exactly the same piece of code at exactly the same virtual address runs in privileged segment, it can access supervisor-only pages without a slightest problem. This is what a trick of entering the kernel from your program via a call gate is based upon…

It is only an example how you can let one module call another without exposing their
whole address space to each other.

As you can see, as long as we speak about 32-bit address space, this is not a right example. However, it is a right example if we speak about 16-bit code - as far as CPU is concerned, these 16-bit segments are totally unrelated to one another. Therefore, you can use them to isolate tasks from one another if you run in 16-bit mode…

Maybe it is not worth to implement things like inter-task calls and messages in the
CPU microcode any longer,

Apparently, not, once all major OSes re-implement it in a software anyway, and especially taking into consideration that giving them 64-bit address space (actually, just 48-bit one on x86_64, but even 48-bit address space seems to be more than sufficient) to play around with simplifies their task dramatically…

Anton Bassov