Reading registry during boot, for hostid.

anton_bassov · May 16, 2019, 3:23am

My guess is that not all of the Registry is loaded and ready to be used during boot.

Of course - the Registry is physically stored in certain files in a system folder. Therefore, you have to wait until both the whole storage stack
and NTFS are up and running if you want to make the full use of it.

My driver uses
ServiceType = 1
StartType = 1
ErrorControl = 1
LoadOrderGroup = “File System”

Well, the very first question that gets into my head is why one wants his driver to be loaded at the boot time, taking into consideration that this driver does not either control any hardware device or implement a system service that may be essential for booting, and does not have any dependent drivers or services either. Change this part, and the problem will go in itself…

Anton Bassov

Jorgen_Lundman · May 16, 2019, 5:07am

Booting Windows on ZFS will be something to work on soon.

But, on the original question and lack of replies, I think I will just use the driver’s registry and generate a hostid instead of taking advantage of existing information.

anton_bassov · May 16, 2019, 7:17am

Booting Windows on ZFS will be something to work on soon.

BTW, I had a brief look at your project. Sorry, but things that you do in your code seem to be just horrible

For example, look at the snippet from spl_mutex_enter()

https://github.com/openzfsonwindows/ZFSin/blob/master/ZFSin/spl/module/spl/spl-mutex.c

	if (mp->m_owner == current_thread())
        panic("mutex_enter: locking against myself!");

#ifdef DEBUG
	if (*((uint64_t *)mp) == 0xdeadbeefdeadbeef) {
		panic("SPL: mutex_enter");
	}
#endif
 //lck_mtx_lock((FAST_MUTEX *)&mp->m_lock);
	ExAcquireFastMutex((FAST_MUTEX *)&mp->m_lock);
	//KeWaitForSingleObject((KMUTEX *)&mp->m_lock, Executive, KernelMode, FALSE, NULL);
    mp->m_owner = current_thread();

	// Windows increases irql in fastmutex, this is not how
	// we want to use mutex with unix
	// We should research and check if ExAcquireResourceExclusiveLite() is better for this
KeLowerIrql(PASSIVE_LEVEL);

You have arbitrarily(!!!) lowered IRQL from APC_LEVEL( which prevents the code from re-entering itself by means of disabling APC delivery to the target thread) down to PASSIVE_LEVEL,effectively enabling APC delivery. Now consider what happens if APC gets delivered to the thread that owns a mutex, and it tries to acquire this mutex recursively…

To be honest, after having seen this “engineering feat” (and particularly the explanation of your reasoning behind it) I immediately lost my interest in looking any further…

Anton Bassov

[MODS: Edit for clarity, because Anton needs to learn how to use Markdown.]

Jorgen_Lundman · May 16, 2019, 7:32am

Thanks for taking a look anyway, despite your bad experience, and I agree - that was a hack to get long-hold mutex code to work, in the very early days of porting the code.

It seems Unix and Windows have very different ideas of how a mutex can be used. The mutex held in Illumos, can be held for a long time, even days. Perfectly valid way to use mutex locking under Unix Kernel.

Whereas I found that in Windows, it fiddles with the IRQL level even if I don’t want it to, and is expected to be released asap, and by the very same thread.

I suspect Windows simply does not have any kernel built-in options that will fit with Unix usage, and I will simply have to implement my own mutex code. It is on the TODO - but became much lower priority as the hack happens to work (for now).

When I started this port, I did not know if it could be done, let alone by me as I am new to Windows kernels - but since it has become clear that it can be done, it is time to go back and make things proper. But it has been surprisingly hard to get “Windows Best Practises”. Even here, it took you being revolted to increase the mutex TODO entry

Jorgen_Lundman · May 16, 2019, 7:54am

You also mention “recursive mutex” usage, I would guess that is something Windows differs again. Unix mutex, for example SunOS/Solaris/illumos
https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/os/mutex.c#L411

Simply not a valid usage - I did not write that panic test above, it came from SYSV.

anton_bassov · May 16, 2019, 9:25am

Even here, it took you being revolted to increase the mutex TODO entry

Well, this is not the question of TODO entry…

Basically, what you did is, in UNIX/Linux terms, exactly the same thing as enabling interrupts in a function that is meant to run with interrupts disabled. In other words, you have intentionally violated one of the most basic rules of Windows kernel programming, and, to make it much worse, found a justification for it…

Simply not a valid usage - I did not write that panic test above, it came from SYSV.

Well, if there was no “panic” part it would be even worse - you would simply deadlock, because fast_mutex cannot be acquired recursively…

It seems Unix and Windows have very different ideas of how a mutex can be used. The mutex held in Illumos, can be held for a long time, >even days. Perfectly valid way to use mutex locking under Unix Kernel.

Whereas I found that in Windows, it fiddles with the IRQL level even if I don’t want it to, and is expected to be released asap,
and by the very same thread.

…which reveals more and more about the level of your current understanding of Windows kernel…

In actuality, a mutex can be held as long as you wish, which is true of both “regular” mutexes and fast ones. A fast mutex is “fast” because it relies upon test-and-set operation, which means it cannot be acquired recursively by the same thread. This is why it elevates IRQL to APC_LEVEL,which,however, has absolutely nothing to do with timing. A “regular” mutex can be acquired recursively…

as I am new to Windows kernels

…but still you are trying to do something that happens to be, apparently, one of the most complex Windows projects in existence…

In any case, it seems to me that you are taking the wrong approach from the very beginning. What I would do in your place is discarding ZFS POSIX layer altogether, and writing a Windows- centric filesystem that just makes use of DMU/SPA/MOS/ZIL layers behind the scenes, effectively storing data in ZFS format and presenting ZFS objects (i.e filesystems/snapshots/clones) as volumes that Windows filesystem may be mounted on. However, the way all these objects are accessed has to be a Windows-centric one…

Anton Bassov

Jorgen_Lundman · May 16, 2019, 12:10pm

@anton_bassov said:
A fast mutex is “fast” because it relies upon test-and-set operation, which means it cannot be acquired recursively by the same thread.
This is why it elevates IRQL to APC_LEVEL,which,however, has absolutely nothing to do with timing. A “regular” mutex can be acquired recursively…
“test-and-set” is how I would implement it, should I have to do so.

Seeing as recursive call is not allowed on Unix, I have no need for that ability. You are suggesting here that fast_mutex can then be used, any hints as to how? If I don’t have the hack, it dies fairly early on.

…which reveals more and more about the level of your current understanding of Windows kernel…
…but still you are trying to do something that happens to be, apparently, one of the most complex Windows projects in existence…

Of course - why would you ever try to dissuade someone from wanting to learn? Have you no patience for those with less experience than you?
Either way, I see it as a great way to get familiar with another platform - I already familiar with SYSV, BSD, and MACH kernels, why not one more.

In any case, it seems to me that you are taking the wrong approach from the very beginning. What I would do in your place is discarding ZFS POSIX layer altogether, and writing a Windows- centric filesystem that just makes use of DMU/SPA/MOS/ZIL layers behind the scenes, effectively storing data in ZFS format and presenting ZFS objects (i.e filesystems/snapshots/clones) as volumes that Windows filesystem may be mounted on. However, the way all these objects are accessed has to be a Windows-centric one…

Yes, that is how you would do it, and perhaps any Windows dev, would do it that way. Never going to happen though, and in 15 years, hasn’t. Unfortunately, I know Unix inside-and-out but not Windows, so I ported it they way I know, the same way I ported it to OSX.

Anyway, in the end I changed the code to use the driver’s registry, and create a hostid if not present, it was quite an easy thing to do.

Jorgen_Lundman · May 16, 2019, 12:48pm

Actually, just to clarify, then you/Windows talks about “recursive mutex” we are talking about
mutex_enter(A);
…mutex_enter(A);
…code
…mutex_exit(A);
mutex_exit(A);

Right? Not the standard nesting use:
mutex_enter(A);
…mutex_enter(B);
…code
…mutex_exit(B);
mutex_exit(A);

If it is the case that fast_mutex is not able to do the latter example, then it is actually more like a spinlock, and not a mutex at all.
If that was the case, I would totally understand your reaction, that would be a terrible, terrible use.

Jeffrey_Tippet_MSFT · May 16, 2019, 7:44pm

The SYSTEM hive (mounted at /Registry/Machine/SYSTEM) is loaded by the boot loader, so it’s available to boot start drivers. However, most other hives are loaded later by the normal storage stack.

Typically, if you have something that’s critical for a boot driver to start, you’d put it in the SYSTEM hive.

At present, there’s no notification API when other hives are loaded. A boot driver can poll, or your can orchestrate things so that needing to read the SOFTWARE hive is driven by some other event that guarantees the SOFTWARE hive is already loaded. For example, usermode doesn’t start until hives are loaded, so you can assume that if you get an ioctl from usermode, you’re ready to read other hives.

Whereas I found that in Windows, it fiddles with the IRQL level even if I don’t want it to, and is expected to be released asap, and by the very same thread.

Yeah, Windows locks don’t really expect to be held for “days”, and are typically thread- or processor-affinitized. You can, of course, build your own lock with an InterlockedCompareExchange. One gotcha though: if you’re running on some usermode process’s context, you need to prevent the current thread from getting suspended while you hold a lock. (Imagine if usermode ioctls into your driver, you grab your lock, then someone suspends that process. The lock is still held, so nobody else can make forward progress, but the thread that holds it isn’t going to run.) The solution to this is KeEnterCriticalRegion(), which postpones thread suspension. If you implement a “thread-neutral lock”, you should think carefully about where to enter the critical region: whenever a thread needs to make forward progress to prevent the lock from getting starved, that thread needs to be in a critical region.

anton_bassov · May 16, 2019, 10:35pm

If it is the case that fast_mutex is not able to do the latter example, then it is actually more like a spinlock, and not a mutex at all.

Well, to begin with, the scenario B (i.e the one of the nested acquisition of 2 different locks) is perfectly fine with ANY constructs, including spinlocks. The only condition here is that the target OS has to take the original IRQL in case of Windows ( or the original state of interrupt flag in case of Linux) into consideration when releasing a lock. As long as it gets done by the target OS, there is no problem here whatsoever.

However, the scenario A is a totally different story as far as a spinlock owner is concerned - if you try something like that you are just bound to deadlock because of the test-and-set. This is why spinlocks owners have to ensure that this unfortunate scenario cannot occur. In the Windows world they do so by means of elevating IRQL to DPC_LEVEL before attempting test-and-set. The OSes that allow the use of spinlocks in ISRs have to disable interrupts before attempting test-and-set…

Concerning the fast mutex, you can think of it just of a combination of a mutex and a spinlock.

First ExAcquireFastMutex() tries test-and-set, and if it fails, it goes blocking on the dispatcher object (probably, after having had polled the target flag for some reasonable number of iterations). This is why it is “fast” - if the contention is low and/or the mutex is released quickly neither an owner nor a contender needs to go to the system dispatcher upon acquisition/release, effectively reducing the acquisition effort to a simple test-and-set. In this sense, it is conceptually similar to the adaptive mutex on Illumos. The only difference is that it polls the state of the flag, rather than the one of the owner thread. However, if you look at the whole thing from the owner’s perspective, it is, for all practical purposes, just a spinlock that cannot be acquired recursively- a recursive acquisition attempt by the owner thread guarantees a deadlock because of the test-and-set. Therefore, in order to avoid this unfortunate scenario, ExAcquireFastMutex() elevates IRQL to APC_LEVEL before attempting test-and-set…

If that was the case, I would totally understand your reaction, that would be a terrible, terrible use.

As I told you already, what you do is equivalent, in Linux terms, to enabling interrupts while holding a spinlock. I really hope there is no need to explain to you the gravity of this mistake…

Yes, that is how you would do it, and perhaps any Windows dev, would do it that way.
…
Unfortunately, I know Unix inside-and-out but not Windows, so I ported it they way I know, the same way I ported it to OSX.

I hope you DO realise that ZFS is a world on its own, with no easy mapping of its operations to ANY major OS in existence, including even Solaris that it was developed on. In order to make it usable on the host OS ZFS POSIX layer has been designed. What it does is just interfacing ZFS DSL objects, and presenting them as files and directories to the target OS. It makes certain assumptions about the target OS, particularly about the way it handles the file system operations. Once ZFS had been originally designed for Solaris, the more similar to Solaris in this respect the target OS is, the less modifications porting ZFS to it requires.

OSX is more or less similar to FreeBSD by the virtue of being a BSD derivative. Therefore, ZFS POSIX layer naturally maps to OSX file system operations, and does not require the significant modifications to itself when it gets ported to OSX. However, Windows NT kernel is a totally different world, and Solaris-targeting ZFS POSIX layer is completely foreign to it. This is why you are more than likely to get into the trouble if you try to bluntly and dumbly adjust it to Windows. Therefore, assuming that you want it to work fine under Windows, it would be better to interface ZFS objects to Windows in a way that naturally maps to the Windows filesystem operations, rather than trying to emulate the ways that are totally unnatural and foreign for it…

Certainly, the above does not necessarily imply that your code is going to crash and burn straight away - as I can see, you have managed to get away with a deadly serious bug in so far. However, it does not necessarily imply that your code is perfectly fine either, does it…

Anton Bassov

Peter_Viscarola_OSR · May 16, 2019, 10:59pm

Either way, I see it as a great way to get familiar with another platform

Hmmmmm… kind of a big project for doing that. A bit like “I wanted to learn about boats, so o decided to build an aircraft carrier.”

Also, getting familiar with another platform entails learning “the ways”… the underlying architecture… the overall design of that platform and the way it’s pieces interwork. The flavor, the approach… even the coding style. It doesn’t seem to me that taking a chainsaw and a hammer to force some pre-existing code to work “somehow” is the way to achieve the goal of learning.

I would think you would WANT to come out the other side of your learning activity with a series of modules that reflect the best principles and practices of the host OS. But that’s not what you’re doing. So, with all due respect, that’s not really learning… it’s more like “bludgeoning.”

Peter

Jorgen_Lundman · May 17, 2019, 12:28am

At present, there’s no notification API when other hives are loaded. A boot driver can poll,

I didn’t really expect there to be, thought I would ask just in case. Thanks for letting me know.

Yeah, Windows locks don’t really expect to be held for “days”, and are typically thread- or processor-affinitized. You can, of course, build your own lock with an InterlockedCompareExchange.

They do seem to be a little more “specialised” than I was expecting, and perhaps leaning toward “critical section” usage, rather than the more generic “protecting variable access” as they are used in this code.

I actually had “regular” (proper? full?) mutex code initially, but turned out to not work due to “The kernel never permits a thread that owns a mutex to cause a transition to user mode without first releasing the mutex”, which would not work here. It totally expects to be able to do that.

Skimming the documentation, perhaps InitializeSRWLock could be used in this situation, but it is starting to look like I might have to implement my own mutex code.

Jorgen_Lundman · May 17, 2019, 12:38am

Concerning the fast mutex, you can think of it just of a combination of a mutex and a spinlock.

Thank you, that is helpful.

enabling interrupts while holding a spinlock. I really hope there is no need to explain to you the gravity of this mistake…

Of course. From my POV though, I was calling the Windows-equivalent of “pthread_enter()” only to find it disables interrupts as well, a weird thing to do that breaks the specifications. Naturally, Windows isn’t POSIX, nor does it have pthreads. That is my bad, and I’ll own that. Of course I thought we had already settled that that code is wrong, and we had moved on to looking at alternatives.

anton_bassov · May 17, 2019, 7:40am

From my POV though, I was calling the Windows-equivalent of “pthread_enter()” only to find it disables interrupts as well,

Well,it does not disable interrupts. What it disables is, in UNIX terms, a signal delivery to the target thread. The only difference is that we are speaking about the KM, rather than userland, here

Naturally, Windows isn’t POSIX

Here we go…

Now go and read my previous post again…

Anton Bassov

Jorgen_Lundman · May 17, 2019, 8:05am

Here we go…
Now go and read my previous post again…
And if you’d read what I said, you would have seen I was merely re-enacting my thoughts at the time, and pointing out that “of course” it will not work as I expect, as it is not the system I am used to. It’s as if you guys just read keywords, and assume I’m on some crusade. I am not.

Still, I checked out SRWLock - but it’s userland. Poked at Pushlocks, but they don’t have a *Try method. So in the end, I went with simple Events. I suspect that KeWaitForSingleObject() is not “free”, so I attempt to avoid calling it by using CAS. It is not “sexy”, but it works.

https://github.com/openzfsonwindows/ZFSin/pull/138

The fast_mutex is just one of the two hacks I wasn’t happy with, the other one is in vdev_disk.c - I will hopefully find a better way around that one too.

Anyway, that’s all the beating I can take from you guys for now, thanks for the information shared.

Martin_Drab · May 17, 2019, 11:24am

Seeing as recursive call is not allowed on Unix, I have no need for that ability. You are suggesting here that fast_mutex can then be used, any hints as to how? If I don’t have the hack, it dies fairly early on.

Well, executive resources may be the right primitives for you. They are reader-writer locks but can be used as sort of mutexes. And they do not raise IRQL to APC level.

Why do you need to hold a lock for a long time (even accross usermode trips of the holding thread)? Windows driver is something like a dynamic link library – the kernel just calls one of its callbacks, the driver does its job and returns the control back to the kernel. If you need to wait for something that happens in usermode, use appropriate synchronization primitives (events, semaphores…).

Waiting for the registry hive to appear

Well, you possibly may detect the SOFTWARE hive appearing by tracking changes of the HKLM\System\CurrentControlSet\Control\hivelist key. I think it should be present even at boot-time. I don’t remember whether a specific ZwXxx routine is documented for this purpose. Either way, you can use CmRegisterCallback to do the waiting in a passive manner. But personally, I would opt to storing all necessary information in driver’s registry key.

anton_bassov · May 17, 2019, 11:40am

Well, executive resources may be the right primitives for you. They are reader-writer locks but can be used as sort of mutexes.
And they do not raise IRQL to APC level.

However, they still impose certain limitations and requirements (like, for example, disabling normal kernel APC delivery before calling
ExAcquireResourceSXXXLite()). Therefore, I believe the best option here is just to implement mutexes,condvars and rwlocks yourself on top of KEVENTs and InterlockedXXX() functions…

Anton Bassov

Martin_Drab · May 17, 2019, 12:32pm

However, they still impose certain limitations and requirements (like, for example, disabling normal kernel APC delivery before calling
ExAcquireResourceSXXXLite()). Therefore, I believe the best option here is just to implement mutexes,condvars and rwlocks yourself on top of KEVENTs and InterlockedXXX() functions…

I probably missed some paragraphs but I got an intention that the main problem is the APC_LEVEL IRQL (since it restricts callable API set), not the fact that normal kernel APCs are disabled.

Of course, KEVENT-based mutexes would work fine.

anton_bassov · May 17, 2019, 5:15pm

I probably missed some paragraphs but I got an intention that the main problem is the APC_LEVEL IRQL (since it restricts callable API set),

Well, actually, the real problem is that the OP tries to use the code that got written for Solaris under Windows, despite the HUGE differences between these systems. His requirement has absolutely nothing to do with the objective APC_LEVEL restrictions - it is based upon the assertion that ,in his words, “Windows increases irql in fastmutex, this is not how we want to use mutex with unix” (the reasoning behind the second part is not so clear either)

Therefore, in his situation it would be better to implement everything himself in order to ensure that his emulated API behaves as much Solaris-like as possible…

Anton Bassov

Tim_Roberts · May 17, 2019, 5:28pm

Martin_Dráb wrote:

Windows driver is something like a dynamic link library – the kernel just calls one of its callbacks, the driver does its job and returns the control back to the kernel.

This, I think, is one of the fundamental architectural tenets of Windows
drivers, and it is not intuitively obvious to user-mode programmers. In
user-mode programming, flow comes in at main(), and until we exit out
the bottom of main(), there’s always at least one thread that has its
program counter in our code. This is even true in a GUI app; although
we may be waiting for events, we’re waiting because we called GetMessage
and DispatchMessage.

With drivers, that’s not the case. The “steady state” for a driver is
not to have any code active at all. A driver consists of a set of
callbacks. The callbacks get fired to respond to events, either in the
system or in our hardware. The callback runs, does its work, and
returns as quickly as possible. A driver shouldn’t wait; it should
remember its state, and return until some other event triggers a change
to the next state. That’s a very different way to think about programming.