Windows kernel XOR chain list

Jamey_Kirby · October 21, 2018, 1:29am

Hence the > I see you left that out of the quote

On Sat, Oct 20, 2018 at 8:57 PM anton_bassov
wrote:

> OSR https://community.osr.com/
> anton_bassov commented on Windows kernel XOR chain list
>
> > It’s like blockchain version 0.1
>
> Well, this is, probably, not the best comparison in context of this
> discussion (i.e. whether the list under consideration is a single or a
> double linked one) . Please note that, unlike XOR-linked list, blockchain
> is a STRICTLY single-linked list that allows traversals only in one
> direction and modification only to the end of the list. If you try adding
> or removing any of its entries ( apart from the last one, of course) you
> are going to get a checksum mismatch. Actually, this protection against
> modifications happens to be the underlying idea behind the very concept of
> a blockchain, in the first place.
>
> In fact, I try my best to avoid using this term - on my books, it is
> closely associated with business types and other laymen who like speaking
> about some supposedly “revolutionary blockchain technology”, and want to go
> as far as even start teaching it in the universities. Being a
> technically-minded person, I realise that, in actuality, the whole
> “revolutionary technology” is nothing more than just a special form of a
> Merkle tree that had been invented (and patented) by Ralph Merkle back in
> 1979, i.e. 30 years before this term even came to the existence, and found
> a wide range of practical applications since. ZFS and BTRFS file systems,
> as well as various file-sharing protocols, are the very first examples that
> come up to me mind.
>
> Therefore, I would say the best term to describe a blockchain is either
> a “hash-linked list” or a “Merkle single-linked list” , although neither of
> the above is, apparently, too palatable to the business/marketing
> types…
>
> Anton Bassov
>
> –
> Reply to this email directly or follow the link below to check it out:
> https://community.osr.com/discussion/comment/291057#Comment_291057
>
> Check it out:
> https://community.osr.com/discussion/comment/291057#Comment_291057
>

anton_bassov · October 21, 2018, 2:44am

Hence the </> I see you left that out of the quote

As I have experimentally established, the new hosting platform seems, for some reason, to be automatically removing <>-enclosed XML- style stuff. For example, on a couple occasions it removed the opening

one
(I am pretty sure it is about to do it with this post as well). Therefore, when I saw </> I thought that it could well be just a mangled paragraph-closing part, so that I did not give it any interpretation. As I can see now, it was meant to be some kind of an emoticon…

BTW, once we are at it, I think that if some independent observers look at all these emoticons, they may suspect that the world is trying to develop a new glyph (or,probably, even a cuneiform) -based writing system of the XXI century. Call me a bore of you wish, but on my books this “writing system” firmly belongs in the same class with "2B/“4U/etc” one, i.e. something that is more likely to be associated with the kids texting one another, rather than with the mature technical-minded people posting to the technical NGs…

Anton Bassov

anton_bassov · October 21, 2018, 2:47am

on a couple occasions it removed the opening statement while preserving the closing one
(I am pretty sure it is about to do it with this post as well)

As I can see, the platform 'improved" itself even further - now it seems to be removing both…

Anton Bassov

Jamey_Kirby · October 21, 2018, 3:53am

Here is a situation where this has been useful:

Data structure with LIST_ENTRY was 40 bytes. Since memory manager aligns on
16 byte boundary, this is actually taking 48 bytes. By removing 8 bytes by
using the XOR list, my 40 byte structure is now 32 bytes. Although this
seems trivial, it is not.

I save 8 bytes from the node link, and 8 bytes of overflow from the
memory manager. That’s a savings of 16 bytes (33%).
Two nodes can reside in the cache. Before, only one full node would fit.

Although using XOR list may add CPU cycles, I suspect I gain more from
being cache friendly than i lose in CPU cycles. I’d be happy if it were
close in either direction.

On Sat, Oct 20, 2018 at 10:47 PM anton_bassov
wrote:

> OSR https://community.osr.com/
> anton_bassov commented on Windows kernel XOR chain list
>
> > on a couple occasions it removed the opening statement while preserving
> the closing one
> >
> > (I am pretty sure it is about to do it with this post as well)
>
> As I can see, the platform 'improved" itself even further - now it seems
> to be removing both…
>
> Anton Bassov
>
> –
> Reply to this email directly or follow the link below to check it out:
> https://community.osr.com/discussion/comment/291060#Comment_291060
>
> Check it out:
> https://community.osr.com/discussion/comment/291060#Comment_291060
>

Prokash_Sinha · October 21, 2018, 4:39am

I fail to understand your logic though. The way I think about, is that whatever the underlying data structure, a node is a node, randomly allocated ( as we assume under any dynamic structure ), hence how the cache will get them in cache memory, so that accessing those nodes will have cache hit. Usually cache fill is sequential prefetch, and the nodes are not. So for data cache, how could you expect they would be in cache ? Or more hits than misses ? 'm missing something …
-Pro

Jamey_Kirby · October 21, 2018, 5:19am

When accessing the nodes to xor the pointers, two full nodes can reside in
the cache at the same time. This is not guaranteed, but the opportunity for
two nodes to reside in the cache at the same time is possible, and when
processing millions of records through the list, I’d expect to see an
accumulated benefit. I could setup a test and measure performance between
XLIST and LIST_ENTRY. Maybe i’ll do that next weekend when I have more
tinker time. With a 40 byte node, I am guaranteed that two full nodes will
never exist in the cache at the same time.

I know for sure that I am saving memory. I have an odd situation where I am
supporting an old design (not mine). Occasionally the software is paused,
and lots of nodes are queued in NPP. This can be done for an extended
period of time, and on systems without loads of RAM. On rare occasions, it
can be millions of nodes. It is an issue. A 33% savings in this
pathological situation is a temporary solution (tested) until I can re
architect that aspect of the code. This is a very rare situation (right
now), but one that needs addressing. When the software was designed (years
ago), apparently no one was thinking about the impact of 20+ TB drives.

And I’ve always wanted to play around with an xor chain, so now I have.

On Sun, Oct 21, 2018 at 12:39 AM Prokash_Sinha-2
wrote:

> OSR https://community.osr.com/
> Prokash_Sinha-2 commented on Windows kernel XOR chain list
>
> I fail to understand your logic though. The way I think about, is that
> whatever the underlying data structure, a node is a node, randomly
> allocated ( as we assume under any dynamic structure ), hence how the cache
> will get them in cache memory, so that accessing those nodes will have
> cache hit. Usually cache fill is sequential prefetch, and the nodes are
> not. So for data cache, how could you expect they would be in cache ? Or
> more hits than misses ? 'm missing something …
>
> -Pro
>
> –
> Reply to this email directly or follow the link below to check it out:
> https://community.osr.com/discussion/comment/291062#Comment_291062
>
> Check it out:
> https://community.osr.com/discussion/comment/291062#Comment_291062
>

anton_bassov · October 21, 2018, 1:51pm

the way I think about, is that whatever the underlying data structure, a node is a node, randomly allocated ( as we assume under
any dynamic structure ), hence how the cache will get them in cache memory, so that accessing those nodes will have cache hit.

The very first thing that gets into my head is a custom slab allocator - first you allocate an array of needed structures at once by calling ExAllocatePool()), and then allocate/deallocate them from/to this array, rather than going to the system allocator upon every allocation/deallocation. You may get a noticeable performance improvement in some cases if you do it this way. In such case aligning your target structures on the cache-line boundary is going to enhance the performance even further.

Anton Bassov

Prokash_Sinha · October 21, 2018, 2:41pm

Yep, that’s the approach to take . Surely, Kirby has a wonderful observation already, the system alignment. Big pain is debugging, often most debugger gives linked structures in debugger pane, and traversing is easy. Here no such luck

Prokash_Sinha · October 21, 2018, 2:50pm

One thing to notice is that, dynamic structures ( their arrival, existence, and duration ) are assumed to be random — So your apparent neighbors could change randomly, which would make cache misses more probable…

-Pro

Jamey_Kirby · October 21, 2018, 4:21pm

I am using it to replace LIST_ENTRY to save space, so the code was already
using a dynamic list; with a lookaside list.

I was reviewing the WRK last night, and I see the default setting for
lookaside lists is 256 entries. Most of the time I will have way less than
that in my queue. Only in certain situations will the driver have to
collect a lot of data in memory; which by that time, it is
allocating/freeing from/to NPP anyway.

I considered a slab allocator. I could then ensure my data was cache
aligned without sacrificing too much memory. In reality, I am in upper
region of the block storage stack, not sitting on a hardware interrupt, so
I am not going to waste too much time right now on any major modifications
or changes for performance. I am only collecting metadata, not block data.
However, I am more than happy to add contributors to the project if anyone
wants to tinker, and push better code

If I am not mistaken, the Linux kernel has an xor based queue
implementation that is frequently used; maybe in the storage stack. I think
I remember seeing it some years back.

I have made several modification based on some of the lists input; like
renaming ExInterlockedXxx to InterlockedXxx, added traversal examples, and
I even added PushXList() and PopXList() macros to mimic how SLIST works
(kinda silly, but I added them anyway). When i get some time, I’ll add an
enumerator that has an Out parameter that contains a state structure with
prev, cur, and next pointers so that you can restart an enumeration where
you left off; in either direction.

On Sun, Oct 21, 2018 at 10:51 AM Prokash_Sinha-2
wrote:

> OSR https://community.osr.com/
> Prokash_Sinha-2 commented on Windows kernel XOR chain list
>
> One thing to notice is that, dynamic structures ( their arrival,
> existence, and duration ) are assumed to be random — So your apparent
> neighbors could change randomly, which would make cache misses more
> probable…
>
> -Pro
>
> –
> Reply to this email directly or follow the link below to check it out:
> https://community.osr.com/discussion/comment/291066#Comment_291066
>
> Check it out:
> https://community.osr.com/discussion/comment/291066#Comment_291066
>

Prokash_Sinha · October 21, 2018, 5:08pm

Jamey, it is possible that some kinda implementations are inside Linux. There was guy from IBM, working on Linux kernel asked me if their team can use for their kernel implementations. Of course, it is open source, so I told them some of the pitfalls, as I mentioned here. There were two guys from Europe sent me mails - One was again for Linux, another one for Solaris. He ran it different platform, and was happy it was open source .

I’ve some other ADTs that I need to adapt to this, been pending for sometime, and will let the community know

From the caching point of view, unless you come up with your cache management techniques ( aging, eviction, bitmaps ) for a generic implementation the system caching would not be of much help ( one-way or multi-way cache lines )…

The main thing is that how could we minimize the traverse when we need to remove from some where in the middle. It’s called Adversary - that always try to use the worst possible scenarios… Some from of SKIP-LIST techniques could be used

-Pro

Prokash_Sinha · October 21, 2018, 5:32pm

For those concerned about - going to amazon and buy extra memory - The paradigm is “Any amount of memory is not enough!”. There are many discussions, here and other places which emphasizes oneway or another that - don’t waste resource. History says, CPU improvement over the years has steeper curve than memory, moreover when IOT has billions of deployments
-Pro

Jamey_Kirby · October 21, 2018, 5:42pm

Funny you should mention skiplist. I published a Windows kernel
implementation of Pugh’s skiplist some time ago. I really need to get in
there and clean it up a little, and complete the stubbed out functions. I
tried to keep it inline with RtlGenericTableXxx API. You can find it here:
https://github.com/jameykirby/rtlskip

I am working in my spare time (which is limited these days) on a set
library. It will have options to handle adjacent and overlapped subset. I
want to build it using a skiplist so that I can get AVL like search times
while maintaining linked list semantics.

On Sun, Oct 21, 2018 at 1:32 PM Prokash_Sinha-2
wrote:

> OSR https://community.osr.com/
> Prokash_Sinha-2 commented on Windows kernel XOR chain list
>
> For those concerned about - going to amazon and buy extra memory - The
> paradigm is “Any amount of memory is not enough!”. There are many
> discussions, here and other places which emphasizes oneway or another that
> - don’t waste resource. History says, CPU improvement over the years has
> steeper curve than memory, moreover when IOT has billions of deployments
>
> -Pro
>
> –
> Reply to this email directly or follow the link below to check it out:
> https://community.osr.com/discussion/comment/291069#Comment_291069
>
> Check it out:
> https://community.osr.com/discussion/comment/291069#Comment_291069
>

Prokash_Sinha · October 21, 2018, 7:00pm

Nice! Will look at it

For XOR doubly linked list. Objective is to reduce memory usage w/o compromising OPS time. So the skip list has to be dynamic, so it can grow or shrink based on the load.

anton_bassov · October 21, 2018, 10:42pm

Pro,

I fail to understand your logic though. The way I think about, is that
whatever the underlying data structure, a node is a node, randomly
allocated ( as we assume under any dynamic structure ), hence how the cache
will get them in cache memory, so that accessing those nodes will have
cache hit. Usually cache fill is sequential prefetch, and the nodes are
not. So for data cache, how could you expect they would be in cache ? Or
more hits than misses ? 'm missing something …

…

One thing to notice is that, dynamic structures ( their arrival, existence, and duration ) are assumed to be random —
So your apparent neighbors could change randomly, which would make cache misses more probable…

Well, of course you are not going to have all entries in the cache, but, as long as your target structures are cache-aligned, accessing multiple members of the same structure is less likely to result in cache miss, compared to non-aligned ones.

Another point to consider is that traversing a XOR-linked list in itself is not going to result in any cache misses. Why? Simply because a data structure like that does not embed a pointer to the next list element. Therefore, traversing a list like that does not involve accessing the actual list element - the only thing you have to do is to perform XOR operations on the same memory locations in a loop. As a result, you can traverse a list with millions of entries without incurring a SINGLE cache miss, at least in theory.

Certainly, in practical terms you are more than likely to actually access your target entries anyway - after all, normally you would not traverse a list for the sole purpose of doing it, would you. However, an argument about proper cache alignment still applies…

Anton Bassov

Prokash_Sinha · October 21, 2018, 11:27pm

Anton,
Sure aligned allocation has its advantage, and I expect at the kernel layer, any allocation should provide aligned allocation - for example, if I allocate any amount of data, I would expect the MM infrastructure would give me at least 4 bytes or 8 bytes boundaries depending on 32bit or 64 bit atomic access of data. If it does not, then we will need to have our own wrapper to handle aligned allocation, that requires some coding to handle free correctly…

For the preallocated pool to hold the nodes does not help cache hits, as I mentioned your neighbors in one instance is no longer your neighbor in next instance of time ( which could be couple micro-sec ). The contents of that preallocated buffers are by no means static, as well as under no circumstance we should expect the size of the pool being static.
-Pro

anton_bassov · October 22, 2018, 2:28am

Pro,

For the preallocated pool to hold the nodes does not help cache hits, as I mentioned your neighbors in one instance is
no longer your neighbor in next instance of time ( which could be couple micro-sec ).

Please note that we are speaking about the data structure that is “not- so- efficient”, so to say, when it comes to adding/removing entries from the middle of the list. Therefore, I suspect that one would be using it in the situations when removals of entries are VERY infrequent, while accessing individual entries is a common occurrence. In other words, once an entry got a neighbour, it is very unlike to ever change, so that the memory access pattern remains the same on all iterations( that are frequent). I guess the CPU may make some cache-related optimisations under these circumstances,especially taking into consideration that all these entries physically reside in a close vicinity to one another…

Anton Bassov

MBond · October 25, 2018, 11:56pm

You can fix the debugging problem by providing a NATVIS file

https://docs.microsoft.com/en-us/visualstudio/debugger/create-custom-views-of-native-objects?view=vs-2017

it doesn

Prokash_Sinha · October 27, 2018, 1:53am

Will look at, thanks much!

anton_bassov · October 27, 2018, 11:04pm

Well, once this thread refuses to die anyway, I am going to make one more post to it

For XOR doubly linked list. Objective is to reduce memory usage w/o compromising OPS time.

After thinking carefully about the whole thing I realised that, in actuality, the whole thing can be,indeed, highly beneficial in some situations.

For example,consider the scenario when you want to have every thread to keep a list of RW locks that it has locked for R access, and every RW lock to keep a list of readers that have locked it. Once the same RW lock may be locked by multiple readers at once, and the same thread may be locking multiple RW locks for R access at the time, a simplistic approach of embedding a pair of LIST_ENTRY structures in the ones that represent a thread and a lock so that the whole thing can get trivially linked together, is not going to work here.
Instead, you would need something more complex, which may be anything from “simplistic” dynamic LIST_ENTRY allocations to Solaris-like turnstiles.

This is where XOR-linked list may come in handy. Once it does not require the target structures to embed a pointer to the next list element you can simply embed a XOR-linked list into the structures that represent a thread and a lock, and that’s it. Certainly, a list like that is not an optimal solution when it comes to removing entries from the middle of the list (i.e.e something that we may be required to do in this case), but let’s get real - in practical terms, our lists are very unlikely to grow long, so that traversing a list in order to remove an an entry from the middle of it is not going to pose any more or less serious practical problem.

In other words, XOR-linked list’s feature of not requiring the target structures to embed a pointer to the next list element may find some useful practical applications other than just saving anything from 8 to 56 bytes of memory (which, IMHO, happens to be of very little practical significance in the year 2018 anyway)…

Anton Bassov