best approach to make big cache

Pavel_S · September 8, 2023, 8:48pm

Hi folks.

I have question related to design. I need to create a driver that will process in time many files based up to certain size. Any file can be processed just once (until some expiry date).
E.g if process A.exe will open interesting file, we will process it in driver. If then process A.exe or X.exe will try to do same we should skip. After 12 days such cache can be invalidated. Now assume that cache is volatile (not survivng reboot) but I need to assume server machines that are not rebooted often.

Question: Since I need to process file just once I think I need to have sort of cache in kernel. I am affraid that if there will be dozens of files on disk (10mln etc) then such cache will take a lot of memory and wondering if such approach is correct? Suppose that my internal cache info will be small, still I think with such big amount of files in 12 days it may take a lot of space.
Can you point me some additional hints how to maybe address it more smart (shared memory?).

Thank you

Dejan_Maksimovic · September 8, 2023, 10:29pm

Memory mapping a cache file?
You have to test what will work best for you
Tuning how much will be in memory vs. in a file is every product/env’s main
work.

Pavel_S · September 9, 2023, 1:42pm

Thank you for feedback. Hm I thought about it but not sure how it would work and what would be performance penalty? For instance suppose we need to have in cache fid of file as key and 2 DWORDs as value. Normally I would do it with rtl generic table that implements it with avl or splay tree. Or actually it could be done in many other ways. Thing is all data is in memory structure that is easy to alter (add/delete) and query. In case it will be in file such thing is hard to achieve as every time there is need to alter cache i need to rewrite(?) block of memory right? Even if I think i could write to this file binary representation of some tree still once one thread alter file seconf need to reread whole file to initialize tree from scratch…or maybe you thought about something different? Maybe you have some sample? Thank you

Dejan_Maksimovic · September 10, 2023, 8:22am

An AVL tree may not be a good idea here.
I would implement a sorted array, that is updated on each change. You do a
simple binary search then.
AVL tree may be faster for some uses, and definitely for general searching
but recreation is slower than a file write (allocations are slow, for 1M
IDs even lookaside allocation will show slowness, and definitely won’t have
1M slots on each recreate ready).

Note that if you have a million IDs and an ID is just a file ID (128 bit,
because otherwise it is not unique on ReFS at all), having it all in memory
is not a big deal in terms of memory usage (16MB per million, +8MB for USN
/file time stamp, but if it’s tens of millions, then I see the problem).
Your much bigger issue is how not to fragment memory, so some special
suballocator for the AVL tree here is crucial.

What does the scan do, anyway?
I am skeptical that so many of the files need to be scanned.

Did you consider kernel EAs? They are best for exactly for this purpose,
like an AV scanned-files cache.

Pavel_S · September 10, 2023, 2:37pm

Thanks for this great answer. Actually I cannot assume number of files to scan yet but I need to be pesimistic and need to consider such scenario too. Your comment about sorted array is valid and I think correct too. However I am unsure of one thing. If I map a file to kernel wont it be accessible only for system process? Sorry if this is nooby question gonna check it soon, but I feel it will not be accessible in arbitrary theead context isnt it? If so that would mean I would need to map such thingy to every process in which context my part is happening? My project is about calculating special hash of files along with special token in it. Unfortunately all should happen when files are altered so i dont always know who will alter file. And i nedz to do all hashijg once per file content and such info should survive reboot. EA looks good, but here I have another question: if my driver will be deinstalled how to remove such EAs from files? Thanks again!!

Dejan_Maksimovic · September 10, 2023, 7:29pm

Hmm, good point about processes
I frankly donno what happens if the view is made in the system process, I
never tried… please lemme know

EAs need to be manually deleted. They survive reboots and are very secure
(special kernel EAs, not general EAs)

Pavel_S · September 11, 2023, 3:19pm

Hello, I made a test.
From within a driver I did this:

in driverentry I created section
then I started thread routine and mapped this section to SYSTEM process, after that I wrote some string to this memory
finally from debugger I could see following:If I am in context of System process I could make >da ADDRESS and I say beutiful string, however switching to e.g explorer.exe clearly shows there is trash. Switching back again to System - string visible

To sum it up: it is visible to SYSTEM process only.

Dejan_Maksimovic · September 11, 2023, 8:16pm

Pitty
Interlocked system queue with memory mapping then?