Possible solution for generating a unique file id for files in a minifilter for caching?

brad_H · June 11, 2023, 1:24pm

I have a minifilter driver, where I scan every new file that hasn’t been seen before by minifilter in my pre create, and collect some info regarding that file.

Currently I have a caching mechanism based on the hash of the file path, so I scan a file, collect its info and put the hash of its path in an AVL tree and the key of the AVL tree is the hash and the content is the file info that I collected. And I also rescan the file upon a successful write since the info might change, and also remove it from cache upon deletion.

Now my question is, is there a more optimal way of generating a unique file id for files, other that the hash of the path?

rod_widdowson · June 11, 2023, 1:29pm

REFS/NTFS only? If so why not use the FileId or the ObjectId

brad_H · June 11, 2023, 1:35pm

@rod_widdowson said:
REFS/NTFS only? If so why not use the FileId or the ObjectId

Yes I’m fine with supporting only REFS/NTFS, but I assume if the file system is not REFS or NTFS, then FltQueryInformationFile would just fail right? I just want to make sure it wouldn’t be returning junk id or the same id for every file.

And what is the difference between FileId and ObjectId? Which class in FltQueryInformationFile should I use to find the id?

rod_widdowson · June 11, 2023, 1:48pm

Its probably best to read the doc

FileId

ObjectId

The doc says NTFS only but I have a query in about that with the doc team.

brad_H · June 11, 2023, 2:31pm

@rod_widdowson said:
Its probably best to read the doc

FileId

ObjectId

The doc says NTFS only but I have a query in about that with the doc team.

One thing that raises concern in the fileid doc is this:

“File reference numbers, also called file IDs, are guaranteed to be unique only within a static file system. They are not guaranteed to be unique over time, because file systems are free to reuse them. Nor are they guaranteed to remain constant.”

So if by static file system, they mean a file system that never gets written on (which means no system in the real world), then doesn’t this mean that this fileid isnt unique at all?

When will the fileid change? If it changes when a file is written to, or when a machine reboots, I’m fine with it. But fileid changing randomly over time on a live machine (without the content of the file changing) makes it useless! So am I understanding this right?

And considering that there is no such sentence in the doc of file object id, does this mean that file object id is more unique and a better choice in my scenario? (If it is in fact supported in REFS). This has nothing to do with the FILE_OBJECT tho right? Because to my understanding that is per file handle.

rod_widdowson · June 11, 2023, 2:43pm

By observation only the FileId in NTFS consists of a number (possibly an index into the file table) and a sequence number. This is what ODS2 did 20+ years earlier it and there is good reason to assume that the design of one was influenced by its earlier antecedent). Obviouesly sequence numbers can and do wrap but…

I wouldn’t like to make any definitive statement about the ObjectId but there are others who watch this space and who could.

And you are correct: this has nothing to do with a FILE_OBJECT. Kernel OBJECTS (of which FILE_OBJECT is just one example) are ephemeral, but referenced, in-memory-only things used to manipulate things (files in the case of FILE_OBJECTS but there are OBJECTS for threads or processes or security structures)

brad_H · June 11, 2023, 2:54pm

@rod_widdowson said:
By observation only the FileId in NTFS consists of a number (possibly an index into the file table) and a sequence number. This is what ODS2 did 20+ years earlier it and there is good reason to assume that the design of one was influenced by its earlier antecedent). Obviouesly sequence numbers can and do wrap but…

I wouldn’t like to make any definitive statement about the ObjectId but there are others who watch this space and who could.

And you are correct: this has nothing to do with a FILE_OBJECT. Kernel OBJECTS (of which FILE_OBJECT is just one example) are ephemeral, but referenced, in-memory-only things used to manipulate things (files in the case of FILE_OBJECTS but there are OBJECTS for threads or processes or security structures)

Okay so assuming we have the following code:

    NTSTATUS status = FltQueryInformationFile(
        FltObjects->Instance,
        fileObject,
        &fileInternalInfo,
        sizeof(fileInternalInfo),
        FileInternalInformation,
        NULL);

      LONGLONG fileId = fileInternalInfo.IndexNumber.QuadPart;

And I use this FileId as a key in the AVL tree. And I remove the file from cache when a file is written to, or it gets deleted/renamed/moved.

Will I face any issues? I assume that the reuse of a fileid only happens when the file is deleted or modified, which means that I am fine. Right?

Dejan_Maksimovic · June 11, 2023, 4:32pm

FileId is static as long as the file is not deleted, yes.
That call won’t work for ReFS, because ReFS uses 16-byte IDs, in the
FOLDER_ID SUBFILE_ID format. So if you query internal ID only, it will
match so often it would look like the same file has more copies than files
on the C drive
You need to query the 128-bit object ID on ReFS instead.

You can use USN to track if a file has changed (note that it won’t change
in case of memory mapping until data is flushed)

Regards, Dejan

brad_H · June 12, 2023, 2:20am

@Dejan_Maksimovic said:
FileId is static as long as the file is not deleted, yes.
That call won’t work for ReFS, because ReFS uses 16-byte IDs, in the
FOLDER_ID SUBFILE_ID format. So if you query internal ID only, it will
match so often it would look like the same file has more copies than files
on the C drive
You need to query the 128-bit object ID on ReFS instead.

You can use USN to track if a file has changed (note that it won’t change
in case of memory mapping until data is flushed)

Regards, Dejan

So should I use objectid (I assume by object id you meant FILE_OBJECTID_INFORMATION) instead of fileid in order to support both the NTFS and REFS? Will using objectid have any drawback compared to fileid as a key to AVL tree?

Also why would I need to use the USN (update sequence number, right?) to track file changes when I have a minifilter? Can’t I just track if a file that I have cached has changes by monitoring write and setinfo callbacks?

Dejan_Maksimovic · June 12, 2023, 7:16am

Yes, that object id. But I don’t recall if it works on NTFS, so please
check and use the InternalInformation on NTFS if needed (and let us know).

Just mentioned USN. Also minifilters below you can change the file without
you knowing it.

brad_H · June 22, 2023, 4:56am

So based on my research, the most optimal way of achieving what I’m asking is to do it the same way as the Microsoft’s avscan sample :
https://github.com/microsoft/Windows-driver-samples/tree/main/filesys/miniFilter/avscan

Which uses fileid, and it works both in REFS and NTFS.