Here’s a problem that sounds easier than it is:
Say you are writing a file system filter to enforce size limits on directories on a system. One of the things you would need to do is calculate the size of the directories that the driver manages. This is simple enough; just write a function that walks the directory tree and tallies up all files in all subdirectories of the directory.
Since this tally routine is implemented in a file system filter, it seems logical that you should be able to tally the size of the directory while simultaneously allowing IO (possibly IO that changes the logical size of the directory you are tallying) to pass through without screwing up the total. For example, say you receive a write request which will write data beyond the end of some file, thus extending it. Now say that the write is directed to the directory whose size you are tallying. My intuition tells me that this should be a tractable problem, but I am having trouble coming up with a solution.
I guess the main question that need answers at the time of the file size changing io is:
a) Will this IO change the logical size of the directory being tallied.
b) Has the file whose size is changing already been added to the tally by the tally routine.
The first question is easy, the second is a little trickier. If you can know that the file whose size is changing has already been looked at by the tally routine, than you simply adjust the running tally by the delta. If you know that the file size hasn’t been looked at by the tally routine, then you simply let the size change without intervention.
Given intimate knowledge of the tally algorithm, it would be trivial to determine if the file in question has been looked at by the tally routine by examining the path that is currently being examined and comparing it to the path of the file whose size is changing. This would be easy IF you could rely on a query directory request returning directory entries in a predictable order. However, you cannot assume ordering for all file systems (FAT for example, right?). So this wouldn’t work. You could recurse through the directories in a predictable order if you cached directory entries in a sorted data structure and recursed through the subdirectories in sorted order rather than the arbitrary order returned by query directory. This is expensive and painful, however and you’d eat up a lot of memory in very large and deep directory trees.
Another approach would be to accumulate the indexes of the file’s that have been examined by the tally routine in a sorted data structure. When deciding if a particular file has been examined yet, you would simply look into this buffer for the file in question’s index, which would tell you if the file has been added to the tally yet. Again, though, this approach would become prohibitively expensive for very large and deep directories.
Anyway, there are several additional approaches I can come up with, but intuition tells me that they are all overkill and there must be simple solution to this problem that I am not thinking of. Can anyone think of an efficient, elegant approach to solving this problem? Any suggestions would be appreciated.