I have a customer that says a competing product can tell the difference on a file server between when a client computer reads a file for a copy vs a search (searching the contents of the file). This is surely impossible because I could write a trivial program that reads a file from a server into memory, and at any point could search the memory for a string, or it could decide to write the file to disk (so now it's a copy).
But, in trying to meet his needs for maybe 80% of the time (users using Windows Explorer for copy or searching?) does Explorer, or the file copy API, do anything unusual that could be detected in a mini filter vs an application just reading the file? I'm coming up blank.
Perhaps the trick is to study the output file instead! It should eventually get the same size as the input file (and also the same hash) Perhaps you can try to find this fenomen.
When you say “surely impossible“, I need to ask what assumptions you might be making. What is the probability of success? It is very difficult to do this with 100% accuracy, which no doubt makes you think it impossible.
The real question, I think, is whether some specific scenarios that have a high probability of occurring and whether those scenarios have a high enough probability make the effort worth it.
For example, your counter-example of random code. However, the more common cases would be using either APIs or command line tools. Walking the stack might allow some degree of certainty for the API case, and looking at process information could resolve the command line case. These might detect a high enough percent of cases to make it worth the effort to implement.
Accessing the user mode stack from kernel mode — in case the code in question is, in fact, kernel mode — adds a bit of complexity, but I suspect even that is doable
I guess it depends whether you need a marketing or an engineering solution....
But I'd spend some time looking at IO_COPY_CHUNK I have absolutely no experience of it and it might be on the wrong machine (client, not server) but it also might be interesting
They only have access to the server, not the client. I do already keep track of copied ranges so I can tell if the entire file was accessed, but a search could (should?) access the whole file to I would think.
I guess I think it's impossible to be 100% accurate, so I'm taking the approach you recommend by looking for how to do it in probable scenarios.
The copy is initiated on a client computer, and I can only see the server side, so APIs and stack walking won't help in this case, but maybe spying on the client process (Explorer.exe) during my investigations will give me some hints.
as you already know, there is no 100% proof positive way to determine the intent of an application when it opens a file handle. All you know for sure is that read access was requested and is allowed
The engineering solution to a situation like this is to use an heuristic. For a single handle open, and set of reads on that file, nothing can really be inferred. Both copy and search probably read the data sequentially, so if the IO pattern is not like that you can probably rule out either. Possibly the search application takes longer between reads, but that is hardly certain depending on the algorithm. Copying a single file will have a long delay between opening one handle and a handle for another file in the same directory, but copying a whole directory won't. Searching from windows explorer can't be done within a single file, so presumably it must open a bunch of handles.
Obviously the same considerations apply to recursion into sub-directories.
And these behaviours are subject to change between versions of windows explorer / other applications that do copying or searching.
So without giving any answer that you can use directly, by suggestion is to monitor and test. And then see what conclusions you can draw from what you see