Network drives coming disconnected with our HSM driver

The company I work for is working on an HSM product that uses reparse
points and a filter driver to transparently retrieve that have been
locally released from long term storage. When we use this on a file
server we’re running into problems. We hold the create requests in a
pending state while we retrieve the file to local storage, and with
large files, we’re holding them too long.

The result of this is that the windows Server service is becoming
unresponsive when a client requests a file that is too large and takes
too long to restore to the server. The share becomes inaccessible from
the client that requested the large file, and other clients are unable
to mount the share. Once the file retrieval is completed, and
FltCompletePendedPostOperation () gets called, everything resumes
working. It’s not really acceptable that it stops working in the
meantime, however, and I’m looking for a way to keep things alive or
reset any timeout mechanism that may be at work. The limit here appears
to be around 10 minutes.

The first question would be is there a hard and fast limit to how long
you should hold something pending? If so, how is it enforced in the
system, and is there a direct way to reset it. Second, if we can’t
reset some timeout directly, does anybody think that reissuing the
request will keep the Server service from going unresponsive (I’m
guessing that if the problem is in the Server service, the re-requests
in the kernel won’t affect it)? Third, does anybody know of any other
reason we might be seeing this problem, we’re assuming it’s related to
the Server service, since we don’t have a problem restoring large files
locally.

~Eric

A few bits of further information:

Fileserver is Win2K3, client is WinXP

Changing the network timeouts referenced in
http://support.microsoft.com/kb/297684 has no effect. Given that the
entire service becomes unavailable, we don’t think this issue is related
at the moment. If you can think of a reason it might be, any light you
can shed on it would be appreciated.

>The result of this is that the windows Server service is becoming

unresponsive when a client requests a file that is too large and takes
too long to restore to the server.

A well-known SRV’s issue. Performance drops a lot when writing to the end of
huge file.

So, add the “split large file to chunks” feature to your product.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Maxim, thanks. We’re going down the road of adding “split large files into chunks”, but slowly. For now I’m working on getting the retrieval into the reads rather than the creates. It appears that only Create operations can capture a reparse point, as discussed here (http://www.osronline.com/showThread.cfm?link=12039), and confirmed by debugging (The status code sure isn’t set to STATUS_REPARSE at least). I’m thinking of caching the reparse data in the driver or the userland service for create requests, and then using it to retrieve the file once a read comes in. That way, we avoid retrieving the file every time somebody browses the directory containing it with Explorer. That doesn’t happen on a remote directory, but it does happen on a local one.

Once we get that implemented, the next step is to get the splitting going. Not only will it (hopefully) keep the server alive by removing long delays, but it should make things go a lot faster in total. Rather than retrieving the file to the fileserver and then sending it back to the client, we can pipeline the process.

Thanks again

~Eric