Hi everyone,
I had a very strange problem with my driver that took me one full week to
track down and fix. I thought I'll share this, so you avoid this kind of
mistake [or have at least a good laugh at my stupid error] ![]()
Quick description about my driver: It consists of a function driver
managing a number of virtual disks. Each disk gets attached a system
service which in turn loads one of a number of 3rd party IO DLLs that
handle the read/write requests to/from files or devices. In the easiest
case, the driver adds one partition, the service just forwards all
read/write requests to the IO DLL which updates a single image file, in
more complex scenarios the system service handles a number of additional
'overlay files' and only some read requests are forwarded to the IO DLL.
Acess to those virtual disks is mostly read-only.
The error:
All seemed to work well, when one day I ran a 'chkdsk' on a larger
damanged virtual NTFS volume. When Chkdsk checked and corrected the volume
bitmap, it would always say "there is not sufficient space on the disk to
fix the bitmap" and aborted.
I've then spent a long time checking that all device object flags are ok,
it's created correctly, that I do not miss any request, that I reply to
all IOCTLs correctly - and found nothing. I've put in lots of break points
at places where something could go wrong, added loads of debug messages --
nothing triggered, everything seemed to work 100% okay, no single request
failed, but 'chkdsk' failed consistently at the same point [so no timing
problem].
As I found nothing, I started to worry about the transmitted data in the
read/write requests. I've modified the driver and the service to write out
all read/write requests to special log files, including the read/written
data, and then compared those two log files with each other and with the
source, and sure enough, there were differences between the driver view of
the data, and the service view -- in all IRP_MJ_WRITE requests with a
length of 65536 bytes.
This finally brought me to the right track - At first, as some of the IO
DLLs are limited to read a maximum of 32kB at once, I had a limitation in
the driver to accept only reads up to 32kB. In the control structure used
to describe a job [between driver<-->service] there is a 'size' field, and
because of that 32kB limitation I thought 'Hey, save two bytes, take
USHORT instead of ULONG'. [No worries, I already slapped me more than once
for this, no need to do it again].
Later I found out that Windows usually sends 4kB or 64kB requests, and
since all failed 64kB requests came in later as a series of smaller
requests I thought in order to reduce the overhead I allow 64kB reads
directly. I have loads of ASSERTs in my code, and even more if()'s to find
all invalid or fishy requests, but for some reason that's beyond me all
checks on the 'size' field of the control structure in read/write requests
got skipped, both in the driver and in the service - with the latter even
fully trusting the (ULONG, sigh) Length field describing the WRITE and
ignoring the number of bytes actually returned in the DeviceIoControl
[loop: /slap; goto loop] -- must have done that part of the code on a
Monday morning.
So, when the driver forwarded IRP_MJ_WRITE to the service, it built a
buffer (control structure; last field is variable size and contains the
data to be written) and finished the command with
IoStatus.Status=STATUS_SUCCESS;
IoStatus.Information=cmd->size=(USHORT)(sizeof(ControlStructure)+AmountOfDataToBeWritten);
Thus resulting in the service to just receive a valid control structure
but not any data at all, therefore using some random buffer contents... As
we just use the virtual drives read-only [apart from rare 'chkdsk's and
the little changes the file system does] this error didn't show up during
a long time.
*cry*
I still can't believe how I managed to omit in one code-path so many
safety-checks...
Cheers,
Michael B.
Vogon International GmbH