See below…
-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@gmail.com
Sent: Friday, November 26, 2010 9:43 AM
To: Windows File Systems Devs Interest List
Subject: [ntfsd] Processing write instructions using MiniFilters
Hi!
I’m trying to code a MiniFilter. I need to access the data involved in write
operations for further processing. For example if a user changes one
character in a 200MB textfile and saves it, I guess only the block involved
in the transaction will be saved on disk, not all of the 200MB. What I need
is to access that block of data for further processing. The problem is that
the methods I’m trying seem to provide pointers to the whole data block (the
200MB block) not just the transaction in question. I’m using the samples
found in WinDDK as a reference, particulary the sample “Scanner”.
****
Suppose I change one character in a 200MB file in NotePad, WordPad, Word,
Excel or PowerPoint. What happens? When I do a save, it writes 200MB of
data back out. In fact, nearly every program that exists simply rewrites
the entire file contents. While it is theoretically possible to rewrite
only the page that changes, but handling deletions and insertions gets
trickier, and the old ISAM techniques are generally not supported
(essentially creating linked lists of file pages) in most programs.
Thinking that you will see only one block written is naively optimistic.
There may well be programs out there that implement ISAM-style internal
representations (I’ve certainly done this for databases), but this is
actually uncommon, and the increasing speed of hard drives discourages these
sophisticated approaches (why spend a lot of effort making an update take
close-to-zero time when in fact with no effort you can work with a
“tolerable” amount of time? So on the whole, you can expect that the file
is not actuallyl changed; instead, a NEW file is created, the entire
contents are written, and then a sequence of rename-and-delete calls are
used to update the file. You can even read the WordPad code and discover
this is how it is done. The performance of Word and PowerPoint on large
documents makes it evident that this is how they work (save time is
proportional to file size and not proportional to change size).
****
What I’m doing is:
-
Register the following callback routines: IRP_MJ_CREATE, IRP_MJ_CLEANUP,
IRP_MJ_WRITE.
-
Register the following context: FLT_STREAMHANDLE_CONTEXT
-
In the create post-operation callback routine,
- Send data to USER_MODE for further processing via FltSendMessage().
Anyway, both of these methods point to the whole data block and not to the
data being operated upon. At a later time I would need to reinsert the
processed transactions. So, can I extract just the write transactions?
*****
While in my history, I have worked on systems where text editors updated the
file using transactions, and after a crash you could resume typing where you
left off (IBM’s AIX system, for example), Windows does not seem to encourage
this model of interaction with the file system. The older
read-everything-then-write-everything model seems to dominate. So it really
depends on what programs you are studying as to whether or not you are going
to learn anything intersting from studying the reading and writing of files.
In general, only databases (SQL Server, Oracle, MySQL, and friends) update
records in place, and in general they implement higher-level transaction
mechanisms to guarantee data intregrity across serious failures (application
crashes, database server crashes, OS crashes and hardware failures).
In general, when you have partial in-place updates, you are limited to the
block size of the atomic unit of storage, which could be either a page or a
disk sector, 4096 or 512 bytes, since these are the only “adressible units”
of the storage device. There is no way on a disk to write the 37th byte of
the 93rd sector of the 87th track of the 233rd cylinder. You write the
entire 93rd sector. At a higher level, you work in terms of the logical
addressible units of the file structure and they get mapped down to the
atomic addressible units of the device; so, for example, your high-level
abstract atomic unit might be the page (4096 bytes) even though the atomic
unit on the device is the sector (512 bytes). File systems which are
transacted tend to have their own notion of atomic entities, and while the
transacted system might still think in terms of the 37th byte of the 93rd
sector, this will almost invariably be mapped to either a logically atomic
or physically atomic unit of storage. My recollection from 1990 was that
the AIX system handled byte-level transactions but at lower levels it was
implemented by full-sector-level writes, but it has been 20 years since I
attended that talk. The files were not simple byte sequences, but really
had a complex internal representation.
ISAM technology creates linked lists of information in the file. For
example, one of the parameters to an ISAM (indexed sequential access method,
in IBM-speak) is the “percentage padding” at the end of each logical block,
so that if you do insertions, the insertion appears as a link to the
subblock at the end of the logical block, with a back pointer, e.g., the way
you implemented the insertion of B into the sequence of As below would be
AAAAAAAAAAAAAAAACCCCCCCCCCCCCCC…CCCCCCCCCCCCCC=>AAAAAAAAABBBBAAAABBBAAACC
CCCCCCCCCCCCC…CCCCCCCCCCCCCC
Done in Windows by reading in the entire initial sequence into memory, and
the writing the entire sequence back out. In an ISAM file, it would be
represented as
±----------------------------------+
| |
±-)------------------------+ |
| | | |
v v | |
AAAAA*1**2*AAACCCCCCCC AAAABBBB*3*AAAABBBB*4*
| | ^ ^
| | | |
±-)-----------------+ |
| |
±---------------------------+
Key here is that you can put these pieces out at the end of the logical area
until they get too big, then you have to add a new logical block and
overflow into it. In the VISAM (Virtual Indexed Sequential Access Method)
in IBM’s TSS/360, the logical elements were 4096-byte pages. We kept 80% of
data at the front for the continuous data and 20% at the end for overflow
records. When the 20% fills up, it contained links to overflow pages.
After a while, traversing the file “sequentially” involved so many seeks and
page-faults to move data in from the overflow pages that performance went to
pieces, so we had to apply the equivalent of ‘defrag’ to relinearize the
data with fresh new data pages with fresh new 20% padding at the end.
Some Driver Conference or WinHEC I went to Tony Mason’s talk on how to do
insertions in a compressed, encrypted file system, and I had this wave of
nostalgia as he described the VISAM approach I’d used back in 1968 (VISAM
was built into the IBM OS, so we could simply say “Give us the next
sequential record” and we got the next sequential record; none of these toy
byte-seqeuence models! Variable-length VISAM was very powerful and we used
it extensively. But it was the intrinsic support that made it work. Now,
you could actually do all of this with a minifilter, providing you knew that
the program worked with a model of I/O that made sense (for example, it
didn’t try to move the contents of the file around, but worked with notions
like “insert this record between these two other records”. So what we’d do
is push the records around just enough to make room for the insertion [which
might just be the record sequence number and the pointer to the record
contents] then insert the entire new record in the slack space at the end of
the page, or in the overflow page. “Read next record” knew how to walk this
structure sequentially by record number (or what we’d recognize today as the
table key of a relational table). The smallest logical atomic entity was
the record, and it would be implemented at lower levels by logical or
physical atomic elements (e.g., ISAM on hard drives was implemented by
sector primitives, even though a record might use only part of a sector or
span several-and-a-fraction sectors. These structures are great student
exercises in an OS course, by the way. But I could see people in the
audience who were reacting to Tony’s talk by seeing a whole new paradigm of
file management opening up (besides, he gave a GREAT talk!), once they got
beyond the concept that all files are necessarily contiguous sequences of
bytes with no internal structure (it is cool to see people outgrow the
childish file system models of Unix/linux, which they always though were the
only possible representation of information on a disk. Since I grew up with
OS/360 and TSS/360, we were immediately exposed to sophisticated file
architectures; when I got to TOPS-10 and realized that files were sequences
of bytes, I was dismayed to think that something this weak was the
fundmantal concept of data organization in the operating system. Then Unix
added more disappointment by the paucity of its OS (for years, it didn’t
even support file locking, and when it did, it was not enforced). NTFS is
just a fancy implementation of the MS-DOS FAT system (the concept that you
cannot replace a file that is in use, for example, was easily dealt with in
1968 as soon as someone knew that this was a real problem, so NTFS looks 40
years behind the times; and TOPS-20 handled versioning cleanly because it
built on a model that realized that file systems are not punched cards)
At least part of the problem here is that you are showing us either the
elephant’s leg or the elephant’s tail and asking us for a description of how
to transport an elephant. You haven’t said what you are trying to
accomplish. Perhaps the goal is to implement a mini-filter driver that
converts a plain-vanilla file into a fully-transacted file by introducing a
transaction layer, which is certainly an interesting and potentially useful
idea, but in the context of a newsgroup like this, you are better off
describing the problem you are trying to solve instead of asking detailed
questions about how to track write requests, which is an implementation
strategy which may or may not solve your problem. There are probably a
dozen readers of this newsgroup who probably have done something like this
and can offer constructive advice as to how to actually accomplish this, and
what will or will not work. But asking about how to make some low-level
concept such as detecting write operations working, or creating something
that handles this efficiently, masks what your intent is, making it hard to
evaluate the validity of a solution.
****
And is there a faster way to send data to USER_MODE than using communication
ports? I’m guessing ports are rather slow.
****
Guessing is a poor methodology. There are several dangerous assumptions in
that question. For example, do you have any evidence that IOCPs are in any
way slow? Or that they have any significant impact on overall peformance
compared to other methodologies (e.g., waiting on an event, using a
synchronous I/O request, etc.) If I really cared about the peformance, the
first thing I’d do is start doing measurements to see how many events per
second I could process using several different approaches. My own approach
is to favor IOCPs because they scale well, particularly if you can have
multiple threads waiting, because they allow the responses to be handled
concurrently on multiprocessor systems, whereas other approaches tend to
limit you to a single thread, often the same thread that initiated the I/O,
creating a serious response bottleneck. This can actually be measured, and
you can do simulation studies that can demonstrate this (I’ve done them,
years ago, in another lifetime on another operating system).
I tend to worry when someone asks for a “faster way” without giving any
evidence that the way that they have selected exhibits any performance
pathology. In the absence of compelling evidence that an IOCP causes any
performance problem, you should assume it does not, and if you think it
might, do some measurements comparing them various methods.
****
Thank you in advance for your help.
NTFSD is sponsored by OSR
For our schedule of debugging and file system seminars (including our new fs
mini-filter seminar) visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer
–
This message has been scanned for viruses and dangerous content by
MailScanner, and is believed to be clean.