Are paging IO requests synchronized by the Memory Manager

anton_bassov · June 8, 2008, 6:11pm

Guys,

I’ve got, apparently, pretty dumb question. Let’s say FSD does not do any synchronization. Can it receive paging IO requests that somehow conflict with one another? It is understandable that they may conflict with cached and non-cached IO if no synchronization is done by FSD, but may they somehow get into a conflict with one another? Judging from FASTFAT WDK sample, FASTFAT tends to acquire paging IO resource shared. My theory (to which, btw, I found indirect confirmation in Nagar’s book) is that, once FSD may not directly receive paging IO from anyone, apart from Memory Manager ( it may be originated either by Memory Manager itself , i.e. by Mapped/Modified writer threads or upon a page fault, or by some other component that has called this or that MMxxx routine) ,apparently, it just assumes that Memory Manager synchronizes all paging IO itself, so that it does not bother itself with synchronizing paging IO requests with one another. Am I right or wrong here???

Anton Bassov

OSR_Community_User · June 9, 2008, 5:37am

FASTFAT’s PagingIo synchronization only prevents the file from being
truncated with pending paging IO.

FASTFAT’s PagingIoResource, as you can notice, is only acquired exclusive
during file truncation. So, acquiring it shared serves the only purpose of
synchronize with truncations, so that a truncation can run only if there is no
pending paging IO at all.

This prevents paging IO from overwrite the disk sectors freed while
truncating a file and possibly reused for another file.

As about the generic paging IO synchronization - then it is often done by
Cc or Mm calling the FSD’s FastIo synchronization callbacks.

For instance, Cc’s read-ahead calls the callback provided to
CcInitializeCacheMap, so are Mapped Page Writer, CcFlushCache and the lazy
writer.

The goal of these callbacks is to allow Cc and Mm to take the file stream
locks first, and their own internal locks protecting the list of cache maps or
memory mapped segments - next.

The notion of “top level IRP” is about - what locks were already taken by
Cc/Mm while submitting this IRP, the FSD can look at it and skip the locking to
avoid the recursivity if it wants so.

–
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

wrote in message news:xxxxx@ntfsd…
> Guys,
>
> I’ve got, apparently, pretty dumb question. Let’s say FSD does not do any
synchronization. Can it receive paging IO requests that somehow conflict with
one another? It is understandable that they may conflict with cached and
non-cached IO if no synchronization is done by FSD, but may they somehow get
into a conflict with one another? Judging from FASTFAT WDK sample, FASTFAT
tends to acquire paging IO resource shared. My theory (to which, btw, I found
indirect confirmation in Nagar’s book) is that, once FSD may not directly
receive paging IO from anyone, apart from Memory Manager ( it may be originated
either by Memory Manager itself , i.e. by Mapped/Modified writer threads or
upon a page fault, or by some other component that has called this or that
MMxxx routine) ,apparently, it just assumes that Memory Manager synchronizes
all paging IO itself, so that it does not bother itself with synchronizing
paging IO requests with one another. Am I right or wrong here???
>
>
> Anton Bassov
>

anton_bassov · June 9, 2008, 7:09am

> FASTFAT’s PagingIoResource, as you can notice, is only acquired exclusive during

file truncation. So, acquiring it shared serves the only purpose of synchronize with truncations,
so that a truncation can run only if there is no pending paging IO at all.

Exactly!!!

For instance, Cc’s read-ahead calls the callback provided to CcInitializeCacheMap,
so are Mapped Page Writer, CcFlushCache and the lazy writer.

Let’s concentrate on the last 3 ones - they all do write, and, therefore, unless paging IO is synchronized by the Memory Manager, they all have to acquire the main resource exclusively. However, lazy writer does not do it, does it??? At the same time, CcCacheFlush() does take it exclusively. Although FASTFAT does not provide AcquireForModWrite() callback, comments in the file somehow suggest that it, indeed, pre-acquires the main resource exclusively. Why there is such discrepancy between these callbacks???

My theory stands as following. Mapped Page Writer may increase valid data length, but lazy writer may not, because, unlike Mapped Page Writer , it is never a top-level component. When you flush a file,
you don’t really want valid data length to be increased, do you? Therefore, the conclusion that I made is that paging IO uses paging resource in order to protect itself against truncation, and relies upon the main one in order to synchronize changes to valid data length - it does not seem to synchronize paging reads and writes that don’t change valid data length with one another, because it expects Memory Manager to serialize paging I/O operations to any particular range in the file…

What I am trying to find out is whether this conclusion is right or wrong…

Anton Bassov

OSR_Community_User · June 9, 2008, 11:05am

The memory manager guarantees that it will not do extending writes during a paging write operation. This allows a file system to keep allocation structures in paged memory (typically memory mapped.) This isn’t required, but it is the purpose of these interlocks.

The locking model has evolved over the years. It started with the two resource objects in the common header (and the two bits that control the specific acquisition rules) and then moved to the fast I/O routines and now the filter callbacks (which are routinely used in file systems.)

The idea is thus that Mm can call and say “I’m writing up to this point, so guarantee the file won’t shrink below it and I’ll guarantee I won’t do an extending write.” Hence, you get a high water mark.

Cc does not need this because Cc already calls in advance to ensure the file is big enough (those pesky paging I/O set information calls it sends.) This locking has been around for longer than the additions to allow Mm to specify the maximum offset so it doesn’t offer the “maximum offset” option. Presumably, if it were a bottleneck of some sort, the Windows OS folks would have moved the API forward.

VDL should NEVER be moved forward by Cc, because Cc doesn’t have data for that region of the file - if it did, the VDL would already be moved. MM might have valid data for that region (ergo, memory mapped files) and thus can move the VDL out. So only Mm and FSD should move VDL.

Of course, VDL management is even worse than you think. We recently found that Cc special cases a VDL of 8EB when you call CcUninitializeCacheMap (if it is 8EB, it does the cache uninit in thread context. Otherwise it posts it. We found this because we were working out a deadlock.) If this is documented somewhere, I have yet to find it.

Tony
OSR

anton_bassov · June 9, 2008, 3:38pm

> The memory manager guarantees that it will not do extending writes during a paging write operation.

What do you mean by “extending write”??? There are file size and valid data length. Although paging write is not allowed to extend the former, it still can extend the latter as long as it is done by the top -level component…Once Cache Manager is never the one, lazy writer never does it.

In any case, there is still no answer to my main question - can FSD safely assume that it will never ever
receive non-extending paging write that somehow conflicts with other paging writes/reads to the same byte range of the file, because MM will serialize it with other paging reads/writes to a given byte range??? It is understandable that it may get into a conflict with non-paging IO to the same byte range, but may it somehow get into a conflict with non-extending paging IO to the same byte range as well???

Anton Bassov

OSR_Community_User · June 9, 2008, 4:12pm

> Let’s concentrate on the last 3 ones - they all do write, and, therefore,
unless

paging IO is synchronized by the Memory Manager, they all have to acquire the
main resource exclusively.

Why? Why serialize writes to some potentially huge file stream? Just imagine
serializing writes to an Oracle database sitting on hardware RAID.

However, lazy writer does not do it, does it??? At the same time,
CcCacheFlush()
does take it exclusively.

MainIoResource? Why? Paging IO usually never acquires it. It is usually
acquired to protect file growth (unlike truncation). Since paging IO can never
grow the file, it is not needed to acquire it in paging IO path.

This allows to have running cache flushes by lazy writer while growing the
file.

My theory stands as following. Mapped Page Writer may increase valid data
length,

Yes, but not EOF.

but lazy writer may not, because, unlike Mapped Page Writer , it is never a
top-level component.

No, because of the other thing - the very notion of VDL. Data beyound VDL
cannot be flushed by lazy-writer, this is the definition of in-memory VDL as
used by FAT (on-disk NTFS’s VDL is another).

When you flush a file,
you don’t really want valid data length to be increased, do you?

Increasing VDL and reporting it to Cc is the very action which allows the Cc’s
lazy writer to flush the new data in the end of the grown file.

Therefore, the conclusion that I made is that paging IO uses paging resource
in
order to protect itself against truncation, and relies upon the main one

No, paging IO never acquires the main resource.

–
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

OSR_Community_User · June 9, 2008, 4:18pm

> VDL should NEVER be moved forward by Cc, because Cc doesn’t have data for

that region of the file - if it did, the VDL would already be moved.

Really? I think that a cached write which grows the file write does the
following sequence - advance EOF, CcCopyWrite, then advance VDL.

I think Cc cannot hold data after EOF (and the calling FSD must guarantee
this), but it can - for a short period of time - hold data between VDL and EOF,
and this data is not flushable by lazy writer.

Am I wrong? If yes - what is the correct picture?

that Cc special cases a VDL of 8EB when you call CcUninitializeCacheMap (if it
is 8EB, it does the cache uninit in thread context. Otherwise it posts it. We
found
this because we were working out a deadlock.)

Great testing with 8EB per single cached file how long it took to
generate such a file for testing?

–
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

OSR_Community_User · June 9, 2008, 4:20pm

>writes/reads to the same byte range of the file, because MM will serialize it
with

other paging reads/writes to a given byte range???

No, it will not.

BTW - paging IO ignores the byte range locks, it is too low-level for them.

–
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

OSR_Community_User · June 9, 2008, 4:50pm

I think you missed my point Max - *CC* won’t move VDL in a way that is not initiated by the FSD. If the FSD calls CcCopyWrite, it knows the VDL is moving out. However, in the case of MM, the VDL may move and the FSD won’t know it. So that’s why this is an issue for Mm but not for Cc.

VDL is a squishy concept anyway. For FAT it isn’t persistent, for NTFS it is. Even the point about “extending write” is now more nebulous than it once was because the original goal (to keep allocation routines out of the paging path) isn’t meaningful when you combine it with sparse files.

Indeed, it is only more recently that there’s been an API for moving VDL (via SET_INFORMATION now) and I suppose that could now be used.

Earlier this year we found a case for a file system where when we asked to move allocation size (to guarantee that we weren’t going to run out of disk space inside a transactional system) they converted that into a request to move EOF (and of course block zeroed all the data in between.) THAT was rather surprising…

Anton, I’m not sure I understand your distinction of “top level” and “non-top-level” requests. The Lazy Writer and the Mapped Page Writer both behave in a simlar fashion from our perspective - they call the FSD, they lock the file against a class of changes, and then they perform their operations. These locks are done to protect write operations and are entirely optional. As i recall, HPFS did NOT implement the locking behavior for FAT or NTFS - that’s why if there is no FCB->Resource there is no locking done for the modified page writer (which is really used by the mapped page writer, but the mapped page writer was added to break a deadlock when there was JUST a modified page writer.)

When a page is being written out to disk, any other process that faults on it will wait for the I/O to complete. Actually, this applies to read operations as well.

Tony
OSR

anton_bassov · June 9, 2008, 6:40pm

> No, paging IO never acquires the main resource.

Please look at CcFlushCache() callback in FASTFAT sample - this is exactly what it does, and ,apparently,
AcquireForModWrite() callback does the same. Once lazy writer never increases valid data length, it just doest not need it…

>but lazy writer may not, because, unlike Mapped Page Writer , it is never a
>top-level component.

No, because of the other thing - the very notion of VDL. Data beyound VDL cannot be
flushed by lazy-writer, this is the definition of in-memory VDL as used by FAT

Don’t confuse the reason with consequence - data beoynd VDL may be flushed by anyone who happens to be top-level component., but once lazy writer is never the one, it cannot do it…

> Let’s concentrate on the last 3 ones - they all do write, and, therefore, unless
>paging IO is synchronized by the Memory Manager, they all have to acquire the
>main resource exclusively.

Why? Why serialize writes to some potentially huge file stream? Just imagine serializing
writes to an Oracle database sitting on hardware RAID.

…

>writes/reads to the same byte range of the file, because MM will serialize it with
>other paging reads/writes to a given byte range???

No, it will not.

My logic stands as following (in fact, it looks like I am just attempting to answer my own question and to convince myself that I am right).

Although the same mapped file range may have multiple views, all these view are backed up by the same physical pages. Therefore, in order to serialize reads and writes to any given range of a file,
all that MM has to do is to serialize pages, i.e ensure that when multiple page faults to the same range
occur or multiple requests to flush the same page to the disk get made, only one of them can actually result in writing to the disk or updating the cache. The only one who can do it is MM itself - after all, neither FSD nor Cc has control over pages that individual processes have mapped into their address spaces, i.e.
they cannot control either loading pages to RAM (because handling page faults is MM’s responsibility) or flushing them to the disk with FlushViewOfFile() (because MM is the only one who knows whether a given page has been modified and needs to be written to the disk or whether it can just get discarded because the calling process has not modified it).Therefore, serializing multiple paging read vs other paging reads and paging writes vs other paging writes to the same range has to be Memory Manager’s responsibility - no one else can be responsible for it

In order to get flushed to the disk, a page has to be physically resident in RAM, and, hence, just cannot cause a page fault. The only reason why paging read may be made is a page fault that may be caused either by application/driver that reads memory-mapped file, or by Cc upon reading data from the disk to the cache. The same page cannot be both resident and non-resident in RAM at the same time, can it ??? Therefore, paging reads and writes to the same range are mutually exclusive.

In other words, the conclusion that I made is that FSD does not need to serialize paging IO - the only reason why paging IO may acquire any resource is related to file length and valid data size, rather than
to serializing paging IO operations with one another…

Anton Bassov

OSR_Community_User · June 9, 2008, 6:58pm

>who can do it is MM itself - after all, neither FSD nor Cc has control over
pages

that individual processes have mapped into their address spaces,

All of them map the same single page for the given file offset.

There can be only 1 page reflecting the given FCB and given offset in it, or no
such page at all.

–
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

anton_bassov · June 9, 2008, 7:01pm

Tony,

Anton, I’m not sure I understand your distinction of “top level” and “non-top-level” requests.

This is not my distinction - this is the logic that is implemented by FASTFAT. When IRP_MJ_WRITE handler sees that modified page writer is top-level component, it sets TopLevelIrp to IRP itself, i.e. something that it does when IRP has no top-level component, and leaves top-level IRP intact if some other component is top-level for a given IRP. As a result, when FatCommonWrite() processes the paging IO that is initiated by modified writer it believes that it is FSD who is top-level component, so that it can modify VDL. However, when it sees that some other component is top-level (in case of lazy writer’s pagin IO it appears to be Cache Manager because of lazy writer’s callback’s logic ), it does not modify VDL…

When a page is being written out to disk, any other process that faults on it will wait for the I/O to
> complete. Actually, this applies to read operations as well.

As long as the above logic is implemented by the MM, rather than FSD, I take the above statement as a confirmation of my theory…

Anton Bassov

anton_bassov · June 9, 2008, 7:03pm

> All of them map the same single page for the given file offset.

Exactly, and this is what all my theory is based upon…

Anton Bassov

OSR_Community_User · June 9, 2008, 7:19pm

The concept of a top level IRP was added in NT 3.51 and at this point about the only thing that directly stores an *IRP* in the TopLevelirp field is Fat (check out CDFS - it stores an IRP context as I recall, which is what every other file system does as well.) Thus, I don’t tend to think in terms of “top level” versus “non top level” which is why I’m asking you what your model for this is. It’s ironic that the one example everyone turns to operates differently than the primary file system everyone uses…

FAT is mostly a snapshot in time. The changes to it have been minimal and mostly required for it to comply to OS changes around it. The FSD/CC/MM interactions have preserved existing semantics so FAT hasn’t really changed in this area.

Be that as it may, you may manage VDL in a much broader number of ways than demonstsrated by FASTFAT. One model is to set VDL very, very large (oh, like 8EB!) This will trigger page faults when EOF is moved. It can be filled with highly random data (sometimes, big blocks of zeros are NOT what you want.)

When a page fault occurs it is MM that serializes other attempts to access that same page. My recollection is that on the write side you are NOT guaranteed a lack of mappings to the page, so you cannot count on the contents changing while you are writing it. Of course, the PTE dirty bit gets set so the page will be written again later, but what hits the disk might actually not be what was ever in memory. This is the classic problem of multiple parallel asynchronous writers.

Byte range locks are enforced against user I/O operations (either direct or via a proxy) but then the data is written into the cache. If we enforced byte range locks on the paging I/O, the “allowed” user changes would then be “disallowed” paging changes. Memory mapped applications may USE byte range locks as an advisory locking scheme, but it is not enforced by the OS. It is mixing the two modes (NtReadFile/NtWriteFile versus a mapped view) that can be confusing.

Hopefully I’ll see you at the file systems class in Stockholm this summer Anton and you can pepper me with all sorts of complex questions like this.

Tony
OSR

OSR_Community_User · June 9, 2008, 7:25pm

>
There can be only 1 page reflecting the given FCB and given offset in it, or no such page at all.

Actually, this is not true. There can be more than one:

1 for the data section;
1 for the image section;
1 per transactional (isolated) view (Vista and Server 2008.)

Indeed, it is the fact that we have both data and image sections that causes a certain amount of churning in the file system to ensure that these two copies are properly synchronized (e.g., MmFlushImageSection and CcPurgeCache.) Get this wrong and you’ll se very strange break behavior when you copy an executable image onto a different file and then try to execute it (that’d make a good test case. Take two programs “a.exe” and “b.exe” and then in a loop copy each one to “c.exe” and then execute it.)

Tony
OSR

anton_bassov · June 9, 2008, 9:36pm

> My recollection is that on the write side you are NOT guaranteed a lack of mappings

to the page, so you cannot count on the contents changing while you are writing it.
Of course, the PTE dirty bit gets set so the page will be written again later, but what hits
the disk might actually not be what was ever in memory. This is the classic problem
of multiple parallel asynchronous writers.

Well, this is understandable - if page is in RAM, it its available to anyone who has it mapped into its address space. Theoretically, MM could mark it is as not present (or as RO) in all PTEs, so that writing to it would cause a page fault , and, hence, MM would be able to put the writing process to sleep until flush operation is complete. However, in order to do it, MM would have to maintain a list of all PTEs that map the page in question, and, AFAIK, MM does have such table it (which, IIRC, is well-known feature of NT). This is why the OS cannot guarantee that the purge operation will be successful - it just cannot successfully purge the cache if the target file is mapped into the address space of a process because MM does not maintain the list of all PTEs that map the target page. In any case, this is beyond FSD designer’s control - he has no option but to live with this limitation…

Be that as it may, you may manage VDL in a much broader number of ways
than demonstsrated by FASTFAT.

Of course…

In fact, I would implement a totally different logic of synchronization (this is why I asked my original question, in the first place)…

Indeed, it is the fact that we have both data and image sections that causes
a certain amount of churning in the file system to ensure that these two copies
are properly synchronized (e.g., MmFlushImageSection and CcPurgeCache.)
Get this wrong and you’ll se very strange break behavior when you copy an executable
image onto a different file and then try to execute it

This is why there is a distinction between data and image sections, in the first place. Modified image section is not meant to be EVER written to the file - instead, it has to stored in a paging file. In fact, it would be unwise to allow creation of an image section unless the file is open as RO and FILE_SHARE_WRITE
flag is clear…

Hopefully I’ll see you at the file systems class in Stockholm this summer

I just wonder why OSR chose Stockholm for its FSD development seminar -I never thought that Sweden is known for its Windows file system developers (which seems to apply to any location in Europe). I would not be surprised if they hosted .NET/ Java/PHP/etc seminars, but when I saw OSR advertisement of Windows file system development seminar in Sweden, my jaw dropped…

Anton Bassov

OSR_Community_User · June 9, 2008, 11:10pm

>it mapped into its address space. Theoretically, MM could mark it is as not

present (or as RO) in all PTEs, so that writing to it would cause a page fault
, and,
hence, MM would be able to put the writing process to sleep until flush
operation
is complete.

Writing to the page which is in the process of flush sets the dirty bit in the
PTE back, which will later be transferred to PFN entry, thus causing re-flush.

–
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Daniel_Terhell · June 10, 2008, 4:25am

For anybody having difficulty following this thread and wanting to know more
what VDLs and high water marks are about here is a very comprehensible
introduction to the cache manager

http://www.i.u-tokyo.ac.jp/edu/training/ss/lecture/new-documents/Lectures/15-CacheManager/CacheManager.pdf

//Daniel

anton_bassov · June 10, 2008, 6:33am

> http://www.i.u-tokyo.ac.jp/edu/training/ss/lecture/new-documents/Lectures/15-Cach >eManager/CacheManager.pdf

Not so bad, especially if combined with reading Nagar’s book…

It gives a very clear explanation why lazy writer cannot extend VDL - it moves backwards,rather than forward…

Thanks, Daniel

Anton Bassov

Daniel_Terhell · June 10, 2008, 8:06am

Still I have not slightest clue what 8EB means or is about. Anybody can put
a light on ?

//Daniel

wrote in message news:xxxxx@ntfsd…
>> http://www.i.u-tokyo.ac.jp/edu/training/ss/lecture/new-documents/Lectures/15-Cach
>> >eManager/CacheManager.pdf
>
> Not so bad, especially if combined with reading Nagar’s book…
>
> It gives a very clear explanation why lazy writer cannot extend VDL - it
> moves backwards,rather than forward…
>
> Thanks, Daniel
>
>
> Anton Bassov
>