UNIX-style filesystems and Cache Manager

OSR_Community_User · June 24, 2002, 1:52pm

UNIX-style filesystems like ext2 use the indirection blocks for
file mappings.
Indirection blocks are metadata blocks, not a part of some file’s
stream, and can be scattered over the whole volume.

How can such FS be implemented on NT with Cc? To cache these
blocks, one will need a virtual volume file of the size of the whole
disk. Is it OK? Will it not impose large memory load for cache page
tables?

What is the suggested way of doing this?

Max

OSR_Community_User · June 24, 2002, 9:26pm

On Mon, Jun 24, 2002 at 09:42:43PM +0400 Maxim S. Shatskih (xxxxx@storagecraft.com) wrote:
} Subject: [ntfsd] UNIX-style filesystems and Cache Manager

basically, Maxim points out that for ext2 and other UNIX file systems,
meta-data is scattered all over the disk (volume). if you want to use
the NT cache to cache this information in a straight forward fashion,
then you need to map the whole volume into the cache, which will
probably consume a lot space for page tables (and will also consume a
lot of virtual memory, which is in short supply in the NT cache).

this might also create aliasing problems, unless you are careful about
invalidating cached pages (as file system blocks transition between
use for meta-data and use for file data).

also your file system block size has to be >= memory manager cache
size, or you’ll have a different kind of aliasing problem (although
that can be worked around with some additional complexity).

and, there are likely issues with ordering of i/o operations since some
unix file system types maintain a degree of consistentcy by updating
meta-data in a specific order to avoid consistentcy problems (for
example, an indirect block is initialized before a pointer to the
indirect block is put in the inode).

i think the best solution is: implement a buffer cache in NT and cache
your meta-data there. that’s what we did for the vxfs port to NT
(never released), and it was for the above reasons (except for the
virtual memory issue, which we didn’t think of).

it might be possible to cache the meta-data by treating it as a series
of private files in the cache. for example, the first level indirect
blocks would be one data stream associated with a file, the second
level indirect blocks another, and so on. reads and writes of the top
level indirect block (third level) would occur to a block specified in
the inode (and the indirection handled in the page-in/page-out
routines).

reads and writes of the second level of indirect blocks would use block
addresses cached in the top level indirect block entry. so
page-in/page-out for the data stream mapping the second level indirect
blocks would need to access the cache entries for the third level
indirect block data stream, and so on.

i don’t know if the NT cache manager can support that kind of recursion
or not …

cheers,

craig.

P.S. on those flavors of UNIX with a page cache used for meta-data
(such as UFS as solaris), the design of the cache is different from
NT so that the issues with virtual memory consumption and page
table space don’t exist. the cached pages are associated with a
virtual address only when required.

} UNIX-style filesystems like ext2 use the indirection blocks for
} file mappings.
} Indirection blocks are metadata blocks, not a part of some file’s
} stream, and can be scattered over the whole volume.
}
} How can such FS be implemented on NT with Cc? To cache these
} blocks, one will need a virtual volume file of the size of the whole
} disk. Is it OK? Will it not impose large memory load for cache page
} tables?
}
} What is the suggested way of doing this?
}
} Max
}
}
}
} —
} You are currently subscribed to ntfsd as: xxxxx@veritas.com
} To unsubscribe send a blank email to %%email.unsub%%

–
{apple,amdahl}!veritas!craig xxxxx@veritas.com
(415) 668-3564 (h) (650) 527-8520 (w)

OSR_Community_User · June 25, 2002, 3:28am

> basically, Maxim points out that for ext2 and other UNIX file
systems,

meta-data is scattered all over the disk (volume). if you want to
use
the NT cache to cache this information in a straight forward
fashion,
then you need to map the whole volume into the cache, which will
probably consume a lot space for page tables (and will also consume
a
lot of virtual memory, which is in short supply in the NT cache).

Exactly so. The main NT’s problem is in Mm and not Cc. Using prototype
PTE tables for memory-mapped sections.
UNIXen instead use a hashed list of physical pages hanging off the
vnode. Vnode offset and hash entry is inside “struct page” called PFN
in NT. This seems to be much better then PPTE tables, especially for
huge streams.
With this approach, one can easily create a cache map over the whole
huge volume without wasting memory.

also your file system block size has to be >= memory manager cache
size, or you’ll have a different kind of aliasing problem (although
that can be worked around with some additional complexity).

Am I wrong CcPinxxx will solve this automatically?

i think the best solution is: implement a buffer cache in NT and
cache
your meta-data there.

In fact, CcPinxxx is for this. It is not CcPinxxx but underlying Mm
structures which makes this a problem.

Max

OSR_Community_User · June 25, 2002, 12:54pm

> The main NT’s problem is in Mm and not Cc. Using prototype

PTE tables for memory-mapped sections. UNIXen instead use
a hashed list of physical pages hanging off the vnode.
Vnode offset and hash entry is inside “struct page” called
PFN in NT. This seems to be much better then PPTE tables,
especially for huge streams.

Mm is doing sparse allocation of subsections (PPTEs) in the current OS
and headlevel 2kSP.

With this approach, one can easily create a cache map over
the whole huge volume without wasting memory.

Modulo the interesting aliasing and coherency issues mentioned
previously.

> also your file system block size has to be >= memory manager
> cache size, or you’ll have a different kind of aliasing problem
> (although that can be worked around with some additional
> complexity).

Am I wrong CcPinxxx will solve this automatically?

Yep, if it’s the one I’m thinking of - PAGE_SIZE. What happens if your
cluster size is less than PAGE_SIZE and two or more logically distinct
objects share the same page in the cache? Its not so much a problem if
you’re readonly, now make it readwrite.

(I personally like Tylenol …)

The discerning reader will note that Windows 2000 has an implementation
of the UDF filesystem, which has an inode-like structure. To handle this
I build a virtual stream on the fly to cache the metadata otherwise
scattered higgledy-piggledy about the volume, as I think Craig is
pointing to. When you think about it it’s the coherency problems that’ll
continue to weigh against a whole-volume stream file.

NTFS has an MFT for a(nother) reason

This posting is provided “AS IS” with no warranties, and confers no
rights.

OSR_Community_User · June 25, 2002, 5:43pm

>> With this approach, one can easily create a cache map over

> the whole huge volume without wasting memory.
Modulo the interesting aliasing and coherency issues mentioned
previously.

Why? Each vnode has its own set of pages. Vnode’s in-memory page is
not necessary contiguous on disk. IIRC it was never a requirement that
the paging IO path cannot split the request to non-page-aligned
IO_RUNs.

The issues occur when the cluster is freed and then reallocated for
metadata, or back - from data to metadata. In this case, writes can
cause a nightmare.

Yep, if it’s the one I’m thinking of - PAGE_SIZE. What happens if
your
cluster size is less than PAGE_SIZE and two or more logically
distinct
objects share the same page in the cache? Its not so much a problem
if
you’re readonly, now make it readwrite.

This is an issue with both NT’s way and UNIX’s way. In Linux, this is
solved by having a list of buffer descriptors hanging off the page
which is allocated for buffer cache, and IO works in terms of these
descriptors and not pages.

Funny but they had (or even have) paging IO implemented this way -
pseudo-buffer-descriptors are attached to a page and submitted to disk
stack. Only coalescing of the adjacent requests inside the disk device
queue saves them

Max

Andrey_Shedel · June 26, 2002, 1:26am

Max,

How can such FS be implemented on NT with Cc? To cache these
blocks, one will need a virtual volume file of the size of the whole
disk. Is it OK? Will it not impose large memory load for cache page
This works only for RO implementation. Once you will start RW you will hit
coherency problems.

UNIX-style filesystems like ext2 use the indirection blocks for
file mappings.
Indirection blocks are metadata blocks, not a part of some file’s
stream, and can be scattered over the whole volume.

For Ext2 (at least) proven solution will be to have one more internal data
stream in your FCB that will map indirect blocks. This is pretty easy if you
started FS implementation assuming multiple data streams per file are
allowed.

Andrey.

OSR_Community_User · June 26, 2002, 1:31am

I suspect this would require an internal block cache mechanism to run in
tandem with the MM cache manager. So, one cache would be for blocks and
the other for stream data. Only add meta-data blocks to the block
cache…

Just my first thoughts.

Jamey

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S. Shatskih
Sent: Monday, June 24, 2002 10:43 AM
To: File Systems Developers
Subject: [ntfsd] UNIX-style filesystems and Cache Manager

UNIX-style filesystems like ext2 use the indirection blocks for
file mappings.
Indirection blocks are metadata blocks, not a part of some file’s
stream, and can be scattered over the whole volume.

How can such FS be implemented on NT with Cc? To cache these
blocks, one will need a virtual volume file of the size of the whole
disk. Is it OK? Will it not impose large memory load for cache page
tables?

What is the suggested way of doing this?

Max

You are currently subscribed to ntfsd as: xxxxx@storagecraft.com
To unsubscribe send a blank email to %%email.unsub%%

OSR_Community_User · July 5, 2002, 9:23pm

On Tue, Jun 25, 2002 at 09:51:17AM -0700 Daniel Lovinger (xxxxx@windows.microsoft.com) wrote:
} Subject: [ntfsd] Re: UNIX-style filesystems and Cache Manager

} > The main NT’s problem is in Mm and not Cc. Using prototype
} > PTE tables for memory-mapped sections. UNIXen instead use
} > a hashed list of physical pages hanging off the vnode.
} > Vnode offset and hash entry is inside “struct page” called
} > PFN in NT. This seems to be much better then PPTE tables,
} > especially for huge streams.
}
} Mm is doing sparse allocation of subsections (PPTEs) in the current OS
} and headlevel 2kSP.

well, that would help, but it doesn’t really solve the problem. the
meta-data is likely to be scattered more or less randomly around the
disk, so i think you’ll find that many of the sub-sections will need
to be filled in (to map a single page down at the bottom level).

} > With this approach, one can easily create a cache map over
} > the whole huge volume without wasting memory.
}
} Modulo the interesting aliasing and coherency issues mentioned
} previously.

yes. what happens when an inode block and a directory block fall into
the same page, and you want to flush the inode to disk, but don’t want
to flush the directory block?

also, the aliasing issues between meta-data and file data that map to
the same page can be interesting. as others have pointed out, there’s
no requirement that pages be written in their entirety, or even
contiguously on disk. but keeping track of the state would be …
interesting.

} >> also your file system block size has to be >= memory manager
} >> cache size, or you’ll have a different kind of aliasing problem
} >> (although that can be worked around with some additional
} >> complexity).
} >
} > Am I wrong CcPinxxx will solve this automatically?
}
} Yep, if it’s the one I’m thinking of - PAGE_SIZE. What happens if your
} cluster size is less than PAGE_SIZE and two or more logically distinct
} objects share the same page in the cache? Its not so much a problem if
} you’re readonly, now make it readwrite.
}
} (I personally like Tylenol …)

exactly.

} The discerning reader will note that Windows 2000 has an implementation
} of the UDF filesystem, which has an inode-like structure. To handle this
} I build a virtual stream on the fly to cache the metadata otherwise
} scattered higgledy-piggledy about the volume, as I think Craig is
} pointing to. When you think about it it’s the coherency problems that’ll
} continue to weigh against a whole-volume stream file.

that would be another approach – sort of the logical extension of
making the indirect blocks for a file a seperate stream associated the
file.

but i think that mapping back and forth into the virtual stream would
give me a headache.

} NTFS has an MFT for a(nother) reason

but it is nice to able to dynamically allocate additional inodes.

cheers,

craig.

} —
} This posting is provided “AS IS” with no warranties, and confers no
} rights.
}
} —
} You are currently subscribed to ntfsd as: xxxxx@veritas.com
} To unsubscribe send a blank email to %%email.unsub%%

–
{apple,amdahl}!veritas!craig xxxxx@veritas.com
(415) 668-3564 (h) (650) 527-8520 (w)