Open file by ID

Let me repeat this Q separately :slight_smile:
How justified would be to have a filepath->fileID map in order to have
most often used files being opened via ID? Is there going to be any
significant performance improvement?

TIA,

Vladimir

Defragmentation API. For a file system you probably need not keep
FN-ID map - since ID can be simply disk offset - not required but
possible.
For a filter - you’d better query.

–
Kind regards, Dejan M. MVP for DDK
http://www.alfasp.com E-mail: xxxxx@alfasp.com
Alfa Transparent File Encryptor - Transparent file encryption services.
Alfa File Protector - File protection and hiding library for Win32
developers.
Alfa File Monitor - File monitoring library for Win32 developers.

Thanks, Dejan!

My question was is it feasible to keep a map Path->ID to improve
performance in creates. I.e. when create comes my filter would try to
map path to ID and if mapping exists filter will transform path-based
create to ID-based create. Assuming that we have a huge amount of files
on disk, finding actual file entry by path is quite expensive. So I
thought that given the essence of file ID as the exact location of file
entry on disk (correct me if I’m wrong) I may have a big benefit in
performance by [quickly] transforming path-based creates to ID - based
creates.
That’s a general thought, an idea. And before jumping into experiments
//i.e. wasting time :slight_smile: I wanted to kick tires to see if anyone has
something to say :slight_smile:

Regards,

Vladimir

-----Original Message-----
From: Dejan Maksimovic [mailto:xxxxx@alfasp.com]
Sent: Friday, June 11, 2004 11:23 AM
To: Windows File Systems Devs Interest List
Subject: Re: [ntfsd] Open file by ID

Defragmentation API. For a file system you probably need not keep
FN-ID map - since ID can be simply disk offset - not required but
possible.
For a filter - you’d better query.

–
Kind regards, Dejan M. MVP for DDK
http://www.alfasp.com E-mail: xxxxx@alfasp.com
Alfa Transparent File Encryptor - Transparent file encryption services.
Alfa File Protector - File protection and hiding library for Win32
developers.
Alfa File Monitor - File monitoring library for Win32 developers.


Questions? First check the IFS FAQ at
https://www.osronline.com/article.cfm?id=17

You are currently subscribed to ntfsd as:
xxxxx@borland.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

C:\Long file name 1\Long file name 2\Long file name 3.…\ Long file name
100…
That’s 2^100 possible path combinations… Would you keep even 1% of them in
memory?
While 2^100 is _unlikely", 2^10 is not. That’s 1024 paths for one file:-(

Regards, Dejan.

–
Kind regards, Dejan M. MVP for DDK
http://www.alfasp.com E-mail: xxxxx@alfasp.com
Alfa Transparent File Encryptor - Transparent file encryption services.
Alfa File Protector - File protection and hiding library for Win32 developers.
Alfa File Monitor - File monitoring library for Win32 developers.

Dejan: Think a bit further :slight_smile: Here is what is given:

  1. Tens of millions of files on disk.
  2. Just a fraction of them is actively used (the exact list will change
    as time goes but %-wise its size gonna stay pretty much the same).
    Although just small fraction of those files is used, FS needs to go
    through all these mils of records to locate records for the used files.
    To make things even worst, the machine that runs this mess often
    exhausts virtual memory (let aside physical). So, there is not much of
    “metadata” stays in cache for long.
    So, given that, I can imagine that it could be very beneficial to run
    path->ID map in some sort of MRU/MOU table (of course, limited size).

I ran some perf tests and here is some stats (in case somebody
interested)
On the system with total 32K files spread across 16 folders (i.e. 2K
files/folder) an average time to open one (random) file is 1.5
milliseconds. On the same system with 1M files spread across 16 folders
(i.e. 64K files/folder) and average time to open one file is 13
milliseconds and that time grows exponentially along with total number
of files.

-----Original Message-----
From: Dejan Maksimovic [mailto:xxxxx@alfasp.com]
Sent: Friday, June 11, 2004 12:54 PM
To: Windows File Systems Devs Interest List
Subject: Re: [ntfsd] Open file by ID

C:\Long file name 1\Long file name 2\Long file name 3.…\ Long
file name
100…
That’s 2^100 possible path combinations… Would you keep even 1%
of them in
memory?
While 2^100 is _unlikely", 2^10 is not. That’s 1024 paths for one
file:-(

Regards, Dejan.

–
Kind regards, Dejan M. MVP for DDK
http://www.alfasp.com E-mail: xxxxx@alfasp.com
Alfa Transparent File Encryptor - Transparent file encryption services.
Alfa File Protector - File protection and hiding library for Win32
developers.
Alfa File Monitor - File monitoring library for Win32 developers.


Questions? First check the IFS FAQ at
https://www.osronline.com/article.cfm?id=17

You are currently subscribed to ntfsd as:
xxxxx@borland.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Yeah! Now I see your point :slight_smile: But I still strongly believe that even if
I restrict the path (index) to LFN only I’m still gonna get my benefits.
At least dental :slight_smile:

Anyway, looks like I have to build some prototype to test that
concept…

Regards,

Vladimir

-----Original Message-----
From: Dejan Maksimovic [mailto:xxxxx@alfasp.com]
Sent: Friday, June 11, 2004 12:54 PM
To: Windows File Systems Devs Interest List
Subject: Re: [ntfsd] Open file by ID

C:\Long file name 1\Long file name 2\Long file name 3.…\ Long
file name
100…
That’s 2^100 possible path combinations… Would you keep even 1%
of them in
memory?
While 2^100 is _unlikely", 2^10 is not. That’s 1024 paths for one
file:-(

Regards, Dejan.

–
Kind regards, Dejan M. MVP for DDK
http://www.alfasp.com E-mail: xxxxx@alfasp.com
Alfa Transparent File Encryptor - Transparent file encryption services.
Alfa File Protector - File protection and hiding library for Win32
developers.
Alfa File Monitor - File monitoring library for Win32 developers.


Questions? First check the IFS FAQ at
https://www.osronline.com/article.cfm?id=17

You are currently subscribed to ntfsd as:
xxxxx@borland.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Try doing it the way I do filters: add entries (wild card based) to say what
files/folders should be monitored/MRUed.

Yeah! Now I see your point :slight_smile: But I still strongly believe that even if
I restrict the path (index) to LFN only I’m still gonna get my benefits.
At least dental :slight_smile:

Anyway, looks like I have to build some prototype to test that
concept…

–
Kind regards, Dejan M. MVP for DDK
http://www.alfasp.com E-mail: xxxxx@alfasp.com
Alfa Transparent File Encryptor - Transparent file encryption services.
Alfa File Protector - File protection and hiding library for Win32 developers.
Alfa File Monitor - File monitoring library for Win32 developers.

----- Original Message -----
From: “Vladimir Chtchetkine”
To: “Windows File Systems Devs Interest List”
Sent: Friday, June 11, 2004 12:21 PM
Subject: [ntfsd] Open file by ID

Let me repeat this Q separately :slight_smile:
How justified would be to have a filepath->fileID map in order to have
most often used files being opened via ID? Is there going to be any
significant performance improvement?

If I understand your suggestion correctly, DEC’s 16-bit RSX systems did
something like this nearly 3 decades ago, called a ‘path cache’ IIRC. One
reason was because caching entire directories was often too burdensome for
the 16-bit environment with limited physical memory, so a smallish path
cache could eliminate the common path look-ups far more efficiently (and
with a far lower instruction path-length as well).

You should consider whether to depend solely upon the access control
information for the target file to limit access, or emulate the
per-directory controls in the path (which the requestor would encounter
during a normal path traversal) by maintaining each directory’s ACL in the
cached entry (and changing it if the actual directory’s ACL changes).
Similar considerations apply to removing path entries when a file (or
directory) is moved.

- bill

To augment Bill Todd's comments as well: in UNIX file systems this is
typically referred to as a "DNLC" cache (Directory Name Lookup Cache).
It is a tremendous win, assuming you can fit within the restrictions of
the DNLC cache itself. I've certainly seen file systems for which this
would yield incorrect results and it can also have unexpected
interaction issues with security.

Restricting it to your own (kernel mode) filter's use is likely to
provide a performance win, but also may expose the system to new
exploits and certainly could cause incorrect behavior when used with
arbitrary file systems. At present, this would only work properly with
NTFS and CDFS, both of which support open by file ID. The SFU Server
(a/k/a "NFS Server") uses open by file ID, as does SFM.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.


From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Vladimir Chtchetkine
Sent: Friday, June 11, 2004 12:21 PM
To: ntfsd redirect
Subject: [ntfsd] Open file by ID

Let me repeat this Q separately :slight_smile:
How justified would be to have a filepath->fileID map in order to have
most often used files being opened via ID? Is there going to be any
significant performance improvement?

TIA,

Vladimir

Questions? First check the IFS FAQ at

You are currently subscribed to ntfsd as: xxxxx@osr.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Bill, Tony:

Thanks for the reply. It looks like my “separate” question was too broad which made some confusion. When I referred “file ID” I meant precisely NTFS’s feature, I didn’t mean reinventing the wheel. I guess that security questions related to open by ID are fully addressed by NTFS itself, right?

Unfortunately (and to my surprise) I didn’t see much benefit from using path-ID map and using ID-open instead of path-open. I don’t know how to explain that yet but on 1M files environment I’ve seen just 2-3% improvement in performance. Strange…

Regards,

Vladimir

Bill Todd wrote:

----- Original Message -----
From: “Vladimir Chtchetkine”
To: “Windows File Systems Devs Interest List”
Sent: Friday, June 11, 2004 12:21 PM
Subject: [ntfsd] Open file by ID

Let me repeat this Q separately :slight_smile:
How justified would be to have a filepath->fileID map in order to have
most often used files being opened via ID? Is there going to be any
significant performance improvement?

If I understand your suggestion correctly, DEC’s 16-bit RSX systems did
something like this nearly 3 decades ago, called a ‘path cache’ IIRC. One
reason was because caching entire directories was often too burdensome for
the 16-bit environment with limited physical memory, so a smallish path
cache could eliminate the common path look-ups far more efficiently (and
with a far lower instruction path-length as well).

You should consider whether to depend solely upon the access control
information for the target file to limit access, or emulate the
per-directory controls in the path (which the requestor would encounter
during a normal path traversal) by maintaining each directory’s ACL in the
cached entry (and changing it if the actual directory’s ACL changes).
Similar considerations apply to removing path entries when a file (or
directory) is moved.

- bill

—
Questions? First check the IFS FAQ at https://www.osronline.com/article.cfm?id=17

You are currently subscribed to ntfsd as: xxxxx@yahoo.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

---------------------------------
Do you Yahoo!?
Friends. Fun. Try the all-new Yahoo! Messenger

----- Original Message -----
From: “Vladimir Chtchetkine”
To: “Windows File Systems Devs Interest List”
Sent: Friday, June 11, 2004 10:56 PM
Subject: Re: [ntfsd] Open file by ID

> Bill, Tony:
>
> Thanks for the reply. It looks like my “separate” question was too broad
which made some confusion. When I referred “file ID” I meant precisely
NTFS’s feature, I didn’t mean reinventing the wheel. I guess that security
questions related to open by ID are fully addressed by NTFS itself, right?

Well, only sort-of. If you effectively perform by-ID access when the
application has specified a directory-path access, you are clandestinely
substituting the semantics of by-ID access for path access. In particular,
unless you guard against it, you may allow access via a directory path which
is either no longer valid (the ‘move’ - actually, rename - issue I
mentioned) or no longer legal for the accessor (the ACL issues I mentioned):
while the file remains legally accessible by ID, it should not be legally
accessible via the path.

Now, since traversing a directory path takes non-zero time, this may only
drastically widen the window in which changes in the portion of the path
already traversed in a tree-walk don’t affect the balance of the look-up
(though my vague recollection is that NTFS may actually guard against
renaming any path-element used in the access path to a file while the file
is accessed, in which case again you would be changing real system
semantics, albeit subtly).

>
> Unfortunately (and to my surprise) I didn’t see much benefit from using
path-ID map and using ID-open instead of path-open. I don’t know how to
explain that yet but on 1M files environment I’ve seen just 2-3% improvement
in performance.

That may just mean that you’re not cache-constrained in keeping directories
memory-resident once they’ve been accessed, and all you’re seeing is the
difference in instruction path-length (i.e., no, or minimal, net saving in
disk I/O). Cut down significantly on the physical memory available for
caching (e.g., by increasing other system activity that competes with it)
and the value of the path-cache should become greater.

As Tony mentioned, Unix has done this kind of thing for a long time too.
IIRC Linux, for example, caches individual ‘dentries’ for each
recently-accessed target within a given directory, and thus can walk the
directory path for a recently-accessed file by using the dentries in
succession to resolve each inode ID in the path - which also provides help
with the initial path to other unaccessed targets within the same sub-tree
that contains a recently-accessed target.

- bill

> If I understand your suggestion correctly, DEC’s 16-bit RSX systems did

something like this nearly 3 decades ago, called a ‘path cache’ IIRC.

As do UNIX, as do all NT’s filesystems inside themselves.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

>“DNLC” cache (Directory Name Lookup Cache). It is a tremendous win, assuming
you can fit

within the restrictions of the DNLC cache itself.

At least in FreeBSD the interface to filesystems is polymorphic, very similar
to the interface to graphics drivers in Windows.

The driver (or UNIX FSD) can define any semantics for some operation which it
wants - be it pathname lookup or TextOut.

There is also the standard semantics, and the driver can just define its
semantics to be equal to standard one - by either not hooking the operation, or
declaring the standard routine as a part of its dispatch table.

Also the driver’s semantics, can be either wrapped around the standard one - or
fallthru to standard one in some cases (“punting”).

For instance, the “lookup” call - resolve pathname to vnode - is passed to the
driver. The standard semantics for “lookup” decomposes it to 2 calls to the
driver of “resolve pathname to file ID” then “load vnode by file ID”. These 2
calls also have their standard semantics where caching is implemented.

So, the particular filesystem can completely override the name cache or vnode
cache, or both.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Bill:

Thanks! Fortunately, I’m not developing a general-purpose system. What
I’m trying to do is to address a particular problem in a particular
product. So, security-related issues are not that much of concern here.
Moving file, of course, could be. But, I think, on Win installation it’s
only defrag and prefetch that move files (correct me if I’m wrong). And
those can be addressed administratively.

Anyways, it looks like I’m going to have a hard time justifying ID-open
vs. Path-open. Results of the simple testing are not that compelling at
all. I’m agreed with you that under a heavy memory usage results will be
much different and in big favor for ID-open but the actual problem was
scalability. And that’s another thing I can’t explain…
On 32K files environment average time for open by ID was around 1.5
msec/file. On 1M files env. it was around 12 msec/file. Unless I was
missing something really obvious, I was expecting those times to be
close because (as I understood) open by ID should be independent from
number of files to search through. Any thoughts on that? Of course,
mapping name to ID on 1M map will be slower than on 32K map, but not
that significantly…
Any thoughts on that?

Regards,

Vladimir

-----Original Message-----
From: Bill Todd [mailto:xxxxx@metrocast.net]
Sent: Friday, June 11, 2004 9:56 PM
To: Windows File Systems Devs Interest List
Subject: Re: [ntfsd] Open file by ID

----- Original Message -----
From: “Vladimir Chtchetkine”
To: “Windows File Systems Devs Interest List”
Sent: Friday, June 11, 2004 10:56 PM
Subject: Re: [ntfsd] Open file by ID

> Bill, Tony:
>
> Thanks for the reply. It looks like my “separate” question was too
broad
which made some confusion. When I referred “file ID” I meant precisely
NTFS’s feature, I didn’t mean reinventing the wheel. I guess that
security
questions related to open by ID are fully addressed by NTFS itself,
right?

Well, only sort-of. If you effectively perform by-ID access when the
application has specified a directory-path access, you are clandestinely
substituting the semantics of by-ID access for path access. In
particular,
unless you guard against it, you may allow access via a directory path
which
is either no longer valid (the ‘move’ - actually, rename - issue I
mentioned) or no longer legal for the accessor (the ACL issues I
mentioned):
while the file remains legally accessible by ID, it should not be
legally
accessible via the path.

Now, since traversing a directory path takes non-zero time, this may
only
drastically widen the window in which changes in the portion of the path
already traversed in a tree-walk don’t affect the balance of the look-up
(though my vague recollection is that NTFS may actually guard against
renaming any path-element used in the access path to a file while the
file
is accessed, in which case again you would be changing real system
semantics, albeit subtly).

>
> Unfortunately (and to my surprise) I didn’t see much benefit from
using
path-ID map and using ID-open instead of path-open. I don’t know how to
explain that yet but on 1M files environment I’ve seen just 2-3%
improvement
in performance.

That may just mean that you’re not cache-constrained in keeping
directories
memory-resident once they’ve been accessed, and all you’re seeing is the
difference in instruction path-length (i.e., no, or minimal, net saving
in
disk I/O). Cut down significantly on the physical memory available for
caching (e.g., by increasing other system activity that competes with
it)
and the value of the path-cache should become greater.

As Tony mentioned, Unix has done this kind of thing for a long time too.
IIRC Linux, for example, caches individual ‘dentries’ for each
recently-accessed target within a given directory, and thus can walk the
directory path for a recently-accessed file by using the dentries in
succession to resolve each inode ID in the path - which also provides
help
with the initial path to other unaccessed targets within the same
sub-tree
that contains a recently-accessed target.

- bill

—
Questions? First check the IFS FAQ at
https://www.osronline.com/article.cfm?id=17

You are currently subscribed to ntfsd as:
xxxxx@borland.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

What
I’m trying to do is to address a particular problem in a particular
product. So, security-related issues are not that much of concern here.
Moving file, of course, could be. But, I think, on Win installation it’s
only defrag and prefetch that move files (correct me if I’m wrong).

The issue I noted wasn’t physical movement on the disk, but logical movement
within the directory structure caused by renaming the file or one of the
directories in the path to it without updating the path-cache entry for it.
I still don’t think you understand that issue, though it’s a subtle one and
may not be a problem for the particular situation you’re trying to address.

…

Anyways, it looks like I’m going to have a hard time justifying ID-open
vs. Path-open. Results of the simple testing are not that compelling at
all. I’m agreed with you that under a heavy memory usage results will be
much different and in big favor for ID-open but the actual problem was
scalability. And that’s another thing I can’t explain…
On 32K files environment average time for open by ID was around 1.5
msec/file. On 1M files env. it was around 12 msec/file. Unless I was
missing something really obvious, I was expecting those times to be
close because (as I understood) open by ID should be independent from
number of files to search through. Any thoughts on that? Of course,
mapping name to ID on 1M map will be slower than on 32K map, but not
that significantly…
Any thoughts on that?

The only one that comes to mind involves relative caching of the MFT. With
the smaller number of files, the MFT may have been largely cache-resident
(in fact, it pretty well had to be, since the average time for the open was
only a small fraction of the time required for a single disk access) such
that after the first file in a region of the MFT was opened several others
leveraged the cached portion of the MFT to avoid having to perform any disk
access on the open. With the larger number of files, their MFT records may
have been sufficiently spread out that most opens required a single disk
access (to the usually uncached MFT record), which is about what you saw
(assuming a 7200 rpm ATA disk).

  • bill

Bill: You’re correct in your assumption that rename is not an issue in
my particular situation. I do understand that path->ID map is going to
be broken by rename so I would have to synch my map (which is pretty
heavy job), but renames (neither file nor path) are not going to occur
in this particular system. That’s why I didn’t pay much attention on
renames itself.

But what is still puzzling me is that a) average ID-open time is not
nearly a constant and b) it stays very close to path-open time (within
the same environment). I understand that caching greatly influence both
pictures but to my taste it’s too much correlation between
(theoretically independent) path-based times and ID-based times.
Something smells fishy :slight_smile:

Currently I’m testing entirely from the UM and using UM file IDs (16
bytes). I’m also going to try KM IDs (8 bytes, I believe) and see how
this going to change the picture.

Thanks for your willingness to help :slight_smile:

Best regards,

Vladimir

-----Original Message-----
From: Bill Todd [mailto:xxxxx@metrocast.net]
Sent: Monday, June 14, 2004 4:08 PM
To: Windows File Systems Devs Interest List
Subject: Re: [ntfsd] Open file by ID

What
I’m trying to do is to address a particular problem in a particular
product. So, security-related issues are not that much of concern here.
Moving file, of course, could be. But, I think, on Win installation it’s
only defrag and prefetch that move files (correct me if I’m wrong).

The issue I noted wasn’t physical movement on the disk, but logical
movement
within the directory structure caused by renaming the file or one of the
directories in the path to it without updating the path-cache entry for
it.
I still don’t think you understand that issue, though it’s a subtle one
and
may not be a problem for the particular situation you’re trying to
address.

…

Anyways, it looks like I’m going to have a hard time justifying ID-open
vs. Path-open. Results of the simple testing are not that compelling at
all. I’m agreed with you that under a heavy memory usage results will be
much different and in big favor for ID-open but the actual problem was
scalability. And that’s another thing I can’t explain…
On 32K files environment average time for open by ID was around 1.5
msec/file. On 1M files env. it was around 12 msec/file. Unless I was
missing something really obvious, I was expecting those times to be
close because (as I understood) open by ID should be independent from
number of files to search through. Any thoughts on that? Of course,
mapping name to ID on 1M map will be slower than on 32K map, but not
that significantly…
Any thoughts on that?

The only one that comes to mind involves relative caching of the MFT.
With
the smaller number of files, the MFT may have been largely
cache-resident
(in fact, it pretty well had to be, since the average time for the open
was
only a small fraction of the time required for a single disk access)
such
that after the first file in a region of the MFT was opened several
others
leveraged the cached portion of the MFT to avoid having to perform any
disk
access on the open. With the larger number of files, their MFT records
may
have been sufficiently spread out that most opens required a single disk
access (to the usually uncached MFT record), which is about what you saw
(assuming a 7200 rpm ATA disk).

  • bill

Questions? First check the IFS FAQ at
https://www.osronline.com/article.cfm?id=17

You are currently subscribed to ntfsd as:
xxxxx@borland.com
To unsubscribe send a blank email to xxxxx@lists.osr.com