Prefetching data into disk cache.

Is there a userland means to prefetch, asynchronously, file data into the cache? I’m aware I could accomplish this by various means (e.g, mapping the file into memory and locking or touching pages for instance), however this would ofcourse block and throwing threads at such a problem is profoundly inelegant.

Ideally one would just submit a read I/O with the user destination buffer omitted.

One idea that just occured to me would be to map the file into my address-space and issue appropriately aligned reads with it as the destination.

That’s an odd thing to want to do. Most of the time whenever the file system cache is discussed, it is because it has cached too much data not the reverse.

Forcing a prefetch implies that you might be trying to optimize an application that performs many small synch reads without materially changing its source? If that’s the case, and the reads use a common function of some kind, I would likely be more inclined to explore reading the data into UM in large chunks and then feeding the small reads from that data. But that’s a total guess on my part

The cache manager will pre-fetch a large chunk of data when a read is initiated, you appear to be trying to amplify this optimization, which as MBond2 noted, is generally the opposite of most buffer cache questions here. Any solution you come up with is going to block on a read that isn’t cached already, so I don’t know why ‘throwing threads at such a problem is profoundly inelegant’, when partitioning the work and executing it in multiple parallel tasks is both simple and elegant.

As a minor point, overlapped reads won’t block per se, nor do they require ‘throwing threads’ at the problem. That’s kind of the point of overlapped IO, but in any case, what’s the advantage of reading ahead into the system cache?

Sure, I was just simplifying the discussion.

Regardless of whether the operation is performed using overlapped or synchronous IO, generally read operations cannot be considered complete until the data is available. Sure you can use overlapped IO, but you still have to wait for completion at some point. I suppose in this bizarre scenario the async reads could in fact be fire and forget operations as the intent is not to access the data but simply to fill up the cache. Note that write operations don’t have this feature unless you need confirmation that the data is at the endpoint.

Hence my speculation that the OP is trying to optimize an existing application that performs many small reads by ensuring that the data his application will want soon is as close as possible to it so that those reads will complete as quickly as possible - a typical prefetch scenario. There seems no other possible use for this kind of functionality.

Any application that can be sped up using this method, can also be sped up more by redesigning its IO pattern.

last time I looked the filesystem stack does read-ahead prefetch so that
people don’t have to do stupid shit like this.
Mark Roddy

@MBond2 said:
That’s an odd thing to want to do. Most of the time whenever the file system cache is discussed, it is because it has cached too much data not the reverse.

Forcing a prefetch implies that you might be trying to optimize an application that performs many small synch reads without materially changing its source? If that’s the case, and the reads use a common function of some kind, I would likely be more inclined to explore reading the data into UM in large chunks and then feeding the small reads from that data. But that’s a total guess on my part

I’ll summarize the problem characteristics to give some context:

  • Large (~several GBs) amount of read-only data on disk (single file - is essentially many files packed into one)
  • On-disk layout matches in-memory; there are no transformation steps.
  • A moderate non-contiguous subset of data (logical files within) is accessed very frequently that is critical to performance.
  • Accessed by a potentially large number of processes.
  • Numerous broad subsets of data that change periodically, often overlapping of others and frequently shared between processes.
  • New data is added every so often (order of maybe days), further permutating working subsets - writes are exclusively an offline job and ergo not a concern here.
  • Specific access patterns are difficult to predict.
  • Can be viewed as essentially an extension of the same problem the kernel does; managing a working set of active pages.

The existing system, and the interface which I need to implement, simply maps this data into memory to access which causes significant performance issues in the form of stalling due to the irregular access patterns turning this into effectively synchronous (hard page-faults) small reads on the hot-path.

Briefly, my solution is to continue fundamentally with leveraging the system disk cache given its appropriateness to the problem, however to implement tracing of I/O requests (the user’s MapFile(…) analog) for, at runtime, identify common access patterns and speculatively prefetch “enough” data but also more importantly to physically reorganize the on-disk layout to improve locality, maybe even allocating it to the outer-rings for disks if I can (an offline task). Another aspect is to exploit some application-specific behavior that would permit accumulating I/O requests to enable presorting and submission of larger real IOs (if I can; ergo this question) and reduce the number of context switches.

The other solution would be to just essentially reimplement the service the kernel provides, sharing some [large-page backed] shared memory cache that I have more control over.

physically reorganize the on-disk layout to improve locality

In 2022? Is this data not on an SSD?

essentially reimplement the service the kernel provides

Hmmmm… If physical memory availability is not an issue, why not write a filter and an app that pair-up to read-up the whole file and put it in your own cache? Maybe, ah, at the volume level?? The filter could transparently pass Requests that aren’t related to your file and serve-up your file’s contents from its cached blocks. It can deal with write-down how/when it likes, however you need.

That’s what I would do.

Peter

There are some things to consider

In 2022 even old disk hardware completely virtualizes all of the CHS numbers. You have no way to understand or control where on the fragments are stored on even a simple rotating disk.

In 2022 large files are not measured in GB. Large files are measured in TB. you should expect to be able to map any file less than 10 GB completely into memory on most desktop computers. A simple UM helper that maps the whole file with large pages can do that - and Microsoft even has a sample of how to do it

https://docs.microsoft.com/en-us/windows/win32/memory/creating-a-file-mapping-using-large-pages

there is a lot more that could be said on this subject

@“Peter_Viscarola_(OSR)” said:

physically reorganize the on-disk layout to improve locality

In 2022? Is this data not on an SSD?

essentially reimplement the service the kernel provides

Hmmmm… If physical memory availability is not an issue, why not write a filter and an app that pair-up to read-up the whole file and put it in your own cache? Maybe, ah, at the volume level?? The filter could transparently pass Requests that aren’t related to your file and serve-up your file’s contents from its cached blocks. It can deal with write-down how/when it likes, however you need.

That’s what I would do.

Peter

The target hardware is commonly 15 year old regular consumer desktops --a non-negligible number running WinXP-- owing to a large portion of the userbase being from developing regions (notably SEA and parts of South America). 2-4GB installed physical memory is typical; considerably less available ofcourse.

This unfortunately precludes such an idea (and moving into KM generally is not an option.

@MBond2 said:
There are some things to consider

In 2022 even old disk hardware completely virtualizes all of the CHS numbers. You have no way to understand or control where on the fragments are stored on even a simple rotating disk.

Yes I understand such assumptions can’t be made; once before when doing this I’ve relied on empirical analysis by way of simply probing runs of disk address-space to infer the mapping based on the resultant read data-rate.

In 2022 large files are not measured in GB. Large files are measured in TB. you should expect to be able to map any file less than 10 GB completely into memory on most desktop computers. A simple UM helper that maps the whole file with large pages can do that - and Microsoft even has a sample of how to do it

https://docs.microsoft.com/en-us/windows/win32/memory/creating-a-file-mapping-using-large-pages

there is a lot more that could be said on this subject

I addressed this above in the reply to Peter.

This unfortunately precludes such an idea

But, but, but, but, but, but, BUT… There’s no such thing as a “free lunch” – If you want the Windows cache manager to cache it, it’s gotta go somewhere. There’s either space to cache it (in which case you’re gonna do a WAY better job of custom-designing a cache for your purposes than any generic cache manager is gonna do) or there’s NOT space to cache it (in which case there’s little/no point in trying to fool with getting the cache manager to cache your stuff).

and moving into KM generally is not an option

Hmmmm… I dunno what to say about that. Except if that’s really the case, you’re probably in the wrong forum, you know?

The memory has to come from somewhere. Full stop. If you’re gonna cache, either you’re gonna cache in main memory or you’re gonna roll it out to the paging file… in which case it’s gonna be on that slow disk (but, at least, perhaps logically contiguously).

So… yeah…

Peter