Caveats of physical disk access

When interfacing with a physical drive directly for I/O (e.g, \\.\PhysicalDrive0) in lieu of the usual filesystem (from UM) for general purpose reading/writing what potential pitfalls are there?

The following I’m aware of:

  • Must hold a handle marked with MARK_HANDLE_PROTECT_CLUSTERS for the duration of access for address-space translations to be stable (ie., VCN→LCN(→LBA))
  • Must not be an encrypted volume.
  • Must be a “simple” volume, ie., not spanned.
  • Files cannot be sparse
  • Resident files for NTFS would require special handling; probably fallback to standard I/O for them.
  • Probably needs to be NTFS.
  • Shadowcopy and other filter drivers which have been circumvented are a bit of a question mark.

I think all of your questions assume that you will sometimes access the volume with regular IO handled via the fire system, and sometimes via direct IO that bypasses it. that’s not a good general plan

You would NEVER do “general purpose reading/writing” with a PhysicalDriveX. Ever. That bypasses the file system. There are no directories and no files, because those are all conventions handled by the file system. You’re reading and writing raw sectors.

1 Like

I think all of your questions assume that you will sometimes access the volume with regular IO handled via the fire system, and sometimes via direct IO that bypasses it. that’s not a good general plan

Indeed, that would not be a good idea (which is why I’m not doing that, with the potential exception as mentioned for MFT-resident files)

You would NEVER do “general purpose reading/writing” with a PhysicalDriveX. Ever. That bypasses the file system. There are no directories and no files, because those are all conventions handled by the file system. You’re reading and writing raw sectors.

You’ve misunderstood (or maybe I wasn’t clear enough) - the idea here is bypassing the majority of the storage stack just for read/write purposes, ie., to read/write a file the steps become:

  • CreateFile(..)
  • FSCTL_MARK_HANDLE IOCTL with MARK_HANDLE_PROTECT_CLUSTERS
  • FSCTL_GET_RETRIVAL_POINTERS to acquire VCN→LCN translation
  • Offset LCNs from previously acquired disk extents to get LBA translation
  • Issue I/O in LBA space to associated physical disk.

As for the why, I’m already working in LBA space as part of a usermode I/O scheduler, in addition to the significant performance gain of circumventing most of the storage stack (some optimizations also simpler aren’t possible otherwise, such as in the case where-in two files are adjacent in LBA-space, this would normally require submitting two distinct IRPs if going via the fs, while if interfacing with the disk directly it can be sent down as a single, larger merged request)

what is a ‘usermode I/O scheduler’? How does that imply that you are working with LBA that the file system would normally control?

If you want to work with a volume that has no file system and never will, they go for it. If you want to create your own file system, then go for it. If you want to bypass the file system some of the time, and you are working on a diagnostic tool, there are challenges, but it could make sense

if you think that you are going to achieve significant performance gains via this method you are probably mistaken. And at best you will make a fragile solution tied to a specific version of a specific file system with a lot of limitations.

1 Like

what is a ‘usermode I/O scheduler’? How does that imply that you are working with LBA that the file system would normally control?

It’s exactly what the name implies; mediating competing streams of requests for device time, along with complicating factors such as achieving certain QoS-esque qualities (latency/deadlines, ordering constraints, ect). In other words: little different than what’s done in the kernel.

Naively throwing requests at a HDD in FIFO order is not a very smart idea for instance. Even something as trivial as sorting a set of files to access by starting LBA yields orders of magnitude improved performance in contrast (note the requisite LBA knowledge, to your question).

If you want to work with a volume that has no file system and never will, they go for it. If you want to create your own file system, then go for it. If you want to bypass the file system some of the time, and you are working on a diagnostic tool, there are challenges, but it could make sense

I’m not sure how else I can rephrase this rather straight-forward concept: we’re accessing files, on a file system, except talking directly to the underlying disk ourselves, taking over the job the fs/volume layer would be doing on our behalf.

if you think that you are going to achieve significant performance gains via this method you are probably mistaken. And at best you will make a fragile solution tied to a specific version of a specific file system with a lot of limitations.

This has been a standard practice in systems with high-performance I/O demands for decades. You can take a look at, for example, practically every database engine (including some I’ve worked on). Given, those are mostly Linux-oriented.

Have you actually tried this? They added a security measure in Vista to prevent you from writing directly to a disk with an active file system above it. You can read all you want, but if you want to write you have to lock or dismount the file system first.

While it is certainly true that this sort of optimization has been attempted on various *nix systems for a long time, it is also certainly true that every storage vendor today say that attempting it will kill performance overall and should not be attempted

The supposed gains are irrelevant for NVMe devices, and if the disk is SAN attached can actively hurt performance

Then there is the question of causing file system corruption

And then the need to actually bypass the security checks designed to prevent you from doing this

If you are interested about database performance metrics, the long standing standard has been tpc. All of the top benchmarks in all of the categories belong to MS SQL. And for sure it uses standard ReadFile and WriteFile calls. Actually ReadFileScatter and WriteFileGather. APIs created specifically for MSSQL that were part of the anti-trust stuff back in the day. They are exactly tailored to the design of MSSQL’s buffer pool, but easily integrate with the standard IRP mechanism