Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Before Posting...
Please check out the Community Guidelines in the Announcements and Administration Category.

Short reads on FSDs and disks

Maxim_S._ShatskihMaxim_S._Shatskih Member Posts: 10,396
On Linux, the read() call from the on-disk file, as also from the DASD disk, can sometimes return with the amount of data lesser then requested.

Is it ever possible on Windows (on FSDs, EOF condition aside, and DASD disks)?

I have not ever seen such a thing in Windows for years, neither from user nor from kernel mode.

Have anybody seen this on FSD-driven disk files or the DASD disks?

Surely (Nt)ReadFile can return shorter data on pipe/FIFO-style things, TCP sockets, custom drivers of any kind... but what about FSDs and DASD disks?

Also, if this is not possible, is there any official documentation? The ReadFile MSDN page does not state this _explicitly_, it lists the cases when the short read can occur instead.

--
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

Comments

  • Peter_WielandPeter_Wieland Member - All Emails Posts: 308
    It is possible but unlikely. The disk firmware could complete the read short without any error, but an error is much, much more likely.

    Sent from my Windows Phone
    ________________________________
    From: Maxim S. Shatskih
    Sent: ?1/?11/?2015 9:57 PM
    To: Windows System Software Devs Interest List
    Subject: [ntdev] Short reads on FSDs and disks

    On Linux, the read() call from the on-disk file, as also from the DASD disk, can sometimes return with the amount of data lesser then requested.

    Is it ever possible on Windows (on FSDs, EOF condition aside, and DASD disks)?

    I have not ever seen such a thing in Windows for years, neither from user nor from kernel mode.

    Have anybody seen this on FSD-driven disk files or the DASD disks?

    Surely (Nt)ReadFile can return shorter data on pipe/FIFO-style things, TCP sockets, custom drivers of any kind... but what about FSDs and DASD disks?

    Also, if this is not possible, is there any official documentation? The ReadFile MSDN page does not state this _explicitly_, it lists the cases when the short read can occur instead.

    --
    Maxim S. Shatskih
    Microsoft MVP on File System And Storage
    xxxxx@storagecraft.com
    http://www.storagecraft.com


    ---
    NTDEV is sponsored by OSR

    Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

    OSR is HIRING!! See http://www.osr.com/careers

    For our schedule of WDF, WDM, debugging and other seminars visit:
    http://www.osr.com/seminars

    To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
  • Maxim_S._ShatskihMaxim_S._Shatskih Member Posts: 10,396
    So, the Windows kernel software will not do this on its own, unless the disk FW will do so?
    "Peter Wieland" wrote in message news:xxxxx@ntdev...
    It is possible but unlikely. The disk firmware could complete the read short without any error, but an error is much, much more likely.

    Sent from my Windows Phone

    ------------------------------------------------------------------------------
    From: Maxim S. Shatskih
    Sent: ?1/?11/?2015 9:57 PM
    To: Windows System Software Devs Interest List
    Subject: [ntdev] Short reads on FSDs and disks


    On Linux, the read() call from the on-disk file, as also from the DASD disk, can sometimes return with the amount of data lesser then requested.

    Is it ever possible on Windows (on FSDs, EOF condition aside, and DASD disks)?

    I have not ever seen such a thing in Windows for years, neither from user nor from kernel mode.

    Have anybody seen this on FSD-driven disk files or the DASD disks?

    Surely (Nt)ReadFile can return shorter data on pipe/FIFO-style things, TCP sockets, custom drivers of any kind... but what about FSDs and DASD disks?

    Also, if this is not possible, is there any official documentation? The ReadFile MSDN page does not state this _explicitly_, it lists the cases when the short read can occur instead.

    --
    Maxim S. Shatskih
    Microsoft MVP on File System And Storage
    xxxxx@storagecraft.com
    http://www.storagecraft.com


    ---
    NTDEV is sponsored by OSR

    Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

    OSR is HIRING!! See http://www.osr.com/careers

    For our schedule of WDF, WDM, debugging and other seminars visit:
    http://www.osr.com/seminars

    To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
  • Peter_WielandPeter_Wieland Member - All Emails Posts: 308
    If you include all the layers I?m less positive.

    I suspect the FSD layer will. If you try to read 1MB out of a 1KB file, you?ll only get back 1KB. I don?t recall if you get an error in that case, but I don?t think so. The RAW file system might similarly protect you for reads past the end of a partition, or the end of the disk.

    I don?t know about the partition or disk layer ? I think that it will fail a read past the end of a partition, but I don?t recall for sure.

    Once it gets below those drivers it?s up to the disk to report an error if the LBA being requested is bad.

    -p

    From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S. Shatskih
    Sent: Monday, January 12, 2015 9:12 AM
    To: Windows System Software Devs Interest List
    Subject: Re:[ntdev] Short reads on FSDs and disks

    So, the Windows kernel software will not do this on its own, unless the disk FW will do so?
    "Peter Wieland" > wrote in message news:xxxxx@ntdev...
    It is possible but unlikely. The disk firmware could complete the read short without any error, but an error is much, much more likely.

    Sent from my Windows Phone
    ________________________________
    From: Maxim S. Shatskih
    Sent: ?1/?11/?2015 9:57 PM
    To: Windows System Software Devs Interest List
    Subject: [ntdev] Short reads on FSDs and disks
    On Linux, the read() call from the on-disk file, as also from the DASD disk, can sometimes return with the amount of data lesser then requested.

    Is it ever possible on Windows (on FSDs, EOF condition aside, and DASD disks)?

    I have not ever seen such a thing in Windows for years, neither from user nor from kernel mode.

    Have anybody seen this on FSD-driven disk files or the DASD disks?

    Surely (Nt)ReadFile can return shorter data on pipe/FIFO-style things, TCP sockets, custom drivers of any kind... but what about FSDs and DASD disks?

    Also, if this is not possible, is there any official documentation? The ReadFile MSDN page does not state this _explicitly_, it lists the cases when the short read can occur instead.

    --
    Maxim S. Shatskih
    Microsoft MVP on File System And Storage
    xxxxx@storagecraft.com
    http://www.storagecraft.com


    ---
    NTDEV is sponsored by OSR

    Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

    OSR is HIRING!! See http://www.osr.com/careers

    For our schedule of WDF, WDM, debugging and other seminars visit:
    http://www.osr.com/seminars

    To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

    ---
    NTDEV is sponsored by OSR

    Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

    OSR is HIRING!! See http://www.osr.com/careers

    For our schedule of WDF, WDM, debugging and other seminars visit:
    http://www.osr.com/seminars

    To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
  • Alex_GrigAlex_Grig Member Posts: 3,238
    I don't think disk+classpnp is able to handle partial completion (underrun) of DASD I/O.

    For large operations split to multiple XRBs, the code in DDK doesn't check if each of those partial transfers completed fully. If an XRB in the middle had an underrun, it will create a hole in transferred data, and the total transfer length will then not include some of modified buffer..
  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 7,345
    (thanks to Max for the most interesting question in weeks)

    You'd think all of us big experts would know the answer to this simple questions right off the top of our heads, wouldn't you. LOL...

    Suppose you try to read beyond the disk's capacity? I mean, who checks that? Does the request get to the controller.

    I just don't remember. Back in the day, I *seem* to remember that disk or partition checked to see if you attempted to read past the end of the current partition. But that code has definitely changed since the last time I paid any attention to it.

    Peter
    OSR
    @OSRDrivers

    Peter Viscarola
    OSR
    @OSRDrivers

  • Jan_BottorffJan_Bottorff Member - All Emails Posts: 471
    I don’t know the exact answer, but do know something “not quite right”
    happens if the partition table does not match the reality of the disk,
    like the size of a partition is larger than the actual disk. The user
    perceived behavior is the system gets “stuck” for 20-60 seconds
    occasionally. Disk I/O errors are also written to the system even log. I
    assume this is the file system trying to read/write blocks that don’t
    exist. I believe I saw this on Win 7 and Server 2012 R2.

    You would think some layer in the storage stack checks the requests
    against the actual disk sizes, but my experience seemed to say this was
    not the case. Perhaps like when a disk comes online and the partition
    table is first read, it would be nice if it logged some system event
    saying “Umm, you know, your partition tables seem to not be correct, did
    you incorrectly clone a disk?”. Or perhaps when a file system is mounted
    would be the appropriate time to validate things.

    I saw this happening at one company, on MANY systems, and believe they
    were installing OS images by making block level images of a physical
    source disk, and applying those images to a target disk that was smaller
    in capacity. I wrote a little powershell script that queried the physical
    disk size, and then compared it to the partition offsets and sizes, with
    WMI calls. The machines that reliably got the strange stall and system
    event log messages were exactly the systems that had partitions larger
    than the actual disk.

    I personally view this as an OS bug (and a bug in whatever process was
    being used to image systems), although perhaps Microsoft views it’s
    important to maintain backward bug compatibility, and there is some reason
    having partition tables that run off the end of a disk is appropriate (so
    is a feature not a bug). I initially thought the event log errors meant a
    failing disk, but then after deeper investigation, the reality was rather
    uglier.

    Jan



    On 1/13/15, 2:11 AM, "xxxxx@osr.com" <xxxxx@osr.com> wrote:

    >(thanks to Max for the most interesting question in weeks)
    >
    >You'd think all of us big experts would know the answer to this simple
    >questions right off the top of our heads, wouldn't you. LOL...
    >
    >Suppose you try to read beyond the disk's capacity? I mean, who checks
    >that? Does the request get to the controller.
    >
    >I just don't remember. Back in the day, I *seem* to remember that disk
    >or partition checked to see if you attempted to read past the end of the
    >current partition. But that code has definitely changed since the last
    >time I paid any attention to it.
    >
    >Peter
    >OSR
    >@OSRDrivers
  • Maxim_S._ShatskihMaxim_S._Shatskih Member Posts: 10,396
    >I suspect the FSD layer will. If you try to read 1MB out of a 1KB file, you?ll only get back 1KB.

    Surely so, but I was speaking not about EOFs. With EOFs, everything is obvious.

    > I don?t recall if you get an error in that case, but I don?t think so.

    You will not. Just a short read.

    --
    Maxim S. Shatskih
    Microsoft MVP on File System And Storage
    xxxxx@storagecraft.com
    http://www.storagecraft.com
  • Maxim_S._ShatskihMaxim_S._Shatskih Member Posts: 10,396
    > You'd think all of us big experts

    No, I think MS's people who know the _real_ truth are here and can respond.

    >would know the answer to this simple questions

    Oh yes. At work, I have a case where my code ported to Linux had a bug, definitely due to a short read (my logic was considering this to be EOF or such).

    And, the Windows version of this code, which is like 10 years old now, never ever had such a bug.

    Probably it still can have such a bug, just me (and our QA/Support, and the customers) was lucky enough to not experience it?

    > Suppose you try to read beyond the disk's capacity?

    No, I don't mean any EOF conditions which are more or less obvious.

    I mean - reading in the middle of a large file, by far below EOF, can - on Linux - suddenly do a short read.

    More so, Linux web resources say that yes, Linux does short reads, and you must be prepared.

    --
    Maxim S. Shatskih
    Microsoft MVP on File System And Storage
    xxxxx@storagecraft.com
    http://www.storagecraft.com
  • Maxim_S._ShatskihMaxim_S._Shatskih Member Posts: 10,396
    > perceived behavior is the system gets “stuck” for 20-60 seconds
    > occasionally

    Yes, I also saw this.

    For me, such a condition is just plain a ruined disk, which must be fixed.

    --
    Maxim S. Shatskih
    Microsoft MVP on File System And Storage
    xxxxx@storagecraft.com
    http://www.storagecraft.com
  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 7,345
    <quote>
    I mean - reading in the middle of a large file, by far below EOF, can - on Linux
    - suddenly do a short read.
    </quote>

    WHAT? That strikes me as strange.

    So, if you're sequentially reading through a file and you are returned less data than you asked for, but you don't get an "end of file" error, you just keep reading until you get zero bytes and an end of file error??

    That seems... ah... unusual. But if those are the rules, I guess it's fine...

    Peter
    OSR
    @OSRDrivers

    Peter Viscarola
    OSR
    @OSRDrivers

  • Maxim_S._ShatskihMaxim_S._Shatskih Member Posts: 10,396
    > So, if you're sequentially reading through a file and you are returned less data than you asked for,
    >but you don't get an "end of file" error, you just keep reading until you get zero bytes and an end of
    >file error??

    Yes. On POSIX, yes.

    At there are web resources where the developers are warned on this.

    My Linux bug was: I have some "chunk headers" inside the file.

    If EOF hits in the middle of the Nth chunk header, thus making the header truncated - then the file is corrupt.

    And my code was just reading ChunkHeaderSize and failing on a short read after, reporting the corrupt file.

    This _seems_ (I'm now not sure even about this!) to be correct on Windows.

    But, on Linux, the OS can return a short read on my chunk header read, and then the valid file is considered to be broken. More so, this occurs only _sometimes_ :-)

    All of this is related to Linux signals in some way. A signal can cause a short read.

    --
    Maxim S. Shatskih
    Microsoft MVP on File System And Storage
    xxxxx@storagecraft.com
    http://www.storagecraft.com
  • Peter_WielandPeter_Wieland Member - All Emails Posts: 308
    IIRC I tried introducing that check at one point and it broke some set of systems which were already in market. I can't remember the details though.

    -p


    -----Original Message-----
    From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
    Sent: Monday, January 12, 2015 8:33 PM
    To: Windows System Software Devs Interest List
    Subject: Re: [ntdev] Short reads on FSDs and disks

    I don’t know the exact answer, but do know something “not quite right”
    happens if the partition table does not match the reality of the disk, like the size of a partition is larger than the actual disk. The user perceived behavior is the system gets “stuck” for 20-60 seconds occasionally. Disk I/O errors are also written to the system even log. I assume this is the file system trying to read/write blocks that don’t exist. I believe I saw this on Win 7 and Server 2012 R2.

    You would think some layer in the storage stack checks the requests against the actual disk sizes, but my experience seemed to say this was not the case. Perhaps like when a disk comes online and the partition table is first read, it would be nice if it logged some system event saying “Umm, you know, your partition tables seem to not be correct, did you incorrectly clone a disk?”. Or perhaps when a file system is mounted would be the appropriate time to validate things.

    I saw this happening at one company, on MANY systems, and believe they were installing OS images by making block level images of a physical source disk, and applying those images to a target disk that was smaller in capacity. I wrote a little powershell script that queried the physical disk size, and then compared it to the partition offsets and sizes, with WMI calls. The machines that reliably got the strange stall and system event log messages were exactly the systems that had partitions larger than the actual disk.

    I personally view this as an OS bug (and a bug in whatever process was being used to image systems), although perhaps Microsoft views it’s important to maintain backward bug compatibility, and there is some reason having partition tables that run off the end of a disk is appropriate (so is a feature not a bug). I initially thought the event log errors meant a failing disk, but then after deeper investigation, the reality was rather uglier.

    Jan



    On 1/13/15, 2:11 AM, "xxxxx@osr.com" <xxxxx@osr.com> wrote:

    >(thanks to Max for the most interesting question in weeks)
    >
    >You'd think all of us big experts would know the answer to this simple
    >questions right off the top of our heads, wouldn't you. LOL...
    >
    >Suppose you try to read beyond the disk's capacity? I mean, who checks
    >that? Does the request get to the controller.
    >
    >I just don't remember. Back in the day, I *seem* to remember that disk
    >or partition checked to see if you attempted to read past the end of the
    >current partition. But that code has definitely changed since the last
    >time I paid any attention to it.
    >
    >Peter
    >OSR
    >@OSRDrivers

    ---
    NTDEV is sponsored by OSR

    Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

    OSR is HIRING!! See http://www.osr.com/careers

    For our schedule of WDF, WDM, debugging and other seminars visit:
    http://www.osr.com/seminars

    To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
  • Pavel_APavel_A Member Posts: 2,681
    On 13-Jan-2015 21:19, Peter Wieland wrote:
    > IIRC I tried introducing that check at one point and it broke some set of systems which were already in market. I can't remember the details though.

    Was this related to "magic sectors" beyond the end of physical disk?
    IIRC disk drivers deliberately allow addressing past the disk size
    because some devices used to have proprietary commands via such "magic
    sectors".

    -- pa
  • Jamey_KirbyJamey_Kirby Member - All Emails Posts: 433
    Linux block device layers are sophisticated. Windows not so much. Windows
    does no real optimization in the drivers. It assumes device will handle any
    optimizations (like CSCAN). Linux block layers implement merging, sorting,
    and coalescing of IO in the device queue. These sorts of optimizations can
    result in short reads.

    It is also a good idea to be able to handle short reads in Windows code
    even if you have never seen it. It makes your code more portable and the
    APIs do account for short reads.

    On Tue, Jan 13, 2015 at 3:39 PM, Pavel A. wrote:

    > On 13-Jan-2015 21:19, Peter Wieland wrote:
    >
    >> IIRC I tried introducing that check at one point and it broke some set of
    >> systems which were already in market. I can't remember the details though.
    >>
    >
    > Was this related to "magic sectors" beyond the end of physical disk? IIRC
    > disk drivers deliberately allow addressing past the disk size because some
    > devices used to have proprietary commands via such "magic sectors".
    >
    > -- pa
    >
    >
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
    >
    > OSR is HIRING!! See http://www.osr.com/careers
    >
    > For our schedule of WDF, WDM, debugging and other seminars visit:
    > http://www.osr.com/seminars
    >
    > To unsubscribe, visit the List Server section of OSR Online at
    > http://www.osronline.com/page.cfm?name=ListServer
    >



    --
    Jamey Kirby
    Disrupting the establishment since 1964

    *This is a personal email account and as such, emails are not subject to
    archiving. Nothing else really matters.*
  • Jan_BottorffJan_Bottorff Member - All Emails Posts: 471
    I do understand that backward compatibility is really important, but so is
    the CORRECT operation of systems. Would writing a warning into the system
    event log if the partition table was inconsistent with the physical disk
    be viewed as breaking backward compatibility?

    Jan



    On 1/13/15, 7:19 PM, "Peter Wieland" <xxxxx@microsoft.com> wrote:

    >IIRC I tried introducing that check at one point and it broke some set of
    >systems which were already in market. I can't remember the details
    >though.
    >
    >-p
    >
    >
    >-----Original Message-----
    >From: xxxxx@lists.osr.com
    >[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
    >Sent: Monday, January 12, 2015 8:33 PM
    >To: Windows System Software Devs Interest List
    >Subject: Re: [ntdev] Short reads on FSDs and disks
    >
    >I don’t know the exact answer, but do know something “not quite right”
    >happens if the partition table does not match the reality of the disk,
    >like the size of a partition is larger than the actual disk. The user
    >perceived behavior is the system gets “stuck” for 20-60 seconds
    >occasionally. Disk I/O errors are also written to the system even log. I
    >assume this is the file system trying to read/write blocks that don’t
    >exist. I believe I saw this on Win 7 and Server 2012 R2.
    >
    >You would think some layer in the storage stack checks the requests
    >against the actual disk sizes, but my experience seemed to say this was
    >not the case. Perhaps like when a disk comes online and the partition
    >table is first read, it would be nice if it logged some system event
    >saying “Umm, you know, your partition tables seem to not be correct, did
    >you incorrectly clone a disk?”. Or perhaps when a file system is mounted
    >would be the appropriate time to validate things.
    >
    >I saw this happening at one company, on MANY systems, and believe they
    >were installing OS images by making block level images of a physical
    >source disk, and applying those images to a target disk that was smaller
    >in capacity. I wrote a little powershell script that queried the physical
    >disk size, and then compared it to the partition offsets and sizes, with
    >WMI calls. The machines that reliably got the strange stall and system
    >event log messages were exactly the systems that had partitions larger
    >than the actual disk.
    >
    >I personally view this as an OS bug (and a bug in whatever process was
    >being used to image systems), although perhaps Microsoft views it’s
    >important to maintain backward bug compatibility, and there is some
    >reason having partition tables that run off the end of a disk is
    >appropriate (so is a feature not a bug). I initially thought the event
    >log errors meant a failing disk, but then after deeper investigation, the
    >reality was rather uglier.
    >
    >Jan
    >
    >
    >
    >On 1/13/15, 2:11 AM, "xxxxx@osr.com" <xxxxx@osr.com> wrote:
    >
    >>(thanks to Max for the most interesting question in weeks)
    >>
    >>You'd think all of us big experts would know the answer to this simple
    >>questions right off the top of our heads, wouldn't you. LOL...
    >>
    >>Suppose you try to read beyond the disk's capacity? I mean, who checks
    >>that? Does the request get to the controller.
    >>
    >>I just don't remember. Back in the day, I *seem* to remember that disk
    >>or partition checked to see if you attempted to read past the end of the
    >>current partition. But that code has definitely changed since the last
    >>time I paid any attention to it.
    >>
    >>Peter
    >>OSR
    >>@OSRDrivers
    >
    >---
    >NTDEV is sponsored by OSR
    >
    >Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
    >
    >OSR is HIRING!! See http://www.osr.com/careers
    >
    >For our schedule of WDF, WDM, debugging and other seminars visit:
    >http://www.osr.com/seminars
    >
    >To unsubscribe, visit the List Server section of OSR Online at
    >http://www.osronline.com/page.cfm?name=ListServer
    >
    >---
    >NTDEV is sponsored by OSR
    >
    >Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
    >
    >OSR is HIRING!! See http://www.osr.com/careers
    >
    >For our schedule of WDF, WDM, debugging and other seminars visit:
    >http://www.osr.com/seminars
    >
    >To unsubscribe, visit the List Server section of OSR Online at
    >http://www.osronline.com/page.cfm?name=ListServer
  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 7,345
    [Oh, I feel it coming]

    <quote>
    Linux block device layers are sophisticated. Windows not so much. Windows
    does no real optimization in the drivers. It assumes device will handle any
    optimizations (like CSCAN). Linux block layers implement merging, sorting,
    and coalescing of IO in the device queue. These sorts of optimizations can
    result in short reads.
    </quote>

    That should have read "sophisticated"... with the quotes. I would have said "outdated and meddlesome" myself.

    And I'm not trying to be cute, or take a random dump on Linux here.

    But we've worked very closely with a *lot* of storage vendors and they all, unanimously, want the OS to do as little pre-write "optimization" as possible.

    The type of I/O Scheduling that I understand Linux does is based upon some really ancient assumptions. *I* did that sort of coalescing, next sector first, elevator service, nearest sector first with a a fairness count... heck, back in the days of the PDP-11. I believe that, and ST-506 disks on PCs, was the last time this type of optimization made real sense.

    These days, there's darn little that you can count on in terms of disk layout. It's better to just jam as many requests down to the disks control logic as possible (hundreds of simultaneous operations is great) and let the disk figure out what's best for it based on what it knows about the media.

    If you haven't read it (it's several years old) and didn't see it in our pre-Christmas Tweet, anyone interested in this topic should check out the paper entitled "Why Disks Are Like Snowflakes"(http://www.pdl.cmu.edu/PDL-FTP/Storage/CMU-PDL-11-102.pdf).

    I'd be curious if anybody knows why this type of optimization remains in Linux. I know they're not reticent to change stuff that's outdated, and that probably means they think this type of optimization is "worth it"... But I'd like to hear what the current argument is.

    Peter
    OSR
    @OSRDrivers

    Peter Viscarola
    OSR
    @OSRDrivers

  • Jamey_KirbyJamey_Kirby Member - All Emails Posts: 433
    Yes Peter, sophisticated was probably the wrong word to use; more like
    complex and burdensome. I was being nice :)

    On Tue, Jan 13, 2015 at 5:48 PM, wrote:

    > [Oh, I feel it coming]
    >
    >
    > Linux block device layers are sophisticated. Windows not so much. Windows
    > does no real optimization in the drivers. It assumes device will handle any
    > optimizations (like CSCAN). Linux block layers implement merging, sorting,
    > and coalescing of IO in the device queue. These sorts of optimizations can
    > result in short reads.
    >
    >
    > That should have read "sophisticated"... with the quotes. I would have
    > said "outdated and meddlesome" myself.
    >
    > And I'm not trying to be cute, or take a random dump on Linux here.
    >
    > But we've worked very closely with a *lot* of storage vendors and they
    > all, unanimously, want the OS to do as little pre-write "optimization" as
    > possible.
    >
    > The type of I/O Scheduling that I understand Linux does is based upon some
    > really ancient assumptions. *I* did that sort of coalescing, next sector
    > first, elevator service, nearest sector first with a a fairness count...
    > heck, back in the days of the PDP-11. I believe that, and ST-506 disks on
    > PCs, was the last time this type of optimization made real sense.
    >
    > These days, there's darn little that you can count on in terms of disk
    > layout. It's better to just jam as many requests down to the disks control
    > logic as possible (hundreds of simultaneous operations is great) and let
    > the disk figure out what's best for it based on what it knows about the
    > media.
    >
    > If you haven't read it (it's several years old) and didn't see it in our
    > pre-Christmas Tweet, anyone interested in this topic should check out the
    > paper entitled "Why Disks Are Like Snowflakes"(
    > http://www.pdl.cmu.edu/PDL-FTP/Storage/CMU-PDL-11-102.pdf).
    >
    > I'd be curious if anybody knows why this type of optimization remains in
    > Linux. I know they're not reticent to change stuff that's outdated, and
    > that probably means they think this type of optimization is "worth it"...
    > But I'd like to hear what the current argument is.
    >
    > Peter
    > OSR
    > @OSRDrivers
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
    >
    > OSR is HIRING!! See http://www.osr.com/careers
    >
    > For our schedule of WDF, WDM, debugging and other seminars visit:
    > http://www.osr.com/seminars
    >
    > To unsubscribe, visit the List Server section of OSR Online at
    > http://www.osronline.com/page.cfm?name=ListServer
    >



    --
    Jamey Kirby
    Disrupting the establishment since 1964

    *This is a personal email account and as such, emails are not subject to
    archiving. Nothing else really matters.*
  • Peter_WielandPeter_Wieland Member - All Emails Posts: 308
    Sure. But my part of this was a wayback when discussion, so unless you have a wayback machine I can use to send 1999 me a note*, it's all academic. I'm sorry that I didn't figure out a way to give you a warning.

    I would be interested in knowing what error your disks are giving you back when you read past the end of them. It should be something that doesn't get retried or require any sort of on-disk delay, but they could be doing something really silly and reporting a recoverable error. That would trigger retries.

    -p

    (* to be clear, 1999 is a guess. Don't target your wayback machine there without chatting with me first. Also we should include some lottery numbers in the note. 2015 me just bought a new house)

    -----Original Message-----
    From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
    Sent: Tuesday, January 13, 2015 2:13 PM
    To: Windows System Software Devs Interest List
    Subject: Re: [ntdev] Short reads on FSDs and disks

    I do understand that backward compatibility is really important, but so is the CORRECT operation of systems. Would writing a warning into the system event log if the partition table was inconsistent with the physical disk be viewed as breaking backward compatibility?

    Jan



    On 1/13/15, 7:19 PM, "Peter Wieland" <xxxxx@microsoft.com> wrote:

    >IIRC I tried introducing that check at one point and it broke some set
    >of systems which were already in market. I can't remember the details
    >though.
    >
    >-p
    >
    >
    >-----Original Message-----
    >From: xxxxx@lists.osr.com
    >[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
    >Sent: Monday, January 12, 2015 8:33 PM
    >To: Windows System Software Devs Interest List
    >Subject: Re: [ntdev] Short reads on FSDs and disks
    >
    >I don’t know the exact answer, but do know something “not quite right”
    >happens if the partition table does not match the reality of the disk,
    >like the size of a partition is larger than the actual disk. The user
    >perceived behavior is the system gets “stuck” for 20-60 seconds
    >occasionally. Disk I/O errors are also written to the system even log.
    >I assume this is the file system trying to read/write blocks that don’t
    >exist. I believe I saw this on Win 7 and Server 2012 R2.
    >
    >You would think some layer in the storage stack checks the requests
    >against the actual disk sizes, but my experience seemed to say this was
    >not the case. Perhaps like when a disk comes online and the partition
    >table is first read, it would be nice if it logged some system event
    >saying “Umm, you know, your partition tables seem to not be correct,
    >did you incorrectly clone a disk?”. Or perhaps when a file system is
    >mounted would be the appropriate time to validate things.
    >
    >I saw this happening at one company, on MANY systems, and believe they
    >were installing OS images by making block level images of a physical
    >source disk, and applying those images to a target disk that was
    >smaller in capacity. I wrote a little powershell script that queried
    >the physical disk size, and then compared it to the partition offsets
    >and sizes, with WMI calls. The machines that reliably got the strange
    >stall and system event log messages were exactly the systems that had
    >partitions larger than the actual disk.
    >
    >I personally view this as an OS bug (and a bug in whatever process was
    >being used to image systems), although perhaps Microsoft views it’s
    >important to maintain backward bug compatibility, and there is some
    >reason having partition tables that run off the end of a disk is
    >appropriate (so is a feature not a bug). I initially thought the event
    >log errors meant a failing disk, but then after deeper investigation,
    >the reality was rather uglier.
    >
    >Jan
    >
    >
    >
    >On 1/13/15, 2:11 AM, "xxxxx@osr.com" <xxxxx@osr.com> wrote:
    >
    >>(thanks to Max for the most interesting question in weeks)
    >>
    >>You'd think all of us big experts would know the answer to this simple
    >>questions right off the top of our heads, wouldn't you. LOL...
    >>
    >>Suppose you try to read beyond the disk's capacity? I mean, who checks
    >>that? Does the request get to the controller.
    >>
    >>I just don't remember. Back in the day, I *seem* to remember that disk
    >>or partition checked to see if you attempted to read past the end of the
    >>current partition. But that code has definitely changed since the last
    >>time I paid any attention to it.
    >>
    >>Peter
    >>OSR
    >>@OSRDrivers
    >
    >---
    >NTDEV is sponsored by OSR
    >
    >Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
    >
    >OSR is HIRING!! See http://www.osr.com/careers
    >
    >For our schedule of WDF, WDM, debugging and other seminars visit:
    >http://www.osr.com/seminars
    >
    >To unsubscribe, visit the List Server section of OSR Online at
    >http://www.osronline.com/page.cfm?name=ListServer
    >
    >---
    >NTDEV is sponsored by OSR
    >
    >Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
    >
    >OSR is HIRING!! See http://www.osr.com/careers
    >
    >For our schedule of WDF, WDM, debugging and other seminars visit:
    >http://www.osr.com/seminars
    >
    >To unsubscribe, visit the List Server section of OSR Online at
    >http://www.osronline.com/page.cfm?name=ListServer

    ---
    NTDEV is sponsored by OSR

    Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

    OSR is HIRING!! See http://www.osr.com/careers

    For our schedule of WDF, WDM, debugging and other seminars visit:
    http://www.osr.com/seminars

    To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Upcoming OSR Seminars
Developing Minifilters 29 July 2019 OSR Seminar Space
Writing WDF Drivers 23 Sept 2019 OSR Seminar Space
Kernel Debugging 21 Oct 2019 OSR Seminar Space
Internals & Software Drivers 18 Nov 2019 Dulles, VA