Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Before Posting...
Please check out the Community Guidelines in the Announcements and Administration Category.

Memory Corruption Mystery: Any Ideas?

Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,151
A potentially interesting puzzler for a Tuesday...

We're looking at a series of crash dumps from a client and am hoping that
this corruption looks familiar to someone. This is an end user installation
and none of our software is running on these machines, it's strictly a crash
analysis exercise.

The crashes are spread out over 1,000+ "identical" machines. No one machine
crashes with great frequency, but spread out across the install there's a
few a day. Systems survive anywhere from a few hours to six days before
crashing.

Just looking at the crash codes isn't helpful, they have about every crash
code you could ever imagine (QUOTA_UNDERRUN??) and it's blamed on various
modules. However, digging deeper a very consistent pattern emerges.
Specifically, we are consistently seeing one of two values "randomly" appear
in memory:

• 0x00000010`00000004
• 0x00000010`00400004

Interestingly, when the corruption is discovered the value very, very
frequently appears at physical memory page offset 0xFD8 (most common) or
0xD70 (less common).

For example, in one crash the problem was that the MRXSMB20 image file is
corrupted:

3: kd> !chkimg -d mrxsmb20
fffff800826a6fd8-fffff800826a6fdd 6 bytes -
mrxsmb20!Smb2UpdateFileInfoCacheEntry+4c8
[ 89 7d 18 49 89 45:04 00 00 00 10 00 ]
fffff800826a6fdf - mrxsmb20!Smb2UpdateFileInfoCacheEntry+4cf (+0x07)
[ e8:00 ]
7 errors : mrxsmb20 (fffff800826a6fd8-fffff800826a6fdf)

Dumping the start of the corrupted range, we see our offset and value:

3: kd> dq fffff800826a6fd8
fffff800`826a6fd8 00000010`00000004 4c2b894c`0000e99c
fffff800`826a6fe8 ade901b6`41986d8b 850f02f8`83fffffd
fffff800`826a6ff8 8bc03345`fffffbb5 445e15ff`ce8b49d7
fffff800`826a7008 fb9f850f`c0840002 03fffffe`fee9ffff

In another crash a pool header is corrupted:

2: kd> !pool ffffc00089adcd70
Pool page ffffc00089adcd70 region is Paged pool
...
ffffc00089adcc00 size: 170 previous size: b0 (Free ) MPsc

ffffc00089adcd70 doesn't look like a valid small pool allocation, checking
to see
if the entire page is actually part of a large page allocation...

2: kd> dq ffffc00089adcd70
ffffc000`89adcd70 00000010`00000004 8d5eb149`4b83d33a
ffffc000`89adcd80 00000000`00000000 ffffe000`bdecfec0
ffffc000`89adcd90 ffffe000`bc728860 ffffc000`89adcd98

Due to the fact that the crash appears at random in different virtual
address ranges (paged pool, non-paged pool, code, proto PTEs, working set
lists, etc.), we believe that the corruption must be generated by a device
in the system (or by the platform).

We have tried various things to narrow this further and analyzed the
corruption across hundreds of dump files. The systems are generally idle
when the corruption is discovered, which makes it hard to go back in time
and figure out who might be using the value.

So, my question to you all is: does this LOOK like anything to you? Do those
corruption values hold any meaning to you? What about the offsets of 0xFD8
or 0xD70? I realize it's not much to go on, but stranger things have
happened...

Thanks!

-scott
OSR
@OSRDrivers

-scott
OSR

«1

Comments

  • Pavel_LebedinskyPavel_Lebedinsky Member - All Emails Posts: 435
    Such corruptions are often caused by firmware issues, like failing to exclude memory used by devices from the OS memory map.

    Some things you can try:

    1. Check if there is a specific physical page that gets corrupted more often than other pages.
    2. If corruption appears to hit completely random physical pages, do !search 0000001000400004 and see if there are any other instances of that value, especially at page offsets FD8/D70. If yes, check these physical pages in other dumps to see if there is a pattern.
    3. If you can identify a suspect physical page, try excluding it using bcdedit /set badmemorylist.
    4. Optionally, do !pfn and !searchpte on the suspect PFN to see what the page is being used for and where it might be mapped. Check for problems like a valid PTE pointing to a free/standby page.
    5. Try booting with a checked kernel and see if it hits any asserts in the memory manager.

    -----Original Message-----
    From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Scott Noone
    Sent: Tuesday, November 24, 2015 7:34 AM
    To: Windows System Software Devs Interest List <xxxxx@lists.osr.com>
    Subject: [ntdev] Memory Corruption Mystery: Any Ideas?

    A potentially interesting puzzler for a Tuesday...

    We're looking at a series of crash dumps from a client and am hoping that this corruption looks familiar to someone. This is an end user installation and none of our software is running on these machines, it's strictly a crash analysis exercise.

    The crashes are spread out over 1,000+ "identical" machines. No one machine crashes with great frequency, but spread out across the install there's a few a day. Systems survive anywhere from a few hours to six days before crashing.

    Just looking at the crash codes isn't helpful, they have about every crash code you could ever imagine (QUOTA_UNDERRUN??) and it's blamed on various modules. However, digging deeper a very consistent pattern emerges.
    Specifically, we are consistently seeing one of two values "randomly" appear in memory:

    • 0x00000010`00000004
    • 0x00000010`00400004

    Interestingly, when the corruption is discovered the value very, very frequently appears at physical memory page offset 0xFD8 (most common) or
    0xD70 (less common).

    For example, in one crash the problem was that the MRXSMB20 image file is
    corrupted:

    3: kd> !chkimg -d mrxsmb20
    fffff800826a6fd8-fffff800826a6fdd 6 bytes -
    mrxsmb20!Smb2UpdateFileInfoCacheEntry+4c8
    [ 89 7d 18 49 89 45:04 00 00 00 10 00 ]
    fffff800826a6fdf - mrxsmb20!Smb2UpdateFileInfoCacheEntry+4cf (+0x07)
    [ e8:00 ]
    7 errors : mrxsmb20 (fffff800826a6fd8-fffff800826a6fdf)

    Dumping the start of the corrupted range, we see our offset and value:

    3: kd> dq fffff800826a6fd8
    fffff800`826a6fd8 00000010`00000004 4c2b894c`0000e99c
    fffff800`826a6fe8 ade901b6`41986d8b 850f02f8`83fffffd
    fffff800`826a6ff8 8bc03345`fffffbb5 445e15ff`ce8b49d7
    fffff800`826a7008 fb9f850f`c0840002 03fffffe`fee9ffff

    In another crash a pool header is corrupted:

    2: kd> !pool ffffc00089adcd70
    Pool page ffffc00089adcd70 region is Paged pool ...
    ffffc00089adcc00 size: 170 previous size: b0 (Free ) MPsc

    ffffc00089adcd70 doesn't look like a valid small pool allocation, checking to see if the entire page is actually part of a large page allocation...

    2: kd> dq ffffc00089adcd70
    ffffc000`89adcd70 00000010`00000004 8d5eb149`4b83d33a
    ffffc000`89adcd80 00000000`00000000 ffffe000`bdecfec0
    ffffc000`89adcd90 ffffe000`bc728860 ffffc000`89adcd98

    Due to the fact that the crash appears at random in different virtual address ranges (paged pool, non-paged pool, code, proto PTEs, working set lists, etc.), we believe that the corruption must be generated by a device in the system (or by the platform).

    We have tried various things to narrow this further and analyzed the corruption across hundreds of dump files. The systems are generally idle when the corruption is discovered, which makes it hard to go back in time and figure out who might be using the value.

    So, my question to you all is: does this LOOK like anything to you? Do those corruption values hold any meaning to you? What about the offsets of 0xFD8 or 0xD70? I realize it's not much to go on, but stranger things have happened...
  • anton_bassovanton_bassov Member Posts: 5,038
    Can you provide a bit more info about these "indentical machines" - chipset version, as well as firmare, seem to be of paricular intererst......

    Anton Bassov
  • Maxim_S._ShatskihMaxim_S._Shatskih Member Posts: 10,396
    0xFD8 is like the end of 16MB region for legacy ISA DMA

    Can this be the cause?

    --
    Maxim S. Shatskih
    Microsoft MVP on File System And Storage
    xxxxx@storagecraft.com
    http://www.storagecraft.com

    "Scott Noone" <xxxxx@osr.com> wrote in message news:xxxxx@ntdev...
    >A potentially interesting puzzler for a Tuesday...
    >
    > We're looking at a series of crash dumps from a client and am hoping that
    > this corruption looks familiar to someone. This is an end user installation
    > and none of our software is running on these machines, it's strictly a crash
    > analysis exercise.
    >
    > The crashes are spread out over 1,000+ "identical" machines. No one machine
    > crashes with great frequency, but spread out across the install there's a
    > few a day. Systems survive anywhere from a few hours to six days before
    > crashing.
    >
    > Just looking at the crash codes isn't helpful, they have about every crash
    > code you could ever imagine (QUOTA_UNDERRUN??) and it's blamed on various
    > modules. However, digging deeper a very consistent pattern emerges.
    > Specifically, we are consistently seeing one of two values "randomly" appear
    > in memory:
    >
    > • 0x00000010`00000004
    > • 0x00000010`00400004
    >
    > Interestingly, when the corruption is discovered the value very, very
    > frequently appears at physical memory page offset 0xFD8 (most common) or
    > 0xD70 (less common).
    >
    > For example, in one crash the problem was that the MRXSMB20 image file is
    > corrupted:
    >
    > 3: kd> !chkimg -d mrxsmb20
    > fffff800826a6fd8-fffff800826a6fdd 6 bytes -
    > mrxsmb20!Smb2UpdateFileInfoCacheEntry+4c8
    > [ 89 7d 18 49 89 45:04 00 00 00 10 00 ]
    > fffff800826a6fdf - mrxsmb20!Smb2UpdateFileInfoCacheEntry+4cf (+0x07)
    > [ e8:00 ]
    > 7 errors : mrxsmb20 (fffff800826a6fd8-fffff800826a6fdf)
    >
    > Dumping the start of the corrupted range, we see our offset and value:
    >
    > 3: kd> dq fffff800826a6fd8
    > fffff800`826a6fd8 00000010`00000004 4c2b894c`0000e99c
    > fffff800`826a6fe8 ade901b6`41986d8b 850f02f8`83fffffd
    > fffff800`826a6ff8 8bc03345`fffffbb5 445e15ff`ce8b49d7
    > fffff800`826a7008 fb9f850f`c0840002 03fffffe`fee9ffff
    >
    > In another crash a pool header is corrupted:
    >
    > 2: kd> !pool ffffc00089adcd70
    > Pool page ffffc00089adcd70 region is Paged pool
    > ...
    > ffffc00089adcc00 size: 170 previous size: b0 (Free ) MPsc
    >
    > ffffc00089adcd70 doesn't look like a valid small pool allocation, checking
    > to see
    > if the entire page is actually part of a large page allocation...
    >
    > 2: kd> dq ffffc00089adcd70
    > ffffc000`89adcd70 00000010`00000004 8d5eb149`4b83d33a
    > ffffc000`89adcd80 00000000`00000000 ffffe000`bdecfec0
    > ffffc000`89adcd90 ffffe000`bc728860 ffffc000`89adcd98
    >
    > Due to the fact that the crash appears at random in different virtual
    > address ranges (paged pool, non-paged pool, code, proto PTEs, working set
    > lists, etc.), we believe that the corruption must be generated by a device
    > in the system (or by the platform).
    >
    > We have tried various things to narrow this further and analyzed the
    > corruption across hundreds of dump files. The systems are generally idle
    > when the corruption is discovered, which makes it hard to go back in time
    > and figure out who might be using the value.
    >
    > So, my question to you all is: does this LOOK like anything to you? Do those
    > corruption values hold any meaning to you? What about the offsets of 0xFD8
    > or 0xD70? I realize it's not much to go on, but stranger things have
    > happened...
    >
    > Thanks!
    >
    > -scott
    > OSR
    > @OSRDrivers
    >
    >
    >
  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,151
    Thanks for the replies!

    <QUOTE>
    Such corruptions are often caused by firmware issues, like failing to
    exclude memory used by devices from the OS memory map.
    </QUOTE>

    That's an interesting idea, I hadn't thought of that one.

    <QUOTE>
    1. Check if there is a specific physical page that gets corrupted more often
    than other pages.
    </QUOTE>

    It's definitely not always the same page, though the above makes me wonder
    if it's in a specific range of pages. Unfortunately extracting the bad PFN
    is more manual than I would prefer so automating this might be difficult. I
    have a good sample on which this might work though...

    <QUOTE>
    2. If corruption appears to hit completely random physical pages, do !search
    0000001000400004 and see if there are any other instances of that value,
    especially at page offsets FD8/D70. If yes, check these physical pages in
    other dumps to see if there is a pattern.
    </QUOTE>

    This has been a bit of a dead end only because it's taking so long to get
    the results. I ran !search on one of the more recent dumps this AM at around
    7AM ET and it's still going...I'll continue to see if it gets me anywhere
    though.

    <QUOTE>
    3. If you can identify a suspect physical page, try excluding it using
    bcdedit /set badmemorylist.
    </QUOTE>

    I'll definitely try this if I can identify a range.

    <QUOTE>
    4. Optionally, do !pfn and !searchpte on the suspect PFN to see what the
    page is being used for and where it might be mapped. Check for problems like
    a valid PTE pointing to a free/standby page.
    </QUOTE>

    I did try !searchpte on several dumps but didn't find an obvious
    culprit...I'll give this a few more tries on other dumps though to see if I
    can catch one in the act.

    <QUOTE>
    5. Try booting with a checked kernel and see if it hits any asserts in the
    memory manager.
    </QUOTE>

    Verifier has been run (of course) but not the checked images, also a very
    good idea.

    Thanks again for the ideas! I'll definitely update if we find the smoking
    gun.

    -scott
    OSR
    @OSRDrivers

    "Pavel Lebedinsky" wrote in message news:xxxxx@ntdev...

    Such corruptions are often caused by firmware issues, like failing to
    exclude memory used by devices from the OS memory map.

    Some things you can try:

    1. Check if there is a specific physical page that gets corrupted more often
    than other pages.
    2. If corruption appears to hit completely random physical pages, do !search
    0000001000400004 and see if there are any other instances of that value,
    especially at page offsets FD8/D70. If yes, check these physical pages in
    other dumps to see if there is a pattern.
    3. If you can identify a suspect physical page, try excluding it using
    bcdedit /set badmemorylist.
    4. Optionally, do !pfn and !searchpte on the suspect PFN to see what the
    page is being used for and where it might be mapped. Check for problems like
    a valid PTE pointing to a free/standby page.
    5. Try booting with a checked kernel and see if it hits any asserts in the
    memory manager.

    -----Original Message-----
    From: xxxxx@lists.osr.com
    [mailto:xxxxx@lists.osr.com] On Behalf Of Scott Noone
    Sent: Tuesday, November 24, 2015 7:34 AM
    To: Windows System Software Devs Interest List <xxxxx@lists.osr.com>
    Subject: [ntdev] Memory Corruption Mystery: Any Ideas?

    A potentially interesting puzzler for a Tuesday...

    We're looking at a series of crash dumps from a client and am hoping that
    this corruption looks familiar to someone. This is an end user installation
    and none of our software is running on these machines, it's strictly a crash
    analysis exercise.

    The crashes are spread out over 1,000+ "identical" machines. No one machine
    crashes with great frequency, but spread out across the install there's a
    few a day. Systems survive anywhere from a few hours to six days before
    crashing.

    Just looking at the crash codes isn't helpful, they have about every crash
    code you could ever imagine (QUOTA_UNDERRUN??) and it's blamed on various
    modules. However, digging deeper a very consistent pattern emerges.
    Specifically, we are consistently seeing one of two values "randomly" appear
    in memory:

    • 0x00000010`00000004
    • 0x00000010`00400004

    Interestingly, when the corruption is discovered the value very, very
    frequently appears at physical memory page offset 0xFD8 (most common) or
    0xD70 (less common).

    For example, in one crash the problem was that the MRXSMB20 image file is
    corrupted:

    3: kd> !chkimg -d mrxsmb20
    fffff800826a6fd8-fffff800826a6fdd 6 bytes -
    mrxsmb20!Smb2UpdateFileInfoCacheEntry+4c8
    [ 89 7d 18 49 89 45:04 00 00 00 10 00 ]
    fffff800826a6fdf - mrxsmb20!Smb2UpdateFileInfoCacheEntry+4cf (+0x07)
    [ e8:00 ]
    7 errors : mrxsmb20 (fffff800826a6fd8-fffff800826a6fdf)

    Dumping the start of the corrupted range, we see our offset and value:

    3: kd> dq fffff800826a6fd8
    fffff800`826a6fd8 00000010`00000004 4c2b894c`0000e99c
    fffff800`826a6fe8 ade901b6`41986d8b 850f02f8`83fffffd
    fffff800`826a6ff8 8bc03345`fffffbb5 445e15ff`ce8b49d7
    fffff800`826a7008 fb9f850f`c0840002 03fffffe`fee9ffff

    In another crash a pool header is corrupted:

    2: kd> !pool ffffc00089adcd70
    Pool page ffffc00089adcd70 region is Paged pool ...
    ffffc00089adcc00 size: 170 previous size: b0 (Free ) MPsc

    ffffc00089adcd70 doesn't look like a valid small pool allocation, checking
    to see if the entire page is actually part of a large page allocation...

    2: kd> dq ffffc00089adcd70
    ffffc000`89adcd70 00000010`00000004 8d5eb149`4b83d33a
    ffffc000`89adcd80 00000000`00000000 ffffe000`bdecfec0
    ffffc000`89adcd90 ffffe000`bc728860 ffffc000`89adcd98

    Due to the fact that the crash appears at random in different virtual
    address ranges (paged pool, non-paged pool, code, proto PTEs, working set
    lists, etc.), we believe that the corruption must be generated by a device
    in the system (or by the platform).

    We have tried various things to narrow this further and analyzed the
    corruption across hundreds of dump files. The systems are generally idle
    when the corruption is discovered, which makes it hard to go back in time
    and figure out who might be using the value.

    So, my question to you all is: does this LOOK like anything to you? Do those
    corruption values hold any meaning to you? What about the offsets of 0xFD8
    or 0xD70? I realize it's not much to go on, but stranger things have
    happened...

    -scott
    OSR

  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,151
    They are the i7 variant of the Lenovo M93p Tiny Desktop. They are running
    various versions of the available firmware, though there has been an effort
    recently to get them all updated to the latest. The RAM map being confused
    is certainly an interesting development.


    -scott
    OSR
    @OSRDrivers

    wrote in message news:xxxxx@ntdev...

    Can you provide a bit more info about these "indentical machines" - chipset
    version, as well as firmare, seem to be of paricular intererst......

    Anton Bassov

    -scott
    OSR

  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,151
    The systems aren't ancient, so I wouldn’t expect an errant legacy ISA DMA.

    Also, what's particularly annoying about the problem is that it's not the
    entire region, it's just an 8 byte pattern starting at 0xFD8 (or 0xD70). So,
    unless 0x00000010`00000004 was a common "end of data" marker for that type
    of transfer it doesn't sound related.

    -scott
    OSR
    @OSRDrivers

    "Maxim S. Shatskih" wrote in message news:xxxxx@ntdev...

    0xFD8 is like the end of 16MB region for legacy ISA DMA

    Can this be the cause?

    --
    Maxim S. Shatskih
    Microsoft MVP on File System And Storage
    xxxxx@storagecraft.com
    http://www.storagecraft.com

    "Scott Noone" <xxxxx@osr.com> wrote in message news:xxxxx@ntdev...
    >A potentially interesting puzzler for a Tuesday...
    >
    > We're looking at a series of crash dumps from a client and am hoping that
    > this corruption looks familiar to someone. This is an end user
    > installation
    > and none of our software is running on these machines, it's strictly a
    > crash
    > analysis exercise.
    >
    > The crashes are spread out over 1,000+ "identical" machines. No one
    > machine
    > crashes with great frequency, but spread out across the install there's a
    > few a day. Systems survive anywhere from a few hours to six days before
    > crashing.
    >
    > Just looking at the crash codes isn't helpful, they have about every crash
    > code you could ever imagine (QUOTA_UNDERRUN??) and it's blamed on various
    > modules. However, digging deeper a very consistent pattern emerges.
    > Specifically, we are consistently seeing one of two values "randomly"
    > appear
    > in memory:
    >
    > • 0x00000010`00000004
    > • 0x00000010`00400004
    >
    > Interestingly, when the corruption is discovered the value very, very
    > frequently appears at physical memory page offset 0xFD8 (most common) or
    > 0xD70 (less common).
    >
    > For example, in one crash the problem was that the MRXSMB20 image file is
    > corrupted:
    >
    > 3: kd> !chkimg -d mrxsmb20
    > fffff800826a6fd8-fffff800826a6fdd 6 bytes -
    > mrxsmb20!Smb2UpdateFileInfoCacheEntry+4c8
    > [ 89 7d 18 49 89 45:04 00 00 00 10 00 ]
    > fffff800826a6fdf - mrxsmb20!Smb2UpdateFileInfoCacheEntry+4cf (+0x07)
    > [ e8:00 ]
    > 7 errors : mrxsmb20 (fffff800826a6fd8-fffff800826a6fdf)
    >
    > Dumping the start of the corrupted range, we see our offset and value:
    >
    > 3: kd> dq fffff800826a6fd8
    > fffff800`826a6fd8 00000010`00000004 4c2b894c`0000e99c
    > fffff800`826a6fe8 ade901b6`41986d8b 850f02f8`83fffffd
    > fffff800`826a6ff8 8bc03345`fffffbb5 445e15ff`ce8b49d7
    > fffff800`826a7008 fb9f850f`c0840002 03fffffe`fee9ffff
    >
    > In another crash a pool header is corrupted:
    >
    > 2: kd> !pool ffffc00089adcd70
    > Pool page ffffc00089adcd70 region is Paged pool
    > ...
    > ffffc00089adcc00 size: 170 previous size: b0 (Free ) MPsc
    >
    > ffffc00089adcd70 doesn't look like a valid small pool allocation, checking
    > to see
    > if the entire page is actually part of a large page allocation...
    >
    > 2: kd> dq ffffc00089adcd70
    > ffffc000`89adcd70 00000010`00000004 8d5eb149`4b83d33a
    > ffffc000`89adcd80 00000000`00000000 ffffe000`bdecfec0
    > ffffc000`89adcd90 ffffe000`bc728860 ffffc000`89adcd98
    >
    > Due to the fact that the crash appears at random in different virtual
    > address ranges (paged pool, non-paged pool, code, proto PTEs, working set
    > lists, etc.), we believe that the corruption must be generated by a device
    > in the system (or by the platform).
    >
    > We have tried various things to narrow this further and analyzed the
    > corruption across hundreds of dump files. The systems are generally idle
    > when the corruption is discovered, which makes it hard to go back in time
    > and figure out who might be using the value.
    >
    > So, my question to you all is: does this LOOK like anything to you? Do
    > those
    > corruption values hold any meaning to you? What about the offsets of 0xFD8
    > or 0xD70? I realize it's not much to go on, but stranger things have
    > happened...
    >
    > Thanks!
    >
    > -scott
    > OSR
    > @OSRDrivers
    >
    >
    >

    -scott
    OSR

  • Jan_BottorffJan_Bottorff Member - All Emails Posts: 471
    Extending the idea of excluding individual pages, you might try excluding big chunks of memory. I thought there used to be an option to force only memory above 4GB to be used. Ideally you could binary search excluded memory. You potentially could write a little boot start driver that allocated do nothing buffers in specific ranges, testing if you can cause the corruption to only happen in harmless areas.

    Jan



    On Tue, Nov 24, 2015 at 12:42 PM -0800, "Scott Noone" > wrote:

    Thanks for the replies!


    Such corruptions are often caused by firmware issues, like failing to
    exclude memory used by devices from the OS memory map.


    That's an interesting idea, I hadn't thought of that one.


    1. Check if there is a specific physical page that gets corrupted more often
    than other pages.


    It's definitely not always the same page, though the above makes me wonder
    if it's in a specific range of pages. Unfortunately extracting the bad PFN
    is more manual than I would prefer so automating this might be difficult. I
    have a good sample on which this might work though...


    2. If corruption appears to hit completely random physical pages, do !search
    0000001000400004 and see if there are any other instances of that value,
    especially at page offsets FD8/D70. If yes, check these physical pages in
    other dumps to see if there is a pattern.


    This has been a bit of a dead end only because it's taking so long to get
    the results. I ran !search on one of the more recent dumps this AM at around
    7AM ET and it's still going...I'll continue to see if it gets me anywhere
    though.


    3. If you can identify a suspect physical page, try excluding it using
    bcdedit /set badmemorylist.


    I'll definitely try this if I can identify a range.


    4. Optionally, do !pfn and !searchpte on the suspect PFN to see what the
    page is being used for and where it might be mapped. Check for problems like
    a valid PTE pointing to a free/standby page.


    I did try !searchpte on several dumps but didn't find an obvious
    culprit...I'll give this a few more tries on other dumps though to see if I
    can catch one in the act.


    5. Try booting with a checked kernel and see if it hits any asserts in the
    memory manager.


    Verifier has been run (of course) but not the checked images, also a very
    good idea.

    Thanks again for the ideas! I'll definitely update if we find the smoking
    gun.

    -scott
    OSR
    @OSRDrivers

    "Pavel Lebedinsky" wrote in message news:xxxxx@ntdev...

    Such corruptions are often caused by firmware issues, like failing to
    exclude memory used by devices from the OS memory map.

    Some things you can try:

    1. Check if there is a specific physical page that gets corrupted more often
    than other pages.
    2. If corruption appears to hit completely random physical pages, do !search
    0000001000400004 and see if there are any other instances of that value,
    especially at page offsets FD8/D70. If yes, check these physical pages in
    other dumps to see if there is a pattern.
    3. If you can identify a suspect physical page, try excluding it using
    bcdedit /set badmemorylist.
    4. Optionally, do !pfn and !searchpte on the suspect PFN to see what the
    page is being used for and where it might be mapped. Check for problems like
    a valid PTE pointing to a free/standby page.
    5. Try booting with a checked kernel and see if it hits any asserts in the
    memory manager.

    -----Original Message-----
    From: xxxxx@lists.osr.com
    [mailto:xxxxx@lists.osr.com] On Behalf Of Scott Noone
    Sent: Tuesday, November 24, 2015 7:34 AM
    To: Windows System Software Devs Interest List
    Subject: [ntdev] Memory Corruption Mystery: Any Ideas?

    A potentially interesting puzzler for a Tuesday...

    We're looking at a series of crash dumps from a client and am hoping that
    this corruption looks familiar to someone. This is an end user installation
    and none of our software is running on these machines, it's strictly a crash
    analysis exercise.

    The crashes are spread out over 1,000+ "identical" machines. No one machine
    crashes with great frequency, but spread out across the install there's a
    few a day. Systems survive anywhere from a few hours to six days before
    crashing.

    Just looking at the crash codes isn't helpful, they have about every crash
    code you could ever imagine (QUOTA_UNDERRUN??) and it's blamed on various
    modules. However, digging deeper a very consistent pattern emerges.
    Specifically, we are consistently seeing one of two values "randomly" appear
    in memory:

    ? 0x00000010`00000004
    ? 0x00000010`00400004

    Interestingly, when the corruption is discovered the value very, very
    frequently appears at physical memory page offset 0xFD8 (most common) or
    0xD70 (less common).

    For example, in one crash the problem was that the MRXSMB20 image file is
    corrupted:

    3: kd> !chkimg -d mrxsmb20
    fffff800826a6fd8-fffff800826a6fdd 6 bytes -
    mrxsmb20!Smb2UpdateFileInfoCacheEntry+4c8
    [ 89 7d 18 49 89 45:04 00 00 00 10 00 ]
    fffff800826a6fdf - mrxsmb20!Smb2UpdateFileInfoCacheEntry+4cf (+0x07)
    [ e8:00 ]
    7 errors : mrxsmb20 (fffff800826a6fd8-fffff800826a6fdf)

    Dumping the start of the corrupted range, we see our offset and value:

    3: kd> dq fffff800826a6fd8
    fffff800`826a6fd8 00000010`00000004 4c2b894c`0000e99c
    fffff800`826a6fe8 ade901b6`41986d8b 850f02f8`83fffffd
    fffff800`826a6ff8 8bc03345`fffffbb5 445e15ff`ce8b49d7
    fffff800`826a7008 fb9f850f`c0840002 03fffffe`fee9ffff

    In another crash a pool header is corrupted:

    2: kd> !pool ffffc00089adcd70
    Pool page ffffc00089adcd70 region is Paged pool ...
    ffffc00089adcc00 size: 170 previous size: b0 (Free ) MPsc

    ffffc00089adcd70 doesn't look like a valid small pool allocation, checking
    to see if the entire page is actually part of a large page allocation...

    2: kd> dq ffffc00089adcd70
    ffffc000`89adcd70 00000010`00000004 8d5eb149`4b83d33a
    ffffc000`89adcd80 00000000`00000000 ffffe000`bdecfec0
    ffffc000`89adcd90 ffffe000`bc728860 ffffc000`89adcd98

    Due to the fact that the crash appears at random in different virtual
    address ranges (paged pool, non-paged pool, code, proto PTEs, working set
    lists, etc.), we believe that the corruption must be generated by a device
    in the system (or by the platform).

    We have tried various things to narrow this further and analyzed the
    corruption across hundreds of dump files. The systems are generally idle
    when the corruption is discovered, which makes it hard to go back in time
    and figure out who might be using the value.

    So, my question to you all is: does this LOOK like anything to you? Do those
    corruption values hold any meaning to you? What about the offsets of 0xFD8
    or 0xD70? I realize it's not much to go on, but stranger things have
    happened...


    ---
    NTDEV is sponsored by OSR

    Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

    OSR is HIRING!! See http://www.osr.com/careers

    For our schedule of WDF, WDM, debugging and other seminars visit:
    http://www.osr.com/seminars

    To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
  • Alex_GrigAlex_Grig Member Posts: 3,238
    There are two possible major reasons:

    1. Stray DMA.
    2. Driver writes to stray mapping of RAM instead of BAR. For example, the driver writes some acknowledgement to BAR, but goes to RAM instead.

    In case 1, investigate if some device uses DMA_ADAPTER with bounce buffers (enumerate all DMA_ADAPTERs in the dump). See if the problem disappears if RAM is limited to 3 GB.

    Do the systems at the client's all have the same inventory? If only some systems exhibit the behavior, analyze what's different between the cohorts. It might even be different chip revisions (you'll have to analyze full PCI\ device ID strings).
  • anton_bassovanton_bassov Member Posts: 5,038
    http://shop.lenovo.com/us/en/desktops/thinkcentre/m-series-tiny/m93-m93p/#tab-tech_specs

    As you can see, they seem to be"conveniently forgetting" to provide any info about the chipset that they use....


    Anton Bassov
  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,151
    From the customer:

    <QUOTE>
    Lenovo M93p tiny Model ID , 10AAS21200, 10AAS2YP00, 10AAS36X00, 10AAS36X02
    All of the 4 different model types run the same firmware. ( They are
    identical machines from the driver point of view)

    We have tried versions : FBKT89AUS, FBKT98AUS, FBKT99AUS, FBKTA1AUS ( 98%
    of our PC’s are now on FBKTB6AUS)

    Cpu : i7-4765T ( Haswell-DT) iGPU Intel HD 4600
    Motherboard : Lenovo Sharkbay
    Chipset Intel Q87 (Lynx Point)
    </QUOTE>

    -scott
    OSR
    @OSRDrivers

    wrote in message news:xxxxx@ntdev...

    http://shop.lenovo.com/us/en/desktops/thinkcentre/m-series-tiny/m93-m93p/#tab-tech_specs

    As you can see, they seem to be"conveniently forgetting" to provide any info
    about the chipset that they use....


    Anton Bassov

    -scott
    OSR

  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,151
    Thanks everyone for the brain cycles, I've been staring at this and it's
    nice to have some new angles :)

    We have a bunch of 0x1A/0x41792 crashes, which are nice because the Mm
    causes them when it reads a pointer and gets back a value it doesn't expect.
    They are particularly useful in this case because the pointer is in Arg2 and
    the unexpected value is in Arg3, so dumping them in Excel I can see exactly
    which pointer has the bad content.

    I just took 71 one of these and grabbed the PFN for the faulting virtual
    address (yes, I did this manually...yes, I probably just should have written
    something to do it...). Here are the resulting PFNs:

    0x7a00c
    0x14290
    0x2da86
    0x108f1b
    0x10974c
    0x10a43d
    0x10a524
    0x10a652
    0x10af48
    0x10b1c1
    0x10b4b3
    0x10b586
    0x10be8e
    0x10e453
    0x10e4d3
    0x10e936
    0x10ee0c
    0x10f016
    0x10fc15
    0x1120eb
    0x112e70
    0x113127
    0x1140e5
    0x114fe5
    0x11523f
    0x115603
    0x115696
    0x11583e
    0x1159fd
    0x115ad0
    0x115b88
    0x1164fb
    0x116604
    0x116812
    0x11689c
    0x116a03
    0x116d21
    0x116d37
    0x116d92
    0x116dff
    0x11704a
    0x11718e
    0x117389
    0x11762d
    0x11785b
    0x117b1f
    0x117c1a
    0x117c22
    0x117c34
    0x117d9e
    0x117e60
    0x11832d
    0x118817
    0x118a0b
    0x118a19
    0x118ade
    0x118f96
    0x118fe6
    0x1190a3
    0x11916e
    0x1192cc
    0x1192d1
    0x119434
    0x1196be
    0x119702
    0x119850
    0x11b5a3
    0x11b6aa
    0x11d224
    0x1d35ba
    0x1d875d

    While it looks sort of pattern-y, the three at the beginning kill the
    "doesn't happen under 4GB" idea.

    I also grabbed a few other random dumps that have the problem but are
    crashing in different places. Here are the PFNs involved there:

    0x10faf8
    0x1146a2
    0x118a28
    0x118e24
    0x216059

    Again this is not exhaustive, just a random sampling.

    -scott
    OSR
    @OSRDrivers

    "Jan Bottorff" <xxxxx@pmatrix.com> wrote in message
    news:xxxxx@ntdev...

    Extending the idea of excluding individual pages, you might try excluding
    big chunks of memory. I thought there used to be an option to force only
    memory above 4GB to be used. Ideally you could binary search excluded
    memory. You potentially could write a little boot start driver that
    allocated do nothing buffers in specific ranges, testing if you can cause
    the corruption to only happen in harmless areas.

    Jan

    -scott
    OSR

  • Maxim_S._ShatskihMaxim_S._Shatskih Member Posts: 10,396
    Overheating/poor quality RAM chips is also a possibility.

    Can you reduce the RAM/FSB clock a bit and retry?

    --
    Maxim S. Shatskih
    Microsoft MVP on File System And Storage
    xxxxx@storagecraft.com
    http://www.storagecraft.com

    "Scott Noone" <xxxxx@osr.com> wrote in message news:xxxxx@ntdev...
    > They are the i7 variant of the Lenovo M93p Tiny Desktop. They are running
    > various versions of the available firmware, though there has been an effort
    > recently to get them all updated to the latest. The RAM map being confused
    > is certainly an interesting development.
    >
    >
    > -scott
    > OSR
    > @OSRDrivers
    >
    > wrote in message news:xxxxx@ntdev...
    >
    > Can you provide a bit more info about these "indentical machines" - chipset
    > version, as well as firmare, seem to be of paricular intererst......
    >
    > Anton Bassov
    >
    >
    >
    >
  • Maxim_S._ShatskihMaxim_S._Shatskih Member Posts: 10,396
    Look at PCI IDs in the Device Manager, task done

    --
    Maxim S. Shatskih
    Microsoft MVP on File System And Storage
    xxxxx@storagecraft.com
    http://www.storagecraft.com

    <xxxxx@hotmail.com> wrote in message news:xxxxx@ntdev...
    > http://shop.lenovo.com/us/en/desktops/thinkcentre/m-series-tiny/m93-m93p/#tab-tech_specs
    >
    > As you can see, they seem to be"conveniently forgetting" to provide any info about the chipset that they use....
    >
    >
    > Anton Bassov
    >
  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,151
    <QUOTE>
    Overheating/poor quality RAM chips is also a possibility.

    Can you reduce the RAM/FSB clock a bit and retry?
    </QUOTE>

    I discounted this as being a RAM problem due to the consistency and the pattern and the bad offset. It really "feels" like a device (or possibly driver) writing a control/status value where it shouldn't. That being said, I'm happy still guessing...Would this type of corruption be consistent with a RAM issue in your opinion?

    Thanks!

    -scott
    OSR
    @OSRDrivers

    -scott
    OSR

  • anton_bassovanton_bassov Member Posts: 5,038
    > Look at PCI IDs in the Device Manager, task done

    Well, in order to be able to do so you need get a physical access to the machine, which may be already too late (because you have purchased it already). It seems to be a common trick in the computer stores - they display CPU info in huge letters without saying anything about the chipset. However, if you write down the model and do a bit of googling you may discover that the chipset they use may, in actuality, come from VIA Technologies....


    Anton Bassov
  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,151
    <QUOTE>
    In case 1, investigate if some device uses DMA_ADAPTER with bounce buffers
    (enumerate all DMA_ADAPTERs in the dump).
    </QUOTE>

    That's an interesting idea for a culprit. Unfortunately, !dma is broken for
    these dumps due to missing HAL types so I can't easily determine if there
    are any adapters with bounce buffers.

    <QUOTE>
    Do the systems at the client's all have the same inventory? If only some
    systems exhibit the behavior, analyze what's different between the cohorts.
    It might even be different chip revisions (you'll have to analyze full PCI\
    device ID strings).
    </QUOTE>

    They "should" be the same, but it's something else to check.

    Thanks!

    -scott
    OSR
    @OSRDrivers


    wrote in message news:xxxxx@ntdev...

    There are two possible major reasons:

    1. Stray DMA.
    2. Driver writes to stray mapping of RAM instead of BAR. For example, the
    driver writes some acknowledgement to BAR, but goes to RAM instead.

    In case 1, investigate if some device uses DMA_ADAPTER with bounce buffers
    (enumerate all DMA_ADAPTERs in the dump). See if the problem disappears if
    RAM is limited to 3 GB.

    Do the systems at the client's all have the same inventory? If only some
    systems exhibit the behavior, analyze what's different between the cohorts.
    It might even be different chip revisions (you'll have to analyze full PCI\
    device ID strings).

    -scott
    OSR

  • Alex_GrigAlex_Grig Member Posts: 3,238
    In case of corrupted images, did corruption happen in a paged or nonpaged section? If it's non-paged, then it's definitely NOT a bounce buffer.
  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 7,380
    (Answering for Scott... We're both working on the same problem)

    The corruption is in both page and non-paged memory. Which makes one thing that's made it confusing.

    Peter
    OSR
    @OSRDrivers

    Peter Viscarola
    OSR
    @OSRDrivers

  • Alex_GrigAlex_Grig Member Posts: 3,238
    Are there any fancy drivers in the image? Like USB 3.0.

    Also, when you see these corrupted pages, is there a pattern in the beginning of the page or in the end of the previous page?
  • Alex_GrigAlex_Grig Member Posts: 3,238
    0xFD8 is 4056, which is, coincidentally, one of Jumbo MTUs MS is using for HyperV. What is the NIC config on the boxes?
  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 7,380
    <quote>
    Are there any fancy drivers in the image? Like USB 3.0.
    </quote>

    Yes. These are USB 3 boxes, and there are (typically) USB devices attached.

    <quote>
    is there a pattern in the beginning of
    the page or in the end of the previous page?
    </quote>

    I don't BELIEVE so. I'll have to leave that one to Mr. Noone.

    <quote>
    What is the NIC config on the boxes?
    </quote>

    Excellent observation, and one that we indeed also had. Jumbo-grams are not enabled, unfortunately.

    Peter
    OSR
    @OSRDrivers

    Peter Viscarola
    OSR
    @OSRDrivers

  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    Contributing some information to the thread as I work for the "customer" We are very grateful for the work OSR has regarding the troubleshooting our mystery. (Although it did take while for us to get organize and start collecting hundreds of BSOD dumps and convince OSR that we needed their help)


    Network card,
    - M93p's have an Intel I217-ML They are connected to Cisco 2960s switches on GB ports
    Intel drviers, had version 12.11.96.1 for most of the year, updated to 12.12.80.1920 followed by 12.13.17.7 recently. Using default driver settings. Jumbo frames are not enabled.

    USB
    - The M93p only has physical USB3 ports, (Lenovo hardware maintenance manual list's an optional USB2 port which we don't have)

    USB info as reported by msinfo32.
    Intel(R) USB 3.0 eXtensible Host Controller - 0100 (Microsoft) PCI\VEN_8086&DEV_8C31&SUBSYS_30A317AA&REV_04\3&11583659&0&A0
    Intel(R) 8 Series/C220 Series USB EHCI #2 - 8C2D PCI\VEN_8086&DEV_8C2D&SUBSYS_30A317AA&REV_04\3&11583659&0&D0
    Intel(R) 8 Series/C220 Series USB EHCI #1 - 8C26 PCI\VEN_8086&DEV_8C26&SUBSYS_30A317AA&REV_04\3&11583659&0&E8

    All of our system have at least 2 USB devices at all times : Keyboard and mice.

    USB drivers are the Microsoft Windows 8.1 x64 Enterprise drivers :
    USBXHCI.SYS

    The one critical piece of information Scott has omitted, (as it might lead to conjecture) is that when we recently disabled "Turn off Monitor after Idle" in the windows Power Profile.

    The BSOD's all but stopped !

    We did this after OSR observed that the memory scribble BSOD's were clustered around user logon and power transition events. Our PC's are on 24/7 with weekly reboots. Users will logoff or stay logged on the end of the day, and prior to our recent changes montior would power off after 15min, User returns in the morning, monitor wakes up. and log's-on. We are office workers typical person is use the office Office 2013 Suite; Outlook, Word, Excel products all day. Nothing fancy.

    All of our M93p are on the High performance profile, we don't sleep or do any hibernation. Prior to Scott's recommendation on our Power Settings, We had Turn off Monitor after 15 min of idle, USB low power mode, and HD power-off after idle. These settings have now all been disabled.

    After making this change about 1 week ago and rebooting all 1000+ systems the BSOD's stopped.

    Not knowing at time which action we took changed the behavior, we update the NIC and intel Storage drives and power changes all at the same time, 1 week later we re-enabled the Monitor power-off at idle.

    Not more than 30 min after making this change we had our 1st BSOD with scribble memory. Upon seeing this change in behavior we reversed course a disabled the Monitor power off again, Reboot all system. Have have only encountered one BSOD since in about 6 days. The "normal" trend line for these systems prior was about 2-6 BSOD per day sometimes hiting peaks of 10-15 unique machines BSODing per day.

    The other interesting part of this problem is we have about 100+ Dell E7440 Laptops, these machine run the same image as our M93p. At the image deployment ( SCCM OSD) image different driver package are injected in the image. (Lenovo and Dell both provide drive package for SCCM deployments these always contain out dated drivers that are supposed be vetted and tested.

    Suspecting possible bad drivers we now find the most recent drivers for our hardware on the catalog.update.microsoft.com site and download and deploy these version.

    Post image we updated the drivers for common hardware ; NIC , Storage, Intel HD iGPU using the same driver for both M93p and the E7440. No single "memory scribble" BSOD has been found on the E7440.

    Last part of information regarding our problem. We never knew we has serious issue with BSOD's unit we started looking for BSOD's and collecting the dumps. But we always new we had and still have serious issue with Applications from A to Z crashing on our system. About 60-70% of these crashes are buffer over-run's C000005

    Example : Taken from Windows AppCrash events form various PC's things always crash with Exception code: 0xc0000005.

    Although our BSOD have stopped since we made power change, the general user mode crashes have not. Are they getting corrupted by same memory scribble error ?

    Date-Time , .ProgramName. module, Exception code
    11/24/2015 08:24:11 splwow64.exe KERNELBASE.dll e06d7363
    11/24/2015 08:37:40 WINWORD.EXE wwlib.dll c0000005
    11/24/2015 08:43:25 CcmExec.exe ntdll.dll c0000005
    11/24/2015 08:58:29 AUDIODG.EXE WMALFXGFXDSP.dll c0000005
    11/24/2015 08:58:35 splwow64.exe KERNELBASE.dll e06d7363
    11/24/2015 09:00:27 wfcrun32.exe ntdll.dll c0000005
    11/24/2015 09:12:59 xdcla.exe Profiler.dll c0000135
    11/24/2015 09:19:16 WINWORD.EXE mfc100u.dll c0000005
    11/24/2015 09:22:30 OUTLOOK.EXE combase.dll c0000005
    11/24/2015 09:26:06 lync.exe ntdll.dll c0000005
    11/24/2015 09:27:09 IEXPLORE.EXE ntdll.dll c0000005
    11/24/2015 09:30:40 OUTLOOK.EXE MSVCR100.dll 40000015
    11/24/2015 09:33:23 OUTLOOK.EXE mso.dll c0000602
    11/24/2015 09:39:08 splwow64.exe KERNELBASE.dll e06d7363
    11/24/2015 09:45:01 Acrobat.exe ntdll.dll c0000374
    11/24/2015 09:47:31 OUTLOOK.EXE mso.dll c0000602
    11/24/2015 09:51:32 IEXPLORE.EXE MSHTML.dll c0000005
    11/24/2015 09:52:52 IEXPLORE.EXE MSHTML.dll c0000005
    11/24/2015 09:53:01 IEXPLORE.EXE MSHTML.dll c0000005
    11/24/2015 09:55:01 OUTLOOK.EXE unknown c0000005
    11/24/2015 09:58:43 ppscanmg.exe KERNELBASE.dll e06d7363
    11/24/2015 10:03:15 IEXPLORE.EXE ntdll.dll c0000005
    11/24/2015 10:06:20 Acrobat.exe ntdll.dll c0000374
    11/24/2015 10:09:12 IEXPLORE.EXE igd10iumd32.dll c0000005
    11/24/2015 10:09:24 IEXPLORE.EXE ntdll.dll c0000409
    11/24/2015 10:10:14 splwow64.exe KERNELBASE.dll e06d7363
    11/24/2015 10:11:04 IEXPLORE.EXE Flash.ocx c0000005
    11/24/2015 10:11:24 svchost.exe_Dnscache ntdll.dll c0000008
    11/24/2015 10:21:36 OUTLOOK.EXE ntdll.dll c0000374
    11/24/2015 10:24:30 IEXPLORE.EXE ntdll.dll c0000005
    11/24/2015 10:24:40 OUTLOOK.EXE mso.dll c0000602
    11/24/2015 10:31:21 OUTLOOK.EXE ntdll.dll c0000374
    11/24/2015 10:31:30 WINWORD.EXE unknown c0000005
    11/24/2015 10:34:41 Concordance Image.exe Concordance Image.exe c000041d
    11/24/2015 10:34:41 Concordance Image.exe Concordance Image.exe c0000005

    Naim
  • Michael_ShatzMichael_Shatz Member Posts: 4
    I don't have much to contribute except for my personal Windows Troubleshooting Rule of Thumb: when in doubt, blame antivirus.
  • Andrey_BazhanAndrey_Bazhan Member Posts: 30
    Have you tried to narrow down the culprit by running

    !for_each_module ".echo @#ModuleName; s-b @#Base @#End D8 0F 00 00"

    wrote in message news:xxxxx@ntdev...

    <QUOTE>
    Overheating/poor quality RAM chips is also a possibility.

    Can you reduce the RAM/FSB clock a bit and retry?
    </QUOTE>

    I discounted this as being a RAM problem due to the consistency and the
    pattern and the bad offset. It really "feels" like a device (or possibly
    driver) writing a control/status value where it shouldn't. That being said,
    I'm happy still guessing...Would this type of corruption be consistent with
    a RAM issue in your opinion?

    Thanks!

    -scott
    OSR
    @OSRDrivers
  • Alex_Ionescu-2Alex_Ionescu-2 Member Posts: 138
    Have you looked at CPU Microcode updates? We just spent weeks at work diagnosing what should've been an "impossible" crash, only to realize it was a microcode bug related to power transitions on a recent CPU.

    Have you tried the "scientific method" with these machines? That is, remove more and more hardware and more and more software/drivers until the crashes stop crashing. For example, are there crashes if the users never log in? Are there crashes if you boot into the Windows Recovery Environment? You can even go build a native app that gets launched by SMSS and never returns (or waits on a keystroke) and see if crashes still happen at that point.

    --
    Best regards,
    Alex Ionescu
  • MBondMBond Member - All Emails Posts: 846
    Have you checked the return policy from the OEM? It sounds like you got a bunch of broken systems

    while many people like to round off on Microsoft, Windows is not expected to crash daily and drivers from major manufacturers aren't either (Intel qualifies) so your most likely root cause is bad hardware (firmware bugs?). the fact that this happens during power transitions reinforces this assertion since as well as being difficult for driver writers to get right (thank you again KMDF) it also exposes problems with non-compliant hardware

    I once had a long conversation with a co-worker about a particular system that he was having a problem with. he said it works perfectly with Linux but Windows crashes during install every time - what is wrong with the Windows installer? It turned out that the graphics card installed in the system had a nasty bug where a particular change in graphics mode caused it to overwrite random physical memory and Windows setup just happened to hit this perfect combination while Linux never did.

    ________________________________________
    From: xxxxx@lists.osr.com <xxxxx@lists.osr.com> on behalf of xxxxx@gmail.com <xxxxx@gmail.com>
    Sent: November 25, 2015 9:42 PM
    To: Windows System Software Devs Interest List
    Subject: RE:[ntdev] Memory Corruption Mystery: Any Ideas?

    Contributing some information to the thread as I work for the "customer" We are very grateful for the work OSR has regarding the troubleshooting our mystery. (Although it did take while for us to get organize and start collecting hundreds of BSOD dumps and convince OSR that we needed their help)


    Network card,
    - M93p's have an Intel I217-ML They are connected to Cisco 2960s switches on GB ports
    Intel drviers, had version 12.11.96.1 for most of the year, updated to 12.12.80.1920 followed by 12.13.17.7 recently. Using default driver settings. Jumbo frames are not enabled.

    USB
    - The M93p only has physical USB3 ports, (Lenovo hardware maintenance manual list's an optional USB2 port which we don't have)

    USB info as reported by msinfo32.
    Intel(R) USB 3.0 eXtensible Host Controller - 0100 (Microsoft) PCI\VEN_8086&DEV_8C31&SUBSYS_30A317AA&REV_04\3&11583659&0&A0
    Intel(R) 8 Series/C220 Series USB EHCI #2 - 8C2D PCI\VEN_8086&DEV_8C2D&SUBSYS_30A317AA&REV_04\3&11583659&0&D0
    Intel(R) 8 Series/C220 Series USB EHCI #1 - 8C26 PCI\VEN_8086&DEV_8C26&SUBSYS_30A317AA&REV_04\3&11583659&0&E8

    All of our system have at least 2 USB devices at all times : Keyboard and mice.

    USB drivers are the Microsoft Windows 8.1 x64 Enterprise drivers :
    USBXHCI.SYS

    The one critical piece of information Scott has omitted, (as it might lead to conjecture) is that when we recently disabled "Turn off Monitor after Idle" in the windows Power Profile.

    The BSOD's all but stopped !

    We did this after OSR observed that the memory scribble BSOD's were clustered around user logon and power transition events. Our PC's are on 24/7 with weekly reboots. Users will logoff or stay logged on the end of the day, and prior to our recent changes montior would power off after 15min, User returns in the morning, monitor wakes up. and log's-on. We are office workers typical person is use the office Office 2013 Suite; Outlook, Word, Excel products all day. Nothing fancy.

    All of our M93p are on the High performance profile, we don't sleep or do any hibernation. Prior to Scott's recommendation on our Power Settings, We had Turn off Monitor after 15 min of idle, USB low power mode, and HD power-off after idle. These settings have now all been disabled.

    After making this change about 1 week ago and rebooting all 1000+ systems the BSOD's stopped.

    Not knowing at time which action we took changed the behavior, we update the NIC and intel Storage drives and power changes all at the same time, 1 week later we re-enabled the Monitor power-off at idle.

    Not more than 30 min after making this change we had our 1st BSOD with scribble memory. Upon seeing this change in behavior we reversed course a disabled the Monitor power off again, Reboot all system. Have have only encountered one BSOD since in about 6 days. The "normal" trend line for these systems prior was about 2-6 BSOD per day sometimes hiting peaks of 10-15 unique machines BSODing per day.

    The other interesting part of this problem is we have about 100+ Dell E7440 Laptops, these machine run the same image as our M93p. At the image deployment ( SCCM OSD) image different driver package are injected in the image. (Lenovo and Dell both provide drive package for SCCM deployments these always contain out dated drivers that are supposed be vetted and tested.

    Suspecting possible bad drivers we now find the most recent drivers for our hardware on the catalog.update.microsoft.com site and download and deploy these version.

    Post image we updated the drivers for common hardware ; NIC , Storage, Intel HD iGPU using the same driver for both M93p and the E7440. No single "memory scribble" BSOD has been found on the E7440.

    Last part of information regarding our problem. We never knew we has serious issue with BSOD's unit we started looking for BSOD's and collecting the dumps. But we always new we had and still have serious issue with Applications from A to Z crashing on our system. About 60-70% of these crashes are buffer over-run's C000005

    Example : Taken from Windows AppCrash events form various PC's things always crash with Exception code: 0xc0000005.

    Although our BSOD have stopped since we made power change, the general user mode crashes have not. Are they getting corrupted by same memory scribble error ?

    Date-Time , .ProgramName. module, Exception code
    11/24/2015 08:24:11 splwow64.exe KERNELBASE.dll e06d7363
    11/24/2015 08:37:40 WINWORD.EXE wwlib.dll c0000005
    11/24/2015 08:43:25 CcmExec.exe ntdll.dll c0000005
    11/24/2015 08:58:29 AUDIODG.EXE WMALFXGFXDSP.dll c0000005
    11/24/2015 08:58:35 splwow64.exe KERNELBASE.dll e06d7363
    11/24/2015 09:00:27 wfcrun32.exe ntdll.dll c0000005
    11/24/2015 09:12:59 xdcla.exe Profiler.dll c0000135
    11/24/2015 09:19:16 WINWORD.EXE mfc100u.dll c0000005
    11/24/2015 09:22:30 OUTLOOK.EXE combase.dll c0000005
    11/24/2015 09:26:06 lync.exe ntdll.dll c0000005
    11/24/2015 09:27:09 IEXPLORE.EXE ntdll.dll c0000005
    11/24/2015 09:30:40 OUTLOOK.EXE MSVCR100.dll 40000015
    11/24/2015 09:33:23 OUTLOOK.EXE mso.dll c0000602
    11/24/2015 09:39:08 splwow64.exe KERNELBASE.dll e06d7363
    11/24/2015 09:45:01 Acrobat.exe ntdll.dll c0000374
    11/24/2015 09:47:31 OUTLOOK.EXE mso.dll c0000602
    11/24/2015 09:51:32 IEXPLORE.EXE MSHTML.dll c0000005
    11/24/2015 09:52:52 IEXPLORE.EXE MSHTML.dll c0000005
    11/24/2015 09:53:01 IEXPLORE.EXE MSHTML.dll c0000005
    11/24/2015 09:55:01 OUTLOOK.EXE unknown c0000005
    11/24/2015 09:58:43 ppscanmg.exe KERNELBASE.dll e06d7363
    11/24/2015 10:03:15 IEXPLORE.EXE ntdll.dll c0000005
    11/24/2015 10:06:20 Acrobat.exe ntdll.dll c0000374
    11/24/2015 10:09:12 IEXPLORE.EXE igd10iumd32.dll c0000005
    11/24/2015 10:09:24 IEXPLORE.EXE ntdll.dll c0000409
    11/24/2015 10:10:14 splwow64.exe KERNELBASE.dll e06d7363
    11/24/2015 10:11:04 IEXPLORE.EXE Flash.ocx c0000005
    11/24/2015 10:11:24 svchost.exe_Dnscache ntdll.dll c0000008
    11/24/2015 10:21:36 OUTLOOK.EXE ntdll.dll c0000374
    11/24/2015 10:24:30 IEXPLORE.EXE ntdll.dll c0000005
    11/24/2015 10:24:40 OUTLOOK.EXE mso.dll c0000602
    11/24/2015 10:31:21 OUTLOOK.EXE ntdll.dll c0000374
    11/24/2015 10:31:30 WINWORD.EXE unknown c0000005
    11/24/2015 10:34:41 Concordance Image.exe Concordance Image.exe c000041d
    11/24/2015 10:34:41 Concordance Image.exe Concordance Image.exe c0000005

    Naim


    ---
    NTDEV is sponsored by OSR

    Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

    OSR is HIRING!! See http://www.osr.com/careers

    For our schedule of WDF, WDM, debugging and other seminars visit:
    http://www.osr.com/seminars

    To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,151
    (Happy Holidays everyone :))

    <QUOTE>
    Also, when you see these corrupted pages, is there a pattern in the
    beginning of the page or in the end of the previous page?
    </QUOTE>

    Nothing obvious in the physically contiguous page above or below.

    One interesting data point is that we never found a case where the
    surrounding virtually contiguous pages were actually physically contiguous.
    This made it feel like a stray DMA overrun or underrun, though again we
    never found the evidence for it.

    -scott
    OSR
    @OSRDrivers

    wrote in message news:xxxxx@ntdev...

    Are there any fancy drivers in the image? Like USB 3.0.

    Also, when you see these corrupted pages, is there a pattern in the
    beginning of the page or in the end of the previous page?

    -scott
    OSR

  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,151
    We searched for the sequence in the "suspect" driver list (NIC, video, etc.)
    using IDA Pro, though it was a long shot. We found various instances of it,
    though just through static analysis it was impossible to say if it was even
    related. Not enough hours in the day to do a complete reversing job on every
    driver :P

    -scott
    OSR
    @OSRDrivers

    "Andrey Bazhan" wrote in message news:xxxxx@ntdev...

    Have you tried to narrow down the culprit by running

    !for_each_module ".echo @#ModuleName; s-b @#Base @#End D8 0F 00 00"

    wrote in message news:xxxxx@ntdev...

    <QUOTE>
    Overheating/poor quality RAM chips is also a possibility.

    Can you reduce the RAM/FSB clock a bit and retry?
    </QUOTE>

    I discounted this as being a RAM problem due to the consistency and the
    pattern and the bad offset. It really "feels" like a device (or possibly
    driver) writing a control/status value where it shouldn't. That being said,
    I'm happy still guessing...Would this type of corruption be consistent with
    a RAM issue in your opinion?

    Thanks!

    -scott
    OSR
    @OSRDrivers

    -scott
    OSR

  • Maxim_S._ShatskihMaxim_S._Shatskih Member Posts: 10,396
    > This made it feel like a stray DMA overrun or underrun, though again we
    > never found the evidence for it.

    DMA verifier?

    --
    Maxim S. Shatskih
    Microsoft MVP on File System And Storage
    xxxxx@storagecraft.com
    http://www.storagecraft.com
  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    You might read the errata notes on the processor and chipsets - these are
    often fairly "concerning" if you read them. Other conspiracy theories in
    the firmware regions that might be worth considering: SMM or VMM firmware
    that might be enabled and embedded "management engines" (vpro) might be
    worth looking at.

    t.

    On Mon, Nov 30, 2015 at 4:22 PM, Maxim S. Shatskih
    wrote:

    > > This made it feel like a stray DMA overrun or underrun, though again we
    > > never found the evidence for it.
    >
    > DMA verifier?
    >
    > --
    > Maxim S. Shatskih
    > Microsoft MVP on File System And Storage
    > xxxxx@storagecraft.com
    > http://www.storagecraft.com
    >
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list online at: <
    > http://www.osronline.com/showlists.cfm?list=ntdev>;
    >
    > MONTHLY seminars on crash dump analysis, WDF, Windows internals and
    > software drivers!
    > Details at
    >
    > To unsubscribe, visit the List Server section of OSR Online at <
    > http://www.osronline.com/page.cfm?name=ListServer>;
    >
Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Upcoming OSR Seminars
Developing Minifilters 29 July 2019 OSR Seminar Space
Writing WDF Drivers 23 Sept 2019 OSR Seminar Space
Kernel Debugging 21 Oct 2019 OSR Seminar Space
Internals & Software Drivers 18 Nov 2019 Dulles, VA