Memory Corruption Mystery: Any Ideas?

Can you provide a bit more info about these “indentical machines” - chipset version, as well as firmare, seem to be of paricular intererst…

Anton Bassov

0xFD8 is like the end of 16MB region for legacy ISA DMA

Can this be the cause?


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

“Scott Noone” wrote in message news:xxxxx@ntdev…
>A potentially interesting puzzler for a Tuesday…
>
> We’re looking at a series of crash dumps from a client and am hoping that
> this corruption looks familiar to someone. This is an end user installation
> and none of our software is running on these machines, it’s strictly a crash
> analysis exercise.
>
> The crashes are spread out over 1,000+ “identical” machines. No one machine
> crashes with great frequency, but spread out across the install there’s a
> few a day. Systems survive anywhere from a few hours to six days before
> crashing.
>
> Just looking at the crash codes isn’t helpful, they have about every crash
> code you could ever imagine (QUOTA_UNDERRUN??) and it’s blamed on various
> modules. However, digging deeper a very consistent pattern emerges.
> Specifically, we are consistently seeing one of two values “randomly” appear
> in memory:
>
> • 0x0000001000000004<br>&gt; • 0x0000001000400004
>
> Interestingly, when the corruption is discovered the value very, very
> frequently appears at physical memory page offset 0xFD8 (most common) or
> 0xD70 (less common).
>
> For example, in one crash the problem was that the MRXSMB20 image file is
> corrupted:
>
> 3: kd> !chkimg -d mrxsmb20
> fffff800826a6fd8-fffff800826a6fdd 6 bytes -
> mrxsmb20!Smb2UpdateFileInfoCacheEntry+4c8
> [89 7d 18 49 89 45:04 00 00 00 10 00]
> fffff800826a6fdf - mrxsmb20!Smb2UpdateFileInfoCacheEntry+4cf (+0x07)
> [e8:00]
> 7 errors : mrxsmb20 (fffff800826a6fd8-fffff800826a6fdf)
>
> Dumping the start of the corrupted range, we see our offset and value:
>
> 3: kd> dq fffff800826a6fd8
> fffff800826a6fd8 0000001000000004 4c2b894c0000e99c<br>&gt; fffff800826a6fe8 ade901b641986d8b 850f02f883fffffd
> fffff800826a6ff8 8bc03345fffffbb5 445e15ffce8b49d7<br>&gt; fffff800826a7008 fb9f850fc0840002 03fffffefee9ffff
>
> In another crash a pool header is corrupted:
>
> 2: kd> !pool ffffc00089adcd70
> Pool page ffffc00089adcd70 region is Paged pool
> …
> ffffc00089adcc00 size: 170 previous size: b0 (Free ) MPsc
>
> ffffc00089adcd70 doesn’t look like a valid small pool allocation, checking
> to see
> if the entire page is actually part of a large page allocation…
>
> 2: kd> dq ffffc00089adcd70
> ffffc00089adcd70 0000001000000004 8d5eb1494b83d33a<br>&gt; ffffc00089adcd80 0000000000000000 ffffe000bdecfec0
> ffffc00089adcd90 ffffe000bc728860 ffffc000`89adcd98
>
> Due to the fact that the crash appears at random in different virtual
> address ranges (paged pool, non-paged pool, code, proto PTEs, working set
> lists, etc.), we believe that the corruption must be generated by a device
> in the system (or by the platform).
>
> We have tried various things to narrow this further and analyzed the
> corruption across hundreds of dump files. The systems are generally idle
> when the corruption is discovered, which makes it hard to go back in time
> and figure out who might be using the value.
>
> So, my question to you all is: does this LOOK like anything to you? Do those
> corruption values hold any meaning to you? What about the offsets of 0xFD8
> or 0xD70? I realize it’s not much to go on, but stranger things have
> happened…
>
> Thanks!
>
> -scott
> OSR
> @OSRDrivers
>
>
>

Thanks for the replies!

That’s an interesting idea, I hadn’t thought of that one.

It’s definitely not always the same page, though the above makes me wonder
if it’s in a specific range of pages. Unfortunately extracting the bad PFN
is more manual than I would prefer so automating this might be difficult. I
have a good sample on which this might work though…

This has been a bit of a dead end only because it’s taking so long to get
the results. I ran !search on one of the more recent dumps this AM at around
7AM ET and it’s still going…I’ll continue to see if it gets me anywhere
though.

I’ll definitely try this if I can identify a range.

I did try !searchpte on several dumps but didn’t find an obvious
culprit…I’ll give this a few more tries on other dumps though to see if I
can catch one in the act.

Verifier has been run (of course) but not the checked images, also a very
good idea.

Thanks again for the ideas! I’ll definitely update if we find the smoking
gun.

-scott
OSR
@OSRDrivers

“Pavel Lebedinsky” wrote in message news:xxxxx@ntdev…

Such corruptions are often caused by firmware issues, like failing to
exclude memory used by devices from the OS memory map.

Some things you can try:

  1. Check if there is a specific physical page that gets corrupted more often
    than other pages.
  2. If corruption appears to hit completely random physical pages, do !search
    0000001000400004 and see if there are any other instances of that value,
    especially at page offsets FD8/D70. If yes, check these physical pages in
    other dumps to see if there is a pattern.
  3. If you can identify a suspect physical page, try excluding it using
    bcdedit /set badmemorylist.
  4. Optionally, do !pfn and !searchpte on the suspect PFN to see what the
    page is being used for and where it might be mapped. Check for problems like
    a valid PTE pointing to a free/standby page.
  5. Try booting with a checked kernel and see if it hits any asserts in the
    memory manager.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Scott Noone
Sent: Tuesday, November 24, 2015 7:34 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Memory Corruption Mystery: Any Ideas?

A potentially interesting puzzler for a Tuesday…

We’re looking at a series of crash dumps from a client and am hoping that
this corruption looks familiar to someone. This is an end user installation
and none of our software is running on these machines, it’s strictly a crash
analysis exercise.

The crashes are spread out over 1,000+ “identical” machines. No one machine
crashes with great frequency, but spread out across the install there’s a
few a day. Systems survive anywhere from a few hours to six days before
crashing.

Just looking at the crash codes isn’t helpful, they have about every crash
code you could ever imagine (QUOTA_UNDERRUN??) and it’s blamed on various
modules. However, digging deeper a very consistent pattern emerges.
Specifically, we are consistently seeing one of two values “randomly” appear
in memory:

• 0x0000001000000004<br>• 0x0000001000400004

Interestingly, when the corruption is discovered the value very, very
frequently appears at physical memory page offset 0xFD8 (most common) or
0xD70 (less common).

For example, in one crash the problem was that the MRXSMB20 image file is
corrupted:

3: kd> !chkimg -d mrxsmb20
fffff800826a6fd8-fffff800826a6fdd 6 bytes -
mrxsmb20!Smb2UpdateFileInfoCacheEntry+4c8
[89 7d 18 49 89 45:04 00 00 00 10 00]
fffff800826a6fdf - mrxsmb20!Smb2UpdateFileInfoCacheEntry+4cf (+0x07)
[e8:00]
7 errors : mrxsmb20 (fffff800826a6fd8-fffff800826a6fdf)

Dumping the start of the corrupted range, we see our offset and value:

3: kd> dq fffff800826a6fd8
fffff800826a6fd8 0000001000000004 4c2b894c0000e99c<br>fffff800826a6fe8 ade901b641986d8b 850f02f883fffffd
fffff800826a6ff8 8bc03345fffffbb5 445e15ffce8b49d7<br>fffff800826a7008 fb9f850fc0840002 03fffffefee9ffff

In another crash a pool header is corrupted:

2: kd> !pool ffffc00089adcd70
Pool page ffffc00089adcd70 region is Paged pool …
ffffc00089adcc00 size: 170 previous size: b0 (Free ) MPsc

ffffc00089adcd70 doesn’t look like a valid small pool allocation, checking
to see if the entire page is actually part of a large page allocation…

2: kd> dq ffffc00089adcd70
ffffc00089adcd70 0000001000000004 8d5eb1494b83d33a<br>ffffc00089adcd80 0000000000000000 ffffe000bdecfec0
ffffc00089adcd90 ffffe000bc728860 ffffc000`89adcd98

Due to the fact that the crash appears at random in different virtual
address ranges (paged pool, non-paged pool, code, proto PTEs, working set
lists, etc.), we believe that the corruption must be generated by a device
in the system (or by the platform).

We have tried various things to narrow this further and analyzed the
corruption across hundreds of dump files. The systems are generally idle
when the corruption is discovered, which makes it hard to go back in time
and figure out who might be using the value.

So, my question to you all is: does this LOOK like anything to you? Do those
corruption values hold any meaning to you? What about the offsets of 0xFD8
or 0xD70? I realize it’s not much to go on, but stranger things have
happened…

They are the i7 variant of the Lenovo M93p Tiny Desktop. They are running
various versions of the available firmware, though there has been an effort
recently to get them all updated to the latest. The RAM map being confused
is certainly an interesting development.

-scott
OSR
@OSRDrivers

wrote in message news:xxxxx@ntdev…

Can you provide a bit more info about these “indentical machines” - chipset
version, as well as firmare, seem to be of paricular intererst…

Anton Bassov

The systems aren’t ancient, so I wouldn’t expect an errant legacy ISA DMA.

Also, what’s particularly annoying about the problem is that it’s not the
entire region, it’s just an 8 byte pattern starting at 0xFD8 (or 0xD70). So,
unless 0x00000010`00000004 was a common “end of data” marker for that type
of transfer it doesn’t sound related.

-scott
OSR
@OSRDrivers

“Maxim S. Shatskih” wrote in message news:xxxxx@ntdev…

0xFD8 is like the end of 16MB region for legacy ISA DMA

Can this be the cause?


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

“Scott Noone” wrote in message news:xxxxx@ntdev…
>A potentially interesting puzzler for a Tuesday…
>
> We’re looking at a series of crash dumps from a client and am hoping that
> this corruption looks familiar to someone. This is an end user
> installation
> and none of our software is running on these machines, it’s strictly a
> crash
> analysis exercise.
>
> The crashes are spread out over 1,000+ “identical” machines. No one
> machine
> crashes with great frequency, but spread out across the install there’s a
> few a day. Systems survive anywhere from a few hours to six days before
> crashing.
>
> Just looking at the crash codes isn’t helpful, they have about every crash
> code you could ever imagine (QUOTA_UNDERRUN??) and it’s blamed on various
> modules. However, digging deeper a very consistent pattern emerges.
> Specifically, we are consistently seeing one of two values “randomly”
> appear
> in memory:
>
> • 0x0000001000000004<br>&gt; • 0x0000001000400004
>
> Interestingly, when the corruption is discovered the value very, very
> frequently appears at physical memory page offset 0xFD8 (most common) or
> 0xD70 (less common).
>
> For example, in one crash the problem was that the MRXSMB20 image file is
> corrupted:
>
> 3: kd> !chkimg -d mrxsmb20
> fffff800826a6fd8-fffff800826a6fdd 6 bytes -
> mrxsmb20!Smb2UpdateFileInfoCacheEntry+4c8
> [89 7d 18 49 89 45:04 00 00 00 10 00]
> fffff800826a6fdf - mrxsmb20!Smb2UpdateFileInfoCacheEntry+4cf (+0x07)
> [e8:00]
> 7 errors : mrxsmb20 (fffff800826a6fd8-fffff800826a6fdf)
>
> Dumping the start of the corrupted range, we see our offset and value:
>
> 3: kd> dq fffff800826a6fd8
> fffff800826a6fd8 0000001000000004 4c2b894c0000e99c<br>&gt; fffff800826a6fe8 ade901b641986d8b 850f02f883fffffd
> fffff800826a6ff8 8bc03345fffffbb5 445e15ffce8b49d7<br>&gt; fffff800826a7008 fb9f850fc0840002 03fffffefee9ffff
>
> In another crash a pool header is corrupted:
>
> 2: kd> !pool ffffc00089adcd70
> Pool page ffffc00089adcd70 region is Paged pool
> …
> ffffc00089adcc00 size: 170 previous size: b0 (Free ) MPsc
>
> ffffc00089adcd70 doesn’t look like a valid small pool allocation, checking
> to see
> if the entire page is actually part of a large page allocation…
>
> 2: kd> dq ffffc00089adcd70
> ffffc00089adcd70 0000001000000004 8d5eb1494b83d33a<br>&gt; ffffc00089adcd80 0000000000000000 ffffe000bdecfec0
> ffffc00089adcd90 ffffe000bc728860 ffffc000`89adcd98
>
> Due to the fact that the crash appears at random in different virtual
> address ranges (paged pool, non-paged pool, code, proto PTEs, working set
> lists, etc.), we believe that the corruption must be generated by a device
> in the system (or by the platform).
>
> We have tried various things to narrow this further and analyzed the
> corruption across hundreds of dump files. The systems are generally idle
> when the corruption is discovered, which makes it hard to go back in time
> and figure out who might be using the value.
>
> So, my question to you all is: does this LOOK like anything to you? Do
> those
> corruption values hold any meaning to you? What about the offsets of 0xFD8
> or 0xD70? I realize it’s not much to go on, but stranger things have
> happened…
>
> Thanks!
>
> -scott
> OSR
> @OSRDrivers
>
>
>

Extending the idea of excluding individual pages, you might try excluding big chunks of memory. I thought there used to be an option to force only memory above 4GB to be used. Ideally you could binary search excluded memory. You potentially could write a little boot start driver that allocated do nothing buffers in specific ranges, testing if you can cause the corruption to only happen in harmless areas.

Jan

On Tue, Nov 24, 2015 at 12:42 PM -0800, “Scott Noone” > wrote:

Thanks for the replies!



That’s an interesting idea, I hadn’t thought of that one.



It’s definitely not always the same page, though the above makes me wonder
if it’s in a specific range of pages. Unfortunately extracting the bad PFN
is more manual than I would prefer so automating this might be difficult. I
have a good sample on which this might work though…



This has been a bit of a dead end only because it’s taking so long to get
the results. I ran !search on one of the more recent dumps this AM at around
7AM ET and it’s still going…I’ll continue to see if it gets me anywhere
though.



I’ll definitely try this if I can identify a range.



I did try !searchpte on several dumps but didn’t find an obvious
culprit…I’ll give this a few more tries on other dumps though to see if I
can catch one in the act.



Verifier has been run (of course) but not the checked images, also a very
good idea.

Thanks again for the ideas! I’ll definitely update if we find the smoking
gun.

-scott
OSR
@OSRDrivers

“Pavel Lebedinsky” wrote in message news:xxxxx@ntdev…

Such corruptions are often caused by firmware issues, like failing to
exclude memory used by devices from the OS memory map.

Some things you can try:

1. Check if there is a specific physical page that gets corrupted more often
than other pages.
2. If corruption appears to hit completely random physical pages, do !search
0000001000400004 and see if there are any other instances of that value,
especially at page offsets FD8/D70. If yes, check these physical pages in
other dumps to see if there is a pattern.
3. If you can identify a suspect physical page, try excluding it using
bcdedit /set badmemorylist.
4. Optionally, do !pfn and !searchpte on the suspect PFN to see what the
page is being used for and where it might be mapped. Check for problems like
a valid PTE pointing to a free/standby page.
5. Try booting with a checked kernel and see if it hits any asserts in the
memory manager.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Scott Noone
Sent: Tuesday, November 24, 2015 7:34 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Memory Corruption Mystery: Any Ideas?

A potentially interesting puzzler for a Tuesday…

We’re looking at a series of crash dumps from a client and am hoping that
this corruption looks familiar to someone. This is an end user installation
and none of our software is running on these machines, it’s strictly a crash
analysis exercise.

The crashes are spread out over 1,000+ “identical” machines. No one machine
crashes with great frequency, but spread out across the install there’s a
few a day. Systems survive anywhere from a few hours to six days before
crashing.

Just looking at the crash codes isn’t helpful, they have about every crash
code you could ever imagine (QUOTA_UNDERRUN??) and it’s blamed on various
modules. However, digging deeper a very consistent pattern emerges.
Specifically, we are consistently seeing one of two values “randomly” appear
in memory:

? 0x0000001000000004<br>? 0x0000001000400004

Interestingly, when the corruption is discovered the value very, very
frequently appears at physical memory page offset 0xFD8 (most common) or
0xD70 (less common).

For example, in one crash the problem was that the MRXSMB20 image file is
corrupted:

3: kd> !chkimg -d mrxsmb20
fffff800826a6fd8-fffff800826a6fdd 6 bytes -
mrxsmb20!Smb2UpdateFileInfoCacheEntry+4c8
[89 7d 18 49 89 45:04 00 00 00 10 00]
fffff800826a6fdf - mrxsmb20!Smb2UpdateFileInfoCacheEntry+4cf (+0x07)
[e8:00]
7 errors : mrxsmb20 (fffff800826a6fd8-fffff800826a6fdf)

Dumping the start of the corrupted range, we see our offset and value:

3: kd> dq fffff800826a6fd8
fffff800826a6fd8 0000001000000004 4c2b894c0000e99c<br>fffff800826a6fe8 ade901b641986d8b 850f02f883fffffd
fffff800826a6ff8 8bc03345fffffbb5 445e15ffce8b49d7<br>fffff800826a7008 fb9f850fc0840002 03fffffefee9ffff

In another crash a pool header is corrupted:

2: kd> !pool ffffc00089adcd70
Pool page ffffc00089adcd70 region is Paged pool …
ffffc00089adcc00 size: 170 previous size: b0 (Free ) MPsc

ffffc00089adcd70 doesn’t look like a valid small pool allocation, checking
to see if the entire page is actually part of a large page allocation…

2: kd> dq ffffc00089adcd70
ffffc00089adcd70 0000001000000004 8d5eb1494b83d33a<br>ffffc00089adcd80 0000000000000000 ffffe000bdecfec0
ffffc00089adcd90 ffffe000bc728860 ffffc000`89adcd98

Due to the fact that the crash appears at random in different virtual
address ranges (paged pool, non-paged pool, code, proto PTEs, working set
lists, etc.), we believe that the corruption must be generated by a device
in the system (or by the platform).

We have tried various things to narrow this further and analyzed the
corruption across hundreds of dump files. The systems are generally idle
when the corruption is discovered, which makes it hard to go back in time
and figure out who might be using the value.

So, my question to you all is: does this LOOK like anything to you? Do those
corruption values hold any meaning to you? What about the offsets of 0xFD8
or 0xD70? I realize it’s not much to go on, but stranger things have
happened…


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

There are two possible major reasons:

  1. Stray DMA.
  2. Driver writes to stray mapping of RAM instead of BAR. For example, the driver writes some acknowledgement to BAR, but goes to RAM instead.

In case 1, investigate if some device uses DMA_ADAPTER with bounce buffers (enumerate all DMA_ADAPTERs in the dump). See if the problem disappears if RAM is limited to 3 GB.

Do the systems at the client’s all have the same inventory? If only some systems exhibit the behavior, analyze what’s different between the cohorts. It might even be different chip revisions (you’ll have to analyze full PCI\ device ID strings).

http://shop.lenovo.com/us/en/desktops/thinkcentre/m-series-tiny/m93-m93p/#tab-tech_specs

As you can see, they seem to be"conveniently forgetting" to provide any info about the chipset that they use…

Anton Bassov

From the customer:

-scott
OSR
@OSRDrivers

wrote in message news:xxxxx@ntdev…

http://shop.lenovo.com/us/en/desktops/thinkcentre/m-series-tiny/m93-m93p/#tab-tech_specs

As you can see, they seem to be"conveniently forgetting" to provide any info
about the chipset that they use…

Anton Bassov

Thanks everyone for the brain cycles, I’ve been staring at this and it’s
nice to have some new angles :slight_smile:

We have a bunch of 0x1A/0x41792 crashes, which are nice because the Mm
causes them when it reads a pointer and gets back a value it doesn’t expect.
They are particularly useful in this case because the pointer is in Arg2 and
the unexpected value is in Arg3, so dumping them in Excel I can see exactly
which pointer has the bad content.

I just took 71 one of these and grabbed the PFN for the faulting virtual
address (yes, I did this manually…yes, I probably just should have written
something to do it…). Here are the resulting PFNs:

0x7a00c
0x14290
0x2da86
0x108f1b
0x10974c
0x10a43d
0x10a524
0x10a652
0x10af48
0x10b1c1
0x10b4b3
0x10b586
0x10be8e
0x10e453
0x10e4d3
0x10e936
0x10ee0c
0x10f016
0x10fc15
0x1120eb
0x112e70
0x113127
0x1140e5
0x114fe5
0x11523f
0x115603
0x115696
0x11583e
0x1159fd
0x115ad0
0x115b88
0x1164fb
0x116604
0x116812
0x11689c
0x116a03
0x116d21
0x116d37
0x116d92
0x116dff
0x11704a
0x11718e
0x117389
0x11762d
0x11785b
0x117b1f
0x117c1a
0x117c22
0x117c34
0x117d9e
0x117e60
0x11832d
0x118817
0x118a0b
0x118a19
0x118ade
0x118f96
0x118fe6
0x1190a3
0x11916e
0x1192cc
0x1192d1
0x119434
0x1196be
0x119702
0x119850
0x11b5a3
0x11b6aa
0x11d224
0x1d35ba
0x1d875d

While it looks sort of pattern-y, the three at the beginning kill the
“doesn’t happen under 4GB” idea.

I also grabbed a few other random dumps that have the problem but are
crashing in different places. Here are the PFNs involved there:

0x10faf8
0x1146a2
0x118a28
0x118e24
0x216059

Again this is not exhaustive, just a random sampling.

-scott
OSR
@OSRDrivers

“Jan Bottorff” wrote in message
news:xxxxx@ntdev…

Extending the idea of excluding individual pages, you might try excluding
big chunks of memory. I thought there used to be an option to force only
memory above 4GB to be used. Ideally you could binary search excluded
memory. You potentially could write a little boot start driver that
allocated do nothing buffers in specific ranges, testing if you can cause
the corruption to only happen in harmless areas.

Jan

Overheating/poor quality RAM chips is also a possibility.

Can you reduce the RAM/FSB clock a bit and retry?


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

“Scott Noone” wrote in message news:xxxxx@ntdev…
> They are the i7 variant of the Lenovo M93p Tiny Desktop. They are running
> various versions of the available firmware, though there has been an effort
> recently to get them all updated to the latest. The RAM map being confused
> is certainly an interesting development.
>
>
> -scott
> OSR
> @OSRDrivers
>
> wrote in message news:xxxxx@ntdev…
>
> Can you provide a bit more info about these “indentical machines” - chipset
> version, as well as firmare, seem to be of paricular intererst…
>
> Anton Bassov
>
>
>
>

Look at PCI IDs in the Device Manager, task done


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

wrote in message news:xxxxx@ntdev…
> http://shop.lenovo.com/us/en/desktops/thinkcentre/m-series-tiny/m93-m93p/#tab-tech_specs
>
> As you can see, they seem to be"conveniently forgetting" to provide any info about the chipset that they use…
>
>
> Anton Bassov
>

I discounted this as being a RAM problem due to the consistency and the pattern and the bad offset. It really “feels” like a device (or possibly driver) writing a control/status value where it shouldn’t. That being said, I’m happy still guessing…Would this type of corruption be consistent with a RAM issue in your opinion?

Thanks!

-scott
OSR
@OSRDrivers

> Look at PCI IDs in the Device Manager, task done

Well, in order to be able to do so you need get a physical access to the machine, which may be already too late (because you have purchased it already). It seems to be a common trick in the computer stores - they display CPU info in huge letters without saying anything about the chipset. However, if you write down the model and do a bit of googling you may discover that the chipset they use may, in actuality, come from VIA Technologies…

Anton Bassov

That’s an interesting idea for a culprit. Unfortunately, !dma is broken for
these dumps due to missing HAL types so I can’t easily determine if there
are any adapters with bounce buffers.

They “should” be the same, but it’s something else to check.

Thanks!

-scott
OSR
@OSRDrivers

wrote in message news:xxxxx@ntdev…

There are two possible major reasons:

  1. Stray DMA.
  2. Driver writes to stray mapping of RAM instead of BAR. For example, the
    driver writes some acknowledgement to BAR, but goes to RAM instead.

In case 1, investigate if some device uses DMA_ADAPTER with bounce buffers
(enumerate all DMA_ADAPTERs in the dump). See if the problem disappears if
RAM is limited to 3 GB.

Do the systems at the client’s all have the same inventory? If only some
systems exhibit the behavior, analyze what’s different between the cohorts.
It might even be different chip revisions (you’ll have to analyze full PCI\
device ID strings).

In case of corrupted images, did corruption happen in a paged or nonpaged section? If it’s non-paged, then it’s definitely NOT a bounce buffer.

(Answering for Scott… We’re both working on the same problem)

The corruption is in both page and non-paged memory. Which makes one thing that’s made it confusing.

Peter
OSR
@OSRDrivers

Are there any fancy drivers in the image? Like USB 3.0.

Also, when you see these corrupted pages, is there a pattern in the beginning of the page or in the end of the previous page?

0xFD8 is 4056, which is, coincidentally, one of Jumbo MTUs MS is using for HyperV. What is the NIC config on the boxes?

Yes. These are USB 3 boxes, and there are (typically) USB devices attached.

I don’t BELIEVE so. I’ll have to leave that one to Mr. Noone.

Excellent observation, and one that we indeed also had. Jumbo-grams are not enabled, unfortunately.

Peter
OSR
@OSRDrivers