A potentially interesting puzzler for a Tuesday…
We’re looking at a series of crash dumps from a client and am hoping that
this corruption looks familiar to someone. This is an end user installation
and none of our software is running on these machines, it’s strictly a crash
analysis exercise.
The crashes are spread out over 1,000+ “identical” machines. No one machine
crashes with great frequency, but spread out across the install there’s a
few a day. Systems survive anywhere from a few hours to six days before
crashing.
Just looking at the crash codes isn’t helpful, they have about every crash
code you could ever imagine (QUOTA_UNDERRUN??) and it’s blamed on various
modules. However, digging deeper a very consistent pattern emerges.
Specifically, we are consistently seeing one of two values “randomly” appear
in memory:
• 0x0000001000000004 • 0x00000010
00400004
Interestingly, when the corruption is discovered the value very, very
frequently appears at physical memory page offset 0xFD8 (most common) or
0xD70 (less common).
For example, in one crash the problem was that the MRXSMB20 image file is
corrupted:
3: kd> !chkimg -d mrxsmb20
fffff800826a6fd8-fffff800826a6fdd 6 bytes -
mrxsmb20!Smb2UpdateFileInfoCacheEntry+4c8
[89 7d 18 49 89 45:04 00 00 00 10 00]
fffff800826a6fdf - mrxsmb20!Smb2UpdateFileInfoCacheEntry+4cf (+0x07)
[e8:00]
7 errors : mrxsmb20 (fffff800826a6fd8-fffff800826a6fdf)
Dumping the start of the corrupted range, we see our offset and value:
3: kd> dq fffff800826a6fd8
fffff800826a6fd8 00000010
00000004 4c2b894c0000e99c fffff800
826a6fe8 ade901b641986d8b 850f02f8
83fffffd
fffff800826a6ff8 8bc03345
fffffbb5 445e15ffce8b49d7 fffff800
826a7008 fb9f850fc0840002 03fffffe
fee9ffff
In another crash a pool header is corrupted:
2: kd> !pool ffffc00089adcd70
Pool page ffffc00089adcd70 region is Paged pool
…
ffffc00089adcc00 size: 170 previous size: b0 (Free ) MPsc
ffffc00089adcd70 doesn’t look like a valid small pool allocation, checking
to see
if the entire page is actually part of a large page allocation…
2: kd> dq ffffc00089adcd70
ffffc00089adcd70 00000010
00000004 8d5eb1494b83d33a ffffc000
89adcd80 0000000000000000 ffffe000
bdecfec0
ffffc00089adcd90 ffffe000
bc728860 ffffc000`89adcd98
Due to the fact that the crash appears at random in different virtual
address ranges (paged pool, non-paged pool, code, proto PTEs, working set
lists, etc.), we believe that the corruption must be generated by a device
in the system (or by the platform).
We have tried various things to narrow this further and analyzed the
corruption across hundreds of dump files. The systems are generally idle
when the corruption is discovered, which makes it hard to go back in time
and figure out who might be using the value.
So, my question to you all is: does this LOOK like anything to you? Do those
corruption values hold any meaning to you? What about the offsets of 0xFD8
or 0xD70? I realize it’s not much to go on, but stranger things have
happened…
Thanks!
-scott
OSR
@OSRDrivers