The systems aren’t ancient, so I wouldn’t expect an errant legacy ISA DMA.
Also, what’s particularly annoying about the problem is that it’s not the
entire region, it’s just an 8 byte pattern starting at 0xFD8 (or 0xD70). So,
unless 0x00000010`00000004 was a common “end of data” marker for that type
of transfer it doesn’t sound related.
-scott
OSR
@OSRDrivers
“Maxim S. Shatskih” wrote in message news:xxxxx@ntdev…
0xFD8 is like the end of 16MB region for legacy ISA DMA
Can this be the cause?
–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com
“Scott Noone” wrote in message news:xxxxx@ntdev…
>A potentially interesting puzzler for a Tuesday…
>
> We’re looking at a series of crash dumps from a client and am hoping that
> this corruption looks familiar to someone. This is an end user
> installation
> and none of our software is running on these machines, it’s strictly a
> crash
> analysis exercise.
>
> The crashes are spread out over 1,000+ “identical” machines. No one
> machine
> crashes with great frequency, but spread out across the install there’s a
> few a day. Systems survive anywhere from a few hours to six days before
> crashing.
>
> Just looking at the crash codes isn’t helpful, they have about every crash
> code you could ever imagine (QUOTA_UNDERRUN??) and it’s blamed on various
> modules. However, digging deeper a very consistent pattern emerges.
> Specifically, we are consistently seeing one of two values “randomly”
> appear
> in memory:
>
> • 0x0000001000000004<br>> • 0x00000010
00400004
>
> Interestingly, when the corruption is discovered the value very, very
> frequently appears at physical memory page offset 0xFD8 (most common) or
> 0xD70 (less common).
>
> For example, in one crash the problem was that the MRXSMB20 image file is
> corrupted:
>
> 3: kd> !chkimg -d mrxsmb20
> fffff800826a6fd8-fffff800826a6fdd 6 bytes -
> mrxsmb20!Smb2UpdateFileInfoCacheEntry+4c8
> [89 7d 18 49 89 45:04 00 00 00 10 00]
> fffff800826a6fdf - mrxsmb20!Smb2UpdateFileInfoCacheEntry+4cf (+0x07)
> [e8:00]
> 7 errors : mrxsmb20 (fffff800826a6fd8-fffff800826a6fdf)
>
> Dumping the start of the corrupted range, we see our offset and value:
>
> 3: kd> dq fffff800826a6fd8
> fffff800826a6fd8 00000010
00000004 4c2b894c0000e99c<br>> fffff800
826a6fe8 ade901b641986d8b 850f02f8
83fffffd
> fffff800826a6ff8 8bc03345
fffffbb5 445e15ffce8b49d7<br>> fffff800
826a7008 fb9f850fc0840002 03fffffe
fee9ffff
>
> In another crash a pool header is corrupted:
>
> 2: kd> !pool ffffc00089adcd70
> Pool page ffffc00089adcd70 region is Paged pool
> …
> ffffc00089adcc00 size: 170 previous size: b0 (Free ) MPsc
>
> ffffc00089adcd70 doesn’t look like a valid small pool allocation, checking
> to see
> if the entire page is actually part of a large page allocation…
>
> 2: kd> dq ffffc00089adcd70
> ffffc00089adcd70 00000010
00000004 8d5eb1494b83d33a<br>> ffffc000
89adcd80 0000000000000000 ffffe000
bdecfec0
> ffffc00089adcd90 ffffe000
bc728860 ffffc000`89adcd98
>
> Due to the fact that the crash appears at random in different virtual
> address ranges (paged pool, non-paged pool, code, proto PTEs, working set
> lists, etc.), we believe that the corruption must be generated by a device
> in the system (or by the platform).
>
> We have tried various things to narrow this further and analyzed the
> corruption across hundreds of dump files. The systems are generally idle
> when the corruption is discovered, which makes it hard to go back in time
> and figure out who might be using the value.
>
> So, my question to you all is: does this LOOK like anything to you? Do
> those
> corruption values hold any meaning to you? What about the offsets of 0xFD8
> or 0xD70? I realize it’s not much to go on, but stranger things have
> happened…
>
> Thanks!
>
> -scott
> OSR
> @OSRDrivers
>
>
>