Memory Corruption Mystery: Any Ideas?

From the customer:

-scott
OSR
@OSRDrivers

wrote in message news:xxxxx@ntdev…

http://shop.lenovo.com/us/en/desktops/thinkcentre/m-series-tiny/m93-m93p/#tab-tech_specs

As you can see, they seem to be"conveniently forgetting" to provide any info
about the chipset that they use…

Anton Bassov

Thanks everyone for the brain cycles, I’ve been staring at this and it’s
nice to have some new angles :slight_smile:

We have a bunch of 0x1A/0x41792 crashes, which are nice because the Mm
causes them when it reads a pointer and gets back a value it doesn’t expect.
They are particularly useful in this case because the pointer is in Arg2 and
the unexpected value is in Arg3, so dumping them in Excel I can see exactly
which pointer has the bad content.

I just took 71 one of these and grabbed the PFN for the faulting virtual
address (yes, I did this manually…yes, I probably just should have written
something to do it…). Here are the resulting PFNs:

0x7a00c
0x14290
0x2da86
0x108f1b
0x10974c
0x10a43d
0x10a524
0x10a652
0x10af48
0x10b1c1
0x10b4b3
0x10b586
0x10be8e
0x10e453
0x10e4d3
0x10e936
0x10ee0c
0x10f016
0x10fc15
0x1120eb
0x112e70
0x113127
0x1140e5
0x114fe5
0x11523f
0x115603
0x115696
0x11583e
0x1159fd
0x115ad0
0x115b88
0x1164fb
0x116604
0x116812
0x11689c
0x116a03
0x116d21
0x116d37
0x116d92
0x116dff
0x11704a
0x11718e
0x117389
0x11762d
0x11785b
0x117b1f
0x117c1a
0x117c22
0x117c34
0x117d9e
0x117e60
0x11832d
0x118817
0x118a0b
0x118a19
0x118ade
0x118f96
0x118fe6
0x1190a3
0x11916e
0x1192cc
0x1192d1
0x119434
0x1196be
0x119702
0x119850
0x11b5a3
0x11b6aa
0x11d224
0x1d35ba
0x1d875d

While it looks sort of pattern-y, the three at the beginning kill the
“doesn’t happen under 4GB” idea.

I also grabbed a few other random dumps that have the problem but are
crashing in different places. Here are the PFNs involved there:

0x10faf8
0x1146a2
0x118a28
0x118e24
0x216059

Again this is not exhaustive, just a random sampling.

-scott
OSR
@OSRDrivers

“Jan Bottorff” wrote in message
news:xxxxx@ntdev…

Extending the idea of excluding individual pages, you might try excluding
big chunks of memory. I thought there used to be an option to force only
memory above 4GB to be used. Ideally you could binary search excluded
memory. You potentially could write a little boot start driver that
allocated do nothing buffers in specific ranges, testing if you can cause
the corruption to only happen in harmless areas.

Jan

Overheating/poor quality RAM chips is also a possibility.

Can you reduce the RAM/FSB clock a bit and retry?


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

“Scott Noone” wrote in message news:xxxxx@ntdev…
> They are the i7 variant of the Lenovo M93p Tiny Desktop. They are running
> various versions of the available firmware, though there has been an effort
> recently to get them all updated to the latest. The RAM map being confused
> is certainly an interesting development.
>
>
> -scott
> OSR
> @OSRDrivers
>
> wrote in message news:xxxxx@ntdev…
>
> Can you provide a bit more info about these “indentical machines” - chipset
> version, as well as firmare, seem to be of paricular intererst…
>
> Anton Bassov
>
>
>
>

Look at PCI IDs in the Device Manager, task done


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

wrote in message news:xxxxx@ntdev…
> http://shop.lenovo.com/us/en/desktops/thinkcentre/m-series-tiny/m93-m93p/#tab-tech_specs
>
> As you can see, they seem to be"conveniently forgetting" to provide any info about the chipset that they use…
>
>
> Anton Bassov
>

I discounted this as being a RAM problem due to the consistency and the pattern and the bad offset. It really “feels” like a device (or possibly driver) writing a control/status value where it shouldn’t. That being said, I’m happy still guessing…Would this type of corruption be consistent with a RAM issue in your opinion?

Thanks!

-scott
OSR
@OSRDrivers

> Look at PCI IDs in the Device Manager, task done

Well, in order to be able to do so you need get a physical access to the machine, which may be already too late (because you have purchased it already). It seems to be a common trick in the computer stores - they display CPU info in huge letters without saying anything about the chipset. However, if you write down the model and do a bit of googling you may discover that the chipset they use may, in actuality, come from VIA Technologies…

Anton Bassov

That’s an interesting idea for a culprit. Unfortunately, !dma is broken for
these dumps due to missing HAL types so I can’t easily determine if there
are any adapters with bounce buffers.

They “should” be the same, but it’s something else to check.

Thanks!

-scott
OSR
@OSRDrivers

wrote in message news:xxxxx@ntdev…

There are two possible major reasons:

  1. Stray DMA.
  2. Driver writes to stray mapping of RAM instead of BAR. For example, the
    driver writes some acknowledgement to BAR, but goes to RAM instead.

In case 1, investigate if some device uses DMA_ADAPTER with bounce buffers
(enumerate all DMA_ADAPTERs in the dump). See if the problem disappears if
RAM is limited to 3 GB.

Do the systems at the client’s all have the same inventory? If only some
systems exhibit the behavior, analyze what’s different between the cohorts.
It might even be different chip revisions (you’ll have to analyze full PCI\
device ID strings).

In case of corrupted images, did corruption happen in a paged or nonpaged section? If it’s non-paged, then it’s definitely NOT a bounce buffer.

(Answering for Scott… We’re both working on the same problem)

The corruption is in both page and non-paged memory. Which makes one thing that’s made it confusing.

Peter
OSR
@OSRDrivers

Are there any fancy drivers in the image? Like USB 3.0.

Also, when you see these corrupted pages, is there a pattern in the beginning of the page or in the end of the previous page?

0xFD8 is 4056, which is, coincidentally, one of Jumbo MTUs MS is using for HyperV. What is the NIC config on the boxes?

Yes. These are USB 3 boxes, and there are (typically) USB devices attached.

I don’t BELIEVE so. I’ll have to leave that one to Mr. Noone.

Excellent observation, and one that we indeed also had. Jumbo-grams are not enabled, unfortunately.

Peter
OSR
@OSRDrivers

Contributing some information to the thread as I work for the “customer” We are very grateful for the work OSR has regarding the troubleshooting our mystery. (Although it did take while for us to get organize and start collecting hundreds of BSOD dumps and convince OSR that we needed their help)

Network card,

  • M93p’s have an Intel I217-ML They are connected to Cisco 2960s switches on GB ports
    Intel drviers, had version 12.11.96.1 for most of the year, updated to 12.12.80.1920 followed by 12.13.17.7 recently. Using default driver settings. Jumbo frames are not enabled.

USB

  • The M93p only has physical USB3 ports, (Lenovo hardware maintenance manual list’s an optional USB2 port which we don’t have)

USB info as reported by msinfo32.
Intel(R) USB 3.0 eXtensible Host Controller - 0100 (Microsoft) PCI\VEN_8086&DEV_8C31&SUBSYS_30A317AA&REV_04\3&11583659&0&A0
Intel(R) 8 Series/C220 Series USB EHCI #2 - 8C2D PCI\VEN_8086&DEV_8C2D&SUBSYS_30A317AA&REV_04\3&11583659&0&D0
Intel(R) 8 Series/C220 Series USB EHCI #1 - 8C26 PCI\VEN_8086&DEV_8C26&SUBSYS_30A317AA&REV_04\3&11583659&0&E8

All of our system have at least 2 USB devices at all times : Keyboard and mice.

USB drivers are the Microsoft Windows 8.1 x64 Enterprise drivers :
USBXHCI.SYS

The one critical piece of information Scott has omitted, (as it might lead to conjecture) is that when we recently disabled “Turn off Monitor after Idle” in the windows Power Profile.

The BSOD’s all but stopped !

We did this after OSR observed that the memory scribble BSOD’s were clustered around user logon and power transition events. Our PC’s are on 24/7 with weekly reboots. Users will logoff or stay logged on the end of the day, and prior to our recent changes montior would power off after 15min, User returns in the morning, monitor wakes up. and log’s-on. We are office workers typical person is use the office Office 2013 Suite; Outlook, Word, Excel products all day. Nothing fancy.

All of our M93p are on the High performance profile, we don’t sleep or do any hibernation. Prior to Scott’s recommendation on our Power Settings, We had Turn off Monitor after 15 min of idle, USB low power mode, and HD power-off after idle. These settings have now all been disabled.

After making this change about 1 week ago and rebooting all 1000+ systems the BSOD’s stopped.

Not knowing at time which action we took changed the behavior, we update the NIC and intel Storage drives and power changes all at the same time, 1 week later we re-enabled the Monitor power-off at idle.

Not more than 30 min after making this change we had our 1st BSOD with scribble memory. Upon seeing this change in behavior we reversed course a disabled the Monitor power off again, Reboot all system. Have have only encountered one BSOD since in about 6 days. The “normal” trend line for these systems prior was about 2-6 BSOD per day sometimes hiting peaks of 10-15 unique machines BSODing per day.

The other interesting part of this problem is we have about 100+ Dell E7440 Laptops, these machine run the same image as our M93p. At the image deployment ( SCCM OSD) image different driver package are injected in the image. (Lenovo and Dell both provide drive package for SCCM deployments these always contain out dated drivers that are supposed be vetted and tested.

Suspecting possible bad drivers we now find the most recent drivers for our hardware on the catalog.update.microsoft.com site and download and deploy these version.

Post image we updated the drivers for common hardware ; NIC , Storage, Intel HD iGPU using the same driver for both M93p and the E7440. No single “memory scribble” BSOD has been found on the E7440.

Last part of information regarding our problem. We never knew we has serious issue with BSOD’s unit we started looking for BSOD’s and collecting the dumps. But we always new we had and still have serious issue with Applications from A to Z crashing on our system. About 60-70% of these crashes are buffer over-run’s C000005

Example : Taken from Windows AppCrash events form various PC’s things always crash with Exception code: 0xc0000005.

Although our BSOD have stopped since we made power change, the general user mode crashes have not. Are they getting corrupted by same memory scribble error ?

Date-Time , .ProgramName. module, Exception code
11/24/2015 08:24:11 splwow64.exe KERNELBASE.dll e06d7363
11/24/2015 08:37:40 WINWORD.EXE wwlib.dll c0000005
11/24/2015 08:43:25 CcmExec.exe ntdll.dll c0000005
11/24/2015 08:58:29 AUDIODG.EXE WMALFXGFXDSP.dll c0000005
11/24/2015 08:58:35 splwow64.exe KERNELBASE.dll e06d7363
11/24/2015 09:00:27 wfcrun32.exe ntdll.dll c0000005
11/24/2015 09:12:59 xdcla.exe Profiler.dll c0000135
11/24/2015 09:19:16 WINWORD.EXE mfc100u.dll c0000005
11/24/2015 09:22:30 OUTLOOK.EXE combase.dll c0000005
11/24/2015 09:26:06 lync.exe ntdll.dll c0000005
11/24/2015 09:27:09 IEXPLORE.EXE ntdll.dll c0000005
11/24/2015 09:30:40 OUTLOOK.EXE MSVCR100.dll 40000015
11/24/2015 09:33:23 OUTLOOK.EXE mso.dll c0000602
11/24/2015 09:39:08 splwow64.exe KERNELBASE.dll e06d7363
11/24/2015 09:45:01 Acrobat.exe ntdll.dll c0000374
11/24/2015 09:47:31 OUTLOOK.EXE mso.dll c0000602
11/24/2015 09:51:32 IEXPLORE.EXE MSHTML.dll c0000005
11/24/2015 09:52:52 IEXPLORE.EXE MSHTML.dll c0000005
11/24/2015 09:53:01 IEXPLORE.EXE MSHTML.dll c0000005
11/24/2015 09:55:01 OUTLOOK.EXE unknown c0000005
11/24/2015 09:58:43 ppscanmg.exe KERNELBASE.dll e06d7363
11/24/2015 10:03:15 IEXPLORE.EXE ntdll.dll c0000005
11/24/2015 10:06:20 Acrobat.exe ntdll.dll c0000374
11/24/2015 10:09:12 IEXPLORE.EXE igd10iumd32.dll c0000005
11/24/2015 10:09:24 IEXPLORE.EXE ntdll.dll c0000409
11/24/2015 10:10:14 splwow64.exe KERNELBASE.dll e06d7363
11/24/2015 10:11:04 IEXPLORE.EXE Flash.ocx c0000005
11/24/2015 10:11:24 svchost.exe_Dnscache ntdll.dll c0000008
11/24/2015 10:21:36 OUTLOOK.EXE ntdll.dll c0000374
11/24/2015 10:24:30 IEXPLORE.EXE ntdll.dll c0000005
11/24/2015 10:24:40 OUTLOOK.EXE mso.dll c0000602
11/24/2015 10:31:21 OUTLOOK.EXE ntdll.dll c0000374
11/24/2015 10:31:30 WINWORD.EXE unknown c0000005
11/24/2015 10:34:41 Concordance Image.exe Concordance Image.exe c000041d
11/24/2015 10:34:41 Concordance Image.exe Concordance Image.exe c0000005

Naim

I don’t have much to contribute except for my personal Windows Troubleshooting Rule of Thumb: when in doubt, blame antivirus.

Have you tried to narrow down the culprit by running

!for_each_module “.echo @#ModuleName; s-b @#Base @#End D8 0F 00 00”

wrote in message news:xxxxx@ntdev…

I discounted this as being a RAM problem due to the consistency and the
pattern and the bad offset. It really “feels” like a device (or possibly
driver) writing a control/status value where it shouldn’t. That being said,
I’m happy still guessing…Would this type of corruption be consistent with
a RAM issue in your opinion?

Thanks!

-scott
OSR
@OSRDrivers

Have you looked at CPU Microcode updates? We just spent weeks at work diagnosing what should’ve been an “impossible” crash, only to realize it was a microcode bug related to power transitions on a recent CPU.

Have you tried the “scientific method” with these machines? That is, remove more and more hardware and more and more software/drivers until the crashes stop crashing. For example, are there crashes if the users never log in? Are there crashes if you boot into the Windows Recovery Environment? You can even go build a native app that gets launched by SMSS and never returns (or waits on a keystroke) and see if crashes still happen at that point.


Best regards,
Alex Ionescu

Have you checked the return policy from the OEM? It sounds like you got a bunch of broken systems

while many people like to round off on Microsoft, Windows is not expected to crash daily and drivers from major manufacturers aren’t either (Intel qualifies) so your most likely root cause is bad hardware (firmware bugs?). the fact that this happens during power transitions reinforces this assertion since as well as being difficult for driver writers to get right (thank you again KMDF) it also exposes problems with non-compliant hardware

I once had a long conversation with a co-worker about a particular system that he was having a problem with. he said it works perfectly with Linux but Windows crashes during install every time - what is wrong with the Windows installer? It turned out that the graphics card installed in the system had a nasty bug where a particular change in graphics mode caused it to overwrite random physical memory and Windows setup just happened to hit this perfect combination while Linux never did.


From: xxxxx@lists.osr.com on behalf of xxxxx@gmail.com
Sent: November 25, 2015 9:42 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Memory Corruption Mystery: Any Ideas?

Contributing some information to the thread as I work for the “customer” We are very grateful for the work OSR has regarding the troubleshooting our mystery. (Although it did take while for us to get organize and start collecting hundreds of BSOD dumps and convince OSR that we needed their help)

Network card,
- M93p’s have an Intel I217-ML They are connected to Cisco 2960s switches on GB ports
Intel drviers, had version 12.11.96.1 for most of the year, updated to 12.12.80.1920 followed by 12.13.17.7 recently. Using default driver settings. Jumbo frames are not enabled.

USB
- The M93p only has physical USB3 ports, (Lenovo hardware maintenance manual list’s an optional USB2 port which we don’t have)

USB info as reported by msinfo32.
Intel(R) USB 3.0 eXtensible Host Controller - 0100 (Microsoft) PCI\VEN_8086&DEV_8C31&SUBSYS_30A317AA&REV_04\3&11583659&0&A0
Intel(R) 8 Series/C220 Series USB EHCI #2 - 8C2D PCI\VEN_8086&DEV_8C2D&SUBSYS_30A317AA&REV_04\3&11583659&0&D0
Intel(R) 8 Series/C220 Series USB EHCI #1 - 8C26 PCI\VEN_8086&DEV_8C26&SUBSYS_30A317AA&REV_04\3&11583659&0&E8

All of our system have at least 2 USB devices at all times : Keyboard and mice.

USB drivers are the Microsoft Windows 8.1 x64 Enterprise drivers :
USBXHCI.SYS

The one critical piece of information Scott has omitted, (as it might lead to conjecture) is that when we recently disabled “Turn off Monitor after Idle” in the windows Power Profile.

The BSOD’s all but stopped !

We did this after OSR observed that the memory scribble BSOD’s were clustered around user logon and power transition events. Our PC’s are on 24/7 with weekly reboots. Users will logoff or stay logged on the end of the day, and prior to our recent changes montior would power off after 15min, User returns in the morning, monitor wakes up. and log’s-on. We are office workers typical person is use the office Office 2013 Suite; Outlook, Word, Excel products all day. Nothing fancy.

All of our M93p are on the High performance profile, we don’t sleep or do any hibernation. Prior to Scott’s recommendation on our Power Settings, We had Turn off Monitor after 15 min of idle, USB low power mode, and HD power-off after idle. These settings have now all been disabled.

After making this change about 1 week ago and rebooting all 1000+ systems the BSOD’s stopped.

Not knowing at time which action we took changed the behavior, we update the NIC and intel Storage drives and power changes all at the same time, 1 week later we re-enabled the Monitor power-off at idle.

Not more than 30 min after making this change we had our 1st BSOD with scribble memory. Upon seeing this change in behavior we reversed course a disabled the Monitor power off again, Reboot all system. Have have only encountered one BSOD since in about 6 days. The “normal” trend line for these systems prior was about 2-6 BSOD per day sometimes hiting peaks of 10-15 unique machines BSODing per day.

The other interesting part of this problem is we have about 100+ Dell E7440 Laptops, these machine run the same image as our M93p. At the image deployment ( SCCM OSD) image different driver package are injected in the image. (Lenovo and Dell both provide drive package for SCCM deployments these always contain out dated drivers that are supposed be vetted and tested.

Suspecting possible bad drivers we now find the most recent drivers for our hardware on the catalog.update.microsoft.com site and download and deploy these version.

Post image we updated the drivers for common hardware ; NIC , Storage, Intel HD iGPU using the same driver for both M93p and the E7440. No single “memory scribble” BSOD has been found on the E7440.

Last part of information regarding our problem. We never knew we has serious issue with BSOD’s unit we started looking for BSOD’s and collecting the dumps. But we always new we had and still have serious issue with Applications from A to Z crashing on our system. About 60-70% of these crashes are buffer over-run’s C000005

Example : Taken from Windows AppCrash events form various PC’s things always crash with Exception code: 0xc0000005.

Although our BSOD have stopped since we made power change, the general user mode crashes have not. Are they getting corrupted by same memory scribble error ?

Date-Time , .ProgramName. module, Exception code
11/24/2015 08:24:11 splwow64.exe KERNELBASE.dll e06d7363
11/24/2015 08:37:40 WINWORD.EXE wwlib.dll c0000005
11/24/2015 08:43:25 CcmExec.exe ntdll.dll c0000005
11/24/2015 08:58:29 AUDIODG.EXE WMALFXGFXDSP.dll c0000005
11/24/2015 08:58:35 splwow64.exe KERNELBASE.dll e06d7363
11/24/2015 09:00:27 wfcrun32.exe ntdll.dll c0000005
11/24/2015 09:12:59 xdcla.exe Profiler.dll c0000135
11/24/2015 09:19:16 WINWORD.EXE mfc100u.dll c0000005
11/24/2015 09:22:30 OUTLOOK.EXE combase.dll c0000005
11/24/2015 09:26:06 lync.exe ntdll.dll c0000005
11/24/2015 09:27:09 IEXPLORE.EXE ntdll.dll c0000005
11/24/2015 09:30:40 OUTLOOK.EXE MSVCR100.dll 40000015
11/24/2015 09:33:23 OUTLOOK.EXE mso.dll c0000602
11/24/2015 09:39:08 splwow64.exe KERNELBASE.dll e06d7363
11/24/2015 09:45:01 Acrobat.exe ntdll.dll c0000374
11/24/2015 09:47:31 OUTLOOK.EXE mso.dll c0000602
11/24/2015 09:51:32 IEXPLORE.EXE MSHTML.dll c0000005
11/24/2015 09:52:52 IEXPLORE.EXE MSHTML.dll c0000005
11/24/2015 09:53:01 IEXPLORE.EXE MSHTML.dll c0000005
11/24/2015 09:55:01 OUTLOOK.EXE unknown c0000005
11/24/2015 09:58:43 ppscanmg.exe KERNELBASE.dll e06d7363
11/24/2015 10:03:15 IEXPLORE.EXE ntdll.dll c0000005
11/24/2015 10:06:20 Acrobat.exe ntdll.dll c0000374
11/24/2015 10:09:12 IEXPLORE.EXE igd10iumd32.dll c0000005
11/24/2015 10:09:24 IEXPLORE.EXE ntdll.dll c0000409
11/24/2015 10:10:14 splwow64.exe KERNELBASE.dll e06d7363
11/24/2015 10:11:04 IEXPLORE.EXE Flash.ocx c0000005
11/24/2015 10:11:24 svchost.exe_Dnscache ntdll.dll c0000008
11/24/2015 10:21:36 OUTLOOK.EXE ntdll.dll c0000374
11/24/2015 10:24:30 IEXPLORE.EXE ntdll.dll c0000005
11/24/2015 10:24:40 OUTLOOK.EXE mso.dll c0000602
11/24/2015 10:31:21 OUTLOOK.EXE ntdll.dll c0000374
11/24/2015 10:31:30 WINWORD.EXE unknown c0000005
11/24/2015 10:34:41 Concordance Image.exe Concordance Image.exe c000041d
11/24/2015 10:34:41 Concordance Image.exe Concordance Image.exe c0000005

Naim


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

(Happy Holidays everyone :))

Nothing obvious in the physically contiguous page above or below.

One interesting data point is that we never found a case where the
surrounding virtually contiguous pages were actually physically contiguous.
This made it feel like a stray DMA overrun or underrun, though again we
never found the evidence for it.

-scott
OSR
@OSRDrivers

wrote in message news:xxxxx@ntdev…

Are there any fancy drivers in the image? Like USB 3.0.

Also, when you see these corrupted pages, is there a pattern in the
beginning of the page or in the end of the previous page?

We searched for the sequence in the “suspect” driver list (NIC, video, etc.)
using IDA Pro, though it was a long shot. We found various instances of it,
though just through static analysis it was impossible to say if it was even
related. Not enough hours in the day to do a complete reversing job on every
driver :stuck_out_tongue:

-scott
OSR
@OSRDrivers

“Andrey Bazhan” wrote in message news:xxxxx@ntdev…

Have you tried to narrow down the culprit by running

!for_each_module “.echo @#ModuleName; s-b @#Base @#End D8 0F 00 00”

wrote in message news:xxxxx@ntdev…

I discounted this as being a RAM problem due to the consistency and the
pattern and the bad offset. It really “feels” like a device (or possibly
driver) writing a control/status value where it shouldn’t. That being said,
I’m happy still guessing…Would this type of corruption be consistent with
a RAM issue in your opinion?

Thanks!

-scott
OSR
@OSRDrivers

> This made it feel like a stray DMA overrun or underrun, though again we

never found the evidence for it.

DMA verifier?


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com