Memory Corruption Mystery: Any Ideas?

Have you looked at CPU Microcode updates? We just spent weeks at work diagnosing what should’ve been an “impossible” crash, only to realize it was a microcode bug related to power transitions on a recent CPU.

Have you tried the “scientific method” with these machines? That is, remove more and more hardware and more and more software/drivers until the crashes stop crashing. For example, are there crashes if the users never log in? Are there crashes if you boot into the Windows Recovery Environment? You can even go build a native app that gets launched by SMSS and never returns (or waits on a keystroke) and see if crashes still happen at that point.


Best regards,
Alex Ionescu

Have you checked the return policy from the OEM? It sounds like you got a bunch of broken systems

while many people like to round off on Microsoft, Windows is not expected to crash daily and drivers from major manufacturers aren’t either (Intel qualifies) so your most likely root cause is bad hardware (firmware bugs?). the fact that this happens during power transitions reinforces this assertion since as well as being difficult for driver writers to get right (thank you again KMDF) it also exposes problems with non-compliant hardware

I once had a long conversation with a co-worker about a particular system that he was having a problem with. he said it works perfectly with Linux but Windows crashes during install every time - what is wrong with the Windows installer? It turned out that the graphics card installed in the system had a nasty bug where a particular change in graphics mode caused it to overwrite random physical memory and Windows setup just happened to hit this perfect combination while Linux never did.


From: xxxxx@lists.osr.com on behalf of xxxxx@gmail.com
Sent: November 25, 2015 9:42 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Memory Corruption Mystery: Any Ideas?

Contributing some information to the thread as I work for the “customer” We are very grateful for the work OSR has regarding the troubleshooting our mystery. (Although it did take while for us to get organize and start collecting hundreds of BSOD dumps and convince OSR that we needed their help)

Network card,
- M93p’s have an Intel I217-ML They are connected to Cisco 2960s switches on GB ports
Intel drviers, had version 12.11.96.1 for most of the year, updated to 12.12.80.1920 followed by 12.13.17.7 recently. Using default driver settings. Jumbo frames are not enabled.

USB
- The M93p only has physical USB3 ports, (Lenovo hardware maintenance manual list’s an optional USB2 port which we don’t have)

USB info as reported by msinfo32.
Intel(R) USB 3.0 eXtensible Host Controller - 0100 (Microsoft) PCI\VEN_8086&DEV_8C31&SUBSYS_30A317AA&REV_04\3&11583659&0&A0
Intel(R) 8 Series/C220 Series USB EHCI #2 - 8C2D PCI\VEN_8086&DEV_8C2D&SUBSYS_30A317AA&REV_04\3&11583659&0&D0
Intel(R) 8 Series/C220 Series USB EHCI #1 - 8C26 PCI\VEN_8086&DEV_8C26&SUBSYS_30A317AA&REV_04\3&11583659&0&E8

All of our system have at least 2 USB devices at all times : Keyboard and mice.

USB drivers are the Microsoft Windows 8.1 x64 Enterprise drivers :
USBXHCI.SYS

The one critical piece of information Scott has omitted, (as it might lead to conjecture) is that when we recently disabled “Turn off Monitor after Idle” in the windows Power Profile.

The BSOD’s all but stopped !

We did this after OSR observed that the memory scribble BSOD’s were clustered around user logon and power transition events. Our PC’s are on 24/7 with weekly reboots. Users will logoff or stay logged on the end of the day, and prior to our recent changes montior would power off after 15min, User returns in the morning, monitor wakes up. and log’s-on. We are office workers typical person is use the office Office 2013 Suite; Outlook, Word, Excel products all day. Nothing fancy.

All of our M93p are on the High performance profile, we don’t sleep or do any hibernation. Prior to Scott’s recommendation on our Power Settings, We had Turn off Monitor after 15 min of idle, USB low power mode, and HD power-off after idle. These settings have now all been disabled.

After making this change about 1 week ago and rebooting all 1000+ systems the BSOD’s stopped.

Not knowing at time which action we took changed the behavior, we update the NIC and intel Storage drives and power changes all at the same time, 1 week later we re-enabled the Monitor power-off at idle.

Not more than 30 min after making this change we had our 1st BSOD with scribble memory. Upon seeing this change in behavior we reversed course a disabled the Monitor power off again, Reboot all system. Have have only encountered one BSOD since in about 6 days. The “normal” trend line for these systems prior was about 2-6 BSOD per day sometimes hiting peaks of 10-15 unique machines BSODing per day.

The other interesting part of this problem is we have about 100+ Dell E7440 Laptops, these machine run the same image as our M93p. At the image deployment ( SCCM OSD) image different driver package are injected in the image. (Lenovo and Dell both provide drive package for SCCM deployments these always contain out dated drivers that are supposed be vetted and tested.

Suspecting possible bad drivers we now find the most recent drivers for our hardware on the catalog.update.microsoft.com site and download and deploy these version.

Post image we updated the drivers for common hardware ; NIC , Storage, Intel HD iGPU using the same driver for both M93p and the E7440. No single “memory scribble” BSOD has been found on the E7440.

Last part of information regarding our problem. We never knew we has serious issue with BSOD’s unit we started looking for BSOD’s and collecting the dumps. But we always new we had and still have serious issue with Applications from A to Z crashing on our system. About 60-70% of these crashes are buffer over-run’s C000005

Example : Taken from Windows AppCrash events form various PC’s things always crash with Exception code: 0xc0000005.

Although our BSOD have stopped since we made power change, the general user mode crashes have not. Are they getting corrupted by same memory scribble error ?

Date-Time , .ProgramName. module, Exception code
11/24/2015 08:24:11 splwow64.exe KERNELBASE.dll e06d7363
11/24/2015 08:37:40 WINWORD.EXE wwlib.dll c0000005
11/24/2015 08:43:25 CcmExec.exe ntdll.dll c0000005
11/24/2015 08:58:29 AUDIODG.EXE WMALFXGFXDSP.dll c0000005
11/24/2015 08:58:35 splwow64.exe KERNELBASE.dll e06d7363
11/24/2015 09:00:27 wfcrun32.exe ntdll.dll c0000005
11/24/2015 09:12:59 xdcla.exe Profiler.dll c0000135
11/24/2015 09:19:16 WINWORD.EXE mfc100u.dll c0000005
11/24/2015 09:22:30 OUTLOOK.EXE combase.dll c0000005
11/24/2015 09:26:06 lync.exe ntdll.dll c0000005
11/24/2015 09:27:09 IEXPLORE.EXE ntdll.dll c0000005
11/24/2015 09:30:40 OUTLOOK.EXE MSVCR100.dll 40000015
11/24/2015 09:33:23 OUTLOOK.EXE mso.dll c0000602
11/24/2015 09:39:08 splwow64.exe KERNELBASE.dll e06d7363
11/24/2015 09:45:01 Acrobat.exe ntdll.dll c0000374
11/24/2015 09:47:31 OUTLOOK.EXE mso.dll c0000602
11/24/2015 09:51:32 IEXPLORE.EXE MSHTML.dll c0000005
11/24/2015 09:52:52 IEXPLORE.EXE MSHTML.dll c0000005
11/24/2015 09:53:01 IEXPLORE.EXE MSHTML.dll c0000005
11/24/2015 09:55:01 OUTLOOK.EXE unknown c0000005
11/24/2015 09:58:43 ppscanmg.exe KERNELBASE.dll e06d7363
11/24/2015 10:03:15 IEXPLORE.EXE ntdll.dll c0000005
11/24/2015 10:06:20 Acrobat.exe ntdll.dll c0000374
11/24/2015 10:09:12 IEXPLORE.EXE igd10iumd32.dll c0000005
11/24/2015 10:09:24 IEXPLORE.EXE ntdll.dll c0000409
11/24/2015 10:10:14 splwow64.exe KERNELBASE.dll e06d7363
11/24/2015 10:11:04 IEXPLORE.EXE Flash.ocx c0000005
11/24/2015 10:11:24 svchost.exe_Dnscache ntdll.dll c0000008
11/24/2015 10:21:36 OUTLOOK.EXE ntdll.dll c0000374
11/24/2015 10:24:30 IEXPLORE.EXE ntdll.dll c0000005
11/24/2015 10:24:40 OUTLOOK.EXE mso.dll c0000602
11/24/2015 10:31:21 OUTLOOK.EXE ntdll.dll c0000374
11/24/2015 10:31:30 WINWORD.EXE unknown c0000005
11/24/2015 10:34:41 Concordance Image.exe Concordance Image.exe c000041d
11/24/2015 10:34:41 Concordance Image.exe Concordance Image.exe c0000005

Naim


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

(Happy Holidays everyone :))

Nothing obvious in the physically contiguous page above or below.

One interesting data point is that we never found a case where the
surrounding virtually contiguous pages were actually physically contiguous.
This made it feel like a stray DMA overrun or underrun, though again we
never found the evidence for it.

-scott
OSR
@OSRDrivers

wrote in message news:xxxxx@ntdev…

Are there any fancy drivers in the image? Like USB 3.0.

Also, when you see these corrupted pages, is there a pattern in the
beginning of the page or in the end of the previous page?

We searched for the sequence in the “suspect” driver list (NIC, video, etc.)
using IDA Pro, though it was a long shot. We found various instances of it,
though just through static analysis it was impossible to say if it was even
related. Not enough hours in the day to do a complete reversing job on every
driver :stuck_out_tongue:

-scott
OSR
@OSRDrivers

“Andrey Bazhan” wrote in message news:xxxxx@ntdev…

Have you tried to narrow down the culprit by running

!for_each_module “.echo @#ModuleName; s-b @#Base @#End D8 0F 00 00”

wrote in message news:xxxxx@ntdev…

I discounted this as being a RAM problem due to the consistency and the
pattern and the bad offset. It really “feels” like a device (or possibly
driver) writing a control/status value where it shouldn’t. That being said,
I’m happy still guessing…Would this type of corruption be consistent with
a RAM issue in your opinion?

Thanks!

-scott
OSR
@OSRDrivers

> This made it feel like a stray DMA overrun or underrun, though again we

never found the evidence for it.

DMA verifier?


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

You might read the errata notes on the processor and chipsets - these are
often fairly “concerning” if you read them. Other conspiracy theories in
the firmware regions that might be worth considering: SMM or VMM firmware
that might be enabled and embedded “management engines” (vpro) might be
worth looking at.

t.

On Mon, Nov 30, 2015 at 4:22 PM, Maxim S. Shatskih
wrote:

> > This made it feel like a stray DMA overrun or underrun, though again we
> > never found the evidence for it.
>
> DMA verifier?
>
> –
> Maxim S. Shatskih
> Microsoft MVP on File System And Storage
> xxxxx@storagecraft.com
> http://www.storagecraft.com
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: <
> http://www.osronline.com/showlists.cfm?list=ntdev&gt;
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:>

FWIW, I’ve encountered a problem that seems somewhat similar:

  1. Very occasional random BSOD. On one crashdump I ran !analyze -v and these 8 bytes were changed in NTFS.sys in memory:

Expected:
ff c3 cc cc 48 89 5c 24

Actual:
04 00 00 04 10 00 00 00

This seems to be a somewhat similar pattern.

  1. On a non-system drive, I decompressed a large archive and one decompressed file differed from the original by 7 bytes:

Expected:
50 ea 40 00 50 ea 40 00

Actual:
04 00 00 04 10 00 00 00

Again, the same pattern. I rebooted and compared the files again and they were identical (both ‘expected’), hopefully ruling out any disk issue.

My system:

Lenovo ThinkServer TS140 (70A4001LUX)
Xeon E3-1225 v3
2x4GB DDR3 ECC memory (which passes memtest86 and Windows Memory Diagnostics)
BIOS A8A (which doesn?t seem to be publicly listed anymore)
Microcode Revision 1C (7/3/2014)
Windows 10 x64

Thanks.

That’s in fact the EXACT same pattern! Just displayed as bytes:

3: kd> db ffffc00059324fd8 L8 ffffc00059324fd8 04 00 00 04 10 00 00 00

Instead of a QWORD:

3: kd> dq ffffc00059324fd8 L1 ffffc00059324fd8 00000010`04000004

What’s the address of the corruption?

-scott
OSR
@OSRDrivers

I have a couple of TS140 servers and one device in common between the TS140 and the machines OSR is looking at is the Intel NIC device, I believe a i217. I suppose since they both are Lenovo machines, perhaps the BIOS might be from the same origins. I assume an Intel i217 NIC is available on an external card, so a PCIe analyzer could be inserted and perhaps set to trigger on this data pattern. If this data pattern is normally written by the i271 driver, that add a little clue the corruption might be coming from this device.

Jan

On 11/30/15, 3:49 PM, “xxxxx@lists.osr.com on behalf of xxxxx@gmail.com” wrote:

>FWIW, I’ve encountered a problem that seems somewhat similar:
>
>1. Very occasional random BSOD. On one crashdump I ran !analyze -v and these 8 bytes were changed in NTFS.sys in memory:
>
>Expected:
>ff c3 cc cc 48 89 5c 24
>
>Actual:
>04 00 00 04 10 00 00 00
>
>This seems to be a somewhat similar pattern.
>
>2. On a non-system drive, I decompressed a large archive and one decompressed file differed from the original by 7 bytes:
>
>Expected:
>50 ea 40 00 50 ea 40 00
>
>Actual:
>04 00 00 04 10 00 00 00
>
>Again, the same pattern. I rebooted and compared the files again and they were identical (both ‘expected’), hopefully ruling out any disk issue.
>
>My system:
>
>Lenovo ThinkServer TS140 (70A4001LUX)
>Xeon E3-1225 v3
>2x4GB DDR3 ECC memory (which passes memtest86 and Windows Memory Diagnostics)
>BIOS A8A (which doesn?t seem to be publicly listed anymore)
>Microcode Revision 1C (7/3/2014)
>Windows 10 x64
>
>Thanks.
>
>—
>NTDEV is sponsored by OSR
>
>Visit the list online at: http:
>
>MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
>Details at http:
>
>To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>

The address is fffff801`128f2fd8. Incredibly it ends with 0xFD8 as described in your first post.

CHKIMG_EXTENSION: !chkimg -lo 50 -d !NTFS
fffff801128f2fd8-fffff801128f2fdf 8 bytes - NTFS!NtOfsCollateUlongs+48
[ff c3 cc cc 48 89 5c 24:04 00 00 04 10 00 00 00]
8 errors : !NTFS (fffff801128f2fd8-fffff801128f2fdf)

My other BSODs were CRITICAL_STRUCTURE_CORRUPTION and CRITICAL_PROCESS_DIED which seem like they could be symptoms from the same root cause.

It seems that our systems are somewhat similar. Both Lenovo, both Haswell, both Intel 4600 (mine is P4600?), similar NIC (though I have Jumbo frames enabled), both support USB 3 (though I have no USB 3 devices), power settings configured to turn monitor off after idle, etc.

I only have one machine so it takes weeks to repro.

Yeah, sometimes you wish it was 24 * 2 in a day :). By the way, this is very
interesting case and it would be really cool if you could write a blog post
about it.

“Scott Noone” wrote in message news:xxxxx@ntdev…

We searched for the sequence in the “suspect” driver list (NIC, video, etc.)
using IDA Pro, though it was a long shot. We found various instances of it,
though just through static analysis it was impossible to say if it was even
related. Not enough hours in the day to do a complete reversing job on every
driver :stuck_out_tongue:

-scott
OSR
@OSRDrivers

“Andrey Bazhan” wrote in message news:xxxxx@ntdev…

Have you tried to narrow down the culprit by running

!for_each_module “.echo @#ModuleName; s-b @#Base @#End D8 0F 00 00”

wrote in message news:xxxxx@ntdev…

I discounted this as being a RAM problem due to the consistency and the
pattern and the bad offset. It really “feels” like a device (or possibly
driver) writing a control/status value where it shouldn’t. That being said,
I’m happy still guessing…Would this type of corruption be consistent with
a RAM issue in your opinion?

Thanks!

-scott
OSR
@OSRDrivers

xxxxx@gmail.com wrote:

The address is fffff801`128f2fd8. Incredibly it ends with 0xFD8 as described in your first post.

This is like watching an episode of “CSI: Cyber.”


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Maybe it’s caused by system firmware, BIOS is still can get the control when a SMI is triggered by some drivers, and the BIOS’s SMI callback can modify any memory data, is it possible that BIOS did something wrong in the OS run-time phase?

Definitely! It might end up being more than one, I think it could
practically
be a book at this point :slight_smile:

-scott
OSR
@OSRDrivers

“Andrey Bazhan” wrote in message news:xxxxx@ntdev…

Yeah, sometimes you wish it was 24 * 2 in a day :). By the way, this is very
interesting case and it would be really cool if you could write a blog post
about it.

“Scott Noone” wrote in message news:xxxxx@ntdev…

We searched for the sequence in the “suspect” driver list (NIC, video, etc.)
using IDA Pro, though it was a long shot. We found various instances of it,
though just through static analysis it was impossible to say if it was even
related. Not enough hours in the day to do a complete reversing job on every
driver :stuck_out_tongue:

-scott
OSR
@OSRDrivers

“Andrey Bazhan” wrote in message news:xxxxx@ntdev…

Have you tried to narrow down the culprit by running

!for_each_module “.echo @#ModuleName; s-b @#Base @#End D8 0F 00 00”

wrote in message news:xxxxx@ntdev…

I discounted this as being a RAM problem due to the consistency and the
pattern and the bad offset. It really “feels” like a device (or possibly
driver) writing a control/status value where it shouldn’t. That being said,
I’m happy still guessing…Would this type of corruption be consistent with
a RAM issue in your opinion?

Thanks!

-scott
OSR
@OSRDrivers

I just wondered if it is possible for a logic analyser to trap an access
with this data to an address that matches the profile, and generate a system
interrupt so the relevant code can be dumped before proceeding? Maybe just a
custom FPGA?

Mike

----- Original Message -----
From: Scott Noone
Newsgroups: ntdev
To: Windows System Software Devs Interest List
Sent: Wednesday, December 02, 2015 3:33 AM
Subject: Re:[ntdev] Memory Corruption Mystery: Any Ideas?

Definitely! It might end up being more than one, I think it could
practically
be a book at this point :slight_smile:

-scott
OSR
@OSRDrivers

“Andrey Bazhan” wrote in message news:xxxxx@ntdev…

Yeah, sometimes you wish it was 24 * 2 in a day :). By the way, this is very
interesting case and it would be really cool if you could write a blog post
about it.

“Scott Noone” wrote in message news:xxxxx@ntdev…

We searched for the sequence in the “suspect” driver list (NIC, video, etc.)
using IDA Pro, though it was a long shot. We found various instances of it,
though just through static analysis it was impossible to say if it was even
related. Not enough hours in the day to do a complete reversing job on every
driver :stuck_out_tongue:

-scott
OSR
@OSRDrivers

“Andrey Bazhan” wrote in message news:xxxxx@ntdev…

Have you tried to narrow down the culprit by running

!for_each_module “.echo @#ModuleName; s-b @#Base @#End D8 0F 00 00”

wrote in message news:xxxxx@ntdev…

I discounted this as being a RAM problem due to the consistency and the
pattern and the bad offset. It really “feels” like a device (or possibly
driver) writing a control/status value where it shouldn’t. That being said,
I’m happy still guessing…Would this type of corruption be consistent with
a RAM issue in your opinion?

Thanks!

-scott
OSR
@OSRDrivers


NTDEV is sponsored by OSR

Visit the list online at:
http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer</http:></http:>

I’m definitely not a hardware person, but I’ve brushed up on this sort of
thing in the past. I’m
pretty sure that a high end mainframe TLA will do the first part but only
passively.

Also hugely expensive.

I could be wrong about any or all of this.

mm
On Dec 2, 2015 4:10 AM, “Mike Kemp” wrote:

> I just wondered if it is possible for a logic analyser to trap an access
> with this data to an address that matches the profile, and generate a
> system interrupt so the relevant code can be dumped before proceeding?
> Maybe just a custom FPGA?
>
> Mike
>
> ----- Original Message ----- From: Scott Noone
> Newsgroups: ntdev
> To: Windows System Software Devs Interest List
> Sent: Wednesday, December 02, 2015 3:33 AM
> Subject: Re:[ntdev] Memory Corruption Mystery: Any Ideas?
>
>
>


>
> Definitely! It might end up being more than one, I think it could
> practically
> be a book at this point :slight_smile:
>
> -scott
> OSR
> @OSRDrivers
>
> “Andrey Bazhan” wrote in message news:xxxxx@ntdev…
>
> Yeah, sometimes you wish it was 24 * 2 in a day :). By the way, this is
> very
> interesting case and it would be really cool if you could write a blog post
> about it.
>
> “Scott Noone” wrote in message news:xxxxx@ntdev…
>
> We searched for the sequence in the “suspect” driver list (NIC, video,
> etc.)
> using IDA Pro, though it was a long shot. We found various instances of it,
> though just through static analysis it was impossible to say if it was even
> related. Not enough hours in the day to do a complete reversing job on
> every
> driver :stuck_out_tongue:
>
> -scott
> OSR
> @OSRDrivers
>
> “Andrey Bazhan” wrote in message news:xxxxx@ntdev…
>
> Have you tried to narrow down the culprit by running
>
> !for_each_module “.echo @#ModuleName; s-b @#Base @#End D8 0F 00 00”
>
> wrote in message news:xxxxx@ntdev…
>
>


>
> I discounted this as being a RAM problem due to the consistency and the
> pattern and the bad offset. It really “feels” like a device (or possibly
> driver) writing a control/status value where it shouldn’t. That being said,
> I’m happy still guessing…Would this type of corruption be consistent with
> a RAM issue in your opinion?
>
> Thanks!
>
> -scott
> OSR
> @OSRDrivers
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: <
> http://www.osronline.com/showlists.cfm?list=ntdev&gt;
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: <
> http://www.osronline.com/showlists.cfm?list=ntdev&gt;
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>

Oh Great!!! Can’t wait!!!

“Scott Noone” wrote in message news:xxxxx@ntdev…

Definitely! It might end up being more than one, I think it could
practically
be a book at this point :slight_smile:

-scott
OSR
@OSRDrivers

“Andrey Bazhan” wrote in message news:xxxxx@ntdev…

Yeah, sometimes you wish it was 24 * 2 in a day :). By the way, this is very
interesting case and it would be really cool if you could write a blog post
about it.

“Scott Noone” wrote in message news:xxxxx@ntdev…

We searched for the sequence in the “suspect” driver list (NIC, video, etc.)
using IDA Pro, though it was a long shot. We found various instances of it,
though just through static analysis it was impossible to say if it was even
related. Not enough hours in the day to do a complete reversing job on every
driver :stuck_out_tongue:

-scott
OSR
@OSRDrivers

“Andrey Bazhan” wrote in message news:xxxxx@ntdev…

Have you tried to narrow down the culprit by running

!for_each_module “.echo @#ModuleName; s-b @#Base @#End D8 0F 00 00”

wrote in message news:xxxxx@ntdev…

I discounted this as being a RAM problem due to the consistency and the
pattern and the bad offset. It really “feels” like a device (or possibly
driver) writing a control/status value where it shouldn’t. That being said,
I’m happy still guessing…Would this type of corruption be consistent with
a RAM issue in your opinion?

Thanks!

-scott
OSR
@OSRDrivers

My DNA is of the wrong type make intelligent contributions to this thread. I sure am trying to get Lenovo interested in this problem but a customer with 1000 units is nothing compared someone who has thousands of units in the field.

Some of my observations :

The lastest Lenovo Bios update for the M93p, FBJYB9USA included Microcode update 1D . This is the same microcode as our current bios FBJYB6USA ( As reported by the AMI tool MMTOOL.)

The CPU ID for the Intel i7-4765T is : 00306C3.

The latest Intel Linux Microcode file is dated Oct 11 2015, and it contains file : cpu000306c3_plat00000032_ver0000001e_date20150813.bin and updated version !

My question was , how to we update the microcode manually ?, ( can’t modify the Bios and Microsoft is slow and many versions behind with OS micro-code update’s. )

My answer: I found a vmware lab utility https://labs.vmware.com/flings/vmware-cpu-microcode-update-driver which can update the microcode of a windows system. This worked on my M93p I am now running with Microcode 1E ( Now we can’t install this on every M93p as the microcode has be injected at every boot, but we’ve ask Lenovo for an updated bios with 1E.

I also discovered that Windows 10 Enterprise (10240) installed on a M93p with latest updates shows a microcode of 1E ( Interesting that MS has rolled out the update to Win 10 but not Win 8.1)

Sort of, I keep thinking of the “scientific method”, the M93p has a forum factor that is similar to a laptop, it has no removable - hardware. We have update our core drivers, same versions that are running in our Dell E7440’s. ( the corruption unique to the M93p was never seen on them).

Now that we know the corruption BSOD are linked with Monitor power events we are trying to force a machine to cycle from Monitor on - off thousands of time per day, If the BSOD can be trigger it would certainly accelerate the troubleshooting ?

But I have not found any way to emulate a real mouse or keyboard key press that will wake a monitor.

The WOL NIC settings don’t wake a system that is fully on - with only a monitor turned off by the idle setting.

Is there any way to generate a hardware key-press of mouse movement without any specialized equipment ? We could set the monitor to power off after 1 second and send a wake key or mouse event every other second generating 43200 monitor on/off cycles per day

Naim

— Snip —
But I have not found any way to emulate a real mouse or keyboard key press that will wake a monitor.

Is there any way to generate a hardware key-press of mouse movement without any specialized equipment ? We could set the monitor to power off after 1 second and send a wake key or mouse event every other second generating 43200 monitor on/off cycles per day

— End Snip —

I don’t know how “specialized” you consider “specialized”, but many hobbyist-type microcontroller demo/prototype boards have USB device capabilities and just about all of them have HID example code to emulate wither a keyboard or a mouse or both. It would be almost trivial to modify one of those examples to inject a mouse movement or keyboard input on a schedule (or randomly). I have used the LPC Expresso boards from NXP (about $20 US) to do things very similar. ST Micro, RasPi, etc all have things similar. I wouldn’t be surprised if OSR has something very similar as well for their own testing.

Greg

Thanks again everyone for the responses! This ended up being a very cool
thread :slight_smile:

I would absolutely LOVE to throw some hardware at the problem and use
bus/logic analyzers to track the problem down (that’s probably where we
would head if this were a development project). However, that’s likely to be
above and beyond what would be possible for this engagement.

To summarize the current “resolution”:

A while ago we plotted out the time of day that the crashes were happening
and noticed that they seemed to cluster around 7-9AM. Dumping the security
event log from the crash dumps (“!wmitrace.logdump Eventlog-Security”),
there was significant evidence to imply that the crashes were often
happening shortly after the user logged in to the machine. This was even
true for crashes that happened at other times in the day.

Of course, due to the fact that this is a corruption, it could just be that
the login is what causes someone to notice the corruption and crash. But,
what else happens when someone comes in at 7AM and logs in? Power events! So
we checked the power policy (!popolicy) and noticed that the systems were
not sent to go to standby or hibernate. However, they were set to idle the
monitor and disk. Also, checking outstanding power IRPs (!poaction) we could
see that the USB devices in the system were set to idle.

To rule any of this out, we had the client shut all this stuff off and, to
our surprise, the crashes stopped. The monitor idle timer was then turned
back on (because it’s annoying to have your monitors on all the time) and
the crashes started again.

It still sort of feels like this can’t be it, but who are we to argue? We’ve
also reached the end of our engagement, so this might go down as one of the
great mysteries. However, the fact that we now have proof that *another*
Lenovo system has the problem the client may be able to get some traction on
resolving the issue.

-scott
OSR
@OSRDrivers