Memory Corruption Mystery: Any Ideas?

mm1 · December 2, 2015, 4:26am

I’m definitely not a hardware person, but I’ve brushed up on this sort of
thing in the past. I’m
pretty sure that a high end mainframe TLA will do the first part but only
passively.

Also hugely expensive.

I could be wrong about any or all of this.

mm
On Dec 2, 2015 4:10 AM, “Mike Kemp” wrote:

> I just wondered if it is possible for a logic analyser to trap an access
> with this data to an address that matches the profile, and generate a
> system interrupt so the relevant code can be dumped before proceeding?
> Maybe just a custom FPGA?
>
> Mike
>
> ----- Original Message ----- From: Scott Noone
> Newsgroups: ntdev
> To: Windows System Software Devs Interest List
> Sent: Wednesday, December 02, 2015 3:33 AM
> Subject: Re:[ntdev] Memory Corruption Mystery: Any Ideas?
>
>
>

>
> Definitely! It might end up being more than one, I think it could
> practically
> be a book at this point
>
> -scott
> OSR
> @OSRDrivers
>
> “Andrey Bazhan” wrote in message news:xxxxx@ntdev…
>
> Yeah, sometimes you wish it was 24 * 2 in a day :). By the way, this is
> very
> interesting case and it would be really cool if you could write a blog post
> about it.
>
> “Scott Noone” wrote in message news:xxxxx@ntdev…
>
> We searched for the sequence in the “suspect” driver list (NIC, video,
> etc.)
> using IDA Pro, though it was a long shot. We found various instances of it,
> though just through static analysis it was impossible to say if it was even
> related. Not enough hours in the day to do a complete reversing job on
> every
> driver
>
> -scott
> OSR
> @OSRDrivers
>
> “Andrey Bazhan” wrote in message news:xxxxx@ntdev…
>
> Have you tried to narrow down the culprit by running
>
> !for_each_module “.echo @#ModuleName; s-b @#Base @#End D8 0F 00 00”
>
> wrote in message news:xxxxx@ntdev…
>
>

>
> I discounted this as being a RAM problem due to the consistency and the
> pattern and the bad offset. It really “feels” like a device (or possibly
> driver) writing a control/status value where it shouldn’t. That being said,
> I’m happy still guessing…Would this type of corruption be consistent with
> a RAM issue in your opinion?
>
> Thanks!
>
> -scott
> OSR
> @OSRDrivers
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: <
> http://www.osronline.com/showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: <
> http://www.osronline.com/showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer>
></http:></http:>

Andrey_Bazhan · December 2, 2015, 6:30am

Oh Great!!! Can’t wait!!!

“Scott Noone” wrote in message news:xxxxx@ntdev…

Definitely! It might end up being more than one, I think it could
practically
be a book at this point

-scott
OSR
@OSRDrivers

“Andrey Bazhan” wrote in message news:xxxxx@ntdev…

Yeah, sometimes you wish it was 24 * 2 in a day :). By the way, this is very
interesting case and it would be really cool if you could write a blog post
about it.

“Scott Noone” wrote in message news:xxxxx@ntdev…

We searched for the sequence in the “suspect” driver list (NIC, video, etc.)
using IDA Pro, though it was a long shot. We found various instances of it,
though just through static analysis it was impossible to say if it was even
related. Not enough hours in the day to do a complete reversing job on every
driver

-scott
OSR
@OSRDrivers

“Andrey Bazhan” wrote in message news:xxxxx@ntdev…

Have you tried to narrow down the culprit by running

!for_each_module “.echo @#ModuleName; s-b @#Base @#End D8 0F 00 00”

wrote in message news:xxxxx@ntdev…

I discounted this as being a RAM problem due to the consistency and the
pattern and the bad offset. It really “feels” like a device (or possibly
driver) writing a control/status value where it shouldn’t. That being said,
I’m happy still guessing…Would this type of corruption be consistent with
a RAM issue in your opinion?

Thanks!

-scott
OSR
@OSRDrivers

OSR_Community_User · December 2, 2015, 10:26am

My DNA is of the wrong type make intelligent contributions to this thread. I sure am trying to get Lenovo interested in this problem but a customer with 1000 units is nothing compared someone who has thousands of units in the field.

Some of my observations :

The lastest Lenovo Bios update for the M93p, FBJYB9USA included Microcode update 1D . This is the same microcode as our current bios FBJYB6USA ( As reported by the AMI tool MMTOOL.)

The CPU ID for the Intel i7-4765T is : 00306C3.

The latest Intel Linux Microcode file is dated Oct 11 2015, and it contains file : cpu000306c3_plat00000032_ver0000001e_date20150813.bin and updated version !

My question was , how to we update the microcode manually ?, ( can’t modify the Bios and Microsoft is slow and many versions behind with OS micro-code update’s. )

My answer: I found a vmware lab utility https://labs.vmware.com/flings/vmware-cpu-microcode-update-driver which can update the microcode of a windows system. This worked on my M93p I am now running with Microcode 1E ( Now we can’t install this on every M93p as the microcode has be injected at every boot, but we’ve ask Lenovo for an updated bios with 1E.

I also discovered that Windows 10 Enterprise (10240) installed on a M93p with latest updates shows a microcode of 1E ( Interesting that MS has rolled out the update to Win 10 but not Win 8.1)

Sort of, I keep thinking of the “scientific method”, the M93p has a forum factor that is similar to a laptop, it has no removable - hardware. We have update our core drivers, same versions that are running in our Dell E7440’s. ( the corruption unique to the M93p was never seen on them).

Now that we know the corruption BSOD are linked with Monitor power events we are trying to force a machine to cycle from Monitor on - off thousands of time per day, If the BSOD can be trigger it would certainly accelerate the troubleshooting ?

But I have not found any way to emulate a real mouse or keyboard key press that will wake a monitor.

The WOL NIC settings don’t wake a system that is fully on - with only a monitor turned off by the idle setting.

Is there any way to generate a hardware key-press of mouse movement without any specialized equipment ? We could set the monitor to power off after 1 second and send a wake key or mouse event every other second generating 43200 monitor on/off cycles per day

Naim

Gregory_G_Dyess · December 2, 2015, 10:53am

— Snip —
But I have not found any way to emulate a real mouse or keyboard key press that will wake a monitor.

Is there any way to generate a hardware key-press of mouse movement without any specialized equipment ? We could set the monitor to power off after 1 second and send a wake key or mouse event every other second generating 43200 monitor on/off cycles per day

— End Snip —

I don’t know how “specialized” you consider “specialized”, but many hobbyist-type microcontroller demo/prototype boards have USB device capabilities and just about all of them have HID example code to emulate wither a keyboard or a mouse or both. It would be almost trivial to modify one of those examples to inject a mouse movement or keyboard input on a schedule (or randomly). I have used the LPC Expresso boards from NXP (about $20 US) to do things very similar. ST Micro, RasPi, etc all have things similar. I wouldn’t be surprised if OSR has something very similar as well for their own testing.

Greg

Scott_Noone_OSR · December 2, 2015, 12:34pm

Thanks again everyone for the responses! This ended up being a very cool
thread

I would absolutely LOVE to throw some hardware at the problem and use
bus/logic analyzers to track the problem down (thatâ€™s probably where we
would head if this were a development project). However, thatâ€™s likely to be
above and beyond what would be possible for this engagement.

To summarize the current “resolution”:

A while ago we plotted out the time of day that the crashes were happening
and noticed that they seemed to cluster around 7-9AM. Dumping the security
event log from the crash dumps (“!wmitrace.logdump Eventlog-Security”),
there was significant evidence to imply that the crashes were often
happening shortly after the user logged in to the machine. This was even
true for crashes that happened at other times in the day.

Of course, due to the fact that this is a corruption, it could just be that
the login is what causes someone to notice the corruption and crash. But,
what else happens when someone comes in at 7AM and logs in? Power events! So
we checked the power policy (!popolicy) and noticed that the systems were
not sent to go to standby or hibernate. However, they were set to idle the
monitor and disk. Also, checking outstanding power IRPs (!poaction) we could
see that the USB devices in the system were set to idle.

To rule any of this out, we had the client shut all this stuff off and, to
our surprise, the crashes stopped. The monitor idle timer was then turned
back on (because it’s annoying to have your monitors on all the time) and
the crashes started again.

It still sort of feels like this can’t be it, but who are we to argue? We’ve
also reached the end of our engagement, so this might go down as one of the
great mysteries. However, the fact that we now have proof that *another*
Lenovo system has the problem the client may be able to get some traction on
resolving the issue.

-scott
OSR
@OSRDrivers

Tim_Roberts · December 2, 2015, 12:51pm

Scott Noone wrote:

To rule any of this out, we had the client shut all this stuff off and, to
our surprise, the crashes stopped. The monitor idle timer was then turned
back on (because it’s annoying to have your monitors on all the time) and
the crashes started again.

It still sort of feels like this can’t be it, but who are we to argue?

Since the beginning of time, display driver writers have borne the brunt
of the blame for Windows crashes, often with good justification.

It is interesting that this “turn the monitor off” function is one of
the things we implemented in Windows 3.0 drivers a quarter of a century
ago. You wouldn’t think it would be a high risk area, but all it takes
is someone establishing a new locking rule and forgetting to update the
old crusty code that has worked for years.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Andrey_Bazhan · December 2, 2015, 2:45pm

I just wrote a couple of lines that will turn on and off monitor.
Hope it will help.
http://www.andreybazhan.com/download/MonWake.zip

wrote in message news:xxxxx@ntdev…

My DNA is of the wrong type make intelligent contributions to this thread.
I sure am trying to get Lenovo interested in this problem but a customer
with 1000 units is nothing compared someone who has thousands of units in
the field.

Some of my observations :

The lastest Lenovo Bios update for the M93p, FBJYB9USA included Microcode
update 1D . This is the same microcode as our current bios FBJYB6USA ( As
reported by the AMI tool MMTOOL.)

The CPU ID for the Intel i7-4765T is : 00306C3.

The latest Intel Linux Microcode file is dated Oct 11 2015, and it contains
file : cpu000306c3_plat00000032_ver0000001e_date20150813.bin and updated
version !

My question was , how to we update the microcode manually ?, ( can’t modify
the Bios and Microsoft is slow and many versions behind with OS micro-code
update’s. )

My answer: I found a vmware lab utility
https://labs.vmware.com/flings/vmware-cpu-microcode-update-driver which can
update the microcode of a windows system. This worked on my M93p I am now
running with Microcode 1E ( Now we can’t install this on every M93p as the
microcode has be injected at every boot, but we’ve ask Lenovo for an updated
bios with 1E.

I also discovered that Windows 10 Enterprise (10240) installed on a M93p
with latest updates shows a microcode of 1E ( Interesting that MS has
rolled out the update to Win 10 but not Win 8.1)

Sort of, I keep thinking of the “scientific method”, the M93p has a forum
factor that is similar to a laptop, it has no removable - hardware. We
have update our core drivers, same versions that are running in our Dell
E7440’s. ( the corruption unique to the M93p was never seen on them).

Now that we know the corruption BSOD are linked with Monitor power events we
are trying to force a machine to cycle from Monitor on - off thousands of
time per day, If the BSOD can be trigger it would certainly accelerate the
troubleshooting ?

But I have not found any way to emulate a real mouse or keyboard key press
that will wake a monitor.

The WOL NIC settings don’t wake a system that is fully on - with only a
monitor turned off by the idle setting.

Is there any way to generate a hardware key-press of mouse movement without
any specialized equipment ? We could set the monitor to power off after 1
second and send a wake key or mouse event every other second generating
43200 monitor on/off cycles per day

Naim

Liviu · December 2, 2015, 10:16pm

“Scott Noone” wrote:

Thanks again everyone for the responses! This ended up being a very cool
thread

Indeed, even when glanced at from 3 rings away.

The monitor idle timer was then turned back on …the crashes started again.

It still sort of feels like this can’t be it, but who are we to argue?

Wonder if downgrading to NT 3.51 would be an option

|| In Windows NT 4.0, the Window Manager, GDI, and Win32 Graphics
|| Device Drivers have been incorporated into the Windows NT Executive.

Liviu

Spencer_Low · December 11, 2015, 9:17pm

FWIW, I encountered this issue again on my Lenovo TS140. The Windows 10 1511 upgrade rebooted (a coincidence, I think), the machine was at the login screen, the monitor power save engaged, a few minutes later I walked up to the machine, moved the mouse and a second or two later it had a BSOD.

!analyze -v of the dump showed:

MEMORY_MANAGEMENT (1a)

Any other values for parameter 1 must be individually examined.

Arguments:
Arg1: 0000000000041792, A corrupt PTE has been detected. Parameter 2 contains the address of
the PTE. Parameters 3/4 contain the low/high parts of the PTE.
Arg2: fffff6bfffbf7fd8
Arg3: 0000001004000004
Arg4: 0000000000000000

Same bit pattern (04 00 00 04 10 00 00 00) and same low bits in the address (fd8).

!pte fffff6bfffbf7fd8
VA 00007fff7effb000
PXE at FFFFF6FB7DBED7F8 PPE at FFFFF6FB7DAFFFE8 PDE at FFFFF6FB5FFFDFB8 PTE at FFFFF6BFFFBF7FD8
contains 00A0000000D08867 contains 0C200000015C1867 contains 4240000003A12867 contains 0000001004000004
pfn d08 —DA–UWEV pfn 15c1 —DA–UWEV pfn 3a12 —DA–UWEV not valid
Page has been freed

But I’m guessing that output is wrong due to the memory corruption.

!search 0000001004000004 output is available at https://drive.google.com/open?id=0B3A4jzWKJNSFTUlKVFpyb3NxS3M .

Let me know if there are any useful commands that I should run on the dump. (warning: I’m a user-mode windbg user, but not much of a kernel person.)

A few weeks ago I installed the latest BIOS (B3A) and latest Intel Graphics Driver (15.40.10.64.4300).

Thanks.

Maxim_S_Shatskih · December 13, 2015, 2:02pm

> Arg1: 0000000000041792, A corrupt PTE has been detected.

This is usually due to some “hemorroy” related to MDLs.

–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

Scott_Noone_OSR · December 14, 2015, 1:52pm

Yup, this fits the pattern of “the crash”. I don’t see anything in your
output that immediately begs additional research.

The client made the problem go away by turning off the monitor idle power
off event. This was “good enough” for them and they’re back to doing what
they care about (which is not diagnosing system issues, unfortunately :))

-scott
OSR
@OSRDrivers

wrote in message news:xxxxx@ntdev…

FWIW, I encountered this issue again on my Lenovo TS140. The Windows 10 1511
upgrade rebooted (a coincidence, I think), the machine was at the login
screen, the monitor power save engaged, a few minutes later I walked up to
the machine, moved the mouse and a second or two later it had a BSOD.

!analyze -v of the dump showed:

MEMORY_MANAGEMENT (1a)

Any other values for parameter 1 must be individually examined.

Arguments:
Arg1: 0000000000041792, A corrupt PTE has been detected. Parameter 2
contains the address of
the PTE. Parameters 3/4 contain the low/high parts of the PTE.
Arg2: fffff6bfffbf7fd8
Arg3: 0000001004000004
Arg4: 0000000000000000

Same bit pattern (04 00 00 04 10 00 00 00) and same low bits in the address
(fd8).

!pte fffff6bfffbf7fd8
VA 00007fff7effb000
PXE at FFFFF6FB7DBED7F8 PPE at FFFFF6FB7DAFFFE8 PDE at
FFFFF6FB5FFFDFB8 PTE at FFFFF6BFFFBF7FD8
contains 00A0000000D08867 contains 0C200000015C1867 contains
4240000003A12867 contains 0000001004000004
pfn d08 —DA–UWEV pfn 15c1 —DA–UWEV pfn
—DA–UWEV not valid

Page has been freed

But I’m guessing that output is wrong due to the memory corruption.

!search 0000001004000004 output is available at
https://drive.google.com/open?id=0B3A4jzWKJNSFTUlKVFpyb3NxS3M .

Let me know if there are any useful commands that I should run on the dump.
(warning: I’m a user-mode windbg user, but not much of a kernel person.)

A few weeks ago I installed the latest BIOS (B3A) and latest Intel Graphics
Driver (15.40.10.64.4300).

Thanks.

Scott_Noone_OSR · December 14, 2015, 1:54pm

This crash fits the pattern of the corruption that we tracked down. In this
case, the bad value just happened to show up where a valid PTE should have
been.

-scott
OSR
@OSRDrivers

“Maxim S. Shatskih” wrote in message news:xxxxx@ntdev…

Arg1: 0000000000041792, A corrupt PTE has been detected.

This is usually due to some “hemorroy” related to MDLs.

–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com