heap lock impossible state

MBond · January 19, 2016, 8:04pm

Recently a very strange UM problem has been bounced to my desk and as I have seen nothing like it before I thought I would ask it anyone else has.

The UM process in question occasionally fails on one particular machine (running Server 2012 with the latest patches) in a very strange way. All of the threads in the process eventually become blocked on the process heap lock, but when I break in and take a dump Windbg shows that lock as free and it does not appear to have any memory corruption. And as a test, I added functionality where I could trigger heap allocations externally after the main threads locked, and those allocations worked!

The threads are all blocked in WaitForSingleObject as a result of EnterCriticalSectionContended. In this particular application, there are 200+ threads and it runs for several days processing millions of transactions before it fails but it only fails on one server (hardware?) and I have never seen or heard of a similar failure mode.

Any insight would be appreciated as I have been looking at this for two months with half a dozen crash dumps and have nothing to show for it

Sent from Mailhttp: for Windows 10</http:>

Phil_Barila · January 19, 2016, 10:20pm

That really does sound like a hardware issue. You can attempt to isolate it by swapping/rotating the memory modules. If the problem vanishes or something new starts to happen, it’s probably a stuck bit.

I once changed a memory allocator from a chunk of a driver (or device, can’t remember anymore) extension to a block of pool. A benchmark started hanging. I reverted the change, the hang went away. I hunted through my code for a couple of weeks, and after deciding the code couldn’t be it, I decided that it could possibly be hardware, so I swapped the RIMMs, and the problem disappeared. Swapped them back, and the problem reappeared, so I replaced both RIMMs.

You might run the memory diagnostics, it might show something.

Phil

From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Marion Bond
Sent: Tuesday, January 19, 2016 6:04 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] heap lock impossible state

Recently a very strange UM problem has been bounced to my desk and as I have seen nothing like it before I thought I would ask it anyone else has.

The UM process in question occasionally fails on one particular machine (running Server 2012 with the latest patches) in a very strange way. All of the threads in the process eventually become blocked on the process heap lock, but when I break in and take a dump Windbg shows that lock as free and it does not appear to have any memory corruption. And as a test, I added functionality where I could trigger heap allocations externally after the main threads locked, and those allocations worked!

The threads are all blocked in WaitForSingleObject as a result of EnterCriticalSectionContended. In this particular application, there are 200+ threads and it runs for several days processing millions of transactions before it fails but it only fails on one server (hardware?) and I have never seen or heard of a similar failure mode.

Any insight would be appreciated as I have been looking at this for two months with half a dozen crash dumps and have nothing to show for it

Sent from Mailhttp: for Windows 10

—
NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></http:>

Maxim_S_Shatskih · January 20, 2016, 5:01am

Oh yes.

ECC memory is a must for serious validation tests.

Once I’ve even considered making in-memory ECC protection of one absolutely critical data structure to deal with such things.

–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

“Phil Barila” wrote in message news:xxxxx@ntdev…
That really does sound like a hardware issue. You can attempt to isolate it by swapping/rotating the memory modules. If the problem vanishes or something new starts to happen, it’s probably a stuck bit.

I once changed a memory allocator from a chunk of a driver (or device, can’t remember anymore) extension to a block of pool. A benchmark started hanging. I reverted the change, the hang went away. I hunted through my code for a couple of weeks, and after deciding the code couldn’t be it, I decided that it could possibly be hardware, so I swapped the RIMMs, and the problem disappeared. Swapped them back, and the problem reappeared, so I replaced both RIMMs.

You might run the memory diagnostics, it might show something.

Phil

From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Marion Bond
Sent: Tuesday, January 19, 2016 6:04 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] heap lock impossible state

Recently a very strange UM problem has been bounced to my desk and as I have seen nothing like it before I thought I would ask it anyone else has.

The UM process in question occasionally fails on one particular machine (running Server 2012 with the latest patches) in a very strange way. All of the threads in the process eventually become blocked on the process heap lock, but when I break in and take a dump Windbg shows that lock as free and it does not appear to have any memory corruption. And as a test, I added functionality where I could trigger heap allocations externally after the main threads locked, and those allocations worked!

The threads are all blocked in WaitForSingleObject as a result of EnterCriticalSectionContended. In this particular application, there are 200+ threads and it runs for several days processing millions of transactions before it fails but it only fails on one server (hardware?) and I have never seen or heard of a similar failure mode.

Any insight would be appreciated as I have been looking at this for two months with half a dozen crash dumps and have nothing to show for it

Sent from Mail for Windows 10

—
NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>

Alex_Grig · January 20, 2016, 1:09pm

A thread that owned the heap lock crashed or was killed (ugh).

I suspect ever helpful Windows freed the critical section owned by a terminated thread, but didn’t wake the waiting threads. See if you can log thread termination. Check you are not killing any thread explicitly. Check if the application silently swallows fatal exceptions in the threads.

MBond · January 21, 2016, 6:10pm

I was thinking about this possibility too, but then I rejected it because

There are no calls to TerminateThread in the code and no use of 3rd party libraries so it would have to be an exception
What kind of exception could be raised by the code in HeapAlloc or HeapFree that would consistently leave the heap in a state where further allocations + frees would succeed and also don’t cause !heap etc to complain? I have at least 5 crash dumps that all show the same pattern, so it is hard to imagine that they were all corrupted in exactly the same way – though it is possible

It is a PRD machine, so it will be hard to change out the RAM, so I am still stuck. But i’ll keep tkinking
Sent from Mailhttp: for Windows 10

From: xxxxx@broadcom.com mailto:xxxxx
Sent: January 20, 2016 1:09 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE:[ntdev] heap lock impossible state

A thread that owned the heap lock crashed or was killed (ugh).

I suspect ever helpful Windows freed the critical section owned by a terminated thread, but didn’t wake the waiting threads. See if you can log thread termination. Check you are not killing any thread explicitly. Check if the application silently swallows fatal exceptions in the threads.

—
NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></mailto:xxxxx></mailto:xxxxx></http:>

Alex_Grig · January 21, 2016, 6:16pm

See if there is disk read error in the system log. That page-in error would translate to the unhandled exception in the application.