Windows custom driver freezes system with 100% CPU

There is a kernel level driver installed on a terminal server.It works fine for certain period of time on that terminal sever. later on
that terminal server itself getting into freezed state where noboday can RDP & web console to connect with server. In my case,
CPU is always hitting to 100% in freezed state and i had to hard reboot only by using VM option “power off”. After unstalling that driver the terminal server works fine or even responds properly always.Even if it is 100% CPU usage and gets slow but still reponds to the RDP & web console.

That scenario is kind of hard to reproduce it. but still i got successful to fetch complete memory dump out of that machine in that scenario then i analyzed full memory dump using microsoft WinDbg tool. WinDbg tool displayed faulty driver module name and call stack as below

Module Name: MMTEProxy (Installed Driver)

		0: kd> !analyze -v
		*******************************************************************************
		*                                                                             *
		*                        Bugcheck Analysis                                    *
		*                                                                             *
		*******************************************************************************

		NMI_HARDWARE_FAILURE (80)
		This is typically due to a hardware malfunction.  The hardware supplier should
		be called.
		Arguments:
		Arg1: 00000000004f4454
		Arg2: 0000000000000000
		Arg3: 0000000000000000
		Arg4: 0000000000000000

		Debugging Details:
		------------------
		KEY_VALUES_STRING: 1

		PROCESSES_ANALYSIS: 1

		SERVICE_ANALYSIS: 1

		STACKHASH_ANALYSIS: 1

		TIMELINE_ANALYSIS: 1

		DUMP_CLASS: 1

		DUMP_QUALIFIER: 402

		BUILD_VERSION_STRING:  9600.17415.amd64fre.winblue_r4.141028-1500

		SYSTEM_MANUFACTURER:  VMware, Inc.

		VIRTUAL_MACHINE:  VMware

		SYSTEM_PRODUCT_NAME:  VMware Virtual Platform

		SYSTEM_VERSION:  None

		BIOS_VENDOR:  Phoenix Technologies LTD

		BIOS_VERSION:  6.00

		BIOS_DATE:  04/05/2016

		BASEBOARD_MANUFACTURER:  Intel Corporation

		BASEBOARD_PRODUCT:  440BX Desktop Reference Platform

		BASEBOARD_VERSION:  None

		DUMP_TYPE:  0

		BUGCHECK_P1: 4f4454

		BUGCHECK_P2: 0

		BUGCHECK_P3: 0

		BUGCHECK_P4: 0

		CPU_COUNT: 2

		CPU_MHZ: bb8

		CPU_VENDOR:  GenuineIntel

		CPU_FAMILY: 6

		CPU_MODEL: 3e

		CPU_STEPPING: 4

		CPU_MICROCODE: 6,3e,4,0 (F,M,S,R)  SIG: 42C'00000000 (cache) 42C'00000000 (init)

		DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

		BUGCHECK_STR:  0x80

		PROCESS_NAME:  svchost.exe

		CURRENT_IRQL:  0

		ANALYSIS_SESSION_HOST:  INPN01LAP107

		ANALYSIS_SESSION_TIME:  03-26-2019 16:30:13.0120

		ANALYSIS_VERSION: 10.0.18317.1001 amd64fre

		LAST_CONTROL_TRANSFER:  from fffff8005ae205b2 to fffff8009a6601a7

		STACK_TEXT:  
		nt!KxWaitForLockOwnerShip+0x27
		MMTEProxy!SVSessionLutTranslatePort+0x2c2 [c:\users\dkelone\git\MMTE\MMTE\MMTEdriver\sessionlut.c @ 873] 
		MMTEProxy!PerformProxySocketRedirection+0xba7 [c:\users\dkelone\git\MMTE\MMTE\MMTEdriver\filteralebindredirect.c @ 247] 
		MMTEProxy!TriggerProxyByALERedirectInline+0x244 [c:\users\dkelone\git\MMTE\MMTE\MMTEdriver\filteralebindredirect.c @ 690] 
		MMTEProxy!DDProxyBindRedirectClassify+0x537 [c:\users\dkelone\git\MMTE\MMTE\MMTEdriver\filteralebindredirect.c @ 881] 

		THREAD_SHA1_HASH_MOD_FUNC:  03f7fb5fd041c46c9b4dff8f1685ccff753d3642

		THREAD_SHA1_HASH_MOD_FUNC_OFFSET:  7f4a5e830d38804e610244f134268d53640c97a0

		THREAD_SHA1_HASH_MOD:  2a8f232a3e3c38ad2a6b44b0d2253b97c2ac4b2a

		FOLLOWUP_IP: 
		MMTEProxy!SVSessionLutTranslatePort+2c2 [c:\users\dkelone\git\MMTE\MMTE\MMTEdriver\sessionlut.c @ 873]
		fffff800`5ae205b2 c644244000      mov     byte ptr [rsp+40h],0

		FAULT_INSTR_CODE:  402444c6

		FAULTING_SOURCE_LINE:  c:\users\dkelone\git\MMTE\MMTE\MMTEdriver\sessionlut.c

		FAULTING_SOURCE_FILE:  c:\users\dkelone\git\MMTE\MMTE\MMTEdriver\sessionlut.c

		FAULTING_SOURCE_LINE_NUMBER:  873

		FAULTING_SOURCE_CODE:  
		No source found for 'c:\users\dkelone\git\MMTE\MMTE\MMTEdriver\sessionlut.c'

		SYMBOL_STACK_INDEX:  1

		SYMBOL_NAME:  MMTEProxy!SVSessionLutTranslatePort+2c2

		FOLLOWUP_NAME:  MachineOwner

		MODULE_NAME: MMTEProxy

		IMAGE_NAME:  MMTEProxy.sys

		DEBUG_FLR_IMAGE_TIMESTAMP:  5a60d5f0

		STACK_COMMAND:  .thread ; .cxr ; kb

		BUCKET_ID_FUNC_OFFSET:  2c2

		FAILURE_BUCKET_ID:  0x80_MMTEProxy!SVSessionLutTranslatePort

		BUCKET_ID:  0x80_MMTEProxy!SVSessionLutTranslatePort

		PRIMARY_PROBLEM_CLASS:  0x80_MMTEProxy!SVSessionLutTranslatePort

		TARGET_TIME:  2019-02-26T11:15:36.000Z

		OSBUILD:  9600

		OSSERVICEPACK:  0

		SERVICEPACK_NUMBER: 0

		OS_REVISION: 0

		SUITE_MASK:  16

		PRODUCT_TYPE:  3

		OSPLATFORM_TYPE:  x64

		OSNAME:  Windows 8.1

		OSEDITION:  Windows 8.1 Server TerminalServer

		OS_LOCALE:  

		USER_LCID:  0

		OSBUILD_TIMESTAMP:  2014-10-29 06:08:48

		BUILDDATESTAMP_STR:  141028-1500

		BUILDLAB_STR:  winblue_r4

		BUILDOSVER_STR:  6.3.9600.17415.amd64fre.winblue_r4.141028-1500

		ANALYSIS_SESSION_ELAPSED_TIME:  685

		ANALYSIS_SOURCE:  KM

		FAILURE_ID_HASH_STRING:  km:0x80_MMTEProxy!svsessionluttranslateport

		FAILURE_ID_HASH:  {c64b7e97-0bf3-daf1-ad95-9f39cbf37a9a}

		Followup:     MachineOwner
		---------

Since i am not expert in kernel level driver development,But i tried to google about driver. Internally it uses the following lock to perform any operation at process table or session table

		#Code snippet

	PLIST_ENTRY    processTableListHead = NULL;

	{
		....

		KLOCK_QUEUE_HANDLE processTableLockHandle;
		KLOCK_QUEUE_HANDLE sessionTableLockHandle;
		
		PLIST_ENTRY tempNode = 0;
		....
		...

		KeAcquireInStackQueuedSpinLock(&gProcessTableLock,&processTableLockHandle);

		tempNode = processTableListHead;

		...
		...
		..
		//Releases lock
		KeReleaseInStackQueuedSpinLock(&sessionTableLockHandle);
		KeReleaseInStackQueuedSpinLock(&processTableLockHandle);

	}

With help of WinDbg tool, What i observed here, Mostly it is failling at source line no where assinging the value to a variables and that variables defined before accuiring the lock. You can see it in above driver code snippet. my driver is a WFP ALE filtered driver. it inspects traffic it works in a multhreaded environment and my driver allocates/freed memory in non-paged pool

And I also checked that there is no any deadlock condition or curretnly held lock by any thread. still i am not getting what causing this issue. whether its lock is not handled properly at code level or some particular situation.

Can you please help me with pointer or direction?

Your code sample is useless. It is obviously incorrect (releasing a lock not acquired) but that is likely because what you posted is not the code. Assuming that you just mangled the sample and the sessionTableLockHandle is actually held, there is nothing left to what you posted. Locks are acquired, shared data is accessed, locks are released.

Assuming that you just mangled the sample

What has always amazed me is this way of thinking - how on Earth are we supposed to help the posters with their code if they present only some parts of it???

Anton Bassov

All a big aside:

This is what dismays me about software “development” in the 21st century: Search, find sample, hack sample, have problem with hacked sample, ask someone else to solve problem with hacked sample. Go to next project.

This might work in web development… it is not a reasonable way to undertake a kernel mode project.

It just makes me nuts. But every time I write something like this I wonder if I’m not just overvaluing what I do by thinking it’s more special than, say, writing some app in C# where the above process seems to me to be at worst silly and sloppy, but not actually pernicious.

Peter

But every time I write something like this I wonder if I’m not just overvaluing what I do by thinking it’s more special
than, say, writing some app in C#

Well, I would say there is a huge gap between theory and practice here…

If you look at the whole thing from purely theoretical standpoint, you may safely assume that some C# /Java/etc programmers may be more professionally competent than a dozen of us combined, and be writing some truly advanced programs (like, for example, circuit synthesizers that generate netlists from the circuit description in Verilog or VHDL). Their reason for choosing a managed language over C or C++ may be their willingness to get concentrated on the “exciting” problem in itself, without having to worry about the “mundane” things like buffer overflows, pointer validation, prevention of memory leaks,etc,etc,etc. Certainly, a program written in a managed language is going to be slower than the one written in C, but the tasks of this kind may take many hours to accomplish anyway, so that it does not really matter.

This is theory. However, in practice, would you ever encounter something like that??? I would say that, in practical terms, they are more than likely to be writing some toy web app, and the writers of above mentioned advanced programs are more than likely to stick to C, simply because this happens to be the language that they are most comfortable with and think in terms of…

Anton Bassov

bharatgade wrote:

That scenario is kind of hard to reproduce it. but still i got successful to fetch complete memory dump out of that machine in that scenario then i analyzed full memory dump using microsoft WinDbg tool. WinDbg tool displayed faulty driver module name and call stack as below

Module Name: MMTEProxy (Installed Driver)
NMI_HARDWARE_FAILURE (80)…
Arguments:
Arg1: 00000000004f4454

VMWare raises NMI to signal certain fatal error conditions.  Note that
Arg1 contains the ASCII string “ODT” (or “TDO”).  I could not find any
Google results in a quick search (“vmware nmi odt”), but I’ll wager this
is some kind of watchdog timeout.  Have you done any analysis to
determine exactly how long you might wait for those spin locks?  I’m
sure you haven’t, but that’s where you’d want to look.

And I also checked that there is no any deadlock condition or curretnly held lock by any thread. still i am not getting what causing this issue. whether its lock is not handled properly at code level or some particular situation.

Well, that statement is patently false, right?  The code you posted
shows that you hold both the session table lock and the process table lock.