NT Kernel causing DPC_WATCHDOG_VIOLATION (133) when trying to flush write cache?

Hi,

I’m getting a bugcheck on two different servers (one is Server 2016, the other is Server 2022).

Here’s the crash dump analysis from the Server 2022 system:

DPC_WATCHDOG_VIOLATION (133)
The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL
or above.
Arguments:
Arg1: 0000000000000001, The system cumulatively spent an extended period of time at
	DISPATCH_LEVEL or above. The offending component can usually be
	identified with a stack trace.
Arg2: 0000000000001e00, The watchdog period.
Arg3: fffff8052390f328, cast to nt!DPC_WATCHDOG_GLOBAL_TRIAGE_BLOCK, which contains
	additional information regarding the cumulative timeout
Arg4: 0000000000000000

Debugging Details:

*************************************************************************
***    Either you specified an unqualified symbol, or your debugger   ***
***    doesn't have full symbol information.  Unqualified symbol      ***
***    resolution is turned off by default. Please either specify a   ***
***    fully qualified symbol module!symbolname, or enable resolution ***
***    of unqualified symbols by typing ".symopt- 100". Note that     ***
***    enabling unqualified symbol resolution with network symbol     ***
***    server shares in the symbol path may cause the debugger to     ***
***    appear to hang for long periods of time when an incorrect      ***
***    symbol name is typed or the network symbol server is down.     ***
***                                                                   ***
***    For some commands to work properly, your symbol path           ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: TickPeriods                                   ***
***                                                                   ***
*************************************************************************

KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 6640

    Key  : Analysis.DebugAnalysisManager
    Value: Create

    Key  : Analysis.Elapsed.mSec
    Value: 6653

    Key  : Analysis.Init.CPU.mSec
    Value: 1062

    Key  : Analysis.Init.Elapsed.mSec
    Value: 30579

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 110

    Key  : WER.OS.Branch
    Value: fe_release_svc_prod2

    Key  : WER.OS.Timestamp
    Value: 2022-07-07T18:32:00Z

    Key  : WER.OS.Version
    Value: 10.0.20348.859

BUGCHECK_CODE:  133

BUGCHECK_P1: 1

BUGCHECK_P2: 1e00

BUGCHECK_P3: fffff8052390f328

BUGCHECK_P4: 0

DPC_TIMEOUT_TYPE:  DPC_QUEUE_EXECUTION_TIMEOUT_EXCEEDED

TRAP_FRAME:  ffffed09e909eba0 -- (.trap 0xffffed09e909eba0)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=00000000c0000000 rbx=0000000000000000 rcx=00000000c0000000
rdx=00000000000000ff rsi=0000000000000000 rdi=0000000000000000
rip=fffff80522f29687 rsp=ffffed09e909ed30 rbp=0000000000000002
 r8=0000000000000004  r9=ffff810867fbc930 r10=ffff8108723aded0
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei ng nz na po cy
nt!ExpWaitForSpinLockExclusiveAndAcquire+0xd7:
fffff805`22f29687 0f845dffffff    je      nt!ExpWaitForSpinLockExclusiveAndAcquire+0x3a (fffff805`22f295ea) [br=0]
Resetting default scope

BLACKBOXBSD: 1 (!blackboxbsd)

BLACKBOXNTFS: 1 (!blackboxntfs)

BLACKBOXPNP: 1 (!blackboxpnp)

BLACKBOXWINLOGON: 1

PROCESS_NAME:  System

STACK_TEXT:  
ffffdd00`9d532da8 fffff805`22e8ae41     : 00000000`00000133 00000000`00000001 00000000`00001e00 fffff805`2390f328 : nt!KeBugCheckEx
ffffdd00`9d532db0 fffff805`22e88cf4     : 000390bd`4c6cb908 ffffdd00`9d532ee8 00000000`00000246 fffff805`22e85c02 : nt!KeAccumulateTicks+0x541
ffffdd00`9d532e20 fffff805`22e8895a     : 00000000`03d9f248 ffffdd00`9d51f218 00000000`00000000 fffff805`22f2e73f : nt!KiUpdateRunTime+0x64
ffffdd00`9d532eb0 fffff805`22e887e4     : ffff8108`5093d0e0 00000000`00000001 ffff8108`5093d0e0 00000000`00000000 : nt!KeClockInterruptNotify+0x10a
ffffdd00`9d532f40 fffff805`22e41350     : 00000000`00000000 ffffa146`472a0d20 00000000`00000200 fffff805`2385f300 : nt!HalpTimerClockIpiRoutine+0x14
ffffdd00`9d532f70 fffff805`230216ba     : ffffed09`e909ec20 ffff8108`5093d0e0 00000000`00000000 00000000`00000200 : nt!KiCallInterruptServiceRoutine+0xa0
ffffdd00`9d532fb0 fffff805`23021f27     : ffff8108`90dd9530 fffff805`2385f320 00000000`00000000 00000000`00000000 : nt!KiInterruptSubDispatchNoLockNoEtw+0xfa
ffffed09`e909eba0 fffff805`22f29687     : 00000000`00000000 fffff805`22ee4591 00000000`00000010 00000000`00040246 : nt!KiInterruptDispatchNoLockNoEtw+0x37
ffffed09`e909ed30 fffff805`22ee45c9     : ffff8108`0001c12f fffff805`2385f300 00000000`00000000 fffff805`2385edc0 : nt!ExpWaitForSpinLockExclusiveAndAcquire+0xd7
ffffed09`e909ed60 fffff805`22ee1a8d     : ffff8108`723aded0 00000000`00000000 00000000`00000000 00000000`00000000 : nt!ExAcquireSpinLockExclusiveAtDpcLevel+0x79
ffffed09`e909ed90 fffff805`22ee119e     : ffff8108`723aded0 00000000`00000004 00000000`00000200 00000000`00000000 : nt!MiInsertUnusedSubsection+0x65
ffffed09`e909edd0 fffff805`22f553b3     : ffff8108`723aded0 00000000`00000000 00000000`00000000 00000000`00000000 : nt!MiRemoveViewsFromSection+0x82
ffffed09`e909ee00 fffff805`22ee18db     : ffff8108`67fbc930 00000000`00000000 00000000`00000000 00000000`00000000 : nt!MiConvertStaticSubsections+0x77
ffffed09`e909ee40 fffff805`22ee0f69     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!MiInsertUnusedSegment+0x2f
ffffed09`e909ee80 fffff805`22ee0cf2     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!MiCheckControlArea+0x1b9
ffffed09`e909ef00 fffff805`232e5253     : ffff8108`67fbc930 00000000`00000000 00000000`00000000 00000000`00000000 : nt!MiDereferenceControlAreaBySection+0x2a
ffffed09`e909ef30 fffff805`23282ce0     : ffffbb06`1bd135b0 ffff8904`0acdd820 00000000`00000000 ffff8904`0ac28c90 : nt!MiSectionDelete+0x83
ffffed09`e909ef60 fffff805`22e327d3     : 00000000`00000000 00000000`00000000 ffff8904`0acdd820 ffffbb06`1bd135e0 : nt!ObpRemoveObjectRoutine+0x80
ffffed09`e909efc0 fffff805`22e14ab9     : 00000000`00000000 ffffed09`e909f0b0 ffff8904`0acdd820 ffffed09`e909f0b0 : nt!ObfDereferenceObject+0x93
ffffed09`e909f000 fffff805`22e16263     : ffff8108`683a6da0 ffffed09`e909f0c9 ffff8108`93bb7300 ffff8108`6882bd30 : nt!CcDeleteSharedCacheMap+0x16d
ffffed09`e909f050 fffff805`22e15639     : ffff8904`0acdd800 ffffed09`e909f100 00000000`00000000 0000001f`ff9194f9 : nt!CcWriteBehindInternal+0x493
ffffed09`e909f130 fffff805`22e141d4     : ffff8108`93bb7300 ffff8108`591eb040 ffff8904`0ac52cb0 ffff8108`508b3c60 : nt!CcWriteBehind+0xa5
ffffed09`e909f230 fffff805`22edbd01     : ffff8904`0ac06110 fffff805`2393d680 ffff8904`0ac06110 00000000`00000100 : nt!CcCachemapUninitWorkerThread+0x104
ffffed09`e909f320 fffff805`22f69f15     : ffff8108`591eb040 00000000`00000001 ffff8108`591eb040 00000000`00000000 : nt!ExpWorkerThread+0x161
ffffed09`e909f530 fffff805`23024488     : ffffdd00`9d516180 ffff8108`591eb040 fffff805`22f69ec0 00000000`00000246 : nt!PspSystemThreadStartup+0x55
ffffed09`e909f580 00000000`00000000     : ffffed09`e90a0000 ffffed09`e9099000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x28

SYMBOL_NAME:  nt!KeAccumulateTicks+541

MODULE_NAME: nt

IMAGE_NAME:  ntkrnlmp.exe

STACK_COMMAND:  .thread ; .cxr ; kb

BUCKET_ID_FUNC_OFFSET:  541

FAILURE_BUCKET_ID:  0x133_ISR_nt!KeAccumulateTicks

OS_VERSION:  10.0.20348.859

BUILDLAB_STR:  fe_release_svc_prod2

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {65350307-c3b9-f4b5-8829-4d27e9ff9b06}

Followup:     MachineOwner

My code that seems to be causing this is a user mode application that does a lot of file I/O on data and index files. The files have a header at the beginning that is mapped into memory via CreateFileMapping/MapViewOfFile, the rest of the file is read/written via standard WriteFile/ReadFile API calls. When the program ends, FlushViewOfFile is called before un-mapping and closing the files. The crash seems to occur after the program has exited but it’s not consistent. It can run multiple times over many days without a crash.

The index files are stored on NVMe modules (O/S level RAID1 via disk mgmt). After the crash when the system comes back up, the RAID1 mirror is in an error state and has to be “re-activated” where it re-synchs and then is fine. The data files are stored on a RAID10 spin disk array.

I suspect the problem could be that I’m doing both memory mapped I/O and standard file I/O on the same file and this may be confusing the windows cache manager.
I’ve created a version that only uses standard file I/O and am currently running that to see if the crashes go away. This is not ideal as mapping the the header into memory leads to much higher performance. Perhaps I should look into making a version that only uses memory mapped I/O?

The 2 servers where this occurred have different hardware and drivers. They both have NVMe for index storage and spin disk for data storage, but different mother boards, NVMe modules, disk controller etc. One is using the standard “NVM Express Controller” provided by Microsoft, the other is using the “Micron NVME Storage Controller”.

Questions:
Since the crash dump seems to indicate this is happening within windows code and not a 3rd party driver, what’s the best way to submit these crash dumps to Microsoft?


Has anyone else encountered this? Is it bad juju to use both memory mapped I/O and standard file I/O on the same file?


Thank you for any input,
Erik

You need to have some sort of relationship with microsoft in order to open an incident with them and get actual support. Even a simple MSDN subscription (now called visual studio subscription) is sufficient. Once you have that set up they are pretty responsive, but you have to push through the first line of response to get to the tech level that has the knowledge to analyze this sort of issue.

Can you share the dump file somewhere?

I wish I could share that optimism… it took us a year last time to get a
single dump analyzed.

Dejan.

“I wish I could share that optimism…” - well Your Mileage May Vary, but both inside Huge Corporations and on my own one person consultancy, I’ve had pretty good experiences, most of the time. But you really have to work at getting past the initial contact and getting access to somebody who actually knows something about your problem. It is sort of shitty as the first line of support gets rated on not letting issues get past them.

I got to support right away… support took a year :slight_smile:

Share the dump please - we can analyze it for you.

Since the crash dump seems to indicate this is happening within windows code and not a 3rd party driver

Well, with this type of bug check, WHERE the crash occurs in incidental, isn’t it? I mean, BadDriver.sys could be spending a ton of time at IRQL DISPATCH_LEVEL, and then (later, with BadDriver.sys off the stack) the watchdog fires in Windows OS code.

If you have a completely stock system, with only MSFT drivers, then I’m sure MSFT would be interested in seeing the dump.

OTOH, if you have a system with third party drivers, I bet they blame the 3rd party driver.

As everyone’s already said… share the crash dump and let us have a look.

Peter

Thank you all for your willingness to look at this.

I’ve zipped the dump file and uploaded it to google drive.

I did password protect the zip file in an attempt to be at least somewhat security conscious. Does this forum software have a way to send the password via direct message or am I being too paranoid lol?

https://drive.google.com/file/d/1qPkPop2lknQoOXHM4SbuAvlk89Kqy0tG/view?usp=sharing

Thanks,
Erik

The password for the zip file is: zaza123ybyb