Hi,
I’m getting a bugcheck on two different servers (one is Server 2016, the other is Server 2022).
Here’s the crash dump analysis from the Server 2022 system:
DPC_WATCHDOG_VIOLATION (133)
The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL
or above.
Arguments:
Arg1: 0000000000000001, The system cumulatively spent an extended period of time at
DISPATCH_LEVEL or above. The offending component can usually be
identified with a stack trace.
Arg2: 0000000000001e00, The watchdog period.
Arg3: fffff8052390f328, cast to nt!DPC_WATCHDOG_GLOBAL_TRIAGE_BLOCK, which contains
additional information regarding the cumulative timeout
Arg4: 0000000000000000
Debugging Details:
*************************************************************************
*** Either you specified an unqualified symbol, or your debugger ***
*** doesn't have full symbol information. Unqualified symbol ***
*** resolution is turned off by default. Please either specify a ***
*** fully qualified symbol module!symbolname, or enable resolution ***
*** of unqualified symbols by typing ".symopt- 100". Note that ***
*** enabling unqualified symbol resolution with network symbol ***
*** server shares in the symbol path may cause the debugger to ***
*** appear to hang for long periods of time when an incorrect ***
*** symbol name is typed or the network symbol server is down. ***
*** ***
*** For some commands to work properly, your symbol path ***
*** must point to .pdb files that have full type information. ***
*** ***
*** Certain .pdb files (such as the public OS symbols) do not ***
*** contain the required information. Contact the group that ***
*** provided you with these symbols if you need this command to ***
*** work. ***
*** ***
*** Type referenced: TickPeriods ***
*** ***
*************************************************************************
KEY_VALUES_STRING: 1
Key : Analysis.CPU.mSec
Value: 6640
Key : Analysis.DebugAnalysisManager
Value: Create
Key : Analysis.Elapsed.mSec
Value: 6653
Key : Analysis.Init.CPU.mSec
Value: 1062
Key : Analysis.Init.Elapsed.mSec
Value: 30579
Key : Analysis.Memory.CommitPeak.Mb
Value: 110
Key : WER.OS.Branch
Value: fe_release_svc_prod2
Key : WER.OS.Timestamp
Value: 2022-07-07T18:32:00Z
Key : WER.OS.Version
Value: 10.0.20348.859
BUGCHECK_CODE: 133
BUGCHECK_P1: 1
BUGCHECK_P2: 1e00
BUGCHECK_P3: fffff8052390f328
BUGCHECK_P4: 0
DPC_TIMEOUT_TYPE: DPC_QUEUE_EXECUTION_TIMEOUT_EXCEEDED
TRAP_FRAME: ffffed09e909eba0 -- (.trap 0xffffed09e909eba0)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=00000000c0000000 rbx=0000000000000000 rcx=00000000c0000000
rdx=00000000000000ff rsi=0000000000000000 rdi=0000000000000000
rip=fffff80522f29687 rsp=ffffed09e909ed30 rbp=0000000000000002
r8=0000000000000004 r9=ffff810867fbc930 r10=ffff8108723aded0
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei ng nz na po cy
nt!ExpWaitForSpinLockExclusiveAndAcquire+0xd7:
fffff805`22f29687 0f845dffffff je nt!ExpWaitForSpinLockExclusiveAndAcquire+0x3a (fffff805`22f295ea) [br=0]
Resetting default scope
BLACKBOXBSD: 1 (!blackboxbsd)
BLACKBOXNTFS: 1 (!blackboxntfs)
BLACKBOXPNP: 1 (!blackboxpnp)
BLACKBOXWINLOGON: 1
PROCESS_NAME: System
STACK_TEXT:
ffffdd00`9d532da8 fffff805`22e8ae41 : 00000000`00000133 00000000`00000001 00000000`00001e00 fffff805`2390f328 : nt!KeBugCheckEx
ffffdd00`9d532db0 fffff805`22e88cf4 : 000390bd`4c6cb908 ffffdd00`9d532ee8 00000000`00000246 fffff805`22e85c02 : nt!KeAccumulateTicks+0x541
ffffdd00`9d532e20 fffff805`22e8895a : 00000000`03d9f248 ffffdd00`9d51f218 00000000`00000000 fffff805`22f2e73f : nt!KiUpdateRunTime+0x64
ffffdd00`9d532eb0 fffff805`22e887e4 : ffff8108`5093d0e0 00000000`00000001 ffff8108`5093d0e0 00000000`00000000 : nt!KeClockInterruptNotify+0x10a
ffffdd00`9d532f40 fffff805`22e41350 : 00000000`00000000 ffffa146`472a0d20 00000000`00000200 fffff805`2385f300 : nt!HalpTimerClockIpiRoutine+0x14
ffffdd00`9d532f70 fffff805`230216ba : ffffed09`e909ec20 ffff8108`5093d0e0 00000000`00000000 00000000`00000200 : nt!KiCallInterruptServiceRoutine+0xa0
ffffdd00`9d532fb0 fffff805`23021f27 : ffff8108`90dd9530 fffff805`2385f320 00000000`00000000 00000000`00000000 : nt!KiInterruptSubDispatchNoLockNoEtw+0xfa
ffffed09`e909eba0 fffff805`22f29687 : 00000000`00000000 fffff805`22ee4591 00000000`00000010 00000000`00040246 : nt!KiInterruptDispatchNoLockNoEtw+0x37
ffffed09`e909ed30 fffff805`22ee45c9 : ffff8108`0001c12f fffff805`2385f300 00000000`00000000 fffff805`2385edc0 : nt!ExpWaitForSpinLockExclusiveAndAcquire+0xd7
ffffed09`e909ed60 fffff805`22ee1a8d : ffff8108`723aded0 00000000`00000000 00000000`00000000 00000000`00000000 : nt!ExAcquireSpinLockExclusiveAtDpcLevel+0x79
ffffed09`e909ed90 fffff805`22ee119e : ffff8108`723aded0 00000000`00000004 00000000`00000200 00000000`00000000 : nt!MiInsertUnusedSubsection+0x65
ffffed09`e909edd0 fffff805`22f553b3 : ffff8108`723aded0 00000000`00000000 00000000`00000000 00000000`00000000 : nt!MiRemoveViewsFromSection+0x82
ffffed09`e909ee00 fffff805`22ee18db : ffff8108`67fbc930 00000000`00000000 00000000`00000000 00000000`00000000 : nt!MiConvertStaticSubsections+0x77
ffffed09`e909ee40 fffff805`22ee0f69 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!MiInsertUnusedSegment+0x2f
ffffed09`e909ee80 fffff805`22ee0cf2 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!MiCheckControlArea+0x1b9
ffffed09`e909ef00 fffff805`232e5253 : ffff8108`67fbc930 00000000`00000000 00000000`00000000 00000000`00000000 : nt!MiDereferenceControlAreaBySection+0x2a
ffffed09`e909ef30 fffff805`23282ce0 : ffffbb06`1bd135b0 ffff8904`0acdd820 00000000`00000000 ffff8904`0ac28c90 : nt!MiSectionDelete+0x83
ffffed09`e909ef60 fffff805`22e327d3 : 00000000`00000000 00000000`00000000 ffff8904`0acdd820 ffffbb06`1bd135e0 : nt!ObpRemoveObjectRoutine+0x80
ffffed09`e909efc0 fffff805`22e14ab9 : 00000000`00000000 ffffed09`e909f0b0 ffff8904`0acdd820 ffffed09`e909f0b0 : nt!ObfDereferenceObject+0x93
ffffed09`e909f000 fffff805`22e16263 : ffff8108`683a6da0 ffffed09`e909f0c9 ffff8108`93bb7300 ffff8108`6882bd30 : nt!CcDeleteSharedCacheMap+0x16d
ffffed09`e909f050 fffff805`22e15639 : ffff8904`0acdd800 ffffed09`e909f100 00000000`00000000 0000001f`ff9194f9 : nt!CcWriteBehindInternal+0x493
ffffed09`e909f130 fffff805`22e141d4 : ffff8108`93bb7300 ffff8108`591eb040 ffff8904`0ac52cb0 ffff8108`508b3c60 : nt!CcWriteBehind+0xa5
ffffed09`e909f230 fffff805`22edbd01 : ffff8904`0ac06110 fffff805`2393d680 ffff8904`0ac06110 00000000`00000100 : nt!CcCachemapUninitWorkerThread+0x104
ffffed09`e909f320 fffff805`22f69f15 : ffff8108`591eb040 00000000`00000001 ffff8108`591eb040 00000000`00000000 : nt!ExpWorkerThread+0x161
ffffed09`e909f530 fffff805`23024488 : ffffdd00`9d516180 ffff8108`591eb040 fffff805`22f69ec0 00000000`00000246 : nt!PspSystemThreadStartup+0x55
ffffed09`e909f580 00000000`00000000 : ffffed09`e90a0000 ffffed09`e9099000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x28
SYMBOL_NAME: nt!KeAccumulateTicks+541
MODULE_NAME: nt
IMAGE_NAME: ntkrnlmp.exe
STACK_COMMAND: .thread ; .cxr ; kb
BUCKET_ID_FUNC_OFFSET: 541
FAILURE_BUCKET_ID: 0x133_ISR_nt!KeAccumulateTicks
OS_VERSION: 10.0.20348.859
BUILDLAB_STR: fe_release_svc_prod2
OSPLATFORM_TYPE: x64
OSNAME: Windows 10
FAILURE_ID_HASH: {65350307-c3b9-f4b5-8829-4d27e9ff9b06}
Followup: MachineOwner
My code that seems to be causing this is a user mode application that does a lot of file I/O on data and index files. The files have a header at the beginning that is mapped into memory via CreateFileMapping/MapViewOfFile, the rest of the file is read/written via standard WriteFile/ReadFile API calls. When the program ends, FlushViewOfFile is called before un-mapping and closing the files. The crash seems to occur after the program has exited but it’s not consistent. It can run multiple times over many days without a crash.
The index files are stored on NVMe modules (O/S level RAID1 via disk mgmt). After the crash when the system comes back up, the RAID1 mirror is in an error state and has to be “re-activated” where it re-synchs and then is fine. The data files are stored on a RAID10 spin disk array.
I suspect the problem could be that I’m doing both memory mapped I/O and standard file I/O on the same file and this may be confusing the windows cache manager.
I’ve created a version that only uses standard file I/O and am currently running that to see if the crashes go away. This is not ideal as mapping the the header into memory leads to much higher performance. Perhaps I should look into making a version that only uses memory mapped I/O?
The 2 servers where this occurred have different hardware and drivers. They both have NVMe for index storage and spin disk for data storage, but different mother boards, NVMe modules, disk controller etc. One is using the standard “NVM Express Controller” provided by Microsoft, the other is using the “Micron NVME Storage Controller”.
Questions:
Since the crash dump seems to indicate this is happening within windows code and not a 3rd party driver, what’s the best way to submit these crash dumps to Microsoft?
Has anyone else encountered this? Is it bad juju to use both memory mapped I/O and standard file I/O on the same file?
Thank you for any input,
Erik