Hello,
Got a very puzzling BugCheck. Irql 0, PAGE_FAULT_IN_NONPAGED_AREA, but
the memory address in argument 1 is completely valid (!pool sees it as
allocated NP pool and I can read it in the dump).
Anyone seen a similar case?
Regards, Dejan.
Hello,
Got a very puzzling BugCheck. Irql 0, PAGE_FAULT_IN_NONPAGED_AREA, but
the memory address in argument 1 is completely valid (!pool sees it as
allocated NP pool and I can read it in the dump).
Anyone seen a similar case?
Regards, Dejan.
Can you provide the !analyze output? Also, what does !pte say about the address?
Yep.
…
Loading User Symbols
PEB is paged out (Peb.Ldr = 0000005d2c131018). Type ".hh dbgerr001" for details Loading unloaded module list ............... For analysis of this file, run !analyze -v nt!KeBugCheckEx: fffff805
12bfa110 48894c2408 mov qword ptr [rsp+8],rcx
ss:0018:ffffd081`28155920=0000000000000050
3: kd> l+t
Source options are 1:
1/t - Step/trace by source line
3: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced. This cannot be protected by try-except.
Typically the address is just plain bad or it is pointing at freed memory.
Arguments:
Arg1: ffffd4831483e0d0, memory referenced.
Arg2: 0000000000000000, X64: bit 0 set if the fault was due to a
not-present PTE.
bit 1 is set if the fault was due to a write, clear if a read.
bit 3 is set if the processor decided the fault was due to a corrupted PTE.
bit 4 is set if the fault was due to attempted execute of a no-execute PTE.
Debugging Details:
Can you provide the faulting instruction and the output of
!pte ffffd4831483e0d0
Ah, it got cut as part of the Quote on the forum… fffff805`12eb2490 498b0a mov rcx,qword ptr [r10] kd> !pte ffffd4831483e0d0 VA ffffd4831483e0d0 PXE at FFFF964B2592CD48 PPE at FFFF964B259A9060 PDE at FFFF964B3520C520 PTE at FFFF966A418A41F0 contains 0A00000004F40863 contains 0A00000005043863 contains 0A0000010AE7F863 contains 8A000001186E9A63 pfn 4f40 —DA–KWEV pfn 5043 —DA–KWEV pfn 10ae7f —DA–KWEV pfn 1186e9 C–DA–KW-V
It looks like the PTE for the address has the Copy on Write bit set, which is supremely strange for a kernel address. Is this really just non-paged pool allocated with ExAllocatePool (or some variation)?
It’s a real NPN, and dc/dq show the expected data.
Dejan.
Is this case reproducible or was it a one time event ? In case it was a one time event, have you considered the possibility of memory corruption, not necessarily because of faulty RAM but possibly because of overheating or a faulty power supply ?
It is a one time thing, never saw anything similar in any driver, not just this one. Faulty RAM - maybe. MemCorruption - I sincerely don’t see how.
Of course faulty RAM also leads to memory corruption. However, often when a RAM test is performed after, it yields nothing because the RAM was not at the temprature it was when the bug occurred.
Most RAM chips are not equipped with thermal sensors and become erratic above a certain temperature. While the system keeps running, overheated RAM can cause all sort of weirdness (most often in a hot path) before the CPU is shut off or even throttled back because it’s still under the thermal trip point.
It can also explain why a physical page can contain wrong data and later turn back to normal, after temperature has gone down. Also an instable or faulty power supply can cause this behavior. Not to say that these problems are the cause, but things to think about.
It may be left at that
I’d say an NPP allocation having a PTE marked as copy on write is weird enough to blame something like a bit flip. Not very satisfying for sure…I’d check various outputs of !sysinfo (smbios in particular) just to categorize/bucket the machine and move on until it started to overlap with other issues.
Consumer grade hardware is much more susceptible to errors than server grade hardware. And it need not even be a ‘permanent’ bit flip - it could be a transient error in the memory circuit. Issues of this sort are almost impossible to pin down