Pf in NonPaged area - but memory is valid

Dejan_Maksimovic · April 14, 2023, 10:03am

Hello,

Got a very puzzling BugCheck. Irql 0, PAGE_FAULT_IN_NONPAGED_AREA, but
the memory address in argument 1 is completely valid (!pool sees it as
allocated NP pool and I can read it in the dump).

Anyone seen a similar case?

Regards, Dejan.

Scott_Noone_OSR · April 24, 2023, 3:18pm

Can you provide the !analyze output? Also, what does !pte say about the address?

Dejan_Maksimovic · April 24, 2023, 3:40pm

Yep.

…
Loading User Symbols
PEB is paged out (Peb.Ldr = 0000005d2c131018). Type ".hh dbgerr001" for details Loading unloaded module list ............... For analysis of this file, run !analyze -v nt!KeBugCheckEx: fffff80512bfa110 48894c2408 mov qword ptr [rsp+8],rcx
ss:0018:ffffd081`28155920=0000000000000050
3: kd> l+t
Source options are 1:
1/t - Step/trace by source line
3: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************

PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced. This cannot be protected by try-except.
Typically the address is just plain bad or it is pointing at freed memory.
Arguments:
Arg1: ffffd4831483e0d0, memory referenced.
Arg2: 0000000000000000, X64: bit 0 set if the fault was due to a
not-present PTE.
bit 1 is set if the fault was due to a write, clear if a read.
bit 3 is set if the processor decided the fault was due to a corrupted PTE.
bit 4 is set if the fault was due to attempted execute of a no-execute PTE.

ARM64: bit 1 is set if the fault was due to a write, clear if a read.
bit 3 is set if the fault was due to attempted execute of a no-execute PTE.
Arg3: fffff80512eb2490, If non-zero, the instruction address which
referenced the bad memory
address.
Arg4: 0000000000000002, (reserved)

Debugging Details:

Scott_Noone_OSR · April 24, 2023, 7:35pm

Can you provide the faulting instruction and the output of

!pte ffffd4831483e0d0

Dejan_Maksimovic · April 24, 2023, 8:06pm

Ah, it got cut as part of the Quote on the forum… fffff805`12eb2490 498b0a mov rcx,qword ptr [r10] kd> !pte ffffd4831483e0d0 VA ffffd4831483e0d0 PXE at FFFF964B2592CD48 PPE at FFFF964B259A9060 PDE at FFFF964B3520C520 PTE at FFFF966A418A41F0 contains 0A00000004F40863 contains 0A00000005043863 contains 0A0000010AE7F863 contains 8A000001186E9A63 pfn 4f40 —DA–KWEV pfn 5043 —DA–KWEV pfn 10ae7f —DA–KWEV pfn 1186e9 C–DA–KW-V

Scott_Noone_OSR · April 25, 2023, 7:38pm

It looks like the PTE for the address has the Copy on Write bit set, which is supremely strange for a kernel address. Is this really just non-paged pool allocated with ExAllocatePool (or some variation)?

Dejan_Maksimovic · April 25, 2023, 9:18pm

It’s a real NPN, and dc/dq show the expected data.

Scott_Noone_OSR · April 26, 2023, 10:06pm

Are you running under a hypervisor? 2. Have you looked for a race condition? It’s possible the address became valid between the invalid reference and the crash dump (VERY small window but I’ve seen it)

Dejan_Maksimovic · April 27, 2023, 11:35am

No, this is a dump I got from MS from a regular machine.
I am sure, the allocation is not changed for the duration of the driver
load (i.e. not until reboot or reload). It is static, once allocated, won’t
change.

Dejan.

Daniel_Terhell · April 28, 2023, 8:05pm

Is this case reproducible or was it a one time event ? In case it was a one time event, have you considered the possibility of memory corruption, not necessarily because of faulty RAM but possibly because of overheating or a faulty power supply ?

Dejan_Maksimovic · April 29, 2023, 10:48am

It is a one time thing, never saw anything similar in any driver, not just this one. Faulty RAM - maybe. MemCorruption - I sincerely don’t see how.

Daniel_Terhell · April 29, 2023, 1:00pm

Of course faulty RAM also leads to memory corruption. However, often when a RAM test is performed after, it yields nothing because the RAM was not at the temprature it was when the bug occurred.
Most RAM chips are not equipped with thermal sensors and become erratic above a certain temperature. While the system keeps running, overheated RAM can cause all sort of weirdness (most often in a hot path) before the CPU is shut off or even throttled back because it’s still under the thermal trip point.
It can also explain why a physical page can contain wrong data and later turn back to normal, after temperature has gone down. Also an instable or faulty power supply can cause this behavior. Not to say that these problems are the cause, but things to think about.

Dejan_Maksimovic · April 29, 2023, 6:10pm

It may be left at that

Scott_Noone_OSR · April 30, 2023, 8:09pm

I’d say an NPP allocation having a PTE marked as copy on write is weird enough to blame something like a bit flip. Not very satisfying for sure…I’d check various outputs of !sysinfo (smbios in particular) just to categorize/bucket the machine and move on until it started to overlap with other issues.

MBond2 · April 30, 2023, 9:24pm

Consumer grade hardware is much more susceptible to errors than server grade hardware. And it need not even be a ‘permanent’ bit flip - it could be a transient error in the memory circuit. Issues of this sort are almost impossible to pin down