Find which CPU core is stuck

tsuzing · May 9, 2025, 3:09pm

I'm not sure if this is a hardware error or a driver/kernel bug. This is a Windows 11 PC, AMD 5950X, UDIMM ECC RAM, 7900XTX GPU, and I use it to run Hyper-V and LM Studio. Since a while ago, it random crash when GPU high load: screen freeze, USB unresponsive, RDP can't connect, but network and sound and most processes still running, I can use SSH to connect, can access NVMe disk, but access AHCI disk will stuck. This can last from a few seconds to about two minutes, then it will restart, no memory dump, no BSOD code, no event log.

I have checked RAM error with memtest86, checked CPU error with CoreCycle, move GPU to another computer, and replace PSU. Still couldn't solve, so I replaced entire hardware.

I'm curious, what error can cause some hardware stuck and some hardware work normal? I suspect that clock of a core is stopped (such as Intel cascade lake bug), or it dies on an interrupt?

I can easily configure windbg remote debugg over network, but I don't know much about Windows kernel development, I'm not sure windbg will work if a CPU core stuck, maybe windbg stuck too, or disconnect? But I should at least try to find which CPU core stuck, and preferably know stack?

If x86 have on-chip debugger like MCU (maybe for engineering samples?), maybe much easier to analyze weird error.

tsuzing · May 9, 2025, 3:25pm

I have tested voltage of PSU, CPU, RAM, GPU, PCH, etc. with oscilloscope, none of them are abnormal. I don't understand why system crash always accompanied by high GPU load when voltage are normal.

Tim_Roberts · May 9, 2025, 5:09pm

This kind of thing is often caused by bugs in the graphics driver, a key point of failure for decades. Have you ensured your graphics driver is up to date? If you use the kernel debugger, you can query each CPU separately to see where they are executing.