I'm not sure if this is a hardware error or a driver/kernel bug. This is a Windows 11 PC, AMD 5950X, UDIMM ECC RAM, 7900XTX GPU, and I use it to run Hyper-V and LM Studio. Since a while ago, it random crash when GPU high load: screen freeze, USB unresponsive, RDP can't connect, but network and sound and most processes still running, I can use SSH to connect, can access NVMe disk, but access AHCI disk will stuck. This can last from a few seconds to about two minutes, then it will restart, no memory dump, no BSOD code, no event log.
I have checked RAM error with memtest86, checked CPU error with CoreCycle, move GPU to another computer, and replace PSU. Still couldn't solve, so I replaced entire hardware.
I'm curious, what error can cause some hardware stuck and some hardware work normal? I suspect that clock of a core is stopped (such as Intel cascade lake bug), or it dies on an interrupt?
I can easily configure windbg remote debugg over network, but I don't know much about Windows kernel development, I'm not sure windbg will work if a CPU core stuck, maybe windbg stuck too, or disconnect? But I should at least try to find which CPU core stuck, and preferably know stack?
If x86 have on-chip debugger like MCU (maybe for engineering samples?), maybe much easier to analyze weird error.