BSOD After enabling RETPOLINE mitigation

Hello,

I have come across an issue recently while working on an EDR driver and am hoping someone may have some insight as I am at a loss for the cause.

If I enable RETPOLINE mitigations (specifically adding the /d2guardretpoline flag to the compiler and /guard:retpoline to the linker) then I consistently get a bugcheck from PatchGuard.
The bugcheck details are:

*** Fatal System Error: 0x00000109
                       (0xA39FE75C730E0508,0xB3B6F3E2C58A8ED5,0x0000000000000006,0x0000000000000018)

The final parameter being 0x18 is documented as Kernel notification callout modification.

The driver is using a lot of different callback mechanisms provided by the kernel, including process, image, thread, container, power and a few internally defined ExCallbacks. This leaves me with a lot of code to inspect as the third bugcheck parameter is not document. If anyone has any knowledge of what the 0x6 parameter points to I would be very greatful.

The obvious solution to my problem is to not enable RETPOLINE, and I have reverted the changes, but would still like to try and understand why enabling the mitigation would cause a PatchGuard bugcheck. Has anyone had a similar experience or any recommendations for debugging the problem (ideally short of reversing PatchGuard :wink: )?

One thing worth noting, after enabling the mitigation I did have to ignore a few linker warnings. The warnings related to missing retpoline metadata from object files generated from assembly source. I would have thought this would be harmless, but perhaps not.

Thanks,
Niall

Those are not compiler / linker options that I’m familiar with, and looking at the documentation, I can’t find them either. Which compiler are you using?

As far as I understand the idea of retpoline is to replace ordinary branch instructions with ret instructions so that the branch prediction logic in the CPU will stall or consistently malfunction - thus providing a consistent timing between instructions and avoiding a possible inference of memory contents. Presumably, returning to addresses that aren’t after corresponding call instructions is something that patch guard specifically looks for - a typical buffer overrun

we have discussed this at least once before, but it is worth mentioning again that when specter etc. were ‘discovered’ the possibility of inter-process interference was well known. Every time a context switch happens between threads of different processes, the CPU caches will be in an unhelpful state for the new process at a minimum

It is also worth noting that the viability of a successful attack based on this approach in the wild is doubtful. A great many things have to be just right in order to find out the bit values, and it is quite slow. Relying on timing differences necessarily means that specific CPU model, chipset type, RAM details etc. matter. Sure this can be done in a lab, and probably the NSA can do it when attacking a specific target, but it is not a general purpose kind of attack

I’m using the MSVC toolchain, but the flags are not formally documented anywhere. I went looking for them after watching a talk by Andrea Allievi at Blue Hat 2018 talking about the Windows kernels implementation of Retpoline. It seems for the sake of performance MS want drivers to explicitly support retpoline but provide no information on how to do so. The closest thing to documentation I have read is a blog article titled “Mitigating Spectre variant 2 with Retpoline on Windows” on the kernel internals blog.

The reason for looking to support this at all is that I found a significant performance penalty when running on older processors with no support for the hardware based/microcode Specter mitigations.

But I take your point about the unlikelihood of a practical attack using Specter. Given the issues I have faced I think I will likely just drop it entirely.

Thanks you for your input!

I’m glad to know i’m not crazy and they are in fact undocumented options

IMHO a high performance cost and the low probability of an effective exploit is a big part of why very little has been done about this. Combine that with the availability of newer hardware, and I think your decision is sound