1.Is IRQL 2 (DISPATCH_LEVEL) sufficient to protect FP operations from interrupt interference?
2. I've heard raising CR8 to 15 (maximum) disables the kernel's floating-point state management subsystem - is this correct?
3. Are there architectural implications or caveats when using this approach?
Where did you find this abomination? That's NOT the right way to raise the IRQL, There are well-defined APIs that will work cross-platform, if that was your goal. What, exactly, are you trying to prevent?
You shouldn't have to use assembler at all in a kernel driver. Your code will not work in an ARM64 system, of which there are many.
I appreciate your concern about standard APIs, but I'm working with a specific technical constraint. These floating-point operations (CVTSI2SD, ADDSD, XORPD) require protection from interrupts within an extremely narrow execution window (typically 12-14 cycles) to maintain IEEE-754 compliance, particularly for operations involving denormal handling.
The standard kernel APIs introduce function call overhead that exceeds the actual execution time of these critical FP operations. Direct CR8 manipulation provides the minimal-latency protection window necessary for these timing-sensitive operations.
I understand the cross-platform limitations - this is specifically for x86-64 systems where I'm handling performance-critical floating-point calculations. If you have examples of alternative protection mechanisms for these specific operations that maintain the same execution guarantees, I'd genuinely appreciate seeing them.
What do you think you are going to do with floating point registers that either can be done with the level of timing precision that you are talking about, or where that level of precision matters?
For any computation to be useful, you will have to load some value into at least one register and store a result. Probably using memory locations - which involves the cache. Never mind the power state of the CPU (and other parts of the system) or hyper-threading. Or in modern machines the P-core / E-core differences and chip-let interconnect.
And if your floating point code contains even one branch, a misprediction could dominate.
If you are trying to reduce the total cost of performing this calculation, and think that a single function call overhead is significant, than you probably have a design problem. Probably, whatever you are counting or timing should use integers and the floating point part (average, standard deviation etc.) should be calculated later in a non-critical path. Otherwise, you are probably not correct that it is significant.
It should also be noted that performance (and numerical stability) critical arithmetic is usually done with fixed rather than floating point calculations. Usually using scaled integers. And that IEEE floating point operations can be done using the CPUs integer registers too - avoid division and mismatched scale (exponent portion) to approximate the performance of the dedicated FP hardware
in any case, if you want to rely on a precise time taken for a floating point sequence, it won't be reliable because that's not how the hardware is designed to work