SIMD operations & KeSaveFloatingPointState

Hello,

I have set MSC_OPTIMIZATIONS to /Ox (all optimizations, favour speed) to see what difference it makes and noticed a very strange behaviour.
Additionally, I have set /arch:SSE2 to make use of SSE/2 instructions.

As expected, the compiler unrolled a few loops, but instead used MMX instructions (namely PXOR & MOVQ) for some of them.

Normally, I wouldn’t care, but according to Microsoft’s article “Using Floating Point or MMX in a WDM Driver”, all MMX (etc.) operations must be wrapped between KeSaveFloatingPointState & KeRestoreFloatingPointState (for x86-32 builds), which is now obviously missing (PREfast doesn’t complain either).

Is this safe?

Not sure if I know what I am talking about but my common sense tells me that
since the compiler will choose itself when and where to use MMX/SSE
instructions, it is not under your control so you can also not protect
against it. The compiler needs to take care for instance not to mix up FPU
and MMX instructions without the normal precautions (such as EMMS
instructions etc). Since this is beyond your control I assume it’s not your
responsability but this stands apart from the normal rules you have to
follow in any case when using floating point instructions in your driver.

To me it looks like drivers have to pay an extra price ibecause they have to
save FP state at least an extra time. FP states are already saved across
context switches (right?) which is a price which needs to be paid anyway.
The FSAVE and FRESTORE instructions which save and restore the FP state are
very expensive and take up some 300 clock cycles altogether according to AMD
optimization.

/Daniel

The compiler will choose itself when and where to apply these optimizations.
Since it is not
under your control which instructions are being used and where it will be
the compiler
who has to take its responsabilities such as making sure FPU and MMX
instructions
are not being mixed without taking any measures.

For you I think nothing changes, if you were using floating point before y.
You were using floating
point instructions anyway so you have to take your measures, the rest should
become the
responsability of the compiler. You need to save and restore the floating
point state
according to the rules which is heavy duty (300 clock cycles together
according to the AMD
optimization guide).

wrote in message news:xxxxx@ntdev…
> Hello,
>
> I have set MSC_OPTIMIZATIONS to /Ox (all optimizations, favour speed) to
> see what difference it makes and noticed a very strange behaviour.
> Additionally, I have set /arch:SSE2 to make use of SSE/2 instructions.
>
> As expected, the compiler unrolled a few loops, but instead used MMX
> instructions (namely PXOR & MOVQ) for some of them.
>
> Normally, I wouldn’t care, but according to Microsoft’s article “Using
> Floating Point or MMX in a WDM Driver”, all MMX (etc.) operations must be
> wrapped between KeSaveFloatingPointState & KeRestoreFloatingPointState
> (for x86-32 builds), which is now obviously missing (PREfast doesn’t
> complain either).
>
> Is this safe?
>

Daniel Terhell wrote:

To me it looks like drivers have to pay an extra price ibecause they have to
save FP state at least an extra time. FP states are already saved across
context switches (right?) which is a price which needs to be paid anyway.
The FSAVE and FRESTORE instructions which save and restore the FP state are
very expensive and take up some 300 clock cycles altogether according to AMD
optimization.

I have used RDTSC to benchmark KeSaveFloatingPointState & KeRestoreFloatingPointState at DISPATCH_LEVEL (of course, I have taken care of other influences RDTSC might face).
Well, both functions have a penalty of ~ 800 - ~ 1200 cycles together (Pentium M).

It’s a shame that you have to call them in the first place, because there’s actually no reason why floating-point operations should be any harder than integer ones (of course, floating-point stuff has its own quirks).

On architectures where floating-point support was not an afterthought, it just works without any hassles.

This isn’t a problem with the machine architecture, it’s rather an OS
choice.

Think of it. There’s only one set of floating point registers, and there’s
lots of them. Saving and restoring can be a time-consuming chore. There are
some global floating point entities, besides the registers themselves, that
need to be saved and restored at every context switch. As you yourself found
out, saving and restoring the Floating Point state can be pretty expensive.
Look at your Pentium, you have three sets of FP registers to deal with: the
original set, the MMX 64-bit set, and the 128-bit SIMD set. Are you
surprised that it takes that long to switch states ?

The OS already actually doesn’t save and restore Floating Point state every
time a context switch happens. The OS normally uses what’s called “lazy”
context switch: the floating point context is only switched when the first
floating point instruction after the context switch executes. That is, in
most cases the FP context is just not switched, because there’s no need.
That saves a lot of context switch time!

To do that, OS’s often rely on temporarily disabling floating point
operations at context switch time, so that the first FP operation after a
context switch incurrs in an “illegal FP operation” kind of interrupt, which
triggers the OS to perform the context switch there and then.

Now what would happen if that FP operation was issued on kernel side ? If
for any reason interrupts are disabled, disaster ensues. If you have
interrupts on, you still have to handle FP context switch, but now you’re
running in kernel space, which can make some of the context switch aspects
difficult to handle. Even if interrupts aren’t disabled, something as simple
as running at a high enough IRQL would make things hard to handle.

Sometimes it’s hard not to use FP on kernel side, but you have to handle it
right lest your driver causes applications to misbehave. When I wrote OpenGL
ICDs for a living, using FP in kernel side was a requirement, so, I did save
and restore the context. To alleviate the overhead, the user-side OpenGL
library batched requests and issued one single Ioctl to the driver. The
driver would save the FP state, perform all the requests, restore the FP
state, and leave.

And even then, we tested it to exhaustion, with lots of FP-intensive
applications, just to make sure.

The best advise I have is, Pentium-like architectures are fastest when we
treat them as RISC machines. Less is more - you want to avoid complex
instructions, floating point included, as much as you can! For example, in
some cases using fixed point numbers is faster than using floating point.

Alberto.

----- Original Message -----
From:
To: “Windows System Software Devs Interest List”
Sent: Friday, September 28, 2007 6:49 AM
Subject: RE:[ntdev] SIMD operations & KeSaveFloatingPointState

> Daniel Terhell wrote:
>> To me it looks like drivers have to pay an extra price ibecause they have
>> to
>> save FP state at least an extra time. FP states are already saved across
>> context switches (right?) which is a price which needs to be paid anyway.
>> The FSAVE and FRESTORE instructions which save and restore the FP state
>> are
>> very expensive and take up some 300 clock cycles altogether according to
>> AMD
>> optimization.
>
> I have used RDTSC to benchmark KeSaveFloatingPointState &
> KeRestoreFloatingPointState at DISPATCH_LEVEL (of course, I have taken
> care of other influences RDTSC might face).
> Well, both functions have a penalty of ~ 800 - ~ 1200 cycles together
> (Pentium M).
>
> It’s a shame that you have to call them in the first place, because
> there’s actually no reason why floating-point operations should be any
> harder than integer ones (of course, floating-point stuff has its own
> quirks).
>
> On architectures where floating-point support was not an afterthought, it
> just works without any hassles.
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer