Use of SSE instructions from an arbitrary thread context

Hi,

I’ve written a driver that uses SSE instructions inside a callback that runs at DISPATCH_LEVEL. This runs perfectly until I run a userspace application that also performs SSE instructions in its own userspace thread, at which point the results of this application can occasionally be corrupted while my driver is running. If I stop my driver’s callback from firing, or remove the SSE instructions from the driver, the userspace code runs perfectly without any corruption of its results.

I should point out that there is absolutely no coupling between the userspace application and the kernel driver: They don’t share memory buffers, in fact the only thing they have in common is that they both perform SSE instructions.

I’m running this on Windows 7 x64, and so according to the DDK documentation I shouldn’t need to perform any steps to save the processor state before use of SSE instructions. The documentation states that SSE operations can be performed at any IRQL. Does this also mean that it is possible to perform SSE operations from an arbitrary thread context?

The only possible explanation that I have is that my driver is pre-empting the userspace application, and not saving and restoring the state of the SSE registers. Assuming this is the case, can anyone offer any advice on how to proceed?

(For those interested, my driver is a WaveCyclic audio driver, and I’m using SSE within my IDmaChannel::CopyFrom() method.)

IIRC SSE uses the floating point unit, so like any other floating point
operation in the kernel it needs to be protected with
KeSaveFloatingPointState / KeRestoreFloatingPointState

Don Burn (MVP, Windows DKD)
Windows Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr

-----Original Message-----
From: xxxxx@aanet.com.au [mailto:xxxxx@aanet.com.au]
Posted At: Monday, July 19, 2010 8:49 AM
Posted To: ntdev
Conversation: Use of SSE instructions from an arbitrary thread context
Subject: Use of SSE instructions from an arbitrary thread context

Hi,

I’ve written a driver that uses SSE instructions inside a callback
that runs
at DISPATCH_LEVEL. This runs perfectly until I run a userspace
application
that also performs SSE instructions in its own userspace thread, at
which
point the results of this application can occasionally be corrupted
while my
driver is running. If I stop my driver’s callback from firing, or
remove the
SSE instructions from the driver, the userspace code runs perfectly
without
any corruption of its results.

I should point out that there is absolutely no coupling between the
userspace
application and the kernel driver: They don’t share memory buffers, in
fact
the only thing they have in common is that they both perform SSE
instructions.

I’m running this on Windows 7 x64, and so according to the DDK
documentation I
shouldn’t need to perform any steps to save the processor state before
use of
SSE instructions. The documentation states that SSE operations can be
performed at any IRQL. Does this also mean that it is possible to
perform SSE
operations from an arbitrary thread context?

The only possible explanation that I have is that my driver is
pre-empting the
userspace application, and not saving and restoring the state of the
SSE
registers. Assuming this is the case, can anyone offer any advice on
how to
proceed?

(For those interested, my driver is a WaveCyclic audio driver, and I’m
using
SSE within my IDmaChannel::CopyFrom() method.)

__________ Information from ESET Smart Security, version of virus
signature
database 5291 (20100719) __________

The message was checked by ESET Smart Security.

http://www.eset.com

Hmmmm…

The SSE registers are separate from the X87/FPP registers, so on x64 calling KeSaveFloatinPointState is unnecessary and, IIRC, actually does nothing.

You’re coding these SSE operations yourself, right? And you’re using Win7?? Are you using the XMM registers or the YMM (AVX) registers?? If you’re using the YMM registers, you have to explicitly do a KeSaveExtendedProcessorState around the operations.

I wouldn’t be terribly surprised to find that you have to save/restore the SSE registers in your DPC. If you think about it, to avoid this Windows would have to save/register the XMM registers around every DPC… and I’d be surprised to find it did that. I definitely don’t remember seeing any code that does it, now that I think of it.

Since you’re on Win7, you can try calling RtlGetEnabledExtendedFeatures during your driver’s initialization, and saving away the returned feature mask. Then call KeSaveExtendedProcessorState using that same feature mask around your SSE/AVX operations – both of these functions are new starting in Win7.

Sorry to provide such a hazy reply… Your’s is an interesting question and the WDK docs are particularly bad in their handling of XMM/YMM information. I’m curious to see the outcome,

Peter
OSR

> IIRC SSE uses the floating point unit, so like any other floating point

operation in the kernel it needs to be protected with
KeSaveFloatingPointState / KeRestoreFloatingPointState

The doc really says this is not needed on x64


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

> I’m running this on Windows 7 x64, and so according to the DDK documentation I shouldn’t need to

perform any steps to save the processor state before use of SSE instructions. The documentation

Probably Peter is correct in that, even if the x64 OS does auto FPU save/load on each context switch or such, it does not do this on DPC delivery. So, when KiRetireDpcList interrupts your thread, the context is not saved, and your DPC dirties it.

And yes, the doc page is vague and bad.

To experiment, try offloading this work from the DPC path to some other path, for instance, to a system thread. Will this stop the corruption issue?

Also, probably you use some SSE registers not covered by this guarantee.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

It’s a strange calling convention, unfortunately:

You can use SSE intrinsics that the compiler provides without fear on x64 Windows in any context (XMM registers are part of the calling convention just like the standard GPRs, so you can treat them as such). Don’t try to use non-default rounding/exception modes by changing the standard value of the MxCSR register.

If you want to use YMM registers, you have to call KeSaveExtendedProcessorState(AVX_MASK). You can then use the compiler-provided intrinsics to do the processing you need. I don’t think you can call the Save/restore APIs above DISPATCH_LEVEL, but you can use them in DPCs, etc.

If you want to write assembly code, you have to follow the calling convention, including emitting the proper unwind codes. What this means is that you have to call KeSaveExtendedProcessorState, and then also save/restore the corresponding XMM register to any YMM register you use.

I wouldn’t recommend going beyond compiler intrinsics since it is nontrivial to write proper, unwindable x64 assembly code, and getting it right is important for kernelmode. If it is still considered valuable, I can give an example of proper SSE asm code, however.

-Neeraj
Windows Kernel Core Team

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@osr.com
Sent: Monday, July 19, 2010 7:59 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Use of SSE instructions from an arbitrary thread context

Hmmmm…

The SSE registers are separate from the X87/FPP registers, so on x64 calling KeSaveFloatinPointState is unnecessary and, IIRC, actually does nothing.

You’re coding these SSE operations yourself, right? And you’re using Win7?? Are you using the XMM registers or the YMM (AVX) registers?? If you’re using the YMM registers, you have to explicitly do a KeSaveExtendedProcessorState around the operations.

I wouldn’t be terribly surprised to find that you have to save/restore the SSE registers in your DPC. If you think about it, to avoid this Windows would have to save/register the XMM registers around every DPC… and I’d be surprised to find it did that. I definitely don’t remember seeing any code that does it, now that I think of it.

Since you’re on Win7, you can try calling RtlGetEnabledExtendedFeatures during your driver’s initialization, and saving away the returned feature mask. Then call KeSaveExtendedProcessorState using that same feature mask around your SSE/AVX operations – both of these functions are new starting in Win7.

Sorry to provide such a hazy reply… Your’s is an interesting question and the WDK docs are particularly bad in their handling of XMM/YMM information. I’m curious to see the outcome,

Peter
OSR


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Thanks for all the helpful replies!

After further investigation, I’ve discovered that the x64 compiler is generating an incorrect prolog and epilog for all of my functions that use SSE intrinsics. If I disassemble the code produced by the 3790 DDK compiler, I can see that the compiler is writing all callee-saved XMM registers to the stack, however, it’s only writing 8-bytes of each XMM register (and not the entire 16-byte register). The prolog that is emitted uses the following instructions to save xmm6 and xmm7:

movsdx QWORD PTR [rsp+48], xmm6
movsdx QWORD PTR [rsp+40], xmm7

If I compile the same code with Visual Studio 2008 (as a userspace application), the prolog saves the entire 16-bytes of each callee-saved XMM register, and the corresponding disassembly contains the instructions:

movaps XMMWORD PTR [rsp-24], xmm6
movaps XMMWORD PTR [rsp-40], xmm7

For now I’ve written my own assembler procedures to save and restore the XMM registers, and this has fixed all of the problems that I mentioned in my original post. I still haven’t ruled out this problem being caused by an incorrect compiler option that I’m setting. If anyone has any suggestions, I’d love to hear them!

Doug

On 20/07/2010 14:07, xxxxx@aanet.com.au wrote:

3790 DDK compiler

That’s an awfully old DDK. Perhaps try upgrading to the latest DDK?

MH.

>epilog for all of my functions that use SSE intrinsics. If I disassemble the code produced by the 3790

DDK

Why are you using this ancient DDK? it is obsolete for 5 years now, it is like using Turbo C 2.0.

Upgrading to 6001.18002 (if you need w2k support) or even to last Win7 WDK is the better option then writing your own assembly code.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com