The mystery deepens. I have a code snippet and a stack trace and some other
information that simply does not make any sense at all.
To me.
Two of my partners are now telling me “This is beginning to look like a
hardware problem.”
I can’t believe that it is. The system is too stable. The problem is too
reproducible. Except, that, as Henry Gabryjelski puts it, it is a
Heisenbug. That is, the bug disappears when I run in a kernel debug
environment.
Here’s the stack trace
nt!KiTrap0E+0x238
nt!MmGetSystemRoutineAddress
continua!SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS+0xc3
continua!ContinuaGetOsVersion::ContinuaGetOsVersion+0x1a
continua!ContinuaVdiskSysDriver::DriverEntry+0x8f
continua!KDriver::DriverInit+0x16b
continua!DriverEntry+0xfa
nt!IopLoadDriver+0x66c
nt!PipCallDriverAddDeviceQueryRoutine+0x235
nt!RtlpCallQueryRegistryRoutine+0x3b1
nt!RtlQueryRegistryValues+0x2a6
nt!PipCallDriverAddDevice+0x261
nt!PipProcessDevNodeTree+0x1a4
nt!PiRestartDevice+0x80
nt!PipDeviceActionWorker+0x168
nt!ExpWorkerThread+0xef
nt!PspSystemThreadStartup+0x34
nt!KiThreadStartup+0x16
Here’s the code snippet (reformatted so that it looks semi-decent with short
lines)
PSF_GET_VERSION SF_DYNAMIC_FUNCTION_POINTERS::GetVersion = 0;
SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS()
{
ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
UNICODE_STRING functionName;
ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
RtlInitUnicodeString( &functionName, L"RtlGetVersion" );
ULONG_PTR IoGetRemainingStackSize_ = IoGetRemainingStackSize();
IoGetRemainingStackSize_ = IoGetRemainingStackSize_;
ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
GetVersion =
reinterpret_cast
<
long (__stdcall *)(struct _OSVERSIONINFOW *)
>
(
MmGetSystemRoutineAddress( &functionName )
);
ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
} /* SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS() */
Here is the command output from WinDbg
kd> !analyze -v
*******************************************************************************
*
*
* Bugcheck Analysis
*
*
*
*******************************************************************************
IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at
an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: 805bfa33, memory referenced
Arg2: 000000ff, IRQL
Arg3: 00000000, value 0 = read operation, 1 = write operation
Arg4: 805bfa33, address which referenced memory
Debugging Details:
READ_ADDRESS: 805bfa33
CURRENT_IRQL: ff
FAULTING_IP:
nt!MmGetSystemRoutineAddress+0
805bfa33 8bff mov edi,edi
DEFAULT_BUCKET_ID: DRIVER_FAULT
BUGCHECK_STR: 0xA
LAST_CONTROL_TRANSFER: from ed75522f to 805bfa33
TRAP_FRAME: ee2e0530 – (.trap ffffffffee2e0530)
ErrCode = 00000000
eax=00000000 ebx=ee2e07a8 ecx=00000000 edx=ee2e05c0 esi=84bb8078
edi=84bb5698
eip=805bfa33 esp=ee2e05a4 ebp=ee2e05cc iopl=0 nv up di pl zr na po
nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000
efl=00010046
nt!MmGetSystemRoutineAddress:
805bfa33 8bff mov edi,edi
Resetting default scope
As we can see, we are crashing in MmGetSystemRoutineAddress.
The value of IoGetRemainingStackSize_ is 9628 (decimal). I don’t think
stack is the problem.
I call MmGetSystemRoutineAddress with a data structure that should be on the
stack or in nonpaged memory.
Everything in my link map is in either a DATA or CODE segment and there is
nothing in my driver that is paged. I did this in order to try to track
down this awful bug.
I am definitely running a checked build of my driver so the
ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
is definitely being executed.
The memory being referenced (805bfa33) seems to be real. At least I can see
it in the memory dump.
So, ladies and gentlemen, how in Hades is MmGetSystemRoutineAddress()
faulting with IRQL_NOT_LESS_OR_EQUAL?
Weird, weird, weird.
Ralph Shnelvar