IRQL = 0xFF mystery II

OSR_Community_User · February 9, 2005, 1:10pm

The mystery deepens. I have a code snippet and a stack trace and some other
information that simply does not make any sense at all.

To me.

Two of my partners are now telling me “This is beginning to look like a
hardware problem.”

I can’t believe that it is. The system is too stable. The problem is too
reproducible. Except, that, as Henry Gabryjelski puts it, it is a
Heisenbug. That is, the bug disappears when I run in a kernel debug
environment.

Here’s the stack trace

nt!KiTrap0E+0x238
nt!MmGetSystemRoutineAddress
continua!SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS+0xc3
continua!ContinuaGetOsVersion::ContinuaGetOsVersion+0x1a
continua!ContinuaVdiskSysDriver::DriverEntry+0x8f
continua!KDriver::DriverInit+0x16b
continua!DriverEntry+0xfa
nt!IopLoadDriver+0x66c
nt!PipCallDriverAddDeviceQueryRoutine+0x235
nt!RtlpCallQueryRegistryRoutine+0x3b1
nt!RtlQueryRegistryValues+0x2a6
nt!PipCallDriverAddDevice+0x261
nt!PipProcessDevNodeTree+0x1a4
nt!PiRestartDevice+0x80
nt!PipDeviceActionWorker+0x168
nt!ExpWorkerThread+0xef
nt!PspSystemThreadStartup+0x34
nt!KiThreadStartup+0x16

Here’s the code snippet (reformatted so that it looks semi-decent with short
lines)

PSF_GET_VERSION SF_DYNAMIC_FUNCTION_POINTERS::GetVersion = 0;

SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS()
{
ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );

UNICODE_STRING functionName;

ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
RtlInitUnicodeString( &functionName, L"RtlGetVersion" );

ULONG_PTR IoGetRemainingStackSize_ = IoGetRemainingStackSize();
IoGetRemainingStackSize_ = IoGetRemainingStackSize_;

ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
GetVersion =
reinterpret_cast
<
long (__stdcall *)(struct _OSVERSIONINFOW *)
>
(
MmGetSystemRoutineAddress( &functionName )
);

ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
} /* SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS() */

Here is the command output from WinDbg

kd> !analyze -v
*******************************************************************************
*
*
* Bugcheck Analysis
*
*
*
*******************************************************************************

IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at
an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: 805bfa33, memory referenced
Arg2: 000000ff, IRQL
Arg3: 00000000, value 0 = read operation, 1 = write operation
Arg4: 805bfa33, address which referenced memory

Debugging Details:

READ_ADDRESS: 805bfa33

CURRENT_IRQL: ff

FAULTING_IP:
nt!MmGetSystemRoutineAddress+0
805bfa33 8bff mov edi,edi

DEFAULT_BUCKET_ID: DRIVER_FAULT

BUGCHECK_STR: 0xA

LAST_CONTROL_TRANSFER: from ed75522f to 805bfa33

TRAP_FRAME: ee2e0530 – (.trap ffffffffee2e0530)
ErrCode = 00000000
eax=00000000 ebx=ee2e07a8 ecx=00000000 edx=ee2e05c0 esi=84bb8078
edi=84bb5698
eip=805bfa33 esp=ee2e05a4 ebp=ee2e05cc iopl=0 nv up di pl zr na po
nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000
efl=00010046
nt!MmGetSystemRoutineAddress:
805bfa33 8bff mov edi,edi
Resetting default scope

As we can see, we are crashing in MmGetSystemRoutineAddress.

The value of IoGetRemainingStackSize_ is 9628 (decimal). I don’t think
stack is the problem.

I call MmGetSystemRoutineAddress with a data structure that should be on the
stack or in nonpaged memory.

Everything in my link map is in either a DATA or CODE segment and there is
nothing in my driver that is paged. I did this in order to try to track
down this awful bug.

I am definitely running a checked build of my driver so the
ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
is definitely being executed.

The memory being referenced (805bfa33) seems to be real. At least I can see
it in the memory dump.

So, ladies and gentlemen, how in Hades is MmGetSystemRoutineAddress()
faulting with IRQL_NOT_LESS_OR_EQUAL?

Weird, weird, weird.

Ralph Shnelvar

OSR_Community_User · February 9, 2005, 2:06pm

xxxxx@lists.osr.com wrote on 02/09/2005 11:13:21 AM:

The mystery deepens. I have a code snippet and a stack trace and some
other
information that simply does not make any sense at all.

To me.

Two of my partners are now telling me “This is beginning to look like a
hardware problem.”

I can’t believe that it is. The system is too stable. The problem is
too
reproducible. Except, that, as Henry Gabryjelski puts it, it is a
Heisenbug. That is, the bug disappears when I run in a kernel debug
environment.

I already suggested that you consider that possibility. Let me explain it
a little deeper.

Windows puts it’s kernel wherever it wants. For the most part, wherever
it goes on a given system is where it stays, until you change kernel
components (drivers) or hardware components, which may induce changes to
the kernel data, causing other things loaded later to move. The XP Pro
and Home versions probably have a lot of stuff in different locations in
memory. When you hook up the debugger, that further moves things around.
It’s quite possible that you have a memory defect somewhere that doesn’t
affect stability until you move some kernel structure there, then boom.
Heisenbug in the flesh.

Just swap your memory modules around and see if that moves the problem. If
it does, you know you have a bogus memory module. If it doesn’t, that
doesn’t eliminate the possibility of hardware, but it makes it a lot
harder to verify, because you’ve done the easy stuff. The next easiest
thing to do is replace the CPU (is there a defect in the cache tag, for
instance, where a specific address tag is corrupted? That is ***HIGHLY***
unlikely, as most defects of that sort will show up at a lot of addresses,
not just one). After that, it just gets harder.

Phil

Philip D. Barila
Seagate Technology LLC
(720) 684-1842

OSR_Community_User · February 9, 2005, 3:36pm

You might try is to unassemble nt!MmGetSystemRoutineAddress. I get,

0: kd> u nt!MmGetSystemRoutineAddress
nt!MmGetSystemRoutineAddress:
806526f0 55 push ebp
806526f1 8bec mov ebp,esp

I thought your symbols might be off, but the address looks pretty close. It
looks like this might just be memory corruption. Is this an MP box, or HT?

CB

From: xxxxx@attotech.com
Reply-To: “Windows System Software Devs Interest List”

>To: “Windows System Software Devs Interest List”
>Subject: Re: [ntdev] IRQL = 0xFF mystery II
>Date: Wed, 9 Feb 2005 14:52:07 -0500
>
>
>
>
>
>
>Ok, Ralph, now you’ve got me interested.
>
>I don’t know anything about home edition, but …
>
>1. Note the faulting IP and the instruction there.
>
>nt!MmGetSystemRoutineAddress:
>805bfa33 8bff mov edi,edi
>
>This instruction is a no-op usually inserted by the compiler to get
>functions aligned on a DWORD boundary. BUT the address is not on a DWORD
>boundary.
>
>2. I have no idea in heck why the compiler would start a function off with
>any kind of no-op, so I don’t think this instruction is supposed to be
>there.
>
>3. This instruction doesn’t reference memory. So why would it trap? The
>only thing I can think of is that the read of the instruction from memory
>is what actually caused the trap.
>
>Not too much help, I’m afraid, but just thought I’d throw in my $.02 worth.
>
>Jerry.
>
>
>
>
>
> Ralph Shnelvar
> > > To
> Sent by: “Windows System Software Devs
> bounce-200716-124 Interest List”
> xxxxx@lists.osr.com
> cc
>
> 02/09/2005 01:13 Subject
> PM [ntdev] IRQL = 0xFF mystery II
>
>
> Please respond to
> “Windows System
> Software Devs
> Interest List”
> > com>
>
>
>
>
>
>
>The mystery deepens. I have a code snippet and a stack trace and some
>other
>information that simply does not make any sense at all.
>
>To me.
>
>
>Two of my partners are now telling me “This is beginning to look like a
>hardware problem.”
>
>I can’t believe that it is. The system is too stable. The problem is too
>reproducible. Except, that, as Henry Gabryjelski puts it, it is a
>Heisenbug. That is, the bug disappears when I run in a kernel debug
>environment.
>
>
>
>
>Here’s the stack trace
>=======================================
>nt!KiTrap0E+0x238
>nt!MmGetSystemRoutineAddress
>continua!SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS+0xc3
>continua!ContinuaGetOsVersion::ContinuaGetOsVersion+0x1a
>continua!ContinuaVdiskSysDriver::DriverEntry+0x8f
>continua!KDriver::DriverInit+0x16b
>continua!DriverEntry+0xfa
>nt!IopLoadDriver+0x66c
>nt!PipCallDriverAddDeviceQueryRoutine+0x235
>nt!RtlpCallQueryRegistryRoutine+0x3b1
>nt!RtlQueryRegistryValues+0x2a6
>nt!PipCallDriverAddDevice+0x261
>nt!PipProcessDevNodeTree+0x1a4
>nt!PiRestartDevice+0x80
>nt!PipDeviceActionWorker+0x168
>nt!ExpWorkerThread+0xef
>nt!PspSystemThreadStartup+0x34
>nt!KiThreadStartup+0x16
>=======================================
>
>
>
>Here’s the code snippet (reformatted so that it looks semi-decent with
>short
>lines)
>=======================================
>PSF_GET_VERSION SF_DYNAMIC_FUNCTION_POINTERS::GetVersion = 0;
>
>SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS()
>{
> ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
>
> UNICODE_STRING functionName;
>
> ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
> RtlInitUnicodeString( &functionName, L"RtlGetVersion" );
>
> ULONG_PTR IoGetRemainingStackSize_ =
>IoGetRemainingStackSize();
> IoGetRemainingStackSize_ = IoGetRemainingStackSize_;
>
> ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
> GetVersion =
> reinterpret_cast
> <
> long (_stdcall )(struct _OSVERSIONINFOW )
> >
> (
> MmGetSystemRoutineAddress( &functionName )
> );
>
> ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
>} / SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS() /
>=======================================
>
>
>
>Here is the command output from WinDbg
>=======================================
>kd> !analyze -v
>
>
>
>
> Bugcheck Analysis
>
>
>
>****
>
>
>IRQL_NOT_LESS_OR_EQUAL (a)
>An attempt was made to access a pageable (or completely invalid) address at
>an
>interrupt request level (IRQL) that is too high. This is usually
>caused by drivers using improper addresses.
>If a kernel debugger is available get the stack backtrace.
>Arguments:
>Arg1: 805bfa33, memory referenced
>Arg2: 000000ff, IRQL
>Arg3: 00000000, value 0 = read operation, 1 = write operation
>Arg4: 805bfa33, address which referenced memory
>
>Debugging Details:
>------------------
>
>
>READ_ADDRESS: 805bfa33
>
>CURRENT_IRQL: ff
>
>FAULTING_IP:
>nt!MmGetSystemRoutineAddress+0
>805bfa33 8bff mov edi,edi
>
>DEFAULT_BUCKET_ID: DRIVER_FAULT
>
>BUGCHECK_STR: 0xA
>
>LAST_CONTROL_TRANSFER: from ed75522f to 805bfa33
>
>TRAP_FRAME: ee2e0530 – (.trap ffffffffee2e0530)
>ErrCode = 00000000
>eax=00000000 ebx=ee2e07a8 ecx=00000000 edx=ee2e05c0 esi=84bb8078
>edi=84bb5698
>eip=805bfa33 esp=ee2e05a4 ebp=ee2e05cc iopl=0 nv up di pl zr na po
>nc
>cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000
>efl=00010046
>nt!MmGetSystemRoutineAddress:
>805bfa33 8bff mov edi,edi
>Resetting default scope
>=======================================
>
>
>
>
>As we can see, we are crashing in MmGetSystemRoutineAddress.
>
>The value of IoGetRemainingStackSize is 9628 (decimal). I don’t think
>stack is the problem.
>
>I call MmGetSystemRoutineAddress with a data structure that should be on
>the
>stack or in nonpaged memory.
>
>Everything in my link map is in either a DATA or CODE segment and there is
>nothing in my driver that is paged. I did this in order to try to track
>down this awful bug.
>
>I am definitely running a checked build of my driver so the
> ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
>is definitely being executed.
>
>The memory being referenced (805bfa33) seems to be real. At least I can
>see
>it in the memory dump.
>
>
>So, ladies and gentlemen, how in Hades is MmGetSystemRoutineAddress()
>faulting with IRQL_NOT_LESS_OR_EQUAL?
>
>
>Weird, weird, weird.
>
>
>Ralph Shnelvar
>
>
>—
>Questions? First check the Kernel Driver FAQ at
>http://www.osronline.com/article.cfm?id=256
>
>You are currently subscribed to ntdev as: xxxxx@attotech.com
>To unsubscribe send a blank email to xxxxx@lists.osr.com
>
>—
>Questions? First check the Kernel Driver FAQ at
>http://www.osronline.com/article.cfm?id=256
>
>You are currently subscribed to ntdev as: unknown lmsubst tag argument: ‘’
>To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · February 9, 2005, 8:09pm

Dear Centis:

On Wed, 09 Feb 2005 15:35:26 -0500, you wrote:

You might try is to unassemble nt!MmGetSystemRoutineAddress. I get,

0: kd> u nt!MmGetSystemRoutineAddress
nt!MmGetSystemRoutineAddress:
806526f0 55 push ebp
806526f1 8bec mov ebp,esp

Now, that is strange. Compare your results with Jerry’s. He gets the same
result that I do.

I thought your symbols might be off, but the address looks pretty close. It
looks like this might just be memory corruption.

I say this without sarcasm: Is Jerry’s machine showing memory corruption,
too?

Is your version of MmGetSystemRoutineAddress from XP (Home/Professional)
SP2?

Is this an MP box, or HT?

Neither. I haven’t exercised this code on an MP or HT box yet because I
want to straighten this problem out first.

CB

Ralph

>From: xxxxx@attotech.com
>Reply-To: “Windows System Software Devs Interest List”
>
>>To: “Windows System Software Devs Interest List”
>>Subject: Re: [ntdev] IRQL = 0xFF mystery II
>>Date: Wed, 9 Feb 2005 14:52:07 -0500
>>
>>
>>
>>
>>
>>
>>Ok, Ralph, now you’ve got me interested.
>>
>>I don’t know anything about home edition, but …
>>
>>1. Note the faulting IP and the instruction there.
>>
>>nt!MmGetSystemRoutineAddress:
>>805bfa33 8bff mov edi,edi
>>
>>This instruction is a no-op usually inserted by the compiler to get
>>functions aligned on a DWORD boundary. BUT the address is not on a DWORD
>>boundary.
>>
>>2. I have no idea in heck why the compiler would start a function off with
>>any kind of no-op, so I don’t think this instruction is supposed to be
>>there.
>>
>>3. This instruction doesn’t reference memory. So why would it trap? The
>>only thing I can think of is that the read of the instruction from memory
>>is what actually caused the trap.
>>
>>Not too much help, I’m afraid, but just thought I’d throw in my $.02 worth.
>>
>>Jerry.
>>
>>
>>
>>
>>
>> Ralph Shnelvar
>> >> > To
>> Sent by: “Windows System Software Devs
>> bounce-200716-124 Interest List”
>> xxxxx@lists.osr.com
>> cc
>>
>> 02/09/2005 01:13 Subject
>> PM [ntdev] IRQL = 0xFF mystery II
>>
>>
>> Please respond to
>> “Windows System
>> Software Devs
>> Interest List”
>> >> com>
>>
>>
>>
>>
>>
>>
>>The mystery deepens. I have a code snippet and a stack trace and some
>>other
>>information that simply does not make any sense at all.
>>
>>To me.
>>
>>
>>Two of my partners are now telling me “This is beginning to look like a
>>hardware problem.”
>>
>>I can’t believe that it is. The system is too stable. The problem is too
>>reproducible. Except, that, as Henry Gabryjelski puts it, it is a
>>Heisenbug. That is, the bug disappears when I run in a kernel debug
>>environment.
>>
>>
>>
>>
>>Here’s the stack trace
>>=======================================
>>nt!KiTrap0E+0x238
>>nt!MmGetSystemRoutineAddress
>>continua!SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS+0xc3
>>continua!ContinuaGetOsVersion::ContinuaGetOsVersion+0x1a
>>continua!ContinuaVdiskSysDriver::DriverEntry+0x8f
>>continua!KDriver::DriverInit+0x16b
>>continua!DriverEntry+0xfa
>>nt!IopLoadDriver+0x66c
>>nt!PipCallDriverAddDeviceQueryRoutine+0x235
>>nt!RtlpCallQueryRegistryRoutine+0x3b1
>>nt!RtlQueryRegistryValues+0x2a6
>>nt!PipCallDriverAddDevice+0x261
>>nt!PipProcessDevNodeTree+0x1a4
>>nt!PiRestartDevice+0x80
>>nt!PipDeviceActionWorker+0x168
>>nt!ExpWorkerThread+0xef
>>nt!PspSystemThreadStartup+0x34
>>nt!KiThreadStartup+0x16
>>=======================================
>>
>>
>>
>>Here’s the code snippet (reformatted so that it looks semi-decent with
>>short
>>lines)
>>=======================================
>>PSF_GET_VERSION SF_DYNAMIC_FUNCTION_POINTERS::GetVersion = 0;
>>
>>SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS()
>>{
>> ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
>>
>> UNICODE_STRING functionName;
>>
>> ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
>> RtlInitUnicodeString( &functionName, L"RtlGetVersion" );
>>
>> ULONG_PTR IoGetRemainingStackSize_ =
>>IoGetRemainingStackSize();
>> IoGetRemainingStackSize_ = IoGetRemainingStackSize_;
>>
>> ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
>> GetVersion =
>> reinterpret_cast
>> <
>> long (_stdcall )(struct _OSVERSIONINFOW )
>> >
>> (
>> MmGetSystemRoutineAddress( &functionName )
>> );
>>
>> ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
>>} / SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS() /
>>=======================================
>>
>>
>>
>>Here is the command output from WinDbg
>>=======================================
>>kd> !analyze -v
>>
>>
>>
>>
>> Bugcheck Analysis
>>
>>
>>
>>****
>>
>>
>>IRQL_NOT_LESS_OR_EQUAL (a)
>>An attempt was made to access a pageable (or completely invalid) address at
>>an
>>interrupt request level (IRQL) that is too high. This is usually
>>caused by drivers using improper addresses.
>>If a kernel debugger is available get the stack backtrace.
>>Arguments:
>>Arg1: 805bfa33, memory referenced
>>Arg2: 000000ff, IRQL
>>Arg3: 00000000, value 0 = read operation, 1 = write operation
>>Arg4: 805bfa33, address which referenced memory
>>
>>Debugging Details:
>>------------------
>>
>>
>>READ_ADDRESS: 805bfa33
>>
>>CURRENT_IRQL: ff
>>
>>FAULTING_IP:
>>nt!MmGetSystemRoutineAddress+0
>>805bfa33 8bff mov edi,edi
>>
>>DEFAULT_BUCKET_ID: DRIVER_FAULT
>>
>>BUGCHECK_STR: 0xA
>>
>>LAST_CONTROL_TRANSFER: from ed75522f to 805bfa33
>>
>>TRAP_FRAME: ee2e0530 – (.trap ffffffffee2e0530)
>>ErrCode = 00000000
>>eax=00000000 ebx=ee2e07a8 ecx=00000000 edx=ee2e05c0 esi=84bb8078
>>edi=84bb5698
>>eip=805bfa33 esp=ee2e05a4 ebp=ee2e05cc iopl=0 nv up di pl zr na po
>>nc
>>cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000
>>efl=00010046
>>nt!MmGetSystemRoutineAddress:
>>805bfa33 8bff mov edi,edi
>>Resetting default scope
>>=======================================
>>
>>
>>
>>
>>As we can see, we are crashing in MmGetSystemRoutineAddress.
>>
>>The value of IoGetRemainingStackSize is 9628 (decimal). I don’t think
>>stack is the problem.
>>
>>I call MmGetSystemRoutineAddress with a data structure that should be on
>>the
>>stack or in nonpaged memory.
>>
>>Everything in my link map is in either a DATA or CODE segment and there is
>>nothing in my driver that is paged. I did this in order to try to track
>>down this awful bug.
>>
>>I am definitely running a checked build of my driver so the
>> ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
>>is definitely being executed.
>>
>>The memory being referenced (805bfa33) seems to be real. At least I can
>>see
>>it in the memory dump.
>>
>>
>>So, ladies and gentlemen, how in Hades is MmGetSystemRoutineAddress()
>>faulting with IRQL_NOT_LESS_OR_EQUAL?
>>
>>
>>Weird, weird, weird.
>>
>>
>>Ralph Shnelvar
>>
>>
>>—
>>Questions? First check the Kernel Driver FAQ at
>>http://www.osronline.com/article.cfm?id=256
>>
>>You are currently subscribed to ntdev as: xxxxx@attotech.com
>>To unsubscribe send a blank email to xxxxx@lists.osr.com
>>
>>—
>>Questions? First check the Kernel Driver FAQ at
>>http://www.osronline.com/article.cfm?id=256
>>
>>You are currently subscribed to ntdev as: unknown lmsubst tag argument: ‘’
>>To unsubscribe send a blank email to xxxxx@lists.osr.com
>
>
>
>—
>Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256
>
>You are currently subscribed to ntdev as: xxxxx@dos32.com
>To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · February 9, 2005, 8:45pm

Jerry:

On Wed, 9 Feb 2005 17:12:47 -0500, you wrote:

Well I’ll be …

I just disassembled this function on my XP Pro SP2 kernel versions and sure
enough, the function does start off with this no-op! Not only that, but in
the ntoskrnl version, the function is sure enough not aligned on a 4-byte
boundary!

ntkrnlmp.exe:

004d5bec 8bff mov edi,edi
004d5bee 55 push ebp

ntoskrnl.exe:

004e8a33 8bff mov edi,edi
004e8a35 55 push ebp

What’s really surprising is that in BOTH versions the function is preceded
by a bunch of single-byte “alignment” instructions, nop in ntoskrnl and int
3 in ntkrnlmp.

Furthermore, in ntkrnlmp.exe, there are a lot of exports that are not
aligned on 4-byte boundaries, and in ntoskrnl their alignment is all over
the place. So I wonder if the linker that Microsoft used to build this is
broken. Either that or I am mistaken on the purpose of these alignment
thingeys.

So that leaves only item 3 in my previous email.

So, let’s review item 3.

This instruction doesn’t reference memory. So why would it trap? The
only thing I can think of is that the read of the instruction from memory
is what actually caused the trap.

That doesn’t make sense, either. The code is being called from DriverEntry.
The statement immediately preceding the call to MmGetSystemRoutineAddress is
ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );

One way that MmGetSystemRoutineAddress can page fault is if something raised
Irql. But the address that is faulting is the first instruction in
MmGetSystemRoutineAddress.

Another - really bizarre - way for this to happen is if the page fault
handling code is hosed.

As Alice said: Curiouser and curiouser.

I raise the probability that this is a hardware problem to 25%.

Jerry.

(The disassembly was using dumpbin, so these are not run-time addresses.)

Thanks for looking into this, Jerry.

Ralph Shnelvar

OSR_Community_User · February 9, 2005, 8:55pm

The answer is in the !analyze code.

See the value in EFLAGS?

efl=00010046

I had to go check, but it turns out that the interrupt enable bit is bit
9 (Volume 1, Page 3-15). That corresponds to 0x200 hex. This bit is
NOT set, and thus on crash this exhibits by showing an “irql” of 0xff.
That is consistent what what I’ve always observed.

The documentation clearly says: This routine can only be called at IRQL
= PASSIVE_LEVEL.

Interrupts are disabled. This is the *equivalent* of running at IRQL
HIGH_LEVEL, without that nasty TPR programming. The page in question is
probably marked as “in transition” (use “!pte 805bfa33”) which means the
data contents are really in memory, but this causes a hardware page
fault anyway. The debugger “helps” you by showing you the contents of
that memory location, even though it generated a page fault.

The use of a two-byte NOP code is odd, but a red herring. The reason
that an innocuous instruction like this causes a fault is because the
*instruction* is at fault.

Now you just need to figure out why interrupts are disabled.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Ralph Shnelvar
Sent: Wednesday, February 09, 2005 8:13 PM
To: ntdev redirect
Subject: Re: [ntdev] IRQL = 0xFF mystery II

Dear Centis:

On Wed, 09 Feb 2005 15:35:26 -0500, you wrote:

You might try is to unassemble nt!MmGetSystemRoutineAddress. I get,

0: kd> u nt!MmGetSystemRoutineAddress
nt!MmGetSystemRoutineAddress:
806526f0 55 push ebp
806526f1 8bec mov ebp,esp

Now, that is strange. Compare your results with Jerry’s. He gets the
same
result that I do.

I thought your symbols might be off, but the address looks pretty
close. It
looks like this might just be memory corruption.

I say this without sarcasm: Is Jerry’s machine showing memory
corruption,
too?

Is your version of MmGetSystemRoutineAddress from XP (Home/Professional)
SP2?

Is this an MP box, or HT?

Neither. I haven’t exercised this code on an MP or HT box yet because I
want to straighten this problem out first.

CB

Ralph

>From: xxxxx@attotech.com
>Reply-To: “Windows System Software Devs Interest List”
>
>>To: “Windows System Software Devs Interest List”
>>Subject: Re: [ntdev] IRQL = 0xFF mystery II
>>Date: Wed, 9 Feb 2005 14:52:07 -0500
>>
>>
>>
>>
>>
>>
>>Ok, Ralph, now you’ve got me interested.
>>
>>I don’t know anything about home edition, but …
>>
>>1. Note the faulting IP and the instruction there.
>>
>>nt!MmGetSystemRoutineAddress:
>>805bfa33 8bff mov edi,edi
>>
>>This instruction is a no-op usually inserted by the compiler to get
>>functions aligned on a DWORD boundary. BUT the address is not on a
DWORD
>>boundary.
>>
>>2. I have no idea in heck why the compiler would start a function off
with
>>any kind of no-op, so I don’t think this instruction is supposed to be
>>there.
>>
>>3. This instruction doesn’t reference memory. So why would it trap?
The
>>only thing I can think of is that the read of the instruction from
memory
>>is what actually caused the trap.
>>
>>Not too much help, I’m afraid, but just thought I’d throw in my $.02
worth.
>>
>>Jerry.
>>
>>
>>
>>
>>
>> Ralph Shnelvar
>> >> >
To
>> Sent by: “Windows System Software Devs
>> bounce-200716-124 Interest List”
>> xxxxx@lists.osr.com
>>
cc
>>
>> 02/09/2005 01:13
Subject
>> PM [ntdev] IRQL = 0xFF mystery II
>>
>>
>> Please respond to
>> “Windows System
>> Software Devs
>> Interest List”
>> >> com>
>>
>>
>>
>>
>>
>>
>>The mystery deepens. I have a code snippet and a stack trace and some
>>other
>>information that simply does not make any sense at all.
>>
>>To me.
>>
>>
>>Two of my partners are now telling me “This is beginning to look like
a
>>hardware problem.”
>>
>>I can’t believe that it is. The system is too stable. The problem is
too
>>reproducible. Except, that, as Henry Gabryjelski puts it, it is a
>>Heisenbug. That is, the bug disappears when I run in a kernel debug
>>environment.
>>
>>
>>
>>
>>Here’s the stack trace
>>=======================================
>>nt!KiTrap0E+0x238
>>nt!MmGetSystemRoutineAddress
>>continua!SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS+0x
c3
>>continua!ContinuaGetOsVersion::ContinuaGetOsVersion+0x1a
>>continua!ContinuaVdiskSysDriver::DriverEntry+0x8f
>>continua!KDriver::DriverInit+0x16b
>>continua!DriverEntry+0xfa
>>nt!IopLoadDriver+0x66c
>>nt!PipCallDriverAddDeviceQueryRoutine+0x235
>>nt!RtlpCallQueryRegistryRoutine+0x3b1
>>nt!RtlQueryRegistryValues+0x2a6
>>nt!PipCallDriverAddDevice+0x261
>>nt!PipProcessDevNodeTree+0x1a4
>>nt!PiRestartDevice+0x80
>>nt!PipDeviceActionWorker+0x168
>>nt!ExpWorkerThread+0xef
>>nt!PspSystemThreadStartup+0x34
>>nt!KiThreadStartup+0x16
>>=======================================
>>
>>
>>
>>Here’s the code snippet (reformatted so that it looks semi-decent with
>>short
>>lines)
>>=======================================
>>PSF_GET_VERSION SF_DYNAMIC_FUNCTION_POINTERS::GetVersion = 0;
>>
>>SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS()
>>{
>> ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
>>
>> UNICODE_STRING functionName;
>>
>> ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
>> RtlInitUnicodeString( &functionName, L"RtlGetVersion" );
>>
>> ULONG_PTR IoGetRemainingStackSize_ =
>>IoGetRemainingStackSize();
>> IoGetRemainingStackSize_ = IoGetRemainingStackSize_;
>>
>> ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
>> GetVersion =
>> reinterpret_cast
>> <
>> long (__stdcall *)(struct OSVERSIONINFOW )
>> >
>> (
>> MmGetSystemRoutineAddress( &functionName )
>> );
>>
>> ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
>>} / SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS() */
>>=======================================
>>
>>
>>
>>Here is the command output from WinDbg
>>=======================================
>>kd> !analyze -v
>> *******************************************************

>>
>>
>>
>> Bugcheck Analysis
>>
>>
>>
>> *************************************************************

>>
>>
>>IRQL_NOT_LESS_OR_EQUAL (a)
>>An attempt was made to access a pageable (or completely invalid)
address at
>>an
>>interrupt request level (IRQL) that is too high. This is usually
>>caused by drivers using improper addresses.
>>If a kernel debugger is available get the stack backtrace.
>>Arguments:
>>Arg1: 805bfa33, memory referenced
>>Arg2: 000000ff, IRQL
>>Arg3: 00000000, value 0 = read operation, 1 = write operation
>>Arg4: 805bfa33, address which referenced memory
>>
>>Debugging Details:
>>------------------
>>
>>
>>READ_ADDRESS: 805bfa33
>>
>>CURRENT_IRQL: ff
>>
>>FAULTING_IP:
>>nt!MmGetSystemRoutineAddress+0
>>805bfa33 8bff mov edi,edi
>>
>>DEFAULT_BUCKET_ID: DRIVER_FAULT
>>
>>BUGCHECK_STR: 0xA
>>
>>LAST_CONTROL_TRANSFER: from ed75522f to 805bfa33
>>
>>TRAP_FRAME: ee2e0530 – (.trap ffffffffee2e0530)
>>ErrCode = 00000000
>>eax=00000000 ebx=ee2e07a8 ecx=00000000 edx=ee2e05c0 esi=84bb8078
>>edi=84bb5698
>>eip=805bfa33 esp=ee2e05a4 ebp=ee2e05cc iopl=0 nv up di pl zr
na po
>>nc
>>cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000
>>efl=00010046
>>nt!MmGetSystemRoutineAddress:
>>805bfa33 8bff mov edi,edi
>>Resetting default scope
>>=======================================
>>
>>
>>
>>
>>As we can see, we are crashing in MmGetSystemRoutineAddress.
>>
>>The value of IoGetRemainingStackSize is 9628 (decimal). I don’t
think
>>stack is the problem.
>>
>>I call MmGetSystemRoutineAddress with a data structure that should be
on
>>the
>>stack or in nonpaged memory.
>>
>>Everything in my link map is in either a DATA or CODE segment and
there is
>>nothing in my driver that is paged. I did this in order to try to
track
>>down this awful bug.
>>
>>I am definitely running a checked build of my driver so the
>> ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
>>is definitely being executed.
>>
>>The memory being referenced (805bfa33) seems to be real. At least I
can
>>see
>>it in the memory dump.
>>
>>
>>So, ladies and gentlemen, how in Hades is MmGetSystemRoutineAddress()
>>faulting with IRQL_NOT_LESS_OR_EQUAL?
>>
>>
>>Weird, weird, weird.
>>
>>
>>Ralph Shnelvar
>>
>>
>>—
>>Questions? First check the Kernel Driver FAQ at
>>http://www.osronline.com/article.cfm?id=256
>>
>>You are currently subscribed to ntdev as: xxxxx@attotech.com
>>To unsubscribe send a blank email to xxxxx@lists.osr.com
>>
>>—
>>Questions? First check the Kernel Driver FAQ at
>>http://www.osronline.com/article.cfm?id=256
>>
>>You are currently subscribed to ntdev as: unknown lmsubst tag
argument: ‘’
>>To unsubscribe send a blank email to xxxxx@lists.osr.com
>
>
>
>—
>Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256
>
>You are currently subscribed to ntdev as: xxxxx@dos32.com
>To unsubscribe send a blank email to xxxxx@lists.osr.com

—
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@osr.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · February 9, 2005, 10:09pm

Dear Tony:

On Wed, 9 Feb 2005 20:54:54 -0500, you wrote:

The answer is in the !analyze code.

See the value in EFLAGS?

efl=00010046

I had to go check, but it turns out that the interrupt enable bit is bit
9 (Volume 1, Page 3-15). That corresponds to 0x200 hex. This bit is
NOT set, and thus on crash this exhibits by showing an “irql” of 0xff.
That is consistent what what I’ve always observed.

The documentation clearly says: This routine can only be called at IRQL
= PASSIVE_LEVEL.

Interrupts are disabled.

Oh my god!

This is the *equivalent* of running at IRQL
HIGH_LEVEL, without that nasty TPR programming. The page in question is
probably marked as “in transition” (use “!pte 805bfa33”) which means the
data contents are really in memory, but this causes a hardware page
fault anyway. The debugger “helps” you by showing you the contents of
that memory location, even though it generated a page fault.

The use of a two-byte NOP code is odd, but a red herring.

Wouldn’t it be more appropriate to say Blue Screen of Death Herring?

The reason
that an innocuous instruction like this causes a fault is because the
*instruction* is at fault.

Let me see if I understand this: The instruction is in a block of memory
that is “in transition” and thus a page fault emerges.

Right?

Now you just need to figure out why interrupts are disabled.

May the gods bless your house, Tony.

Now I’ve got to put instrumentation into the code to test when/if the code
is has the interrupt bit turned off.

This is plain-vanilla code and I simply do not remember doing anything that
would cause interrupts to be turned off.

Stranger and stranger. Curiouser and curiouser.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

Best,

Ralph Shnelvar

Loren_Wilton · February 9, 2005, 11:35pm

>>The reason

>that an innocuous instruction like this causes a fault is because the
>*instruction* is at fault.

Let me see if I understand this: The instruction is in a block of memory
that is “in transition” and thus a page fault emerges.

If you look at that dump again, you’ll see that the fault indicates a read from something that just happens to be the instruction address. I kept asking myself “why would something in the kernel be reading the *instruction*?”

Well, the answer is now obvious: the *processor* was reading the instruction because it wanted to execute it, and it got a read fault when trying to read it. The ‘in transition’ page fault sounds like the best explanation here so far.

Is there a chance this code is being called from the ISR and someone has done an illegal KeLowerIrql?
No, wait. This is happening out of DriverEntry. This isn’t a boot-time driver, is it? I’m wondering if there is ever a time that DriverEntry can be called with interrupts disabled.

Loren

OSR_Community_User · February 10, 2005, 12:21am

Ralph:

In fact, driver verifier works by explicitly clearing the “active” bit
in every page of memory that isn’t pinned down. This scenario fits
everything we see in your !analyze command: interrupts disabled
(EFLAGS), a no-op instruction causing a fault at an address that the
debugger shows is present (I’ve been burned by this before - bad
address, but the debugger shows the contents of that memory location.
That’s why I suggested to use the “!pte” command - that will confirm the
theory).

The bit about IRQL = 0xFF is my observation of the system over these
many years of debugging - it always seems to report that as the IRQL
when interrupts are disabled.

But, rather than trust to this, I went and looked at the code. This is
slightly annoying to figure out, but not horrible:

lkd> u 804e0e37 804e0ee7
nt!KiTrap0E: ;; on entry CS, EFLAGS, EIP
804e0e37 66c74424020000 mov word ptr [esp+0x2],0x0
804e0e3e 55 push ebp ;; + 4
804e0e3f 53 push ebx ;; + 8
804e0e40 56 push esi ;; + C
804e0e41 57 push edi ;; + 10
804e0e42 0fa0 push fs ;; + 14
804e0e44 bb30000000 mov ebx,0x30
804e0e49 668ee3 mov fs,bx ;; load PCR
804e0e4c 648b1d00000000 mov ebx,fs:[00000000] ;; exception list

804e0e53 53 push ebx ;; + 18
804e0e54 83ec04 sub esp,0x4 ;; + 1C
804e0e57 50 push eax ;; + 20
804e0e58 51 push ecx ;; + 24
804e0e59 52 push edx ;; + 28
804e0e5a 1e push ds ;; + 2C
804e0e5b 06 push es ;; + 30
804e0e5c 0fa8 push gs ;; + 34
804e0e5e 66b82300 mov ax,0x23
804e0e62 83ec30 sub esp,0x30 ;; + 64
804e0e65 668ed8 mov ds,ax
804e0e68 668ec0 mov es,ax
804e0e6b 8bec mov ebp,esp ;; load base pointer
804e0e6d f744247000000200 test dword ptr [esp+0x70],0x20000
804e0e75 7593 jnz nt!V86_kite_a (804e0e0a)
804e0e77 fc cld
804e0e78 8b5d60 mov ebx,[ebp+0x60]
804e0e7b 8b7d68 mov edi,[ebp+0x68]
804e0e7e 89550c mov [ebp+0xc],edx
804e0e81 c74508000ddbba mov dword ptr [ebp+0x8],0xbadb0d00
804e0e88 895d00 mov [ebp],ebx
804e0e8b 897d04 mov [ebp+0x4],edi
804e0e8e 64f60550000000ff test byte ptr fs:[00000050],0xff
804e0e96 0f85eafeffff jne nt!Dr_kite_a (804e0d86)
804e0e9c 64833d5400000000 cmp dword ptr fs:[00000054],0x0
804e0ea4 0f85fc010000 jne nt!KiTrap0E+0x26f (804e10a6)
804e0eaa 8bc4 mov eax,esp
804e0eac 642b0504000000 sub eax,fs:[00000004]
804e0eb3 058c000000 add eax,0x8c
804e0eb8 7415 jz nt!KiTrap0E+0x98 (804e0ecf)
804e0eba 83f890 cmp eax,0xffffff90
804e0ebd 7210 jb nt!KiTrap0E+0x98 (804e0ecf)
804e0ebf 64a118000000 mov eax,fs:[00000018]
804e0ec5 0bc0 or eax,eax
804e0ec7 7e06 jle nt!KiTrap0E+0x98 (804e0ecf)
804e0ec9 e8d6daffff call nt!CommonDispatchException+0x54
(804de9a4)
804e0ece 90 nop
804e0ecf 0f20d7 mov edi,cr2
804e0ed2 803d91ca558000 cmp byte ptr [nt!KiI386PentiumLockErrataPresent
(8055ca91)],0x0
804e0ed9 0f8570feffff jne nt!KeInvalidAccessAllowed+0x15
(804e0d4f)
804e0edf fb sti
;; look at EFLAGS as captured by CPU at fault
804e0ee0 f7457000020000 test dword ptr [ebp+0x70],0x200
804e0ee7 0f8482010000 je nt!KiTrap0E+0x238 (804e106f)

And this check transfers control (if interrupts weren’t enabled):

;; check for IOPL
804e106f f7457000000200 test dword ptr [ebp+0x70],0x20000
804e1076 740d jz nt!KiTrap0E+0x24e (804e1085)
804e1078 833d94ca558000 cmp dword ptr [nt!KeI386VdmIoplAllowed
(8055ca94)],0x0
804e107f 0f8568feffff jne nt!KiTrap0E+0xb6 (804e0eed)

;; some internal OS state (freeze flag? What a name!)
804e1085 833d2034568000 cmp dword ptr [nt!KiFreezeFlag
(80563420)],0x0
804e108c 0f855bfeffff jne nt!KiTrap0E+0xb6 (804e0eed)
804e1092 833da03e568000 cmp dword ptr [nt!KiBugCheckData
(80563ea0)],0x0
804e1099 0f854efeffff jne nt!KiTrap0E+0xb6 (804e0eed)
;; hey look - set eax to be -1
804e109f b8ff000000 mov eax,0xff
804e10a4 eba8 jmp nt!KiTrap0E+0x217 (804e104e)

Unconditional jump to…

804e104e f0ff0500aa5580 lock inc dword ptr [nt!KiHardwareTrigger
(8055aa00)]
804e1055 8b4d64 mov ecx,[ebp+0x64]
804e1058 d1e9 shr ecx,1
804e105a 230d00b15580 and ecx,[nt!KeErrorMask (8055b100)]
804e1060 8b7568 mov esi,[ebp+0x68]
804e1063 55 push ebp
804e1064 56 push esi
804e1065 51 push ecx
804e1066 50 push eax ;; <– That’s the -1 we set.
804e1067 57 push edi
804e1068 6a0a push 0xa
804e106a e83b5c0500 call nt!KeBugCheck2 (80536caa)

So I can now rest well tonight knowing that in fact the kernel DOES set
the IRQL to 0xFF in case the interrupts were disabled.

Don’t know why interrupts are disabled, of course, but at least you know
what to look for…

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Ralph Shnelvar
Sent: Wednesday, February 09, 2005 10:13 PM
To: ntdev redirect
Subject: Re: [ntdev] IRQL = 0xFF mystery II

Dear Tony:

On Wed, 9 Feb 2005 20:54:54 -0500, you wrote:

The answer is in the !analyze code.

See the value in EFLAGS?

efl=00010046

I had to go check, but it turns out that the interrupt enable bit is
bit
9 (Volume 1, Page 3-15). That corresponds to 0x200 hex. This bit is
NOT set, and thus on crash this exhibits by showing an “irql” of 0xff.
That is consistent what what I’ve always observed.

The documentation clearly says: This routine can only be called at IRQL
= PASSIVE_LEVEL.

Interrupts are disabled.

Oh my god!

This is the *equivalent* of running at IRQL
HIGH_LEVEL, without that nasty TPR programming. The page in question
is
probably marked as “in transition” (use “!pte 805bfa33”) which means
the
data contents are really in memory, but this causes a hardware page
fault anyway. The debugger “helps” you by showing you the contents of
that memory location, even though it generated a page fault.

The use of a two-byte NOP code is odd, but a red herring.

Wouldn’t it be more appropriate to say Blue Screen of Death Herring?

The reason
that an innocuous instruction like this causes a fault is because the
*instruction* is at fault.

Let me see if I understand this: The instruction is in a block of
memory
that is “in transition” and thus a page fault emerges.

Right?

Now you just need to figure out why interrupts are disabled.

May the gods bless your house, Tony.

Now I’ve got to put instrumentation into the code to test when/if the
code
is has the interrupt bit turned off.

This is plain-vanilla code and I simply do not remember doing anything
that
would cause interrupts to be turned off.

Stranger and stranger. Curiouser and curiouser.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

Best,

Ralph Shnelvar

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@osr.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · February 10, 2005, 12:45am

Dearest, Most Awesome Tony:

On Thu, 10 Feb 2005 00:20:32 -0500, you wrote:

Ralph:

In fact, driver verifier works by explicitly clearing the “active” bit
in every page of memory that isn’t pinned down. This scenario fits
everything we see in your !analyze command: interrupts disabled
(EFLAGS), a no-op instruction causing a fault at an address that the
debugger shows is present (I’ve been burned by this before - bad
address, but the debugger shows the contents of that memory location.
That’s why I suggested to use the “!pte” command - that will confirm the
theory).

kd> !pte 805bfa33
VA 805bfa33
PDE at C0300804 PTE at C02016FC
contains 0003D163 contains 005BF860
pfn 3d -G-DA–KWEV not valid
Transition: 5bf
Protect: 3 - ExecuteRead

The bit about IRQL = 0xFF is my observation of the system over these
many years of debugging - it always seems to report that as the IRQL
when interrupts are disabled.

Now that makes perfect sense. It makes even more sense when one looks at
the code, below.

But, rather than trust to this, I went and looked at the code. This is
slightly annoying to figure out, but not horrible:

lkd> u 804e0e37 804e0ee7
nt!KiTrap0E: ;; on entry CS, EFLAGS, EIP
804e0e37 66c74424020000 mov word ptr [esp+0x2],0x0
804e0e3e 55 push ebp ;; + 4
804e0e3f 53 push ebx ;; + 8
804e0e40 56 push esi ;; + C
804e0e41 57 push edi ;; + 10
804e0e42 0fa0 push fs ;; + 14
804e0e44 bb30000000 mov ebx,0x30
804e0e49 668ee3 mov fs,bx ;; load PCR
804e0e4c 648b1d00000000 mov ebx,fs:[00000000] ;; exception list

804e0e53 53 push ebx ;; + 18
804e0e54 83ec04 sub esp,0x4 ;; + 1C
804e0e57 50 push eax ;; + 20
804e0e58 51 push ecx ;; + 24
804e0e59 52 push edx ;; + 28
804e0e5a 1e push ds ;; + 2C
804e0e5b 06 push es ;; + 30
804e0e5c 0fa8 push gs ;; + 34
804e0e5e 66b82300 mov ax,0x23
804e0e62 83ec30 sub esp,0x30 ;; + 64
804e0e65 668ed8 mov ds,ax
804e0e68 668ec0 mov es,ax
804e0e6b 8bec mov ebp,esp ;; load base pointer
804e0e6d f744247000000200 test dword ptr [esp+0x70],0x20000
804e0e75 7593 jnz nt!V86_kite_a (804e0e0a)
804e0e77 fc cld
804e0e78 8b5d60 mov ebx,[ebp+0x60]
804e0e7b 8b7d68 mov edi,[ebp+0x68]
804e0e7e 89550c mov [ebp+0xc],edx
804e0e81 c74508000ddbba mov dword ptr [ebp+0x8],0xbadb0d00
804e0e88 895d00 mov [ebp],ebx
804e0e8b 897d04 mov [ebp+0x4],edi
804e0e8e 64f60550000000ff test byte ptr fs:[00000050],0xff
804e0e96 0f85eafeffff jne nt!Dr_kite_a (804e0d86)
804e0e9c 64833d5400000000 cmp dword ptr fs:[00000054],0x0
804e0ea4 0f85fc010000 jne nt!KiTrap0E+0x26f (804e10a6)
804e0eaa 8bc4 mov eax,esp
804e0eac 642b0504000000 sub eax,fs:[00000004]
804e0eb3 058c000000 add eax,0x8c
804e0eb8 7415 jz nt!KiTrap0E+0x98 (804e0ecf)
804e0eba 83f890 cmp eax,0xffffff90
804e0ebd 7210 jb nt!KiTrap0E+0x98 (804e0ecf)
804e0ebf 64a118000000 mov eax,fs:[00000018]
804e0ec5 0bc0 or eax,eax
804e0ec7 7e06 jle nt!KiTrap0E+0x98 (804e0ecf)
804e0ec9 e8d6daffff call nt!CommonDispatchException+0x54
(804de9a4)
804e0ece 90 nop
804e0ecf 0f20d7 mov edi,cr2
804e0ed2 803d91ca558000 cmp byte ptr [nt!KiI386PentiumLockErrataPresent
(8055ca91)],0x0
804e0ed9 0f8570feffff jne nt!KeInvalidAccessAllowed+0x15
(804e0d4f)
804e0edf fb sti
;; look at EFLAGS as captured by CPU at fault
804e0ee0 f7457000020000 test dword ptr [ebp+0x70],0x200
804e0ee7 0f8482010000 je nt!KiTrap0E+0x238 (804e106f)

And this check transfers control (if interrupts weren’t enabled):

;; check for IOPL
804e106f f7457000000200 test dword ptr [ebp+0x70],0x20000
804e1076 740d jz nt!KiTrap0E+0x24e (804e1085)
804e1078 833d94ca558000 cmp dword ptr [nt!KeI386VdmIoplAllowed
(8055ca94)],0x0
804e107f 0f8568feffff jne nt!KiTrap0E+0xb6 (804e0eed)

;; some internal OS state (freeze flag? What a name!)
804e1085 833d2034568000 cmp dword ptr [nt!KiFreezeFlag
(80563420)],0x0
804e108c 0f855bfeffff jne nt!KiTrap0E+0xb6 (804e0eed)
804e1092 833da03e568000 cmp dword ptr [nt!KiBugCheckData
(80563ea0)],0x0
804e1099 0f854efeffff jne nt!KiTrap0E+0xb6 (804e0eed)
;; hey look - set eax to be -1
804e109f b8ff000000 mov eax,0xff
804e10a4 eba8 jmp nt!KiTrap0E+0x217 (804e104e)

Unconditional jump to…

804e104e f0ff0500aa5580 lock inc dword ptr [nt!KiHardwareTrigger
(8055aa00)]
804e1055 8b4d64 mov ecx,[ebp+0x64]
804e1058 d1e9 shr ecx,1
804e105a 230d00b15580 and ecx,[nt!KeErrorMask (8055b100)]
804e1060 8b7568 mov esi,[ebp+0x68]
804e1063 55 push ebp
804e1064 56 push esi
804e1065 51 push ecx
804e1066 50 push eax ;; <– That’s the -1 we set.
804e1067 57 push edi
804e1068 6a0a push 0xa
804e106a e83b5c0500 call nt!KeBugCheck2 (80536caa)

So I can now rest well tonight knowing that in fact the kernel DOES set
the IRQL to 0xFF in case the interrupts were disabled.

Don’t know why interrupts are disabled, of course, but at least you know
what to look for…

Oh Most Awesome One, I am humbled. And most grateful.

I now lower the probability that this is a hardware error to 15%.

I hope this is worthy of an article in the NT Insider.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

Best,

Ralph Shnelvar

OSR_Community_User · February 10, 2005, 5:45am

Just a guess, but if the function started with a no-op, it *would* be a
very convenient place for a debugger (or person) to replace the no-op
with a debug breakpoint instruction, wouldn’t it?

MH.

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@attotech.com
Sent: 09 February 2005 22:13
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] IRQL = 0xFF mystery II

Furthermore, in ntkrnlmp.exe, there are a lot of exports that are not
aligned on 4-byte boundaries, and in ntoskrnl their alignment is all
over the place. So I wonder if the linker that Microsoft used to build
this is broken. Either that or I am mistaken on the purpose of these
alignment thingeys.

This email and any attachments is confidential, may be legally privileged and is intended for the use of the addressee only. If you are not the intended recipient, please note that any use, disclosure, printing or copying of this email is strictly prohibited and may be unlawful. If received in error, please delete this email and any attachments and confirm this to the sender.

Maxim_S_Shatskih · February 10, 2005, 6:25am

> Stranger and stranger. Curiouser and curiouser.

Try to switch off some functionality of your driver. Will the simplified driver
also crash?

Is crash reproducible on any another machine?

Have you tried the driver under Verifier? Was it OK?

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

OSR_Community_User · February 10, 2005, 6:31am

That would imply that the people coding the compiler thought this was
important for some reason, but I find that difficult to believe. I
honestly can’t profess to understand the code that is emitted, and I’ve
been staring at it for many years now. Sometimes when debugging I’ll
marvel at the need to load and reload a register with the same value
(presumably to make sure those nasty bits haven’t flipped around or
something). That doesn’t mean they aren’t clever sometimes (usually,
that’s when I’m scurrying off to the reference manual to figure out what
that weird sequence really means.)

But perhaps you are right.

Regards,

Tony

Tony Mason

Consulting Partner

OSR Open Systems Resources, Inc.

http://www.osr.com http:</http:>

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Martin C Harvey
Sent: Thursday, February 10, 2005 5:47 AM
To: ntdev redirect
Subject: RE: [ntdev] IRQL = 0xFF mystery II

Just a guess, but if the function started with a no-op, it *would* be a
very convenient place for a debugger (or person) to replace the no-op
with a debug breakpoint instruction, wouldn’t it?

MH.

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@attotech.com
Sent: 09 February 2005 22:13
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] IRQL = 0xFF mystery II

Furthermore, in ntkrnlmp.exe, there are a lot of exports that are not
aligned on 4-byte boundaries, and in ntoskrnl their alignment is all
over the place. So I wonder if the linker that Microsoft used to build
this is broken. Either that or I am mistaken on the purpose of these
alignment thingeys.

This email and any attachments is confidential, may be legally
privileged and is intended for the use of the addressee only. If you are
not the intended recipient, please note that any use, disclosure,
printing or copying of this email is strictly prohibited and may be
unlawful. If received in error, please delete this email and any
attachments and confirm this to the sender.

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

Peter_Viscarola_OSR · February 10, 2005, 9:31am

Ralph Shnelvar wrote:

I hope this is worthy of an article in the NT Insider.

Sure! The limit is 2400 words. Can you send it by the end of next week?

Peter
OSR

Marc-Antoine_Ruel · February 10, 2005, 11:19am

Sorry for jumping this late in the discution I haven’t looked at it before
but the mov edi,edi discution has been ongoing on IDA forums long before.

Martin C Harvey is almost right. It is used for quick hot-patching because
it is a 2 bytes noop instruction (instead of 2 noop which would have been
slower).
And 2 bytes is just what it is needed to do a short jump to a location where
you can put a long jump (5 bytes). You will see many place where there is 5
noop, they are *not* there for the alignment.

look at
http://blogs.msdn.com/ishai/archive/2004/06/24/165143.aspx

“Tony Mason” a écrit dans le message de news: xxxxx@ntdev…
That would imply that the people coding the compiler thought this was
important for some reason, but I find that difficult to believe. I honestly
can’t profess to understand the code that is emitted, and I’ve been staring
at it for many years now. Sometimes when debugging I’ll marvel at the need
to load and reload a register with the same value (presumably to make sure
those nasty bits haven’t flipped around or something). That doesn’t mean
they aren’t clever sometimes (usually, that’s when I’m scurrying off to the
reference manual to figure out what that weird sequence really means.)

But perhaps you are right.

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com]
On Behalf Of Martin C Harvey
Sent: Thursday, February 10, 2005 5:47 AM
To: ntdev redirect
Subject: RE: [ntdev] IRQL = 0xFF mystery II

Just a guess, but if the function started with a no-op, it would be a very
convenient place for a debugger (or person) to replace the no-op with a
debug breakpoint instruction, wouldn’t it?

MH.

OSR_Community_User · February 10, 2005, 11:55am

Marc-Antoine:

I want to say thank you for posting the link and the information.

One mystery after another is being explained.

Does anyone here know what happened to Jimmy Hoffa? Was he buried under a
pile of debug output?

Ralph Shnelvar

On Thu, 10 Feb 2005 11:19:00 -0500, you wrote:

Sorry for jumping this late in the discution I haven’t looked at it before
but the mov edi,edi discution has been ongoing on IDA forums long before.

Martin C Harvey is almost right. It is used for quick hot-patching because
it is a 2 bytes noop instruction (instead of 2 noop which would have been
slower).
And 2 bytes is just what it is needed to do a short jump to a location where
you can put a long jump (5 bytes). You will see many place where there is 5
noop, they are *not* there for the alignment.

look at
http://blogs.msdn.com/ishai/archive/2004/06/24/165143.aspx

“Tony Mason” a écrit dans le message de news: xxxxx@ntdev…
>That would imply that the people coding the compiler thought this was
>important for some reason, but I find that difficult to believe. I honestly
>can’t profess to understand the code that is emitted, and I’ve been staring
>at it for many years now. Sometimes when debugging I’ll marvel at the need
>to load and reload a register with the same value (presumably to make sure
>those nasty bits haven’t flipped around or something). That doesn’t mean
>they aren’t clever sometimes (usually, that’s when I’m scurrying off to the
>reference manual to figure out what that weird sequence really means.)
>
>But perhaps you are right.
>
>Regards,
>
>Tony
>
>Tony Mason
>Consulting Partner
>OSR Open Systems Resources, Inc.
>http://www.osr.com
>
>
>
>
>From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com]
>On Behalf Of Martin C Harvey
>Sent: Thursday, February 10, 2005 5:47 AM
>To: ntdev redirect
>Subject: RE: [ntdev] IRQL = 0xFF mystery II
>
>Just a guess, but if the function started with a no-op, it would be a very
>convenient place for a debugger (or person) to replace the no-op with a
>debug breakpoint instruction, wouldn’t it?
>
>MH.
>
>
>
>—
>Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256
>
>You are currently subscribed to ntdev as: xxxxx@dos32.com
>To unsubscribe send a blank email to xxxxx@lists.osr.com

OSR_Community_User · February 10, 2005, 12:19pm

Actually, now you come to mention it, I can think of one situation where
a compiler would want to pad code with nops, but it’s pretty arcane.
Some optimising compilers might, in the interests of speed, decide that
it was worth ensuring that a short inner loop didn’t cross a an L2 cache
line boundary - it can cause bank conflicts where the caches end up
thrashing (http://www.agner.org/assem/pentopt.pdf), and the instruction
fetch on PII / PIII processors fetches in 16 byte aligned blocks, so
there’s a decode penalty too.

However, to be honest, I don’t know of any compilers that deal with
this: normally, they just perform transformations on the intermediate
representation, perform register allocation by colouring and nesting
level, and then generate the ops with a tree matching algorithm, with
possibly some special cases dependent on architecture, and then do a bit
of peephole optimization to remove redundancies.

MH.

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Tony Mason
Sent: 10 February 2005 11:31
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] IRQL = 0xFF mystery II

That would imply that the people coding the compiler thought this was
important for some reason, but I find that difficult to believe. I
honestly can’t profess to understand the code that is emitted, and I’ve
been staring at it for many years now. Sometimes when debugging I’ll
marvel at the need to load and reload a register with the same value
(presumably to make sure those nasty bits haven’t flipped around or
something). That doesn’t mean they aren’t clever sometimes (usually,
that’s when I’m scurrying off to the reference manual to figure out what
that weird sequence really means.)

But perhaps you are right.

Regards,

Tony

Tony Mason

Consulting Partner

OSR Open Systems Resources, Inc.

http://www.osr.com http:</http:>

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Martin C Harvey
Sent: Thursday, February 10, 2005 5:47 AM
To: ntdev redirect
Subject: RE: [ntdev] IRQL = 0xFF mystery II

Just a guess, but if the function started with a no-op, it *would* be a
very convenient place for a debugger (or person) to replace the no-op
with a debug breakpoint instruction, wouldn’t it?

MH.

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@attotech.com
Sent: 09 February 2005 22:13
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] IRQL = 0xFF mystery II

Furthermore, in ntkrnlmp.exe, there are a lot of exports that are not
aligned on 4-byte boundaries, and in ntoskrnl their alignment is all
over the place. So I wonder if the linker that Microsoft used to build
this is broken. Either that or I am mistaken on the purpose of these
alignment thingeys.

This email and any attachments is confidential, may be legally
privileged and is intended for the use of the addressee only. If you are
not the intended recipient, please note that any use, disclosure,
printing or copying of this email is strictly prohibited and may be
unlawful. If received in error, please delete this email and any
attachments and confirm this to the sender.

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

This email and any attachments is confidential, may be legally privileged and is intended for the use of the addressee only. If you are not the intended recipient, please note that any use, disclosure, printing or copying of this email is strictly prohibited and may be unlawful. If received in error, please delete this email and any attachments and confirm this to the sender.

IRQL = 0xFF mystery II

Here’s the stack trace

Here’s the code snippet (reformatted so that it looks semi-decent with short lines)

ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL ); } /* SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS() */

Here is the command output from WinDbg

Debugging Details:

Here’s the code snippet (reformatted so that it looks semi-decent with short
lines)

ASSERT( KeGetCurrentIrql() == PASSIVE_LEVEL );
} /* SF_DYNAMIC_FUNCTION_POINTERS::SF_DYNAMIC_FUNCTION_POINTERS() */