I am trying to develop a basic firewall based on the wfp example in the ddk for windows vista. I have been working some extremely long hours and pouring my heart and soul into this project, and now I am in the testing phases if I see another blue screen of death I think I’ll scream.
I have learned how to analyze crash dumps using windbg, and now I am really up against it. Possibly the least helpful error message I have seen.
IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
The full crash dump is posted below, but I have NO idea where to turn next. Surely there is a way to get more information than this, and track this down. Really heartbroken if i can’t get any further and all that extremely hard work was for nothing.
IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: fffe9084, memory referenced
Arg2: 00000002, IRQL
Arg3: 00000001, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: 81a5ca2e, address which referenced memory
Debugging Details:
WRITE_ADDRESS: GetPointerFromAddress: unable to read from 81b55868
Unable to read MiSystemVaType memory at 81b35420
fffe9084
I am trying to develop a basic firewall based on the wfp example in the ddk for windows vista. I have been working some extremely long hours and pouring my heart and soul into this project, and now I am in the testing phases if I see another blue screen of death I think I’ll scream.
I have learned how to analyze crash dumps using windbg, and now I am really up against it. Possibly the least helpful error message I have seen.
IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
The full crash dump is posted below, but I have NO idea where to turn next. Surely there is a way to get more information than this, and track this down. Really heartbroken if i can’t get any further and all that extremely hard work was for nothing.
IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: fffe9084, memory referenced
Arg2: 00000002, IRQL
Arg3: 00000001, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: 81a5ca2e, address which referenced memory
Debugging Details:
WRITE_ADDRESS: GetPointerFromAddress: unable to read from 81b55868
Unable to read MiSystemVaType memory at 81b35420
fffe9084
Specia pool is a debugging feature of driver verifier that can help catch most pool corruption bugs as they happen instead of when they cause secondary failures. Many “random crashes” in a completely unknown or unrelated section of code are caused by pool corruption.
Verifier.exe is the program that ships with Windows and controls driver verifier settings.
S
-----Original Message-----
From: xxxxx@yahoo.co.uk Sent: Friday, November 07, 2008 16:13 To: Windows System Software Devs Interest List Subject: RE:[ntdev] OK What now
>>My guess is that you are causing pool corruption which is later resulting in secondary failures.
Can you try enabling special pool in driver verifier? (verifier.exe)
- S >>
Thanks for the feedback. Sadly I don’t understand any of that, sorry.
I did this and did get some more meaningful stuff in my crash dump. I am still not any further forward though. I thought I could embelish on the filter driver in the ddk, and I have been doing well, but now this is starting to run away from me which is truly heartbreaking when you consider the hours I have put in. The crashdump makes no sense to me here
DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: a8674f90, memory referenced
Arg2: 00000002, IRQL
Arg3: 00000000, value 0 = read operation, 1 = write operation
Arg4: a41044e5, address which referenced memory
Debugging Details:
READ_ADDRESS: GetPointerFromAddress: unable to read from 81b53868
Unable to read MiSystemVaType memory at 81b33420
a8674f90
Charlie - take a deep breath… packet is probably a bad value. Verifier found the bad read, where before verifier, RemoveEntryList was making a bad write. Add some code to check the validity of packet, or take a look at the structure of packet when the crash happens.
As sprochniak mentioned, ``packet’’ is probably bogus memory [that has been released to the pool but is still being used].
My guess is that you are freeing the memory pointed to by that variable to the pool, but still keeping ahold of it and thus ending up reusing it after the memory’s released. You should look through your code and see what code paths will result in you using an already-freed memory block in this context.
I did this and did get some more meaningful stuff in my crash dump. I am still not any further forward though. I thought I could embelish on the filter driver in the ddk, and I have been doing well, but now this is starting to run away from me which is truly heartbreaking when you consider the hours I have put in. The crashdump makes no sense to me here
DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: a8674f90, memory referenced
Arg2: 00000002, IRQL
Arg3: 00000000, value 0 = read operation, 1 = write operation
Arg4: a41044e5, address which referenced memory
Debugging Details:
READ_ADDRESS: GetPointerFromAddress: unable to read from 81b53868
Unable to read MiSystemVaType memory at 81b33420
a8674f90
The packet memory is created in the callout function, and added to the LinkedList for out of band processing. The packet is tested for null before being added to the linked list. The function that is creating the blue screen is the worker function which is fired out of band. Given that null packets cannot be added to the list, and the packet is not freed until after this line I cannot see how this would be bogus memory.
That said, I implemented a null check anyway. So now the code reads
if (packet!=NULL)
{
if (packet->direction!=NULL)
{
if (packet->direction == FWP_DIRECTION_INBOUND)
{
RemoveEntryList(&packet->ListEntry);
The line :
if (packet->direction !=NULL)
is now giving the blue screen.
From this I know that the packet is not null as the first if case passed ok. Testing for null on the packet->direction member before using it would surely prevent this blue screen, no?
Invalid doesn’t always mean NULL. The packet value in your dump is
(probably) 0xa8674f80 and it points to invalid memory. You don’t need to
add checks to find out that the value is wrong. This is what BSOD
already said. Instead, you need to find a reason why it is wrong. Ken
already gave you a good advice how to start.
The packet memory is created in the callout function, and
added to the LinkedList for out of band processing. The
packet is tested for null before being added to the linked
list. The function that is creating the blue screen is the
worker function which is fired out of band. Given that null
packets cannot be added to the list, and the packet is not
freed until after this line I cannot see how this would be
bogus memory.
That said, I implemented a null check anyway. So now the code reads
if (packet!=NULL)
{
if (packet->direction!=NULL)
{
if (packet->direction == FWP_DIRECTION_INBOUND)
{
RemoveEntryList(&packet->ListEntry);
The line :
if (packet->direction !=NULL)
is now giving the blue screen.
From this I know that the packet is not null as the first if
case passed ok. Testing for null on the packet->direction
member before using it would surely prevent this blue screen, no?
Why don’t you add some kdprints to your code and see if you are attempting
to process one packet in multiple threads on multiple CPUs at the same time.
Display the addresses, processor, and current thread ID. Run with windbg
and see. From what I remember of this thread, it appears you have written
some code (maybe all of it) before any testing and running with windbg,
verifier, and pool tracking. Most of us write and test code incrementally
to avoid such surprises - especially if writing a new driver.
wrote in message news:xxxxx@ntdev… > [quote] > As sprochniak mentioned, ``packet’’ is probably bogus memory [that has > been > released to the pool but is still being used]. > > My guess is that you are freeing the memory pointed to by that variable to > the > pool, but still keeping ahold of it and thus ending up reusing it after > the > memory’s released. You should look through your code and see what code > paths > will result in you using an already-freed memory block in this context. > [/quote] > > The packet memory is created in the callout function, and added to the > LinkedList for out of band processing. The packet is tested for null > before being added to the linked list. The function that is creating the > blue screen is the worker function which is fired out of band. Given that > null packets cannot be added to the list, and the packet is not freed > until after this line I cannot see how this would be bogus memory. > > That said, I implemented a null check anyway. So now the code reads > > if (packet!=NULL) > { > if (packet->direction!=NULL) > { > if (packet->direction == FWP_DIRECTION_INBOUND) > { > RemoveEntryList(&packet->ListEntry); > > The line : > > if (packet->direction !=NULL) > > is now giving the blue screen. > > From this I know that the packet is not null as the first if case passed > ok. Testing for null on the packet->direction member before using it would > surely prevent this blue screen, no? > > Any thoughts here? > >
OK, 9 hours later after some diligent debugging I have the smoking gun but still no suspect. I will try to convey the facts as best I can in the hope that someone can point me further in the right direction.
This is based on the network packet fwp inspection example in the DDK don’t forget, so if you are familiar with that it would be helpful.
The driver loads fine, sets up the fwp callback routines ok. The execution path that causes the bsod is when the first callback function fires TLInspectALEConnectClassify. It creates a packet “object” - sorry oop programmer by nature, and InsertTailList it to the global LinkedList gConnList, and then fires the worker thread to pull from the gConnList and do some processing on the given packet. The problem is that in the worker thread, the packet object that is retrieved from the gConnList is corrupt, and when the first operation that is performed on it causes the blue screen.
Now arguably the gConnList variable is becoming corrupt, in that the global heap is somehow becoming corrupted somehow, but I do not have any idea how. I did for one fleeting minute suspect that spinlocks were not being released or some such, but the code path is totally clean.
Basically the callback says
Populate packet object
Obtain spinlock
Add packet to global linked list
Release spinlock
Call out of band processing function
The out of band processing function says
Obtain spinlock
Get packet object
Interrogate member of the packet object … whiz bang bsod
This code was working like a charm the other day, and I cannot see any big changes in this code execution path tbh. How can the gConnList get trashed? How can I prove it / investigate it further? What tools should I use? Is there a known gotcha with using global linkedlists I should know about?
I see several problems with your ‘description’ of the problem.
“fires the worker thread” - How, what, when, where, who, etc. Be
specific.
‘InsertTailList’ does provide some specifics, but where and how the head
and other memory is allocated is not mentioned. Many of the drivers in the
network stack run at dispatch level IRQL.
Normally worker threads are passed a work item that includes the data or
pointers upon which they are to act. Using system worker threads is a bad
idea. Read “NT Insider” from many years ago about creating your own worker
threads.
Even if a list is used why use a double link list for this problem?
Sounds like too much OOP programming design.
wrote in message news:xxxxx@ntdev… > OK, 9 hours later after some diligent debugging I have the smoking gun but > still no suspect. I will try to convey the facts as best I can in the hope > that someone can point me further in the right direction. > > This is based on the network packet fwp inspection example in the DDK > don’t forget, so if you are familiar with that it would be helpful. > > The driver loads fine, sets up the fwp callback routines ok. The execution > path that causes the bsod is when the first callback function fires > TLInspectALEConnectClassify. It creates a packet “object” - sorry oop > programmer by nature, and InsertTailList it to the global LinkedList > gConnList, and then fires the worker thread to pull from the gConnList and > do some processing on the given packet. The problem is that in the worker > thread, the packet object that is retrieved from the gConnList is corrupt, > and when the first operation that is performed on it causes the blue > screen. > > Now arguably the gConnList variable is becoming corrupt, in that the > global heap is somehow becoming corrupted somehow, but I do not have any > idea how. I did for one fleeting minute suspect that spinlocks were not > being released or some such, but the code path is totally clean. > > Basically the callback says > > Populate packet object > Obtain spinlock > Add packet to global linked list > Release spinlock > Call out of band processing function > > The out of band processing function says > > Obtain spinlock > Get packet object > Interrogate member of the packet object … whiz bang bsod > > This code was working like a charm the other day, and I cannot see any big > changes in this code execution path tbh. How can the gConnList get > trashed? How can I prove it / investigate it further? What tools should I > use? Is there a known gotcha with using global linkedlists I should know > about? > > Thanks in advance >
Well if Packets are flying by you and you are queueing them onto a linked list to be handled by a worker thread, it’s more than likely that the next driver in the stack is taking that packet, having it’s way with it and then effectively completing it. Any operation you want to perform on these packets, you probably have to do synchronously.
Good for you to have some OO experience. The concept of encapsulation tends to escape people that don’t have any, and still has a lot of relevance in the world of quality WDM C programming.
This is very old. Even NDIS miniports have a lot of the techniques without
the overhead (and help) provided by OOP languages. When you are started
with a piece of hardware, you create a memory block to contain your
information about the hardware and NDIS. For each call you receive you get
a ‘context’ so you can find that data for that device (you might have
several to control in one driver). When interrupted by the hardware it goes
first to NDIS, so the context is available to the driver even then.
In languages not considered OOP such as C, COBOL, and assembler, the
techniques of encapsulation can be used. Beginners won’t get it, but with
OOP they will find ways to write bad code even if the language tries to
avoid it. With any language, bad code can be written and will be written.
I have looked back at code from many years ago and see how much I have
learned. There are many things that must be considered to write good code
and OOP is only one. You have to consider the environment, resource
availability, OS interfaces, hardware limitations, and so on. Remember the
old saying that if the only tool you have is a hammer, everything looks like
a nail. Experience in the area involved is the only way to write good code
and even then you will have bad days where you will go down bad paths. No
one write excellent code all the time.
wrote in message news:xxxxx@ntdev… > Well if Packets are flying by you and you are queueing them onto a linked > list to be handled by a worker thread, it’s more than likely that the next > driver in the stack is taking that packet, having it’s way with it and > then effectively completing it. Any operation you want to perform on > these packets, you probably have to do synchronously. > > Good for you to have some OO experience. The concept of encapsulation > tends to escape people that don’t have any, and still has a lot of > relevance in the world of quality WDM C programming. >
Ok so now we just fired the worker thread (in answer to the how,when where type questions. This is how it is done in the DDK example, and as I say this has been working just fine up till now.
So now we are in the worker thread.
void
TLInspectWorker(
IN PVOID StartContext
)
{
NTSTATUS status;
TL_INSPECT_PENDED_PACKET* packet;
LIST_ENTRY* listEntry;
KLOCK_QUEUE_HANDLE packetQueueLockHandle;
KLOCK_QUEUE_HANDLE connListLockHandle;
UNREFERENCED_PARAMETER(StartContext);
while (1)
{
KeWaitForSingleObject(
&gWorkerEvent,
Executive,
KernelMode,
FALSE,
NULL
);
OK been sitting here for a while, as this was set up in DriverLoad, and now that we have been kicked off from the callback lets get going.
if (gDriverUnloading)
{
break;
}
listEntry = NULL;
KeAcquireInStackQueuedSpinLock(
&gConnListLock,
&connListLockHandle
);
Cheeky spin lock now acquired
if (!IsListEmpty(&gConnList))
{
listEntry = gConnList.Flink;
packet = CONTAINING_RECORD(
listEntry,
TL_INSPECT_PENDED_PACKET,
listEntry
);
OK so the list isnt empty and we just got the first record
This next line, 'ere be dragons. Bang fizz wallop blue screen.
if (packet->direction == FWP_DIRECTION_INBOUND)
{
RemoveEntryList(&packet->listEntry);
}
Now there is exactly how it 'appened m’ludd.
I should just add that I am using workitems to log stuff.
DbgPrint(“A problem occured in function logPacketCallback. Kernel api call ZwWriteFile returned an error. \n”);
#endif
}
}
else
{
//DbgPrint the error if Debugging is turned on
#ifdef DEBUG_OUTPUT
DbgPrint(“A problem occured in function logPacketCallback. Kernel api call RtlStringCbPrintfA returned an error. \n”);
#endif
}
}
IoFreeWorkItem(logItemContext->previousWorkItem);
ExFreePool(logItemContext);
logItemContext = NULL;
return;
}
And I have been peppering calls to createLogEntry all over the code in an effort to try and create some sort of running log file. I have been putting these calls inside #ifdef directives so I can turn them off quite easily and this was the first thing I did when I started getting the bsods.
As I say it is beyond my understanding what is going wrong here. As you can see, I set up the lists correctly, get the spinlock, add the packet to the list, release the spinlock. Then in the worker thread get the spinlock, retrieve the first list item, and then BANG as soon as I try and poke the retrieved item with a crooked stick.
Sorry for the rather lengthy post, but I just felt that perhaps people needed to see code to make learned comments.
I wouldn’t think you make that analogy. Windbg has registers, memory,
source, assembly, stack, locals, globals, etc. display windows so you can
get many views into your code. I do use windbg for dumps, but most of the
time is live debug sessions using 1394a to test changed code or examine the
behavior of suspected failure conditions.
The analogy I used continues with using screws as if they were nails. I you
want to make that analogy with windbg you could add in SoftIce somewhere
along with Vista where there is no real compatibility between the two.
“Pavel A.” wrote in message news:xxxxx@ntdev… > “David Craig” wrote in message > news:xxxxx@ntdev… > >> Remember the old saying that if the only tool you have is a hammer, >> everything looks like a nail. > > to extend this analogy - if the only tool you have is windbg… everything > looks like a dump? > > --PA >
So I’m guessing no one knows the answer, and now I should stop checking this thread as we are ending up in analogy talks? I was previously asked “1. “fires the worker thread” - How, what, when, where, who, etc. Be specific.”
I have done that. I was asked to put some DbgPrints in my code and trace the path. I have done that. I was asked to check if anywhere along the path I was Feeing memory up and then trying to access it once it has been freed. I have done that, and I am not. The fact remains that a global linked list is getting trashed and I can’t for the life of me see where. I have posted all the code in the execution path and am hanging on for some sort of clue from someone more knowledgable, after all thats what this thread is for, right?
I am aware of the golden hammer anti pattern, and this is quite insulting to me as I have many long years of programming experience under my belt. This is not some rookie coder who thought he’d have a bash a writing a kernel mode driver, I am flat out on this one, its not through want of effort I can tell you. I just need a learned driver developer to cast his eye over the findings I have posted recently and the code I am pouring over and give me some sort of clue.
I think you are missing the point here to some extent. This is about
finding the right question. The answer will be obvious once you (we) do.
Mr. Craig suggested a very powerful and useful bit of diagnostics to install
into your list (packet queue) handling code: Log everything with
processor/thread IDs included and watch every operation on the packet
objects and packet queue itself. Have you done that?
Mr. Prochniak suggested an important debugging technique of ‘self
verification’ where you embed a unique type signature into your object as an
aid in validating it is not corrupt. Think PoolTag.
As I read and re-read the posts in this thread I had the following thoughts
(as a network driver guy):
Are you sure you have claimed ownership of the packet from the callout such
that you even have the right to queue it?
Have you considered all paths (especially external paths) that might access
the lists and or list elements?
Does your design treat the lists purely as queues? By this I mean that you
cannot ‘touch’ the packet in any way when it is on the queue. The only
operations are to enque or dequeue it. The lock then would only be used to
synchronize access to the list head (queue) and list entry fields in the
pack *and nothing else*. If that is the case, how are you synchronizing
access to the packet itself and especially its deletion? Does your design
explicitly imply that only a single activity can own a packet reference
(either the creator, the queue, or the consumer)?
I find that encapsulating all of the queue access into FORCEINLINE routines
instead of scattering CONTAINING_RECORD(), IsListEmpty(), and other code
throughout the other routines is a handy way of making an OOP design retain
some OO. It is also easier to insert the instrumentation because you have a
single (source code) point of access to key data structures and operations
(like enqueue, dequeue) where validation of these structures can be done at
access time.
Detecting corruption of a queue based on LIST_ENTRY is pretty straight
forward. It requires that you keep a counter of the number of elements you
*think* are in the queue and simply a routine which walks the list
validating what it finds. If it walks more elements than are supposed to be
in the list, it should ASSERT(). If it does not find enough, it should
ASSERT(). If it finds an element that does not make sense (type check
fails) it should ASSERT(). Obviously if it runs across trash and causes a
bugcheck, you have found trash.
Interlocked{Increment|Decrement} counters on key operations are very useful
when trying to prove that your async allocation/queue/process/deallocation
routines are working. Count the number of packets allocated. Count how
many are queued. Count how many are dequeued. Count how many are in the
queue. Count how many are processed. Count how many are freed. When the
crash occurs, look at the counts. Do the add up?
I don’t think anyone here is trying to insult you. IMHO this thread has
been chocked full of useful stuff.
Good Luck,
Dave Cattley
Consulting Engineer
Systems Software Development
So I’m guessing no one knows the answer, and now I should stop checking this
thread as we are ending up in analogy talks? I was previously asked “1.
“fires the worker thread” - How, what, when, where, who, etc. Be specific.”
I have done that. I was asked to put some DbgPrints in my code and trace the
path. I have done that. I was asked to check if anywhere along the path I
was Feeing memory up and then trying to access it once it has been freed. I
have done that, and I am not. The fact remains that a global linked list is
getting trashed and I can’t for the life of me see where. I have posted all
the code in the execution path and am hanging on for some sort of clue from
someone more knowledgable, after all thats what this thread is for, right?
I am aware of the golden hammer anti pattern, and this is quite insulting to
me as I have many long years of programming experience under my belt. This
is not some rookie coder who thought he’d have a bash a writing a kernel
mode driver, I am flat out on this one, its not through want of effort I can
tell you. I just need a learned driver developer to cast his eye over the
findings I have posted recently and the code I am pouring over and give me
some sort of clue.
Please step back for a moment, and take a deep breath.
This is a volunteer mailing list and not a paid support group - and it’s much difficult to troubleshoot your problem with the limited visibility into the issue that we have versus what you have available to you. Thus, we are relying on you to tell us what we need to know, as we don’t have the source in front of us to make inferences from. In some cases, this means we’re only able to offer general advice.
Now… is the code you posted the *only* place that references those two linked lists?
The thought pattern that you should be using here is that something is freeing the “packet” object before you get to the " if (packet->direction == FWP_DIRECTION_INBOUND)" test. It might, thus, be more productive to start working backwards from all the places where you might free a packet back to the pool and see if there’s any way that a packet could get wrongly freed while it’s still in the linked list.
Based on the limited information that I have available to me here, I’d say the most probable cause is that the packet is being freed to the pool while still being used, given where the crash happened with special pool. At this point, the problem really is in your court, for the most part, as we don’t have the source code, and we can’t thus look through all the logic that might free a packet to make sure it doesn’t have a bug. If I were in your position, that would be my next step.
So I’m guessing no one knows the answer, and now I should stop checking this thread as we are ending up in analogy talks? I was previously asked “1. “fires the worker thread” - How, what, when, where, who, etc. Be specific.”
I have done that. I was asked to put some DbgPrints in my code and trace the path. I have done that. I was asked to check if anywhere along the path I was Feeing memory up and then trying to access it once it has been freed. I have done that, and I am not. The fact remains that a global linked list is getting trashed and I can’t for the life of me see where. I have posted all the code in the execution path and am hanging on for some sort of clue from someone more knowledgable, after all thats what this thread is for, right?
I am aware of the golden hammer anti pattern, and this is quite insulting to me as I have many long years of programming experience under my belt. This is not some rookie coder who thought he’d have a bash a writing a kernel mode driver, I am flat out on this one, its not through want of effort I can tell you. I just need a learned driver developer to cast his eye over the findings I have posted recently and the code I am pouring over and give me some sort of clue.