Looking for an idea to track memory corruption

Hi,

During runtime, one of our driver’s stack is being corrupted, most
likely by a DMA transaction gone wrong. The reason I say it is because
the instruction memory is getting corrupted and Windows will trap that
if any code will try to do that. The reason window does not catch that
before is most likely a DMA access was done wrong. The code that crashes
has no hardware access at all so it’s 100% not its fault.

Here is how the code ends up:

564 fffff880`2f9507a5 8b842490000000 mov eax,dword ptr [rsp+90h]

564 fffff880`2f9507ac 89440a14 mov dword ptr
[rdx+rcx+14h],eax

565 fffff880`2f9507b0 8b442400 mov eax,dword ptr [rsp]

565 fffff880`2f9507b4 0000 add byte ptr [rax],al
<============= Panic happened here.

565 fffff880`2f9507b6 0000 add byte ptr [rax],al

565 fffff880`2f9507b8 0000 add byte ptr [rax],al

565 fffff880`2f9507ba 0000 add byte ptr [rax],al

565 fffff880`2f9507bc 0000 add byte ptr [rax],al

565 fffff880`2f9507be 0000 add byte ptr [rax],al

565 fffff880`2f9507c0 0000 add byte ptr [rax],al

565 fffff880`2f9507c2 0000 add byte ptr [rax],al

565 fffff880`2f9507c4 0000 add byte ptr [rax],al

565 fffff880`2f9507c6 0000 add byte ptr [rax],al

565 fffff880`2f9507c8 0000 add byte ptr [rax],al

565 fffff880`2f9507ca 0000 add byte ptr [rax],al

565 fffff880`2f9507cc 0000 add byte ptr [rax],al

565 fffff880`2f9507ce 0000 add byte ptr [rax],al

565 fffff880`2f9507d0 0000 add byte ptr [rax],al

565 fffff880`2f9507d2 00488d add byte ptr [rax-73h],cl

565 fffff880`2f9507d5 4c0110 add qword ptr [rax],r10

565 fffff880`2f9507d8 8b842488000000 mov eax,dword ptr [rsp+88h]

565 fffff880`2f9507df 486bc044 imul rax,rax,44h

565 fffff880`2f9507e3 c744011804000000 mov dword ptr
[rcx+rax+18h],4

Any creative idea to look for clues who might have done this ?

Thanks.

There are 32 bytes of zero there, is that a magic number to you for your
driver? What happens up to the crash? Is it 100% reproducible?

Also, do you have a driver in the system that does DMA or is that just a
guess? You can check to see if the code is actually marked read only in this
case with !pte (it’s not necessarily a given, e.g. large pages can’t be read
only).

-scott


Scott Noone
Consulting Associate
OSR Open Systems Resources, Inc.
http://www.osronline.com

wrote in message news:xxxxx@ntdev…
Hi,

During runtime, one of our driver’s stack is being corrupted, most likely by
a DMA transaction gone wrong. The reason I say it is because the instruction
memory is getting corrupted and Windows will trap that if any code will try
to do that. The reason window does not catch that before is most likely a
DMA access was done wrong. The code that crashes has no hardware access at
all so it’s 100% not its fault.

Here is how the code ends up:

564 fffff8802f9507a5 8b842490000000 mov eax,dword ptr [rsp+90h]<br> 564 fffff8802f9507ac 89440a14 mov dword ptr [rdx+rcx+14h],eax
565 fffff8802f9507b0 8b442400 mov eax,dword ptr [rsp]<br> 565 fffff8802f9507b4 0000 add byte ptr [rax],al
ç=========== Panic happened here.
565 fffff8802f9507b6 0000 add byte ptr [rax],al<br> 565 fffff8802f9507b8 0000 add byte ptr [rax],al
565 fffff8802f9507ba 0000 add byte ptr [rax],al<br> 565 fffff8802f9507bc 0000 add byte ptr [rax],al
565 fffff8802f9507be 0000 add byte ptr [rax],al<br> 565 fffff8802f9507c0 0000 add byte ptr [rax],al
565 fffff8802f9507c2 0000 add byte ptr [rax],al<br> 565 fffff8802f9507c4 0000 add byte ptr [rax],al
565 fffff8802f9507c6 0000 add byte ptr [rax],al<br> 565 fffff8802f9507c8 0000 add byte ptr [rax],al
565 fffff8802f9507ca 0000 add byte ptr [rax],al<br> 565 fffff8802f9507cc 0000 add byte ptr [rax],al
565 fffff8802f9507ce 0000 add byte ptr [rax],al<br> 565 fffff8802f9507d0 0000 add byte ptr [rax],al
565 fffff8802f9507d2 00488d add byte ptr [rax-73h],cl<br> 565 fffff8802f9507d5 4c0110 add qword ptr [rax],r10
565 fffff8802f9507d8 8b842488000000 mov eax,dword ptr [rsp+88h]<br> 565 fffff8802f9507df 486bc044 imul rax,rax,44h
565 fffff880`2f9507e3 c744011804000000 mov dword ptr [rcx+rax+18h],4

Any creative idea to look for clues who might have done this ?

Thanks.

An !analyze ?v dump would be nice, with symbols set correctly.

Gary G. Little

H (952) 223-1349

C (952) 454-4629

xxxxx@comcast.net

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@emc.com
Sent: Friday, September 10, 2010 8:43 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Looking for an idea to track memory corruption

Hi,

During runtime, one of our driver?s stack is being corrupted, most likely by
a DMA transaction gone wrong. The reason I say it is because the instruction
memory is getting corrupted and Windows will trap that if any code will try
to do that. The reason window does not catch that before is most likely a
DMA access was done wrong. The code that crashes has no hardware access at
all so it?s 100% not its fault.

Here is how the code ends up:

564 fffff880`2f9507a5 8b842490000000 mov eax,dword ptr [rsp+90h]

564 fffff880`2f9507ac 89440a14 mov dword ptr [rdx+rcx+14h],eax

565 fffff880`2f9507b0 8b442400 mov eax,dword ptr [rsp]

565 fffff880`2f9507b4 0000 add byte ptr [rax],al
?=========== Panic happened here.

565 fffff880`2f9507b6 0000 add byte ptr [rax],al

565 fffff880`2f9507b8 0000 add byte ptr [rax],al

565 fffff880`2f9507ba 0000 add byte ptr [rax],al

565 fffff880`2f9507bc 0000 add byte ptr [rax],al

565 fffff880`2f9507be 0000 add byte ptr [rax],al

565 fffff880`2f9507c0 0000 add byte ptr [rax],al

565 fffff880`2f9507c2 0000 add byte ptr [rax],al

565 fffff880`2f9507c4 0000 add byte ptr [rax],al

565 fffff880`2f9507c6 0000 add byte ptr [rax],al

565 fffff880`2f9507c8 0000 add byte ptr [rax],al

565 fffff880`2f9507ca 0000 add byte ptr [rax],al

565 fffff880`2f9507cc 0000 add byte ptr [rax],al

565 fffff880`2f9507ce 0000 add byte ptr [rax],al

565 fffff880`2f9507d0 0000 add byte ptr [rax],al

565 fffff880`2f9507d2 00488d add byte ptr [rax-73h],cl

565 fffff880`2f9507d5 4c0110 add qword ptr [rax],r10

565 fffff880`2f9507d8 8b842488000000 mov eax,dword ptr [rsp+88h]

565 fffff880`2f9507df 486bc044 imul rax,rax,44h

565 fffff880`2f9507e3 c744011804000000 mov dword ptr [rcx+rax+18h],4

Any creative idea to look for clues who might have done this ?

Thanks.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Here it is, but I must say, the assembly code snippet from before is not
from this failure, it’s from a previous one(but they fail the same way)

0: kd> !analyze -v

************************************************************************
*******

*
*

* Bugcheck Analysis
*

*
*

************************************************************************
*******

SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (7e)

This is a very common bugcheck. Usually the exception address pinpoints

the driver/function that caused the problem. Always note this address

as well as the link date of the driver/image that contains this address.

Arguments:

Arg1: ffffffffc0000005, The exception code that was not handled

Arg2: fffff88026d497b4, The address that the exception occurred at

Arg3: fffff880246208f8, Exception Record Address

Arg4: fffff88024620150, Context Record Address

Debugging Details:


EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at “0x%08lx”
referenced memory at “0x%08lx”. The memory could not be “%s”.

FAULTING_IP:

NeitPackage!fbe_api_common_send_control_packet+94
[d:\views\d7a70e197744846811e80440a568257c.stg\catmerge\disk\fbe\src\lib
\fbe_api\fbe_api_common\src\fbe_api_common.c @ 187]

fffff880`26d497b4 0000 add byte ptr [rax],al

EXCEPTION_RECORD: fffff880246208f8 – (.exr 0xfffff880246208f8)

ExceptionAddress: fffff88026d497b4
(NeitPackage!fbe_api_common_send_control_packet+0x0000000000000094)

ExceptionCode: c0000005 (Access violation)

ExceptionFlags: 00000000

NumberParameters: 2

Parameter[0]: 0000000000000001

Parameter[1]: 0000000000000001

Attempt to write to address 0000000000000001

CONTEXT: fffff88024620150 – (.cxr 0xfffff88024620150)

rax=0000000000000001 rbx=fffffa8007e00040 rcx=fffffa80091bca20

rdx=0000000000000460 rsi=fffffa80036c5890 rdi=0000000000000001

rip=fffff88026d497b4 rsp=fffff88024620b30 rbp=0000000000000080

r8=0000000000000801 r9=fffff800018652e0 r10=fffff80001864888

r11=fffffa80091bca20 r12=0000000000000000 r13=fffff88026d46150

r14=0000000000000000 r15=fffff880009cfec0

iopl=0 nv up ei ng nz na po nc

cs=0010 ss=0018 ds=002b es=002b fs=0053 gs=002b
efl=00010286

NeitPackage!fbe_api_common_send_control_packet+0x94:

fffff88026d497b4 0000 add byte ptr [rax],al ds:002b:0000000000000001=??

Resetting default scope

PROCESS_NAME: System

CURRENT_IRQL: f

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at “0x%08lx”
referenced memory at “0x%08lx”. The memory could not be “%s”.

EXCEPTION_PARAMETER1: 0000000000000001

EXCEPTION_PARAMETER2: 0000000000000001

WRITE_ADDRESS: 0000000000000001

FOLLOWUP_IP:

NeitPackage!fbe_api_common_send_control_packet+94
[d:\views\d7a70e197744846811e80440a568257c.stg\catmerge\disk\fbe\src\lib
\fbe_api\fbe_api_common\src\fbe_api_common.c @ 187]

fffff880`26d497b4 0000 add byte ptr [rax],al

BUGCHECK_STR: 0x7E

DEFAULT_BUCKET_ID: CODE_CORRUPTION

LAST_CONTROL_TRANSFER: from fffff88026d474da to fffff88026d497b4

STACK_TEXT:

fffff88024620b30 fffff88026d474da : fffffa80ff030004 fffff88024620c08 fffffa8000000008 fffff88000000835 :
NeitPackage!fbe_api_common_send_control_packet+0x94
[d:\views\d7a70e197744846811e80440a568257c.stg\catmerge\disk\fbe\src\lib
\fbe_api\fbe_api_common\src\fbe_api_common.c @ 187]

fffff88024620bb0 fffff88026d45e2e : 000000000000000b fffff88024620c54 fffffa800000000b 0000000000000000 :
NeitPackage!fbe_api_get_object_type+0x4a
[d:\views\d7a70e197744846811e80440a568257c.stg\catmerge\disk\fbe\src\lib
\fbe_api\fbe_api_physical_package_interface\src\fbe_api_discovery_interf
ace.c @ 274]

fffff88024620c20 fffff88026d46427 : fffffa800000000b fffff88026d4c21d fffffa80036c5890 fffff88026d46150 :
NeitPackage!object_map_interface_add_object+0x2e
[d:\views\d7a70e197744846811e80440a568257c.stg\catmerge\disk\fbe\src\lib
\fbe_api\fbe_api_object_map\src\fbe_api_object_map_interface.c @ 820]

fffff88024620c70 fffff88026d462d4 : fffff8800000000b fffff88000000002 fffff88026d4e880 000000000000000b :
NeitPackage!object_map_interface_change_object_state+0x47
[d:\views\d7a70e197744846811e80440a568257c.stg\catmerge\disk\fbe\src\lib
\fbe_api\fbe_api_object_map\src\fbe_api_object_map_interface.c @ 1026]

fffff88024620cb0 fffff88026d46198 : fffff88026d50808 0000000000000000 fffff88026d4e820 0000000000000008 :
NeitPackage!object_map_interface_dispatch_queue+0xe4
[d:\views\d7a70e197744846811e80440a568257c.stg\catmerge\disk\fbe\src\lib
\fbe_api\fbe_api_object_map\src\fbe_api_object_map_interface.c @ 958]

fffff88024620d00 fffff8000197a166 : 0000000000000000 fffffa80036f2b60 0000000000000000 fffff800016ced87 :
NeitPackage!object_map_interface_thread_func+0x48
[d:\views\d7a70e197744846811e80440a568257c.stg\catmerge\disk\fbe\src\lib
\fbe_api\fbe_api_object_map\src\fbe_api_object_map_interface.c @ 904]

fffff88024620d40 fffff800016b5486 : fffff880009cb180 fffffa8007e00040 fffffa80036f1b60 fffff88001c15a90 :
nt!PspSystemThreadStartup+0x5a

fffff88024620d80 0000000000000000 : fffff88024621000 fffff8802461b000 fffff88024620a10 0000000000000000 :
nt!KxStartSystemThread+0x16

CHKIMG_EXTENSION: !chkimg -lo 50 -d !NeitPackage

fffff88026d497b3-fffff88026d497d1 31 bytes -
NeitPackage!fbe_api_common_send_control_packet+93

[68 e8 c7 a2 ff ff 48 89:00 00 00 00 00 00 00 00]

31 errors : !NeitPackage (fffff88026d497b3-fffff88026d497d1)

MODULE_NAME: memory_corruption

IMAGE_NAME: memory_corruption

FOLLOWUP_NAME: memory_corruption

DEBUG_FLR_IMAGE_TIMESTAMP: 0

MEMORY_CORRUPTOR: LARGE_32

STACK_COMMAND: .cxr 0xfffff88024620150 ; kb

FAILURE_BUCKET_ID: X64_MEMORY_CORRUPTION_LARGE_32

BUCKET_ID: X64_MEMORY_CORRUPTION_LARGE_32

Followup: memory_corruption



From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Gary G. Little
Sent: Friday, September 10, 2010 9:57 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Looking for an idea to track memory corruption

An !analyze -v dump would be nice, with symbols set correctly.

Gary G. Little

H (952) 223-1349

C (952) 454-4629

xxxxx@comcast.net

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@emc.com
Sent: Friday, September 10, 2010 8:43 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Looking for an idea to track memory corruption

Hi,

During runtime, one of our driver’s stack is being corrupted, most
likely by a DMA transaction gone wrong. The reason I say it is because
the instruction memory is getting corrupted and Windows will trap that
if any code will try to do that. The reason window does not catch that
before is most likely a DMA access was done wrong. The code that crashes
has no hardware access at all so it’s 100% not its fault.

Here is how the code ends up:

564 fffff880`2f9507a5 8b842490000000 mov eax,dword ptr [rsp+90h]

564 fffff880`2f9507ac 89440a14 mov dword ptr
[rdx+rcx+14h],eax

565 fffff880`2f9507b0 8b442400 mov eax,dword ptr [rsp]

565 fffff880`2f9507b4 0000 add byte ptr [rax],al
<============= Panic happened here.

565 fffff880`2f9507b6 0000 add byte ptr [rax],al

565 fffff880`2f9507b8 0000 add byte ptr [rax],al

565 fffff880`2f9507ba 0000 add byte ptr [rax],al

565 fffff880`2f9507bc 0000 add byte ptr [rax],al

565 fffff880`2f9507be 0000 add byte ptr [rax],al

565 fffff880`2f9507c0 0000 add byte ptr [rax],al

565 fffff880`2f9507c2 0000 add byte ptr [rax],al

565 fffff880`2f9507c4 0000 add byte ptr [rax],al

565 fffff880`2f9507c6 0000 add byte ptr [rax],al

565 fffff880`2f9507c8 0000 add byte ptr [rax],al

565 fffff880`2f9507ca 0000 add byte ptr [rax],al

565 fffff880`2f9507cc 0000 add byte ptr [rax],al

565 fffff880`2f9507ce 0000 add byte ptr [rax],al

565 fffff880`2f9507d0 0000 add byte ptr [rax],al

565 fffff880`2f9507d2 00488d add byte ptr [rax-73h],cl

565 fffff880`2f9507d5 4c0110 add qword ptr [rax],r10

565 fffff880`2f9507d8 8b842488000000 mov eax,dword ptr [rsp+88h]

565 fffff880`2f9507df 486bc044 imul rax,rax,44h

565 fffff880`2f9507e3 c744011804000000 mov dword ptr
[rcx+rax+18h],4

Any creative idea to look for clues who might have done this ?

Thanks.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

I’ll leave this a bit free of context, just to note the corruption source needn’t be DMA. That physical page can also be mapped as R/W data to a different VA, and code using those VAs can alter it. If this isn’t happening without your driver, you are probably the corruptor. (And I did say “probably”, not “definitely”). Have you used driver verifier on your driver?

Try using DMA Verifier.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

wrote in message news:xxxxx@ntdev…
Hi,

During runtime, one of our driver’s stack is being corrupted, most likely by a DMA transaction gone wrong. The reason I say it is because the instruction memory is getting corrupted and Windows will trap that if any code will try to do that. The reason window does not catch that before is most likely a DMA access was done wrong. The code that crashes has no hardware access at all so it’s 100% not its fault.

Here is how the code ends up:

564 fffff8802f9507a5 8b842490000000 mov eax,dword ptr [rsp+90h]<br> 564 fffff8802f9507ac 89440a14 mov dword ptr [rdx+rcx+14h],eax
565 fffff8802f9507b0 8b442400 mov eax,dword ptr [rsp]<br> 565 fffff8802f9507b4 0000 add byte ptr [rax],al ?=========== Panic happened here.
565 fffff8802f9507b6 0000 add byte ptr [rax],al<br> 565 fffff8802f9507b8 0000 add byte ptr [rax],al
565 fffff8802f9507ba 0000 add byte ptr [rax],al<br> 565 fffff8802f9507bc 0000 add byte ptr [rax],al
565 fffff8802f9507be 0000 add byte ptr [rax],al<br> 565 fffff8802f9507c0 0000 add byte ptr [rax],al
565 fffff8802f9507c2 0000 add byte ptr [rax],al<br> 565 fffff8802f9507c4 0000 add byte ptr [rax],al
565 fffff8802f9507c6 0000 add byte ptr [rax],al<br> 565 fffff8802f9507c8 0000 add byte ptr [rax],al
565 fffff8802f9507ca 0000 add byte ptr [rax],al<br> 565 fffff8802f9507cc 0000 add byte ptr [rax],al
565 fffff8802f9507ce 0000 add byte ptr [rax],al<br> 565 fffff8802f9507d0 0000 add byte ptr [rax],al
565 fffff8802f9507d2 00488d add byte ptr [rax-73h],cl<br> 565 fffff8802f9507d5 4c0110 add qword ptr [rax],r10
565 fffff8802f9507d8 8b842488000000 mov eax,dword ptr [rsp+88h]<br> 565 fffff8802f9507df 486bc044 imul rax,rax,44h
565 fffff880`2f9507e3 c744011804000000 mov dword ptr [rcx+rax+18h],4

Any creative idea to look for clues who might have done this ?

Thanks.

There are underlying drivers that do DMA transactions. This is not 100% reproducible, it’s a timing issue. However, the victim driver tends to be the same one every time, but it gets hit in different places.
The failure sometimes manifests itself as 32 bytes zeroed on the heap and not all the time in the code segment.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of Scott Noone
Sent: Friday, September 10, 2010 9:55 AM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] Looking for an idea to track memory corruption

There are 32 bytes of zero there, is that a magic number to you for your
driver? What happens up to the crash? Is it 100% reproducible?

Also, do you have a driver in the system that does DMA or is that just a
guess? You can check to see if the code is actually marked read only in this
case with !pte (it’s not necessarily a given, e.g. large pages can’t be read
only).

-scott


Scott Noone
Consulting Associate
OSR Open Systems Resources, Inc.
http://www.osronline.com

wrote in message news:xxxxx@ntdev…
Hi,

During runtime, one of our driver’s stack is being corrupted, most likely by
a DMA transaction gone wrong. The reason I say it is because the instruction
memory is getting corrupted and Windows will trap that if any code will try
to do that. The reason window does not catch that before is most likely a
DMA access was done wrong. The code that crashes has no hardware access at
all so it’s 100% not its fault.

Here is how the code ends up:

564 fffff8802f9507a5 8b842490000000 mov eax,dword ptr [rsp+90h]<br> 564 fffff8802f9507ac 89440a14 mov dword ptr [rdx+rcx+14h],eax
565 fffff8802f9507b0 8b442400 mov eax,dword ptr [rsp]<br> 565 fffff8802f9507b4 0000 add byte ptr [rax],al
?=========== Panic happened here.
565 fffff8802f9507b6 0000 add byte ptr [rax],al<br> 565 fffff8802f9507b8 0000 add byte ptr [rax],al
565 fffff8802f9507ba 0000 add byte ptr [rax],al<br> 565 fffff8802f9507bc 0000 add byte ptr [rax],al
565 fffff8802f9507be 0000 add byte ptr [rax],al<br> 565 fffff8802f9507c0 0000 add byte ptr [rax],al
565 fffff8802f9507c2 0000 add byte ptr [rax],al<br> 565 fffff8802f9507c4 0000 add byte ptr [rax],al
565 fffff8802f9507c6 0000 add byte ptr [rax],al<br> 565 fffff8802f9507c8 0000 add byte ptr [rax],al
565 fffff8802f9507ca 0000 add byte ptr [rax],al<br> 565 fffff8802f9507cc 0000 add byte ptr [rax],al
565 fffff8802f9507ce 0000 add byte ptr [rax],al<br> 565 fffff8802f9507d0 0000 add byte ptr [rax],al
565 fffff8802f9507d2 00488d add byte ptr [rax-73h],cl<br> 565 fffff8802f9507d5 4c0110 add qword ptr [rax],r10
565 fffff8802f9507d8 8b842488000000 mov eax,dword ptr [rsp+88h]<br> 565 fffff8802f9507df 486bc044 imul rax,rax,44h
565 fffff880`2f9507e3 c744011804000000 mov dword ptr [rcx+rax+18h],4

Any creative idea to look for clues who might have done this ?

Thanks.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Ask Max mentioned, DMK Verifier will be a good start then on those drivers
if you haven’t run it before. Hopefully the verification won’t just make the
timing issue go away though…

Memory corruptors suck.

-scott


Scott Noone
Consulting Associate
OSR Open Systems Resources, Inc.
http://www.osronline.com

wrote in message news:xxxxx@ntdev…
> There are underlying drivers that do DMA transactions. This is not 100%
> reproducible, it’s a timing issue. However, the victim driver tends to be
> the same one every time, but it gets hit in different places.
> The failure sometimes manifests itself as 32 bytes zeroed on the heap and
> not all the time in the code segment.
>
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf Of Scott Noone
> Sent: Friday, September 10, 2010 9:55 AM
> To: Windows System Software Devs Interest List
> Subject: Re:[ntdev] Looking for an idea to track memory corruption
>
> There are 32 bytes of zero there, is that a magic number to you for your
> driver? What happens up to the crash? Is it 100% reproducible?
>
> Also, do you have a driver in the system that does DMA or is that just a
> guess? You can check to see if the code is actually marked read only in
> this
> case with !pte (it’s not necessarily a given, e.g. large pages can’t be
> read
> only).
>
> -scott
>
> –
> Scott Noone
> Consulting Associate
> OSR Open Systems Resources, Inc.
> http://www.osronline.com
>
>
> wrote in message news:xxxxx@ntdev…
> Hi,
>
> During runtime, one of our driver’s stack is being corrupted, most likely
> by
> a DMA transaction gone wrong. The reason I say it is because the
> instruction
> memory is getting corrupted and Windows will trap that if any code will
> try
> to do that. The reason window does not catch that before is most likely a
> DMA access was done wrong. The code that crashes has no hardware access at
> all so it’s 100% not its fault.
>
> Here is how the code ends up:
>
> 564 fffff8802f9507a5 8b842490000000 mov eax,dword ptr [rsp+90h]<br>&gt; 564 fffff8802f9507ac 89440a14 mov dword ptr [rdx+rcx+14h],eax
> 565 fffff8802f9507b0 8b442400 mov eax,dword ptr [rsp]<br>&gt; 565 fffff8802f9507b4 0000 add byte ptr [rax],al
> ç=========== Panic happened here.
> 565 fffff8802f9507b6 0000 add byte ptr [rax],al<br>&gt; 565 fffff8802f9507b8 0000 add byte ptr [rax],al
> 565 fffff8802f9507ba 0000 add byte ptr [rax],al<br>&gt; 565 fffff8802f9507bc 0000 add byte ptr [rax],al
> 565 fffff8802f9507be 0000 add byte ptr [rax],al<br>&gt; 565 fffff8802f9507c0 0000 add byte ptr [rax],al
> 565 fffff8802f9507c2 0000 add byte ptr [rax],al<br>&gt; 565 fffff8802f9507c4 0000 add byte ptr [rax],al
> 565 fffff8802f9507c6 0000 add byte ptr [rax],al<br>&gt; 565 fffff8802f9507c8 0000 add byte ptr [rax],al
> 565 fffff8802f9507ca 0000 add byte ptr [rax],al<br>&gt; 565 fffff8802f9507cc 0000 add byte ptr [rax],al
> 565 fffff8802f9507ce 0000 add byte ptr [rax],al<br>&gt; 565 fffff8802f9507d0 0000 add byte ptr [rax],al
> 565 fffff8802f9507d2 00488d add byte ptr [rax-73h],cl<br>&gt; 565 fffff8802f9507d5 4c0110 add qword ptr [rax],r10
> 565 fffff8802f9507d8 8b842488000000 mov eax,dword ptr [rsp+88h]<br>&gt; 565 fffff8802f9507df 486bc044 imul rax,rax,44h
> 565 fffff880`2f9507e3 c744011804000000 mov dword ptr [rcx+rax+18h],4
>
> Any creative idea to look for clues who might have done this ?
>
> Thanks.
>
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>
>

How about prefast?

It could also be some silliness like allocating a buffer sized to a pointer
(4 or 8 bytes) and then shoving 256 bytes into the buffer you just
allocated.

The point is just because the analysis does not show your driver anywhere,
does not mean it isn’t your driver. You could be mishandling an IRP by
completing it twice or accessing it or it’s data buffers AFTER you have
completed it.

Gary G. Little

H (952) 223-1349

C (952) 454-4629

xxxxx@comcast.net

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Bob Kjelgaard
Sent: Friday, September 10, 2010 9:39 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Looking for an idea to track memory corruption

I’ll leave this a bit free of context, just to note the corruption source
needn’t be DMA. That physical page can also be mapped as R/W data to a
different VA, and code using those VAs can alter it. If this isn’t
happening without your driver, you are probably the corruptor. (And I did
say “probably”, not “definitely”). Have you used driver verifier on your
driver?


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

>There are underlying drivers that do DMA transactions. This is not 100% reproducible, it’s a >timing issue. However, the victim driver tends to be the same one every time, but it gets hit in >different places. The failure sometimes manifests itself as 32 bytes zeroed on the heap and not all >the time in the code segment.
If you have some ideas that DMA transactions caused this problem you should start work with hardware. The first step would be using PCI bus analyzer and see if the hardware uses a wrong address for transaction.

Igor Sharovar

Is it possible you are not seeing corruption; is it possible you are seeing
a branch to an invalid code address? A !analyze stack dump would help.

FIRST, I would try to assure your basic system is not flaky. Remove any odd
hardware and run something like the WHQL system stress test for a few days
or week. It might just be you are doing debugging on a malfunctioning
motherboard. Many years ago I had a problem that looked similar to this
(although the corrupt was aligned), and it turned out to be a flaky
motherboard IDE controller that corrupted memory writes on occasion, without
my device or driver present.

SECOND, I?d run driver verifier with DMA verification enabled, on EVERY
driver. Run some heavy workload that tends to stimulate your problem.

Think hard, who would write 32 bytes of unaligned zeros. Storage tends to
make aligned transfers; networks tend to make unaligned transfers. What does
your device do?

Jan

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@emc.com
Sent: Friday, September 10, 2010 6:43 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Looking for an idea to track memory corruption

Hi,

During runtime, one of our driver?s stack is being corrupted, most likely by
a DMA transaction gone wrong. The reason I say it is because the instruction
memory is getting corrupted and Windows will trap that if any code will try
to do that. The reason window does not catch that before is most likely a
DMA access was done wrong. The code that crashes has no hardware access at
all so it?s 100% not its fault.

Here is how the code ends up:

564 fffff880`2f9507a5 8b842490000000 mov eax,dword ptr [rsp+90h]

564 fffff880`2f9507ac 89440a14 mov dword ptr [rdx+rcx+14h],eax

565 fffff880`2f9507b0 8b442400 mov eax,dword ptr [rsp]

565 fffff880`2f9507b4 0000 add byte ptr [rax],al
?=========== Panic happened here.

565 fffff880`2f9507b6 0000 add byte ptr [rax],al

565 fffff880`2f9507b8 0000 add byte ptr [rax],al

565 fffff880`2f9507ba 0000 add byte ptr [rax],al

565 fffff880`2f9507bc 0000 add byte ptr [rax],al

565 fffff880`2f9507be 0000 add byte ptr [rax],al

565 fffff880`2f9507c0 0000 add byte ptr [rax],al

565 fffff880`2f9507c2 0000 add byte ptr [rax],al

565 fffff880`2f9507c4 0000 add byte ptr [rax],al

565 fffff880`2f9507c6 0000 add byte ptr [rax],al

565 fffff880`2f9507c8 0000 add byte ptr [rax],al

565 fffff880`2f9507ca 0000 add byte ptr [rax],al

565 fffff880`2f9507cc 0000 add byte ptr [rax],al

565 fffff880`2f9507ce 0000 add byte ptr [rax],al

565 fffff880`2f9507d0 0000 add byte ptr [rax],al

565 fffff880`2f9507d2 00488d add byte ptr [rax-73h],cl

565 fffff880`2f9507d5 4c0110 add qword ptr [rax],r10

565 fffff880`2f9507d8 8b842488000000 mov eax,dword ptr [rsp+88h]

565 fffff880`2f9507df 486bc044 imul rax,rax,44h

565 fffff880`2f9507e3 c744011804000000 mov dword ptr [rcx+rax+18h],4

Any creative idea to look for clues who might have done this ?

Thanks.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

>this (although the corrupt was aligned), and it turned out to be a flaky motherboard IDE controller

I second this.

2.5 years ago I had a nasty bug, which turned out to be faulty RAM. Lowering the RAM clock got rid of the bug, and the code which just checks some invariant twice (after some path which does not touch the invariant) - was giving different results.

“fc /b” or large (50GB or such) files is a good general test for such hardware faults.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

[Sorry to repeat myself somewhat, but I think this is a better way to state my point]:

“The reason I say it is because the instruction memory is getting corrupted and Windows will trap that if any code will try to do that.”

That is NOT strictly true (or a DMA corruptor could also be caught by Windows). If, e.g., you get sloppy with an MDL somewhere you can inadvertently map the page containing that code as a writable data page- the writable part will have different virtual addresses than the protected part, but it’s the same physical memory and the changes you make on the writable range will show on the “protected” range. Physical memory containing your driver code has to be read /write [or the code could never have got there in the first place]- all of the trapping mechanisms you’re trying to say prevent this work on the virtual addresses mapped to that physical memory through a PTE. A single physical page can be mapped to multiple virtual addresses with differing protections- I know this is true- I deliberately use it in my own work and I’m certain the memory manager relies upon it as well.
I’m not saying it is not DMA, or that it can’t be any of the other causes suggested here- you’ve been getting reasonable advice. I’m just trying to point out that jumping to DMA as the most likely cause with that particular justification (and simultaneously removing your driver from further consideration as the source of the problem) is an error- one that could have you searching endlessly in the wrong place.

The offset into a page is the same in both of these breaks (0x7b4). Does this corruption always hit the same offset relative to the driver’s base? If yes, you could set a write breakpoint (ba w4) on that virtual address (unlikely to catch anything if the page is mapped read-only), and also try setting a breakpoint on the corresponding PFN’s refcount. This will catch drivers attempting to probe-and-lock the page so they can DMA into it (or map it at a different VA), but it may also generate a lot of noise if the code page is pageable.

Running with a checked kernel might also be useful.

Thanks,
Pavel

>

During runtime, one of our driver’s stack is being corrupted, most
likely by a
DMA transaction gone wrong. The reason I say it is because the
instruction
memory is getting corrupted and Windows will trap that if any code
will try to
do that. The reason window does not catch that before is most likely a
DMA
access was done wrong. The code that crashes has no hardware access at
all so
it’s 100% not its fault.

The Verifier with DMA checking turned on is the obvious first step. I
assume you’ve already tried it but you didn’t say.

Some random thoughts…

Is it always your stack that gets corrupt?

Is it always a 32 byte transfer that causes corruption? (or are the 32
‘00’ bytes just part of the corruption)

Are all your DMA transfers 32 bytes? You could keep a ring buffer of the
last (say) 10000 transfers and see if any match up with where you are
seeing corruption.

Does the corruption happen on a 32 bit OS?

Do you have more than 4GB of memory in your test system? Does the crash
happen with less than 4GB?

Good luck. I went through something like this with my Xen drivers
recently - I was shutting down things in the wrong order so Xen was
continuing to write into a shared ring buffer after I’d given it back to
windows. Lots of fun to find!

James

>on the corresponding PFN’s refcount. This will catch drivers attempting to probe-and-lock the page so

they can DMA into it (or map it at a different VA)

This is surely the Advice of the Week.

Great!


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Some answers to a the questions below:
* OS is Windows server 2008 R2
* Didn’t try the verifier yet.
* The corruption is always the same size, sometime it shows up as
corrupting instruction memory and sometime heap memory.
* DMA transfers of the drivers below are probably more than 32 bytes.
* system has more than 4G

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of James Harper
Sent: Friday, September 10, 2010 7:39 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Looking for an idea to track memory corruption

During runtime, one of our driver’s stack is being corrupted, most
likely by a
DMA transaction gone wrong. The reason I say it is because the
instruction
memory is getting corrupted and Windows will trap that if any code
will try to
do that. The reason window does not catch that before is most likely a
DMA
access was done wrong. The code that crashes has no hardware access at
all so
it’s 100% not its fault.

The Verifier with DMA checking turned on is the obvious first step. I
assume you’ve already tried it but you didn’t say.

Some random thoughts…

Is it always your stack that gets corrupt?

Is it always a 32 byte transfer that causes corruption? (or are the 32
‘00’ bytes just part of the corruption)

Are all your DMA transfers 32 bytes? You could keep a ring buffer of the
last (say) 10000 transfers and see if any match up with where you are
seeing corruption.

Does the corruption happen on a 32 bit OS?

Do you have more than 4GB of memory in your test system? Does the crash
happen with less than 4GB?

Good luck. I went through something like this with my Xen drivers
recently - I was shutting down things in the wrong order so Xen was
continuing to write into a shared ring buffer after I’d given it back to
windows. Lots of fun to find!

James


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Great point.

Thanks.


From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Bob Kjelgaard
Sent: Friday, September 10, 2010 5:36 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Looking for an idea to track memory corruption

[Sorry to repeat myself somewhat, but I think this is a better way to
state my point]:

“The reason I say it is because the instruction memory is getting
corrupted and Windows will trap that if any code will try to do that.”

That is NOT strictly true (or a DMA corruptor could also be caught by
Windows). If, e.g., you get sloppy with an MDL somewhere you can
inadvertently map the page containing that code as a writable data page-
the writable part will have different virtual addresses than the
protected part, but it’s the same physical memory and the changes you
make on the writable range will show on the “protected” range. Physical
memory containing your driver code has to be read /write [or the code
could never have got there in the first place]- all of the trapping
mechanisms you’re trying to say prevent this work on the virtual
addresses mapped to that physical memory through a PTE. A single
physical page can be mapped to multiple virtual addresses with differing
protections- I know this is true- I deliberately use it in my own work
and I’m certain the memory manager relies upon it as well.

I’m not saying it is not DMA, or that it can’t be any of the other
causes suggested here- you’ve been getting reasonable advice. I’m just
trying to point out that jumping to DMA as the most likely cause with
that particular justification (and simultaneously removing your driver
from further consideration as the source of the problem) is an error-
one that could have you searching endlessly in the wrong place.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

I don’t think it’s a flaky hardware since its being seen on multiple
systems.

Good point about it not being aligned.


From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Jan Bottorff
Sent: Friday, September 10, 2010 4:58 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Looking for an idea to track memory corruption

Is it possible you are not seeing corruption; is it possible you are
seeing a branch to an invalid code address? A !analyze stack dump would
help.

FIRST, I would try to assure your basic system is not flaky. Remove any
odd hardware and run something like the WHQL system stress test for a
few days or week. It might just be you are doing debugging on a
malfunctioning motherboard. Many years ago I had a problem that looked
similar to this (although the corrupt was aligned), and it turned out
to be a flaky motherboard IDE controller that corrupted memory writes on
occasion, without my device or driver present.

SECOND, I’d run driver verifier with DMA verification enabled, on EVERY
driver. Run some heavy workload that tends to stimulate your problem.

Think hard, who would write 32 bytes of unaligned zeros. Storage tends
to make aligned transfers; networks tend to make unaligned transfers.
What does your device do?

Jan

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@emc.com
Sent: Friday, September 10, 2010 6:43 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Looking for an idea to track memory corruption

Hi,

During runtime, one of our driver’s stack is being corrupted, most
likely by a DMA transaction gone wrong. The reason I say it is because
the instruction memory is getting corrupted and Windows will trap that
if any code will try to do that. The reason window does not catch that
before is most likely a DMA access was done wrong. The code that crashes
has no hardware access at all so it’s 100% not its fault.

Here is how the code ends up:

564 fffff880`2f9507a5 8b842490000000 mov eax,dword ptr [rsp+90h]

564 fffff880`2f9507ac 89440a14 mov dword ptr
[rdx+rcx+14h],eax

565 fffff880`2f9507b0 8b442400 mov eax,dword ptr [rsp]

565 fffff880`2f9507b4 0000 add byte ptr [rax],al
<============= Panic happened here.

565 fffff880`2f9507b6 0000 add byte ptr [rax],al

565 fffff880`2f9507b8 0000 add byte ptr [rax],al

565 fffff880`2f9507ba 0000 add byte ptr [rax],al

565 fffff880`2f9507bc 0000 add byte ptr [rax],al

565 fffff880`2f9507be 0000 add byte ptr [rax],al

565 fffff880`2f9507c0 0000 add byte ptr [rax],al

565 fffff880`2f9507c2 0000 add byte ptr [rax],al

565 fffff880`2f9507c4 0000 add byte ptr [rax],al

565 fffff880`2f9507c6 0000 add byte ptr [rax],al

565 fffff880`2f9507c8 0000 add byte ptr [rax],al

565 fffff880`2f9507ca 0000 add byte ptr [rax],al

565 fffff880`2f9507cc 0000 add byte ptr [rax],al

565 fffff880`2f9507ce 0000 add byte ptr [rax],al

565 fffff880`2f9507d0 0000 add byte ptr [rax],al

565 fffff880`2f9507d2 00488d add byte ptr [rax-73h],cl

565 fffff880`2f9507d5 4c0110 add qword ptr [rax],r10

565 fffff880`2f9507d8 8b842488000000 mov eax,dword ptr [rsp+88h]

565 fffff880`2f9507df 486bc044 imul rax,rax,44h

565 fffff880`2f9507e3 c744011804000000 mov dword ptr
[rcx+rax+18h],4

Any creative idea to look for clues who might have done this ?

Thanks.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

This driver was there for a while and is a very simple one. It worked
for a long time with no issues and was not changed for a while (the
possibility of a dormant bug waiting to show up is always there, I know)

Static analysis is being done on the code on regular basis do while I
good idea, I don’t think that’s the case.


From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Gary G. Little
Sent: Friday, September 10, 2010 2:36 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Looking for an idea to track memory corruption

How about prefast?

It could also be some silliness like allocating a buffer sized to a
pointer (4 or 8 bytes) and then shoving 256 bytes into the buffer you
just allocated.

The point is just because the analysis does not show your driver
anywhere, does not mean it isn’t your driver. You could be mishandling
an IRP by completing it twice or accessing it or it’s data buffers AFTER
you have completed it.

Gary G. Little

H (952) 223-1349

C (952) 454-4629

xxxxx@comcast.net

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Bob Kjelgaard
Sent: Friday, September 10, 2010 9:39 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] Looking for an idea to track memory corruption

I’ll leave this a bit free of context, just to note the corruption
source needn’t be DMA. That physical page can also be mapped as R/W
data to a different VA, and code using those VAs can alter it. If this
isn’t happening without your driver, you are probably the corruptor.
(And I did say “probably”, not “definitely”). Have you used driver
verifier on your driver?


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer