memory corruption at page start

Hello,
I am facing random BSODs on a particular system running WindowsXP.
I have analyzed the memdumps and found a recurring pattern of
overwritten bytes just after a page-boundary. This means all the
different BSODs are caused because someone hammers into the beginning
of a (4K) page.

However, this seems to be related to running our drivers ???
These same drivers run fine on a checked MP system with verifier.

The memory corruption seems to have a length of 0xC bytes. I saw
values 0x3F or 0xCF which seem to be HDLC flags ( from our data
transfer).

By describing the above pattern of memory corruption I hope that
someone of you already had a similar bug and is willing to give me a
hint where to look.

Norbert.
“I am Pentium of Borg. Arithmetic is irrelevant. Prepare to be
approximated.”

Not sure if this is of use or not…
A system running Driver Verifier will always allocate it’s blocks at the
back end of a 4K block. So if the customers systems are running DV [with
special pool], then that could explain what’s going on.

Other than that, I haven’t got any good ideas.

I thought HDLC flags were 7F… (or is it 0x3f/0xCf alternating, so that
the seventh bis in the second byte?).

Of course, adding some padding to the block allocated and checking that
this added data is not overwritten when de-allocating will help track down
overflow problems. This may help in the debug effort, so that rather than
waiting for the error to occur just when the packet is overflowing the end
of a page, it bug-checks immediately when it overflows it’s buffer region.


Mats

xxxxx@lists.osr.com wrote on 10/19/2004 03:43:19 PM:

Hello,
I am facing random BSODs on a particular system running WindowsXP.
I have analyzed the memdumps and found a recurring pattern of
overwritten bytes just after a page-boundary. This means all the
different BSODs are caused because someone hammers into the beginning
of a (4K) page.

However, this seems to be related to running our drivers ???
These same drivers run fine on a checked MP system with verifier.

The memory corruption seems to have a length of 0xC bytes. I saw
values 0x3F or 0xCF which seem to be HDLC flags ( from our data
transfer).

By describing the above pattern of memory corruption I hope that
someone of you already had a similar bug and is willing to give me a
hint where to look.

Norbert.
“I am Pentium of Borg. Arithmetic is irrelevant. Prepare to be
approximated.”


Questions? First check the Kernel Driver FAQ at http://www.
osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@3dlabs.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

ForwardSourceID:NT00005A8E

On the target I am not runnning under verifier.
The overwritten memory also does not corrupt the pool entries. These
pool entries do not belong to me. Doing a !poolval is ok. It seems not
to be a form of overflow.

Can it be a hardware problem ? Isochronous Usb pipe on ALI chipset ?

HDLC flag is a 0 bit, then 6 consecutive 1 bits, then 0 bit. Due to
bit stuffing you may have shifted values corresponding to 0x7e.

Norbert.

“An ulcer is a pain in the neck that dropped.”
---- snip ----

Not sure if this is of use or not…
A system running Driver Verifier will always allocate it’s blocks at the
back end of a 4K block. So if the customers systems are running DV [with
special pool], then that could explain what’s going on.
---- snip ----

See “Special Memory Pool” in the DDK. Verifier can be adjusted to control
end (default) or start alignment of ‘special pool’. Playing with this may
help isolate the corruption.

=====================
Mark Roddy

-----Original Message-----
From: Mats PETERSSON [mailto:xxxxx@3dlabs.com]
Sent: Tuesday, October 19, 2004 11:06 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] memory corruption at page start

Not sure if this is of use or not…
A system running Driver Verifier will always allocate it’s blocks at the
back end of a 4K block. So if the customers systems are running DV [with
special pool], then that could explain what’s going on.

Other than that, I haven’t got any good ideas.

I thought HDLC flags were 7F… (or is it 0x3f/0xCf alternating, so that the
seventh bis in the second byte?).

Of course, adding some padding to the block allocated and checking that this
added data is not overwritten when de-allocating will help track down
overflow problems. This may help in the debug effort, so that rather than
waiting for the error to occur just when the packet is overflowing the end
of a page, it bug-checks immediately when it overflows it’s buffer region.


Mats

xxxxx@lists.osr.com wrote on 10/19/2004 03:43:19 PM:

Hello,
I am facing random BSODs on a particular system running WindowsXP.
I have analyzed the memdumps and found a recurring pattern of
overwritten bytes just after a page-boundary. This means all the
different BSODs are caused because someone hammers into the beginning
of a (4K) page.

However, this seems to be related to running our drivers ???
These same drivers run fine on a checked MP system with verifier.

The memory corruption seems to have a length of 0xC bytes. I saw
values 0x3F or 0xCF which seem to be HDLC flags ( from our data
transfer).

By describing the above pattern of memory corruption I hope that
someone of you already had a similar bug and is willing to give me a
hint where to look.

Norbert.
“I am Pentium of Borg. Arithmetic is irrelevant. Prepare to be
approximated.”


Questions? First check the Kernel Driver FAQ at http://www.
osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@3dlabs.com To
unsubscribe send a blank email to xxxxx@lists.osr.com

ForwardSourceID:NT00005A8E


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@stratus.com To
unsubscribe send a blank email to xxxxx@lists.osr.com

xxxxx@lists.osr.com wrote on 10/19/2004 05:05:43 PM:

On the target I am not runnning under verifier.
The overwritten memory also does not corrupt the pool entries. These
pool entries do not belong to me. Doing a !poolval is ok. It seems not
to be a form of overflow.

Strange. I would have expected the pool entries to be destroyed. But maybe
that’s just luck that you miss the pool data structures?

Can it be a hardware problem ? Isochronous Usb pipe on ALI chipset ?

Possible, but I would think that the USB pipe isn’t really caring much
about pages at all… It should be caring about physical addresses… But
of course, it could be that it’s continuing to write past the page that
it’s been given in the scatter/gather list. i.e. we tell it to write to
address 0xaaaa0000-0xaaaa0fff, and it continues to write at 0xaaaa1000
because of some stupid hardware bug. Should be farily indpendent of the
hardware that is plugged into the USB port tho’, so if this is the case,
you’d see the failure on just about any hardware that is available on the
market that connects via USB (at least that uses the same basic protocol as
your product). Maybe a good idea to do some googling and see if there is a
mass of similar type products with problems on ALI chipsets…

HDLC flag is a 0 bit, then 6 consecutive 1 bits, then 0 bit. Due to
bit stuffing you may have shifted values corresponding to 0x7e.

Oh, yes. 0x7F is an “abort” or some such… Sorry… It was a while since I
did that sort of stuff… :wink:

Norbert.

“An ulcer is a pain in the neck that dropped.”
---- snip ----
> Not sure if this is of use or not…
> A system running Driver Verifier will always allocate it’s blocks at
the
> back end of a 4K block. So if the customers systems are running DV
[with
> special pool], then that could explain what’s going on.
---- snip ----


Questions? First check the Kernel Driver FAQ at http://www.
osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@3dlabs.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

ForwardSourceID:NT00005ACA

Mark and Mats,
first of all thanks for your comments and hints.

We decided to plug in a cardbus USB host controller card with a NEC
OHCI. After plugging our device to this host controller the problems
are gone.

This fact and the strange pattern of always hitting a physical page at
the beginning makes me believe that the hardware/driver combination of
this notebook is definitely broken.

If this message is monitored by someone from HP/(Acer Labs M5237) who would be
interested to solve it then please contact me privately.

Norbert.

---- snip ----