READ_REGISTER_ULONG Failing to Read Device on Vista 64

Jeffry_Gummeson · May 19, 2009, 1:39pm

Hi,

I’m having a problem with a PCI Express device where whenever I try reading data from the device using READ_REGISTER_ULONG I always receive the value 0xffffffff. Everything works correctly on 32-bit windows, but on any 64-bit Windows computer I’ve tried it on so far I am having this problem.

From debugging the firmware on my device I know that the device is receiving and correctly handling PCI Express write memory TLPs (PCI Express Transaction Layer Packet), so I know that the addresses I’m using with WRITE_REGISTER_ULONG and READ_REGISTER_ULONG are correctly mapped to my device. Also I know that my device is receiving and handling the read memory TLPs and generating a completion TLP. I also know that the data that my device is trying to return is NOT 0xffffffff.

Does anyone have any ideas or know what possible causes are for the function to return 0xffffffff. One thing I’ve considered is that the device might be taking too long in sending back a completion, causing the completion to timeout. Is it possible that on 64-bit platforms there is a more strict timeout requirement? I’ve also read that this could be due to a master or target abort.

Is there information somewhere that my driver can read to determine if a read to my device has failed? Otherwise I don’t see how a device could return data of 0xffffffff without confusing it with a failed read. Any advice on how to figure out what’s going wrong here? Thanks.

-Jeff

Tim_Roberts · May 19, 2009, 1:53pm

xxxxx@gmail.com wrote:

I’m having a problem with a PCI Express device where whenever I try reading data from the device using READ_REGISTER_ULONG I always receive the value 0xffffffff. Everything works correctly on 32-bit windows, but on any 64-bit Windows computer I’ve tried it on so far I am having this problem.

>From debugging the firmware on my device I know that the device is receiving and correctly handling PCI Express write memory TLPs (PCI Express Transaction Layer Packet), so I know that the addresses I’m using with WRITE_REGISTER_ULONG and READ_REGISTER_ULONG are correctly mapped to my device. Also I know that my device is receiving and handling the read memory TLPs and generating a completion TLP. I also know that the data that my device is trying to return is NOT 0xffffffff.

Just to confirm this: you’re saying that you see evidence in the
firmware that your device is receiving and generating a completion
packet for a request that comes back 0xffffffff?

Hmm, that’s not clear. Let me restate. There’s a difference between “I
know my device correctly responds to some read requests, either through
simulation or because it worked at least once” and “I can see debug
traces for THIS SPECIFIC REQUEST that show that I generated good data
but the data never made it to the host.” Which is it?

Are you being mapped into physical space below 4GB or beyond? Does your
device handle 64-bit physical addresses? Do you understand that your
packets must use 32-bit addressing with 32-bit addresses, even if it is
capable of 64-bit?

Does anyone have any ideas or know what possible causes are for the function to return 0xffffffff. One thing I’ve considered is that the device might be taking too long in sending back a completion, causing the completion to timeout. Is it possible that on 64-bit platforms there is a more strict timeout requirement? I’ve also read that this could be due to a master or target abort.

Do you happen to have access to one of those preciously expensive
PCIExpress bus analyzers?

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Calvin_Guan-2 · May 19, 2009, 2:18pm

Comments inline.

From debugging the firmware on my device I know that the
device is receiving and correctly handling PCI Express write
memory TLPs (PCI Express Transaction Layer Packet), so I
know that the addresses I’m using with WRITE_REGISTER_ULONG
and READ_REGISTER_ULONG are correctly mapped to my
device.? Also I know that my device is receiving and
handling the read memory TLPs and generating a completion
TLP.? I also know that the data that my device is
trying to return is NOT 0xffffffff.

See Tim’s reply for this part.

Does anyone have any ideas or know what possible causes are
for the function to return 0xffffffff.? One thing I’ve
considered is that the device might be taking too long in
sending back a completion, causing the completion to
timeout.?

Comp timeout is very unlikely from what I read from your description. You’d get a #SERR. Depending on how the chipset is configured, a #SERR can NMI. I bet it’s a master abort.

Is there information somewhere that my driver can read to
determine if a read to my device has failed?? Otherwise
I don’t see how a device could return data of 0xffffffff
without confusing it with a failed read.?

Master abort or target abort will do. On most systems, Master Abort is forgivable while target abort would be more catastrophic. Watch the PCI CSR. If your endpoint supports pcie advanced error reporting, dump the state to see if you have any recoverable/nonrecoverable errors.

Any advice on how to figure out what’s going wrong here?

A PCIe bus analyzer will tell you exactly what happened and gives concrete& indisputable evidence. Yes, it’s expensive but serious developer should have one. Small lane analyzer is not that expensive though.

Thanks.
You’re welcome.

Calvin Guan
Broadcom Corp.
Connecting Everything(r)

Jeffry_Gummeson · May 19, 2009, 2:48pm

Thanks for your reply, Tim. Yes, what I’m saying is that I can see my device’s firmware generating completions for read requests that are returning 0xffffffff. There is a hardware layer underneath the firmware that transmits the the TLP, and I’ve been assuming that the hardware layer is functioning correctly because it works fine on 32 bit platforms; maybe this is something I need to revisit to be absolutely sure that my completion is getting sent. But still, shouldn’t there be a status register or something somewhere that my driver can read to know that when it reads 0xffffffff there was some kind of failure in reading the device?

My device is mapped below the 4GB mark, and like I said earlier I know that my device is being addressed correctly because I’m getting the read and write TLPs. And in PCI Express, completions don’t contain an address field. There are requester and completer ID fields present in completion headers, and I’ve verified that the requester ID in my header matches with the requester ID that was in the request that I received, so it should be getting routed back to the root complex.

No, unfortunately I don’t have a bus analyzer. My device is using a Xilinx FPGA to implement PCI Express though, so one option I have is to debug with ChipScope, to see what’s going on with signals inside the FPGA. And I could use this method to verify that my completion is making it all the way down to the bus.

Thanks for the insight about the master abort, Guan. Do you know of anything I can do (either on the device side or the driver side) to determine whether I’m experiencing a master abort?

Thanks again for your help.

-Jeff

Tim_Roberts · May 19, 2009, 3:01pm

xxxxx@gmail.com wrote:

Thanks for your reply, Tim. Yes, what I’m saying is that I can see my device’s firmware generating completions for read requests that are returning 0xffffffff. There is a hardware layer underneath the firmware that transmits the the TLP, and I’ve been assuming that the hardware layer is functioning correctly because it works fine on 32 bit platforms; maybe this is something I need to revisit to be absolutely sure that my completion is getting sent. But still, shouldn’t there be a status register or something somewhere that my driver can read to know that when it reads 0xffffffff there was some kind of failure in reading the device?

No. The processor doesn’t know anything about PCIExpress. The
processor bus is basically just PCI, where there is no concept of
failure. When a read fails, the data lines float, and you read all ones
– ffffffff. The PCIExpress protocol is strictly between you and the
root complex.

Whose PCIExpress core are you using? Dopes Xilinx supply one now?

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Jeffry_Gummeson · May 19, 2009, 3:41pm

Yes, Xilinx offers the Block Plus IP, which is an implementation of PCI Express for their FPGAs.

Ok, so my CPU knows nothing about PCI Express, but if there’s an error it must at least be reported to my device through the PCI Express error reporting mechanism. I’ll check and see if my device is ever receiving any error messages.

Pavel_A1 · May 20, 2009, 6:09am

xxxxx@gmail.com wrote:

Ok, so my CPU knows nothing about PCI Express, but if there’s an error it must at least be reported to my device through the PCI Express error reporting mechanism. I’ll check and see if my device is ever receiving any error messages.

Yes, the root ports are mapped somewhere in the memory; errors can be
seen there.

– pa

Jeffry_Gummeson · May 20, 2009, 10:25am

Anyone know where the root ports are mapped in the memory space or how I can see errors that are reported there? Or is this something that varies across chipsets? Or is there some interface to the root ports provided by Windows?

Calvin_Guan-2 · May 20, 2009, 4:08pm

!pci against the rc from kd. The problem is that you may not know which device is responsible for the error while you looked at the rc level.

good luck,

Calvin Guan
Broadcom Corp.
Connecting Everything(r)

— On Wed, 5/20/09, xxxxx@gmail.com wrote:

> From: xxxxx@gmail.com
> Subject: RE:[ntdev] READ_REGISTER_ULONG Failing to Read Device on Vista 64
> To: “Windows System Software Devs Interest List”
> Date: Wednesday, May 20, 2009, 7:25 AM
> Anyone know where the root ports are
> mapped in the memory space or how I can see errors that are
> reported there?? Or is this something that varies
> across chipsets?? Or is there some interface to the
> root ports provided by Windows?
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars
> visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online
> at http://www.osronline.com/page.cfm?name=ListServer
>

Pavel_A1 · May 21, 2009, 7:26am

Calvin Guan wrote:

!pci against the rc from kd. The problem is that you may not know which device is responsible for the error while you looked at the rc level.

Device Manager, view devices by connection.
There will be something like “ICHxxx Family PCI Express Root Port N -
xxxx” and your PCIe device sitting on it.
The bus/dev/func are shown in properties of the root port.

– pa

Calvin_Guan-2 · May 21, 2009, 1:20pm

The tree hierarchy can be examined by !pcitree or !devnode 1. This is not a problem. The problem is error reported in RC can be from any children, grand children endpoints.

Calvin Guan
Broadcom Corp.
Connecting Everything(r)

— On Thu, 5/21/09, Pavel A. wrote:

> From: Pavel A.
> Subject: Re:[ntdev] READ_REGISTER_ULONG Failing to Read Device on Vista 64
> To: “Windows System Software Devs Interest List”
> Date: Thursday, May 21, 2009, 4:26 AM
> Calvin Guan wrote:
> > !pci against the rc from kd. The problem is that you
> may not know which device is responsible for the error while
> you looked at the rc level.
> >
>
> Device Manager, view devices by connection.
> There will be something like “ICHxxx Family PCI Express
> Root Port N - xxxx” and your PCIe device sitting on it.
> The bus/dev/func are shown in properties of the root port.
>
> – pa
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars
> visit: http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online
> at http://www.osronline.com/page.cfm?name=ListServer
>

Pavel_A1 · May 21, 2009, 3:44pm

Calvin Guan wrote:

The tree hierarchy can be examined by !pcitree or !devnode 1. This is not a problem. The problem is error reported in RC can be from any children, grand children endpoints.

Yes. Hope the OP’s device sits directly on the port, with nobody in the
middle.

– pa