Issues with DMA transfer of USB3 bulk-in data

Hello,

My company develops USB3 devices and I am on the driver team. We are developing an xHCI bus driver (in addition to a device driver specifically for our device) in WDF and we are having some odd problems with our DMAing of data, specifically bulk-in data.

Our device driver is pretty solid; it works very well when using any xHCI card we’ve tried with those vendors’ controller drivers installed. Therefore, I’m very certain that this problem is in the bus driver and not the device driver.

What happens is that the data comes in with blocks “missing”. What should be a solid data stream has empty portions in it, as well as slightly misaligned portions. These empty portions are the same size and in the same place each time, and happen intermittently (ie, not every transfer). In other words, If I have a 100,000 byte data transfer, I might get something like this:

-good transfer-
-good transfer-
-bad transfer- bytes 2400 ~ 5000 “empty”, bytes 26700 ~ 27700 also “empty”
-good transfer-
-bad transfer- bytes 2400 ~ 5000 “empty”, bytes 26700 ~ 27700 also “empty”
-good transfer-
-good transfer-
… (etc)

Also, if I power cycle and restart the device, I’ll get similar behavior, but the pattern will be different. Sometimes there is just a small amount of corruption/empty spots, but sometimes pretty much every single data transfer is spotty and full of empty places (though consistently, the same places are empty).

The kicker is that sometimes this does NOT happen at all, and I’ll start the device and everything will work great for the duration that the device is plugged in:

-good transfer-
-good transfer-
-good transfer-
-good transfer-
… (etc)

There seems to be no specific reproduction case or pattern to getting this to happen. Further, I’ve look at the sizes of the “empty spots” in a hex editor and they are seemingly random sizes like 592 and 35120.

Can anyone think of any reason this might be happening?
I can provide code if anyone is willing to take a look. The fact that things completely work sometimes and are super spotty other times makes me wonder if this is some sort of synchronization issue or something? It seems like the actual DMA mechanism does work (sometimes :slight_smile: ).

xxxxx@gmail.com wrote:

My company develops USB3 devices and I am on the driver team. We are developing an xHCI bus driver…

Why? Isn’t that pretty much a solved problem today? What do you gain?

What happens is that the data comes in with blocks “missing”. What should be a solid data stream has empty portions in it, as well as slightly misaligned portions. These empty portions are the same size and in the same place each time, and happen intermittently (ie, not every transfer). In other words, If I have a 100,000 byte data transfer, I might get something like this:

-good transfer-
-good transfer-
-bad transfer- bytes 2400 ~ 5000 “empty”, bytes 26700 ~ 27700 also “empty”
-good transfer-
-bad transfer- bytes 2400 ~ 5000 “empty”, bytes 26700 ~ 27700 also “empty”
-good transfer-
-good transfer-
… (etc)

Have you considered acquiring a PCIExpress bus analyzer to see if you
can capture these transactions at the bus level?

Have you triple-checked the way you construct the TRBs? Are you quite
sure you’re getting your request lists to the hardware in time for the
frame?

There seems to be no specific reproduction case or pattern to getting this to happen. Further, I’ve look at the sizes of the “empty spots” in a hex editor and they are seemingly random sizes like 592 and 35120.

Do they happen to start or end on a page boundary?


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Thanks for the reply Tim.

Why? Isn’t that pretty much a solved problem today? What do you gain?

The idea is to have an entire solution available for customers provided by our company; taking control of the entire driver stack is one way we do this.
Depending on the different host controllers, there are cases when our driver solutions have worked and continue to work better than the vendor drivers in certain cases, which is important for some specific customers. As time passes, this has become increasingly rare (luckily), but remains so.
Furthermore, there is an older semi-complete xHCI driver that a lot of this project is being based upon, so this is not an entirely from-the-ground up build.
FWIW, the decision was not my own. :slight_smile:

Have you considered acquiring a PCIExpress bus analyzer to see if you
can capture these transactions at the bus level?

Working on it now. We do actually have one at the company, but it’s often in use. This is something I will be doing very soon, however.
I’ll try and figure out if I can see the missing data.

Have you triple-checked the way you construct the TRBs? Are you quite
sure you’re getting your request lists to the hardware in time for the
frame?

I have quadruple-checked the TRBs, and the xHCI spec as well; they are, to the very best of my knowledge, correct.
I’m not entirely sure I understand what is meant by “request lists”. Are you referring to the TRBs themselves? I think this should be correct, since all TRBs are set up before the doorbell is rung.

Do they happen to start or end on a page boundary?

I do suspect this is the case and am currently looking into this. Does this raise an eyebrow toward some specific issue in some way?
I thought this might be single scatter-gather segments that were somehow being missed, but I am not currently sure.

xxxxx@gmail.com wrote:

> Do they happen to start or end on a page boundary?
I do suspect this is the case and am currently looking into this. Does this raise an eyebrow toward some specific issue in some way?
I thought this might be single scatter-gather segments that were somehow being missed, but I am not currently sure.

Yes, that’s where I was going. Does your motherboard have enough
horsepower to keep up with USB 3? Is your memory subsystem fast
enough? What if part of the scatter/gather fetch wasn’t ready by the
time you needed it? That’s the direction I’m thinking.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

> Yes, that’s where I was going. Does your motherboard have enough

horsepower to keep up with USB 3? Is your memory subsystem fast
enough?

Interesting. Well, if I switch from our driver to the vendor driver and keep all else the same, things work properly, which makes me think that the system is powerful enough to handle USB3.

What if part of the scatter/gather fetch wasn’t ready by the
time you needed it?

Is there a way to tell if it is not ready? I assumed that the framework passed to me a ready-to-use scatter gather list and I needn’t do any checks on this. Our functions are mostly based off the Microsoft examples here:
http://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-9fde-d599bac8184a/dma.docx
They don’t do any sort of ready-checks as far as I can tell.

Another thing I was wondering about was if the memory was somehow being prematurely freed before being fetched by the application software, though if this was the case, I don’t know how to check for it or prevent it. Does this sound feasible? To the best of my knowledge, all that I can do is wait for the interrupt/event from the host controller, pull out the data, and complete the request at that point; I don’t know how or why the data could be getting freed.

Something I forgot to mention is that I have recorded a USB bus trace and the device itself is most certainly sending the full data transfer as expected.

Does it happen if you limit the RAM on the SUT to 3GB?

Hi Alex,

I tried limiting it to 3GB and to 2GB as well, but the problem does still happen.
I’m curious what you were wondering about? Some sort of issue with 64-bit pointers?
FWIW, I am testing on an x86 Windows 8 system.

Thanks!

xxxxx@gmail.com wrote:

> What if part of the scatter/gather fetch wasn’t ready by the
> time you needed it?
Is there a way to tell if it is not ready? I assumed that the framework passed to me a ready-to-use scatter gather list and I needn’t do any checks on this.

Well, you have the TRBs themselves, which have to be read by the device,
and you have the buffers described by the TRBs. That’s two different
DMA transactions, all under the control of the controller, not the host
system.

Another thing I was wondering about was if the memory was somehow being prematurely freed before being fetched by the application software, though if this was the case, I don’t know how to check for it or prevent it. Does this sound feasible? To the best of my knowledge, all that I can do is wait for the interrupt/event from the host controller, pull out the data, and complete the request at that point; I don’t know how or why the data could be getting freed.

If you are using the host application’s buffers directly (rather than
copying), then you shouldn’t have to worry about this.

I tried limiting it to 3GB and to 2GB as well, but the problem does still happen.
I’m curious what you were wondering about? Some sort of issue with 64-bit pointers?

Exactly. In a system with more than about 3GB of RAM, some of the
physical addresses may be above the 4GB boundary. If you don’t treat
the addresses as 64-bit quantities, or if the chip doesn’t handle 64-bit
addresses, that causes problems.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Here is a hack idea. I have seen excessive DMA operations sometimes get killed by the chipset.
Perhaps you can enter BIOS setup on the platform and disable C states?
Its not a real solution but who knows.

>>I have quadruple-checked the TRBs, and the xHCI spec as well; they are, to
the very best of my knowledge, correct.

>I’m not entirely sure I understand what is meant by “request lists”. Are
you referring to the TRBs themselves? I think this should be correct, since
all TRBs >>are set up before the doorbell is rung.

A couple things to check on your TRB construction:

* The doorbell ring is not the synchronizing mechanism - once the xHC starts
reading the ring (after a doorbell), it will continue to fetch TRBs based on
the value of the cycle bit. So it could very well fetch TRBs before you
ring the doorbell, particularly in high-traffic situations.
* Make sure you don’t change the cycle bit of the 1st TRB of the TD until
the entire TD is constructed. This was new in xHCI 1.0, so if your software
was originally written for xHCI .96, it might not have done this.