Windows 7 USB performance inferior to XP

Hi,

I have debugged further and found convincing proofs that Windows 7 USB
performance is inferior to that of XP and that is by design! I would
like to have your inputs on the same.
You can also refer to my earlier related post at :
http://groups.google.com/group/microsoft.public.development.device.drivers/browse_thread/thread/66776f7ee7d48741

The first thing I noticed with Win7 is that on USB Bus Analyzer we can
see NO bulk-in going for more than 2 microframes (sometimes 8
microframes) with continuous reader or with manual polling of more
than 32 requests pending. From USB ETW logs we well as prints in my
driver I confirmed that at least one request was pending with Host
Controller all the time (actually more than 20 requests were always
pending).

I then broke into the machine during heavy file transfer and walked
through the QTD list of QH (Queue Head) representing the Bulk-in
endpoint of my device.
On XP, the no. of QTDs linked to the QH correspond to the no. of
requests pending with Host Controller at that point of time.
However on Win 7 I was surprised to find that there were only 2 QTDs
linked to the QH even though there are 20 requests pending with host
controller !
This probably leads to delays in resubmitting a new request after
completion of last one.

Here I am giving the WinDbg snaps for both OS:

XP

Comment : Host controller BAR is fe02b000
0: kd> !dd [uc] fe02b010
#fe02b010 00010021 00008008 0000003f 00002bfe
#fe02b020 00000000 05707000 05941100 00000000

Comment : Async List Base Address = 05941100
0: kd> !dd [uc] 05941100

5941100 05708002 00406001 40000000 05941200

Comment : First QH seen = 05708002
0: kd> !dd [uc] 05708002

5708000 055a1102 00008000 40000000 00000000

Comment : That was with H-bit set. The QH horizontal pointer points to
another QH = 055a1102
0: kd> !dd [uc] 055a1102

55a1100 05922102 02002201 40000000 055a1200

Comment : That QH was for bulk-out endpoint (EP 2). Get the next one
now.
0: kd> !dd [uc] 05922102

5922100 05941102 02002101 40000000 05922500

Comment : Got bulk IN endpoint. Move to its QTD.
0: kd> !dd [uc] 05922500

5922500 05922600 05922600 85eb8d80 05636a10

Comment : That one was pending with its ACTIVE bit set. Move
horizontal to another QTD.
0: kd> !dd [uc] 05922600

5922600 05922300 05922300 85eb8d80 05862638

Comment : Next QTD
0: kd> !dd [uc] 05922300

5922300 05922400 05922400 85eb8d80 055d7a10

and so on

Windows 7

Comment : Host controller BAR is fe02b000
1: kd> !dd [uc] fe02b010
#fe02b010 00010021 0000a009 0000003f 00000642
#fe02b020 00000000 3af2d000 3a99c080 00000000

Comment : Async List Base Address = 3a99c080
1: kd> !dd [uc] 3a99c080
#3a99c080 3a9b8082 02002201 40000000 3a99c100

Comment : First QH seen = 3a9b8082
1: kd> !dd [uc] 3a9b8082
#3a9b8080 3af30002 02002101 40000000 3a9b8100

Comment : Got bulk IN endpoint. Move to its QTD.
1: kd> !dd [uc] 3a9b8100
#3a9b8100 3a9b8180 3a9b8180 85ae8d00 3ba04050

Comment : Next QTD
1: kd> !dd [uc] 3a9b8180
#3a9b8180 3a9b8100 3a9b8100 05ae8d00 3ba389f8

As it can be seen there are only 2 QTDs pointing to each other.

Regards
Suresh

The EHCI Bulk transfer request queuing logic change happened between Windows XP and Vista, not between Vista and Windows 7, just to be clear that this was not really a change in behavior introduced Windows 7. The behavior has been the same in Vista in the 3+ years that it has been released.

The EHCI bulk transfer request queuing logic was optimized more for the case of a small number of large transfers rather than a large number of small transfers. Ideally the optimization parameters would have been configurable somehow, but the in the implementation today in Vista and Windows 7 they are fixed and TDs for at most two transfers will be actively queued on a Queue Head at the same time.

-Glen

Hey Glen?

What’s the advantage that you get by limiting the number of bulk transfer requests per endpoint (QH) in the schedule? How does this favor a small number of large transfers?

In case the intent of my question isn’t clear: I’m not trying to argue, I’m trying to learn from somebody who knows how this stuff really works,

Peter
OSR

If you run high volume traffic, it always made sense to queue a few large
transfers, rather than many single frame requests. You need fewer queue
updates. Live queue update has been the most trickiest operation, especially
in UHCI (whose architecture is quite half-brained).

I remember hitting UHCI’s major design flaw - when all bulk requests in the
queue are completed, the hardware continues reading the descriptors in very
fast loop, consuming much of system memory bandwidth.

wrote in message news:xxxxx@ntdev…
> Hey Glen?
>
> What’s the advantage that you get by limiting the number of bulk transfer
> requests per endpoint (QH) in the schedule? How does this favor a small
> number of large transfers?
>
> In case the intent of my question isn’t clear: I’m not trying to argue,
> I’m trying to learn from somebody who knows how this stuff really works,
>
> Peter
> OSR
>
>

Alexander Grigoriev wrote:

If you run high volume traffic, it always made sense to queue a few large
transfers, rather than many single frame requests. You need fewer queue
updates.

Yes, but that’s a very different situation. You’re talking about
sending large (multi-packet) requests to the host controller driver,
rather than a host of smaller ones. The discussion here is about the
individual packet scheduling at the hardware level. At that point,
large requests will already have been chopped into packet-sized pieces.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Managing endpoints in the host controller driver in response to URB_FUNCTION_SELECT_CONFIGURATION and URB_FUNCTION_SELECT_INTERFACE requests requires allocating and managing common buffer memory that will be used for the host controller Queue Head (QH) and Queue Element Transfer Descriptor (qTD) data structures. There are various ways that this could be implemented as either static or dynamic schemes.

The EHCI implementation in Vista and Windows 7 is static from the point that an endpoint is configured/opened to the point that an endpoint is unconfigured/closed. The host controller driver allocates a fixed chunk of common buffer memory per endpoint and carves that up into a QH and a set of qTDs. The tuning question is then how much common buffer do you allocate per endpoint and how do you manage the set of qTDs for that endpoint. The system performance people will always want you to treat common buffer as a scarce resource and to minimize the amount of common buffer that you allocate.

An EHCI QH points to the current qTD (the current qTD is also actually contained inside the Transfer Overlay of the QH) and each qTD contains a Next qTD Pointer and an Alternate Next qTD Pointer that is followed if the current qTD is retired due to a short packet.

In order to simplify the management of the qTD Next and Alternate Next qTD Pointers a scheme was implemented in Vista where an endpoint essentially has a set of columns of qTDs, where each column can only be used for one transfer at a time. In each column of qTDs the Next qTD Pointers of each qTD in the column points down to the next lower qTD in the column, except for the last qTD for the transfer in that column, which points across to the first qTD of the next column. Also all qTD Alternate Next qTD pointers in a column point across to the first qTD of the next column. As long as a short transfer does not occur qTDs will be consumed down a column for a transfer, until the end of the transfer is reached. If a short transfer does occur an Alternate Next qTD pointer will be followed straight to the start of the next column of qTDs for the next transfer.

I believe the motivation for implementing this scheme was to allow the Alternate Next qTD Pointer fields of all qTDs for an endpoint to be statically initialized when an endpoint is first opened and to eliminate the need to update the Alternate Next qTD Pointer fields on the fly as transfers are queued, completed, or aborted/canceled on an endpoint. Historically there might have been a number of bugs in the Alternate Next qTD Pointer management in Windows XP that were difficult to debug and fix and this was one way to try to address those bugs.

So anyway, given that in this implementation an EHCI Bulk (or Interrupt) endpoint has a fixed size of common buffer allocated to it that you want to minimize, and the set of qTDs carved up from that common buffer needs to be allocated into a fixed number of columns containing a fixed number of qTDs per column/transfer, how do you tune that for transfer size vs. number of transfers? The choice that was made was enough qTDs per column for a 512KB transfer, and enough qTDs for two columns of transfers. For the same amount of common buffer the transfer size could have been reduced for more columns of transfers, or the number of columns of transfers could be increased at the cost of allocating more common buffer. The parameters chosen at the time might have seemed reasonable for devices such as USB mass storage devices, or scanners, or printers, but don?t work so well for USB networking devices which typically want to use a large number of small transfers.

Ideally the implementation would have been able to dynamically allocate qTDs as necessary across whatever sized transfer requests were submitted on an endpoint, or least have some mechanism to tweak the fixed qTD allocation between transfer size and number of transfers, but that is not what exists today in Vista and Windows 7. (I believe some of the non-Windows OS?s handle this better and criticism here is fair).

This EHCI code could be rewritten yet again to address this in Windows 8 and/or QFEs, but the reality is that at this point the focus is on getting this right from the start for USB3/XHCI (where the host controller hardware is a lot more DMA friendly and efficient to start with).

-Glen

Thanks, Glen.

I appreciate your taking the time to give such a clear and thorough answer.

Peter
OSR