USB2 high-speed isochronous performance issues

Hi guys,

I have a USB2 function driver that creates a set of URBs for high-speed
isochronous IN transfer. It passes theses URBs down to USBD and recycles them
when they come back. There’s nothing fancy going on, just the standard model
for doing isochronous streams into a ring buffer.

This pretty much currently only works with MS driver; all the other drivers
I’ve tried would not participate in high-speed isochronous transfers, ever. Period.

However, there are some issues even with the MS driver:

  • if I hand it more than 8 URBs at the same time, I get the 9th, 17th, … URB
    back RIGHT AWAY with a status code of 0, but no data transferred. CPU
    utilization goes through the roof, but other than that the other 8 URBs
    transfer data just fine.

  • if I do this on a CardBus USB 2 card with the NEC B1 EHCI chip, I get CPU
    utilization (for the isochronous transfer only) of about 15% on a P4 with 2.5
    GHz. Not great, but okay. kernrate tells me that most of the CPU cycles are
    burnt in two functions from usbehci.sys called EHCI_GetPacketForFrame (taking
    up about 60% of the total time spent in usbehci.sys) and
    EHCI_InternalPollHsIsoEndpoint (spending about 40%). Now if I use the exact
    same setup on the built-in Intel USB2 chip, using the same MS driver, the same
    machine, the same USB2 device, and the same build of my driver, CPU utilization
    goes up to 100% (on a P4 with 2.5 GHz!), of which 80% are spent in the kernel,
    again in those two functions, with roughly the same percentages. That makes me
    wonder how these two chips can be so radically different, considering that both
    are EHCI-compliant. A driver-issue?

All of the above applies to Windows XP as well.

Does anyone have any experience with high-speed isochronous streams? Have you
noticed these behaviors and found a fix for them? Or am I doing something
outragously stupid that I’m being deservedly slapped for?

Any info is helpful.

Thanks,

Burk.

Burkhard Daniel
Software Technologies Group, Inc.
xxxxx@stg.com * http://www.stg.com
fon: +49-179-5319489 fax: +49-179-335319489

So, here’s some more info.

Using perftest, I found out that on the external, cardbus-connected (NEC) chip,
the interrupt load is about 700/sec. On the internal (INTEL) chip, the load is
4500/sec+. Most of the time was spent in DPC routines, the load of which
apparently depends lineraly on the interrupt load (no surprise there).
I wondered why the interrupt load could be so different, and dug a little deeper.

So I looked at the USBCMD register in the EHCI chip and – what did I find! —
the driver sets the interrupt threshold to once each microframe!

Is there a good reason for this? I mean, we’re potentially looking at one
interrupt every 125 µs, or 8000 interrupts/second! Considering the amount of
DPC time each such interrupt seems to entail when isochronous transfers are
involved, this number is quite huge.

So, I guess my question to the people at Microsoft is: is there any way to
reduce the interrupt load? Since my isochronous transfers are all served on a
per-URB basis, I’d consider it enough to generate one interrupt once for each
URB, which in my case is every 31 ms. At any rate, getting the INTEL chip’s
interrupt rate down to that of the NEC chip would be cool with me, as well.

Is there ANYTHING I can do short of writing my own EHCI driver???

Thanks,

Burk.

Burkhard Daniel wrote:

Hi guys,

I have a USB2 function driver that creates a set of URBs for high-speed
isochronous IN transfer. It passes theses URBs down to USBD and recycles
them when they come back. There’s nothing fancy going on, just the
standard model for doing isochronous streams into a ring buffer.

This pretty much currently only works with MS driver; all the other
drivers I’ve tried would not participate in high-speed isochronous
transfers, ever. Period.

However, there are some issues even with the MS driver:

  • if I hand it more than 8 URBs at the same time, I get the 9th, 17th,
    … URB back RIGHT AWAY with a status code of 0, but no data
    transferred. CPU utilization goes through the roof, but other than that
    the other 8 URBs transfer data just fine.

  • if I do this on a CardBus USB 2 card with the NEC B1 EHCI chip, I get
    CPU utilization (for the isochronous transfer only) of about 15% on a P4
    with 2.5 GHz. Not great, but okay. kernrate tells me that most of the
    CPU cycles are burnt in two functions from usbehci.sys called
    EHCI_GetPacketForFrame (taking up about 60% of the total time spent in
    usbehci.sys) and EHCI_InternalPollHsIsoEndpoint (spending about 40%).
    Now if I use the exact same setup on the built-in Intel USB2 chip, using
    the same MS driver, the same machine, the same USB2 device, and the same
    build of my driver, CPU utilization goes up to 100% (on a P4 with 2.5
    GHz!), of which 80% are spent in the kernel, again in those two
    functions, with roughly the same percentages. That makes me wonder how
    these two chips can be so radically different, considering that both are
    EHCI-compliant. A driver-issue?

All of the above applies to Windows XP as well.

Does anyone have any experience with high-speed isochronous streams?
Have you noticed these behaviors and found a fix for them? Or am I doing
something outragously stupid that I’m being deservedly slapped for?

Any info is helpful.

Thanks,

Burk.


Burkhard Daniel
Software Technologies Group, Inc.
xxxxx@stg.com * http://www.stg.com
fon: +49-179-5319489 fax: +49-179-335319489