Is there a limit on the number of asynchronous URBs that can be queued?

Consider a function driver (FD) that sits on top of the USB driver stack (USBD). The driver uses USB bulk mode to asynchronously read large chunks of data (>16 MB) at prescribed time intervals. Once the transfer starts, this has to happen at very high speed (SuperSpeed on USB 3.0, > 350 MB/s) without data loss and with minimal CPU overhead. The buffers are pre-allocated and locked in the driver, so there is virtually no overhead here.

The idea is to queue multiple URBs (around 64 KB each) to avoid the start-completion overhead specific to back-to-back transfers (i.e. when one URB is completed the next one is prepared and started).

Questions:

  1. Is there a maximum number of URBs that can be submitted/queued to an endpoint at any given time? (e.g. submitting/queuing 1024 URBs for 64 KB data blocks to a read from an endpoint)

  2. Is there a performance degradation due to the large number of URBs waiting to be completed by USBD?

  3. What would be the optimal number of queued URBs that could guarantee no data loss over a very fast (SuperSpeed) pipe?

  4. What else could go wrong when using a large number of URBs to accomplish the transfer?

Thank you,

Stefan

xxxxx@shaw.ca wrote:

Consider a function driver (FD) that sits on top of the USB driver stack (USBD). The driver uses USB bulk mode to asynchronously read large chunks of data (>16 MB) at prescribed time intervals. Once the transfer starts, this has to happen at very high speed (SuperSpeed on USB 3.0, > 350 MB/s) without data loss and with minimal CPU overhead. The buffers are pre-allocated and locked in the driver, so there is virtually no overhead here.

The idea is to queue multiple URBs (around 64 KB each) to avoid the start-completion overhead specific to back-to-back transfers (i.e. when one URB is completed the next one is prepared and started).

Yes, that’s required to get reasonable performance.

Questions:

  1. Is there a maximum number of URBs that can be submitted/queued to an endpoint at any given time? (e.g. submitting/queuing 1024 URBs for 64 KB data blocks to a read from an endpoint)

There’s probably a limit, but it is large.

  1. Is there a performance degradation due to the large number of URBs waiting to be completed by USBD?

Yes. Remember that you don’t get notified as soon as a request ends.
The host controller only fires an interrupt at frame or microframe
boundaries. So, you might as well make your URBs very large – a couple
of megabytes is reasonable.

  1. What would be the optimal number of queued URBs that could guarantee no data loss over a very fast (SuperSpeed) pipe?

I don’t think there is enough real-world experience to know this yet. I
would certainly want to be able to survive small integer milliseconds.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Thank you, Tim. I appreciate your help. I’m in a phase of a project where some architectural decisions will need to be made and I really want to get everything right (before it’s too late).

Now I’m going to abuse your willingness to keep this discussion going :slight_smile:

(using the question numbers from my previous message)

  1. There is a developer that tried to push the limits and found out that 2500 URBs can be queued in Win XP (x86). Yes, the limit is large. Probably larger in Win 7 (x64). For my practical purposes this is more than enough.

  2. You are right: I plan to use fairly large URBs. Actually, I have another related question that I posted in another thread regarding the maximum URB size in SuperSpeed mode (for bulk transfers). In my tests with different USB 3.0 host controllers (NEC, ASMedia and TI) I found that the URB can be as large as you want (no restriction except 32-bit representation of the size value). I didn’t notice a serious overhead penalty for URBs as large as 1 GB!

Should I assume that URBs won’t be limited in size for SuperSpeed? Microsoft engineers didn’t respond to this question in my other thread. I would expect the limit will be increased or eliminated for SuperSpeed but I might be wrong as well.

  1. If I look at the continuous reader configuration (WdfUsbTargetPipeConfigContinuousReader and its WDF_USB_CONTINUOUS_READER_CONFIG structure), the number of pending requests is not large (3…4 mentioned in the documentation); Ms claims that increasing this number does not improve data throughput but may reduce latency, or the chance of missing data, on a high-frequency pipe. I agree. However, the NumPendingReads is an 8-bit value which suggests that higher values are not recommended.

4 (new). The continuous reader documentation rises another question: how do you track (and order) the received data buffers if they are not delivered in the exact order that the device delivers the data? I have no access to the device to change its way of streaming data. This device is able to send only large blocks of unformatted data that are acquired internally without host control. As consequence, there are no block headers with sequence numbers to help me to rebuilt the stream correctly (i.e. re-establish the time line). Any ideas?

Thanks again,

Stefan

xxxxx@shaw.ca wrote:

  1. You are right: I plan to use fairly large URBs. Actually, I have another related question that I posted in another thread regarding the maximum URB size in SuperSpeed mode (for bulk transfers). In my tests with different USB 3.0 host controllers (NEC, ASMedia and TI) I found that the URB can be as large as you want (no restriction except 32-bit representation of the size value). I didn’t notice a serious overhead penalty for URBs as large as 1 GB!

Should I assume that URBs won’t be limited in size for SuperSpeed? Microsoft engineers didn’t respond to this question in my other thread. I would expect the limit will be increased or eliminated for SuperSpeed but I might be wrong as well.

The limit for USB 2 is multiple megabytes. I assume the limit will be
at least that large in the USB 3 stack. By the way, this is one area
where we are fortunate not to be working on Linux. The single URB limit
in Linux is a few dozen kilobytes.

4 (new). The continuous reader documentation rises another question: how do you track (and order) the received data buffers if they are not delivered in the exact order that the device delivers the data?

URB completions happen at a raised IRQL. They will arrive at the driver
in the order they were completed. If you are using large enough
buffers, you won’t get multiple completions in a singe microframe anyway
– there will be time between them.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

On the Microsoft USB 2.0 stack with a device downstream of an EHCI host controller things will work best if individual URB_FUNCTION_BULK_OR_INTERRUPT_TRANSFER are no larger than 512KB (0x80000) each. Also, things will work best if you only queue somewhere around 8 requests or less per endpoint at a time. The USBEHCI.SYS driver will only program at most two URB_FUNCTION_BULK_OR_INTERRUPT_TRANSFER requests per endpoint into the host controller transfer schedule (i.e. into TDs) at a time. URB_FUNCTION_BULK_OR_INTERRUPT_TRANSFER requests larger than 512KB on an EHCI host controller will start incurring additional overhead. While there is no fixed upper limit on the number of simultaneously outstanding transfer requests, the USBPORT.SYS driver is not optimized for a large number of simultaneously outstanding transfer requests. If the number gets large it can have a noticeable negative impact on system DPC utilization.

USBPORT.SYS is really optimized for a small number of larger transfers, up to the 512KB threshold on EHCI host controllers, and not optimized for a large number of small transfers.

-Glen

Excellent clarification Glen. Thank you.

If you look at my posts again, you may see what I’m really interested: the SuperSpeed mode i.e. USB 3.0. My current project involves the best performance I can achieve on USB 3.0 bulk pipes while remaining compatible with Vista/Win 7 (Win 8 will be eventually included later):

  • transfer speeds of 350 MB/s and higher (close to 400 MB/s if possible);
  • minimum latency
  • very low overhead (i.e. < 5% CPU load)
  • stable performance even on a loaded system

Currently, the xHCI drivers provided by the chip manufacturers allow URBs of any size without apparent penalty. I assume that Microsoft will eventually follow the same path, maybe optimizing the performance for URBs no larger than 4…16 MB. But I expect no actual limitation in size.

As for USBPORT.SYS, I expected that the performance will be good for a low number of large transfers (see Tim’s answer as well). What about a high number (>100) of large transfers? In this case, the system DPC utilization shouldn’t be a serious problem as every transfer happens every 10…20 ms (considering 4…8 MB URBs). Is it correct?

Thank you,

Stefan

I?ll see if I can get some confirmation on any URB_FUNCTION_BULK_OR_INTERRUPT_TRANSFER TransferBufferLength limitations in the Microsoft USB 3.0 stack for Windows 8.

In the Microsoft USB 2.0 stack you really would not want to queue a large number of transfers, such as >100 in your example. Regardless of how often transfers are actually completing on the bus there is some increasing overhead in the USBPORT.SYS driver that goes up with the number of transfers that are queued. There is a single global list per host controller on which all active transfers across all device endpoints are queued, and there are multiple DPC worker routines which walk all transfers queued on the single global list under the protection of a single global list spinlock. It works reasonably well in typical usage scenarios but does not scale well with a large number of transfers. Ideally this implementation would be improved, but the risk/reward tradeoff of churning this code is not in favor of doing so.

-Glen

Thanks again, Glen. Really useful. I’ll definitely take your advice into account in my design.

Stefan