Windows 7 vs. Windows 10 USB Performances

Hi everyone,

I am working on an application that captures images through a USB driver and I experience a 10-15% performance decrease running on Windows 10 versus what I’m seeing on Windows 7.

The application is made of :

  • a USB camera device (which is Fx2 based)
  • either the libusb0 open-source or Windows' own WinUSB driver
  • a Windows application made up of a few DLLs

Unfortunately, the camera firmware is limited to bulk transfer mode (Yes, I know it should be synchronous, but that won’t be possible any time soon)

  • Q1: Is there a fundamental difference how USB transfers are processed or scheduled between Windows 7 and Windows 10? I must admit this is a slightly subjective observation since I didn’t yet thoroughly instrumented and measured the actual difference but I do see one.
  • Q2: Is there any reason why Windows 10 feels slightly slower? It would be nice to back what I am experiencing with facts about Windows 10’s internals.
  • Q3: Is there any possible remedy or tweak that can be applied to Windows 10 so performance could be improved?

I experience a 10-15% performance decrease running on Windows 10 versus what I’m seeing on Windows 7

Same exact (host) hardware being used on Win7 and Win10? I mean exactly the same??

There are almost certainly many differences between the USB Host Controller drivers on Win7 and on Win10. 10% perf difference on a USB link isn’t something that I, personally, would spend any time at all thinking about. IMHO that’s a noise-level difference for USB.

Mr. Roberts, who comments here regularly, seems to do more USB that I do… so he has additional insights to lend us here.

Peter

The performance difference is quite important for this application as it’s already stretching what can be reasonably done in bulk mode for something that streams images and doesn’t have much buffering.

  • In all case we do have to cope with missing frames (very tiny buffering on the device itself). The issue is really how much is too much?
  • This is using USB3 port that is not sharing time slice with anything else.
  • Although still far for the theoretical limit, we get around 40MB/s (USB3) with acceptable losses on Windows 7
  • On Windows 10, we’re already beyond the acceptable just shy of 35MB/s

I did setup a dual-boot machine with both OSes and was experiencing some performance differences but cannot give any actual % value.

It is to note, that we did have some other strange issues with previous builds of Windows 10, it was previous to build 18xxxx, then magically with Windows 10 build 18362 we started to get something we could work with which brings me to the point where I can say that it generally works, but I’m not experiencing the same performance level found on Windows 7.

Your original comment did not define what you meant by “performance decrease,” although I see you have done so in your followup.

It’s very difficult to compare USB 3 in Windows 7 and 10. Remember, Microsoft did not support USB 3 until Windows 8, so if you are using USB 3 in WIndows 7, you are not using any Microsoft components at all. You are using a driver stack supplied by the host controller manufacturer, and some of those driver stacks did not follow the rules.

Bulk performance has always been extremely variable. The maximum throughput you get depends strongly on the quality of the host controller and hubs. Have you tried this on several different computers with different chipsets? I’ll wager you get very different results. In our experience, the best host controllers could sustain 45 MB/s, but some models could never break 30 or 35 MB/s. There are simply no guarantees. And of course, all someone has to do is plug in a USB memory stick, and you’re totally hosed. Are you using a hub? Is it USB 2 or 3?

How large are your buffers? Using small buffers increases the overhead of kernel/user transitions. Your buffers should be a megabyte or more.

The short answer is going to be unsatisfying. You can be sure no one in the USB team made any changes specifically designed to reduce USB throughput.

BTW, your original post said “limited to bulk transfer mode (yes, I know it should be synchronous…)”. I assume you meant isochronous. Isochronous pipes are vastly more predictable, but they are capped at 24MB/s, so it’s certainly not going to help you.

Hey @Tim_Roberts,

  • Indeed, I did meant isochronous :slight_smile:
  • No there is no hub on the port that the camera is plugged in, that controller is solely handling the camera, we have a dedicated port of the camera. Other USB devices (keyb and mouse) use another cheaper controller.
  • I do believe that the Hw is USB3, maybe it runs as USB2 on Windows 7 (you got me on that one)

Your comment is actually useful in the sense that it might prevent any doomed-to-waste further work as far as updating the firmware to use isochronous mode :slight_smile:

Could you point out documentation about that maximum bandwidth of 24MB/s ?

The camera device buffer is tiny, nothing in the range of MB :frowning: it then needs to be services at very high rate.

Because we are willing to accept some packet or frame losses, is there any way to maybe disable handshaking and simply get as much data as possible disabling any lower level error check there might be? I admit that copy storage device data you do want data to be checked and guarantied, in our case it is not that big of a deal if we lose some data from time to time.

For example, my initial implementation of the client code using WinUSB was really not on par with the usage of the libusb0 driver. That was until I found out about the WinUSB driver policies. When I did disable ALLOW_PARTIAL_READS and enabled RAW_IO (opposite of default) policies I gained about an extra 5-8MB/S from the same exact hardware (granted with with occasional but acceptable data loses)

Could you point out documentation about the maximum bandwidth of 24MB/s?

It’s easy to figure it out yourself. A high-speed isochronous endpoint can have a max packet size of 1024 bytes. The maximum endpoint gets three transactions per microframe, with 8 microframes per millisecond. 3 x 1024 x 8 x 1000 = 24,576,000 bytes per second.

The camera device buffer is tiny, nothing in the range of MB

What I meant is to pass large buffers to your WinUSB bulk read. The host controller will chop that up into packets, but the device can keep filling it one transfer until the buffer fills. As I said, that saves kernel/user transitions. I’ve done an FX2-based camera with no buffering in the camera; all we had was the 4k buffering in the FX2. I provided both bulk and isochronous endpoints in the FX2 firmware (that’s trivially easy) so we could try both. Isochronous was reliable, if the bandwidth fit. Bulk was always a gamble if the pixel rate was > 30 MB/s.

Is there any way to maybe disable handshaking and simply get as much data as possible disabling any lower level error check … ?

I can’t guess what you think is going on, but the way USB works is all in the spec. A bulk endpoint does get error checking and automatic retries, and that can totally hose your throughput, but that only happens with cheap cables or bad connections. I have seen a USB connector that was slightly too large, and that did cause retries. Isochronous does not do retries; if there’s bad data, the packet is lost. It was specifically designed for video streaming, where you can’t afford to do retries. But, you have to be able to live within the bandwidth.

What I meant is to pass large buffers to your WinUSB bulk read. The host controller will chop that up into packets, but the device can keep filling it one transfer until the buffer fills. As I said, that saves kernel/user transitions. I’ve done an FX2-based camera with no buffering in the camera; all we had was the 4k buffering in the FX2. I provided both bulk and isochronous endpoints in the FX2 firmware (that’s trivially easy) so we could try both. Isochronous was reliable, if the bandwidth fit. Bulk was always a gamble if the pixel rate was > 30 MB/s.

Wow, almost feels like you worked on the same base design I’m working with, It is a FX2-based device with a a tiny 4K buffer :slight_smile:
I’m setting up 16 transaction/transfers of 128KB. Same here, around 25-28MB/s everything seems acceptable on either Windows 7 or 10, we start loosing packets/frames (most likely due to tiny 4K buffer being overwritten) around 30MB/s in Windows 10 and maybe around 33-35MB/s in Windows 7 hence my original question.

How are you measuring performance? Have you captured a usb ETW log on both 7 and 10 to see what the on the wire schedule looks like?

Have you looked at your application which is consuming the data to see if it is behaving differently? Perhaps the application is not sending reads fast enough on win10.

(posted for Doron by the Community Mods)

I’m setting up 16 transaction/transfers of 128KB.

That’s not valid. A high-speed bulk endpoint must have a maximum packet size of 512, and all of your transfers must be a multiple of that size. I would think, for example, that you’d set up 2 or 3 transfers of 64k bytes.

@Doron_Holan said:
How are you measuring performance? Have you captured a usb ETW log on both 7 and 10 to see what the on the wire schedule looks like?

Have you looked at your application which is consuming the data to see if it is behaving differently? Perhaps the application is not sending reads fast enough on win10.

(posted for Doron by the Community Mods)

  • I’m using the application own statistics for both throughput and frame rate. I agree that it cannot measure the actual raw USB throughput and that own statistics are limited by quality & performance of the application, but the application is needed to decode frames and get the effective good frame rate.
  • In any case I’ve been working on challenging the claims about the lower apparent performance on Windows 10. It’s a bit of logistical work especially now that my dual-boot test system got updated to build 1903 (forgot to disable auto-update on this test system). The claims were about the previous 18xx build.
  • I do have access to a Beagle v2 5000 analyzer (which I am not totally familiar with) is it possible to measure throughput? I didn’t see anything about that in the main software.

@Tim_Roberts
Why is 128KB not a valid transfer size? The original code is using 16 transfers of 128KB. I was told that these values (16 and 128KB) were the ones offering the best performances. Didn’t have time to challenge that myself.

(slaps my forehead) You wrote 128KB and I read “128B”. You are correct.