Question on submit URB

OSR_Community_User · May 18, 2009, 4:22am

Sorry for my bad English

I write driver for USB 2.0 device. This device is analog-to-digital converter. It produce byte stream with speed: 8 Megabytes per second.
This device answer for IN request with data. Size of this data is random value (442, 512, 511, 502). Drive don’t know size of returned data in next transaction.

When I submit URBs with size 8192 I find that Windows becomes 100% loaded and driver can’t produce all incoming bytes from USB device. Data is lost.

When we run our device in configuration with fixed packet size 511 bytes. I see the same situation! Data is lost and system is loaded.
When we run our device in other configuration with fixed packet size 512 bytes. All is OK! System load is 3%. Data don’t lost (we fixed data lost with help of lamp on USB device. It turn on when device internal buffers is full)

It is important to support configuration with random packet data size

I find thread with same problem.
http://www.osronline.com/showThread.cfm?link=150397
But I don’t undestand how to solve it from this thread.

How I can solve this problem?

I will be very grateful for the information, books or articles about Host Contoller, submiting URB (about how URB split).

Best Regards
Kirill bagrinovsky

OSR_Community_User · May 18, 2009, 5:19am

I use BULK end point for my exchange

OSR_Community_User · May 18, 2009, 6:04am

From:
> 1. I write driver for USB 2.0 device. This device is analog-to-digital
> converter. It produce byte stream with speed: 8 Megabytes per second.
> This device answer for IN request with data. Size of this data is random
> value (442, 512, 511, 502). Drive don’t know size of returned data in next
> transaction.
> I find thread with same problem.
> http://www.osronline.com/showThread.cfm?link=150397
> But I don’t undestand how to solve it from this thread.

The upshot of Tim Robert’s response in that thread is that you’ll do better
to ask for exactly 512 bytes in every URB, but you’ll need to know how the
host controller driver schedules operations into microframes to predict
performance.

When you ask for 8192 bytes, the host schedules all the IN transactions
needed to read that much 512 bytes at a time (assuming your endpoint packet
size is 512). As soon as your device sends less than 512 bytes, that
terminates the 8192-byte transfer. Someone at Microsoft will have to fill in
the gap here to tell us when the host controller will start the next URB’s
worth of IN’s.

While waiting for an exact answer, though, you may be able to run your own
experiment by just changing your URB request size to 512 bytes.

Walter Oney
Consulting and Training
www.oneysoft.com

OSR_Community_User · May 18, 2009, 10:29am

> While waiting for an exact answer, though, you may be able to run your own

experiment by just changing your URB request size to 512 bytes.

Thanks Walter
I set URB Size = 512 bytes.

Some words about application&driver. I use ReadFile Function in my application with overlapped method to “create” IRPs for my driver. [One ReadFile call with 512 byte buffer]= [one IRP] = [one URB]. I keep queue of 4096 IRP to my driver. When I receive answer from GetOverlappedResult I send next IRP by calling ReadFile.

My experiment table (when device answers for IN request with 511 bytes packet):

|QueueSize | URB size | CPU Load |
| 4096 | 512 | 80% |
| 2048 | 1024 | 85% |
| 1024 | 2048 | 83% |
| 512 | 4096 | 82% |
| 256 | 8192 | 75% |

My other experiment table (when device answers for IN request with 512 bytes packet):

|QueueSize | URB size | CPU Load |
| 4096 | 512 | 86% |
| 2048 | 1024 | 60% |
| 1024 | 2048 | 45% |
| 512 | 4096 | 25% |
| 256 | 8192 | 13% |

When I have switched off other thread in my application I have ceased to lose bytes. But each refreshing of Internet Explorer, leads to loss of the data.

Whether there can be big CPU loading result of 16384 ReadFile calls/second?

8*1024*1024[Bytes/second] / 512[URB size] = 16384 [ReadFile calls/second]

If a problem in it I can call ReadFile with 8192 buffer and then split this request on 16 x 512 bytes IRP/URB requests in my driver.

Or I can face the same problem sending small IRP/URB to the Host Controller driver?

Maxim_S_Shatskih · May 18, 2009, 10:57am

>I keep queue of 4096 IRP to my driver.

Why such huge a queue? why not, say, 16 IRPs or so?

The HC driver will have really busy time managing the schedules for these 4K IRPs.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Tim_Roberts · May 18, 2009, 1:05pm

xxxxx@spiritdsp.com wrote:

> While waiting for an exact answer, though, you may be able to run your own
> experiment by just changing your URB request size to 512 bytes.
>

Thanks Walter
I set URB Size = 512 bytes.

Some words about application&driver. I use ReadFile Function in my application with overlapped method to “create” IRPs for my driver. [One ReadFile call with 512 byte buffer]= [one IRP] = [one URB]. I keep queue of 4096 IRP to my driver.

That’s a ridiculously large number. If you can’t keep up using a queue
of 32 IRPs, then no number will ever be large enough, and you need to
change the design.

When I have switched off other thread in my application I have ceased to lose bytes. But each refreshing of Internet Explorer, leads to loss of the data.

Whether there can be big CPU loading result of 16384 ReadFile calls/second?

Well, of course there is. Take a few moments to think about what this
involves. Your request has to pass down through the user-mode ReadFile
API processing, through validation, switch into kernel mode, go through
additional validation, get converted to an IRP, pass into the top of the
USB stack for your device, eventually find its way to your driver, where
you lock the buffers, create an URB, send it down to the host
controller, which then creates a DMA request and adds it to the
scheduling list for the next available microframe. Then, when the
microframe is finished, the reverse happens: the request gets marked
complete, passed back up to your driver, which does more processing on
it, then percolates back through the I/O system, crosses back into
user-mode, and returns back to your thread. You are hoping to do all of
that in less than 60 microseconds.

The ONLY way to sustain high bandwidth USB performance is to use URBs
with large buffers, and that REQUIRES that your device ship complete
512-byte packets. Your best solution, by far, is to have your hardware
pad the packets to 512 bytes. You can include a “length” field in the
data, if you need to.

If a problem in it I can call ReadFile with 8192 buffer and then split this request on 16 x 512 bytes IRP/URB requests in my driver.

The problem is only slightly reduced. You still have an enormous amount
of overhead between you and the host controller.

Another alternative, if you really really cannot pad the packets, is to
switch to an isochronous pipe instead of a bulk pipe. With isochronous
data, each packet stands alone. You don’t have this “short packet”
issue. You can send down a single request of 32 packets of 512 bytes
each, and what you get back contains the results from the next 32
intervals. If the device sent 419 bytes, you get 419 bytes in the
packet. If the device skipped an interval, you get back 0 bytes.

Perhaps this fits your model a bit better.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

OSR_Community_User · May 19, 2009, 4:00am

> Why such huge a queue? why not, say, 16 IRPs or so?
Thanks, Maxim
16 - is very small value (Data Lost).
I set queue size to 64 IRPs ubstead of 4096. CPU Load decrease from 80% to 65%.

OSR_Community_User · May 19, 2009, 5:12am

Big Thanks for Tim Roberts for detail answer!

Two questions:

What if CRC is invalid in incoming data packet with isochronous IN pipe? Whether this package will be returned to my driver? Or Host Controller driver remove this packet? Very important precisely to know the size of the transferred data.
Whether correctly I understand, what use of the ISOCHRONOUS pipe in any case is faster than use of the BULK pipe method?
What will faster work:

Exchange using ISOCHRONOUS pipe and URB requests with 32 packets of 512 bytes
Exchange using BULK pipe and URB requests with size 32*512 byte (when device send full 512 packets)

Whether over-expenditure can turn out because some packets of the URB sent by us will be empty? Whether it is necessary to us to send twice more URBs if in URB will half of packets is filled by the data only?

Maxim_S_Shatskih · May 19, 2009, 6:51am

> 16 - is very small value (Data Lost).

Why bulk data is lost? The bulk pipe must properly suspend itself if nobody is reading it.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

OSR_Community_User · May 19, 2009, 7:01am

My device is analog-to-digital converter. It produce data with fixed speed. It has own internal buffer. Device put data from analog-to-digital converter to this buffer. Host read data from this buffer by using BULK endpoint. If to IN requests will come very slowly buffer to be overflowed - we lost data.

Maxim_S_Shatskih · May 19, 2009, 7:03am

> My device is analog-to-digital converter. It produce data with fixed speed.

Then maybe isochronous is better.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

OSR_Community_User · May 19, 2009, 7:35am

> Then maybe isochronous is better.
Yes, Yes! I think about it

I have some questions about isochronous in Message 8. Do you know answers?

Maxim_S_Shatskih · May 19, 2009, 8:38am

> I have some questions about isochronous in Message 8. Do you know answers?

With isoch, your driver submits both the data buffer (divided to lots of tiny areas, each for 1 packet) and the packet status array to the USB stack’s pipe read operation.

The stack fills both the data and the packet status array. If some packet was lost - then the packet status in the array for this packet will be properly marked.

IIRC (not sure in this 100% for now) with isoch, the software sets the maximum packet size, and the data buffer is laid out as same-sized slots each of this maximum size. The hardware can actually transfer the smaller packet, but, regardless of this, the next packet will go to the fixed-size next slot and not immediately after the previous packet, so, there will be some tail full of junk after the previous packet.

With isoch, the device should transmit the fixed number of packets on each USB SOF clock period, which is 8KHz on high speed (USB2) or 1KHz on full speed. On high speed, this fixed number can be IIRC 1 2 or 3, with full speed - IIRC 1 only.

BTW, I even have doubts that, with bulk, the device is allowed to transmit variable-size packets. IIRC all packets must be the same size except the “logically last” one, which can have the smaller size or be a ZLP. This “logically last” packet has a special meaning, similar to EOF marker or to TCP FIN - the end of logical stream.

So, if the device uses variable-size packets on bulk, then each packet is a single logical stream, which constitutes this packet and nothing else. Probably the logical stream end has some overhead for the USB stack, which has major perf costs.

At least it looks very much like that, if you have a pending 8K URB on a bulk pipe, and the device transmits the non-full-size packet like 511 bytes, then the 8KB URB is completed immediately with this small packet (since this small packet is “logically last”) and the tail space in the URB’s buffer is wasted. So, the pipe degrades to 1 packet = 1 URB mode.

But, if the bulk pipe works OK, then most packets are full-sized (size == constant endpoint’s packet size), and the “logically last” packets are rare - probably 1 per the whole long transfer. In this case, the stack can fill the 8K URB till its very end.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Maxim_S_Shatskih · May 19, 2009, 8:49am

I will try to interpret these measurements.

My experiment table (when device answers for IN request with 511 bytes packet):

|QueueSize | URB size | CPU Load |
| 4096 | 512 | 80% |
| 2048 | 1024 | 85% |
| 1024 | 2048 | 83% |
| 512 | 4096 | 82% |
| 256 | 8192 | 75% |

In this mode, each packet is “logically last in the logical stream” and the stack does not allow the single URB to contain the packets for 2 or more “logical streams”. So, 1 URB per packet, which means 1 URB per some fixed time period.

The raise of 80% in the top row compared to 75% in the bottom one is probably due to cost of the queue size - the larger the queue, the more costly it is for the CPU to maintain (probably this is due to list traversing in the UHCD’s DMA chain building code - the chains are longer).

If I’m correct, then the queue of 4096 requests has its own CPU cost of around 5%, and thus is too large.

My other experiment table (when device answers for IN request with 512 bytes packet):

|QueueSize | URB size | CPU Load |
| 4096 | 512 | 86% |
| 2048 | 1024 | 60% |
| 1024 | 2048 | 45% |
| 512 | 4096 | 25% |
| 256 | 8192 | 13% |

In this case, nearly no (or maybe 100% no) packets are “logically last” (short), and so the whole pipe data flow is 1 “logical transfer”.

The URBs are filled till their very end, and so, 8K URB is 16 packets, and .5K URB is 1 packet. So, the number of URB’s per second is 16 times smaller in the last row then in the first one, which means 13% load instead of 86%.

This surely looks like the large per-URB cost (not per-byte or per-packet) in CPU cycles, this cost is a) IRP submission and completion paths in Windows b) atomic operation used to attach the URB’s DMA chain to the global DMA chain which is done by the UHCD driver.

So, you need to have the number of URBs smaller, which means many packets per URB, which means - for bulk - all packets (except the “logically last” one) of the same size as in the endpoint descriptor.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Tim_Roberts · May 19, 2009, 12:54pm

xxxxx@spiritdsp.com wrote:

Two questions:

What if CRC is invalid in incoming data packet with isochronous IN pipe? Whether this package will be returned to my driver? Or Host Controller driver remove this packet? Very important precisely to know the size of the transferred data.

If the CRC is invalid in an isochronous packet, the packet will be
dropped. As I said, this never happens in real life.

Whether correctly I understand, what use of the ISOCHRONOUS pipe in any case is faster than use of the BULK pipe method?
What will faster work:

Exchange using ISOCHRONOUS pipe and URB requests with 32 packets of 512 bytes

Exchange using BULK pipe and URB requests with size 32*512 byte (when device send full 512 packets)

“Faster” is not the right question to ask. The tradeoffs are more
complicated. A bulk pipe has potentially more raw bandwidth than an
isochronous pipe. A bulk pipe can suck up every unused byte on the bus;
we have sustained 45 MB/s for long periods over a bulk pipe. An
isochronous pipe, on the other hand, is limited to no more than 24
MB/s. However, the isochronous pipe has its time slots reserved only
for it. Even if you plug in 11 more devices, the isochronous pipe will
still get 24 MB/s, whereas the bulk pipe will be competing with the
other devices for the remaining space.

**IF** you are able to change your hardware to send full 512-byte
packets, in my view that is the best option. Easier to handle,
automatic retries on data errors, higher potential bandwidth. But
**IF** you are not able to pad the packets in the hardware, then it is
easier to handle variable-length packets in isochronous than it is in bulk.

Whether over-expenditure can turn out because some packets of the URB sent by us will be empty? Whether it is necessary to us to send twice more URBs if in URB will half of packets is filled by the data only?

I don’t understand the question. With a bulk pipe, ANY short packet
will cause the entire transfer (meaning the entire URB) to be finished
immediately. A zero-length packet is considered “short”. However, if
the device simply NAKs the request, saying that it has nothing to send,
that will have no effect. The current URB will remain pending until
there is data.

With an isochronous pipe, the short packets (including zero-length
packets) are simply registered in the ISO_PACKET array that is returned
to you.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Tim_Roberts · May 19, 2009, 1:02pm

xxxxx@spiritdsp.com wrote:

> Then maybe isochronous is better.
>
Yes, Yes! I think about it

I have some questions about isochronous in Message 8. Do you know answers?

Remember that only a fraction of the people here us the “forums”
interface to this group. Many of us read this as a mailing list or as a
newsgroup, and for us the phrase “message 8” has no meaning.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Chris_Aseltine · May 19, 2009, 2:17pm

Tim Roberts wrote:

we have sustained 45 MB/s for long periods over a bulk pipe.

Every time Tim says he’s sustained 45MB/sec over a bulk pipe, take a drink.

Tim_Roberts · May 19, 2009, 2:27pm

xxxxx@gmail.com wrote:

Tim Roberts wrote:

> we have sustained 45 MB/s for long periods over a bulk pipe.
>

Every time Tim says he’s sustained 45MB/sec over a bulk pipe, take a drink.

I know *I* do.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

OSR_Community_User · June 5, 2009, 6:08am

Thanks Tim Roberts, Maxim S. Shatskih, Walter Oney for your answers

> If a problem in it I can call ReadFile with 8192 buffer and then split this
> request on 16 x 512 bytes IRP/URB requests in my driver.
>

The problem is only slightly reduced. You still have an enormous amount
of overhead between you and the host controller.

Tim, I have tried to realise this method and have received the big overhead at an exchange of small URB with the controller. All has turned out as you have told. Earlier I had CPU loading of 80 %, and with this method have received 65%.

Then I try to submit 16 URBs (with size 512) and then resubmit them from Completion Routine. Data writes to circular buffer from Completion Routine. Thus I have ceased to send IRPs from the application to kernel. Earlier I had CPU loading of 80 %, and with this method have received 50%.

I understand: Bulk with URB size 512 - IS VERY BAD IDEA

> **IF** you are not able to pad the packets in the hardware, then it is
> easier to handle variable-length packets in isochronous than it is in bulk.

I have rewrote the driver and have started to use isochronous pipe. Sheme: IRP -> URB (1024 packets x 1024 bytes/per packet). Now CPU loading is equal 10%. It’s OK.

But I have faced other problem.

If I start any heavy application I see that sometimes the host has not time to take away the data from the device. The internal buffer of the device is overflowed, and it is inadmissible. Device is analog-to-digital converter.

I set a “Realtime priority” to a thread which sends IRPs. Buffer overflow began to occur seldom, but nevertheless sometimes it occurs.

Then I try to use field PipeFlags in USBD_PIPE_INFORMATION structure:
// optimize transfers for use with 'real time threads
#define USBD_PF_ENABLE_RT_THREAD_ACCESS 0x00000004
// causes the driver to allocate map map more transfers in the queue.
#define USBD_PF_MAP_ADD_TRANSFERS 0x00000008

But then I have read that they are not supported in many versions of Windows.

What can I make that a host constantly, took away the data from the device every microframe?
How engineers solve this problem on high-speed stream USB devices in real life?

My Idea
Device driver pass pointer to nonpaged buffer (circular buffer) to HostController driver and size of this buffer. Host controller driver request data from USB device every microframe and put received data to circular buffer.

Whether there is something similar to it in Windows?

Best Regards
Kirill Bagrinovsky

Tim_Roberts · June 5, 2009, 1:00pm

xxxxx@spiritdsp.com wrote:

>> **IF** you are not able to pad the packets in the hardware, then it is
>> easier to handle variable-length packets in isochronous than it is in bulk.
>>

I have rewrote the driver and have started to use isochronous pipe. Sheme: IRP -> URB (1024 packets x 1024 bytes/per packet). Now CPU loading is equal 10%. It’s OK.

Are you submitting one URB, then processing it and resubmitting? That’s
dangerous, because while you are processing the URB, your device is not
given a chance to transmit. There MUST be a request waiting in order
for the host controller to schedule you. And 1024 packets is way too many.

Personally, I’d go with 8 URBs with 32 packets each, or maybe 4 URBs
with 64 packets each.

But I have faced other problem.

If I start any heavy application I see that sometimes the host has not time to take away the data from the device. The internal buffer of the device is overflowed, and it is inadmissible. Device is analog-to-digital converter.

You need to “do the math” to match your device’s needs with the interval
for the endpoint. What is the continuous and the peak data rate? What
do you have the isochronous interval set to? If your peak data rate is
no more than 1 MB/second, for example, then an interval of once every 8
microframes should keep up, but you might set it to every 4 microframes
just in case. You almost have to run a simulation to figure out the
worst case. You get a shot to transmit one packet during your isoch
interval. After that, USB won’t talk to you AT ALL until your next
interval. If your FIFO is going to overflow by that time, then you need
to decrease the interval.

What can I make that a host constantly, took away the data from the device every microframe?
How engineers solve this problem on high-speed stream USB devices in real life?

As long as you have more than one URB circulating, this shouldn’t be an
issue. I work with web cameras that stream 24 megabytes a second, which
fills a maximum bandwidth isochronous pipe. We can shove that into a
DirectShow graph and preview it, all with only about 10% CPU.

My Idea
Device driver pass pointer to nonpaged buffer (circular buffer) to HostController driver and size of this buffer. Host controller driver request data from USB device every microframe and put received data to circular buffer.

The host controller can’t do that, but your driver certainly can.
That’s a very common model for streaming USB drivers. Allocate 256k
bytes of buffer, chop it up between 8 URBs of 32 packets each, and fire
them all off. You still have to worry about what happens if your
user-mode app can’t drain the data before the circular buffer wraps, but
if the buffer wraps, you have a more fundamental problem.

Where are you getting the buffers now? Are you waiting for a user-mode
app to send you buffers?

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.