WDM USB iso streaming question

Hello all,

I have just started developing a WDM driver for some USB streaming device
(WinXP).
I studied the WDM example of the Windows XP DDK (isousb) and I (hopefully)
understood how the data transfer between the driver and the USB bus is done.

But the transfer of the iso data (requested with
IOCTL_ISOUSB_START_ISO_STREAM) from driver to userspace is still a mystery
to me.
For the read/write IOCTLs everything looks fine, but I do not see how the
application can fetch the data which is continuousloy received by the
driver. I know that a global buffer is defined in the example (gbuf) but I
don’t see how this is used.

I checked Walter Oneys “Programming the Microsoft Windows Driver Model”.
There was an example on isochronous USB transfer, but I didn’t find this
case (USB iso streaming).

Can somebody give me a pointer to an example on how to do isochronous
streaming on USB ?
Especially on how an application is receiving streaming data from the driver
?

Thanks in advance.

Kind regards,
Frank


Frank Dawidowsky, xxxxx@sony.de

Here is how I do it:

  • allocate a large buffer in user space
  • pass that buffer to the driver using a proprietary IOCTL
  • probe & lock the buffer into kernel memory
  • create as many MDLs as you have URBs, each MDL describing a small portion
    of the large buffer; making sure that the MDLs fully span the buffer
  • send those MDLs to USBD as part of a URB/IRP pair
  • recycle those URB/IRP pairs in your completion routine

Burk.

Dawidowsky, Frank wrote:

Hello all,

I have just started developing a WDM driver for some USB streaming device
(WinXP).
I studied the WDM example of the Windows XP DDK (isousb) and I (hopefully)
understood how the data transfer between the driver and the USB bus is done.

But the transfer of the iso data (requested with
IOCTL_ISOUSB_START_ISO_STREAM) from driver to userspace is still a mystery
to me.
For the read/write IOCTLs everything looks fine, but I do not see how the
application can fetch the data which is continuousloy received by the
driver. I know that a global buffer is defined in the example (gbuf) but I
don’t see how this is used.

I checked Walter Oneys “Programming the Microsoft Windows Driver Model”.
There was an example on isochronous USB transfer, but I didn’t find this
case (USB iso streaming).

Can somebody give me a pointer to an example on how to do isochronous
streaming on USB ?
Especially on how an application is receiving streaming data from the driver
?

Thanks in advance.

Kind regards,
Frank


Frank Dawidowsky, xxxxx@sony.de

Burkhard Daniel * xxxxx@mtronix.de * www.mtronix.de

mtronix Precision Measuring Instruments

> For the read/write IOCTLs everything looks fine, but I do not see
how the

application can fetch the data which is continuousloy received by
the
driver.

The app must send lots of overlapped IRPs by DeviceIoControl, and the
driver will fill their buffers one by one and complete them one by
one.

I know that a global buffer is defined in the example (gbuf) but I

Bad practice, “lots of IRPs” are better, note also that DirectShow and
DirectSound use “lots of IRPs”.

Max

Maxim S. Shatskih wrote:

>I know that a global buffer is defined in the example (gbuf) but I

Bad practice, “lots of IRPs” are better, note also that DirectShow and
DirectSound use “lots of IRPs”.

Why is that bad practice? I’d say that really depends on the problem you’re
trying to address.

In my case, I’m the only customer of that data ever. I know what the data
looks like, I have built-in synchronization, and performance is an absolute
must. Therefore, I go with a single large buffer that I have my driver roam
over with many IRPs.

What’s wrong with that?

Burk.

Burkhard Daniel * xxxxx@mtronix.de * www.mtronix.de

mtronix Precision Measuring Instruments

> Why is that bad practice?

  1. Synchronization of accesses to this memory by the driver and by the
    app. Named events cause issues on Remote Desktop/TS, and passing the
    event handle in IOCTL is not simpler then using the “lots of IRPs”
    approach.
  2. User memory access from the driver. Usually, this will require
    building a MDL on the buffer and holding it locked for all time the
    driver is loaded. Again - not easier then “lots of IRPs”, and provides
    no advantages.

must. Therefore, I go with a single large buffer that I have my
driver roam
over with many IRPs.

In fact, the shared memory approach, if implemented properly, will
have more limitations and will require more coding, while having no
advantages over “lots of pending IRPs” model.

Max

Maxim S. Shatskih wrote:

>Why is that bad practice?

  1. Synchronization of accesses to this memory by the driver and by the
    app. Named events cause issues on Remote Desktop/TS, and passing the
    event handle in IOCTL is not simpler then using the “lots of IRPs”
    approach.

As I said, that pretty much depends on the problem you’re cracking down on.
I don’t need to explicitly synchronize my data with the driver, as the
synchronization is *intrinsic* to the data. No events needed, thank you.
This is true for quite a few scenarios I can picture that involve
isochronous streaming.

  1. User memory access from the driver. Usually, this will require
    building a MDL on the buffer and holding it locked for all time the
    driver is loaded. Again - not easier then “lots of IRPs”, and provides
    no advantages.

I actually need more than one MDL, because I have multiple URBs that each
describe a certain section of the buffer. But IMHO, creating those MDLs in
the driver is no more difficult than managing the complexity involved with
sending numerous DeviceIoControls to the driver and handling the data thus
returned.

And of course, the buffer does not have to be locked for the entire life of
the driver. Instead, you hand the driver a pointer to the large buffer in a
proprietary IOCTL, and teh driver then blithely holds on to the
corresponding IRP until the application cancels the process by issuing
another IOCTL (or else until the app quits). Quite simple really.

driver is loaded. Again - not easier then “lots of IRPs”, and provides
> no advantages.

There is one big big big advantage to using the shared buffer approach:
latency. I have almost *zero* latency between the data arriving in the
driver and the application having access to it. Try to get that with
multiple IRPs floating around, which will always mean that you have (at
least one) context switch before your app sees the data. Not so with a
shared buffer.

Also, I save myself the overhead of initialzing a zillion IRPs (by means of
DeviceIoControl), sending them to the driver, and having the system wait on
each of them to complete.

Where performance and latency are important, I’ll take the shared buffer
approach any time over the “many IRPs” variety. And why, exactly, is that
easier? Considering you have to implement cancellation logic and all?

In fact, the shared memory approach, if implemented properly, will
have more limitations and will require more coding, while having no
advantages over “lots of pending IRPs” model.

I doubt that. What limitations? What extra coding? And I just told you the
advantages.

Burk.

Burkhard Daniel
Software Technologies Group, Inc.
xxxxx@stg.com * http://www.stg.com
fon: +49-179-5319489 fax: +49-179-335319489

> Also, I save myself the overhead of initialzing a zillion IRPs (by
means of

DeviceIoControl), sending them to the driver, and having the system
wait on
each of them to complete.

Why zillion? For instance, DV/DirectShow uses an IRP per video frame -
25 or 30 IRPs per second.

easier? Considering you have to implement cancellation logic and
all?

Just above in the original message you said about the auxiliary IRP
with cancellation support. :slight_smile: why not make it the main data path and
not auxiliary stuff?
The cancel logic is trivial, I just cited it to this very forum around
6 hours ago.

I doubt that. What limitations? What extra coding?

Auxiliary IRPs with cancellation support.
As about latency - DirectShow is a realtime environment, and works
fine with “lots of IRPs” - namely IOCTL_KS_READ_STREAM IRPs.

Max

— “Maxim S. Shatskih” wrote:
> > Also, I save myself the overhead of initialzing a zillion IRPs (by means of
> > DeviceIoControl), sending them to the driver, and having the system wait on
> > each of them to complete.
>
> Why zillion? For instance, DV/DirectShow uses an IRP per video frame -
> 25 or 30 IRPs per second.

It depends.
Actually for analog video it is usually one IRP per interlaced field (odd or even) that makes it
60 IRPs/second.
On the other hand for DTV (DVB or OpenCable) for certain data rates and allocated buffer sizes it
can be even 200-300 IRPs/second.

I don’t think there is anything wrong with it though.
The OS can handle this load with no problem at all.

The issue is not with the number of IRPs to handle. The issue is mostly due to latency.
Consider one of those PCI analog TV cards with loopback cable for audio. Audio in that case goes
through the wire directly to a sound card, which makes latency virtually nil. Video however is
delivered through PCI bus to the systsem memory. If there is a 50 ms latency builds up while it
goes through the hoops in DShow down to the video renderer, we will have 50 ms lip sync problem.
While it is nothing to an average user, video professionals with dozen years of experience in the
industry might notice it. And they frequently have a say on whether the design is accepted for
production or not.

– Max.

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

> goes through the hoops in DShow down to the video renderer, we will have 50
ms lip sync problem.

While it is nothing to an average user, video professionals with dozen years
of experience in

The solution is the following:

  • the driver must support the “pause mode”. In this mode, the DMA pipeline is
    stalled by some control bit in one of the hardware registers. So, no “DMA done”
    interrupts -> no IRP completions.
  • the driver must be able to enter this mode even in the middle of some DMA
    transfer, having the current IRP partially processed.
  • the driver must accept the IRPs in this mode, they must be queued and never
    delivered to StartIo and thus to the hardware.
  • then do the following: feed both video and audio to appropriate devices which
    are in pause mode, and then resume both of them.

This is really so in DirectSound (which also provides a master clock from the
card’s DMA counter), as also in it’s Linux descendant OpenSound. IIRC there
were some DirectShow interfaces which could support this for any multimedia
hardware, not only for sound.

Max

Maxim S. Shatskih wrote:

>goes through the hoops in DShow down to the video renderer, we will have 50

ms lip sync problem.

>While it is nothing to an average user, video professionals with dozen years

of experience in

The solution is the following:

  • the driver must support the “pause mode”. In this mode, the DMA pipeline is
    stalled by some control bit in one of the hardware registers. So, no “DMA done”
    interrupts -> no IRP completions.
  • the driver must be able to enter this mode even in the middle of some DMA
    transfer, having the current IRP partially processed.
  • the driver must accept the IRPs in this mode, they must be queued and never
    delivered to StartIo and thus to the hardware.
  • then do the following: feed both video and audio to appropriate devices which
    are in pause mode, and then resume both of them.

Only this doesn’t work for /getting/ data from the device, such as recording
audio, or streaming data into the computer from a device, which requires low
latency.

Burkhard.

Burkhard Daniel
Software Technologies Group, Inc.
xxxxx@stg.com * http://www.stg.com
fon: +49-179-5319489 fax: +49-179-335319489

> Only this doesn’t work for /getting/ data from the device, such as recording

Why?

  • open all source drivers and feed them with IRPs in pause mode
  • then un-pause them all
  • if each driver will timestamp the input data packets (completed IRPs) and
    will start the timestamp counter with zero on un-pause, then merging the
    streams in user mode is a solvable task, possibly with some amount of queues.

The nastiest part of latency is from analog-digital converter to PCI DMA engine
or USB interface. This latency can only be approximated by some hardcoded
value, it cannot be measured.

Max

Maxim S. Shatskih wrote:

>Only this doesn’t work for /getting/ data from the device, such as recording

Why?

  • open all source drivers and feed them with IRPs in pause mode
  • then un-pause them all
  • if each driver will timestamp the input data packets (completed IRPs) and
    will start the timestamp counter with zero on un-pause, then merging the
    streams in user mode is a solvable task, possibly with some amount of queues.

well what about the case where the device is streaming high-bandwidth data
for long periods of time? The problem is not so much *knowing* when the data
originated and synching that to something else, it is the amount of time it
takes for the data to travel through the peripheral device, through the
driver stack and into the application that matters. IRP queues will only
make that latency worse.

Burk.

The nastiest part of latency is from analog-digital converter to PCI DMA engine
or USB interface. This latency can only be approximated by some hardcoded
value, it cannot be measured.

The nastiest part of latency is that it is time between the event taking
place and the event being processed in the PC. If that time is large, you
will notice a delay, which very often is bad.

Burk.

Burkhard Daniel * xxxxx@mtronix.de * www.mtronix.de

mtronix Precision Measuring Instruments

Yes, there are some interfaces in KS/DShow that allow video capture devices (or any other devices
for that matter) to queue a bit of data for particular stream upon transition from AQUIRE state to
PAUSE so that they are a little ahead of streams.

With my example I was just referring to the problem original poster was trying to solve with his
own solution.

I’d rather use standard “lots of IRPs” approach as it gives you a whole bunch of services such as
DShow. Otherwise the app would have to use some 3rd party middleware, which may not be such a good
idea.
Standard API is usually better.

– Max.

— “Maxim S. Shatskih” wrote:
> > goes through the hoops in DShow down to the video renderer, we will have 50
> ms lip sync problem.
> > While it is nothing to an average user, video professionals with dozen years
> of experience in
>
> The solution is the following:
> - the driver must support the “pause mode”. In this mode, the DMA pipeline is
> stalled by some control bit in one of the hardware registers. So, no “DMA done”
> interrupts -> no IRP completions.
> - the driver must be able to enter this mode even in the middle of some DMA
> transfer, having the current IRP partially processed.
> - the driver must accept the IRPs in this mode, they must be queued and never
> delivered to StartIo and thus to the hardware.
> - then do the following: feed both video and audio to appropriate devices which
> are in pause mode, and then resume both of them.
>
> This is really so in DirectSound (which also provides a master clock from the
> card’s DMA counter), as also in it’s Linux descendant OpenSound. IIRC there
> were some DirectShow interfaces which could support this for any multimedia
> hardware, not only for sound.
>
> Max
>
>
>
> —
> You are currently subscribed to ntdev as: xxxxx@yahoo.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

Thanks for your very helpful comments on my problem.
I am right now favoring the “lots of IRPs” approach, but the driver has
to support muliple applications, and i am not sure how to solve this.

The Scenario:
One USB streaming device delivers a continuous stream, which is received
by a variety of applications. Every application gets the same data.
I am considering some sort of register/unregister IOCTL to keep track of
the applications connected to the driver. And if I use a fixed data buffer
for each application ( the memory handle will be transmitted with the
IOCTL), the driver can
copy the data from the completed URBs to several buffers.
Is the access control to locked memory already enough to synchronise the
kernel and application? I.e. Am I sure that I only get the data which is
already processed by the driver?

And is it possible to solve such a scenario with “multiple IRPs”?

Thanks again,
Frank Dawidowsky

-----Original Message-----
From: Max Paklin
To: Windows System Software Developers Interest List
Sent: 18.07.2003 03:17
Subject: [ntdev] Re: WDM USB iso streaming question

Yes, there are some interfaces in KS/DShow that allow video capture
devices (or any other devices
for that matter) to queue a bit of data for particular stream upon
transition from AQUIRE state to
PAUSE so that they are a little ahead of streams.

With my example I was just referring to the problem original poster was
trying to solve with his
own solution.

I’d rather use standard “lots of IRPs” approach as it gives you a whole
bunch of services such as
DShow. Otherwise the app would have to use some 3rd party middleware,
which may not be such a good
idea.
Standard API is usually better.

– Max.

— “Maxim S. Shatskih” wrote:
> > goes through the hoops in DShow down to the video renderer, we will
have 50
> ms lip sync problem.
> > While it is nothing to an average user, video professionals with
dozen years
> of experience in
>
> The solution is the following:
> - the driver must support the “pause mode”. In this mode, the DMA
pipeline is
> stalled by some control bit in one of the hardware registers. So, no
“DMA done”
> interrupts -> no IRP completions.
> - the driver must be able to enter this mode even in the middle of
some DMA
> transfer, having the current IRP partially processed.
> - the driver must accept the IRPs in this mode, they must be queued
and never
> delivered to StartIo and thus to the hardware.
> - then do the following: feed both video and audio to appropriate
devices which
> are in pause mode, and then resume both of them.
>
> This is really so in DirectSound (which also provides a master clock
from the
> card’s DMA counter), as also in it’s Linux descendant OpenSound. IIRC
there
> were some DirectShow interfaces which could support this for any
multimedia
> hardware, not only for sound.
>
> Max

This is how I would do it.

  1. have a service build a simple DShow graph that contains two filters.
    One is the source of the data (your capture device) connected and streaming data to you custom
    sink user mode filter (S).

  2. have your applications register with the sink filter (S).

  3. (S) allocates shared memory (memory mapped file or whatever else) and submits a bunch of
    buffers out of that shared memory to capture filter to fill them out with the data. This is your
    “multiple IRPs” technique in action.

  4. once capture filter has finished filling out its next buffer it returns it to KSPROXY, which
    forwards it to (S).

  5. (S) then notifies every registered application that the new data is available.
    It could use some sort of COM event interface and have data wrapped into IMyCustomBuffer.
    (S) AddRefs it before sending to an application. Application calls Release on it when its done.
    When the reference count drops to 0, the buffer is put back into pool of free buffers and
    submitted again to the capture driver.

I believe that this is very straightforward approach and it completely separates participating
parties. Applications don’t have to worry (or even know) about the driver. They register with (S)
and get their data when it is available. (S) handles buffers and registered applications list. The
driver is the meantime just supplies the data. It does not have to worry about where the data goes
afterwards, nor does it care. Its job is to retrieve the data from the hardware and provide it to
the consumer, nothing more. No sync issues to worry about, no custom work to do. In terms of
communication with user mode clients everything is handled by DShow.

– Max.

— “Dawidowsky, Frank” wrote:
> Thanks for your very helpful comments on my problem.
> I am right now favoring the “lots of IRPs” approach, but the driver has
> to support muliple applications, and i am not sure how to solve this.
>
> The Scenario:
> One USB streaming device delivers a continuous stream, which is received
> by a variety of applications. Every application gets the same data.
> I am considering some sort of register/unregister IOCTL to keep track of
> the applications connected to the driver. And if I use a fixed data buffer
> for each application ( the memory handle will be transmitted with the
> IOCTL), the driver can
> copy the data from the completed URBs to several buffers.
> Is the access control to locked memory already enough to synchronise the
> kernel and application? I.e. Am I sure that I only get the data which is
> already processed by the driver?
>
> And is it possible to solve such a scenario with “multiple IRPs”?
>
> Thanks again,
> Frank Dawidowsky
>
>
> -----Original Message-----
> From: Max Paklin
> To: Windows System Software Developers Interest List
> Sent: 18.07.2003 03:17
> Subject: [ntdev] Re: WDM USB iso streaming question
>
> Yes, there are some interfaces in KS/DShow that allow video capture
> devices (or any other devices
> for that matter) to queue a bit of data for particular stream upon
> transition from AQUIRE state to
> PAUSE so that they are a little ahead of streams.
>
> With my example I was just referring to the problem original poster was
> trying to solve with his
> own solution.
>
> I’d rather use standard “lots of IRPs” approach as it gives you a whole
> bunch of services such as
> DShow. Otherwise the app would have to use some 3rd party middleware,
> which may not be such a good
> idea.
> Standard API is usually better.
>
> – Max.
>
>
>
>
> — “Maxim S. Shatskih” wrote:
> > > goes through the hoops in DShow down to the video renderer, we will
> have 50
> > ms lip sync problem.
> > > While it is nothing to an average user, video professionals with
> dozen years
> > of experience in
> >
> > The solution is the following:
> > - the driver must support the “pause mode”. In this mode, the DMA
> pipeline is
> > stalled by some control bit in one of the hardware registers. So, no
> “DMA done”
> > interrupts -> no IRP completions.
> > - the driver must be able to enter this mode even in the middle of
> some DMA
> > transfer, having the current IRP partially processed.
> > - the driver must accept the IRPs in this mode, they must be queued
> and never
> > delivered to StartIo and thus to the hardware.
> > - then do the following: feed both video and audio to appropriate
> devices which
> > are in pause mode, and then resume both of them.
> >
> > This is really so in DirectSound (which also provides a master clock
> from the
> > card’s DMA counter), as also in it’s Linux descendant OpenSound. IIRC
> there
> > were some DirectShow interfaces which could support this for any
> multimedia
> > hardware, not only for sound.
> >
> > Max
>
>
>
> —
> You are currently subscribed to ntdev as: xxxxx@yahoo.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

Hi,

I am new to WDM development too. I have the same feeling as Frank does.
One step further I tried is that when rwiso -g , the USB data is actually
coming in I saw that using the USB sniff. But rwiso doesn’t provide a way
to retrieve the data on the fly. And that is my question , How can I use
get the streaming data from the existing kernal driver. Even though I did
manage to create a ring buffer and move USB data into it and create a new
IOCTL to try to retrieve it from ring buffer. But it failed at this new
IOCTL even though I tried using Mutex, SpinLock to synchronize. I know it
may not be hard to the experts in this group.

Thanks in advance

William

Here is how I do it:

  • allocate a large buffer in user space
  • pass that buffer to the driver using a proprietary IOCTL
  • probe & lock the buffer into kernel memory
  • create as many MDLs as you have URBs, each MDL describing a small portion
    of the large buffer; making sure that the MDLs fully span the buffer
  • send those MDLs to USBD as part of a URB/IRP pair
  • recycle those URB/IRP pairs in your completion routine

Burk.

Dawidowsky, Frank wrote:
> Hello all,
>
> I have just started developing a WDM driver for some USB streaming device
> (WinXP).
> I studied the WDM example of the Windows XP DDK (isousb) and I (hopefully)
> understood how the data transfer between the driver and the USB bus is done.
>
>
> But the transfer of the iso data (requested with
> IOCTL_ISOUSB_START_ISO_STREAM) from driver to userspace is still a mystery
> to me.
> For the read/write IOCTLs everything looks fine, but I do not see how the
> application can fetch the data which is continuousloy received by the
> driver. I know that a global buffer is defined in the example (gbuf) but I
> don’t see how this is used.
>
> I checked Walter Oneys “Programming the Microsoft Windows Driver Model”.
> There was an example on isochronous USB transfer, but I didn’t find this
> case (USB iso streaming).
>
> Can somebody give me a pointer to an example on how to do isochronous
> streaming on USB ?
> Especially on how an application is receiving streaming data from the driver
> ?
>
> Thanks in advance.
>
> Kind regards,
> Frank
>
> –
> Frank Dawidowsky, xxxxx@sony.de

Burkhard Daniel * xxxxx@mtronix.de * www.mtronix.de

mtronix Precision Measuring Instruments