Thanks for your response Mark!
Microsoft, in my opinion, has no motivation to improve this situation
for 32bit crippled PCI devices. They would prefer that such PCI devices
go away and stop being plugged into their systems.
Consequently, brain-dead 32bit PCI devices will continue to suffer from
a 64k block of map registers until those devices are as rare as isa devices.
I hadn’t realized that PCI devices that ONLY have Dma32BitAddresses==TRUE were nowadays considered crippled. I thought most devices are like that Anyway, since we are working with 1394 adapters, which use the OHCI 1.1 spec, we are stuck with 32-bit DMA addressing.
That said, if your driver is well behaved, other than being slow,
it should not have ‘lots of problems’.
In fact it should not have any problems at all.
It’s clear to me now that my previous points were not so clear, so I will try to clarify using some real numbers in a LONG topic.
When the device you work with is under your “complete” control, being slow is not a big issue.
However when your device operates autonomously, shooting data at you at rates up to 800Mbps, then you better be fast enough or you miss data. This is exactly what happens with 1394 isochronous transmissions and hi-res 1394 digital cameras that are capable of streaming at very high rates.
Isochronous transmissions are “guaranteed bandwidth” but not “guaranteed delivery”.
Isochronous packets arrive every 125 microseconds.
If one 1394 packet is lost because the 1394 chip is not ready to do DMA when the packet is transmitted on the cable then the whole frame is useless.
Moreover due to the packet header format you have no 100% reliable way of detecting that something bad happened. As a bonus the next frame also gets lost (even if you manage to detect that the previous was bad). I will be happy to clarify both points to anyone interested.
Now let’s see why max 64KB DMA transfer is a big pain for 1394 cameras.
Instead of speaking theoretically I will perform an example calculation using a medium-resolution format, specifically IIDC Format_1 Mode_3 which is 1024x768 YUV 4:2:2 at 15 fps.
In this setup each frame consists of 512 packets of 3072 bytes each.
If I decide to use all 16 available map registers in one DMA transfer I will be able to map up to 64KB, which gives me 65536/3072=21.33 packets. Let’s forget page alignment and suppose that indeed 21 packets can always be mapped.
This means that the driver will build a DMA context program for the 1394 chip that will be able to receive 21 isochronous packets. After the 21st packet is received an interrupt occurs and the 1394 DMA context is HALTED.
At this point we have exactly 125 microseconds before the 22nd packet flies on the cable.
125 microseconds to get into the ISR, have the DPC queued, executed, process the DMA context program descriptors to find out what happened, free DMA resources and prepare the DMA context program for the next 21 packets and restart the halted DMA context (which is not an instant operation).
Since each frame is 512 packets, I will have to break each frame into 512/21=24.38==>25 dma transfers.
This means that 24 times for each incoming frame I depend on very delicate timing.
How does it work in practice? Very very poorly.
So at some point not so long ago we switched to what we call “Double Buffering Isochronous Receive”.
This means that when a single frame cannot fit into one DMA transfer, we split it into pieces that are each HALF the maximum DMA transfer. This way we have two active DMA transfers programmed on the 1394 chip.
The 1394 chip generates an interrupt half-way (for the 1st transfer) and continues receiving isochronous packets into the buffers of the 2nd DMA transfer. Instead of a mere 125 microseconds, the isochronous request completion processing cycle has much more time to do its work, plus the DMA context on the 1394 chip does NOT get halted.
Let’s see how this translates into the 16 map register case. Instead of using all 16 map registers in one DMA transfer, we program 2 DMA transfers using 8 map registers each.
32KB/3072 = 10.66 which results into 10 isochronous packets in each DMA transfer (ignoring page alignment).
So now we have 2 active DMA context programs on the 1394 chip. When the first one completes the drivers have 10*125 microseconds to do their job. This sounds much more promising but in the tests we run (on pretty recent and decent hardware) the results are not good.
Here are some test results:
YUV422 800x600 7.5fps –> Actual rate reported 5.6 fps. Image quality “looks” OK
YUV422 800x600 15 fps –> Actual rate reported 10.9 fps. Image very often is “scrambled” (packets may contain a non-integral number of scan lines so a lost packet produces a visible rearrangement in the rest of the image).
YUV422 1024x768 7.5fps –> Actual rate reported 0.xx fps. Image is always “scrambled”.
Btw, YUV422 1024x768 7.5 fps uses isochronous receive double buffering with 18 packets in each DMA transfer and still the system can’t keep up. Going at 15 fps (10 isochronous packets per DMA transfer) simply makes the situation worse (if it really makes any difference to say that something is worse than 0.xx fps).
I gather that we could detect the 16 map register limitation and try to use Common Buffer DMA instead, writing whole new bunches of code just for these fancy machines.
I don’t know if it is worth it. I think that max 16 map registers is plain silly in today’s hardware under any modern OS configuration.
Since we mainly deal with industrial clients and software houses, our only option is to advise them NOT to use such hardware with our drivers. They won’t crash if they do, but they won’t be able to do much meaningful work with 1394 digital cameras.
Warm Regards,
Dimitris Staikos
Unibrain