Year before I posted here a decription of the problem which causes OS deadlock at XP SP2 when USB selective suspend is used and device is surprise removed in the "right" moment. We discussed it but no solution was found and I postponed it because of other work. Now the updated driver should be signed again and I'd like to solve the problem before.
Basically, my questions are:
- How should USB driver correctly handle SS cancellation?
- How WDF handles SS cancellation and how it behaves in below scenario?
- Is there any solution which'd work for both XP SP2 and pre-SP2 USB versions?
Currently, my driver handles SS cancellation as I was adviced by USB core team member:
- Once your callback is invoked do not cancel your IDLE irp.
- If your callback is invoked and you want to cancel your IDLE irp,
etc, stay in D0, do not go to D2.- If you want to cancel your IDLE irp after going to D2 in your
callback send a D0 irp to the bus driver.
It differs from DDK samples; I had to change them because there were race conditions leading to non-functional device at pre-SP2 XP. Also, DDK samples, which cancel IDLE IRP, violate what DDK docs says (basically #1 above). The deadlock I mentioned occurs because OS never completes D0 IRP sent to cancel IDLE IRP (#3) when surpsise removal occurs in parallel. I'll cite the scenario from my original mail:
Normal situation:
- IDLE notification callback is called
- callback sends D2 IRP and waits until completed
- IDLE notification callback finishes
- device is physically disconnected
- IDLE IRP completion routine is called
- completion routine sends D0 IRP
- D0 IRP is completed
- driver receives few PnP IRPs and finally surprise removal
- driver waits until all requests complete, makes cleanup and passes IRP below
- driver receives device removal IRP, finishes processing and everything is OK
Now remove device a bit sooner:
- IDLE notification callback is called
2 - 3. callback sends D2 IRP and waits until completed
2 - 3. device is physically disconnected - driver receives the first PnP IRP from the removal series (IRP_MN_QUERY_DEVICE_RELATIONS)
- driver finds IDLE IRP isn't completed, yet and tries to cancel it. Because callback was already called, it sends D0 IRP down the stack to achieve it.
- driver receives next PnP IRP and finally surprise removal
- driver waits until both D0 and IDLE IRPs are completed and waits forever because they aren't.
OK, postpone waiting as late as possible:
7. driver doesn't wait, makes cleanup and passes IRP below, instead
8. driver receives device removal IRP, finishes processing and waits for D0 and IDLE IRPs
9. IDLE IRP is completed
10. driver still waits for D0 IRP completion and waits forever
Some times I measured:
Typical order of event with boundary timing (correct run). Device disconnect is scheduled circa 700 ms after idle IRP submit:
T + 0: driver submits idle IRP
T + 650 ms: idle callback starts and sends D2 IRP
T + 660 ms: D2 IRP is completed and idle callback finishes
T + 700 ms: device is physically disconnected
T + 900 ms: idle IRP is completed with cancelled status
T + 1400 ms: surprise removal IRP is received
Now the same scenario when device disconnect is scheduled to T + 600 ms:
T + 0: driver submits idle IRP
T + 600 ms: device is physically disconnected
T + 700 ms: idle callback starts and sends D2 IRP
T + 1200 ms: D2 IRP is completed and idle callback finishes
T + 1600 ms: surprise removal IRP is received
IDLE IRP isn't completed by system so PnP IRP handler has to cancel it. I follow DDK docs and above advice and send D0 IRP instead of its cancellation. D0 IRP is never completed in this situation which finally causes OS deadlock.
Best regards,
Michal Vodicka
UPEK, Inc.
[xxxxx@upek.com, http://www.upek.com]