Hang in DeviceIoControl

  1. I have a driver that lets a user mode app queue up a WDFREQUEST. Multiple request can be queued up.
  2. When a certain event happens in the driver, I complete the request.
  3. I have a thread in the user mode app that calls GetQueuedCompletionStatus so that the user mode app is notified when the event in the driver happens.
  4. After GetQueuedCompletionStatus return, I call DeviceIOControl with a synchronous message on the same handle.
  5. Then I call DeviceIOControl with an asynchronous message to queue up another request so that the user mode app can be notified next time the event happens in the driver.
  6. At the same time, another thread can call DeviceIOControl, with the same handle, with an asynchronous message. This message is using the same handle, so the message should complete on the same thread as mentioned before.

My problem is that sometimes the synchronous message that I send in step 4 never returns. It happens at the same time as step 6 above, and the completion of the message in step 6 is never received by calling GetQueuedCompletionStatus since the thread is stuck waiting for the synchronous message to return.

I have discovered that these two changes works independently.

  • If I use PostQueuedCompletionStatus to schedule the synchronous message in step 6 on the thread that calls `GetQueuedCompletionStatus, then nothing get stuck.
    or:
  • If I start another thread that calls GetQueuedCompletionStatus with the same completion port handle, before I make the synchronous message in step 6, also then nothing gets stuck.

Can someone please explain what is happening here?

Thanks!

I think that for any of us to help, more details are needed. Presumably, the HANDLE has been opened with FILE_FLAG_OVERLAPPED, and then has been associated with an IOCP. How are the ‘synchronous’ DeviceIOControl calls made? They will still generate IO completion notifications, but you can also wait for them to complete with GetOverlappedResult etc.

I open the file with:

driverHandle_ = CreateFile(
    path,                                
    GENERIC_READ | GENERIC_WRITE,   
    FILE_SHARE_WRITE | FILE_SHARE_READ,
    nullptr,                           
    OPEN_EXISTING,                 
     FILE_FLAG_OVERLAPPED,
    nullptr);

and I associate it with a completion port like this:

completionPortHandle_ = CreateIoCompletionPort(driverHandle_, nullptr, 0, 0);

I only have one thread calling GetQueuedCompletionStatus, but I still pass 0, as the last parameter to CreateIoCompletionPort to set NumberOfConcurrentThreads. Does it matter what I pass if I only have one thread calling CreateIoCompletionPort?

Here is a stipped down version of what I beleive is the interesting parts of how I call DeviceIOControl:

typedef void (*AsyncNotificationFunctionPtr)(void* ref);

struct AsyncNotificationContext {
  OVERLAPPED Overlapped{};
  AsyncNotificationFunctionPtr funcPtr;
  void* ref = nullptr;
};

WindowsAudioClient::AudioCallbackBuffers::consumeDataFromDevice() {
  // Do processing, omitted.

  // Send DeviceIoControl with OVERLAPPED to queue up another WDFREQUEST in the driver
  // so that this buffer can be notified when the driver write data to it.

  memset(&asyncNotificationContext_->Overlapped, 0, sizeof(OVERLAPPED));
  asyncNotificationContext_.funcPtr = &audioCallbackFunction;
  asyncNotificationContext_.ref = this;

  DeviceIoControl(driverHandle_, IOCTL_AUDIO_ARM_CALLBACK,
                  &index_,                 // Index if this buffer
                  sizeof(int),
                  nullptr,
                  0,
                  nullptr,                 // BytesReturned
                  &AsyncNotificationContext_->Overlapped);  // Ptr to Overlapped structure
    
  // Send another message to the driver telling it that the user mode app is done processing.
  // This call sometimes never return!
  DeviceIoControl(driverHandle_, IOCTL_AUDIO_MARK_BUFFER_AS_CONSUMED,
                  &index, sizeof(index), nullptr, 0, nullptr, nullptr))
}

void audioCallbackFunction(void* const ref) {
  static_cast<AudioCallbackBuffers*>(ref)->consumeDataFromDevice();
}

// Part of the function that calls GetQueuedCompletionStatus:

while (!threadShouldExit_.load()) {
  DWORD byteCount = 0;
  ULONG_PTR compKey = 0;
  OVERLAPPED* overlapped = nullptr;

  // When GetQueuedCompletionStatus return it is the IOCTL_AUDIO_ARM_CALLBACK that is completed.
  BOOL worked =
    GetQueuedCompletionStatus(completionPortHandle_, &byteCount, &compKey, &overlapped, 1000);
  if (byteCount == 0 || !worked || overlapped == nullptr) {
    continue;
  }   
  const auto* const wrap = reinterpret_cast<AsyncNotificationContext*>(overlapped);
  wrap->funcPtr(wrap->ref); // Calls audioCallbackFunction
}

// This method is called from some other thread (not the thead that in in the while loop above)
void WindowsAudioClient::stopAudio() {

  OVERLAPPED overlapped{};
  overlapped.hEvent = CreateEvent(NULL, TRUE, FALSE, NULL);

  // driverHandle_ is the same handle that is in each AudioCallbackBuffers
  DeviceIoControl(driverHandle_, IOCTL_AUDIO_STOP, nullptr, 0, nullptr, 0,nullptr, &overlapped);
  // Error handling ommitted...

  WaitForSingleObject(overlapped.hEvent, INFINITE);
  CloseHandle(overlapped.hEvent);
  return true;
}

---------- Driver code ---------

I have two queues:

  • The default queue is setup to process DeviceIOControl messages with WdfIoQueueDispatchParallel.
  • nonAudioQueue is setup to process with WdfIoQueueDispatchSequential.
VOID USBDriverDefaultEvtIoDeviceControl(_In_ WDFQUEUE Queue, _In_ WDFREQUEST Request,
                                        _In_ size_t OutputBufferLength,
                                        _In_ size_t InputBufferLength,
                                        _In_ ULONG IoControlCode) {
  switch (IoControlCode) {
    case IOCTL_AUDIO_MARK_BUFFER_AS_CONSUMED: {
       // Get the index, move one int from one list to the other, then call WdfRequestCompleteWithInformation(...)
       // No mutex, spinlock is taken and there is no loop that we cannot get out of.
      break;
    }
    case IOCTL_AUDIO_ARM_CALLBACK: {
      // Store the WDFREQUEST on a WDFQUEUE.
      break;
    }
    default: {
      if (const auto status =
            WdfRequestForwardToIoQueue(Request, getDeviceContext(Queue)->nonAudioQueue);
          !NT_SUCCESS(status)) {
          // Log error...
      }
    }
  }
}

The EvtIoDeviceControl for the other queue is not so interesting (I can see that WWdfRequestCompleteWithInformation is called for the WDFREQUEST for IOCTL_AUDIO_STOP in logging)

Well… I noticed that you’re not referencing your overlapped structure in one of those calls. If the handle’s been opened overlapped, I don’t think that’s valid.

Aside from that, you need to debug this and narrow down where the error is. Are you getting the “synchronous” request that you’re sending in your driver? And do you complete it??

1 Like

The documentation says that every request on a FILE_FLAG_OVERLAPPED file handle must use an OVERLAPPED structure, but since Windows XP, the I/O system does handle it.

1 Like

As noted by @“Peter_Viscarola_(OSR)” and @Tim_Roberts you need an overlapped object for each io operation if you opened it for asynch io.

However, you can make your busted no overlapped ioctl synchronous by just waiting for it to complete after issuing it (assuming it returns ERROR_IO_PENDING.)

As an aside, why bother with io completion ports if you are only using one thread and one handle?

Thanks everybody!

@“Peter_Viscarola_(OSR)” I can see that the request is delivered to the driver and that it is completed.

Passing an OVERLAPPED when I send IOCTL_AUDIO_MARK_BUFFER_AS_CONSUMED seems to fix it, but only if I also create an event and set the lower bit to skip io notification packet being queued to the completion port. Like this:

  // Send another message to the driver telling it that the user mode app is done
  // processing.  
  AsyncNotificationContext notificationContext{};
  notificationContext.Overlapped.hEvent = CreateEvent(NULL, TRUE, FALSE, NULL);
  notificationContext.Overlapped.hEvent = (HANDLE)((DWORD_PTR)notificationContext.Overlapped.hEvent | 1);
  const auto result =
    DeviceIoControl(driverHandle_, IOCTL_OB_AUDIO_MARK_BUFFER_AS_CONSUMED,
                    &message,                          // Ptr to InBuffer
                    sizeof(message),                   // Length of InBuffer
                    &notificationContext.data,         // Ptr to OutBuffer
                    sizeof(notificationContext.data),  // Length of OutBuffer
                    nullptr,                           // BytesReturned
                    &notificationContext.Overlapped);

  DWORD bytesTransfered = 0;
  GetOverlappedResult(driverHandle_, &notificationContext.Overlapped, &bytesTransfered, TRUE);
  CloseHandle((HANDLE)((DWORD_PTR)notificationContext.Overlapped.hEvent & ~1));

I wonder when GetOverlappedResult returns if there is an io completion packet queued to the completion port. It seems that the completion port packet need to be picked up by calling GetQueuedCompletionStatus for GetOverlappedResult to return.
So maybe the issue is that I send the DeviceIOControl without an OVERLAPPED on the only thread that is calling GetQueuedCompletionStatus?

@Mark_Roddy:
I am not sure what you mean with one thread? There is one thread reading from GetQueuedCompletionStatus, and some other thread that calls stopAudio. Can you give me more details if I can improve something here?

What I meant was that io completion ports are typically used with a pool of threads. It just seems overly complicated to me, particularly as you are using the same handle for both io completion ports and your getoverlappedresult ioctls. But never mind that if you have it working.

You could just use separate handles, right?

This is not the normal usage pattern for IOCP or OVERLAPPED IO. But I can answer some of your questions

The NumberOfConcurrentThreads parameter in CreateIoCompletionPort matters only when the completion port is created. When associating a handle with an existing completion port it is ignored. This parameter provides a hint to the thread scheduler about when threads blocking in GetQueuedCompletionStatus should be allowed to run. Normally, many threads will run loops similar to the one you show. Usually at least twice as many threads as there are CPUs in the system. If the scheduler allowed the blocking call to complete immediately when an IO completed, then there could be many threads competing for a limited set of CPU resources and the ensuing context switches would reduce the throughput. But if handling the IO completion results in blocking calls (WaitForSingleObject, another IO etc.) more CPU resources become available and other threads can do useful work. Its a hint because the scheduler can’t release threads that aren’t waiting in GetQueuedCompletionStatus so it might not get enough threads running, and it doesn’t control when threads wake up from other blocking calls, so it might get too many threads running. The scheduler can’t control the other waking of other threads because it does not understand the correlation between locks that causes deadlock

You do need to supply a valid OVERLAPPED struct to any call involving a handle opened with FILE_FLAG_OVERLAPPED. As Tim points out, most of the time the system will fix it for you by falling back to sync IO, but there are some edge cases. I looked at this about 10 years ago and I don’t remember the exact details

I don’t know anything about setting the low bit of an event handle suppressing IOCP notification. Handle values are meant to be opaque, so I would be nervous to rely on that behaviour. Microsoft did eventually document the Internal and InternalHigh members of OVERLAPPED, so it is possible that this will be too. But there is actually no need. given that you will pass an OVERLAPPED struct to every call anyways, you can just as easily pass an AsyncNotificationContext struct to every call and set some value to indicate that the completion should be ignored. Maybe as easy as setting funcPtr to NULL

This brings up another point. Exiting the GetQueuedCompletionStatus loop should normally not be done by testing a boolean, but by using PostQueuedCompletionStatus passing a special value for the completion key that tells your thread to exit. This allows the thread to exit even when there are no other IO completions to process. Otherwise the thread will continue to block in GetQueuedCompletionStatus and never test that bool

I know this is not the complete code, but probably, you do not want either of the two synch IOs to be synch at all. In the case of indicating that a buffer has been consumed, do you really need the driver to confirm that it now considers the buffer available before checking if there is another buffer filled with data? In fact the IOCTL_AUDIO_ARM_CALLBACK & IOCTL_AUDIO_MARK_BUFFER_AS_CONSUMED can be combined into one that gets called after the UM app has consumed the data and makes the buffer available to the driver again. Obviously for this to work, the driver needs a queue of pending requests so that it can fill another buffer while UM is busy consuming the data from this one

The other synch IO is more dangerous. The stopAudio is coded in a way that implies that it should not return until IO has stopped. But that’s not the same time as when that IOCTL completes. Even if the driver stops servicing requests before completing this IOCTL, requests that it has already completed may still be running in your worker thread. If code after this proceeds to deallocate resources, you might get exceptions or memory corruption. Usually, this is handled either by using rundown protection (counting pointer instances and not deallocating until the count reaches 0) or waiting for the worker thread(s) to exit. Obviously I don’t know what would make sense in your design

I don’t know anything about setting the low bit of an event handle suppressing IOCP notification.

This one is actually true. Real handles are addresses and thus are a multiple of 4, so there are two “spare” bits for out-of-band communication. If the low bit is set, it skips the notification. It’s actually documented in the page for GetQueuedCompletionStatus.

@Mark_Roddy

You could just use separate handles, right?

I have some context in the driver that is allocated per handle (multiple user mode apps can do CreateFile), and all of these calls needs to access that context.

I read need to read up a bit on using the thead pool, but the part in my example marked with // Do processing, omitted. can only handle on thread at a time, so I was thinking that using only one thread is a simple solution.

But there is actually no need. given that you will pass an OVERLAPPED struct to every call anyways, you can just as easily pass an >AsyncNotificationContext struct to every call and set some value to indicate that the completion should be ignored. Maybe as easy as setting funcPtr to NULL

Yes, this was my original idea but it seems that the IOCTL_AUDIO_MARK_BUFFER_AS_CONSUMED can get stuck if I do not set the lower bit to avoid IOCP completion and call GetOverlappedResult.

In fact the IOCTL_AUDIO_ARM_CALLBACK & IOCTL_AUDIO_MARK_BUFFER_AS_CONSUMED can be combined into one that gets called after the UM app >has consumed the data and makes the buffer available to the driver again. Obviously for this to work, the driver needs a queue of pending requests so that it can fill another buffer while UM is busy consuming the data from this one

Great idea! I will try this. There is a queue of pending request already in the driver. The reason for have IOCTL_AUDIO_ARM_CALLBACK and IOCTL_AUDIO_MARK_BUFFER_AS_CONSUMED in the first place is that when the user mode apps starts, it does so with only calling IOCTL_AUDIO_ARM_CALLBACK to queue up request, and when it is notified about the completion of one of those it also writes data to a buffer that is shared betwen the app and the driver. So, the call to IOCTL_AUDIO_MARK_BUFFER_AS_CONSUMED lets the driver now that the shared buffer contains data that should be sent to the USB device (which is not the case the first time the user mode app calls IOCTL_AUDIO_ARM_CALLBACK. But I can solve this by adding a bool to the message the is sent in IOCTL_AUDIO_ARM_CALLBACK (currently it only contains the index of the buffer).

There is an example on how to set the lower bit to avoid IOCP completion in Windows via C / C++, page 327. In that example they also reset the bit before closing the handle.

Also:
I have realized that my IOCTL_AUDIO_STOP example is wrong also. Since I don’t set the lower bit to avoid IOCP completion then there will be a completion packet which will result in a call to funcPtr which is not set to any valid function pointer.

This brings up another point. Exiting the GetQueuedCompletionStatus loop should normally not be done by testing a boolean, but by using PostQueuedCompletionStatus passing a special value for the completion key that tells your thread to exit. This allows the thread to exit even when there are no other IO completions to process. Otherwise the thread will continue to block in GetQueuedCompletionStatus and never test that bool

Ah, yes, that is a good idea. Currently I the price for checking that bool everytime GetQueuedCompletionStatus returns, but I need to wait 1000 ms if I want to shutdown.

Well, I learn new things all the time. It is actually documented that the low bit of the event handle will suppress completion. Thanks!

Even if your UM algorithm can’t be multi-threaded, there is an advantage to having multiple threads calling GetQueuedCompletionStatus. Other unrelated completions can be completed by those threads

But beware, the IO completion order is not assured between threads. Even if it was, the thread that dequeues an IO completion can immediately be preempted and another thread scheduled. This means that either your overlapped context, or the data itself, must have a way of identifying the next chunk that you should process if you pend multiple reads on the same handle and the order of processing them matters

I have posed here about this issue a few times over the years, but it is essentially unfixable now, and probably excusable since the NT kernel was designed at a time when 2 CPU cores was rare and 4 a lot - much less 8!

There is another discussion about how this should be done now that the hardware is no longer SMP or even NUMA, asymmetric with performance and efficiency cores. But that’s a much larger conversation

Yes, in my application it is important that the IOCP events are processed in the same order as they are completed by the driver, so I’ll stick with this for now :slight_smile: