High CPU Utilization in WaitForSingleObject

I have a driver and a userland service which exchange packets. I am getting unexpected (high) CPU utilization (~7% to 10% single-core) in WaitForSingleObject. Profiler shows lots of spin time for WaitForSingleObject. Can somebody please help me on ways to performance optimize the following

    // **UserLand Code**
    void OpenDevice()
    {
        m_DeviceHandle = CreateFile( DEVICE_NAME, MAXIMUM_ALLOWED, 0, NULL, OPEN_EXISTING, FILE_FLAG_OVERLAPPED | FILE_ATTRIBUTE_NORMAL, INVALID_HANDLE_VALUE );
    }
    void workerThread()
    { 
        HANDLE hIoEvent = CreateEvent(NULL, FALSE, TRUE, ""); 
        OVERLAPPED ol;
        memset(&ol, 0, sizeof(OVERLAPPED));
        ol.hEvent = hIoEvent;
        do { 
            DeviceIoControl(m_DeviceHandle, CONTROL_CODE( MY_ASYNC_EVENT, METHOD_OUT_DIRECT), NULL, 0, outBuf, sizeof(outBuf), &returnedCount, &ol);            
            ret = WaitForSingleObjects(hIoEvent, INFINITE); **//High CPU Utilization (Spin) here**
            switch(ret)
            {
            case WAIT_OBJECT_0:
                // data avialble
                GetOverlappedResult(m_DeviceHandle, &ol, &bytesRead, FALSE); 
                break;
            }
        } while(true);
        CloseHandle(hIoEvent);
    } 

    // **Driver code** 
    static
    NTSTATUS IoctlDeviceIoControl(PIRP  pIrp,   PIO_STACK_LOCATION   pIrpStack, PULONG  pOutBufLen)
    {
        switch (ioFunctionCode) {
            case MY_ASYNC_EVENT: 
                IoMarkIrpPending(pIrp);        
                IoSetCancelRoutine(pIrp, EventCancelIRP);                
                // Get the packet from Queue and update pIrp                
                IoSetCancelRoutine(pIrp, NULL);
                IoCompleteRequest(pIrp, IO_NO_INCREMENT);
            break;
        }
    }

I have also tried to use IoCompletionPort, the time spent in WaitForSingleObject now goes in GetQueuedCompletionStatus

You’re sure you’re seeing this time in the WaitForSingleObject code and not something you’re doing as a result of waking? Seems… unlikely.

I’m not sure it matter but the driver code snip you provided doesn’t make any sense. At all. What does “get the packet from the queue” mean? Why are you synchronously completing an IRP that you marked pending? Why are you writing a WDM driver?

Peter

I don’t doubt that you’re spending ELAPSED time in WaitForSingleObject and GetQueuedCompletionStatus, but not CPU time. What profiler are you using?


I am using Intel VTune Amplifier.
@“Peter_Viscarola_(OSR)” I skipped a lot of driver code in the above snippet to have my query concise (assuming that is not causing the CPU issue)
“get the packet from the queue” we have packet-list which has packet captured at this driver level and then sent to userland. IRP is completed asynchronously (I can put more driver details if that helps to solve the issue)

That snip from VTune shows the time being spent in CreateFile, not wait.

I’m very confused.

Peter

@“Peter_Viscarola_(OSR)” Spin time of WaitForSingleObject is 3.264s (which is high considering the test sample).

From VTune Documentation: Spin Time
Spin time is Wait Time during which the CPU is busy. This often occurs when a synchronization API causes the CPU to poll while the software thread is waiting. Some Spin Time may be preferable to the alternative of increased thread context switches. Too much Spin Time, however, can reflect lost opportunity for productive work.

Spin time is Wait Time during which the CPU is busy. This often occurs when a synchronization API causes the CPU
to poll while the software thread is waiting. Some Spin Time may be preferable to the alternative of increased thread context switches.
Too much Spin Time, however, can reflect lost opportunity for productive work.

Well, I don’t really know what they mean by " synchronization API causes the CPU to poll", but the above seems to be of zero relevance to KeWaitForSingleObject(). At the kernel level, it may apply to Solaris/Illumos adaptive mutex acquisition that involves waiter spinning as long as the mutex owner is in a running state. In the Windows kernel world it may, probably, (I am not so sure about it) apply to FAST_MUTEX acquisition, which is conceptually similar to the userland critical section locking that involves a limited number of spins before calling
WaitFor SingleObject(). However, I don’t think KeWaitForSingleObject() may work this way…

Anton Bassov