Hello.
Actually I’m writing a user-mode application but we struck the sutuation when it can be kernel problem indeed.
Background: I’m writing a user-mode application that acts like a server for network redirector driver. The driver maps client requests to our custom protocol and sends them to the server. The server executes requets and sends requsts back. We may say that NtReadFile API call at the client side maps to corresponding NtReadFile API call at the server side. Other API calls are mapped in the same manner.
The server has thread pool based on I/O completion port. One of two functions may be executed when NtRemoveIoCompletion routine removes completion notification from the IOCP queue:
#define DEADLOCK_TIMEOUT 15000
void Func1()
{
DWORD waitResult = WAIT_FAILED;
do {
waitResult = WaitForSingleObjectEx(mutex1, DEADLOCK_TIMEOUT, TRUE);
} while (WAIT_IO_COMPLETION == waitResult);
if (WAIT_OBJECT_0 != waitResult)
{
// Deadlock, restart Func1
}
// Do something useful
ReleaseMutex(mutex1);
}
void Func2()
{
HANDLE handlesToWaitFor = { mutext1, mutex2 };
DWORD waitResult = WAIT_FAILED;
do {
waitResult =
WaitForMultipleObjectsEx(NUMBER_OF(handlesToWaitFor), handlesToWaitFor, TRUE, DEADLOCK_TIMEOUT, TRUE);
} while (WAIT_IO_COMPLETION == waitResult);
if (WAIT_OBJECT_0 != waitResult)
{
// Deadlock, restart Func2
}
// Do something useful (1)
ReleaseMutex(mutex2);
// Do something useful (2)
ReleaseMutex(mutex1);
}
Note that at “Do something useful (2)” point threads may call NtReadFile/NtWriteFile/NtQueryDirectoryFile rountines in asynchronous mode BUT completion of operations are reported using IOCP notifications.
Two deadlock guards allow to restart requests if a thread will not be able to take mutex ownership. If it happed the failed request is returnted back to the IOCP queue by means of PostQueuedCompletionStatus call.
The problem: So the problem is that ReleaseMutex(mutex2) call in Func2 can take up to DEADLOCK_TIMEOUT milliseconds to execute. I tried several values for DEADLOCK_TIMEOUT. Each time ReleaseMutex call took either nanoseconds or DEADLOCK_TIMEOUT milliseconds (i.e. 10-15 seconds!). The situation is rare enough but absolutely unexplainable for me.
Also I can say that waiting in alertable state is redundunt for this case. I don’t issue any asynchronous call with APC callback and I never get WAIT_IO_COMPLETION result from any of my wait function calls.
BTW I was able to reproduce this only under Windows XP Professional on P4 with hyper-threading computer only. Windows 2000 and Windows 2003 works fine for me (on the same computer).
Does anyone encounter with long execution of ReleaseMutex/NtReleaseMutant? What can be the reason of it? Thanks for any help.
Best regards,
Alexey.