Thread switch while a lock is held

Hi guys,
This is kind of a general question on locking methods. In my system I often see spikes in time my code spends inside a lock. Since in normal circumstances that code just flies (exec time is less than a tick), and all code and data touched inside the lock are NonPaged, I contribute these spikes to a thread switch. Unfortunately, I have a pretty steady stream of events in dozens (sometimes hundreds) of threads that must be protected with that lock. So, even small disruption (like a thread switch) quickly (in seconds) clogs me up, because at some points these delays that are due to thread switches accumulate, and suddenly I have a whole line of threads blocked on the lock, and it takes a considerable amount of time to clear that clog, just to come back to it a couple of seconds later.
So, is there a ?common wisdom? on how to handle that scenario? Once again, it?s pretty much guaranteed that without a thread switch my code will release a lock in matter of nanoseconds. I?m thinking to resort to spinlocks (instead of eresource), since held spinlock will disable running thread from switching, but I?m thinking that may be there is other solution?

TIA,

Vladimir

If you have hundreds of threads, that in and of itself will be a perf bottleneck. Having more threads than processors (lets assume you affinitize properly) does you no good.

d

dent from a phine with no keynoard

-----Original Message-----
From: xxxxx@gmail.com
Sent: Friday, February 25, 2011 2:28 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] Thread switch while a lock is held

Hi guys,
This is kind of a general question on locking methods. In my system I often see spikes in time my code spends inside a lock. Since in normal circumstances that code just flies (exec time is less than a tick), and all code and data touched inside the lock are NonPaged, I contribute these spikes to a thread switch. Unfortunately, I have a pretty steady stream of events in dozens (sometimes hundreds) of threads that must be protected with that lock. So, even small disruption (like a thread switch) quickly (in seconds) clogs me up, because at some points these delays that are due to thread switches accumulate, and suddenly I have a whole line of threads blocked on the lock, and it takes a considerable amount of time to clear that clog, just to come back to it a couple of seconds later.
So, is there a ?common wisdom? on how to handle that scenario? Once again, it?s pretty much guaranteed that without a thread switch my code will release a lock in matter of nanoseconds. I?m thinking to resort to spinlocks (instead of eresource), since held spinlock will disable running thread from switching, but I?m thinking that may be there is other solution?

TIA,

Vladimir


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Thanks, Doron!
Yes, I understand that, but I don’t control number of threads that are hitting me: I don’t spawn them, I’m just serving their request, trying to interfere as little as possible. Still, what are recomendations for the scenario I described?

I assume you’re asking about a mutex or its equivalent.

If a thread tries to acquire an owned mutex, it will be put to sleeping state. If there is a lot of contention, many threads will be waiting for the mutex.

If you acquire the lock often enough for a thread switch to be of noticeable cost, you may want to change to spinlocks. Make sure all your code is non-paged then.

Normally, it doesn’t make sense to have excessive number of worker threads, more than number of processors.

Also, if your average CPU usage is high enough, a thread who was waiting for a mutex may not be waken up immediately after the mutex was released. Or a thread may get preempted while owning the mutex. You can use auto-reset (synchronization) event as a simple non-reentrant mutex which lets you boost the priority of a thread that was waked up as a result of lock release.

A spin lock certainly has low cost of acquisition when NOT contended. If that’s the pattern you expect, you don’t mind paying the cost of contended acquisition, your code and data are non-paged, and you don’t need reader/writer semantics you WANT a spin lock… Not an ERESOURCE. But the two are very different types of locks.

If you really DO want the ERESOURCE have you considered something as simple as boosting thread priority while holding the lock? One way to look at the problem is trying to determine WHY the thread holding the lock is being rescheduled. What higher priority activity is taking place? The solution could lie in simple priority manipulation.

Peter
OSR

Sorry, I guess there is something else wrong. Looks like I may be wrong in the way I calculate time. I use KeQueryTickCount to get timestamps, and it seems that sometimes KeQueryTickCount ?sleeps? for 156001 ?100-nanosec units?. I.e. two immediate subsequent calls to KeQueryTickCount give me results that are 15600 mksec apart, even if it?s under a spinlock:
?..
KSPIN_LOCK spin_locker;
KeInitializeSpinLock(&spin_locker);
KLOCK_QUEUE_HANDLE lock_handle;
KeAcquireInStackQueuedSpinLock(&spin_locker, &lock_handle);
const __int64 start = GetTickCount();
__int64 cur = start;
while ((cur - start) < 300000000) {
__int64 next = GetTickCount();
if ((next - cur) >= 10000) {
DbgPrint(?Hickup: %I64d\n?, next ? cur); // >>> always 156001
break;
}
cur = next;
}
KeReleaseInStackQueuedSpinLock(&lock_handle);
?.
__int64 GetTickCount() {
LARGE_INTEGER tick;
KeQueryTickCount(&tick);
return tick.QuadPart * KeQueryTimeIncrement();
}

Similar thing happens if I use KeQueryPerformanceCounter.

Am I breaking any rules here (apart from trying to hold spinlock for 30 seconds :slight_smile:

KeQueryTickCount has granularity of one timer tick which is 10 or 15 ms. KeQueryPerformanceCounter may have 1us granularity. Or sometimes one CPU clock. But it may only be meaningful if you don’t get thread switch in between.

xxxxx@gmail.com wrote:

Sorry, I guess there is something else wrong. Looks like I may be wrong in the way I calculate time. I use KeQueryTickCount to get timestamps, and it seems that sometimes KeQueryTickCount ?sleeps? for 156001 ?100-nanosec units?. I.e. two immediate subsequent calls to KeQueryTickCount give me results that are 15600 mksec apart, even if it?s under a spinlock:

KeQueryTickCount is only updated at interval timer interrupts, which
only run every 15.6 ms (this varies depending on options and versions).
The difference between two consecutive readings is always going to be
either 0 or 15.6ms. If you need sub-millisecond timing, the tick
counter is a very poor choice…

KeQueryPerformanceCounter should not have that problem.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Ooops :slight_smile: It’s shameful! Sorry guys, I knew that KeQueryPerformanceCounter has better granularity, but I was under impression (by simply reading, or rather misreading the docs) that KeQueryTickCount has a 100 nanosec granularity (which was enough for my purposes). Dropped and gave twenty. Thanks for your help!

> If you really DO want the ERESOURCE have you considered something as simple as boosting

thread priority while holding the lock?

Well, this is something from the field of RTOS (i.e. priority ceiling and priority inheritance protocols) where this manipulation is applied by the dispatcher itself as a part of lock acquisition. However, under Windows it will require at least 2 additional calls to the dispatcher, which may be a quite a performance hit on system-wide basis if the contention for the section is as high as the OP claims it to be. Another problem with this approach is that you are limited strictly to PASSIVE__LEVEL so that you cannot use it if caller already holds, say, a FAST_MUTEX or has acquired ERESOURCE…

Anton Bassov

Well, no. It’ll require two SYSTEM SERVICE calls, which (depending on expected previous mode) could have as little overhead as a simple call/return each (for the NtXxxx variant). THAT was my thinking here.

Of course now, given what the OP said of his situation, I’m not sure WHAT’s going on. But I’m sure the OP has to acquire some proper data before we can provide him any additional useful guidance.

Peter
OSR

OK, so my situation is as such.
I have a sorted array of “events” (of file system nature) that I aquire as they happen. Then I have an async event handler, that pulls events from the array, and handles them. Essentially, there are 3 operations on the array: “insert”, “lookup”, and “delete”. The first and the last require exclusive lock, while the second require a shared lock. Insert is (most of the time) a “no-op”, as I don’t have to do any lookups, but just add an entry at the bottom of the array: array[++num_events] = new_event. Occasionally (and normally very rarely) I have to grow the array buffer, if events are thrown in faster than they are pulled out, but, again, this doesn’t happen too often. The lookup operation is essentially a binary search on that sorted array, that usually contains up to 512 entries. I.e it takes up to 9 comparisons to complete the lookup. Again, virtually “no-op”. Deletion is “lookup” + moving the array buffer to “cover” deleted entry. Considering size of the array, not too ba
d.
To be quite honest, I was not sure that ERESOURCE is the proper locker in this case. Considering that exclusive locks happen twice as often as shared locks, chances are that the eresource will be almost always held “virtually exclusive”, because shared lockers will be lined up behind exclusive waiters (and I’m reluctant to use “shared, starve exclusive” locking in this case).
*
As for the original question - again, I was under wrong impression that large spikes in times spent “inside the lock” are due to thread switch. But still, it was interesting to hear opinions on how to deal with thread switching inside a “global” lock.

If you just want to use exclusive lock, and if you think the priority of the thread taking the lock is lower than other threads, then the only option is to do priority bump. So if the priority of the contending threads for this lock are same to begin with, it is probably not a good idea to increase the priority, assuming those thread does not have extraneous wait ( by way of I/O access or some such) while holding the lock.

If the write/update frequency is twice as much as non-destructive op frequency then sure shared lock is not a good choice. But you will have to consider the weight of the processing. For example, if the Operation for read is fast and non-destructive you might try shared lock, since the destructive operations would not have to wait long!!

If you have multi-core/hyper-threaded then using spinlock will save the scheduling time, but then it was already pointed out that too many threads spoil the party.

It is not clear, how the events are sorted in the array and who does it? If the events are from a set of {0, … 255}, yes I can understand that they can be sorted in O(1), basically drop the events to its corresponding indexed bucket. But that does not seem to be the case here???

SO, IMO you will have to try different things to understand the behavior of you implementation a bit more, before you decide what is best in your case.

-pro

On Feb 26, 2011, at 8:50 AM, xxxxx@gmail.com wrote:

OK, so my situation is as such.
I have a sorted array of “events” (of file system nature) that I aquire as they happen. Then I have an async event handler, that pulls events from the array, and handles them. Essentially, there are 3 operations on the array: “insert”, “lookup”, and “delete”. The first and the last require exclusive lock, while the second require a shared lock. Insert is (most of the time) a “no-op”, as I don’t have to do any lookups, but just add an entry at the bottom of the array: array[++num_events] = new_event. Occasionally (and normally very rarely) I have to grow the array buffer, if events are thrown in faster than they are pulled out, but, again, this doesn’t happen too often. The lookup operation is essentially a binary search on that sorted array, that usually contains up to 512 entries. I.e it takes up to 9 comparisons to complete the lookup. Again, virtually “no-op”. Deletion is “lookup” + moving the array buffer to “cover” deleted entry. Considering size of the array, not too ba
d.
To be quite honest, I was not sure that ERESOURCE is the proper locker in this case. Considering that exclusive locks happen twice as often as shared locks, chances are that the eresource will be almost always held “virtually exclusive”, because shared lockers will be lined up behind exclusive waiters (and I’m reluctant to use “shared, starve exclusive” locking in this case).
*
As for the original question - again, I was under wrong impression that large spikes in times spent “inside the lock” are due to thread switch. But still, it was interesting to hear opinions on how to deal with thread switching inside a “global” lock.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Events are sorted exactly as you described:
Array generates a unique “event ID” by simply increasing by one some __int64 next_event_id. Then this ID is used to sort events in the array. By the nature of event ID generator it is guaranteed that new event added to the end of the aray will not break the order. Here is how “insert” works:
LockArrayExclusive();
new_event->event_id = ++next_event_id;
if (elements_in_array == max_in_array) GrowArray(); // <- this is a very rare case
array[elements_in_array++] = new_event;
UnlockArray();

Can you explain:

  1. Do you have to handle the events in the order they are generated, or in some different order. If not, what’s the reason.
  2. How big is the event data.

Order in which events are handled is not important. I have a bunch of worker threads that pick up events from the array, and which event gets handled by wroker thread and when is on the mercy of the scheduler. It’s ok, because events are independent from each other.
Total event data may vary (again, it’s file system events, so read / write buffer maybe quite large). However, event decriptors stored in my array are in the range of 64-128 bytes.

Why do you need to sort the events then and remove them from the middle of the array. You can just have a simple circular FIFO buffer. Or a list.

Ah! There is no particular order in which events are handled. I.e. event from the middle of the array- can be completed before events that surround it. List doesn’t work for me neither, because there is more than one ways to complete the event. For instance, event can be associated with an IRP that gets pended by my driver. So, the event should be cancelled when the IRP is cancelled, and at the same time the event should be completed by my handler. There is a race in who owns the event: handler, or IRP cancellation after the event has been saved in the array. The easiest (most efficient way) to solve this is to address event by a unique ID, rather than by pointer: whoever was able to pull that event by ID is now the owner who ought to complete it one way, or another.
But I’m not sure how this is relevant to the originial Q/

So what you said before that it would take 9 visits to get to the element is not quite right. Why? Over a long run your array would be fragmented in the sense you will have some elements (ptr to event) served. And your array would increase. I asked if it is bounded or not, it is not in this case…

Wild guess …

Event would come randomly, you will create a new_event stuct, you will give it an event id, and you will put the event ptr to the next available slot. __You could have some slots before this current index, that had been served/consumed, so they could be free.

The event server threads are to handle random events, not a specific subset of events … These all are decision points to me.

For example, let’s take an approach, though it may not suit your purpose …

Make a fixed array [256] of pointer to events.

As events come, you create your structure, give it you next id, then put into the bucket using [event_id % 256]. Here 256 should not be hard-coded. When you index this way, make sure you verify if it is empty or not. If empty, put it there, if not then chain it to the end of the list. So basically a bucket. Then searching would be like the one you expected it to, provided there is not much backlog.

Now you can use some kind of lock partition…

Let’s say you have a way to find out the number of processors in the system, and if that is hyperthreaded or not. Based on that you can find out how many true concurrent threads can run at any given time… Let’s say 4 threads ( dual core/ hyper threaded). Now you can partition the the locks into four exclusive locks. And you can asks the threads to serve accordingly…

But that is just one option, and definitely need to think thru …

So yes, the design point is not clear yet…

If you could spec out the design, I’m sure lot of people here would point you to the right direction(s).

-pro

On Feb 26, 2011, at 10:48 AM, xxxxx@gmail.com wrote:

Events are sorted exactly as you described:
Array generates a unique “event ID” by simply increasing by one some __int64 next_event_id. Then this ID is used to sort events in the array. By the nature of event ID generator it is guaranteed that new event added to the end of the aray will not break the order. Here is how “insert” works:
LockArrayExclusive();
new_event->event_id = ++next_event_id;
if (elements_in_array == max_in_array) GrowArray(); // <- this is a very rare case
array[elements_in_array++] = new_event;
UnlockArray();


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer