Producer at DPC,Consumer at helper thread,puzzled

Mark_Roddy · December 14, 2010, 11:13am

The decision is simply ‘put the next (some number of) request(s) on
list M, then start filling list M+1’. It is a set of buckets, one per
processor. The point is to eliminate the thundering herd problem with
waking up N(cpu) threads to all process a single list that has
transitioned from empty to not-empty. One lock per list, no global
lock at all. Only the producer is concerned with the bucketizing.

I wouldn’t even bother with more than one list unless the number of
CPUs on the system is sufficiently large.

Mark Roddy

On Tue, Dec 14, 2010 at 10:23 AM, Maxim S. Shatskih
wrote:
>> Yeah, I remember that. That would simplify stuff. All the producer need is
>> to do a round robin among N consumers.
>
> And?
>
> The producer will need to make a decision on - to what list to put the request.
>
> The decision will require some global data.
>
> Which must be protected by the lock.
>
> And yes, this lock will be the same bottleneck.
>
> –
> Maxim S. Shatskih
> Windows DDK MVP
> xxxxx@storagecraft.com
> http://www.storagecraft.com
>
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
>

anton_bassov · December 14, 2010, 5:07pm

> Your assumption that as soon as you produce an item, and signal, it would be picked up

a consumer - Extreemly strong assumption. I can not dictate the OS to do just that.

No, I don’t assume that, and I don’t have to. What you speak about is just an absolutely ideal scenario
when two workers never ever contend for a spinlock. I don’t assume that it will always necessarily happen this way…

It is possible within the number of NbOfConsumerThreads that more than one work item queued,
before anyone get a chance to serve, meaning consumer serves right away.

The only thing I want to ensure is that every time an item gets inserted into the queue one thread gets released (unless the number of outstanding items equals to or exceeds the number of worker threads, of course). With such approach, although you may end up with a situation when two workers contend for a spinlock, you will never end up with the one that released worker eventually discovers that it has nothing to do at the moment, and this is the only thing I want to ensure…

Anton Bassov

Prokash_Sinha-1 · December 14, 2010, 7:11pm

consumers are sleeping

producer is producing and signaling from DPC level.

Where is the guarantee that as soon as a producer is signalling, one of
the worker thread would be processing?

-pro

> Your assumption that as soon as you produce an item, and signal, it
> would be picked up
> a consumer - Extreemly strong assumption. I can not dictate the OS to do
> just that.

No, I don’t assume that, and I don’t have to. What you speak about is just
an absolutely ideal scenario
when two workers never ever contend for a spinlock. I don’t assume that it
will always necessarily happen this way…

> It is possible within the number of NbOfConsumerThreads that more than
> one work item queued,
> before anyone get a chance to serve, meaning consumer serves right
> away.

The only thing I want to ensure is that every time an item gets inserted
into the queue one thread gets released (unless the number of outstanding
items equals to or exceeds the number of worker threads, of course). With
such approach, although you may end up with a situation when two workers
contend for a spinlock, you will never end up with the one that released
worker eventually discovers that it has nothing to do at the moment, and
this is the only thing I want to ensure…

Anton Bassov

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

anton_bassov · December 14, 2010, 7:24pm

> Where is the guarantee that as soon as a producer is signalling, one of the worker thread

would be processing?

No guarantee whatsoever - as I told you already, I don’t assume that. The only thing that is needed is moving the sleeping thread from the WAITING state to the READY one, and this is the thing that my approach ensures…

Anton Bassov

Prokash_Sinha-1 · December 14, 2010, 7:58pm

Right, all of them would wake up, correct… And my assertion that there
could be leftover work item is wrong…

And we know it has the same thundering herd, though that has been pointed
out already.

-pro

> Where is the guarantee that as soon as a producer is signalling, one of
> the worker thread
> would be processing?

No guarantee whatsoever - as I told you already, I don’t assume that. The
only thing that is needed is moving the sleeping thread from the WAITING
state to the READY one, and this is the thing that my approach
ensures…

Anton Bassov

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

anton_bassov · December 14, 2010, 8:12pm

> Right, all of them would wake up, correct…

Well, probably not all of them - after all, the number of outstanding items may be less that that of available workers…

And we know it has the same thundering herd,

It depends on what you mean by “thundering herd”…

What makes it different from Max’s approach is that there are no idle invocations - if thread gets released it can be 100% sure that it will find at least one item to process (although it may contend for a spinlock). In Max’s implementation, N threads wake up and contend for a spinlock…and N-1 of them may go back to sleep without having done anything useful. This is particularly true if the load is low. This is what I would call
“thundering herd”. However, I don’t think this term applies to a situation when few workers may have to contend for a spinlock once in a while…

Anton Bassov

Prokash_Sinha-1 · December 14, 2010, 11:40pm

On Dec 14, 2010, at 5:11 PM, xxxxx@hotmail.com wrote:

> Right, all of them would wake up, correct…

Well, probably not all of them - after all, the number of outstanding items may be less that that of available workers…

It is possible that all the consumer threads are waiting for the event ( only one event object). Once the event is in signaled state. OS will satisfy all of them, meaning now they are runnable…

Example: at the start ( no work item), every consumer is waiting. Now let’s say just one work item posted by producer, and the event is in signal state. Every waiting thread will have their wait satisfied, right? So whoever get the spinlock first will get the work item, after that all the other threads will find no work item after getting hold of the spinlock. Now there is no work, and count < NbCustomerThreads, each one will go back to wait on the event object, then again one work item — this is “thundering herd”. As I said, it is for scalability. Perhaps I’ve been dealing with too many vCPUs

-pro

> And we know it has the same thundering herd,

It depends on what you mean by “thundering herd”…

What makes it different from Max’s approach is that there are no idle invocations - if thread gets released it can be 100% sure that it will find at least one item to process (although it may contend for a spinlock). In Max’s implementation, N threads wake up and contend for a spinlock…and N-1 of them may go back to sleep without having done anything useful. This is particularly true if the load is low. This is what I would call
“thundering herd”. However, I don’t think this term applies to a situation when few workers may have to contend for a spinlock once in a while…

Anton Bassov

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Prokash_Sinha-1 · December 15, 2010, 1:44am

Man I’m all wet now… Back to age 5 or below. Read the msdn document.

A SynchronizationEvent is also called an autoreset or autoclearing event. When such an event is set, a single waiting thread becomes eligible for execution. The kernel automatically resets the event to the not-signaled state each time a wait is satisfied. A driver might use a synchronization event to protect a shared resource that is used in synchronizing the operations of several threads. Synchronization events are rarely used in a typical driver.

If the above statement is true, then I’ve no idea how there can be no unserviced work item, I was trying to get to.

Once again, the producer can produce multiple consecutive work items, before any of the consumer gets a chance to serve. Now as per the document, it is going to make only one thread runable. I will find the first work item, and will compare the counter ( that is the remaining work item not yet served) with the NbOfConsumerThread. Since the counter is less than NbOfConsumerThread, it will exit out of processing only to go back to Wait ( at the top of the loop ). But now there is no more signaling the event from the producer …

So I’m still holding my claim – You will have work item put into the list by producer and not served…

Am I wrong ? Can anyone please find flaws in my logic???
-pro

On Dec 14, 2010, at 8:40 PM, Prokash Sinha wrote:

On Dec 14, 2010, at 5:11 PM, xxxxx@hotmail.com wrote:

>> Right, all of them would wake up, correct…
>
> Well, probably not all of them - after all, the number of outstanding items may be less that that of available workers…
>
It is possible that all the consumer threads are waiting for the event ( only one event object). Once the event is in signaled state. OS will satisfy all of them, meaning now they are runnable…

Example: at the start ( no work item), every consumer is waiting. Now let’s say just one work item posted by producer, and the event is in signal state. Every waiting thread will have their wait satisfied, right? So whoever get the spinlock first will get the work item, after that all the other threads will find no work item after getting hold of the spinlock. Now there is no work, and count < NbCustomerThreads, each one will go back to wait on the event object, then again one work item — this is “thundering herd”. As I said, it is for scalability. Perhaps I’ve been dealing with too many vCPUs

-pro

>> And we know it has the same thundering herd,
>
> It depends on what you mean by “thundering herd”…
>
> What makes it different from Max’s approach is that there are no idle invocations - if thread gets released it can be 100% sure that it will find at least one item to process (although it may contend for a spinlock). In Max’s implementation, N threads wake up and contend for a spinlock…and N-1 of them may go back to sleep without having done anything useful. This is particularly true if the load is low. This is what I would call
> “thundering herd”. However, I don’t think this term applies to a situation when few workers may have to contend for a spinlock once in a while…
>
>
> Anton Bassov
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

anton_bassov · December 15, 2010, 3:17am

> It is possible that all the consumer threads are waiting for the event ( only one event object).

Once the event is in signaled state. OS will satisfy all of them, meaning now they are runnable…

…but only one of them has some actual work to do. All other ones will find out that they have to go blocking again. What is the point of letting them run, in the first place, unless you are just desperate to waste CPU
time - you make them contend for a spinlock that guards the list; for the dispatcher one when they go blocking…and it is known in advance that all this effort is futile. Funny, ugh…

Man I’m all wet now… Back to age 5 or below. Read the msdn document.

As we have established on multiple occasions, when you start speaking like that it is more than likely that you are about to make a statement that reveals your basic misunderstanding of the subject . This particular case does not seem to be an exception to this rule, as we will see shortly…

Once again, the producer can produce multiple consecutive work items, before any of the consumer
gets a chance to serve. Now as per the document, it is going to make only one thread runable

…

Can anyone please find flaws in my logic???

The flaw in your logic is in basic misunderstanding of how synch event construct works - you just don’t seem to understand what the doc says. Producer will make as many threads runnable as the number of times it calls KeSetEvent() - every time it calls KeSetEvent() one thread will get into the READY state and put on its CPU’s ready list ( we have decided to have one consumer per CPU). However, it does not imply that this newly-released thread will immediately start running, right. Therefore, by the time the first of them starts ACTUALLY running few others may be already released and placed on their ready lists. As a result, you may end up with few threads contending for a shared resource when they run. What is so terribly complex here???

Anton Bassov

James_Harper · December 15, 2010, 4:20am

>

> Can anyone please find flaws in my logic???

The flaw in your logic is in basic misunderstanding of how synch event
construct works - you just don’t seem to understand what the doc says.
Producer will make as many threads runnable as the number of times it
calls
KeSetEvent() - every time it calls KeSetEvent() one thread will get
into the
READY state and put on its CPU’s ready list

I don’t think KeSetEvent works like that. It’s just an event, it doesn’t
count anything. I think you mean that Producer will make as many
waiting threads runnable as the number of times it calls KeSetEvent.
All of the threads may not be waiting, in fact it is possible (unlikely,
but possible) that every single one of them is just about to call
KeWaitXxx but has been pre-empted. Then you could call KeSetEvent a
billion times - only one thread will become runnable.

There is always a race between ‘are there any items on the queue’ and
KeWaitXxx.

James

Prokash_Sinha-1 · December 15, 2010, 6:44am

I don’t disagree about the fact that I don’t understand lot of stuff, that is why I said “Perhaps I am included in the group …” Read that archive of this thread…

Now it is clear, right? So do you see the race in your code? If so, please show us code that fixes that race !

Could you explain where it is ? And why there would be work items left on the queue and would never be served?

Could you kindly explain why Max’s code does not have that race condition?

-pro

On Dec 15, 2010, at 12:16 AM, xxxxx@hotmail.com wrote:

> It is possible that all the consumer threads are waiting for the event ( only one event object).
> Once the event is in signaled state. OS will satisfy all of them, meaning now they are runnable…

…but only one of them has some actual work to do. All other ones will find out that they have to go blocking again. What is the point of letting them run, in the first place, unless you are just desperate to waste CPU
time - you make them contend for a spinlock that guards the list; for the dispatcher one when they go blocking…and it is known in advance that all this effort is futile. Funny, ugh…

> Man I’m all wet now… Back to age 5 or below. Read the msdn document.

As we have established on multiple occasions, when you start speaking like that it is more than likely that you are about to make a statement that reveals your basic misunderstanding of the subject . This particular case does not seem to be an exception to this rule, as we will see shortly…

> Once again, the producer can produce multiple consecutive work items, before any of the consumer
> gets a chance to serve. Now as per the document, it is going to make only one thread runable

…

> Can anyone please find flaws in my logic???

The flaw in your logic is in basic misunderstanding of how synch event construct works - you just don’t seem to understand what the doc says. Producer will make as many threads runnable as the number of times it calls KeSetEvent() - every time it calls KeSetEvent() one thread will get into the READY state and put on its CPU’s ready list ( we have decided to have one consumer per CPU). However, it does not imply that this newly-released thread will immediately start running, right. Therefore, by the time the first of them starts ACTUALLY running few others may be already released and placed on their ready lists. As a result, you may end up with few threads contending for a shared resource when they run. What is so terribly complex here???

Anton Bassov

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

anton_bassov · December 15, 2010, 7:29am

> Now it is clear, right? So do you see the race in your code? If so, please show us code that fixes that race !

If the purpose of this thread to discover the use of the counting semaphore then my code had served its purpose well - indeed, there is a theoretical race condition, and this race condition can be eliminated simply by replacing synch event (which happens to be just a binary semaphore) with the counting one, where the max count equals the number of workers…

Anton Bassov

Prokash_Sinha-1 · December 15, 2010, 10:15am

Who told you that term “Theoritical race”? I want to know what it means.

-pro
On Dec 15, 2010, at 4:28 AM, xxxxx@hotmail.com wrote:

> Now it is clear, right? So do you see the race in your code? If so, please show us code that fixes that race !

If the purpose of this thread to discover the use of the counting semaphore then my code had served its purpose well - indeed, there is a theoretical race condition, and this race condition can be eliminated simply by replacing synch event (which happens to be just a binary semaphore) with the counting one, where the max count equals the number of workers…

Anton Bassov

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Prokash_Sinha-1 · December 15, 2010, 10:43am

I’ve seen people ( including myself) debugging on 128 real processor machine at 2Am. So there will be some one who would take your code, apply NbOfConsumerThread to be 128, and there are system with even more real processors on the test rings. For example 384 real processors !

And those machines goes to NASA, Boeing, Big Banks. A transaction of a million dollar, just sat there for a while.

Reason — THERE IS A REALLY REALLY THORITICAL RACE IN ANTON’S CODE that we found after several months of it. Oh well, we blew up couple flights to mars. National bank failed.

And all we have to tell to ourself WTF do we know, that is why we took ANTON’s code.

-pro

On Dec 15, 2010, at 7:14 AM, Prokash Sinha wrote:

Who told you that term “Theoritical race”? I want to know what it means.

-pro
On Dec 15, 2010, at 4:28 AM, xxxxx@hotmail.com wrote:

>> Now it is clear, right? So do you see the race in your code? If so, please show us code that fixes that race !
>
> If the purpose of this thread to discover the use of the counting semaphore then my code had served its purpose well - indeed, there is a theoretical race condition, and this race condition can be eliminated simply by replacing synch event (which happens to be just a binary semaphore) with the counting one, where the max count equals the number of workers…
>
>
> Anton Bassov
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

anton_bassov · December 15, 2010, 3:23pm

> Reason — THERE IS A REALLY REALLY THORITICAL RACE IN ANTON’S CODE that we found

after several months of it. Oh well, we blew up couple flights to mars. National bank failed.

Relax, Prokash - you start getting hysterical…

Your arguments make sense only under hard-time RTOS where missing a deadline may be equal to failure
( I don’t know about national bank’s failure, but the mission to Mars nearly got screwed up because of the priority inversion). In our situation we are not bound by any timing constraints, are we.

Therefore, even once is a VERY WHILE we have a short delay in pushing the packet up the stack there is
no great problem. I think it is lesser evil compared to releasing 128 (or 384 if you wish) threads despite the fact that most of them will find nothing to do at the moment - after all, once we speak about GPOS we are more concerned about the overall performance rather than about fast response on any particular operation…

BTW, originally I planned to use a counting semaphore but then somehow decided that a binary one would suffice…

Anton Bassov

Maxim_S_Shatskih · December 15, 2010, 4:35pm

>released it can be 100% sure that it will find at least one item to process (although it may contend for

a spinlock)

Well, in this case, you can replace a spinlock with the lock-free list.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Maxim_S_Shatskih · December 15, 2010, 4:38pm

>right? So whoever get the spinlock first will get the work item

Question to Anton:

imagine the following:
* 1 item is queued, KeSetEvent is called
* before the thread awakens, the second item is queued, KeSetEvent is called second time, and does not alter the state of the event since the event is already signaled
* then the thread awaken at step 1 runs, passes your sync event, and un-signals them
* voila the deadlock. The second item’s KeSetEvent is undone.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

anton_bassov · December 15, 2010, 6:17pm

> 1 item is queued, KeSetEvent is called

* before the thread awakens, the second item is queued, the second item is queued, KeSetEvent
is called second time, and does not alter the state of the event since the event is already signaled

How can something like that possibly happen??? KeSetEvent() releases a thread and resets the event atomically - it is not released thread who resets the event behind the scenes before returning from KeWaitXXX, right?

By the moment KeSetEvent() releases dispatcher spinlock the first thread will be already awake (which does not mean it will be actually running) and event will be already in unsignaled state again - it may remain
signaled only if no thread was released. Therefore, the subsequent invocation of KeSetEvent() will release another thread…

Anton Bassov

Prokash_Sinha-1 · December 16, 2010, 11:07am

>

What makes it different from Max’s approach is that there are no idle invocations - if thread gets released it can be 100% sure that it will find at least one item to process (although it may contend for a spinlock). In Max’s implementation, N threads wake up and contend for a spinlock…and N-1 of them may go back to sleep without having done anything useful. This is particularly true if the load is low. This is what I would call
“thundering herd”. However, I don’t think this term applies to a situation when few workers may have to contend for a spinlock once in a while…

Yes, Max’s code has thundering herd problem

In your implementation, try to find the region of preemptions — There are two I noticed.

I’ve 128 processor machine, with 128 Nic ports ( 64 cards ). Each 10Gbits… Observed capacity is around 7 to 8 Gbits/port. So 1GByte/s per port. So roughly 128 GBytes/sec or actual payload.

Traffic is essentially random, I will take a burst that would occupy 128 processors at higher level than the worker threads.

Calculate the worst case, what is the maximum I can put in in a Work item to indicate up.

The synchronized event would be hammered then there is a pause in the traffic pattern, 128 worker threads will be scheduled back from preemption. One will win the race and break out of wait. Start processing - No one else will wake up…

Now essentially I’ve a one processor machine to serve traffic.

-pro

Prokash_Sinha-1 · December 16, 2010, 11:33am

Also one of the preemption region is accidentally better. The other one is just a classic choke-point to attack…

For those who does not play with synchronization much or did not have seen much yet, here is the key point -

Whenever you see that there is some kind of explict conditional to breakout of such loop under code segment that is subject to probable race, concentrate on that first. That is where the funny smell comes from and a bad dog like me starts to worry. So the termination condition etc of endless loop is very important for code that is subject to synchronization…

_Notice that I never tried to get past the NbOfConsumerThreads, because that was the conditional exit point.

What I learned over the years is that those area of the code should be very clean. Introduction of code in those region without care is nothing but sending an invitation to the world for your marriage ceremony…

-pro

On Dec 16, 2010, at 8:05 AM, Prokash Sinha wrote:

>
> What makes it different from Max’s approach is that there are no idle invocations - if thread gets released it can be 100% sure that it will find at least one item to process (although it may contend for a spinlock). In Max’s implementation, N threads wake up and contend for a spinlock…and N-1 of them may go back to sleep without having done anything useful. This is particularly true if the load is low. This is what I would call
> “thundering herd”. However, I don’t think this term applies to a situation when few workers may have to contend for a spinlock once in a while…

Yes, Max’s code has thundering herd problem

In your implementation, try to find the region of preemptions — There are two I noticed.

I’ve 128 processor machine, with 128 Nic ports ( 64 cards ). Each 10Gbits… Observed capacity is around 7 to 8 Gbits/port. So 1GByte/s per port. So roughly 128 GBytes/sec or actual payload.

Traffic is essentially random, I will take a burst that would occupy 128 processors at higher level than the worker threads.

Calculate the worst case, what is the maximum I can put in in a Work item to indicate up.

The synchronized event would be hammered then there is a pause in the traffic pattern, 128 worker threads will be scheduled back from preemption. One will win the race and break out of wait. Start processing - No one else will wake up…

Now essentially I’ve a one processor machine to serve traffic.

-pro

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer