⚙ D7042 When a callout is being run and scheduled at the same time, callout_stop() would only unschedule the scheduled one, but will not drain the running one.

glebius updated this revision to Diff 18049.Jun 30 2016, 4:44 PM

glebius retitled this revision from to When a callout is being run and scheduled at the same time, callout_stop() would only unschedule the scheduled one, but will not drain the running one..

glebius updated this object.

glebius edited the test plan for this revision. (Show Details)

glebius added reviewers: rrs, • hselasky, davide, jhb, kib.

glebius added a subscriber: network.

Herald added a subscriber: imp. · View Herald TranscriptJun 30 2016, 4:44 PM

The changes to the CALLOUT_PROCESS flag are not necessary for the fix. I left them in just for consistency. I used the flag during debugging, so it was important for me that it is set and removed at proper places.

imp added inline comments.Jun 30 2016, 4:53 PM

sys/kern/kern_timeout.c
753 ↗	(On Diff #18049)	Should you KASSERT That it's already set?

Hi,

The title of this patch is a bit confusing to me.

callout_stop() is not allowed to drain, because that will cause a sleep and callout_stop() is a non-blocking function.

Do you mean callout_async_drain() or callout_drain() ?

Instead of introducing another flag, CALLOUT_RUNNING, you should be able to deduce this state from the cc_xxx() variables ???

How would your patch look for hps_head?
https://svnweb.freebsd.org/base/projects/hps_head/sys/kern/kern_timeout.c?annotate=299262

Is this a bug or new feature? Do you need to update the callout manual page?

--HPS

jch subscribed.Jul 1 2016, 8:23 AM

jch added inline comments.

sys/kern/kern_timeout.c
1256 ↗	(On Diff #18049)	Open question can this case still happen now? (c->c_iflags & CALLOUT_RUNNING) && (cc_exec_curr(cc, direct) != c) i.e. callout is running _and_ it is not the current one being run? I would say no, but double checking with you.

I agree with this change spirit:

callout_async_drain()/callout_stop() should not return success when the callout is actually running, even if the scheduled one has be successfully cancelled. Otherwise this is indeed confusing for callout_async_drain()/callout_stop() users (like me).

Would be nice to have:

Make it clear in callout man page

I tried to introduced this behavior (see below) but I prefer your approach (clearer):

callout_stop() should return 0 when the callout is currently being serviced and indeed unstoppable.
https://reviews.freebsd.org/D3078?vs=on&id=8254&whitespace=ignore-most#toc

I am going to test this change especially against TCP timer callouts

callout_stop() should return 0 when the callout is currently being serviced and indeed unstoppable.

Hi,

I think this approach will add extra complications to the code.

I suggest to use:

static void
dummy_drain(void *arg)
{
}

ret = callout_async_drain(co, &dummy_drain, NULL);

Which is basically what you want and allow callout_async_drain() to update the callback function if called multiple times.

Instead of:

ret = callout_stop()

In this case. It returns the correct value for the client.

If callout_stop() always return 0 in the MPSAFE cc_exec_curr == c, it breaks the callout API possibly with out-of-the-tree code which we don't control.

--HPS

In D7042#147452, @hselasky wrote:

callout_stop() should return 0 when the callout is currently being serviced and indeed unstoppable.

I think this approach will add extra complications to the code.

I suggest to use:

static void
dummy_drain(void *arg)
{
}

ret = callout_async_drain(co, &dummy_drain, NULL);

Which is basically what you want and allow callout_async_drain() to update the callback function if called multiple times.

Instead of:

ret = callout_stop()

In this case. It returns the correct value for the client.

If callout_stop() always return 0 in the MPSAFE cc_exec_curr == c, it breaks the callout API possibly with out-of-the-tree code which we don't control.

I also agree callout_stop() is an old API, breaking it has a cost...

Limiting D7042 change spirit to only callout_async_drain() will be already enough for me (let me update my previous comment). And as callout_async_drain() is a new API and currently only used in TCP timer, very low cost.

But still ok to modify both callout_async_drain() and callout_stop() behaviour (less confusion).

I think this might re-introduce the unkillable threads issue in the sleepqueue code, since there is seems to be a fine difference between !PENDING and RUNNING, exactly when callout is simultaneously reset and running. Am I right ?

Either way, just limiting the scope of the change to drain would remove the case I worry about from the scope at all.

I would not agree that there is API change for callout_stop(). Let's look into the manual page:

The function callout_stop() cancels a callout c if it is currently
pending.  If the callout is pending and successfully stopped, then
callout_stop() returns a value of one.  If the callout is not set, or has
already been serviced, then negative one is returned.  If the callout is
currently being serviced and cannot be stopped, then zero will be
returned.  If the callout has an associated lock, then that lock must be
held when this function is called.

The problem is that the whole paradigm of callouts is that it has three consequent states: not scheduled -> scheduled -> running -> not scheduled. The API and the manual page assume that, some comments in the code assume that, and looks like some contributors to the code also did. But this paradigm isn't true. A callout can be scheduled and running at the same time. And the mishandling of this state was the source of our TCP panics. Now let's get back to the documentation and see what should callout_stop() return if callout is scheduled and running at the same time. It should return 0 because callout is currently being serviced and can't be stopped. At the same time it shouldreturn 1, since it has successfully unscheduled a callout. So it must be 0 and 1 at the same time! :) Yep, which is impossible. The API is limited by wrong paradigm and we must select the most safe value between these two. Now, I assert that returning 1 is much less safe in this case. Because any properly designed user of the API will be ready to free resources that are now in use by running callout. This is exactly what happened with TCP. Any other user, use it async version or not, is also prone to panic if we return 1. I assert that returning 0 will make behaviour more safe, and won't break any users, since this is what a sane programmer would expect from API. To finalize, I would notice that this state of running+scheduled in practice is extremely rare, if we noticed it only now at above 50 Gbit/s. So, I'm strongly against limiting the behavior only to async version.

P.S. I should have put the above long text into the summary of the review. But the review was started in a hurry. Drew said to me that guys are right now on a that secret phone call, and are willing to see my patch, so I should put it on phabricator ASAP.

Julien, I think your observation is correct and right now I'm already testing that. If you are right, then we actually can deduce running state from cc_exec_curr, as Hans suggests. I will keep you updated on results.

Kostik, I failed to see how can sleepqueue case be affected by the change. Can you please provide more details?

Here is updated version of the patch. Per input from Julien and Hans I have
checked that cc_exec_curr(cc) == c and c_iflags & CALLOUT_RUNNING is invariant.

First, I've been running the version with flag and was asserting that
cc_exec_curr(cc) == c. That worked and survived for 24h. Then, I redid
my patch to this final version and tested another 48h.

Since this is a serious problem, that prevents to run high loaded TCP
server, I want to commit this ASAP. The patch was floating for around
a week and now I want to leave just 24h for final reviews.

Updating D7042: When a callout is being run and scheduled at the same time, callout_stop()

would only unschedule the scheduled one, but will not drain the running
one.

Herald added a reviewer: manpages. · View Herald TranscriptJul 4 2016, 12:02 PM

glebius marked 2 inline comments as done.Jul 4 2016, 12:05 PM

Looks good to me, let me launch our TCP QA on it.

share/man/man9/timeout.9
253 ↗	(On Diff #18124)	Excellent, quite clear.

In D7042#147487, @glebius wrote:

I would not agree that there is API change for callout_stop(). Let's look into the manual page:
...

The problem is that the whole paradigm of callouts is that it has three consequent states: not scheduled -> scheduled -> running -> not scheduled. The API and the manual page assume that, some comments in the code assume that, and looks like some contributors to the code also did. But this paradigm isn't true. A callout can be scheduled and running at the same time. And the mishandling of this state was the source of our TCP panics. Now let's get back to the documentation and see what should callout_stop() return if callout is scheduled and running at the same time. It should return 0 because callout is currently being serviced and can't be stopped. At the same time it should return 1, since it has successfully unscheduled a callout. So it must be 0 and 1 at the same time! :) Yep, which is impossible. The API is limited by wrong paradigm and we must select the most safe value between these two. Now, I assert that returning 1 is much less safe in this case. Because any properly designed user of the API will be ready to free resources that are now in use by running callout. This is exactly what happened with TCP. Any other user, use it async version or not, is also prone to panic if we return 1. I assert that returning 0 will make behaviour more safe, and won't break any users, since this is what a sane programmer would expect from API. To finalize, I would notice that this state of running+scheduled in practice is extremely rare, if we noticed it only now at above 50 Gbit/s. So, I'm strongly against limiting the behavior only to async version.

Thanks for taking time to give more context and I do agree with your conclusion.

• hselasky added inline comments.Jul 4 2016, 12:26 PM

sys/kern/kern_timeout.c
1091 ↗	(On Diff #18124)	This code line could more nicely have been squashed into the "&= ~ CALLOUT_PENDING" below.
1389 ↗	(On Diff #18124)	Same applies here. Move the flag clearing below if (not_on_a_list == 0) and generalize that clearing CALLOUT_PENDING also clears CALLOUT_PROCESSED. One line of code instead of two.

@glebius

What about the return codes of callout_reset_xxx() . Will they be the same like callout_stop()?

In D7042#147489, @glebius wrote:

Kostik, I failed to see how can sleepqueue case be affected by the change. Can you please provide more details?

Sleepqueue code uses the td_slpcallout to manage time-bound sleeps. The callout is reset after the thread is added to the sleepqueue and the bound sleep was requested. When thread becomes runnable after the voluntary release of CPU, it verifies the conditions and checks whether the scheduled callout was executed or not. If the callout was not executed, it is stopped. If stop failed, then the sleepq_check_timeout() function relinguishes CPU to allow the callout to run.

The last action is done to prevent the situation where callout wakes up possible next sleep, causing too early sleep termination.

Now consider a situation when one (previous) callout is running, and another one (current) callout was successfully canceled. Since new callout was scheduled, this means that previous callout already dropped the thread lock, and is on the path to finish execution. New callout which could make us runnable after the relinquish, is cancelled. In the current KPI, this case, at least with CS_MIGRBLOCK flag set, is returning 1. Your proposal makes the case return 0. The result is that the thread goes into the off-CPU state (it is not runnable, but also not sleeping nor blocked on lock), and the only thing that can make it runnable, the new callout, was already cancelled.

There were several cases where this was already broken, most recent were rss changes to callouts. Although previous breakages were limited to the migrating cases, I fixed them in once in r234952, then fixed after rss once more in 296320. Now you intend to make the situation non-fixable at all.

P.S. To clarify next question, no, this cannot be fixed with the hps' magic bullet, that is the callout_init_mtx. The thread lock, which protects the state shared between sleepq code and sleepq_timeout, is not the lock in usual sense of the word, it is a dynamic set of locks, determined by the current runnability container managing the thread (i.e. runqueue, sleepqueue, turnstile etc lock, changing as thread migrates its running state).

P.S. To clarify next question, no, this cannot be fixed with the hps' magic bullet, that is the callout_init_mtx. The thread lock, which protects the state shared between sleepq code and sleepq_timeout, is not the lock in usual sense of the word, it is a dynamic set of locks, determined by the current runnability container managing the thread (i.e. runqueue, sleepqueue, turnstile etc lock, changing as thread migrates its running state).

If you allow callout_init_mtx() to accept spin-mutexes, like done in hps-head, and add a spin-mutex to "struct thread", which you initialize the callout with - yes it will work just fine w/o needing the return values of callout_reset() and callout_stop() !

--HPS

glebius added inline comments.Jul 5 2016, 7:47 AM

sys/kern/kern_timeout.c
1091 ↗	(On Diff #18124)	It could be, but that's makes reading of code more difficult. Down below we may come already with the flag cleared. I'm not sure, may be I should just omit the part of the patch that clears CALLOUT_PROCESSED.

In D7042#148077, @hselasky wrote:

@glebius

What about the return codes of callout_reset_xxx() . Will they be the same like callout_stop()?

Nothing changes for them, since they don't call _callout_stop_safe internally. AFAIU, the described paradigm problem doesn't apply to scheduling functions.

Kostik, thanks for explanation. What I see is that the case that I now created for sleepqueue is very close to a case when callout was running on a different CPU and wasn't stoppable. And you did r296320 exactly to fix this case. Effectively CS_MIGRBLOCK doesn't block migration. It just flips the return value. So, if I use the same return value flip for my case, that will fix sleepqueue. And I'd suggest to rename CS_MIGRBLOCK to a more meaningful name and document it.

What about the return codes of callout_reset_xxx() . Will they be the same like callout_stop()?
Nothing changes for them, since they don't call _callout_stop_safe internally. AFAIU, the described paradigm problem doesn't apply to scheduling functions.

I think currently callout_stop() returns the same like callout_reset_xxx(), because callout_reset_xxx() can be viewed like:

callout_reset_xxx()
{
int retval;
retval = callout_stop();
restart_callout();
return (retval);
}

If you change the return values of callout_stop() you should also change the return values of callout_reset().

--HPS

Rename CS_MIGRBLOCK to CS_EXECUTING and utilize it in new return path, to avoid reintroducing problem with sleepqueue.
Remove the CALLOUT_PROCESSED manipulations, they aren't necessary to fix the problem.
Updating D7042: When a callout is being run and scheduled at the same time, callout_stop()

would only unschedule the scheduled one, but will not drain the running
one.

Herald edited edge metadata. · View Herald TranscriptJul 5 2016, 1:00 PM

Remove some local patches sneaked in.

Updating D7042: When a callout is being run and scheduled at the same time, callout_stop()

would only unschedule the scheduled one, but will not drain the running
one.

Herald edited edge metadata. · View Herald TranscriptJul 5 2016, 1:01 PM

Gleb:

I am fine with these changes, however it does *not* mean the fix I sent you does
not also need to be applied. I.e. if an async_drain is running (just like when a drain is running)
we need to return failure and not start the timeout again.

This revision is now accepted and ready to land.Jul 5 2016, 1:23 PM

emaste subscribed.Jul 5 2016, 2:18 PM

Closed by commit rS302350: The paradigm of a callout is that it has three consequent states: (authored by glebius). · Explain WhyJul 5 2016, 6:47 PM

This revision was automatically updated to reflect the committed changes.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=210884#c11

When a callout is being run and scheduled at the same time, callout_stop() would only unschedule the scheduled one, but will not drain the running one.
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Updating D7042: When a callout is being run and scheduled at the same time, callout_stop()

Updating D7042: When a callout is being run and scheduled at the same time, callout_stop()

Revision Contents
Changeset List

Diff 18156

head/share/man/man9/timeout.9

head/sys/kern/kern_timeout.c

head/sys/kern/subr_sleepqueue.c

head/sys/sys/callout.h

When a callout is being run and scheduled at the same time, callout_stop() would only unschedule the scheduled one, but will not drain the running one.ClosedPublicActions

Details

Diff Detail

Event Timeline

Updating D7042: When a callout is being run and scheduled at the same time, callout_stop()

Updating D7042: When a callout is being run and scheduled at the same time, callout_stop()

Revision ContentsChangeset List

Diff 18156

head/share/man/man9/timeout.9

head/sys/kern/kern_timeout.c

head/sys/kern/subr_sleepqueue.c

head/sys/sys/callout.h

When a callout is being run and scheduled at the same time, callout_stop() would only unschedule the scheduled one, but will not drain the running one.
ClosedPublic
Actions

Revision Contents
Changeset List