Fix a callout race condition introduced in TCP timers callouts with r281599.
ClosedPublic
Actions

Authored by jch on Jun 9 2015, 3:57 AM.

Details

Reviewers

hiren
jhb
lstewart
adrian

Commits

rS287304: Put r284245 back in place: If at first this fix was seen as a temporary
rS284245: Fix a callout race condition introduced in TCP timers callouts with r281599.

Summary

Fix a callout race condition introduced in TCP timers callouts with r281599.
In TCP timer context, it is not enough to check callout_stop() return value
to decide if a callout is still running or not, previous callout_reset()
return value have also to be checked.

First reported by @lstewart in D2079#51293 comment, see this comment
and following to get how this issue has been tracked down.

Test Plan

Have been tested on HEAD and stable/10.

Will be tested by @lstewart and @hiren

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jch updated this revision to Diff 6053.Jun 9 2015, 3:57 AM

jch retitled this revision from to Fix a callout race condition introduced in TCP timers callouts with r281599..

jch updated this object.

jch edited the test plan for this revision. (Show Details)

jch added a reviewer: lstewart.

jch added a parent revision: D2079: Fix TCP timers use-after-free old race conditions.Jun 9 2015, 3:58 AM

jch updated this object.

jch edited the test plan for this revision. (Show Details)

jch added reviewers: hiren, jhb, adrian.

jch added a subscriber: hiren.

jch added subscribers: j-nitrology.com, florent_guiliani.fr, rrs.

jch set the repository for this revision to rS FreeBSD src repository - subversion.Jun 9 2015, 4:00 AM

jch updated this object.

So far this is looking solid for us. Both with defaults and lowered keep alives on the same traffic patterns that caused the cores prior. Running with net.inet.tcp.per_cpu_timers = 1

jch added a child revision: D2599: Decompose TCP INP_INFO lock to increase short-lived connections scalability.Jun 10 2015, 5:32 PM

jch mentioned this in D2599: Decompose TCP INP_INFO lock to increase short-lived connections scalability.Jun 10 2015, 5:54 PM

Hi @nitroboost-gmail.com,

In D2763#52995, @nitroboost-gmail.com wrote:

So far this is looking solid for us. Both with defaults and lowered keep alives on the same traffic patterns that caused the cores prior. Running with net.inet.tcp.per_cpu_timers = 1

Thanks a lot for your time testing it. This patch works perfectly fine with our traffic, but it is always great to get a second validation in a different context. I will wait a bit for @hiren or @lstewart feedbacks and push it tomorrow (and mark it for MFC).

In D2763#53029, @jch wrote:

Hi @nitroboost-gmail.com,

In D2763#52995, @nitroboost-gmail.com wrote:

So far this is looking solid for us. Both with defaults and lowered keep alives on the same traffic patterns that caused the cores prior. Running with net.inet.tcp.per_cpu_timers = 1

Thanks a lot for your time testing it. This patch works perfectly fine with our traffic, but it is always great to get a second validation in a different context. I will wait a bit for @hiren or @lstewart feedbacks and push it tomorrow (and mark it for MFC).

@nitroboost-gmail.com aka Jason wolfe and I work at limelight networks. So yes, looks good from our side :-)

Thanks a lot for your work and quick responses.

hiren accepted this revision.Jun 10 2015, 7:10 PM

hiren edited edge metadata.

This revision is now accepted and ready to land.Jun 10 2015, 7:10 PM

Closed by commit rS284245: Fix a callout race condition introduced in TCP timers callouts with r281599. (authored by jch). · Explain WhyJun 10 2015, 8:43 PM

This revision was automatically updated to reflect the committed changes.

Patch pushed in both HEAD and 10-STABLE. And it is not too late for comments on this review, it is never too late for improvements. Thanks all for your time.

jch mentioned this in D2079: Fix TCP timers use-after-free old race conditions.Jun 13 2015, 2:33 PM

The fact that callout_stop() can return 1 (i.e. callout successfully stopped) where this exact callout is just about to be ran can be seen as bug (/feature). Marc proposed me a fix for this callout bug(/feature) and will ask @rrs if it deserves to be fixed(/documented). Thanks again for your inputs/review and testing.

jch mentioned this in D3078: callout_stop() should return 0 when the callout is currently being serviced and indeed unstoppable..Jul 14 2015, 12:23 PM

jch mentioned this in rS286880: callout_stop() should return 0 (fail) when the callout is currently.Aug 18 2015, 10:15 AM

As I wrote before, this race condition was driven by an issue in callout_stop() that it is fixed in rS286880: callout_stop() should return 0 (fail) when the callout is currently (D3078). I will revert rS284245 as it is not more needed now.

This change has been reverted with rS287101, as it is not need anymore with rS286880/D3078.

Change put back with rS287304: Put r284245 back in place: If at first this fix was seen as a temporary, as it turns out this change is not a workaround for a callout_stop() bug, but the right to use callout_stop()/callout_reset() in mpsafe mode. Should be the last change in the task to remove TCP timers race conditions started with D2079: Fix TCP timers use-after-free old race conditions.

For the record, with rS287304, the same logic is used in release/10.2.0, stable/10 and 11-current.

jch mentioned this in D3521: Implement callout_drain_async().Sep 15 2015, 3:10 PM