Page MenuHomeFreeBSD

Fix a callout race condition introduced in TCP timers callouts with r281599.
ClosedPublic

Authored by jch on Jun 9 2015, 3:57 AM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Nov 25, 4:42 AM
Unknown Object (File)
Mon, Nov 18, 1:56 PM
Unknown Object (File)
Mon, Nov 18, 1:54 PM
Unknown Object (File)
Thu, Nov 14, 7:57 AM
Unknown Object (File)
Thu, Nov 14, 12:43 AM
Unknown Object (File)
Nov 13 2024, 2:37 AM
Unknown Object (File)
Nov 11 2024, 6:52 AM
Unknown Object (File)
Nov 9 2024, 9:20 PM

Details

Summary

Fix a callout race condition introduced in TCP timers callouts with r281599.
In TCP timer context, it is not enough to check callout_stop() return value
to decide if a callout is still running or not, previous callout_reset()
return value have also to be checked.

First reported by @lstewart in D2079#51293 comment, see this comment
and following to get how this issue has been tracked down.

Test Plan

Have been tested on HEAD and stable/10.

Will be tested by @lstewart and @hiren

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

jch retitled this revision from to Fix a callout race condition introduced in TCP timers callouts with r281599..
jch updated this object.
jch edited the test plan for this revision. (Show Details)
jch added a reviewer: lstewart.
jch updated this object.
jch edited the test plan for this revision. (Show Details)
jch added reviewers: hiren, jhb, adrian.
jch added a subscriber: hiren.
jch updated this object.
jch updated this object.

So far this is looking solid for us. Both with defaults and lowered keep alives on the same traffic patterns that caused the cores prior. Running with net.inet.tcp.per_cpu_timers = 1

Hi @nitroboost-gmail.com,

In D2763#52995, @nitroboost-gmail.com wrote:

So far this is looking solid for us. Both with defaults and lowered keep alives on the same traffic patterns that caused the cores prior. Running with net.inet.tcp.per_cpu_timers = 1

Thanks a lot for your time testing it. This patch works perfectly fine with our traffic, but it is always great to get a second validation in a different context. I will wait a bit for @hiren or @lstewart feedbacks and push it tomorrow (and mark it for MFC).

In D2763#53029, @jch wrote:

Hi @nitroboost-gmail.com,

In D2763#52995, @nitroboost-gmail.com wrote:

So far this is looking solid for us. Both with defaults and lowered keep alives on the same traffic patterns that caused the cores prior. Running with net.inet.tcp.per_cpu_timers = 1

Thanks a lot for your time testing it. This patch works perfectly fine with our traffic, but it is always great to get a second validation in a different context. I will wait a bit for @hiren or @lstewart feedbacks and push it tomorrow (and mark it for MFC).

@nitroboost-gmail.com aka Jason wolfe and I work at limelight networks. So yes, looks good from our side :-)

Thanks a lot for your work and quick responses.

hiren edited edge metadata.
This revision is now accepted and ready to land.Jun 10 2015, 7:10 PM
This revision was automatically updated to reflect the committed changes.

Patch pushed in both HEAD and 10-STABLE. And it is not too late for comments on this review, it is never too late for improvements. Thanks all for your time.

The fact that callout_stop() can return 1 (i.e. callout successfully stopped) where this exact callout is just about to be ran can be seen as bug (/feature). Marc proposed me a fix for this callout bug(/feature) and will ask @rrs if it deserves to be fixed(/documented). Thanks again for your inputs/review and testing.

As I wrote before, this race condition was driven by an issue in callout_stop() that it is fixed in rS286880: callout_stop() should return 0 (fail) when the callout is currently (D3078). I will revert rS284245 as it is not more needed now.

This change has been reverted with rS287101, as it is not need anymore with rS286880/D3078.

Change put back with rS287304: Put r284245 back in place: If at first this fix was seen as a temporary, as it turns out this change is not a workaround for a callout_stop() bug, but the right to use callout_stop()/callout_reset() in mpsafe mode. Should be the last change in the task to remove TCP timers race conditions started with D2079: Fix TCP timers use-after-free old race conditions.

For the record, with rS287304, the same logic is used in release/10.2.0, stable/10 and 11-current.