⚙ D15686 convert inpcbhash rlock to epoch

mmacy created this revision.Jun 7 2018, 2:23 AM

Herald added a reviewer: transport. · View Herald TranscriptJun 7 2018, 2:23 AM

Herald added subscribers: ae, imp. · View Herald Transcript

Harbormaster completed remote builds in B17060: Diff 43398.Jun 7 2018, 2:23 AM

mmacy added a reviewer: gallatin.Jun 7 2018, 2:24 AM

jtl added a reviewer: glebius.Jun 7 2018, 2:28 AM

I added a few basic comments while I ponder the rest...

sys/netinet/in_pcb.h
328 ↗	(On Diff #43398)	While here, please update/fix the locking annotations on these lines. (For example, the [w] and [r] are clearly backwards.)
637 ↗	(On Diff #43398)	Just off the top of my head, this section of lock macros doesn't look right. For example, INP_INFO_TRY_UPGRADE can't try to upgrade an rw lock that it won't actually own in read mode (since INP_INFO_RLOCK doesn't actually acquire a read lock). Also, now that we will only hold a lock when writing, it seems like this should all switch from rw locks to mutexes.
646 ↗	(On Diff #43398)	Do the INP_LIST macros need similar treatment to the INP_INFO macros?

Update inpcb locking comments
convert rwlocks to mutexes
update INP_INFO_TRY_RLOCK callers to reflect the fact that they will always succeed

Harbormaster completed remote builds in B17061: Diff 43399.Jun 7 2018, 3:16 AM

mmacy added inline comments.Jun 7 2018, 3:21 AM

sys/netinet/in_pcb.h
646 ↗	(On Diff #43398)	No, I'm keeping them unchanged. I don't think that the code it protects is performance critical.

This has been tested under moderate load in our lab @ LLNW over a few days with no stability issues. We haven't gotten to more in depth profiling at higher loads, but I also saw no performance concerns.

RACK needs a minor update https://gist.github.com/kev009/37fe7fb65aee5ae391342b5a18d5015e

Update WITNESS for pcbinfo lock change
Merge Kevin's gist

Harbormaster completed remote builds in B17158: Diff 43576.Jun 11 2018, 12:22 AM

I have this running on a prod canary with RACK 🚢

kbowling accepted this revision.Jun 11 2018, 2:06 AM

This revision is now accepted and ready to land.Jun 11 2018, 2:06 AM

I'll try to look at this review this week.

This is a significant change; however, I would suggest we all try to get @mmacy a first round of comments by next Monday.

@jtl Thanks. The core of the change was already approved and the CK macro conversion is mechanical. Nonetheless, I can wait until the end of next week to flip the switch on converting the rlock to epoch.

We were planning to commit this today, we're running it with both your stacks as well, so to be honest this is a little bit annoying to wait idly if nobody has a specific concern. Most of the line delta is a mechanical conversion, the new locking contract can be inspected by looking at just the macro changes.

I've not had a chance to read this patch in any detail as yet, as I've just been added as a reviewer. On the whole this seems a structurally safe change as long as exclusive acquisition of pcbinfo is still used to serialise multi-inpcb lock acquires, and particular care is taken around the listen/accept context, where there's TCP- and socket-stack recursion. I guess there is more generally a concern about the throughput of unreleased inpcbs in high connections/second turnover environments due to shifting to an epoch-based model for deferred frees -- e.g., high-volume TCP-based DNS service -- that it might be good to test in, if that's not already been done, as that's quite a different workload from long-lived and fairly fat CDN workloads.

jch subscribed.Jun 12 2018, 9:35 PM

Update to reflect latest changes to master

This revision now requires review to proceed.Jun 12 2018, 11:35 PM

Harbormaster completed remote builds in B17213: Diff 43677.Jun 12 2018, 11:35 PM

remove unused try_upgrade

Harbormaster completed remote builds in B17214: Diff 43678.Jun 12 2018, 11:44 PM

In D15686#333534, @rwatson wrote:

I've not had a chance to read this patch in any detail as yet, as I've just been added as a reviewer. On the whole this seems a structurally safe change as long as exclusive acquisition of pcbinfo is still used to serialise multi-inpcb lock acquires, and particular care is taken around the listen/accept context, where there's TCP- and socket-stack recursion.

Care mostly has to be taken around deletion unless there are undocumented locking dependencies on the info lock. Nonetheless, I will take a second look.

I guess there is more generally a concern about the throughput of unreleased inpcbs in high connections/second turnover environments due to shifting to an epoch-based model for deferred frees -- e.g., high-volume TCP-based DNS service -- that it might be good to test in, if that's not already been done, as that's quite a different workload from long-lived and fairly fat CDN workloads.

Garbage collection happens in a pcpu task that gets scheduled by hardclock if there is work pending. Thus under normal conditions there won't be much more than 1ms between logical deletion and free. The one pathological case where this could backup is if something is preempted in the middle of the same period in an epoch and is prevented from running again by a very high priority CPU bound task. However, it's not difficult to come up with worse pathological scenarios with the existing stack.

be more aggressive about GC in tcp_hpts.c

Harbormaster completed remote builds in B17215: Diff 43680.Jun 13 2018, 12:07 AM

GC a label

Harbormaster completed remote builds in B17216: Diff 43681.Jun 13 2018, 12:10 AM

Update to master

Harbormaster completed remote builds in B17220: Diff 43689.Jun 13 2018, 5:17 AM

latest HEAD and latest patch in synthetic and prod testing at llnw

In D15686#333570, @mmacy wrote:

I guess there is more generally a concern about the throughput of unreleased inpcbs in high connections/second turnover environments due to shifting to an epoch-based model for deferred frees -- e.g., high-volume TCP-based DNS service -- that it might be good to test in, if that's not already been done, as that's quite a different workload from long-lived and fairly fat CDN workloads.

Garbage collection happens in a pcpu task that gets scheduled by hardclock if there is work pending. Thus under normal conditions there won't be much more than 1ms between logical deletion and free. The one pathological case where this could backup is if something is preempted in the middle of the same period in an epoch and is prevented from running again by a very high priority CPU bound task. However, it's not difficult to come up with worse pathological scenarios with the existing stack.

I think you slightly misread my comment. It was not so much that I am concerned that there is a pathological case in theory (clearly those exist all over the place), it's that in practical use this could shift the watermark on allocated inpcbs substantially in some real-world workloads, and that it would be good to seek some testing from who work with such workloads, which differ substantially from the target workload for this work (as I understand it). In the past, work on who connections/sec support has been primarily driven by the folk at Verisign doing top-level domain hosting, and who have a good test setup for that. I wonder if they'd be willing to give this a spin and in particular look for evidence of impact on connections/sec to make sure that this doesn't lead to a substantive decrease (and ideally instead leads to an increase) in connection throughput.

In D15686#333660, @rwatson wrote:

In D15686#333570, @mmacy wrote:

I guess there is more generally a concern about the throughput of unreleased inpcbs in high connections/second turnover environments due to shifting to an epoch-based model for deferred frees -- e.g., high-volume TCP-based DNS service -- that it might be good to test in, if that's not already been done, as that's quite a different workload from long-lived and fairly fat CDN workloads.

Garbage collection happens in a pcpu task that gets scheduled by hardclock if there is work pending. Thus under normal conditions there won't be much more than 1ms between logical deletion and free. The one pathological case where this could backup is if something is preempted in the middle of the same period in an epoch and is prevented from running again by a very high priority CPU bound task. However, it's not difficult to come up with worse pathological scenarios with the existing stack.

I think you slightly misread my comment. It was not so much that I am concerned that there is a pathological case in theory (clearly those exist all over the place), it's that in practical use this could shift the watermark on allocated inpcbs substantially in some real-world workloads, and that it would be good to seek some testing from who work with such workloads, which differ substantially from the target workload for this work (as I understand it). In the past, work on who connections/sec support has been primarily driven by the folk at Verisign doing top-level domain hosting, and who have a good test setup for that. I wonder if they'd be willing to give this a spin and in particular look for evidence of impact on connections/sec to make sure that this doesn't lead to a substantive decrease (and ideally instead leads to an increase) in connection throughput.

It seems unlikely that a 2ms backlog of inpcbs could alter the connection rate, vs new connections not blocking on callouts and all the various points that acquire an info read lock. Verisign has not been responsive of late, but if they're willing to test that would be an interesting data point.

In D15686#333661, @mmacy wrote:

In D15686#333660, @rwatson wrote:

[snip]
I think you slightly misread my comment. It was not so much that I am concerned that there is a pathological case in theory (clearly those exist all over the place), it's that in practical use this could shift the watermark on allocated inpcbs substantially in some real-world workloads, and that it would be good to seek some testing from who work with such workloads, which differ substantially from the target workload for this work (as I understand it). In the past, work on who connections/sec support has been primarily driven by the folk at Verisign doing top-level domain hosting, and who have a good test setup for that. I wonder if they'd be willing to give this a spin and in particular look for evidence of impact on connections/sec to make sure that this doesn't lead to a substantive decrease (and ideally instead leads to an increase) in connection throughput.

It seems unlikely that a 2ms backlog of inpcbs could alter the connection rate, vs new connections not blocking on callouts and all the various points that acquire an info read lock. Verisign has not been responsive of late, but if they're willing to test that would be an interesting data point.

We can definitively run our performance tests on this change.

More details: Our tests are quite simple but match DNS TCP traffic behavior:

Open connection
Send small request (one packet)
Receive small response (one packet)
Close connection
Repeat

FreeBSD 11 results using a 10G NIC and 4 receive queuers are:

110k queries per second (~550k TCP packets per second) vs 200k qps (~1M TCP pps) on Linux on hardware)

Will see how it goes with this change. Don't hold your breath though as we are currently lagging behind HEAD.

@jch sounds simple enough to replicate, nonetheless can you share any benchmark source?

Will see how it goes with this change. Don't hold your breath though as we are currently lagging behind HEAD.

@jch what do you mean by this? Do you have to wait until you update to run this? All we need is first pass benchmarking in the lab if you don't have any way to roll this out on to a canary.

In D15686#333925, @mmacy wrote:

Will see how it goes with this change. Don't hold your breath though as we are currently lagging behind HEAD.

@jch what do you mean by this? Do you have to wait until you update to run this? All we need is first pass benchmarking in the lab if you don't have any way to roll this out on to a canary.

Agreed — I think all we really want is a high connections/sec benchmark in a test environment that (a) helps us understand the CPU impact of the changes; and (b) helps us understand what happens with inpcb watermarks. I would hope for a performance improvement on the former unless the latter leads to a lower connection throughout due to either greater cache footprint or inpcb starvation due to a larger set being out of action. I’d like to think we see a win, but I think it’s important to establish the impact on this other key extreme load point in the TCP usage spectrum if we can, to make sure that there are no surprises. (It would of course also be good to improve performance at that point, which further adoption of concurrency kit should also help with.)

if @jch cannot run his real workload, what about running a synthetic test that creates / tears down many connections per second? Eg, something like many copies of netperf -t TCP_CC ?

In D15686#334207, @gallatin wrote:

if @jch cannot run his real workload, what about running a synthetic test that creates / tears down many connections per second? Eg, something like many copies of netperf -t TCP_CC ?

I think any highly parallel connections/sec that sends and receives a small amount of data synchronously without leaving latency-bound TCP would do the trick. I'd be quite happy with a synthetic test environment (and in some ways would prefer it for this purpose, although I'm obviously interested in production as well).

In D15686#334337, @rwatson wrote:

In D15686#334207, @gallatin wrote:

if @jch cannot run his real workload, what about running a synthetic test that creates / tears down many connections per second? Eg, something like many copies of netperf -t TCP_CC ?

I think any highly parallel connections/sec that sends and receives a small amount of data synchronously without leaving latency-bound TCP would do the trick. I'd be quite happy with a synthetic test environment (and in some ways would prefer it for this purpose, although I'm obviously interested in production as well).

The problem with synthetic is that you're overweighting the cost of deferred free. In a real world situation you'll have RTTs orders of magnitude higher than in the lab. Let's say for the sake of discussion that the average RTT is 50ms and the client only ever sends one data packet:

-> SYN
<- SYN/ACK
-> ACK -> connection is open
-> DATA
<- RESPONSE (25ms)
-> FIN (25ms)

So the inpcb is active for at least the RTT. This means that the ratio of active to INP_FREED is going to be at least RTT/(defer latency). So if the RTT Is 50ms roughly 25:1. That said, I did run this on two systems connected back to back.

With my change I'm seeing connections per second scale essentially linearly with the number of netperf processes running -t TCP_CC. It's somewhat skewed as the number gets in the thousands because it may artificially inflate the end result because many netperfs will have exited before some have even gotten to run. One system is a 32-way (2x8x2) and the other 48-way (2x12x2) - both have a clock of 3.2Ghz. I've seen numbers as high as 63Mconn/s (which I don't vouch for) with 65536 netperf processes. I think I can vouch for 2.56Mconn/s with 256 processes. I'll test a kernel prior to my changes as soon as I get back to my desk. Nonetheless these numbers are large enough to seem barely relevant as connection rate is still latency bound (as seen by the need to have more processes than CPUs).

This is what the profile looks like with only 32 processes. There's a lot of overhead there in timewait.
https://people.freebsd.org/~mmacy/tcp_cc.svg

I see the exact same scaling of connections per second (10k /s / netperf process) with a kernel prior to the inpcb deferred free change.

In D15686#333677, @mmacy wrote:

@jch sounds simple enough to replicate, nonetheless can you share any benchmark source?

Good question. Let me check if I can replicate this simple test logic using warp17/Ngnix, this would be useful to make our results easier to reproduce anyway. It should not take long to get the configuration right.

In D15686#333925, @mmacy wrote:

Will see how it goes with this change. Don't hold your breath though as we are currently lagging behind HEAD.

@jch what do you mean by this? Do you have to wait until you update to run this? All we need is first pass benchmarking in the lab if you don't have any way to roll this out on to a canary.

Right we need to repave our test lab first with new FreeBSD which can take time. Thus don't wait on our results, push our change as soon as you have get enough approvals. We will run this test as some point anyway, and report our findings as usual.

I've spent some time thinking about this a bit, and I have the following comments. (Some may seem contradictory, but please bear with me. :-) )

I am supportive of the idea of changing the INP list and hash lookups to use RCU. I think this is a very good use case for RCU. I also think the (non-)locking semantics on epoch section entry make the code much cleaner, as we can avoid the "lock dance" (drop INP, acquire INP_INFO, acquire INP) when we only need a read lock on INP_INFO. So, I am very supportive of this direction and hope to see it fully implemented.

This looks to be a faithful (mostly) mechanical conversion of the INP_INFO lock into an epoch (read) + mutex (write) combination. As such, I cannot see any obvious "problems" with the code change, per se.

My main concern is what sorts of related things this will expose. For example, there could be something else (undocumented) that is relying on the INP_INFO lock for synchronization. Or, there could be race conditions this will expose/exacerbate. (For an example of this, I believe @jch's change several years ago to avoid taking the INP_INFO write lock in tcp_input exposed a race condition in lookups for TCP sessions transitioning to timewait.) I took a look through the code, but couldn't see anything else in TCP-land which would be impacted by the changed synchronization guarantees proposed by this change. OTOH, I could have missed something subtle. And, there are other protocols which use this lock and may be relying on the existing synchronization guarantees. For example, it would be worth getting a SCTP reviewer (e.g. @tuexen or @rrs, for example) to see whether the changed synchronization guarantees will impact them.

I would like to understand why this shares the same epoch section with many other network things. I am concerned about this because we have a tendency to hold the INP_INFO read lock for long portions of packet processing, so we have it just in case we will need it. This review doesn't change that. So, we will now have much higher usage of the net_epoch_preempt epoch tracker, and some of the threads using it may be in epoch sections for a "long" (in computer terms :-) ) time. It is not clear to me whether this will be better or worse than using a separate epoch tracker for INP things.

(Related to the previous item, I am slightly concerned about the DoS attack possibility of someone sending a number of packets that will make us "acquire the INP_INFO read lock" [now, "enter a net_epoch_preempt epoch section"], delaying the release of INPs. Yes, I know they could already do something similar; however, because RW locks prefer writers, the actual behavior would differ.)

To some extent, my big criticism of this change is that it doesn't go far enough. For example, we speculatively acquire the INP_INFO read lock early in the TCP input process, precisely because we think we may need it later. Instead, I think we should really enter the epoch section much closer to when we need it and exit the epoch section when we are done. I think that would be more efficient, take more advantage of the conversion from locks to RCU, and alleviate the concerns I've just expressed. But, I do understand the desire to be incremental and don't propose delaying this code in favor of something larger.

sys/netinet/in_pcb.h
451 ↗	(On Diff #43689)	Its not really "inpcb modification", but rather "inpcb list modification", right?

I would like to understand why this shares the same epoch section with many other network things. I am concerned about this because we have a tendency to hold the INP_INFO read lock for long portions of packet processing, so we have it just in case we will need it. This review doesn't change that. So, we will now have much higher usage of the net_epoch_preempt epoch tracker, and some of the threads using it may be in epoch sections for a "long" (in computer terms :-) ) time. It is not clear to me whether this will be better or worse than using a separate epoch tracker for INP things.

With the current API a thread cannot be in more than one preemptible epoch section because it uses an entry in the thread structure itself. I would very much like to avoid the API that rmlocks use as I feel that epochs are lightweight enough that one can simply expand the scope they protect. An epoch could in principle safely cover much larger sections of code still if we are comfortable with scheduler guarantees on the maximum amount of time a thread in a section can be preempted for before being reschedule to exit the section. However, this is why we need to use epoch_call rather than epoch_wait in general. A periodic 100ms delay in GC is probably tolerable (albeit something very much to avoid), a 100ms delay in progress is not.

(Related to the previous item, I am slightly concerned about the DoS attack possibility of someone sending a number of packets that will make us "acquire the INP_INFO read lock" [now, "enter a net_epoch_preempt epoch section"], delaying the release of INPs. Yes, I know they could already do something similar; however, because RW locks prefer writers, the actual behavior would differ.)

Thanks to the epoch_section in the thread it doesn't actually work that way. We don't need to wait until all threads exit an epoch for any inpcbs to be freed.
The epoch can advance even if there are threads in the section 100% of the time - so long as they are _different_ threads. The issue you're concerned with can only come up if a given thread spends a long time (for some definition of long time) in an epoch or a thread in an epoch is preempted and is prevented from ever leaving it by a higher priority compute bound thread. The former would - I believe - just be a bug. The latter is something that is better addressed in the scheduler, something I've discussed with Jeff and he is quite amenable to addressing.

To some extent, my big criticism of this change is that it doesn't go far enough. For example, we speculatively acquire the INP_INFO read lock early in the TCP input process, precisely because we think we may need it later. Instead, I think we should really enter the epoch section much closer to when we need it and exit the epoch section when we are done. I think that would be more efficient, take more advantage of the conversion from locks to RCU, and alleviate the concerns I've just expressed. But, I do understand the desire to be incremental and don't propose delaying this code in favor of something larger.

I think the John Gall quote sums up the right approach: A complex system that works is invariably found to have evolved from a simple system that worked.

sys/netinet/in_pcb.h
451 ↗	(On Diff #43689)	Correct.

jtl accepted this revision.Jun 18 2018, 10:13 PM

jtl added inline comments.

sys/netinet/in_pcb.h
451 ↗	(On Diff #43689)	OK. Then, it seems like this commend should be updated when you commit the change.

This revision is now accepted and ready to land.Jun 18 2018, 10:13 PM

Closed by commit rS335356: convert inpcbinfo hash and info rwlocks to epoch + mutex. · Explain WhyJun 19 2018, 1:54 AM

This revision was automatically updated to reflect the committed changes.

convert inpcbhash rlock to epoch
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 44054

head/sys/kern/subr_witness.c

head/sys/netinet/in_pcb.h

head/sys/netinet/tcp_hpts.c

head/sys/netinet/tcp_input.c

head/sys/netinet/tcp_stacks/rack.c

head/sys/netinet/tcp_timewait.c

convert inpcbhash rlock to epochClosedPublicActions