Page MenuHomeFreeBSD

if_wg: use proper barriers around pkt->p_state
ClosedPublic

Authored by kevans on Mar 9 2024, 1:33 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Nov 29, 9:02 AM
Unknown Object (File)
Fri, Nov 29, 9:02 AM
Unknown Object (File)
Fri, Nov 29, 9:02 AM
Unknown Object (File)
Sat, Nov 23, 11:52 PM
Unknown Object (File)
Wed, Nov 13, 9:24 PM
Unknown Object (File)
Mon, Nov 11, 2:49 PM
Unknown Object (File)
Mon, Nov 11, 4:36 AM
Unknown Object (File)
Mon, Nov 11, 12:16 AM

Details

Summary

The hardware's allowed to reorder these loads in wg_deliver_out() and
wg_deliver_in() such that we end up with a garbage mbuf that we try to
pass on without appropriate load-synchronization to pair with store
barriers in wg_encrypt() and wg_decrypt((). The issue is particularly
prevalent with the weaker memory models of !x86 platforms.

With the patch, my dual-iperf3 reproducer is dramatically more stable
than it is without.

PR: 264115

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

kevans requested review of this revision.Mar 9 2024, 1:33 AM
sys/dev/wg/if_wg.c
1519–1520

There's a little more context in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=264115#c23, but in all of the non-debug reports it looks like a bogus fib from the mbuf, which I suspected lined up with the pkt->p_mbuf load in wg_deliver_out ultimately happening before this specific store is visible... the panic I was finally able to reproduce seems consistent with that.

sys/dev/wg/if_wg.c
1519–1520

That's kinda a classic symptom... But it's hard to know for sure: It could also be random corruption.... If it is a race here, I think the release on the store and the acquire on the read is sufficient for this variable....but I often forget key details of the semantics, so don't just commit on my say so...

zlei added inline comments.
sys/dev/wg/if_wg.c
1648

Will a rmb() before dereferencing pkt->p_mbuf be sufficient for load synchronization ?

andrew added a subscriber: andrew.
andrew added inline comments.
sys/dev/wg/if_wg.c
1648

Is there a reason to prefer rmb over atomic_load_acq_int? rmb looks heavier than a load acquire on many architectures.

This revision is now accepted and ready to land.Mar 11 2024, 12:47 PM
sys/dev/wg/if_wg.c
1648

Yeah, rmb should be fine as well, but it doesn't seem like there's much point in penalizing x86 with an lfence when we can handle it more optimally for everyone and more clearly outline our expectations at the same time (i.e., that the remaining pkt field updates are visible when we observe p_state != WG_PACKET_CRYPTED).

sys/dev/wg/if_wg.c
1648

@andrew Is there a reason to prefer rmb over atomic_load_acq_int? rmb looks heavier than a load acquire on many architectures.

That is from my first sight on wmb() . Normally rmb() and wmb() should be used as a pair.
I have not tested yet but I guess the wmb() in this driver if_wg is even redundant on x86.

@kevans Yeah, rmb should be fine as well, but it doesn't seem like there's much point in penalizing x86 with an lfence when we can handle it more optimally for everyone and more clearly outline our expectations at the same time (i.e., that the remaining pkt field updates are visible when we observe p_state != WG_PACKET_CRYPTED).

Last night my brain is full filled with weak memory model. I'm now wondering why not establishing a happens-before order between GROUPTASK_ENQUEUE() and next running of tasks (wg_deliver_out() and wg_deliver_in()).

sys/dev/wg/if_wg.c
1648

That is from my first sight on wmb() . Normally rmb() and wmb() should be used as a pair.
I have not tested yet but I guess the wmb() in this driver if_wg is even redundant on x86.

Yes, I think this is the case.

Last night my brain is full filled with weak memory model. I'm now wondering why not establishing a happens-before order between GROUPTASK_ENQUEUE() and next running of tasks (wg_deliver_out() and wg_deliver_in()).

Because we don't really need that heavy of a hammer, and I'm pretty sure that's not sufficient anyways. Just because we GROUPTASK_ENQUEUE() here, that doesn't mean a future run of the task is the one that peels the packet off of the serial queue -- it could actually be a concurrent run pulling it off pretty much as soon as we're done with it. We don't really have a reason to prevent that, either, we just need to make sure that the pkt state is consistent when we actually go to process it.

Looks good to me.

sys/dev/wg/if_wg.c
1648

That is from my first sight on wmb() . Normally rmb() and wmb() should be used as a pair.
I have not tested yet but I guess the wmb() in this driver if_wg is even redundant on x86.

Yes, I think this is the case.

Last night my brain is full filled with weak memory model. I'm now wondering why not establishing a happens-before order between GROUPTASK_ENQUEUE() and next running of tasks (wg_deliver_out() and wg_deliver_in()).

Because we don't really need that heavy of a hammer, and I'm pretty sure that's not sufficient anyways. Just because we GROUPTASK_ENQUEUE() here, that doesn't mean a future run of the task is the one that peels the packet off of the serial queue -- it could actually be a concurrent run pulling it off pretty much as soon as we're done with it. We don't really have a reason to prevent that, either, we just need to make sure that the pkt state is consistent when we actually go to process it.

I think your analysis is right.

-----
wg_xmit()
	wg_peer_send_staged()
		wg_queue_both()
			enqueue(&peer->p_encrypt_serial)
			enqueue(&sc->sc_encrypt_parallel)
		wg_encrypt_dispatch()
			GROUPTASK_ENQUEUE(&sc->sc_encrypt[cpu]);  // signal one cpu to do wg_softc_encrypt()

-----
wg_softc_encrypt()
	while ((pkt = dequeue(&sc->sc_encrypt_parallel)) != NULL)
	wg_encrypt(sc, pkt)
		noise_keypair_encrypt() // Do encryption, the heavy way
		pkt->p_mbuf = m;
        	wmb();
        	pkt->p_state = state;
        	GROUPTASK_ENQUEUE(&peer->p_send);	// sigal one cpu to do wg_deliver_out()

-----
wg_deliver_out()
	while ((pkt = dequeue(&peer->p_encrypt_serial)) != NULL)
	assert (pkt->p_state == WG_PACKET_CRYPTED)
	m = pkt->p_mbuf;
	wg_send(sc, &endpoint, m);
		sosend()

I'm now wondering why not establishing a happens-before order between GROUPTASK_ENQUEUE() and next running of tasks (wg_deliver_out() and wg_deliver_in())

Actually there is a happens-before order GROUPTASK_ENQUEUE < run wg_deliver_out, BUT wg_deliver_out() is dequeuing multiple packets, then some of them may be generated by other threads that run wg_softc_encrypt().