Page MenuHomeFreeBSD

amd64 atomic.h: minor codegen optimization in flag access
ClosedPublic

Authored by rlibby on Feb 28 2020, 5:50 AM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Nov 21, 9:43 AM
Unknown Object (File)
Wed, Nov 20, 7:33 PM
Unknown Object (File)
Tue, Nov 12, 10:29 AM
Unknown Object (File)
Sun, Nov 10, 3:08 PM
Unknown Object (File)
Fri, Nov 8, 3:27 PM
Unknown Object (File)
Thu, Nov 7, 5:46 PM
Unknown Object (File)
Thu, Nov 7, 5:40 PM
Unknown Object (File)
Thu, Nov 7, 5:34 PM

Details

Summary

Previously the pattern to extract status flags from inline assembly
blocks was to use setcc in the block to write the flag to a register.
This was suboptimal in a few ways:

  • It would lead to code like: sete %cl; test %cl; jne, i.e. a flag would just be loaded into a register and then reloaded to a flag.
  • The setcc would force the block to use an additional register.
  • If the client code didn't care for the flag value then the setcc would be entirely pointless but could not be eliminated by the optimizer.

A more modern inline asm construct (since gcc 6, and recent clang)
allows for "flag output operands", where a C variable can be written
directly from a flag. The optimizer can then use this to produce direct
code where the flag does not take a trip through a register.

In practice this makes each affected operation sequence shorter by five
bytes of instructions. It's unlikely this has a measurable performance
impact.

Test Plan

kyua test -k /usr/tests/sys/Kyuafile

Manual inspection of generated code. E.g.

generic_stop_cpus:

Before:

ffffffff80c23fff: b8 ff ff ff ff        movl    $4294967295, %eax
ffffffff80c24004: f0                    lock
ffffffff80c24005: 0f b1 0d 88 17 cf 00  cmpxchgl        %ecx, 13571976(%rip)
ffffffff80c2400c: 0f 94 c1              sete    %cl
ffffffff80c2400f: 84 c9                 testb   %cl, %cl
ffffffff80c24011: 0f 84 0b 01 00 00     je      267 <generic_stop_cpus+0x1a2>

After:

ffffffff80c23907: b8 ff ff ff ff        movl    $4294967295, %eax
ffffffff80c2390c: f0                    lock
ffffffff80c2390d: 0f b1 0d 80 1e cf 00  cmpxchgl        %ecx, 13573760(%rip)
ffffffff80c23914: 74 2e                 je      46 <generic_stop_cpus+0xc4>

smr_lazy_advance:

Before:

ffffffff80c25ebc: 8b 46 0c              movl    12(%rsi), %eax
ffffffff80c25ebf: 89 c2                 movl    %eax, %edx
ffffffff80c25ec1: 29 da                 subl    %ebx, %edx
ffffffff80c25ec3: 85 d2                 testl   %edx, %edx
ffffffff80c25ec5: 7e 08                 jle     8 <smr_advance+0x10f>
ffffffff80c25ec7: f0                    lock
ffffffff80c25ec8: 0f b1 5e 0c           cmpxchgl        %ebx, 12(%rsi)
ffffffff80c25ecc: 0f 94 c2              sete    %dl
ffffffff80c25ecf: 89 c8                 movl    %ecx, %eax
ffffffff80c25ed1: f0                    lock
ffffffff80c25ed2: 0f b1 5e 08           cmpxchgl        %ebx, 8(%rsi)
ffffffff80c25ed6: 0f 94 c1              sete    %cl

After:

ffffffff80c257bc: 8b 46 0c              movl    12(%rsi), %eax
ffffffff80c257bf: 89 c2                 movl    %eax, %edx
ffffffff80c257c1: 29 da                 subl    %ebx, %edx
ffffffff80c257c3: 85 d2                 testl   %edx, %edx
ffffffff80c257c5: 7e 05                 jle     5 <smr_advance+0x10c>
ffffffff80c257c7: f0                    lock
ffffffff80c257c8: 0f b1 5e 0c           cmpxchgl        %ebx, 12(%rsi)
ffffffff80c257cc: 89 c8                 movl    %ecx, %eax
ffffffff80c257ce: f0                    lock
ffffffff80c257cf: 0f b1 5e 08           cmpxchgl        %ebx, 8(%rsi)

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 29661
Build 27518: arc lint + arc unit

Event Timeline

kib added a reviewer: mjg.

Thank you for this.

This revision is now accepted and ready to land.Feb 28 2020, 12:36 PM

I only have a minor comment that the feature is somewhat new in clang, I don't know if we care [that is, if we should version check]. I also happened to have a complete rewrite here D23661, but I think this should go in regardless of what happens with that review.

In D23869#524905, @mjg wrote:

I only have a minor comment that the feature is somewhat new in clang, I don't know if we care [that is, if we should version check]. I also happened to have a complete rewrite here D23661, but I think this should go in regardless of what happens with that review.

Thanks for the pointer.

Yes, clang 9, it looks like. I'll update the commit log with that version number.

Given that this is not MFC'd, it should be enough that the in-tree clang version supports it, I think? I'd rather not add #ifdef __GCC_ASM_FLAG_OUTPUTS__. That would be a possibility, but maybe worse than just doing nothing.

Yea, just go for it. Should anyone complain we can revisit.

How does this compare to using the __atomic compiler builtins?

I've changed CheriBSD RISCV to use them (https://github.com/CTSRD-CHERI/cheribsd/blob/master/sys/riscv/include/atomic.h) and am considering submitting a patch to use them for all architectures.

How does this compare to using the __atomic compiler builtins?

I've changed CheriBSD RISCV to use them (https://github.com/CTSRD-CHERI/cheribsd/blob/master/sys/riscv/include/atomic.h) and am considering submitting a patch to use them for all architectures.

Please see D23661

This has been supported since GCC 6/clang 9 (https://reviews.llvm.org/D57394). It would be better if the logic were conditioned on __GCC_ASM_FLAG_OUTPUTS__.

Now, on FreeBSD 12.1 (cc is clang 8.0.1), I cannot build world with WITHOUT_CLANG_BOOTSTRAP=yes...