Page MenuHomeFreeBSD

Make hwpmc work for userspace binaries again
ClosedPublic

Authored by gallatin on Nov 18 2021, 7:06 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Nov 12, 3:13 PM
Unknown Object (File)
Tue, Nov 12, 1:20 PM
Unknown Object (File)
Tue, Nov 12, 6:00 AM
Unknown Object (File)
Mon, Nov 11, 1:22 PM
Unknown Object (File)
Mon, Nov 11, 12:11 AM
Unknown Object (File)
Sun, Nov 10, 11:48 PM
Unknown Object (File)
Sun, Nov 10, 1:25 PM
Unknown Object (File)
Sun, Nov 10, 1:14 PM

Details

Summary

hwpmc has been utterly broken for userspace binaries, and has been
labeling all samples from userspace binaries as dubious frames. The
issues are that:

  • The check for ph.p_offset & (-ph.p_align) == 0 was mostly bogus. The intent was to ignore all executable segments other than the first, which when using BFD appeared in the first page, but with current LLD a read-only data segment appears before the executable segment, pushing the latter into the second page or later. This meant no executable segment was ever found, and thus pi_vaddr remained 0. Instead of relying on BFD's layout, track whether we've seen an executable segment explicitly with a local bool.
  • Shared libraries were not parsing the segments to calculate pi_vaddr, resulting in it always being 0. Again, when using BFD, the executable segment started at the first page, and so pi_vaddr was genuinely meant to be 0, but not with LLD's current layout. This meant that pmcstat_image_link's offset calculation gave the base address of the segment in memory, rather than the base address of the whole library in memory, and so when adding that to pi_start/pi_end to get the range of the executable sections in memory it double-counted the offset of the first executable segment within the library. Thus we need to do the exact same parsing for ET_DYN as we do for ET_EXEC, which is simpler to write as special-casing ET_REL to not look for segments. Note that, whilst PT_INTERP isn't needed for shared libraries, it will be for PIEs, which pmcstat still fails to handle due to not knowing the base address of the PIE; we get the base address for libraries by MAP_IN events, and for rtld by virtue of the process's entry address being rtld's, but have no equivalent for the executable.

Fixes courtesy of jrtc27@.

Test Plan
  • collect a sample of your system. similar to: pmcstat -n 1000000 -S resource_stalls.any -S cpu_clk_unhalted.thread_p -O out.pmclog_cpi01 sleep 10
  • process the log file and produce a graph: pmcstat -R /d/fg/out.pmclog_cpi01 -G foo.graph
  • Ensure there are a low number of "dubious frames", and that dynamically linked userspace process have stacks recorded in the profile. Note it may be necessary to increase the number of pmc buffers (kern.hwpmc.nbuffers_pcpu=128 in loader.conf), as map-in records may be lost (hence preventing resolution of some executable sections, since they are missing from the logs)

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

I generally trust @jrtc27 on this.

@brett_gutste.in Does the fix to ignore offsets for !ET_REL help with PIE binaries (e.g. does it reduce the scope of some of your patch over in CheriBSD to fix hwpmc on PIE?)

This revision is now accepted and ready to land.Nov 30 2021, 7:05 PM
  • the check for ph.p_offset & (-ph.p_align) == 0 seemed entirely bogus, and was preventing any binary's exec. section from being considred. Replace this with a check for the current vaddr == 0 (meaning we're only taking the 1st exec. section in a binary)

The intent of this check was to find the first executable segment, under the assumption that it would be in the first page of the file (the check is "page == 0", since -align just gets you a mask for the upper bits). This perhaps used to be the case with BFD (looking at /bin/sh on a random Linux box it does have such a layout), but is not the case with LLD. So the commit message should probably be clarified to explain that.

(Also typo on "considred")

  • due to how map-ins are recorded, we were double-counting the offsets for shared libraries. Jessica came up with a brilliant workaround, which is to simply not set the vaddr for relocatable objects.

Again, could be clearer. The double-counting was specifically for the offset of the executable segment within the library (which, again, used to not matter because that would be 0 with BFD due to being the first page, but matters with LLD), not the offset (base address) of the entire library in memory; currently it's a bit ambiguous which offset you're talking about (or even if it's both).

As for the change here, it's a fix not a workaround, and "for relocatable objects" isn't the important part. Previously it was eh.e_type == ET_EXEC, which excluded ET_DYN (shared libraries) and ET_REL (.o files, which are actually what .ko files are on amd64 and mips, hence why they're handled by pmcstat at all), but should have been eh.e_type == ET_EXEC || eh.e_type == ET_DYN, which eh.e_type != ET_REL is a shorter way of writing due to there only being three possibilities (and we've already checked that it's one of the three above).

Though, now I revisit this, I realise this isn't actually quite right when the two diffs are composed like this; if a shared library has more than one executable segment and the first one is within the first page then pi_vaddr will be set to the second one's, since I used pi_vaddr == 0 as a proxy for "not yet found", which works for executables (as it was when I originally wrote that line) but not shared libraries, for which I wrote the patch second. We should probably instead do:

diff --git a/lib/libpmcstat/libpmcstat_image.c b/lib/libpmcstat/libpmcstat_image.c
index 9ee7097e95ec..798590044759 100644
--- a/lib/libpmcstat/libpmcstat_image.c
+++ b/lib/libpmcstat/libpmcstat_image.c
@@ -295,6 +295,7 @@ pmcstat_image_get_elf_params(struct pmcstat_image *image,
 	size_t i, nph, nsh;
 	const char *path, *elfbase;
 	char *p, *endp;
+	bool first_exec_segment;
 	uintfptr_t minva, maxva;
 	Elf *e;
 	Elf_Scn *scn;
@@ -384,7 +385,7 @@ pmcstat_image_get_elf_params(struct pmcstat_image *image,
 	 * loaded.  Additionally, for dynamically linked executables,
 	 * save the pathname to the runtime linker.
 	 */
-	if (eh.e_type == ET_EXEC) {
+	if (eh.e_type != ET_REL) {
 		if (elf_getphnum(e, &nph) == 0) {
 			warnx(
 "WARNING: Could not determine the number of program headers in \"%s\": %s.",
@@ -392,6 +393,7 @@ pmcstat_image_get_elf_params(struct pmcstat_image *image,
 			    elf_errmsg(-1));
 			goto done;
 		}
+		first_exec_segment = true;
 		for (i = 0; i < eh.e_phnum; i++) {
 			if (gelf_getphdr(e, i, &ph) != &ph) {
 				warnx(
@@ -416,8 +418,10 @@ pmcstat_image_get_elf_params(struct pmcstat_image *image,
 				break;
 			case PT_LOAD:
 				if ((ph.p_flags & PF_X) != 0 &&
-				    (ph.p_offset & (-ph.p_align)) == 0)
+				    first_exec_segment) {
 					image->pi_vaddr = ph.p_vaddr & (-ph.p_align);
+					first_exec_segment = false;
+				}
 				break;
 			}
 		}

(you could instead mess with initialising pi_vaddr to -1 since that's not a valid value for pi_vaddr, but it's needlessly hard to follow)

In D33055#750301, @jhb wrote:

I generally trust @jrtc27 on this.

@brett_gutste.in Does the fix to ignore offsets for !ET_REL help with PIE binaries (e.g. does it reduce the scope of some of your patch over in CheriBSD to fix hwpmc on PIE?)

I believe it should fix everything for PIEs except the lack of knowledge about what the base load address was, since won't get a MAP_IN event for it like shared libraries?

In D33055#750301, @jhb wrote:

I generally trust @jrtc27 on this.

@brett_gutste.in Does the fix to ignore offsets for !ET_REL help with PIE binaries (e.g. does it reduce the scope of some of your patch over in CheriBSD to fix hwpmc on PIE?)

I believe it should fix everything for PIEs except the lack of knowledge about what the base load address was, since won't get a MAP_IN event for it like shared libraries?

Yes this was the primary change besides plumbing information about the loaded entry address through.

While making these changes we may also want to set image->pi_entry to eh.e_entry in pmcstat_image_get_elf_params(), as this currently isn't done. Adding this and an image->pi_ispositionindependent field were the only other non-plumbing changes I made in my patch.

I've updated the patch to

  • use first_exec_segment as suggested by jrtc27
  • set image->pi_entry = eh.e_entry as suggested by brett_gutste.in

I've retested, and the updated patch still works for me..

This revision now requires review to proceed.Dec 2 2021, 4:55 PM
lib/libpmcstat/libpmcstat_image.c
459

I'd prefer to keep this as a separate change. I believe this is almost never read, with the sole exception being for creating rtld's image map.

I need to think more about what exactly this is doing to check this is ok, but unless you're profiling rtld (or its mapping happens to be miscalculated as aliasing another library) this shouldn't really be needed.

  • keep image->pi_entry = eh.e_entry as requested by jrt27

OK, I'm happy to keep this as a separate change. I mostly just want to get this committed so I can bring it down into our (netflix) tree and not carry a diff going forward..

Any objection to me committing this change ?

Any objection to me committing this change ?

Mind if I reword the description?

Any objection to me committing this change ?

Mind if I reword the description?

Please feel free!

Any objection to me committing this change ?

Mind if I reword the description?

Please feel free!

Should be correct now

This revision is now accepted and ready to land.Dec 6 2021, 4:35 PM

Do you intend to commit this soon? If not I'm happy to commit it as Reported+Tested by you instead.

Do you intend to commit this soon? If not I'm happy to commit it as Reported+Tested by you instead.

Sorry for the delay, I just pushed it.

Do you intend to commit this soon? If not I'm happy to commit it as Reported+Tested by you instead.

Sorry for the delay, I just pushed it.

Thanks. Will you be MFC'ing it (there's no MFC after: tag in your commit message)?

Do you intend to commit this soon? If not I'm happy to commit it as Reported+Tested by you instead.

Sorry for the delay, I just pushed it.

Thanks. Will you be MFC'ing it (there's no MFC after: tag in your commit message)?

I was not planning to MFC it. I don't have -stable test machines, and I'm a bit paranoid about breaking things in -stable.