Page MenuHomeFreeBSD

RISC-V pmap: remove incorrect assertions in pmap_demote_l2_locked
AbandonedPublic

Authored by freebsdphab-AX9_cmx.ietfng.org on May 30 2021, 2:12 PM.
Tags
Referenced Files
Unknown Object (File)
Mon, Nov 25, 4:49 AM
Unknown Object (File)
Mon, Nov 25, 4:48 AM
Unknown Object (File)
Tue, Nov 12, 7:41 AM
Unknown Object (File)
Sun, Nov 10, 3:37 AM
Unknown Object (File)
Sun, Nov 10, 12:13 AM
Unknown Object (File)
Sat, Nov 9, 5:40 PM
Unknown Object (File)
Sat, Nov 9, 4:52 PM
Unknown Object (File)
Sat, Nov 9, 3:19 PM

Details

Reviewers
markj
Summary

pmap_demote_l2_locked contains two bad assertions:

  1. The check for PTE_A can never fire: the series of conditionals immediately prior will certainly return from this function if PTE_A is not asserted.
  2. The check for PTE_D can fire in legitimate circumstances; see the attached test program for a simple reproducer which creates fully materialized and writeable yet clean large pages.

Just remove both of these assertions.

Test Plan

Run this program under RISC-V QEMU both before and after this change. Before, the kernel will panic while executing the mmap after "Now causing trouble"; after, the program reaches its end.

#include <assert.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/mman.h>

int
main()
{
  int res;

  int nps = getpagesizes(NULL, 0);

  assert(nps > 1);

  size_t ps[nps];
  res = getpagesizes(ps, nps);
  assert(res == nps);

  fprintf(stderr, "Have %d page sizes: %zd %zd ...\n", nps, ps[0], ps[1]);

  /* Map two consecutive "size 1" large pages */
  void *p = mmap(NULL, 2 * ps[1], PROT_READ | PROT_WRITE,
                 MAP_ALIGNED_SUPER | MAP_PRIVATE | MAP_ANON, -1, 0);

  assert(p != MAP_FAILED);

  fprintf(stderr, "Mapping at %p; now reading a fair few zeros...\n", p);

  uint64_t sum = 0;
  for (size_t rix = 0; rix < (2 * ps[1])/ps[0]; rix++)
  {
    sum += ((uint64_t*)p)[rix * ps[0] / sizeof(uint64_t)];
  }
  assert(sum == 0);

  /* Ensure that the two pages are large pages */
  char mcc0, mcc1;

  res = mincore(p, ps[0], &mcc0);
  assert(res == 0);

  res = mincore(p + ps[1], ps[0], &mcc1);
  assert(res == 0);

  fprintf(stderr, "Mincores %x %x\n", mcc0, mcc1);

  /*
   * Ensure that we have two materialized, clean superpages.
   *
   * Having read every byte of these pages, 
   */
  static const char magic =
      MINCORE_INCORE            /* Fully materialized in physmem and pmap */
    | MINCORE_REFERENCED

                                /* Lack of MINCORE_MODIFIED: pages clean */

    | MINCORE_REFERENCED_OTHER

    | MINCORE_PSIND(1);        /* Is superpage */

  assert(mcc0 == magic);
  assert(mcc1 == magic);

  fprintf(stderr, "Now causing trouble\n");

  /* Cause ephemeral entry clipping and downgrade of a large page */
  void *q = mmap(p + ps[1] / 2, ps[1], PROT_READ | PROT_WRITE,
                 MAP_FIXED | MAP_PRIVATE | MAP_ANON, -1, 0);

  fprintf(stderr, "Remapped at %p\n", q);

  assert(q != MAP_FAILED);

  return 0;
}

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

Can we include the test program as a regression test in tests/sys? No need to rewrite it in ATF, keeping it as a plain C executable seems fine to me.

Have you tried running this on arm64? Its pmap_demote_l2_locked has a similar:

KASSERT((oldl2 & (ATTR_S1_AP_RW_BIT | ATTR_SW_DBM)) !=
    (ATTR_S1_AP(ATTR_S1_AP_RO) | ATTR_SW_DBM),
    ("pmap_demote_l2: L2 entry is writeable but not dirty"));

and arm's pmap_demote_pte1 has:

KASSERT((opte1 & (PTE1_NM | PTE1_RO)) != PTE1_NM,
    ("%s: opte1 has PTE1_NM", __func__));

and amd64's pmap_demote_pde_locked has:

KASSERT((oldpde & (PG_M | PG_RW)) != PG_RW,
    ("pmap_demote_pde: oldpde is missing PG_M"));

so I'm concerned that this assertion is there for a reason and that riscv is doing something wrong instead.

Well, for whatever it's worth, the reproducer seems not to panic on amd64, which is a little surprising as I don't see any special handling for setting PG_M on !VPO_UNMANAGED, PG_RW, psind=1 pages in pmap_enter(), and I'd have thought the MI layers were doing the same thing on both.

The essential difference here is that riscv's pmap_promote_l2() assumes that PTE_D is always set by software, i.e., no hardware management of the dirty bit. We should possibly fix that instead.

To explain further, pmap_promote_l2() checks all 512 PTEs to verify that each 4KB mapping carries the same attributes. As a matter of policy, if all of the PTEs are writeable and some but not all are dirty, we do not carry out the promotion. (After the promotion, we would have to treat the entire 2MB page as dirty, which may in principle lead to unnecessary I/O during a pageout.) Of course, if the hardware can set PTE_D, these checks are racy. So, the amd64 and arm64 pmaps downgrade clean, writable mappings to read-only mappings, forcing a page fault if a dirtying write occurs after a PTE is compared. So after a successful promotion, the 2MB mapping must either be writable and dirty, or read-only. That is what the removed assertion is checking, and it is wrong on riscv since pmap_promote_l2() doesn't handle this problem.

The riscv pmap does not consistently handle the possibility that the accessed and dirty bits are hardware-managed. Is that something we should aim to address in the near future? I am not sure if any of the newer riscv boards implement hardware PTE management. From a look through the specs it doesn't look like there's any facility to query the hardware to see if it's implemented, but maybe there's a standard DTB property or something like that? If we don't consider it to be an important, then removing the assertion is the way to go. I would suggest adding a comment to pmap_promote_l2() to the effect of, "assumes that PTE_D is software-managed."

So after a successful promotion, the 2MB mapping must either be writable and dirty, or read-only. That is what the removed assertion is checking, and it is wrong on riscv since pmap_promote_l2() doesn't handle this problem.

(I am assuming also that the riscv pmap does not set PTE_D without the pmap lock held, but I believe that's still true.)

Yes, there is no way to detect what hardware does short of trying and seeing (the spec mandates all harts must always do dirty tracking, or all harts must never, so it's an easy test if you really want to). QEMU currently implements hardware dirty tracking, with no knob to turn it off. Depending on how it's configured, an FPGA implementation we use also does.

Thanks to @markj for the explanation; this is an incorrect fix.

Thanks to @markj for the explanation; this is an incorrect fix.

I've posted a tentative patch here: https://reviews.freebsd.org/D30644