As a reminder, please add an XXX comment here saying that the implementation is still incomplete: When !can_fault, we don't need to perform a broadcast TLB invalidation, only a local one.
Suppose pmap_qenter() adds a PTE, and then an unrelated pmap_enter() into the kernel map results in promotion of the PTE's L3 table. pmap_qremove() will just clear it, which seems wrong. Is there anything preventing this?
For the same situation on amd64, I tried to come up with an argument why is it impossible, but failed. Of course, it is highly unlikely, but still.
It seems not possible to do a trick with some unused bit from PTE, since there is no unused bits. E.g. combinations like wired but not managed etc.
It seems that the best way (just for transient io mappings, but not for general qenter/qremove) is to allocate three pages for single page mapping, making the frame before and after the unmapped guards.
Yes. It stems from the way that we currently manage the kernel virtual address space. Ignoring the faultable submaps within the kernel virtual address space for the moment, the rest of the pmap_enter() calls within the kernel virtual address space happen on virtual addresses from subarenas that import superpage-sized and -aligned address ranges. So, the virtual address that we allocate here can't be close enough to one of those pmap_enter() calls for it to be caught up in a promotion. That said, rather than relying here on the good behavior of the rest of the kernel, allocating here from a subarena, not the kernel_arena, that makes this guarantee explicit would be better engineering.
But then there must be a gap between subarenas that use pmap_enter(), and our vmem allocations that are served by pmap_qenter(). We do not provide such guards around, I believe.
No, if the import into the subarena is superpage aligned and a multiple of the superpage size, which they are. The pmap_qenter() here can't fall within a superpage-sized and -aligned address range on which we do a pmap_enter().
Ah, so it would be a bug to "optimize" KVA allocation by collapsing vm_dom[0].vmd_kernel_arena into kernel_arena on systems with only a single NUMA domain, which is a thought I've had once or twice.
While there would be no harm in having pmap_qenter set ATTR_SW_NO_PROMOTE, this issue is a concern across most if not all architectures that implement superpage promotion, so ultimately I'd rather see it handled uniformly at the machine-independent layer. The pmap_pte_exists that is eventually called by pmap_qremove already has a similar assertion to what @kib added to amd64 reporting an unexpected superpage.
@markj , yes. That is why I think we would be wise to make this function allocate from a different arena, which is itself setup to do superpage-sized and -aligned imports. Could we do this with the existing transient I/O arena?
A bit of modification would be needed in order to use transient_arena here:
we need to make sure that the KVA region imported into the arena is superpage-aligned and -sized;
we need to initialize bio_transient_maxcnt unconditionally; currently it's only created when unmapped I/O is enabled, but that can be disabled administratively (and we disable it by default in sanitizer kernels).
I think that it's probably still the right direction to go. I'll work on a patch to implement that.