Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9).
fspacectl(2) is a system call to provide space management support to
userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the
deallocation. vn_deallocate(9) is a public KPI.
The purpose of proposing a new system call, a KPI and a VOP call is to
allow bhyve or other hypervisor monitors to emulate the behavior of SCSI
UNMAP/NVMe DEALLOCATE on a plain file.
fspacectl(2) comprises of cmd and flags parameters to specify the
space management operation to be performed. Currently cmd can be
SPACECTL_DEALLOC, and flags has to be 0 for SPACECTL_DEALLOC
operation.
fo_fspacectl is added to fileops.
VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation
of VOP_DEALLOCATE(9) is provided.
Submitted by: Ka Ho Ng <khng@freebsdfoundation.org>
Sponsored by: The FreeBSD Foundation
I think it's worth commenting here that the reason we are CTASSERT'ing the size was because we wanted to make sure that for both 32-bit and 64-bit variants, the struct have the same layout (and thus the compat32 call can just use the same struct); having it in the same block of asserting size_t was confusing for me in the first glance.
Is this intentional? If so, could you please add a comment explaining why we are ignoring ERESTART/EINTR/EWOULDBLOCK? (I think this is different from short read/write and these shouldn't be ignored?)
I believe that was already discussed, and now you returned back the same errant behavior. Unix convention is that error return is only acceptable if the file content was not changed at all. If syscall did modified the file, regardless was it completed WRT requested op, or just a partial execution, userspace must know exactly what was done.
I think that this requires translation for compat32, because spacectl_range contains off_t. Look how freebsd32_lseek() handles that, in particular the big- vs. little-endian versions of PAIR32TO64.
I wonder if it make sense to change the fspacectl(2) interface while not too late and split in and out spacectl_range. Basically keep original spacectl_range as is, and copy out optional processed length:
fspacectl(int fd, int cmd, const struct spacectl_range *sr, int flags, off_t *length);
This was copied from shm_dotruncate_locked(), right? It misses the vm_pager_has_page() case. Basically, the page that you need to partially zero might be not resident but swap contains the user data. So we need to read the page from swap.
Perhaps you should refactor code to instantiate the page from shm_dotruncate_locked() and reuse it. Perhaps the code should take some control which region of the page should be cleared, to be passed to pmap_zero_page_area().
BTW, I suggest to remove the error != 0 check at all. Copy *rqsr into *rmsr at the very beginning of kern_fspacectl() if rmsr != NULL, and it should be fine.
Minor change in kern_fspacectl(2): No need to copy rqsr into *rmsrp in the very beginning since things in *rmsrp is useless if error != 0. This reduces the lines further.
I still insist on copying out rmsr always, not only on error == 0. For instance, if errno == ENOSPC, you still might have modified parts of the file, and it is important for usermode to know.