Page MenuHomeFreeBSD

tty: fix improper backspace behaviour for UTF8 characters when in canonical mode
ClosedPublic

Authored by bnovkov on Oct 3 2023, 5:12 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Dec 2, 4:35 PM
Unknown Object (File)
Sat, Nov 30, 11:51 AM
Unknown Object (File)
Nov 7 2024, 8:50 PM
Unknown Object (File)
Nov 7 2024, 5:30 PM
Unknown Object (File)
Nov 7 2024, 5:23 PM
Unknown Object (File)
Nov 7 2024, 5:02 PM
Unknown Object (File)
Nov 7 2024, 2:25 PM
Unknown Object (File)
Nov 7 2024, 1:54 PM
Subscribers

Details

Summary

This patch adds additional logic in ttydisc_rubchar to properly handle backspace behaviour for UTF-8 characters.

Currently, typing in a backspace after a UTF8 character will delete only one byte from the byte sequence, leaving garbled output in the tty's output queue.
With this change all of the character's bytes are deleted.
This change is only active when the IUTF8 flag is set, which can be set using the changes from the first patch D42066.

The code uses the teken_wcwidth function to properly handle character column widths for different code points, and adds the teken_utf8_bytes_to_codepoint function
that converts a UTF-8 byte sequence to a codepoint, as specified in RFC3629.

Test Plan

I've tested the patch with characters encoded byte sequences of varying lengths (1-4 bytes), including some malformed byte sequences.
After applying D42066, follow the steps listed below for quick way to test this change:

$ stty iutf8
$ cat
㹼㹼(backspace)(enter)

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

sys/kern/tty_ttydisc.c
878

are all "glyphs" either single or double wide?

christos added inline comments.
sys/kern/tty_ttydisc.c
873

I'm not familiar at all with the teken code, but teken_wcwidth() seems to be returning -1, 0, 1 and 2, but here I think you're handling only the cases where it returns either 1 or 2. Will this work properly for the other return values?

Address @christos 's comments - properly handle zero width characters.

sys/kern/tty_ttydisc.c
873

Thank you for catching this, the code handling the 0 case properly as there are some legitimate "zero width" UTF8 sequences.
As for the -1 case, I think that it cannot occur due to the conditions checked prior to invoking wcwidth(). From what I see, -1 is returned when an ascii character is detected.

878

So, as far as teken_wcwidth is concerned, yes, with the exception of "zero-width" characters that the patch wasn't handling properly (fixed now).
It's hard to find concrete information on this, there is one Unicode technical report that defines full-width and half-width CJK characters (the actual list of characters can be found here). These definitions are already present in teken_wcwidth().

This comment was removed by christos.

Tested both patches and they seem to run without problems. Is there a reason we don't want the IUTF8 flag to be set by default? At least in my opinion, backspacing UTF-8 characters is common enough that this should be a "builtin" feature, instead of having to run stty iutf8 in a startup script or do it manually. That being said, I am not exactly aware of the side effects (if any) this could have.

Tested both patches and they seem to run without problems. Is there a reason we don't want the IUTF8 flag to be set by default? At least in my opinion, backspacing UTF-8 characters is common enough that this should be a "builtin" feature, instead of having to run stty iutf8 in a startup script or do it manually. That being said, I am not exactly aware of the side effects (if any) this could have.

I agree, but I generally tend to avoid the "on by default" policy for new changes until they've been around for some time. I guess that this is still up for discussion, but I'd personally keep it off by default for now.

In D42067#960854, @bojan.novkovic_fer.hr wrote:

Tested both patches and they seem to run without problems. Is there a reason we don't want the IUTF8 flag to be set by default? At least in my opinion, backspacing UTF-8 characters is common enough that this should be a "builtin" feature, instead of having to run stty iutf8 in a startup script or do it manually. That being said, I am not exactly aware of the side effects (if any) this could have.

I agree, but I generally tend to avoid the "on by default" policy for new changes until they've been around for some time. I guess that this is still up for discussion, but I'd personally keep it off by default for now.

Fair enough. I would still like to keep this in mind though and perhaps enable it by default in the future.

This revision is now accepted and ready to land.Oct 7 2023, 4:37 PM

The default locale is UTF-8 so on by default isn't terrible.