tty: fix improper backspace behaviour for UTF8 characters when in canonical mode
ClosedPublic
Actions

Authored by bnovkov on Oct 3 2023, 5:12 PM.

Details

Reviewers

christos

Group Reviewers

Contributor Reviews (src)

Commits

rGdbe9ba41bbd7: tty: fix improper backspace behaviour for UTF8 characters when in canonical mode
rGf859afb57fd8: tty: fix improper backspace behaviour for UTF8 characters when in canonical mode
rG817701123233: tty: fix improper backspace behaviour for UTF8 characters when in canonical mode
rG9e589b093857: tty: fix improper backspace behaviour for UTF8 characters when in canonical mode

Summary

This patch adds additional logic in ttydisc_rubchar to properly handle backspace behaviour for UTF-8 characters.

Currently, typing in a backspace after a UTF8 character will delete only one byte from the byte sequence, leaving garbled output in the tty's output queue.
With this change all of the character's bytes are deleted.
This change is only active when the IUTF8 flag is set, which can be set using the changes from the first patch D42066.

The code uses the teken_wcwidth function to properly handle character column widths for different code points, and adds the teken_utf8_bytes_to_codepoint function
that converts a UTF-8 byte sequence to a codepoint, as specified in RFC3629.

Test Plan

I've tested the patch with characters encoded byte sequences of varying lengths (1-4 bytes), including some malformed byte sequences.
After applying D42066, follow the steps listed below for quick way to test this change:

$ stty iutf8
$ cat
㹼㹼(backspace)(enter)
㹼

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

bnovkov created this revision.Oct 3 2023, 5:12 PM

Herald added a subscriber: imp. · View Herald TranscriptOct 3 2023, 5:12 PM

bnovkov requested review of this revision.Oct 3 2023, 5:12 PM

imp added inline comments.Oct 3 2023, 9:10 PM

sys/kern/tty_ttydisc.c
878	are all "glyphs" either single or double wide?

christos added a subscriber: christos.Oct 4 2023, 5:44 PM

christos added inline comments.

sys/kern/tty_ttydisc.c
873	I'm not familiar at all with the teken code, but teken_wcwidth() seems to be returning -1, 0, 1 and 2, but here I think you're handling only the cases where it returns either 1 or 2. Will this work properly for the other return values?

christos added a reviewer: christos.Oct 4 2023, 5:47 PM

Address @christos 's comments - properly handle zero width characters.

bnovkov added inline comments.Oct 5 2023, 7:14 PM

sys/kern/tty_ttydisc.c
873	Thank you for catching this, the code handling the 0 case properly as there are some legitimate "zero width" UTF8 sequences. As for the -1 case, I think that it cannot occur due to the conditions checked prior to invoking wcwidth(). From what I see, -1 is returned when an ascii character is detected.
878	So, as far as `teken_wcwidth` is concerned, yes, with the exception of "zero-width" characters that the patch wasn't handling properly (fixed now). It's hard to find concrete information on this, there is one Unicode technical report that defines full-width and half-width CJK characters (the actual list of characters can be found here). These definitions are already present in teken_wcwidth().

christos added a comment.Oct 6 2023, 10:37 PM

This comment was removed by christos.

Tested both patches and they seem to run without problems. Is there a reason we don't want the IUTF8 flag to be set by default? At least in my opinion, backspacing UTF-8 characters is common enough that this should be a "builtin" feature, instead of having to run stty iutf8 in a startup script or do it manually. That being said, I am not exactly aware of the side effects (if any) this could have.

In D42067#960711, @christos wrote:

Tested both patches and they seem to run without problems. Is there a reason we don't want the IUTF8 flag to be set by default? At least in my opinion, backspacing UTF-8 characters is common enough that this should be a "builtin" feature, instead of having to run stty iutf8 in a startup script or do it manually. That being said, I am not exactly aware of the side effects (if any) this could have.

I agree, but I generally tend to avoid the "on by default" policy for new changes until they've been around for some time. I guess that this is still up for discussion, but I'd personally keep it off by default for now.

In D42067#960854, @bojan.novkovic_fer.hr wrote:

In D42067#960711, @christos wrote:

Tested both patches and they seem to run without problems. Is there a reason we don't want the IUTF8 flag to be set by default? At least in my opinion, backspacing UTF-8 characters is common enough that this should be a "builtin" feature, instead of having to run stty iutf8 in a startup script or do it manually. That being said, I am not exactly aware of the side effects (if any) this could have.

I agree, but I generally tend to avoid the "on by default" policy for new changes until they've been around for some time. I guess that this is still up for discussion, but I'd personally keep it off by default for now.