On Sat, Aug 15 2020, Teemu Likonen wrote: > The following Unicode's bidirectional control chars are modal so that > they push a new bidirectional rendering mode to a stack: > > U+202A LEFT-TO-RIGHT EMBEDDING > U+202B RIGHT-TO-LEFT EMBEDDING > U+202D LEFT-TO-RIGHT OVERRIDE > U+202E RIGHT-TO-LEFT OVERRIDE Good stuff -- implementation looks like port of the php code in https://www.iamcal.com/understanding-bidirectional-text to emacs lisp... anyway nice implementation took be a bit of time for me to understand it... thoughts - is it slow to execute it always, pure lisp implementation; (string-match "[\u202a-\u202e]") could be done before that. (if it were executed often could loop with `looking-at` (and then moving point based on match-end) be faster... - *but* adding U+202C's in `notmuch-sanitize` is doing it too early, as some functions truncate the strings afterwards if those are too long (e.g. `notmuch-search-insert-authors`) so those get lost.. - what about https://en.wikipedia.org/wiki/Bidirectional_text#Isolates (was documented more in some page, cannot find it anymore...) (what I noticed when looking `notmuch-search-insert-authors` that it uses `length` to check the length of a string -- but that also counts these bidi mode changing "characters" (as one char). `string-width` would be better there -- and probably in many other places.) (I tried quite a few things, something that could "reset" the stack with e.g. one invisible tab, but no go (or that was filtered as I added it to `notmuch-sanitize` ;), As a final step I did (defun notmuch-sanitize (str) ... - (replace-regexp-in-string "[[:cntrl:]\x7f\u2028\u2029]+" " " str)) + (replace-regexp-in-string + "[\u202A-\u202E\u2066-\u2069]" "" + (replace-regexp-in-string "[[:cntrl:]\x7f\u2028\u2029]+" " " str))) just to test-drop those chars. probably not good enough ;/) Tomi > > Every mode must be terminated with with character U+202C POP > DIRECTIONAL FORMATTING which pops the mode from the stack. The stack > is per paragraph. A new text paragraph resets the rendering mode > changed by these control characters. > > This change adds a new function "notmuch-balance-bidi-ctrl-chars" > which reads its STRING argument and ensures that all push > characters (U+202A, U+202B, U+202D, U+202E) have a pop character > pair (U+202C). The function may add more U+202C characters at the end > of the returned string, or it may remove some U+202C characters. The > returned string is safe in the sense that it won't change the > surrounding bidirectional rendering mode. This function should be used > when sanitizing arbitrary input. > --- > emacs/notmuch-lib.el | 54 ++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 54 insertions(+) > > diff --git a/emacs/notmuch-lib.el b/emacs/notmuch-lib.el > index 118faf1e..e6252c6c 100644 > --- a/emacs/notmuch-lib.el > +++ b/emacs/notmuch-lib.el > @@ -469,6 +469,60 @@ be displayed." > "[No Subject]" > subject))) > > + > +(defun notmuch-balance-bidi-ctrl-chars (string) > + "Balance bidirectional control chars in STRING. > + > +The following Unicode's bidirectional control chars are modal so > +that they push a new bidirectional rendering mode to a stack: > +U+202A LEFT-TO-RIGHT EMBEDDING, U+202B RIGHT-TO-LEFT EMBEDDING, > +U+202D LEFT-TO-RIGHT OVERRIDE and U+202E RIGHT-TO-LEFT OVERRIDE. > +Every mode must be terminated with with character U+202C POP > +DIRECTIONAL FORMATTING which pops the mode from the stack. The > +stack is per paragraph. A new text paragraph resets the rendering > +mode changed by these control characters. > + > +This function reads the STRING argument and ensures that all push > +characters (U+202A, U+202B, U+202D, U+202E) have a pop character > +pair (U+202C). The function may add more U+202C characters at the > +end of the returned string, or it may remove some U+202C > +characters. The returned string is safe in the sense that it > +won't change the surrounding bidirectional rendering mode. This > +function should be used when sanitizing arbitrary input." > + > + (let ((new-string nil) > + (stack-count 0)) > + > + (cl-flet ((push-char-p (c) > + ;; U+202A LEFT-TO-RIGHT EMBEDDING > + ;; U+202B RIGHT-TO-LEFT EMBEDDING > + ;; U+202D LEFT-TO-RIGHT OVERRIDE > + ;; U+202E RIGHT-TO-LEFT OVERRIDE > + (cl-find c '(?\u202a ?\u202b ?\u202d ?\u202e))) > + (pop-char-p (c) > + ;; U+202C POP DIRECTIONAL FORMATTING > + (eql c ?\u202c))) > + > + (cl-loop for char across string > + do (cond ((push-char-p char) > + (cl-incf stack-count) > + (push char new-string)) > + ((and (pop-char-p char) > + (cl-plusp stack-count)) > + (cl-decf stack-count) > + (push char new-string)) > + ((and (pop-char-p char) > + (not (cl-plusp stack-count))) > + ;; The stack is empty. Ignore this pop character. > + ) > + (t (push char new-string))))) > + > + ;; Add possible missing pop characters. > + (cl-loop repeat stack-count > + do (push ?\x202c new-string)) > + > + (seq-into (nreverse new-string) 'string))) > + > (defun notmuch-sanitize (str) > "Sanitize control character in STR. > > -- > 2.20.1 _______________________________________________ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-leave@notmuchmail.org