Re: regex [X-Z] with non-ascii char returns different results from (X|Y|Z)

Subject: Re: regex [X-Z] with non-ascii char returns different results from (X|Y|Z)

Date: Sat, 24 Aug 2019 16:39:10 +0200 (CEST)

To: David Bremner

Cc: Notmuch

From: yury.t


Although this thread now might be offtopic, let me send a follow-up.
By searching with C related terms, I found some articles about this issue.  It seems to be a common problem on regex + multibyte in C.  (e.g. https://stackoverflow.com/a/15895746 <https://stackoverflow.com/a/15895746>)

On Wed, Aug 21, 2019 at 12:58:04PM +0000, tptlab@tuta.io <mailto:tptlab@tuta.io> wrote:
> - [1] (U+FF11) is treated as [\x{F000}-\x{FFFF}]

Actually, it becomes [\xef\xbc\x91].  That's why it matches with U+Fxxx (starts with \xef in UTF-8).  And without ^, it matches partial byte of a character, U+4444 (\xe4\x91\x84), U+5C11 (\xeb\xb0\x91) for example.

I'm not familiar with C and don't know whether pcre or \k solve this issue, but it might hard to fix if the root cause is how C handles multibyte strings.
_______________________________________________
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch

Thread: