Re: Unicode Paths

Subject: Re: Unicode Paths

Date: Fri, 16 Sep 2011 12:58:49 +0200

To: Austin Clements, Martin Owens

Cc: Notmuch developer list

From: Sebastian Spaeth


On Thu, 15 Sep 2011 13:52:12 -0400, Austin Clements <amdragon@mit.edu> wrote:
> On Tue, Sep 13, 2011 at 11:55 PM, Martin Owens <doctormo@gmail.com> wrote:
> > Hello Again,
> >
> > I notice in the lib code notmuch_database_open(),
> > notmuch_database_create() these functions use const char *path for the
> > directory path input. Is this unicode safe?
> >
> > The python bindings (and ctype docs) seem to suggest using something
> > called 'wchar_t *' for accepting unicode but that's for C not C++.
> >
> > Is this something that should be patched?
> 
> char* is the correct type for paths on POSIX systems.  The *meaning*
> of those bytes is a more complicated matter and depends on your locale
> settings.  On old systems it was generally ASCII, on modern systems
> it's generally UTF-8, and it can be many other things.  However, as a
> consequence of UNIX's C heritage, it is *always* terminated with a
> NULL byte and cannot contain embedded NULL's.

Right, that's what we are doing, passing in utf-8 encoded unicode
strings to char*, which should be just fine if that is what the
underlying OS uses.

> wchar_t is another matter entirely.  wchar_t is the type used by C to
> represent wide strings internally, which generally (but not
> necessarily!) means it stores a Unicode code point.  However, this
> isn't an encoding, and different compilers can give wchar_t different
> meanings, so wchar_t strings aren't generally appropriate for storing
> or sharing between processes or with the kernel.

Mmh, I remember I attempted to user wchar_t to pass in unicode objects
directly and it had failed miserably.

Sebastian
part-000.sig (application/pgp-signature)

Thread: