Re: problems with nmbug and empty prefix (UnicodeWarning and broken pipe)

Subject: Re: problems with nmbug and empty prefix (UnicodeWarning and broken pipe)

Date: Sun, 14 Feb 2016 14:33:51 -0800

On Sun, Feb 14, 2016 at 08:22:24AM -0400, David Bremner wrote:
> W. Trevor King writes:
> >   for tag in tags:
> >       _LOG.debug('building a quoted path for {!r} / {!r}'.format(id, tag))
> >       path = 'tags/{id}/{tag}'.format(
> >           id=_hex_quote(string=id), tag=_hex_quote(string=tag))
> >       yield '{mode} {hash}\t{path}\n'.format(mode=mode, hash=hash, path=path)
> >
> 
> I think the problem is not a bad tag, but a bad message-id. The last
> line of output before the UnicodeWarning and the broken pipe is
> 
> building a quoted path for u'D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@\xd1\xe5\xf0\xe3\xe5\xe9-\xcf\xca' / u'unread'

  $ ln -s nmbug nmbug.py
  $ python2 -W error -c "import nmbug; nmbug._hex_quote(u'D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@\xd1\xe5\xf0\xe3\xe5\xe9-\xcf\xca')"
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "nmbug.py", line 106, in _hex_quote
      uppercase_escapes = _quote(string, safe)
    File "/usr/lib64/python2.7/urllib.py", line 1303, in quote
      return ''.join(map(quoter, s))
  UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

The problem seems to be having Unicode characters in either quote argument:

  $ python2 -W error -c "import urllib; urllib.quote(u'D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@\xd1\xe5\xf0\xe3\xe5\xe9-\xcf\xca')"
  …
  UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  $ python2 -W error -c "import urllib; urllib.quote(u'D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@\xd1\xe5\xf0\xe3\xe5\xe9-\xcf\xca', u'+@=:,')"
  …
  UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  $ python2 -W error -c "import urllib; urllib.quote(u'D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@\xd1\xe5\xf0\xe3\xe5\xe9-\xcf\xca'.encode('utf-8'), u'+@=:,')"
  …
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 33: ordinal not in range(128)
  $ python2 -W error -c "import urllib; print(urllib.quote(u'D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@\xd1\xe5\xf0\xe3\xe5\xe9-\xcf\xca'.encode('utf-8'), u'+@=:,'.encode('utf-8')))"
  D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@%C3%91%C3%A5%C3%B0%C3%A3%C3%A5%C3%A9-%C3%8F%C3%8A

Related Python issues [1,2,3,4,5].  [2] lead to the currently working
Python 3 implementation, which encodes to UTF-8 by default and has an
‘encoding’ option [6].  There's some useful background in [7].  For
compatibility with Python 3, I suggest patching _hex_quote to take an
encoding option, defaulting to UTF-8, and encoding both strings that
are passed to _quote.  We should probably raise a ValueError if the
length of the encoded safe characters doesn't match the length of the
Unicode safe characters, because the caller will probably not expect
the byte-level quoting that would cause.  Python 3 covers that by
restricting the safe characters to ASCII [6], although passing
non-ASCII characters with safe doesn't seem to raise an exception:

  $ python3 -c "from urllib.parse import quote; print(quote('\u0091', '\u0091'))"
  %C2%91
  $ python3 -c "from urllib.parse import quote; print(quote('\u203b', '\u203b'))"
  %E2%80%BB

Anyhow, I'll file a patch adding UTF-8 encoding so Python 2 works like
Python 3.

Cheers,
Trevor

[1]: http://bugs.python.org/issue2637
[2]: http://bugs.python.org/issue3300
[3]: http://bugs.python.org/issue22231
[4]: http://bugs.python.org/issue23885
[5]: http://bugs.python.org/issue1712522
[6]: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote
[7]: https://mail.python.org/pipermail/python-dev/2006-July/067335.html

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy

signature.asc (application/pgp-signature)

Previous message (by thread): Re: problems with nmbug and empty prefix (UnicodeWarning and broken pipe)

Thread:

David Bremner—problems with nmbug and empty prefix [inbox, unread]
- W. Trevor King—Re: problems with nmbug and empty prefix (UnicodeWarning and broken pipe) [inbox, signed, unread]
  - David Bremner—Re: problems with nmbug and empty prefix (UnicodeWarning and broken pipe) [inbox, unread]
    - W. Trevor King—Re: problems with nmbug and empty prefix (UnicodeWarning and broken pipe) [inbox, signed, unread]
      - David Bremner—Re: problems with nmbug and empty prefix (UnicodeWarning and broken pipe) [inbox, unread]
        W. Trevor King—Re: problems with nmbug and empty prefix (UnicodeWarning and broken pipe) [inbox, signed, unread]