On Sun, Feb 14, 2016 at 08:22:24AM -0400, David Bremner wrote: > W. Trevor King writes: > > for tag in tags: > > _LOG.debug('building a quoted path for {!r} / {!r}'.format(id, tag)) > > path = 'tags/{id}/{tag}'.format( > > id=_hex_quote(string=id), tag=_hex_quote(string=tag)) > > yield '{mode} {hash}\t{path}\n'.format(mode=mode, hash=hash, path=path) > > > > I think the problem is not a bad tag, but a bad message-id. The last > line of output before the UnicodeWarning and the broken pipe is > > building a quoted path for u'D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@\xd1\xe5\xf0\xe3\xe5\xe9-\xcf\xca' / u'unread' $ ln -s nmbug nmbug.py $ python2 -W error -c "import nmbug; nmbug._hex_quote(u'D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@\xd1\xe5\xf0\xe3\xe5\xe9-\xcf\xca')" Traceback (most recent call last): File "<string>", line 1, in <module> File "nmbug.py", line 106, in _hex_quote uppercase_escapes = _quote(string, safe) File "/usr/lib64/python2.7/urllib.py", line 1303, in quote return ''.join(map(quoter, s)) UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal The problem seems to be having Unicode characters in either quote argument: $ python2 -W error -c "import urllib; urllib.quote(u'D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@\xd1\xe5\xf0\xe3\xe5\xe9-\xcf\xca')" … UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal $ python2 -W error -c "import urllib; urllib.quote(u'D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@\xd1\xe5\xf0\xe3\xe5\xe9-\xcf\xca', u'+@=:,')" … UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal $ python2 -W error -c "import urllib; urllib.quote(u'D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@\xd1\xe5\xf0\xe3\xe5\xe9-\xcf\xca'.encode('utf-8'), u'+@=:,')" … UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 33: ordinal not in range(128) $ python2 -W error -c "import urllib; print(urllib.quote(u'D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@\xd1\xe5\xf0\xe3\xe5\xe9-\xcf\xca'.encode('utf-8'), u'+@=:,'.encode('utf-8')))" D1B4DEBCAFFC4A05A4D4349A6EC5C9D8@%C3%91%C3%A5%C3%B0%C3%A3%C3%A5%C3%A9-%C3%8F%C3%8A Related Python issues [1,2,3,4,5]. [2] lead to the currently working Python 3 implementation, which encodes to UTF-8 by default and has an ‘encoding’ option [6]. There's some useful background in [7]. For compatibility with Python 3, I suggest patching _hex_quote to take an encoding option, defaulting to UTF-8, and encoding both strings that are passed to _quote. We should probably raise a ValueError if the length of the encoded safe characters doesn't match the length of the Unicode safe characters, because the caller will probably not expect the byte-level quoting that would cause. Python 3 covers that by restricting the safe characters to ASCII [6], although passing non-ASCII characters with safe doesn't seem to raise an exception: $ python3 -c "from urllib.parse import quote; print(quote('\u0091', '\u0091'))" %C2%91 $ python3 -c "from urllib.parse import quote; print(quote('\u203b', '\u203b'))" %E2%80%BB Anyhow, I'll file a patch adding UTF-8 encoding so Python 2 works like Python 3. Cheers, Trevor [1]: http://bugs.python.org/issue2637 [2]: http://bugs.python.org/issue3300 [3]: http://bugs.python.org/issue22231 [4]: http://bugs.python.org/issue23885 [5]: http://bugs.python.org/issue1712522 [6]: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.quote [7]: https://mail.python.org/pipermail/python-dev/2006-July/067335.html -- This email may be signed or encrypted with GnuPG (http://www.gnupg.org). For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy