>>> s = 'auszuschließen'
>>> print(s.encode('ascii', errors='xmlcharrefreplace'))
b'auszuschließen'
>>> print(str(s.encode('ascii', errors='xmlcharrefreplace'), 'ascii'))
auszuschließen
Is there a prettier way to print any string without the b''?
EDIT:
I'm just trying to print escaped characters from Python, and my only gripe is that Python adds "b''" when i do that.
If i wanted to see the actual character in a dumb terminal like Windows 7's, then i get this:
Traceback (most recent call last):
File "Mailgen.py", line 378, in <module>
marked_copy = mark_markup(language_column, item_row)
File "Mailgen.py", line 210, in mark_markup
print("TP: %r" % "".join(to_print))
File "c:\python32\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 29: character maps to <undefined>
>>> s='auszuschließen…'
>>> s
'auszuschließen…'
>>> print(s)
auszuschließen…
>>> b=s.encode('ascii','xmlcharrefreplace')
>>> b
b'auszuschließen…'
>>> print(b)
b'auszuschließen…'
>>> b.decode()
'auszuschließen…'
>>> print(b.decode())
auszuschließen…
You start out with a Unicode string. Encoding it to ascii creates a bytes object with the characters you want. Python won't print it without converting it back into a string and the default conversion puts in the b and quotes. Using decode explicitly converts it back to a string; the default encoding is utf-8, and since your bytes only consist of ascii which is a subset of utf-8 it is guaranteed to work.
To see ascii representation (like repr() on Python 2) for debugging:
print(ascii('auszuschließen…'))
# -> 'auszuschlie\xdfen\u2026'
To print bytes:
sys.stdout.buffer.write('auszuschließen…'.encode('ascii', 'xmlcharrefreplace'))
# -> auszuschließen…
Not all terminals can handle more than some sort of 8-bit character set, that's true. But they won't handle that no matter what you do, really.
Printing a Unicode string will, assuming that your OS set's up the terminal properly, result in the best result possible, which means that the characters that the terminal can not print will be replaced with some character, like a question mark or similar. Doing that translation yourself will not really improve things.
Update:
Since you want to know what characters are in the string, you actually want to know the Unicode codes for them, or the XML equivalent in this case. That's more inspecting than printing, and then usually the b'' part isn't a problem per se.
But you can get rid of it easily and hackily like so:
print(repr(s.encode('ascii', errors='xmlcharrefreplace'))[2:-1])
Since you're using Python 3, you're afforded the ability to write print(s) to the console.
I can agree that, depending on the console, it may not be able to print properly, but I would imagine that most modern OSes since 2006 can handle Unicode strings without too much of an issue. I'd encourage you to give it a try and see if it works.
Alternatively, you can enforce a coding by placing this before any lines in a file (similar to a shebang):
# -*- coding: utf-8 -*-
This will force the interpreter to render it as UTF-8.
Related
I get a string which includes Unicode characters. But the backslashes are escaped. I want to remove one backslash so python can treat the Unicode in the right way.
Using replace I am only able to remove and add two backslashes at a time.
my_str = '\\uD83D\\uDE01\\n\\uD83D\\uDE01'
my_str2 = my_str.replace('\\', '')
'\\uD83D\\uDE01\\n\\uD83D\\uDE01' should be '\uD83D\uDE01\n\uD83D\uDE01'
edit:
Thank you for the many responses. You are right my example was wrong. Here are other things I have tried
my_str = '\\uD83D\\uDE01\\n\\uD83D\\uDE01'
my_str2 = my_str.replace('\\\\', '\\') # no unicode
my_str2 = my_str.replace('\\', '')
That's… probably not going to work. Escape characters are handled during lexical analysis (parsing), what you have in your string is already a single backslash, it's just the escaped representation of that single backslash:
>>> r'\u3d5f'
'\\u3d5f'
What you need to do is encode the string to be "python source" then re-decode it while applying unicode escapes:
>>> my_str.encode('utf-8').decode('unicode_escape')
'\ud83d\ude01\n\ud83d\ude01'
However note that these codepoints are surrogates, and your string is thus pretty much broken / invalid, you're not going to be able to e.g. print it because the UTF8 encoder is going to reject it:
>>> print(my_str.encode('utf-8').decode('unicode_escape'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
To fix that, you need a second fixup pass: encode to UTF-16 letting the surrogates pass through directly (using the "surrogatepass" mode) then do proper UTF-16 decoding back to an actual well-formed string:
>>> print(my_str.encode('utf-8').decode('unicode_escape').encode('utf-16', 'surrogatepass').decode('utf-16'))
😁
😁
You may really want to do a source analysis on your data though, it's not logically valid to get a (unicode) string with unicode escapes in there, it might be incorrect loading of JSON data or somesuch. If it's an option (I realise that's not always the case) fixing that would be much better than applying hacky fixups afterwards.
A simple example:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import traceback
e_u = u'abc'
c_u = u'中国'
print sys.getdefaultencoding()
try:
print e_u.decode('utf-8')
print c_u.decode('utf-8')
except Exception as e:
print traceback.format_exc()
reload(sys)
sys.setdefaultencoding('utf-8')
print sys.getdefaultencoding()
try:
print e_u.decode('utf-8')
print c_u.decode('utf-8')
except Exception as e:
print traceback.format_exc()
output:
ascii
abc
Traceback (most recent call last):
File "test_codec.py", line 15, in <module>
print c_u.decode('utf-8')
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
utf-8
abc
中国
Some problems troubled me a few days when I want to thoroughly understand the codec in python, and I want to make sure what I think is right:
Under ascii default encoding, u'abc'.decode('utf-8') have no error, but u'中国'.decode('utf-8') have error.
I think when do u'中国'.decode('utf-8'), Python check and found u'中国' is unicode, so it try to do u'中国'.encode(sys.getdefaultencoding()), this will cause problem, and the exception is UnicodeEncodeError, not error when decode.
but u'abc' have the same code point as 'abc' ( < 128), so there is no error.
In Python 2.x, how does python inner store variable value? If all characters in a string < 128, treat as ascii, if > 128, treat as utf-8?
In [4]: chardet.detect('abc')
Out[4]: {'confidence': 1.0, 'encoding': 'ascii'}
In [5]: chardet.detect('abc中国')
Out[5]: {'confidence': 0.7525, 'encoding': 'utf-8'}
In [6]: chardet.detect('中国')
Out[6]: {'confidence': 0.7525, 'encoding': 'utf-8'}
Short answer
You have to use encode(), or leave it out. Don't use decode() with unicode strings, that makes no sense. Also, sys.getdefaultencoding() doesn't help here in any way.
Long answer, part 1: How to do it correctly?
If you define:
c_u = u'中国'
then c_u is already a unicode string, that is, it has already been decoded from byte string (of your source file) to a unicode string by the Python interpreter, using your -*- coding: utf-8 -*- declaration.
If you execute:
print c_u.encode()
your string will be encoded back to UTF-8 and that byte string is sent to the standard output. Note that this usually happens automatically for you, so you can simplify this to:
print c_u
Long answer, part 2: What's wrong with c_u.decode()?
If you execute c_u.decode(), Python will
Try to convert your object (i.e. your unicode string) to a byte string
Try to decode that byte string to a unicode string
Note that this doesn't make any sense if your object is a unicode string in the first place - you just convert it forth and back. But why does that fail? Well, this is a strange functionality of Python that first step (1.), i.e. any implicit conversion from unicode string to byte strings, usually uses sys.getdefaultencoding(), which in turn defaults to the ASCII character set. In other words,
c_u.decode()
translates roughly to:
c_u.encode(sys.getdefaultencoding()).decode()
which is why it fails.
Note that while you may be tempted to change that default encoding, don't forget that other third-party libraries may contain similar issues, and might break if the default encoding is different from ASCII.
Having said that, I strongly believe that Python would be better off if they hadn't defined unicode.decode() in the first place. Unicode string are already decoded, there's no point in decoding them once more, especially in the way Python does.
I'm trying to deal with unicode in python 2.7.2. I know there is the .encode('utf-8') thing but 1/2 the time when I add it, I get errors, and 1/2 the time when I don't add it I get errors.
Is there any way to tell python - what I thought was an up-to-date & modern language to just use unicode for strings and not make me have to fart around with .encode('utf-8') stuff?
I know... python 3.0 is supposed to do this, but I can't use 3.0 and 2.7 isn't all that old anyways...
For example:
url = "http://en.wikipedia.org//w/api.php?action=query&list=search&format=json&srlimit=" + str(items) + "&srsearch=" + urllib2.quote(title.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
Update
If I remove all my .encode statements from all my code and add # -*- coding: utf-8 -*- to the top of my file, right under the #!/usr/bin/python then I get the following, same as if I didn't add the # -*- coding: utf-8 -*- at all.
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py:1250: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
return ''.join(map(quoter, s))
Traceback (most recent call last):
File "classes.py", line 583, in <module>
wiki.getPage(title)
File "classes.py", line 146, in getPage
url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=json&rvprop=content&rvlimit=1&titles=" + urllib2.quote(title)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1250, in quote
return ''.join(map(quoter, s))
KeyError: u'\xf1'
I'm not manually typing in any string, I parsing HTML and json from websites. So the scripts/bytestreams/whatever they are, are all created by python.
Update 2 I can move the error along, but it just keeps coming up in new places. I was hoping python would be a useful scripting tool, but looks like after 3 days of no luck I'll just try a different language. Its a shame, python is preinstalled on osx. I've marked correct the answer that fixed the one instance of the error I posted.
This is a very old question but just wanted to add one partial suggestion. While I sympathise with the OP's pain - having gone through it a lot myself - here's one (partial) answer to make things "easier". Put this at the top of any Python 2.7 script:
from __future__ import unicode_literals
This will at least ensure that your own literal strings default to unicode rather than str.
There is no way to make unicode "just work" apart from using unicode strings everywhere and immediately decoding any encoded string you receive. The problem is that you MUST ALWAYS keep straight whether you're dealing with encoded or unencoded data, or use tools that keep track of it for you, or you're going to have a bad time.
Python 2 does some things that are problematic for this: it makes str the "default" rather than unicode for things like string literals, it silently coerces str to unicode when you add the two, and it lets you call .encode() on an already-encoded string to double-encode it. As a result, there are a lot of python coders and python libraries out there that have no idea what encodings they're designed to work with, but are nonetheless designed to deal with some particular encoding since the str type is designed to let the programmer manage the encoding themselves. And you have to think about the encoding each time you use these libraries since they don't support the unicode type themselves.
In your particular case, the first error tells you you're dealing with encoded UTF-8 data and trying to double-encode it, while the 2nd tells you you're dealing with UNencoded data. It looks like you may have both. You should really find and fix the source of the problem (I suspect it has to do with the silent coercion I mentioned above), but here's a hack that should fix it in the short term:
encoded_title = title
if isinstance(encoded_title, unicode):
encoded_title = title.encode('utf-8')
If this is in fact a case of silent coercion biting you, you should be able to easily track down the problem using the excellent unicode-nazi tool:
python -Werror -municodenazi myprog.py
This will give you a traceback right at the point unicode leaks into your non-unicode strings, instead of trying troubleshooting this exception way down the road from the actual problem. See my answer on this related question for details.
Yes, define your unicode data as unicode literals:
>>> u'Hi, this is unicode: üæ'
u'Hi, this is unicode: üæ'
You usually want to use '\uxxxx` unicode escapes or set a source code encoding. The following line at the top of your module, for example, sets the encoding to UTF-8:
# -*- coding: utf-8 -*-
Read the Python Unicode HOWTO for the details, such as default encodings and such (the default source code encoding, for example, is ASCII).
As for your specific example, your title is not a Unicode literal but a python byte string, and python is trying to decode it to unicode for you just so you can encode it again. This fails, as the default codec for such automatic encodings is ASCII:
>>> 'å'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Encoding only applies to actual unicode strings, so a byte string needs to be explicitly decoded:
>>> 'å'.decode('utf-8').encode('utf-8')
'\xc3\xa5'
If you are used to Python 3, then unicode literals in Python 2 (u'') are the new default string type in Python 3, while regular (byte) strings in Python 2 ('') are the same as bytes objects in Python 3 (b'').
If you have errors both with and without the encode call on title, you have mixed data. Test the title and encode as needed:
if isinstance(title, unicode):
title = title.encode('utf-8')
You may want to find out what produces the mixed unicode / byte string titles though, and correct that source to always produce one or the other.
be sure that title in your title.encode("utf-8") is type of unicode and dont use str("İŞşĞğÖöÜü")
use unicode("ĞğıIİiÖöŞşcçÇ") in your stringifiers
Actually, the easiest way to make Python work with unicode is to use Python 3, where everything is unicode by default.
Unfortunately, there are not many libraries written for P3, as well as some basic differences in coding & keyword use. That's the problem I have: the libraries I need are only available for P 2.7, and I don't know enough to convert them to P 3. :(
I am working against an application that seems keen on returning, what I believe to be, double UTF-8 encoded strings.
I send the string u'XüYß' encoded using UTF-8, thus becoming X\u00fcY\u00df (equal to X\xc3\xbcY\xc3\x9f).
The server should simply echo what I sent it, yet returns the following: X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f (should be X\xc3\xbcY\xc3\x9f). If I decode it using str.decode('utf-8') becomes u'X\xc3\xbcY\xc3\x9f', which looks like a ... unicode-string, containing the original string encoded using UTF-8.
But Python won't let me decode a unicode string without re-encoding it first - which fails for some reason, that escapes me:
>>> ret = 'X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f'.decode('utf-8')
>>> ret
u'X\xc3\xbcY\xc3\x9f'
>>> ret.decode('utf-8')
# Throws UnicodeEncodeError: 'ascii' codec can't encode ...
How do I persuade Python to re-decode the string? - and/or is there any (practical) way of debugging what's actually in the strings, without passing it though all the implicit conversion print uses?
(And yes, I have reported this behaviour with the developers of the server-side.)
ret.decode() tries implicitly to encode ret with the system encoding - in your case ascii.
If you explicitly encode the unicode string, you should be fine. There is a builtin encoding that does what you need:
>>> 'X\xc3\xbcY\xc3\x9f'.encode('raw_unicode_escape').decode('utf-8')
'XüYß'
Really, .encode('latin1') (or cp1252) would be OK, because that's what the server is almost cerainly using. The raw_unicode_escape codec will just give you something recognizable at the end instead of raising an exception:
>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'
>>> '€\xe2\x82\xac'.encode('latin1').decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256)
In case you run into this sort of mixed data, you can use the codec again, to normalize everything:
>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'
>>> '\\u20ac€'.encode('raw_unicode_escape')
b'\\u20ac\\u20ac'
>>> '\\u20ac€'.encode('raw_unicode_escape').decode('raw_unicode_escape')
'€€'
What you want is the encoding where Unicode code point X is encoded to the same byte value X. For code points inside 0-255 you have this in the latin-1 encoding:
def double_decode(bstr):
return bstr.decode("utf-8").encode("latin-1").decode("utf-8")
Don't use this! Use #hop's solution.
My nasty hack: (cringe! but quietly. It's not my fault, it's the server developers' fault)
def double_decode_unicode(s, encoding='utf-8'):
return ''.join(chr(ord(c)) for c in s.decode(encoding)).decode(encoding)
Then,
>>> double_decode_unicode('X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f')
u'X\xfcY\xdf'
>>> print _
XüYß
Here's a little script that might help you, doubledecode.py --
https://gist.github.com/1282752
when I run my programs from within Eclipse IDE the following piece of code works perfectly:
address_name = self.text_ctrl_address.GetValue().encode('utf-8')
self.address_list = [i for i in data if address_name.upper() in i[5].upper().encode('utf-8')]
but when running the same piece of code directly with python, I get an "UnicodeDecodeError".
What does the IDE does differently that it doesn't fall on this error ?
ps: I encode both unicode strings because it is the only way to test one string against another containing letters like ñ or ç.
Edit:
Sorry, I should have given more details: This piece of code belongs to a dialog built with WxPython. The GetValue() functions gets texts from a line edit widget and try to match this piece of text against a database. The program runs on Windows (and because of this, maybe michael Shopsin above might be right("Win-1252 to UTF-8 is a serious nuisance"). I've read many times that I should always work with unicode, avoid encoding, but if I don't encode, certain string methods don't seem to work very well depending on the characters in a word (I am in Spain, so lots of non ascii characters). By directly I meant "double clicking" the file it self, and not running from within the IDE.
UnicodeDecodeError indicates that the error happens during decoding of a bytestring into Unicode.
In particular, it may happen if you try to encode a bytestring instead of Unicode string on Python 2:
>>> u"\N{EM DASH}".encode('utf-8').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
u"\N{EM DASH}".encode('utf-8') is a bytestring and invoking .encode('utf-8') the 2nd time leads to implicit .decode(sys.getdefaultencoding()) that leads to the UnicodeDecodeError.
What does the IDE does differently that it doesn't fall on this error ?
It probably works in IDE because it changes sys.getdefaultencoding() to utf-8 that you should not do. It may hide bugs as your question demonstrates. In general, it may also break 3rd-party libraries that do not expect non-ascii sys.getdefaultencoding() on Python 2.
I encode both unicode strings because it is the only way to test one string against another containing letters like ñ or ç.
You should use unicodedata.normalize() instead:
>>> import unicodedata
>>> a, b = u'\xf1', u'n\u0303'
>>> print(a)
ñ
>>> print(b)
ñ
>>> a == unicodedata.normalize('NFC', b)
True
Note: the code in your question may produce surprising results:
#XXX BROKEN, DON'T DO IT
...address_name.upper() in i[5].upper().encode('utf-8')...
address_name.upper() calls bytes.upper method while i[5].upper() calls unicode.upper method. The former does not support Unicode and it may depend on the current locale, the latter is better but to perform case-insensitive comparison, use .casefold() method instead:
key = unicode_address_name.casefold()
... if key == i[5].casefold()...
In general, If you need to sort unicode strings then you could use icu.Collator. Compare the default lexicographical sort:
>>> L = [u'sandwiches', u'angel delight', u'custard', u'éclairs', u'glühwein']
>>> sorted(L)
[u'angel delight', u'custard', u'gl\xfchwein', u'sandwiches', u'\xe9clairs']
with the order in en_GB locale:
>>> import icu # PyICU
>>> collator = icu.Collator.createInstance(icu.Locale('en_GB'))
>>> sorted(L, key=collator.getSortKey)
[u'angel delight', u'custard', u'\xe9clairs', u'gl\xfchwein', u'sandwiches']
I could solve the problem changing the encoding from UTF-8 to cp1252 (Windows western europe). Apparently UTF-8 could not encode some Windows characters. Thanks to Michael Shopsin above for the insight.
The program runs on windows and uses WxPython dialog , getting values from a line edit widget and matching the string against a database.
Thank you all for the attention, and I hope this post can help people in the future with a similar problem.