Python character handling in terminal - python

I'm in an interactive Python 2.7 Terminal (Terminal default output is "utf-8"). I have a string from the internet, lets call it a
>>> a
u'M\xfcssen'
>>> a[1]
u'\xfc'
I wonder why its value is not ü so I try
>>> print(a)
Müssen
>>> print(a[1])
ü
which works as intended.
So my first question is, what does print a do, which is missing if i just type a?
and out of curiosity: Why is it that I get another output for the following in the same python terminal session?
>>> "ü"
'\xc3\xbc'
>>> print "ü"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> print u"ü"
ü

You have to understand how python stores various data types and which functions expect which input. Its all quite confusing and also depends on your LOCALE setting of your terminal.
The following link might help to reduce the confusion: https://pythonhosted.org/kitchen/unicode-frustrations.html
All str objects like "My String" are stored as 8bit per character. In your case '\xc3\xbc' is the utf8 representation of the UMLAUT-U as a str object.
For unicode objects, python uses 16bit or 32bit integer to store the string.
Now the print function expects str objects as input. That's why the following works.
>>> print '\xc3\xbc'
ü
To turn the UMLAUT-U from a str into a unicode object. you have to tell python that the string is in UTF8 representation before you convert it into a unicode object.
>>> unicode('\xc3\xbc'.decode('utf8'))
u'\xfc'

what does print a do, which is missing if i just type a?
The interactive >>> prompt outputs values using the Python source code representation of the value, as returned by the repr() function. That's why you get not just \xFC for the ü character but also quote marks around the string. The prompt is trying to show you what you would need to type into a Python program to get the string value you have.
The print statement outputs the raw string conversion of the value, as returned by the str() function.
For some types repr() and str() generate the same output, but this is not the case for strings.

Related

Python to convert special unicode characters (like ♡) to their hexadecimal literals in a string (like 0x2661)

I'm writing a program that pulls screen names and tweets from twitter into a txt file. Some screen names contain special unicode characters like ♡. In my Bash terminal these characters show up as an empty box. My sql fails when I try to insert this character and tells me it contains an untranslatable character. Is there a way to convert only special characters in python to their hexadecimal form? I would also be happy just replacing these special characters with
Ideally "screenName♡" would convert to "screenName0x2661" or just replace special characters to something like "screenName#REPLACE#"
You can achieve this using the encode method, explained here. From the docs:
Another important method is .encode([encoding], [errors='strict']),
which returns an 8-bit string version of the Unicode string, encoded
in the requested encoding. The errors parameter is the same as the
parameter of the unicode() constructor, with one additional
possibility; as well as ‘strict’, ‘ignore’, and ‘replace’, you can
also pass ‘xmlcharrefreplace’ which uses XML’s character references.
The following example shows the different results:
>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8') '\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii') Traceback (most recent call last):
... UnicodeEncodeError: 'ascii' codec can't encode character u'\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore') 'abcd'
>>> u.encode('ascii', 'replace') '?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace') 'ꀀabcd޴'

Unicode and overriding '__str__'

I'm getting a unicode error only when overriding my class' __str__ method. What's going on?
In Test.py:
class Obj(object):
def __init__(self):
self.title = u'\u2018'
def __str__(self):
return self.title
print "1: ", Obj().title
print "2: ", str(Obj())
Running this I get:
$ python Test.py
1: ‘
2:
Traceback (most recent call last):
File "Test.py", line 11, in <module>
print "2: ", str(Obj())
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position 0: ordinal not in range(128)
EDIT: Please don't just say that str(u'\u2018') also raises an Error! (while that may be related). This circumvents the entire purpose of built-in method overloading --- at no point should this code call str(u'\u2018')!!
You're using Python 2.x. str() calls __str__ and expects you to return a string—that is, a str. But you're not; you're returning a unicode object. So str() helpfully tries to convert that to a str since it's what str() is supposed to return.
Now, in Python 2.x strings are sequences of bytes, not codepoints, so Python is trying to convert your Unicode object to a sequence of bytes. Since you didn't (and can't, in this scenario) specify what encoding to use when making the string, Python uses the default encoding of ASCII. This fails because ASCII can't represent the character.
Possible solutions:
Use Python 3, where all strings are Unicode. This will provide you with an entertainingly different set of things to wrap your head around, but this won't be one of them.
Override __unicode__() instead of __str__() and use unicode() instead of str() when converting your object to a string. You still have the problem (shared with Python 3) of how to get that converted into a sequence of bytes that will output correctly.
Figure out what encoding your terminal is using (i.e. sys.stdout.encoding) and have __str__() convert the Unicode object to that encoding before returning it. Note that there's still no guarantee that the character is representable in that encoding; you can't convert your example string to the default Windows terminal encoding, for example. In this case you could fall back to e.g. unicode-escape encoding if you get an exception trying to convert to the output encoding.
Problem is that str() cannot handle u'\u2018' (unicode), since it tries to convert it to ascii and there is no ascii character for it.
>>> str(u'\u2018')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position 0: ordinal not in range(128)
>>>
You can look at this for more info...

Python Unicode Ascii, Ordinal not In range, frustrating error

Here is my problem...
Database stores everything in unicode.
hashlib.sha256().digest() accepts str and returns str.
When I try to stuff hash function with the data, I get the famous error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 1: ordinal not in range(128)
This is my data
>>> db_digest
u"'\x90\x017~1\xe0\xaf4\xf2\xec\xd5]:j\xef\xe6\x80\x88\x89\xfe\xf7\x99,c\xff\xb7\x06hXR\x99\xad\x91\x93lM:\xafT\xc9j\xec\xc3\xb7\xea[\x80\xe0e\xd6\\\xd8\x16'\xcb6\xc8\xaa\xdf\xc9 :\xff"
>>>
>>> hashlib.sha256(db_digest)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\x90' in position 1: ordinal not in range(128)
>>>
>>> asc_db_digest
"'\x90\x017~1\xe0\xaf4\xf2\xec\xd5]:j\xef\xe6\x80\x88\x89\xfe\xf7\x99,c\xff\xb7\x06hXR\x99\xad\x91\x93lM:\xafT\xc9j\xec\xc3\xb7\xea[\x80\xe0e\xd6\\\xd8\x16'\xcb6\xc8\xaa\xdf\xc9 :\xff"
>>> hashlib.sha256(asc_db_digest)
<sha256 HASH object # 0x7f7da0f04300>
So all I am asking for is a way to turn db_digest into asc_db_digest
Edit
I have rephrased the question as it seems I haven't recognized teh problem correctly at first place.
If you have a unicode string that only contains code points from 0 to 255 (bytes) you can convert it to a Python str using the raw_unicode_escape encoding:
>>> db_digest = u"'\x90\x017~1\xe0\xaf4\xf2\xec\xd5]:j\xef\xe6\x80\x88\x89\xfe\xf7\x99,c\xff\xb7\x06hXR\x99\xad\x91\x93lM:\xafT\xc9j\xec\xc3\xb7\xea[\x80\xe0e\xd6\\\xd8\x16'\xcb6\xc8\xaa\xdf\xc9 :\xff"
>>> hash_digest = "'\x90\x017~1\xe0\xaf4\xf2\xec\xd5]:j\xef\xe6\x80\x88\x89\xfe\xf7\x99,c\xff\xb7\x06hXR\x99\xad\x91\x93lM:\xafT\xc9j\xec\xc3\xb7\xea[\x80\xe0e\xd6\\\xd8\x16'\xcb6\xc8\xaa\xdf\xc9 :\xff"
>>> db_digest.encode('raw_unicode_escape')
"'\x90\x017~1\xe0\xaf4\xf2\xec\xd5]:j\xef\xe6\x80\x88\x89\xfe\xf7\x99,c\xff\xb7\x06hXR\x99\xad\x91\x93lM:\xafT\xc9j\xec\xc3\xb7\xea[\x80\xe0e\xd6\\\xd8\x16'\xcb6\xc8\xaa\xdf\xc9 :\xff"
>>> db_digest.encode('raw_unicode_escape') == hash_digest
True
hashes operates on bytes (bytes, str in Python 2.x), not strings (unicode in 2.x, str in 3.x). Therefore, you must supply bytes anyways. Try:
hashlib.sha1(salt.encode('utf-8') + data).digest()
The hash will contain "characters" that are in the range 0-255. These are all valid Unicode characters, but it's not a Unicode string. You need to convert it somehow. The best solution would be to encode it into something like base64.
There's also a hacky solution to convert the bytes returned directly into a pseudo-Unicode string, exactly as your database appears to be doing it:
hash_unicode = u''.join([unichr(ord(c)) for c in hash_digest])
You can also go the other way, but this is more dangerous as the "string" will contain characters outside of the ASCII range of 0-127 and might throw errors when you try to use it.
asc_db_digest = ''.join([chr(ord(c)) for c in db_digest])

Printing escaped Unicode in Python

>>> s = 'auszuschließen'
>>> print(s.encode('ascii', errors='xmlcharrefreplace'))
b'auszuschließen'
>>> print(str(s.encode('ascii', errors='xmlcharrefreplace'), 'ascii'))
auszuschließen
Is there a prettier way to print any string without the b''?
EDIT:
I'm just trying to print escaped characters from Python, and my only gripe is that Python adds "b''" when i do that.
If i wanted to see the actual character in a dumb terminal like Windows 7's, then i get this:
Traceback (most recent call last):
File "Mailgen.py", line 378, in <module>
marked_copy = mark_markup(language_column, item_row)
File "Mailgen.py", line 210, in mark_markup
print("TP: %r" % "".join(to_print))
File "c:\python32\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 29: character maps to <undefined>
>>> s='auszuschließen…'
>>> s
'auszuschließen…'
>>> print(s)
auszuschließen…
>>> b=s.encode('ascii','xmlcharrefreplace')
>>> b
b'auszuschließen…'
>>> print(b)
b'auszuschließen…'
>>> b.decode()
'auszuschließen…'
>>> print(b.decode())
auszuschließen…
You start out with a Unicode string. Encoding it to ascii creates a bytes object with the characters you want. Python won't print it without converting it back into a string and the default conversion puts in the b and quotes. Using decode explicitly converts it back to a string; the default encoding is utf-8, and since your bytes only consist of ascii which is a subset of utf-8 it is guaranteed to work.
To see ascii representation (like repr() on Python 2) for debugging:
print(ascii('auszuschließen…'))
# -> 'auszuschlie\xdfen\u2026'
To print bytes:
sys.stdout.buffer.write('auszuschließen…'.encode('ascii', 'xmlcharrefreplace'))
# -> auszuschließen…
Not all terminals can handle more than some sort of 8-bit character set, that's true. But they won't handle that no matter what you do, really.
Printing a Unicode string will, assuming that your OS set's up the terminal properly, result in the best result possible, which means that the characters that the terminal can not print will be replaced with some character, like a question mark or similar. Doing that translation yourself will not really improve things.
Update:
Since you want to know what characters are in the string, you actually want to know the Unicode codes for them, or the XML equivalent in this case. That's more inspecting than printing, and then usually the b'' part isn't a problem per se.
But you can get rid of it easily and hackily like so:
print(repr(s.encode('ascii', errors='xmlcharrefreplace'))[2:-1])
Since you're using Python 3, you're afforded the ability to write print(s) to the console.
I can agree that, depending on the console, it may not be able to print properly, but I would imagine that most modern OSes since 2006 can handle Unicode strings without too much of an issue. I'd encourage you to give it a try and see if it works.
Alternatively, you can enforce a coding by placing this before any lines in a file (similar to a shebang):
# -*- coding: utf-8 -*-
This will force the interpreter to render it as UTF-8.

Double-decoding unicode in python

I am working against an application that seems keen on returning, what I believe to be, double UTF-8 encoded strings.
I send the string u'XüYß' encoded using UTF-8, thus becoming X\u00fcY\u00df (equal to X\xc3\xbcY\xc3\x9f).
The server should simply echo what I sent it, yet returns the following: X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f (should be X\xc3\xbcY\xc3\x9f). If I decode it using str.decode('utf-8') becomes u'X\xc3\xbcY\xc3\x9f', which looks like a ... unicode-string, containing the original string encoded using UTF-8.
But Python won't let me decode a unicode string without re-encoding it first - which fails for some reason, that escapes me:
>>> ret = 'X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f'.decode('utf-8')
>>> ret
u'X\xc3\xbcY\xc3\x9f'
>>> ret.decode('utf-8')
# Throws UnicodeEncodeError: 'ascii' codec can't encode ...
How do I persuade Python to re-decode the string? - and/or is there any (practical) way of debugging what's actually in the strings, without passing it though all the implicit conversion print uses?
(And yes, I have reported this behaviour with the developers of the server-side.)
ret.decode() tries implicitly to encode ret with the system encoding - in your case ascii.
If you explicitly encode the unicode string, you should be fine. There is a builtin encoding that does what you need:
>>> 'X\xc3\xbcY\xc3\x9f'.encode('raw_unicode_escape').decode('utf-8')
'XüYß'
Really, .encode('latin1') (or cp1252) would be OK, because that's what the server is almost cerainly using. The raw_unicode_escape codec will just give you something recognizable at the end instead of raising an exception:
>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'
>>> '€\xe2\x82\xac'.encode('latin1').decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256)
In case you run into this sort of mixed data, you can use the codec again, to normalize everything:
>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'
>>> '\\u20ac€'.encode('raw_unicode_escape')
b'\\u20ac\\u20ac'
>>> '\\u20ac€'.encode('raw_unicode_escape').decode('raw_unicode_escape')
'€€'
What you want is the encoding where Unicode code point X is encoded to the same byte value X. For code points inside 0-255 you have this in the latin-1 encoding:
def double_decode(bstr):
return bstr.decode("utf-8").encode("latin-1").decode("utf-8")
Don't use this! Use #hop's solution.
My nasty hack: (cringe! but quietly. It's not my fault, it's the server developers' fault)
def double_decode_unicode(s, encoding='utf-8'):
return ''.join(chr(ord(c)) for c in s.decode(encoding)).decode(encoding)
Then,
>>> double_decode_unicode('X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f')
u'X\xfcY\xdf'
>>> print _
XüYß
Here's a little script that might help you, doubledecode.py --
https://gist.github.com/1282752

Categories