Python Unicode Ascii, Ordinal not In range, frustrating error - python

Here is my problem...
Database stores everything in unicode.
hashlib.sha256().digest() accepts str and returns str.
When I try to stuff hash function with the data, I get the famous error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 1: ordinal not in range(128)
This is my data
>>> db_digest
u"'\x90\x017~1\xe0\xaf4\xf2\xec\xd5]:j\xef\xe6\x80\x88\x89\xfe\xf7\x99,c\xff\xb7\x06hXR\x99\xad\x91\x93lM:\xafT\xc9j\xec\xc3\xb7\xea[\x80\xe0e\xd6\\\xd8\x16'\xcb6\xc8\xaa\xdf\xc9 :\xff"
>>>
>>> hashlib.sha256(db_digest)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\x90' in position 1: ordinal not in range(128)
>>>
>>> asc_db_digest
"'\x90\x017~1\xe0\xaf4\xf2\xec\xd5]:j\xef\xe6\x80\x88\x89\xfe\xf7\x99,c\xff\xb7\x06hXR\x99\xad\x91\x93lM:\xafT\xc9j\xec\xc3\xb7\xea[\x80\xe0e\xd6\\\xd8\x16'\xcb6\xc8\xaa\xdf\xc9 :\xff"
>>> hashlib.sha256(asc_db_digest)
<sha256 HASH object # 0x7f7da0f04300>
So all I am asking for is a way to turn db_digest into asc_db_digest
Edit
I have rephrased the question as it seems I haven't recognized teh problem correctly at first place.

If you have a unicode string that only contains code points from 0 to 255 (bytes) you can convert it to a Python str using the raw_unicode_escape encoding:
>>> db_digest = u"'\x90\x017~1\xe0\xaf4\xf2\xec\xd5]:j\xef\xe6\x80\x88\x89\xfe\xf7\x99,c\xff\xb7\x06hXR\x99\xad\x91\x93lM:\xafT\xc9j\xec\xc3\xb7\xea[\x80\xe0e\xd6\\\xd8\x16'\xcb6\xc8\xaa\xdf\xc9 :\xff"
>>> hash_digest = "'\x90\x017~1\xe0\xaf4\xf2\xec\xd5]:j\xef\xe6\x80\x88\x89\xfe\xf7\x99,c\xff\xb7\x06hXR\x99\xad\x91\x93lM:\xafT\xc9j\xec\xc3\xb7\xea[\x80\xe0e\xd6\\\xd8\x16'\xcb6\xc8\xaa\xdf\xc9 :\xff"
>>> db_digest.encode('raw_unicode_escape')
"'\x90\x017~1\xe0\xaf4\xf2\xec\xd5]:j\xef\xe6\x80\x88\x89\xfe\xf7\x99,c\xff\xb7\x06hXR\x99\xad\x91\x93lM:\xafT\xc9j\xec\xc3\xb7\xea[\x80\xe0e\xd6\\\xd8\x16'\xcb6\xc8\xaa\xdf\xc9 :\xff"
>>> db_digest.encode('raw_unicode_escape') == hash_digest
True

hashes operates on bytes (bytes, str in Python 2.x), not strings (unicode in 2.x, str in 3.x). Therefore, you must supply bytes anyways. Try:
hashlib.sha1(salt.encode('utf-8') + data).digest()

The hash will contain "characters" that are in the range 0-255. These are all valid Unicode characters, but it's not a Unicode string. You need to convert it somehow. The best solution would be to encode it into something like base64.
There's also a hacky solution to convert the bytes returned directly into a pseudo-Unicode string, exactly as your database appears to be doing it:
hash_unicode = u''.join([unichr(ord(c)) for c in hash_digest])
You can also go the other way, but this is more dangerous as the "string" will contain characters outside of the ASCII range of 0-127 and might throw errors when you try to use it.
asc_db_digest = ''.join([chr(ord(c)) for c in db_digest])

Related

Why does string and list make difference in UNICODE(utf-8)? How can '-' make Error?

>>> final=[]
>>> for a in range(65535):
final.append([a,chr(a)])
>>> file=open('1.txt','w',encoding='utf-8')
>>> file.write(str(final))
960881
>>> file.close()
>>> final=''
>>> for a in range(65535):
final+='%d -------- %s'%(a,chr(a))
>>> file=open('2.txt','w',encoding='utf-8')
>>> file.write(final)
Traceback (most recent call last):
File "<pyshell#29>", line 1, in <module>
file.write(final)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 873642: surrogates not allowed
As you can see, 1.txt is saved. Why does saving the second 'file'(string) make error?
From Wikibooks:
Unicode and ISO/IEC 10646 do not assign actual characters to any of the code points in the D800–DFFF range — these code points only have meaning when used in surrogate pairs. Hence an individual code point from a surrogate pair does not represent a character, is invalid unless used in a surrogate pair, and
So I'd say chr(0xd800) already is invalid and I guess Python just doesn't check it for speed reasons. But the UTF-8 encoder does check it and complains.
The reason it works for the first file is that wrapping the string in a list and using str on that list leads to repr-ing the string:
>>> str( chr(0xd800) )
'\ud800'
>>> str([chr(0xd800)])
"['\\ud800']"
Note the double backslash in the list version. Instead of one "invalid character" \ud800 it's the six valid characters \, u, d, 8, 0 and 0. And those can be encoded.
The codepoints U+D800 through U+DFFF are reserved for surrogate pairs and can already be seen in the error message
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 873642: surrogates not allowed
You can't write characters in that range. It's only used for UTF-16 to encode codepoints outside the BMP (i.e. > 65535).
Note that Unicode is not a 16-bit charset, so going up to 65535 is not enough. To print all the Unicode characters you need to print all the way up to U+​10FFFF except the surrogate range. It's also easier to use UTF-32 for this instead
I am not aware of a good way how to get UTF-16 to UTF-8, however you could probably apply this method to read file if you DO NOT require 100% accurate representation:
f = open(filename, encoding='utf-8', errors='replace')

Why is Python trying to automatically encode my Unicode string? [duplicate]

This question already has an answer here:
Unicode error Ordinal not in range
(1 answer)
Closed 6 years ago.
I'm trying to read a bunch of e-mail messages from files that are encoded in ISO-8859-1, then write (parts of) them out to a JSON file with UTF-8 encoding. I've currently got a program that reads them and produces objects with str type properties containing the various fields of the message. I want to convert these str strings (encoded bitstrings) to unicode strings (abstract Unicode objects) so that I can later re-encode them with UTF-8 when I write out the file. So I use the decode method of str, like this:
msg_dict = {u'Id' : message.message_id.decode('iso-8859-1'),
u'Subject' : message.subject.decode('iso-8859-1'),
u'SenderEmail' : message.sender_email.decode('iso-8859-1'),
u'SenderName' : message.sender_name.decode('iso-8859-1'),
u'Date': message.date.isoformat()}
According to the documentation I've read, decode should take the str object, interpret its bytes according to the given encoding, and return a unicode object representing those characters. But when I run my code, I get this error:
File "/home/edward/long/path/omitted/dumpMails.py", line 38, in <module>
u'Subject' : message.subject.decode('iso-8859-1'),
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
How could I be getting an encode error when I call decode? My best guess is that Python has decided to automatically convert the returned unicode back to a str, using the default encoding. But why is it trying to do this? Is it something to do with putting unicodes in a dictionary?
Python will automatically try and encode a value if it is not yet a byte string. You cannot decode a Unicode string, after all, so Python tries to be helpful and tries to make it a bytestring first.
In other words, the string is already decoded to unicode:
>>> decoded = u'åüøî'
>>> decoded.decode('latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
You'll either have to test if it is already a Unicode string, or if it is always a Unicode string, just don't try to decode it.
Incidentally, you'll see the inverse problem if you have a byte string that you are trying to encode; Python will implicitly decode such a value first, so that it has a unicode object to encode for you:
>>> encoded = u'åüøî'.encode('utf8')
>>> encoded.encode('latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Note the decode keyword in that error message.

Strange behavior of string format in python 2.7

Working with svn logs in xml format i've accidentally got an error in my script.
Error message is:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
By debugging input data i have found what was wrong. Here is an example:
a=u'\u0440\u0435\u044c\u0434\u0437\u0444\u043a\u044b\u0443\u043a \u043c\u0443\u043a\u044b\u0448\u0449\u0442 \u0430\u0448\u0447'
>>> print a
реьдзфкыук мукышщт ашч
>>> print '{}'.format(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
Can you please explain what is wrong with format?
Seems like it sees u before string bytes and try to decode it from UTF8.
However in Python 3 above example works without error.
You are mixing Unicode and byte string values. Use a unicode format:
print u'{}'.format(a)
Demo:
>>> a=u'\u0440\u0435\u044c\u0434\u0437\u0444\u043a\u044b\u0443\u043a \u043c\u0443\u043a\u044b\u0448\u0449\u0442 \u0430\u0448\u0447'
>>> print u'{}'.format(a)
реьдзфкыук мукышщт ашч
In Python 3, strings are unicode values by default; in Python 2, u"..." indicates a unicode value and regular strings ("...") are byte strings.
Mixing a byte strings and unicode value results in automatic encoding or decoding with the default codec (ASCII), and that's what happens here. The str.format() method has to encode the Unicode value to a byte string to interpolate.

Double-decoding unicode in python

I am working against an application that seems keen on returning, what I believe to be, double UTF-8 encoded strings.
I send the string u'XüYß' encoded using UTF-8, thus becoming X\u00fcY\u00df (equal to X\xc3\xbcY\xc3\x9f).
The server should simply echo what I sent it, yet returns the following: X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f (should be X\xc3\xbcY\xc3\x9f). If I decode it using str.decode('utf-8') becomes u'X\xc3\xbcY\xc3\x9f', which looks like a ... unicode-string, containing the original string encoded using UTF-8.
But Python won't let me decode a unicode string without re-encoding it first - which fails for some reason, that escapes me:
>>> ret = 'X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f'.decode('utf-8')
>>> ret
u'X\xc3\xbcY\xc3\x9f'
>>> ret.decode('utf-8')
# Throws UnicodeEncodeError: 'ascii' codec can't encode ...
How do I persuade Python to re-decode the string? - and/or is there any (practical) way of debugging what's actually in the strings, without passing it though all the implicit conversion print uses?
(And yes, I have reported this behaviour with the developers of the server-side.)
ret.decode() tries implicitly to encode ret with the system encoding - in your case ascii.
If you explicitly encode the unicode string, you should be fine. There is a builtin encoding that does what you need:
>>> 'X\xc3\xbcY\xc3\x9f'.encode('raw_unicode_escape').decode('utf-8')
'XüYß'
Really, .encode('latin1') (or cp1252) would be OK, because that's what the server is almost cerainly using. The raw_unicode_escape codec will just give you something recognizable at the end instead of raising an exception:
>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'
>>> '€\xe2\x82\xac'.encode('latin1').decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256)
In case you run into this sort of mixed data, you can use the codec again, to normalize everything:
>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'
>>> '\\u20ac€'.encode('raw_unicode_escape')
b'\\u20ac\\u20ac'
>>> '\\u20ac€'.encode('raw_unicode_escape').decode('raw_unicode_escape')
'€€'
What you want is the encoding where Unicode code point X is encoded to the same byte value X. For code points inside 0-255 you have this in the latin-1 encoding:
def double_decode(bstr):
return bstr.decode("utf-8").encode("latin-1").decode("utf-8")
Don't use this! Use #hop's solution.
My nasty hack: (cringe! but quietly. It's not my fault, it's the server developers' fault)
def double_decode_unicode(s, encoding='utf-8'):
return ''.join(chr(ord(c)) for c in s.decode(encoding)).decode(encoding)
Then,
>>> double_decode_unicode('X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f')
u'X\xfcY\xdf'
>>> print _
XüYß
Here's a little script that might help you, doubledecode.py --
https://gist.github.com/1282752

Python: Sanitize a string for unicode? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Python UnicodeDecodeError - Am I misunderstanding encode?
I have a string that I'm trying to make safe for the unicode() function:
>>> s = " foo “bar bar ” weasel"
>>> s.encode('utf-8', 'ignore')
Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
s.encode('utf-8', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)
>>> unicode(s)
Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
unicode(s)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)
I'm mostly flailing around here. What do I need to do to remove the unsafe characters from the string?
Somewhat related to this question, although I was unable to solve my problem from it.
This also fails:
>>> s
' foo \x93bar bar \x94 weasel'
>>> s.decode('utf-8')
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
s.decode('utf-8')
File "C:\Python25\254\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 5: unexpected code byte
Good question. Encoding issues are tricky. Let's start with "I have a string." Strings in Python 2 aren't really "strings," they're byte arrays. So your string, where did it come from and what encoding is it in? Your example shows curly quotes in the literal, and I'm not even sure how you did that. I try to paste it into a Python interpreter, or type it on OS X with Option-[, and it doesn't come through.
Looking at your second example though, you have a character of hex 93. That can't be UTF-8, because in UTF-8, any byte higher than 127 is part of a multibyte sequence. So I'm guessing it's supposed to be Latin-1. The problem is, x93 isn't a character in the Latin-1 character set. There's this "invalid" range in Latin-1 from x7f to x9f that's considered illegal. However, Microsoft saw that unused range and decided to put "curly quotes" in there. In doing so they created this similar encoding called "windows-1252", which is like Latin-1 with stuff in that invalid range.
So, let's assume it is windows-1252. What now? String.decode converts bytes into Unicode, so that's the one you want. Your second example was on the right track, but it failed because the string wasn't UTF-8. Try:
>>> uni = 'foo \x93bar bar\x94 weasel'.decode("windows-1252")
u'foo \u201cbar bar\u201d weasel'
>>> print uni
foo “bar bar” weasel
>>> type(uni)
<type 'unicode'>
That's correct, because opening curly quote is Unicode U+201C. Now that you have Unicode, you can serialize it to bytes in any encoding you choose (if you need to pass it across the wire) or just keep it as Unicode if it's staying within Python. If you want to convert to UTF-8, use the oppose function, string.encode.
>>> uni.encode("utf-8")
'foo \xe2\x80\x9cbar bar \xe2\x80\x9d weasel'
Curly quotes take 3 bytes to encode in UTF-8. You could use UTF-16 and they'd only be two bytes. You can't encode as ASCII or Latin-1 though, because those don't have curly quotes.
EDIT. Looks like your string is encoded in such a way that “ (LEFT DOUBLE QUOTATION MARK) becomes \x93 and ” (RIGHT DOUBLE QUOTATION MARK) becomes \x94. There is a number of codepages with such a mapping, CP1250 is one of them, so you may use this:
s = s.decode('cp1250')
For all the codepages which map “ to \x93 see here (all of them also map ” to \x94, which can be verified here).

Categories