I retrieved some exif info from an image and got the following:
{ ...
37510: u'D2\nArbeitsamt\n\xc3\x84nderungsbescheid'
...}
I expected it to be
{ ...
37510: u'D2\nArbeitsamt\nÄnderungsbescheid'
... }
I need to convert the value to a str, but i couldn't manage it to work. I always get something like (using python27)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-15: ordinal not in range(128)
Any ideas how I can handle this?
UPDATE:
I tried it with python3 and there is now error thrown, but the result is now
{ ...
37510: 'D2\nArbeitsamt\nÃ\x84nderungsbescheid',
... }
which is still not the expected.
It seems to be utf8 which was incorrectly decoded as latin1 and then placed in a unicode string. You can use .encode('iso8859-1') to reverse the incorrect decoding.
>>> my_dictionary = {37510: u'D2\nArbeitsamt\n\xc3\x84nderungsbescheid'}
>>> print(my_dictionary[37510].encode('iso8859-1'))
D2
Arbeitsamt
Änderungsbescheid
You can print it out just fine now, but you might then also decode it as unicode, so it ends up with the correct type for further processing:
>>> type(my_dictionary[37510].encode('iso8859-1'))
<type 'str'>
>>> print(my_dictionary[37510].encode('iso8859-1').decode('utf8'))
D2
Arbeitsamt
Änderungsbescheid
>>> type(my_dictionary[37510].encode('iso8859-1').decode('utf8'))
<type 'unicode'>
Related
I'm creating my RSA Signature like this.
transactionStr = json.dumps(GenesisTransaction())
signature = rsa.sign(transactionStr.encode(), client.privateKey, 'SHA-1')
But I'm unable to get it to a string so I can save it.
I have tried decoding it using utf8
signature.decode("utf8")
but I get the error "'utf-8' codec can't decode byte 0xe3 in position 2"
any way I can do this?
A RSA signature looks like this
b'aL\xe3\xf4\xbeEM\xc4\x9e\n\x9e\xf4M`\xba\x85*\x13\xd52x\xd9\\\xe8F\x1c\x07\x90[/\x9dy\xce\xa9IV\x89\xe0\xcd9\\_3\x1e\xaa\x80\xdea\xd1\xbem/\x8e\x91\xbd\x13\x12o\x8c\xed\xf6\x89\xb5\x0b'
.decode('utf8') is used to decode text encoded in UTF8, not arbitrary bytes. Convert the byte string to a hexadecimal string instead:
>>> sig = b'aL\xe3\xf4\xbeEM\xc4\x9e\n\x9e\xf4M`\xba\x85*\x13\xd52x\xd9\\\xe8F\x1c\x07\x90[/\x9dy\xce\xa9IV\x89\xe0\xcd9\\_3\x1e\xaa\x80\xdea\xd1\xbem/\x8e\x91\xbd\x13\x12o\x8c\xed\xf6\x89\xb5\x0b'
>>> s = sig.hex()
>>> s
'614ce3f4be454dc49e0a9ef44d60ba852a13d53278d95ce8461c07905b2f9d79cea9495689e0cd395c5f331eaa80de61d1be6d2f8e91bd13126f8cedf689b50b'
To convert back, if needed:
>>> b = bytes.fromhex(s)
>>> b
b'aL\xe3\xf4\xbeEM\xc4\x9e\n\x9e\xf4M`\xba\x85*\x13\xd52x\xd9\\\xe8F\x1c\x07\x90[/\x9dy\xce\xa9IV\x89\xe0\xcd9\\_3\x1e\xaa\x80\xdea\xd1\xbem/\x8e\x91\xbd\x13\x12o\x8c\xed\xf6\x89\xb5\x0b'
>>> b==sig
True
I have this variable here
reload(sys)
sys.setdefaultencoding('utf8')
foo = u'"Esp\xc3\xadrito"'
which translates to "Espírito". But when I pass my variable to urlencode like this
urllib.urlencode({"q": foo}) # q=%22Esp%C3%83%C2%ADrito%22'
The special character is being "represented" wrongly in the URL.
How should I fix this?
You got the wrong encoding of "Espírito", I don't know where you get that, but this is the right one:
>>> s = u'"Espírito"'
>>>
>>> s
u'"Esp\xedrito"'
Then encoding your query:
>>> u.urlencode({'q':s.encode('utf-8')})
'q=%22Esp%C3%ADrito%22'
This should give you back the right encoding of your string.
EDIT: This is regarding right encoding of your query string, demo:
>>> s = u'"Espírito"'
>>> print s
"Espírito"
>>> s.encode('utf-8')
'"Esp\xc3\xadrito"'
>>> s.encode('latin-1')
'"Esp\xedrito"'
>>>
>>> print "Esp\xc3\xadrito"
EspÃrito
>>> print "Esp\xedrito"
Espírito
This clearly shows that the right encoding for your string is most probably latin-1 (even cp1252 works as well), now as far as I understand, urlparse.parse_qs either assumes default encoding utf-8 or your system default encoding, which as per your post, you set it to utf-8 as well.
Interestingly, I was playing with the query you provided in your comment, I got this:
>>> q = "q=Esp%C3%ADrito"
>>>
>>> p = urlparse.parse_qs(q)
>>> p['q'][0].decode('utf-8')
u'Esp\xedrito'
>>>
>>> p['q'][0].decode('latin-1')
u'Esp\xc3\xadrito'
#Clearly not ASCII encoding.
>>> p['q'][0].decode()
Traceback (most recent call last):
File "<pyshell#320>", line 1, in <module>
p['q'][0].decode()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
>>>
>>> p['q'][0]
'Esp\xc3\xadrito'
>>> print p['q'][0]
EspÃrito
>>> print p['q'][0].decode('utf-8')
Espírito
urllib and urlparse appear to work with byte string in Python 2. To get unicode strings, encode and decode using utf-8.
Here's an example of a round-trip:
data = { 'q': u'Espírito'}
# to query string:
bdata = {k: v.encode('utf-8') for k, v in data.iteritems()}
qs = urllib.urlencode(bdata)
# qs = 'q=Esp%C3%ADrito'
# to dict:
bdata = urlparse.parse_qs(qs)
data = { k: map(lambda s: s.decode('utf-8'), v)
for k, v in bdata.iteritems() }
# data = {'q': [u'Espídrito']}
Note the different meaning of escape sequences: in 'Esp\xc3\xadrito' (a string), they represent bytes, while in u'"Esp\xedrito"' (a unicode object) they represent Unicode code points.
I am trying to do the same thing in python as the java code below.
String decoded = new String("ä¸".getBytes("ISO8859_1"), "UTF-8");
System.out.println(decoded);
The output is a Chinese String "中".
In Python I tried the encode/decode/bytearray thing but I always got unreadable string. I think my problem is that I don't really understand how the java/python encoding mechanism works. Also I cannot find a solution from the existing answers.
#coding=utf-8
def p(s):
print s + ' -- ' + str(type(s))
ch1 = 'ä¸-'
p(ch1)
chu1 = ch1.decode('ISO8859_1')
p(chu1.encode('utf-8'))
utf_8 = bytearray(chu1, 'utf-8')
p(utf_8)
p(utf_8.decode('utf-8').encode('utf-8'))
#utfstr = utf_8.decode('utf-8').decode('utf-8')
#p(utfstr)
p(ch1.decode('iso-8859-1').encode('utf8'))
ä¸- -- <type 'str'>
ä¸Â- -- <type 'str'>
ä¸Â- -- <type 'bytearray'>
ä¸Â- -- <type 'str'>
ä¸Â- -- <type 'str'>
Daniel Roseman's answer is really close. Thank you. But when it comes to my real case:
ch = 'masanori harigae ã\201®ã\203\221ã\203¼ã\202½ã\203\212ã\203«ä¼\232è-°å®¤'
print ch.decode('utf-8').encode('iso-8859-1')
I got
Traceback (most recent call last):
File "", line 1, in
File "/apps/Python/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 19: invalid start byte
Java code:
String decoded = new String("masanori harigae ã\201®ã\203\221ã\203¼ã\202½ã\203\212ã\203«ä¼\232è-°å®¤".getBytes("ISO8859_1"), "UTF-8");
System.out.println(decoded);
The output is masanori harigae のパーソナル会�-�室
You are doing this the wrong way round. You have a bytestring that is wrongly encoded as utf-8 and you want it to be interpreted as iso-8859-1:
>>> ch = "ä¸"
>>> print u.decode('utf-8').encode('iso-8859-1')
中
def main():
client = ##client_here
db = client.brazil
rio_bus = client.tweets
result_cursor = db.tweets.find()
first = result_cursor[0]
ordered_fieldnames = first.keys()
with open('brazil_tweets.csv','wb') as csvfile:
csvwriter = csv.DictWriter(csvfile,fieldnames = ordered_fieldnames,extrasaction='ignore')
csvwriter.writeheader()
for x in result_cursor:
print x
csvwriter.writerow( {k: str(x[k]).encode('utf-8') for k in x})
#[ csvwriter.writerow(x.encode('utf-8')) for x in result_cursor ]
if __name__ == '__main__':
main()
Basically the issue is that the tweets contain a bunch of characters in Portuguese. I tried to correct for this by encoding everything into unicode values before putting them in the dictionary that was to be added to the row. However this doesn't work. Any other ideas for formatting these values so that csv reader and dictreader can read them?
str(x[k]).encode('utf-8') is the problem.
str(x[k]) will convert a Unicode string to an byte string using the default ascii codec in Python 2:
>>> x = u'résumé'
>>> str(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
Non-Unicode values, like booleans, will be converted to byte strings, but then Python will implicitly decode the byte string to a Unicode string before calling .encode(), because you can only encode Unicode strings. This usually won't cause an error because most non-Unicode objects have an ASCII representation. Here's an example where a custom object returns a non-ASCII str() representation:
>>> class Test(object):
... def __str__(self):
... return 'r\xc3\xa9sum\xc3\xa9'
...
>>> x=Test()
>>> str(x)
'r\xc3\xa9sum\xc3\xa9'
>>> str(x).encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Note the above was a decode error instead of an encode error.
If str() is only there to coerce booleans to a string, coerce it to a Unicode string instead:
unicode(x[k]).encode('utf-8')
Non-Unicode values will be converted to Unicode strings, which can then be correctly encoded, but Unicode strings will remain unchanged, so they will also be encoded correctly.
>>> x = True
>>> unicode(x)
u'True'
>>> unicode(x).encode('utf8')
'True'
>>> x = u'résumé'
>>> unicode(x).encode('utf8')
'r\xc3\xa9sum\xc3\xa9'
P.S. Python 3 does not do implicit encode/decode between byte and Unicode strings and makes these errors easier to spot.
I am getting an mp3 tag (ID V1) with eyeD3 and would like to understand its encoding. Here is what I try:
>>> print(type(mp3artist_v1))
<type 'unicode'>
>>> print(type(mp3artist_v1.encode('utf-8')))
<type 'str'>
>>> print(mp3artist_v1)
Zåìôèðà
>>> print(mp3artist_v1.encode('utf-8').decode('cp1252'))
ZåìôèðÃ
>>> print(u'Zемфира'.encode('utf-8').decode('cp1252'))
Zемфира
If I use an online tool to decode the value, it says that the value Zемфира could be converted to correct value Zемфира by changing encodings CP1252 → UTF-8 and value Zåìôèðà by changing encodings like CP1252 → CP1251.
What should I do to get Zемфира from mp3artist_v1? .encode('cp1252').decode('cp1251') works well, but how can I understand possible encoding automatically (just 3 encodings are possible - cp1251, cp1252, utf-8? I was planning to use the following code:
def forceDecode(string, codecs=['utf-8', 'cp1251', 'cp1252']):
for i in codecs:
try:
print(i)
return string.decode(i)
except:
pass
print "cannot decode url %s" % ([string])
but it does not help since I should encode with one charset first and then decode with another.
This
s = u'Zåìôèðà'
print s.encode('latin1').decode('cp1251')
# Zемфира
Explanation: Zåìôèðà is mistakenly treated as a unicode string, while it's actually a sequence of bytes, which mean Zемфира in cp1251. By applying encode('latin1') we convert this "unicode" string back to bytes, using codepoint numbers as byte values, and then convert these bytes back to unicode telling the decode we're using cp1251.
As to automatic decoding, the following brute force approach seems to work with your examples:
import re, itertools
def guess_decode(s):
encodings = ['cp1251', 'cp1252', 'utf8']
for steps in range(2, 10, 2):
for encs in itertools.product(encodings, repeat=steps):
r = s
try:
for enc in encs:
r = r.encode(enc) if isinstance(r, unicode) else r.decode(enc)
except (UnicodeEncodeError, UnicodeDecodeError) as e:
continue
if re.match(ur'^[\w\sа-яА-Я]+$', r):
print 'debug', encs, r
return r
print guess_decode(u'Zемфира')
print guess_decode(u'Zåìôèðà')
print guess_decode(u'ZåìôèðÃ\xA0')
Results:
debug ('cp1252', 'utf8') Zемфира
Zемфира
debug ('cp1252', 'cp1251') Zемфира
Zемфира
debug ('cp1252', 'utf8', 'cp1252', 'cp1251') Zемфира
Zемфира