Convert Python string between iso8859_1 and utf-8 - python

I am trying to do the same thing in python as the java code below.
String decoded = new String("中".getBytes("ISO8859_1"), "UTF-8");
System.out.println(decoded);
The output is a Chinese String "中".
In Python I tried the encode/decode/bytearray thing but I always got unreadable string. I think my problem is that I don't really understand how the java/python encoding mechanism works. Also I cannot find a solution from the existing answers.
#coding=utf-8
def p(s):
print s + ' -- ' + str(type(s))
ch1 = 'ä¸-'
p(ch1)
chu1 = ch1.decode('ISO8859_1')
p(chu1.encode('utf-8'))
utf_8 = bytearray(chu1, 'utf-8')
p(utf_8)
p(utf_8.decode('utf-8').encode('utf-8'))
#utfstr = utf_8.decode('utf-8').decode('utf-8')
#p(utfstr)
p(ch1.decode('iso-8859-1').encode('utf8'))
ä¸- -- <type 'str'>
ä¸Â- -- <type 'str'>
ä¸Â- -- <type 'bytearray'>
ä¸Â- -- <type 'str'>
ä¸Â- -- <type 'str'>
Daniel Roseman's answer is really close. Thank you. But when it comes to my real case:
ch = 'masanori harigae ã\201®ã\203\221ã\203¼ã\202½ã\203\212ã\203«ä¼\232è-°å®¤'
print ch.decode('utf-8').encode('iso-8859-1')
I got
Traceback (most recent call last):
File "", line 1, in
File "/apps/Python/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 19: invalid start byte
Java code:
String decoded = new String("masanori harigae ã\201®ã\203\221ã\203¼ã\202½ã\203\212ã\203«ä¼\232è-°å®¤".getBytes("ISO8859_1"), "UTF-8");
System.out.println(decoded);
The output is masanori harigae のパーソナル会�-�室

You are doing this the wrong way round. You have a bytestring that is wrongly encoded as utf-8 and you want it to be interpreted as iso-8859-1:
>>> ch = "中"
>>> print u.decode('utf-8').encode('iso-8859-1')
中

Related

unprintable python unicode string

I retrieved some exif info from an image and got the following:
{ ...
37510: u'D2\nArbeitsamt\n\xc3\x84nderungsbescheid'
...}
I expected it to be
{ ...
37510: u'D2\nArbeitsamt\nÄnderungsbescheid'
... }
I need to convert the value to a str, but i couldn't manage it to work. I always get something like (using python27)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 14-15: ordinal not in range(128)
Any ideas how I can handle this?
UPDATE:
I tried it with python3 and there is now error thrown, but the result is now
{ ...
37510: 'D2\nArbeitsamt\nÃ\x84nderungsbescheid',
... }
which is still not the expected.
It seems to be utf8 which was incorrectly decoded as latin1 and then placed in a unicode string. You can use .encode('iso8859-1') to reverse the incorrect decoding.
>>> my_dictionary = {37510: u'D2\nArbeitsamt\n\xc3\x84nderungsbescheid'}
>>> print(my_dictionary[37510].encode('iso8859-1'))
D2
Arbeitsamt
Änderungsbescheid
You can print it out just fine now, but you might then also decode it as unicode, so it ends up with the correct type for further processing:
>>> type(my_dictionary[37510].encode('iso8859-1'))
<type 'str'>
>>> print(my_dictionary[37510].encode('iso8859-1').decode('utf8'))
D2
Arbeitsamt
Änderungsbescheid
>>> type(my_dictionary[37510].encode('iso8859-1').decode('utf8'))
<type 'unicode'>

Python urlencode special character

I have this variable here
reload(sys)
sys.setdefaultencoding('utf8')
foo = u'"Esp\xc3\xadrito"'
which translates to "Espírito". But when I pass my variable to urlencode like this
urllib.urlencode({"q": foo}) # q=%22Esp%C3%83%C2%ADrito%22'
The special character is being "represented" wrongly in the URL.
How should I fix this?
You got the wrong encoding of "Espírito", I don't know where you get that, but this is the right one:
>>> s = u'"Espírito"'
>>>
>>> s
u'"Esp\xedrito"'
Then encoding your query:
>>> u.urlencode({'q':s.encode('utf-8')})
'q=%22Esp%C3%ADrito%22'
This should give you back the right encoding of your string.
EDIT: This is regarding right encoding of your query string, demo:
>>> s = u'"Espírito"'
>>> print s
"Espírito"
>>> s.encode('utf-8')
'"Esp\xc3\xadrito"'
>>> s.encode('latin-1')
'"Esp\xedrito"'
>>>
>>> print "Esp\xc3\xadrito"
Espí­rito
>>> print "Esp\xedrito"
Espírito
This clearly shows that the right encoding for your string is most probably latin-1 (even cp1252 works as well), now as far as I understand, urlparse.parse_qs either assumes default encoding utf-8 or your system default encoding, which as per your post, you set it to utf-8 as well.
Interestingly, I was playing with the query you provided in your comment, I got this:
>>> q = "q=Esp%C3%ADrito"
>>>
>>> p = urlparse.parse_qs(q)
>>> p['q'][0].decode('utf-8')
u'Esp\xedrito'
>>>
>>> p['q'][0].decode('latin-1')
u'Esp\xc3\xadrito'
#Clearly not ASCII encoding.
>>> p['q'][0].decode()
Traceback (most recent call last):
File "<pyshell#320>", line 1, in <module>
p['q'][0].decode()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
>>>
>>> p['q'][0]
'Esp\xc3\xadrito'
>>> print p['q'][0]
Espírito
>>> print p['q'][0].decode('utf-8')
Espírito
urllib and urlparse appear to work with byte string in Python 2. To get unicode strings, encode and decode using utf-8.
Here's an example of a round-trip:
data = { 'q': u'Espírito'}
# to query string:
bdata = {k: v.encode('utf-8') for k, v in data.iteritems()}
qs = urllib.urlencode(bdata)
# qs = 'q=Esp%C3%ADrito'
# to dict:
bdata = urlparse.parse_qs(qs)
data = { k: map(lambda s: s.decode('utf-8'), v)
for k, v in bdata.iteritems() }
# data = {'q': [u'Espídrito']}
Note the different meaning of escape sequences: in 'Esp\xc3\xadrito' (a string), they represent bytes, while in u'"Esp\xedrito"' (a unicode object) they represent Unicode code points.

how to show the right word in my code, my code is : os.urandom(64)

My code is:
print os.urandom(64)
which outputs:
> "D:\Python25\pythonw.exe" "D:\zjm_code\a.py"
\xd0\xc8=<\xdbD'
\xdf\xf0\xb3>\xfc\xf2\x99\x93
=S\xb2\xcd'\xdbD\x8d\xd0\\xbc{&YkD[\xdd\x8b\xbd\x82\x9e\xad\xd5\x90\x90\xdcD9\xbf9.\xeb\x9b>\xef#n\x84
which isn't readable, so I tried this:
print os.urandom(64).decode("utf-8")
but then I get:
> "D:\Python25\pythonw.exe" "D:\zjm_code\a.py"
Traceback (most recent call last):
File "D:\zjm_code\a.py", line 17, in <module>
print os.urandom(64).decode("utf-8")
File "D:\Python25\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-3: invalid data
What should I do to get human-readable output?
No shortage of choices. Here's a couple:
>>> os.urandom(64).encode('hex')
'0bf760072ea10140d57261d2cd16bf7af1747e964c2e117700bd84b7acee331ee39fae5cff6f3f3fc3ee3f9501c9fa38ecda4385d40f10faeb75eb3a8f557909'
>>> os.urandom(64).encode('base64')
'ZuYDN1BiB0ln73+9P8eoQ3qn3Q74QzCXSViu8lqueKAOUYchMXYgmz6WDmgJm1DyTX598zE2lClX\n4iEXXYZfRA==\n'
os.urandom is giving you a 64-bytes string. Encoding it in hex is probably the best way to make it "human readable" to some extent. E.g.:
>>> s = os.urandom(64)
>>> s.encode('hex')
'4c28351a834d80674df3b6eb5f59a2fd0df2ed2a708d14548e4a88c7139e91ef4445a8b88db28ceb3727851c02ce1822b3c7b55a977fa4f4c4f2a0e278ca569e'
Of course this gives you 128 characters in the result, which may be too long a line to read comfortably; it's easy to split it up, though -- e.g.:
>>> print s[:32].encode('hex')
4c28351a834d80674df3b6eb5f59a2fd0df2ed2a708d14548e4a88c7139e91ef
>>> print s[32:].encode('hex')
4445a8b88db28ceb3727851c02ce1822b3c7b55a977fa4f4c4f2a0e278ca569e
two chunks of 64 characters each shown on separate lines may be easier on the eye.
Random bytes are not likely to be unicode characters, so I'm not suprised that you get encoding errors. Instead you need to convert them somehow. If all you're trying to do is see what they are, then something like:
print [ord(o) for o in os.urandom(64)]
Or, if you'd prefer to have it as hex 0-9a-f:
print ''.join( [hex(ord(o))[2:] for o in os.urandom(64)] )

UnicodeEncodeError with csvwriter

I have one more error to fix.
row = OpenThisLink + titleTag + JD
try:
csvwriter.writerow([row])
except (UnicodeEncodeError, UnicodeDecodeError):
pass
This gives the error (for this character: "ń")
row = OpenThisLink + str(titleTag) + JD
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 51: ordinal not in range(128)
I tried to fix this by using the method here. But,
>>> title = "hello Giliciński"
Unsupported characters in input
u = unicode(title, "latin1")
Traceback (most recent call last):
File "<pyshell#56>", line 1, in <module>
u = unicode(title, "latin1")
NameError: name 'title' is not defined
>>> title = "ń" Unsupported characters in input
According to documentation:
Unlike a similar case with UnicodeEncodeError, such a failure cannot be always avoided.
And indeed, my exception doesn't seem to work. Any suggestions?
Thanks!
And indeed, my exception doesn't seem
to work. Any suggestions?
row = OpenThisLink + titleTag + JD is outside the try/except block and so any exceptions raised while that statement is running will not be caught. This, however, will catch the exception:
try:
row = OpenThisLink + titleTag + JD
csvwriter.writerow([row])
except (UnicodeEncodeError, UnicodeDecodeError):
print "Caught unicode error"
But, in the code that you posted, row = OpenThisLink + titleTag + JD will not raise UnicodeEncodeError if titleTag contains a unicode string; the result of the string concatenation will be of type unicode.
Now, the csv module doesn't support unicode, so when you call writerow() with unicode data this will raise UnicodeEncodeError. You need to encode your unicode strings into a suitable encoding (UTF8 would be best) and then pass that to writerow(), for example:
>>> titleTag = "hello Giliciński"
>>> titleTag
'hello Gilici\xc5\x84ski'
>>> type(titleTag)
<type 'str'>
>>>
>>> titleTag = titleTag.decode('utf8')
>>> titleTag
u'hello Gilici\u0144ski'
>>> type(titleTag)
<type 'unicode'>
>>>
>>> csvwriter.writerow([titleTag])
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0144' in position 12: ordinal not in range(128)
>>>
>>> # but this will work...
>>> csvwriter.writerow([titleTag.encode('utf8')])
The relevant Python documentation is here. Be sure to look at the examples, in particular the last one.
BTW, pyshell doesn't seem to accept non-ascii characters as input so use the normal Python interpretter.
For IDLE, according to the solution here(link), open file $python/Lib/idellib/IOBinding.py, forcefully put
encoding = "utf-8"
after the try-except-pass module for setting locale. Close IDLE and save the file(perhaps requires administrative priority) and open IDLE again. At least it works for me. My IDLE version is 1.2, python: 2.5.

purpose of '"sss".decode("base64").decode("zlib")'

ACTIVATE_THIS = """
eJx1UsGOnDAMvecrIlYriDRlKvU20h5aaY+teuilGo1QALO4CwlKAjP8fe1QGGalRoLEefbzs+Mk
Sb7NcvRo3iTcoGqwgyy06As+HWSNVciKaBTFywYoJWc7yit2ndBVwEkHkIzKCV0YdQdmkvShs6YH
E3IhfjFaaSNLoHxQy2sLJrL0ow98JQmEG/rAYn7OobVGogngBgf0P0hjgwgt7HOUaI5DdBVJkggR
3HwSktaqWcCtgiHIH7qHV+esW2CnkRJ+9R5cQGsikkWEV/J7leVGs9TV4TvcO5QOOrTHYI+xeCjY
JR/m9GPDHv2oSZunUokS2A/WBelnvx6tF6LUJO2FjjlH5zU6Q+Kz/9m69LxvSZVSwiOlGnT1rt/A
77j+WDQZ8x9k2mFJetOle88+lc8sJJ/AeerI+fTlQigTfVqJUiXoKaaC3AqmI+KOnivjMLbvBVFU
1JDruuadNGcPmkgiBTnQXUGUDd6IK9JEQ9yPdM96xZP8bieeMRqTuqbxIbbey2DjVUNzRs1rosFS
TsLAdS/0fBGNdTGKhuqD7mUmsFlgGjN2eSj1tM3GnjfXwwCmzjhMbR4rLZXXk+Z/6Hp7Pn2+kJ49
jfgLHgI4Jg==
""".decode("base64").decode("zlib")
my code:
import zlib
print 'dsss'.decode('base64').decode('zlib')#error
Traceback (most recent call last):
File "D:\zjm_code\b.py", line 4, in <module>
print 'dsss'.decode('base64').decode('zlib')
File "D:\Python25\lib\encodings\zlib_codec.py", line 43, in zlib_decode
output = zlib.decompress(input)
zlib.error: Error -3 while decompressing data: unknown compression method
a='dsss'.encode('zlib')
print a
a.encode('base64')
print a
a.decode('base64')#error
print a
a.decode('zlib')
print a
x\x9cK)..Traceback (most recent call last):
File "D:\zjm_code\b.py", line 7, in <module>
a.decode('base64')
File "D:\Python25\lib\encodings\base64_codec.py", line 42, in base64_decode
output = base64.decodestring(input)
File "D:\Python25\lib\base64.py", line 321, in decodestring
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
a='dsss'
a=a.encode('zlib')
print a
a=a.decode('zlib')
print a#why can't print 'dsss'
x\x9cK)..
a='dsss'
a=a.encode('zlib')
#print a
a=a.decode('zlib')
print a#its ok
i think the 'print a' encode the a with 'uhf-8'.
so:
#encoding:utf-8
a='dsss'
a=a.encode('zlib')
print a
a=a.decode('utf-8')#but error.
a=a.decode('zlib')
print a#
x\x9cK)..Traceback (most recent call last):
File "D:\zjm_code\b.py", line 5, in <module>
a=a.decode('utf-8')
File "D:\Python25\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c in position 1: unexpected code byte
The data in the strings is encoded and compressed binary data. The .decode("base64").decode("zlib") unencodes and decompresses it.
The error you got was because 'dsss' decoded from base64 is not valid zlib compressed data.
What is the purpose of x.decode(”base64”).decode(”zlib”) for x in ("sss", "dsss", random_garbage)? Excuse me, you should know; you are the one who is doing it!
Edit after OP's addition of various puzzles
Puzzle 1
a='dsss'.encode('zlib')
print a
a.encode('base64')
print a
a.decode('base64')#error
print a
a.decode('zlib')
print a
Resolution: all 3 statements of the form
a.XXcode('encoding')
should be
a = a.XXcode('encoding')
Puzzle 2
a='dsss'
a=a.encode('zlib')
print a
a=a.decode('zlib')
print a#why can't print 'dsss'
x\x9cK)..
But it does print 'dsss':
>>> a='dsss'
>>> a=a.encode('zlib')
>>> print a
x£K)..♠ ♦F☺¥
>>> a=a.decode('zlib')
>>> print a#why can't print 'dsss'
dsss
>>>
Puzzle 3
"""i think the 'print a' encode the a with 'uhf-8'."""
Resolution: You think extremely incorrectly. What follows the print is an expression. There are no such side effects. What do you imagine happens when you do this:
print 'start text ' + a + 'end text'
?
What do you imagine happens if you do print a twice? Encoding the already-encoded text again? Why don't you stop imagining and try it out?
In any case, note that the output of str.encode('zlib') is an str object, not a unicode object:
>>> print repr('dsss'.encode('zlib'))
'x\x9cK)..\x06\x00\x04F\x01\xbe'
Getting from that to UTF-8 is going to be somewhat difficult ... it would have to be decoded into unicode first -- with what codec? ascii and utf8 are going to have trouble with the '\x9c' and the '\xbe' ...
It is the reverse of:
original_message.encode('zlib').encode('base64')
zlib is a binary compression algorithm. base64 is a text encoding of binary data, which is useful to send binary message through text protocols like SMTP.
After 'dsss' was decoded from base64 (the three bytes 76h, CBh, 2Ch), the result was not valid zlib compressed data so it couldn't be decoded.
Try printing ACTIVATE_THIS to see the result of the decoding. It turns out to be some Python code.
.decode('base64') can be called only on a string that's encoded as "base-64, in order to retrieve the byte sequence that was there encoded. Presumably that byte sequence, in the example you bring, was zlib-compressed, and so the .decode('zlib') part decompresses it.
Now, for your case:
>>> 'dsss'.decode('base64')
'v\xcb,'
But 'v\xcv,' is not a zlib-compressed string! And so of course you cannot ask zlib to "decompress" it. Fortunately zlib recognizes the fact (that 'v\xcv,' could not possibly have been produced by applying any of the compression algorithms zlib knows about to any input whatsoever) and so gives you a helpful error message (instead of a random-ish string of bytes, which you might well have gotten if you had randomly supplied a different but equally crazy input string!-)
Edit: the error in
a.encode('base64')
print a
a.decode('base64')#error
is obviously due to the fact that strings are immutable: just calling a.encode (or any other method) does not alter a, it produces a new string object (and here you're just printing it).
In the next snippet, the error is only in the OP's mind:
>>> a='dsss'
>>> a=a.encode('zlib')
>>> print a
x?K)..F?
>>> a=a.decode('zlib')
>>> print a#why can't print 'dsss'
dsss
>>>
that "why can't print" question is truly peculiar, applied to code that does print 'dsss'. Finally,
i think the 'print a' encode the a
with 'uhf-8'.
You think wrongly: there's no such thing as "uhf-8" (you mean "utf-8" maybe?), and anyway print a does not alter a, any more than just calling a.encode does.

Categories