I have one more error to fix.
row = OpenThisLink + titleTag + JD
try:
csvwriter.writerow([row])
except (UnicodeEncodeError, UnicodeDecodeError):
pass
This gives the error (for this character: "ń")
row = OpenThisLink + str(titleTag) + JD
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 51: ordinal not in range(128)
I tried to fix this by using the method here. But,
>>> title = "hello Giliciński"
Unsupported characters in input
u = unicode(title, "latin1")
Traceback (most recent call last):
File "<pyshell#56>", line 1, in <module>
u = unicode(title, "latin1")
NameError: name 'title' is not defined
>>> title = "ń" Unsupported characters in input
According to documentation:
Unlike a similar case with UnicodeEncodeError, such a failure cannot be always avoided.
And indeed, my exception doesn't seem to work. Any suggestions?
Thanks!
And indeed, my exception doesn't seem
to work. Any suggestions?
row = OpenThisLink + titleTag + JD is outside the try/except block and so any exceptions raised while that statement is running will not be caught. This, however, will catch the exception:
try:
row = OpenThisLink + titleTag + JD
csvwriter.writerow([row])
except (UnicodeEncodeError, UnicodeDecodeError):
print "Caught unicode error"
But, in the code that you posted, row = OpenThisLink + titleTag + JD will not raise UnicodeEncodeError if titleTag contains a unicode string; the result of the string concatenation will be of type unicode.
Now, the csv module doesn't support unicode, so when you call writerow() with unicode data this will raise UnicodeEncodeError. You need to encode your unicode strings into a suitable encoding (UTF8 would be best) and then pass that to writerow(), for example:
>>> titleTag = "hello Giliciński"
>>> titleTag
'hello Gilici\xc5\x84ski'
>>> type(titleTag)
<type 'str'>
>>>
>>> titleTag = titleTag.decode('utf8')
>>> titleTag
u'hello Gilici\u0144ski'
>>> type(titleTag)
<type 'unicode'>
>>>
>>> csvwriter.writerow([titleTag])
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0144' in position 12: ordinal not in range(128)
>>>
>>> # but this will work...
>>> csvwriter.writerow([titleTag.encode('utf8')])
The relevant Python documentation is here. Be sure to look at the examples, in particular the last one.
BTW, pyshell doesn't seem to accept non-ascii characters as input so use the normal Python interpretter.
For IDLE, according to the solution here(link), open file $python/Lib/idellib/IOBinding.py, forcefully put
encoding = "utf-8"
after the try-except-pass module for setting locale. Close IDLE and save the file(perhaps requires administrative priority) and open IDLE again. At least it works for me. My IDLE version is 1.2, python: 2.5.
Related
When I run my Python code and print(item), I get the following errors:
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 61-61: Non-BMP character not supported in Tk
Here is my code:
def getUserFollowers(self, usernameId, maxid = ''):
if maxid == '':
return self.SendRequest('friendships/'+ str(usernameId) +'/followers/?rank_token='+ self.rank_token,l=2)
else:
return self.SendRequest('friendships/'+ str(usernameId) +'/followers/?rank_token='+ self.rank_token + '&max_id='+ str(maxid))
def getTotalFollowers(self,usernameId):
followers = []
next_max_id = ''
while 1:
self.getUserFollowers(usernameId,next_max_id)
temp = self.LastJson
for item in temp["users"]:
print(item)
followers.append(item)
if temp["big_list"] == False:
return followers
next_max_id = temp["next_max_id"]
How can I fix this?
Hard to guess without knowing the content of temp["users"], but the error indicates that it contains non BMP unicode characters like for example emoji.
If you try to display that in IDLE, you immediately get that kind of error. Simple example to reproduce (on IDLE for Python 3.5):
>>> t = "ab \U0001F600 cd"
>>> print(t)
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
print(t)
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 3-3: Non-BMP character not supported in Tk
(\U0001F600 represents the unicode character U+1F600 grinning face)
The error is indeed caused by Tk not supporting unicode characters with code greater than FFFF. A simple workaround is the filter them out of your string:
def BMP(s):
return "".join((i if ord(i) < 10000 else '\ufffd' for i in s))
'\ufffd' is the Python representation for the unicode U+FFFD REPLACEMENT CHARACTER.
My example becomes:
>>> t = "ab \U0001F600 cd"
>>> print(BMP(t))
ab � cd
So your code would become:
for item in temp["users"]:
print(BMP(item))
followers.append(item)
I am trying to do the same thing in python as the java code below.
String decoded = new String("ä¸".getBytes("ISO8859_1"), "UTF-8");
System.out.println(decoded);
The output is a Chinese String "中".
In Python I tried the encode/decode/bytearray thing but I always got unreadable string. I think my problem is that I don't really understand how the java/python encoding mechanism works. Also I cannot find a solution from the existing answers.
#coding=utf-8
def p(s):
print s + ' -- ' + str(type(s))
ch1 = 'ä¸-'
p(ch1)
chu1 = ch1.decode('ISO8859_1')
p(chu1.encode('utf-8'))
utf_8 = bytearray(chu1, 'utf-8')
p(utf_8)
p(utf_8.decode('utf-8').encode('utf-8'))
#utfstr = utf_8.decode('utf-8').decode('utf-8')
#p(utfstr)
p(ch1.decode('iso-8859-1').encode('utf8'))
ä¸- -- <type 'str'>
ä¸Â- -- <type 'str'>
ä¸Â- -- <type 'bytearray'>
ä¸Â- -- <type 'str'>
ä¸Â- -- <type 'str'>
Daniel Roseman's answer is really close. Thank you. But when it comes to my real case:
ch = 'masanori harigae ã\201®ã\203\221ã\203¼ã\202½ã\203\212ã\203«ä¼\232è-°å®¤'
print ch.decode('utf-8').encode('iso-8859-1')
I got
Traceback (most recent call last):
File "", line 1, in
File "/apps/Python/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 19: invalid start byte
Java code:
String decoded = new String("masanori harigae ã\201®ã\203\221ã\203¼ã\202½ã\203\212ã\203«ä¼\232è-°å®¤".getBytes("ISO8859_1"), "UTF-8");
System.out.println(decoded);
The output is masanori harigae のパーソナル会�-�室
You are doing this the wrong way round. You have a bytestring that is wrongly encoded as utf-8 and you want it to be interpreted as iso-8859-1:
>>> ch = "ä¸"
>>> print u.decode('utf-8').encode('iso-8859-1')
中
def main():
client = ##client_here
db = client.brazil
rio_bus = client.tweets
result_cursor = db.tweets.find()
first = result_cursor[0]
ordered_fieldnames = first.keys()
with open('brazil_tweets.csv','wb') as csvfile:
csvwriter = csv.DictWriter(csvfile,fieldnames = ordered_fieldnames,extrasaction='ignore')
csvwriter.writeheader()
for x in result_cursor:
print x
csvwriter.writerow( {k: str(x[k]).encode('utf-8') for k in x})
#[ csvwriter.writerow(x.encode('utf-8')) for x in result_cursor ]
if __name__ == '__main__':
main()
Basically the issue is that the tweets contain a bunch of characters in Portuguese. I tried to correct for this by encoding everything into unicode values before putting them in the dictionary that was to be added to the row. However this doesn't work. Any other ideas for formatting these values so that csv reader and dictreader can read them?
str(x[k]).encode('utf-8') is the problem.
str(x[k]) will convert a Unicode string to an byte string using the default ascii codec in Python 2:
>>> x = u'résumé'
>>> str(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
Non-Unicode values, like booleans, will be converted to byte strings, but then Python will implicitly decode the byte string to a Unicode string before calling .encode(), because you can only encode Unicode strings. This usually won't cause an error because most non-Unicode objects have an ASCII representation. Here's an example where a custom object returns a non-ASCII str() representation:
>>> class Test(object):
... def __str__(self):
... return 'r\xc3\xa9sum\xc3\xa9'
...
>>> x=Test()
>>> str(x)
'r\xc3\xa9sum\xc3\xa9'
>>> str(x).encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Note the above was a decode error instead of an encode error.
If str() is only there to coerce booleans to a string, coerce it to a Unicode string instead:
unicode(x[k]).encode('utf-8')
Non-Unicode values will be converted to Unicode strings, which can then be correctly encoded, but Unicode strings will remain unchanged, so they will also be encoded correctly.
>>> x = True
>>> unicode(x)
u'True'
>>> unicode(x).encode('utf8')
'True'
>>> x = u'résumé'
>>> unicode(x).encode('utf8')
'r\xc3\xa9sum\xc3\xa9'
P.S. Python 3 does not do implicit encode/decode between byte and Unicode strings and makes these errors easier to spot.
My code is:
print os.urandom(64)
which outputs:
> "D:\Python25\pythonw.exe" "D:\zjm_code\a.py"
\xd0\xc8=<\xdbD'
\xdf\xf0\xb3>\xfc\xf2\x99\x93
=S\xb2\xcd'\xdbD\x8d\xd0\\xbc{&YkD[\xdd\x8b\xbd\x82\x9e\xad\xd5\x90\x90\xdcD9\xbf9.\xeb\x9b>\xef#n\x84
which isn't readable, so I tried this:
print os.urandom(64).decode("utf-8")
but then I get:
> "D:\Python25\pythonw.exe" "D:\zjm_code\a.py"
Traceback (most recent call last):
File "D:\zjm_code\a.py", line 17, in <module>
print os.urandom(64).decode("utf-8")
File "D:\Python25\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-3: invalid data
What should I do to get human-readable output?
No shortage of choices. Here's a couple:
>>> os.urandom(64).encode('hex')
'0bf760072ea10140d57261d2cd16bf7af1747e964c2e117700bd84b7acee331ee39fae5cff6f3f3fc3ee3f9501c9fa38ecda4385d40f10faeb75eb3a8f557909'
>>> os.urandom(64).encode('base64')
'ZuYDN1BiB0ln73+9P8eoQ3qn3Q74QzCXSViu8lqueKAOUYchMXYgmz6WDmgJm1DyTX598zE2lClX\n4iEXXYZfRA==\n'
os.urandom is giving you a 64-bytes string. Encoding it in hex is probably the best way to make it "human readable" to some extent. E.g.:
>>> s = os.urandom(64)
>>> s.encode('hex')
'4c28351a834d80674df3b6eb5f59a2fd0df2ed2a708d14548e4a88c7139e91ef4445a8b88db28ceb3727851c02ce1822b3c7b55a977fa4f4c4f2a0e278ca569e'
Of course this gives you 128 characters in the result, which may be too long a line to read comfortably; it's easy to split it up, though -- e.g.:
>>> print s[:32].encode('hex')
4c28351a834d80674df3b6eb5f59a2fd0df2ed2a708d14548e4a88c7139e91ef
>>> print s[32:].encode('hex')
4445a8b88db28ceb3727851c02ce1822b3c7b55a977fa4f4c4f2a0e278ca569e
two chunks of 64 characters each shown on separate lines may be easier on the eye.
Random bytes are not likely to be unicode characters, so I'm not suprised that you get encoding errors. Instead you need to convert them somehow. If all you're trying to do is see what they are, then something like:
print [ord(o) for o in os.urandom(64)]
Or, if you'd prefer to have it as hex 0-9a-f:
print ''.join( [hex(ord(o))[2:] for o in os.urandom(64)] )
ACTIVATE_THIS = """
eJx1UsGOnDAMvecrIlYriDRlKvU20h5aaY+teuilGo1QALO4CwlKAjP8fe1QGGalRoLEefbzs+Mk
Sb7NcvRo3iTcoGqwgyy06As+HWSNVciKaBTFywYoJWc7yit2ndBVwEkHkIzKCV0YdQdmkvShs6YH
E3IhfjFaaSNLoHxQy2sLJrL0ow98JQmEG/rAYn7OobVGogngBgf0P0hjgwgt7HOUaI5DdBVJkggR
3HwSktaqWcCtgiHIH7qHV+esW2CnkRJ+9R5cQGsikkWEV/J7leVGs9TV4TvcO5QOOrTHYI+xeCjY
JR/m9GPDHv2oSZunUokS2A/WBelnvx6tF6LUJO2FjjlH5zU6Q+Kz/9m69LxvSZVSwiOlGnT1rt/A
77j+WDQZ8x9k2mFJetOle88+lc8sJJ/AeerI+fTlQigTfVqJUiXoKaaC3AqmI+KOnivjMLbvBVFU
1JDruuadNGcPmkgiBTnQXUGUDd6IK9JEQ9yPdM96xZP8bieeMRqTuqbxIbbey2DjVUNzRs1rosFS
TsLAdS/0fBGNdTGKhuqD7mUmsFlgGjN2eSj1tM3GnjfXwwCmzjhMbR4rLZXXk+Z/6Hp7Pn2+kJ49
jfgLHgI4Jg==
""".decode("base64").decode("zlib")
my code:
import zlib
print 'dsss'.decode('base64').decode('zlib')#error
Traceback (most recent call last):
File "D:\zjm_code\b.py", line 4, in <module>
print 'dsss'.decode('base64').decode('zlib')
File "D:\Python25\lib\encodings\zlib_codec.py", line 43, in zlib_decode
output = zlib.decompress(input)
zlib.error: Error -3 while decompressing data: unknown compression method
a='dsss'.encode('zlib')
print a
a.encode('base64')
print a
a.decode('base64')#error
print a
a.decode('zlib')
print a
x\x9cK)..Traceback (most recent call last):
File "D:\zjm_code\b.py", line 7, in <module>
a.decode('base64')
File "D:\Python25\lib\encodings\base64_codec.py", line 42, in base64_decode
output = base64.decodestring(input)
File "D:\Python25\lib\base64.py", line 321, in decodestring
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
a='dsss'
a=a.encode('zlib')
print a
a=a.decode('zlib')
print a#why can't print 'dsss'
x\x9cK)..
a='dsss'
a=a.encode('zlib')
#print a
a=a.decode('zlib')
print a#its ok
i think the 'print a' encode the a with 'uhf-8'.
so:
#encoding:utf-8
a='dsss'
a=a.encode('zlib')
print a
a=a.decode('utf-8')#but error.
a=a.decode('zlib')
print a#
x\x9cK)..Traceback (most recent call last):
File "D:\zjm_code\b.py", line 5, in <module>
a=a.decode('utf-8')
File "D:\Python25\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c in position 1: unexpected code byte
The data in the strings is encoded and compressed binary data. The .decode("base64").decode("zlib") unencodes and decompresses it.
The error you got was because 'dsss' decoded from base64 is not valid zlib compressed data.
What is the purpose of x.decode(”base64”).decode(”zlib”) for x in ("sss", "dsss", random_garbage)? Excuse me, you should know; you are the one who is doing it!
Edit after OP's addition of various puzzles
Puzzle 1
a='dsss'.encode('zlib')
print a
a.encode('base64')
print a
a.decode('base64')#error
print a
a.decode('zlib')
print a
Resolution: all 3 statements of the form
a.XXcode('encoding')
should be
a = a.XXcode('encoding')
Puzzle 2
a='dsss'
a=a.encode('zlib')
print a
a=a.decode('zlib')
print a#why can't print 'dsss'
x\x9cK)..
But it does print 'dsss':
>>> a='dsss'
>>> a=a.encode('zlib')
>>> print a
x£K)..♠ ♦F☺¥
>>> a=a.decode('zlib')
>>> print a#why can't print 'dsss'
dsss
>>>
Puzzle 3
"""i think the 'print a' encode the a with 'uhf-8'."""
Resolution: You think extremely incorrectly. What follows the print is an expression. There are no such side effects. What do you imagine happens when you do this:
print 'start text ' + a + 'end text'
?
What do you imagine happens if you do print a twice? Encoding the already-encoded text again? Why don't you stop imagining and try it out?
In any case, note that the output of str.encode('zlib') is an str object, not a unicode object:
>>> print repr('dsss'.encode('zlib'))
'x\x9cK)..\x06\x00\x04F\x01\xbe'
Getting from that to UTF-8 is going to be somewhat difficult ... it would have to be decoded into unicode first -- with what codec? ascii and utf8 are going to have trouble with the '\x9c' and the '\xbe' ...
It is the reverse of:
original_message.encode('zlib').encode('base64')
zlib is a binary compression algorithm. base64 is a text encoding of binary data, which is useful to send binary message through text protocols like SMTP.
After 'dsss' was decoded from base64 (the three bytes 76h, CBh, 2Ch), the result was not valid zlib compressed data so it couldn't be decoded.
Try printing ACTIVATE_THIS to see the result of the decoding. It turns out to be some Python code.
.decode('base64') can be called only on a string that's encoded as "base-64, in order to retrieve the byte sequence that was there encoded. Presumably that byte sequence, in the example you bring, was zlib-compressed, and so the .decode('zlib') part decompresses it.
Now, for your case:
>>> 'dsss'.decode('base64')
'v\xcb,'
But 'v\xcv,' is not a zlib-compressed string! And so of course you cannot ask zlib to "decompress" it. Fortunately zlib recognizes the fact (that 'v\xcv,' could not possibly have been produced by applying any of the compression algorithms zlib knows about to any input whatsoever) and so gives you a helpful error message (instead of a random-ish string of bytes, which you might well have gotten if you had randomly supplied a different but equally crazy input string!-)
Edit: the error in
a.encode('base64')
print a
a.decode('base64')#error
is obviously due to the fact that strings are immutable: just calling a.encode (or any other method) does not alter a, it produces a new string object (and here you're just printing it).
In the next snippet, the error is only in the OP's mind:
>>> a='dsss'
>>> a=a.encode('zlib')
>>> print a
x?K)..F?
>>> a=a.decode('zlib')
>>> print a#why can't print 'dsss'
dsss
>>>
that "why can't print" question is truly peculiar, applied to code that does print 'dsss'. Finally,
i think the 'print a' encode the a
with 'uhf-8'.
You think wrongly: there's no such thing as "uhf-8" (you mean "utf-8" maybe?), and anyway print a does not alter a, any more than just calling a.encode does.