UnicodeDecodeError -Decoding and saving in Excel - python

I'm trying to save some data I scraped in an excel sheet, and I'm having unicode decode problems with one particular piece, that has the following form:
work_info['title'] = Darimān-i afsaradgī : rāhnamā-yi kāmil bira-yi hamah-ʼi khānvādahʹhā
The code that is causing the error is:
data.write(b + book + accumulated_books+ 2, 43, work_info['title'])
wb.save('/Users/apple/Downloads/WC Scrape_trialfortwo.csv')
And the error is:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 5: ordinal not in range(128)
I've tried several different encoding/decoding techniques, but nothing have worked so far. Any suggestion would be extremely appreciated.
Thanks!

It looks like you are using python2, and python2's unicode/bytes handling is causing the problem.
>>> s = 'Darimān-i afsaradgī : rāhnamā-yi kāmil bira-yi hamah-ʼi khānvādahʹhā'
>>> wb = Workbook()
>>> ws = wb.add_sheet('test')
>>> ws.write(1, 0, s)
>>> wb.save('test.xls')
Traceback (most recent call last):
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 5: ordinal not in range(128)
xlwt assumes that s is an ascii-encoded string and tries to decode it to unicode, but fails:
>>> s.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 5: ordinal not in range(128)
In fact, s is encoded as utf-8:
>>> s.decode('utf-8')
u'Darim\u0101n-i afsaradg\u012b : r\u0101hnam\u0101-yi k\u0101mil bira-yi hamah-\u02bci kh\u0101nv\u0101dah\u02b9h\u0101'
The simplest solution may be to encode your workbook as utf-8:
>>> wb = Workbook(encoding='utf-8')
>>> ws = wb.add_sheet('test')
>>> ws.write(1, 0, s)
>>> wb.save('test.xls')
If you need a finer-grained approach, you could explicitly decode the string to unicode before writing it to the worksheet:
>>> wb = Workbook()
>>> ws = wb.add_sheet('test')
>>> ws.write(1, 0, s.decode('utf-8'))
>>> wb.save('test.xls')

Related

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 42: ordinal not in range(128)

def main():
client = ##client_here
db = client.brazil
rio_bus = client.tweets
result_cursor = db.tweets.find()
first = result_cursor[0]
ordered_fieldnames = first.keys()
with open('brazil_tweets.csv','wb') as csvfile:
csvwriter = csv.DictWriter(csvfile,fieldnames = ordered_fieldnames,extrasaction='ignore')
csvwriter.writeheader()
for x in result_cursor:
print x
csvwriter.writerow( {k: str(x[k]).encode('utf-8') for k in x})
#[ csvwriter.writerow(x.encode('utf-8')) for x in result_cursor ]
if __name__ == '__main__':
main()
Basically the issue is that the tweets contain a bunch of characters in Portuguese. I tried to correct for this by encoding everything into unicode values before putting them in the dictionary that was to be added to the row. However this doesn't work. Any other ideas for formatting these values so that csv reader and dictreader can read them?
str(x[k]).encode('utf-8') is the problem.
str(x[k]) will convert a Unicode string to an byte string using the default ascii codec in Python 2:
>>> x = u'résumé'
>>> str(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
Non-Unicode values, like booleans, will be converted to byte strings, but then Python will implicitly decode the byte string to a Unicode string before calling .encode(), because you can only encode Unicode strings. This usually won't cause an error because most non-Unicode objects have an ASCII representation. Here's an example where a custom object returns a non-ASCII str() representation:
>>> class Test(object):
... def __str__(self):
... return 'r\xc3\xa9sum\xc3\xa9'
...
>>> x=Test()
>>> str(x)
'r\xc3\xa9sum\xc3\xa9'
>>> str(x).encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Note the above was a decode error instead of an encode error.
If str() is only there to coerce booleans to a string, coerce it to a Unicode string instead:
unicode(x[k]).encode('utf-8')
Non-Unicode values will be converted to Unicode strings, which can then be correctly encoded, but Unicode strings will remain unchanged, so they will also be encoded correctly.
>>> x = True
>>> unicode(x)
u'True'
>>> unicode(x).encode('utf8')
'True'
>>> x = u'résumé'
>>> unicode(x).encode('utf8')
'r\xc3\xa9sum\xc3\xa9'
P.S. Python 3 does not do implicit encode/decode between byte and Unicode strings and makes these errors easier to spot.

Python and Pandas: UnicodeDecodeError: 'ascii' codec can't decode byte

After using Pandas to read a json object into a Pandas.DataFrame, we only want to print the first year in each pandas row. Eg: if we have 2013-2014(2015), we want to print 2013
Full code (here)
x = '{"0":"1985\\u2013present","1":"1985\\u2013present",......}'
a = pd.read_json(x, typ='series')
for i, row in a.iteritems():
print row.split('-')[0].split('—')[0].split('(')[0]
the following error occurs:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-1333-d8ef23860c53> in <module>()
1 for i, row in a.iteritems():
----> 2 print row.split('-')[0].split('—')[0].split('(')[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
Why is this happening? How can we fix the problem?
Your json data strings are unicode string, which you can see for example by just printing one of the values:
In: a[0]
Out: u'1985\u2013present'
Now you try to split the string at the unicode \u2031 (EN DASH), but the string you give to split is no unicode string (therefore the error 'ascii' codec can't decode byte 0xe2 - the EN DASH is no ASCII character).
To make your example working, you could use:
for i, row in a.iteritems():
print row.split('-')[0].split(u'—')[0].split('(')[0]
Notice the u in front of the uncode dash. You could also write u'\u2013' to split the string.
For details on unicode in Python, see https://docs.python.org/2/howto/unicode.html

UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 8: ordinal not in range(128)

I'm writing data, fetched from jobs API, to the Google spreadsheet. Following encoding for 'latin-1' encodes till page# 93 but when reaches 94, it goes in exception. I've used different following techniques, but 'latin-1' did max pagination. Else have been commented(as they die on page #65). Could you please tell me how to modify non-commented(i-e .encode('latin-1')) to get 199 pages safely written on spreadsheet?
Code is given as below:
Any guideline in this regard is appreciated in advance.
def append_data(self,worksheet,row,start_row, start_col,end_col):
r = start_row #last_empty_row(worksheet)
j = 0
i = start_col
while (i <= end_col):
try:
worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1','ignore'))
#worksheet.update_cell(r,i,unicode(row[j]).decode('latin-1').encode("utf-
16"))
#worksheet.update_cell(r,i,unicode(row[j]).encode('iso-8859-1'))
#worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1').decode("utf-
8"))
#worksheet.update_cell(r,i,unicode(row[j]).decode('utf-8'))
#worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1', 'replace'))
#worksheet.update_cell(r,i,unicode(row[j]).encode(sys.stdout.encoding,
'replace'))
#worksheet.update_cell(r,i,row[j].encode('utf8'))
#worksheet.update_cell(r,i,filter(self.onlyascii(str(row[j]))))
except Exception as e:
self.ehandling_obj.error_handler(self.ehandling_obj.SPREADSHEET_ERROR,[1])
try:
worksheet.update_cell(r,i,'N/A')
except Exception as ee:
y = 23
j = j + 1
i = i + 1
You are calling unicode() on a byte string value, which means Python will have to decode to Unicode first:
>>> unicode('\xea')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)
It is this decoding that fails, not the encoding from Unicode back to byte strings.
You either already have Latin-1 input data, or you should decode using the appropriate codec:
unicode(row[j], 'utf8').encode('latin1')
or using str.decode():
row[j].decode('utf8').encode('latin1')
I picked UTF-8 as an example here, you didn't provide any detail about the input data or its possible encodings. You need to pick the right codec yourself here.

Decode an ENCODED unicode string in Python

I need to decode a "UNICODE" encoded string:
>>> id = u'abcdß'
>>> encoded_id = id.encode('utf-8')
>>> encoded_id
'abcd\xc3\x9f'
The problem I have is:
Using Pylons routing, I get the encoded_id variable as a unicode string u'abcd\xc3\x9f' instead of a just a regular string 'abcd\xc3\x9f':
Using python, how can I decode my encoded_id variable which is a unicode string?
>>> encoded_id = u'abcd\xc3\x9f'
>>> encoded_id.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/test/vng/lib64/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-5: ordinal not in range(128)
You have UTF-8 encoded data (there is no such thing as UNICODE encoded data).
Encode the unicode value to Latin-1, then decode from UTF8:
encoded_id.encode('latin1').decode('utf8')
Latin 1 maps the first 255 unicode points one-on-one to bytes.
Demo:
>>> encoded_id = u'abcd\xc3\x9f'
>>> encoded_id.encode('latin1').decode('utf8')
u'abcd\xdf'
>>> print encoded_id.encode('latin1').decode('utf8')
abcdß

UnicodeEncodeError with csvwriter

I have one more error to fix.
row = OpenThisLink + titleTag + JD
try:
csvwriter.writerow([row])
except (UnicodeEncodeError, UnicodeDecodeError):
pass
This gives the error (for this character: "ń")
row = OpenThisLink + str(titleTag) + JD
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 51: ordinal not in range(128)
I tried to fix this by using the method here. But,
>>> title = "hello Giliciński"
Unsupported characters in input
u = unicode(title, "latin1")
Traceback (most recent call last):
File "<pyshell#56>", line 1, in <module>
u = unicode(title, "latin1")
NameError: name 'title' is not defined
>>> title = "ń" Unsupported characters in input
According to documentation:
Unlike a similar case with UnicodeEncodeError, such a failure cannot be always avoided.
And indeed, my exception doesn't seem to work. Any suggestions?
Thanks!
And indeed, my exception doesn't seem
to work. Any suggestions?
row = OpenThisLink + titleTag + JD is outside the try/except block and so any exceptions raised while that statement is running will not be caught. This, however, will catch the exception:
try:
row = OpenThisLink + titleTag + JD
csvwriter.writerow([row])
except (UnicodeEncodeError, UnicodeDecodeError):
print "Caught unicode error"
But, in the code that you posted, row = OpenThisLink + titleTag + JD will not raise UnicodeEncodeError if titleTag contains a unicode string; the result of the string concatenation will be of type unicode.
Now, the csv module doesn't support unicode, so when you call writerow() with unicode data this will raise UnicodeEncodeError. You need to encode your unicode strings into a suitable encoding (UTF8 would be best) and then pass that to writerow(), for example:
>>> titleTag = "hello Giliciński"
>>> titleTag
'hello Gilici\xc5\x84ski'
>>> type(titleTag)
<type 'str'>
>>>
>>> titleTag = titleTag.decode('utf8')
>>> titleTag
u'hello Gilici\u0144ski'
>>> type(titleTag)
<type 'unicode'>
>>>
>>> csvwriter.writerow([titleTag])
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0144' in position 12: ordinal not in range(128)
>>>
>>> # but this will work...
>>> csvwriter.writerow([titleTag.encode('utf8')])
The relevant Python documentation is here. Be sure to look at the examples, in particular the last one.
BTW, pyshell doesn't seem to accept non-ascii characters as input so use the normal Python interpretter.
For IDLE, according to the solution here(link), open file $python/Lib/idellib/IOBinding.py, forcefully put
encoding = "utf-8"
after the try-except-pass module for setting locale. Close IDLE and save the file(perhaps requires administrative priority) and open IDLE again. At least it works for me. My IDLE version is 1.2, python: 2.5.

Categories