Python and Pandas: UnicodeDecodeError: 'ascii' codec can't decode byte - python

After using Pandas to read a json object into a Pandas.DataFrame, we only want to print the first year in each pandas row. Eg: if we have 2013-2014(2015), we want to print 2013
Full code (here)
x = '{"0":"1985\\u2013present","1":"1985\\u2013present",......}'
a = pd.read_json(x, typ='series')
for i, row in a.iteritems():
print row.split('-')[0].split('—')[0].split('(')[0]
the following error occurs:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-1333-d8ef23860c53> in <module>()
1 for i, row in a.iteritems():
----> 2 print row.split('-')[0].split('—')[0].split('(')[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
Why is this happening? How can we fix the problem?

Your json data strings are unicode string, which you can see for example by just printing one of the values:
In: a[0]
Out: u'1985\u2013present'
Now you try to split the string at the unicode \u2031 (EN DASH), but the string you give to split is no unicode string (therefore the error 'ascii' codec can't decode byte 0xe2 - the EN DASH is no ASCII character).
To make your example working, you could use:
for i, row in a.iteritems():
print row.split('-')[0].split(u'—')[0].split('(')[0]
Notice the u in front of the uncode dash. You could also write u'\u2013' to split the string.
For details on unicode in Python, see https://docs.python.org/2/howto/unicode.html

Related

spark python unicode error

I am trying to filter my rdd for one column for one specific value and then taking count . But if I read the column as it is it gives the count as 0
On reading the column as 'str' it gives unicode error
bfcrdd = bfcfile.map(lambda l: l.split(",")).filter(lambda l:l[13] == 'Covered')
bfcrdd.count()
gives count 0 whereas there are many values in column 13 as the value 'Covered'
on running as
bfcrdd = bfcfile.map(lambda l: l.split(",")).filter(lambda l:str(l[13]) == 'Covered')
bfcrdd.count()
gives error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
position 19: ordinal not in range(128)
This is the issue not only with count but collect(), take() also.
Have tried
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
and
bfcrdd = bfcrdd.map(lambda p: (p[13].encode("ascii", "ignore"))).collect()
but nothing works :(

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 3: ordinal not in range(128) [duplicate]

This question already has answers here:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
(34 answers)
Closed 6 years ago.
I got the encoding error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 3: ordinal not in range(128)
at the following python (pyspark) code, where row is the data frame row:
def rowToLine(row):
line = str(row[0]).strip()
columnNum = 44
for k in xrange(1, columnNum):
line = line + "\t"
line = line + str(row[k]).strip() # encoding error here
return line
I also tried the join below:
def rowToLine(row):
s = "\t"
return s.join(row)
but some values of the row is int, so I got errors:
TypeError: sequence item 19: expected string or Unicode, int found
Does anyone know how to fix this? Thanks!
Thanks for everyone's suggestions!
I basically took Padraic Cunningham's idea and made some modification to handle the int case. The code below works.
def rowToLine(row):
s = "\t"
return s.join( x.encode("utf-8") if isinstance(x, basestring) else str(x).encode("utf-8") for x in row)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 42: ordinal not in range(128)

def main():
client = ##client_here
db = client.brazil
rio_bus = client.tweets
result_cursor = db.tweets.find()
first = result_cursor[0]
ordered_fieldnames = first.keys()
with open('brazil_tweets.csv','wb') as csvfile:
csvwriter = csv.DictWriter(csvfile,fieldnames = ordered_fieldnames,extrasaction='ignore')
csvwriter.writeheader()
for x in result_cursor:
print x
csvwriter.writerow( {k: str(x[k]).encode('utf-8') for k in x})
#[ csvwriter.writerow(x.encode('utf-8')) for x in result_cursor ]
if __name__ == '__main__':
main()
Basically the issue is that the tweets contain a bunch of characters in Portuguese. I tried to correct for this by encoding everything into unicode values before putting them in the dictionary that was to be added to the row. However this doesn't work. Any other ideas for formatting these values so that csv reader and dictreader can read them?
str(x[k]).encode('utf-8') is the problem.
str(x[k]) will convert a Unicode string to an byte string using the default ascii codec in Python 2:
>>> x = u'résumé'
>>> str(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
Non-Unicode values, like booleans, will be converted to byte strings, but then Python will implicitly decode the byte string to a Unicode string before calling .encode(), because you can only encode Unicode strings. This usually won't cause an error because most non-Unicode objects have an ASCII representation. Here's an example where a custom object returns a non-ASCII str() representation:
>>> class Test(object):
... def __str__(self):
... return 'r\xc3\xa9sum\xc3\xa9'
...
>>> x=Test()
>>> str(x)
'r\xc3\xa9sum\xc3\xa9'
>>> str(x).encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Note the above was a decode error instead of an encode error.
If str() is only there to coerce booleans to a string, coerce it to a Unicode string instead:
unicode(x[k]).encode('utf-8')
Non-Unicode values will be converted to Unicode strings, which can then be correctly encoded, but Unicode strings will remain unchanged, so they will also be encoded correctly.
>>> x = True
>>> unicode(x)
u'True'
>>> unicode(x).encode('utf8')
'True'
>>> x = u'résumé'
>>> unicode(x).encode('utf8')
'r\xc3\xa9sum\xc3\xa9'
P.S. Python 3 does not do implicit encode/decode between byte and Unicode strings and makes these errors easier to spot.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 8: ordinal not in range(128)

I'm writing data, fetched from jobs API, to the Google spreadsheet. Following encoding for 'latin-1' encodes till page# 93 but when reaches 94, it goes in exception. I've used different following techniques, but 'latin-1' did max pagination. Else have been commented(as they die on page #65). Could you please tell me how to modify non-commented(i-e .encode('latin-1')) to get 199 pages safely written on spreadsheet?
Code is given as below:
Any guideline in this regard is appreciated in advance.
def append_data(self,worksheet,row,start_row, start_col,end_col):
r = start_row #last_empty_row(worksheet)
j = 0
i = start_col
while (i <= end_col):
try:
worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1','ignore'))
#worksheet.update_cell(r,i,unicode(row[j]).decode('latin-1').encode("utf-
16"))
#worksheet.update_cell(r,i,unicode(row[j]).encode('iso-8859-1'))
#worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1').decode("utf-
8"))
#worksheet.update_cell(r,i,unicode(row[j]).decode('utf-8'))
#worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1', 'replace'))
#worksheet.update_cell(r,i,unicode(row[j]).encode(sys.stdout.encoding,
'replace'))
#worksheet.update_cell(r,i,row[j].encode('utf8'))
#worksheet.update_cell(r,i,filter(self.onlyascii(str(row[j]))))
except Exception as e:
self.ehandling_obj.error_handler(self.ehandling_obj.SPREADSHEET_ERROR,[1])
try:
worksheet.update_cell(r,i,'N/A')
except Exception as ee:
y = 23
j = j + 1
i = i + 1
You are calling unicode() on a byte string value, which means Python will have to decode to Unicode first:
>>> unicode('\xea')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)
It is this decoding that fails, not the encoding from Unicode back to byte strings.
You either already have Latin-1 input data, or you should decode using the appropriate codec:
unicode(row[j], 'utf8').encode('latin1')
or using str.decode():
row[j].decode('utf8').encode('latin1')
I picked UTF-8 as an example here, you didn't provide any detail about the input data or its possible encodings. You need to pick the right codec yourself here.

Decode an ENCODED unicode string in Python

I need to decode a "UNICODE" encoded string:
>>> id = u'abcdß'
>>> encoded_id = id.encode('utf-8')
>>> encoded_id
'abcd\xc3\x9f'
The problem I have is:
Using Pylons routing, I get the encoded_id variable as a unicode string u'abcd\xc3\x9f' instead of a just a regular string 'abcd\xc3\x9f':
Using python, how can I decode my encoded_id variable which is a unicode string?
>>> encoded_id = u'abcd\xc3\x9f'
>>> encoded_id.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/test/vng/lib64/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-5: ordinal not in range(128)
You have UTF-8 encoded data (there is no such thing as UNICODE encoded data).
Encode the unicode value to Latin-1, then decode from UTF8:
encoded_id.encode('latin1').decode('utf8')
Latin 1 maps the first 255 unicode points one-on-one to bytes.
Demo:
>>> encoded_id = u'abcd\xc3\x9f'
>>> encoded_id.encode('latin1').decode('utf8')
u'abcd\xdf'
>>> print encoded_id.encode('latin1').decode('utf8')
abcdß

Categories