spark python unicode error - python

I am trying to filter my rdd for one column for one specific value and then taking count . But if I read the column as it is it gives the count as 0
On reading the column as 'str' it gives unicode error
bfcrdd = bfcfile.map(lambda l: l.split(",")).filter(lambda l:l[13] == 'Covered')
bfcrdd.count()
gives count 0 whereas there are many values in column 13 as the value 'Covered'
on running as
bfcrdd = bfcfile.map(lambda l: l.split(",")).filter(lambda l:str(l[13]) == 'Covered')
bfcrdd.count()
gives error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
position 19: ordinal not in range(128)
This is the issue not only with count but collect(), take() also.
Have tried
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
and
bfcrdd = bfcrdd.map(lambda p: (p[13].encode("ascii", "ignore"))).collect()
but nothing works :(

Related

'UCS-2' codec can't encode characters in position 61-61

When I run my Python code and print(item), I get the following errors:
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 61-61: Non-BMP character not supported in Tk
Here is my code:
def getUserFollowers(self, usernameId, maxid = ''):
if maxid == '':
return self.SendRequest('friendships/'+ str(usernameId) +'/followers/?rank_token='+ self.rank_token,l=2)
else:
return self.SendRequest('friendships/'+ str(usernameId) +'/followers/?rank_token='+ self.rank_token + '&max_id='+ str(maxid))
def getTotalFollowers(self,usernameId):
followers = []
next_max_id = ''
while 1:
self.getUserFollowers(usernameId,next_max_id)
temp = self.LastJson
for item in temp["users"]:
print(item)
followers.append(item)
if temp["big_list"] == False:
return followers
next_max_id = temp["next_max_id"]
How can I fix this?
Hard to guess without knowing the content of temp["users"], but the error indicates that it contains non BMP unicode characters like for example emoji.
If you try to display that in IDLE, you immediately get that kind of error. Simple example to reproduce (on IDLE for Python 3.5):
>>> t = "ab \U0001F600 cd"
>>> print(t)
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
print(t)
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 3-3: Non-BMP character not supported in Tk
(\U0001F600 represents the unicode character U+1F600 grinning face)
The error is indeed caused by Tk not supporting unicode characters with code greater than FFFF. A simple workaround is the filter them out of your string:
def BMP(s):
return "".join((i if ord(i) < 10000 else '\ufffd' for i in s))
'\ufffd' is the Python representation for the unicode U+FFFD REPLACEMENT CHARACTER.
My example becomes:
>>> t = "ab \U0001F600 cd"
>>> print(BMP(t))
ab � cd
So your code would become:
for item in temp["users"]:
print(BMP(item))
followers.append(item)

Dictionary keys cannot be encoded as utf-8

I am using the twitter streaming api (tweepy) to capture several tweets. I do this in python2.7.
After I have collected a corpus of tweets I break each tweet into words and add each word to a dictionary as keys, where the values are the participation of each word in positive or negative sentences.
When I retrieve the words as keys of the dictionary and try to process them for a next iteration I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)
errors
The weird thing is that before I place them as dictionary keys I encode them without errors. Here is a sample code
pos = {}
neg = {}
for status in corpus:
p = s.analyze(status).polarity
words = []
# gather real words
for w in status.split(' '):
try:
words.append(w.encode('utf-8'))
except UnicodeDecodeError as e:
print(e)
# assign sentiment of the sentence to the words
for w in words:
if w not in pos:
pos[w] = 0
neg[w] = 0
if p >= 0:
pos[w] += 1
else:
neg[w] += 1
k = pos.keys()
k = [i.encode('utf-8') for i in k] # <-- for this line a get an error
p = [v for i, v in pos.items()]
n = [v for i, v in neg.items()]
So this piece of code will catch no errors during the splitting of the words but it will throw an error when trying to encode the keys again. I should note than normally I wouldn't try to encode the keys again, as I would think they are already properly encoded. But I added this extra encoding to narrow down the source of the error.
Am I missing something? Do you see anything wrong with my code?
to avoid confusion here is a sample code more close to the original that is not trying to encode the keys again
k = ['happy']
for i in range(3):
print('sampling twitter --> {}'.format(i))
myStream.filter(track=k) # <-- this is where I will receive the error in the second iteration
for status in corpus:
p = s.analyze(status).polarity
words = []
# gather real words
for w in status.split(' '):
try:
words.append(w.encode('utf-8'))
except UnicodeDecodeError as e:
print(e)
# assign sentiment of the sentence to the words
for w in words:
if w not in pos:
pos[w] = 0
neg[w] = 0
if p >= 0:
pos[w] += 1
else:
neg[w] += 1
k = pos.keys()
(please suggest a better title for the question)
You get a decode error while you are trying to encode a string. This seems weird but it is due to implicit decode/encode mechanism of Python.
Python allows to encode strings to obtain bytes and decode bytes to obtain strings. This means that Python can encode only strings and decode only bytes.
So when you try to encode bytes, Python (which does not know how to encode bytes) tries to implicitely decode the byte to obtain a string to encode and it uses its default encoding to do that.
This is why you get a decode error while trying to encode something: the implicit decoding.
That means that you are probably trying to encode something which is already encoded.
Note that the error message says "'ascii' codec can't decode ...". That's because when you call encode on something that is already a bytestring in Python 2, it tries to decode it to Unicode first using the default codec.
I'm not sure why you thought that encoding again would be a good idea. Don't do it; the strings are already byetestrings, leave them as that.

Python and Pandas: UnicodeDecodeError: 'ascii' codec can't decode byte

After using Pandas to read a json object into a Pandas.DataFrame, we only want to print the first year in each pandas row. Eg: if we have 2013-2014(2015), we want to print 2013
Full code (here)
x = '{"0":"1985\\u2013present","1":"1985\\u2013present",......}'
a = pd.read_json(x, typ='series')
for i, row in a.iteritems():
print row.split('-')[0].split('—')[0].split('(')[0]
the following error occurs:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-1333-d8ef23860c53> in <module>()
1 for i, row in a.iteritems():
----> 2 print row.split('-')[0].split('—')[0].split('(')[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
Why is this happening? How can we fix the problem?
Your json data strings are unicode string, which you can see for example by just printing one of the values:
In: a[0]
Out: u'1985\u2013present'
Now you try to split the string at the unicode \u2031 (EN DASH), but the string you give to split is no unicode string (therefore the error 'ascii' codec can't decode byte 0xe2 - the EN DASH is no ASCII character).
To make your example working, you could use:
for i, row in a.iteritems():
print row.split('-')[0].split(u'—')[0].split('(')[0]
Notice the u in front of the uncode dash. You could also write u'\u2013' to split the string.
For details on unicode in Python, see https://docs.python.org/2/howto/unicode.html

Python: Unicode and Str

What's the best way to convert to str from unicode? When I run the code, I receive the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 13: ordinal not in range(128).
My code:
import json
import urllib2
def locu_data(city,api_key):
local_business_data='https://api.locu.com/v1_0/venue/search/?locality='+city+'&api_key='+api_key
open_local_business_data=urllib2.urlopen(local_business_data)
data_load=json.load(open_local_business_data)
Business = [x for x in data_load['objects']]
Event=[]
for item in Business:
if (item != None):
Event.append(('Name:{},Categories:{},City:{},Longitutde:{},Latitude:{},Website:{}\n').format(item['name'],item['categories'],item['locality'],item['long'],item['lat'],item['website_url']))
else:
Event.append('None')
print Event
locu_data(city=raw_input("Please enter the city you would like to analyze:"),api_key=raw_input("Please enter your locu API key."))
Just use string.decode() where string is the variable you are manipulating.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 8: ordinal not in range(128)

I'm writing data, fetched from jobs API, to the Google spreadsheet. Following encoding for 'latin-1' encodes till page# 93 but when reaches 94, it goes in exception. I've used different following techniques, but 'latin-1' did max pagination. Else have been commented(as they die on page #65). Could you please tell me how to modify non-commented(i-e .encode('latin-1')) to get 199 pages safely written on spreadsheet?
Code is given as below:
Any guideline in this regard is appreciated in advance.
def append_data(self,worksheet,row,start_row, start_col,end_col):
r = start_row #last_empty_row(worksheet)
j = 0
i = start_col
while (i <= end_col):
try:
worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1','ignore'))
#worksheet.update_cell(r,i,unicode(row[j]).decode('latin-1').encode("utf-
16"))
#worksheet.update_cell(r,i,unicode(row[j]).encode('iso-8859-1'))
#worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1').decode("utf-
8"))
#worksheet.update_cell(r,i,unicode(row[j]).decode('utf-8'))
#worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1', 'replace'))
#worksheet.update_cell(r,i,unicode(row[j]).encode(sys.stdout.encoding,
'replace'))
#worksheet.update_cell(r,i,row[j].encode('utf8'))
#worksheet.update_cell(r,i,filter(self.onlyascii(str(row[j]))))
except Exception as e:
self.ehandling_obj.error_handler(self.ehandling_obj.SPREADSHEET_ERROR,[1])
try:
worksheet.update_cell(r,i,'N/A')
except Exception as ee:
y = 23
j = j + 1
i = i + 1
You are calling unicode() on a byte string value, which means Python will have to decode to Unicode first:
>>> unicode('\xea')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)
It is this decoding that fails, not the encoding from Unicode back to byte strings.
You either already have Latin-1 input data, or you should decode using the appropriate codec:
unicode(row[j], 'utf8').encode('latin1')
or using str.decode():
row[j].decode('utf8').encode('latin1')
I picked UTF-8 as an example here, you didn't provide any detail about the input data or its possible encodings. You need to pick the right codec yourself here.

Categories