Python: Unicode and Str

Python: Unicode and Str - python

What's the best way to convert to str from unicode? When I run the code, I receive the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 13: ordinal not in range(128).
My code:
import json
import urllib2
def locu_data(city,api_key):
local_business_data='https://api.locu.com/v1_0/venue/search/?locality='+city+'&api_key='+api_key
open_local_business_data=urllib2.urlopen(local_business_data)
data_load=json.load(open_local_business_data)
Business = [x for x in data_load['objects']]
Event=[]
for item in Business:
if (item != None):
Event.append(('Name:{},Categories:{},City:{},Longitutde:{},Latitude:{},Website:{}\n').format(item['name'],item['categories'],item['locality'],item['long'],item['lat'],item['website_url']))
else:
Event.append('None')
print Event
locu_data(city=raw_input("Please enter the city you would like to analyze:"),api_key=raw_input("Please enter your locu API key."))

Just use string.decode() where string is the variable you are manipulating.

Related

#'charmap' codec can't decode byte 0x8d in position 1148 [duplicate]

This question already has answers here:
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>
(12 answers)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7240: character maps to <undefined>
(3 answers)
Closed 2 years ago.
I want to read several .text documents but got some error on the line
lyrics = "".join(f.readlines())
The error is:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 1148: character maps to <undefined>
How can I fix it. It would be helpful if anyone fixes it.
My code function is:
def read_lyrics():
reg1 = re.compile("\.txt$")
reg2 = re.compile("([0-9]+)\.txt")
reg3 = re.compile(".*_([0-9])\.txt")
reg4 = re.compile("\[.+\]")
reg5 = re.compile("info\.txt")
lyrics_dictionary = {}
#iter all directory and load all song(txt file)
for i in os.listdir():
if os.path.isdir(i):
for path,sub,items in os.walk(i):
if any([reg1.findall(item) for item in items]):
for item in items:
if reg5.findall(item):
continue
if reg3.findall(item):
num = ["0"+reg3.findall(item)[0]]
name = "_".join(path.split("/") + num)
else:
name = "_".join(path.split("/") + reg2.findall(item))
print("The path is: ", path)
print("The item is: ", item)
with open(os.path.join(path,item),"r") as f:
print("The file path is: ", f)
lyrics = "".join(f.readlines())
lyrics = reg4.subn("",lyrics)[0]
lyrics_dictionary[name] = lyrics
return lyrics_dictionary

When you use open(), you also use a default encoding. It most likely didn't fit you. Try using something like -
with open(os.path.join(path,item),"r",encoding='utf8')
Or, if you can, check what is the enryption which was used on this file.
Try to check the answers this post, one of them might help you.

'UCS-2' codec can't encode characters in position 61-61

When I run my Python code and print(item), I get the following errors:
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 61-61: Non-BMP character not supported in Tk
Here is my code:
def getUserFollowers(self, usernameId, maxid = ''):
if maxid == '':
return self.SendRequest('friendships/'+ str(usernameId) +'/followers/?rank_token='+ self.rank_token,l=2)
else:
return self.SendRequest('friendships/'+ str(usernameId) +'/followers/?rank_token='+ self.rank_token + '&max_id='+ str(maxid))
def getTotalFollowers(self,usernameId):
followers = []
next_max_id = ''
while 1:
self.getUserFollowers(usernameId,next_max_id)
temp = self.LastJson
for item in temp["users"]:
print(item)
followers.append(item)
if temp["big_list"] == False:
return followers
next_max_id = temp["next_max_id"]
How can I fix this?

Hard to guess without knowing the content of temp["users"], but the error indicates that it contains non BMP unicode characters like for example emoji.
If you try to display that in IDLE, you immediately get that kind of error. Simple example to reproduce (on IDLE for Python 3.5):
>>> t = "ab \U0001F600 cd"
>>> print(t)
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
print(t)
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 3-3: Non-BMP character not supported in Tk
(\U0001F600 represents the unicode character U+1F600 grinning face)
The error is indeed caused by Tk not supporting unicode characters with code greater than FFFF. A simple workaround is the filter them out of your string:
def BMP(s):
return "".join((i if ord(i) < 10000 else '\ufffd' for i in s))
'\ufffd' is the Python representation for the unicode U+FFFD REPLACEMENT CHARACTER.
My example becomes:
>>> t = "ab \U0001F600 cd"
>>> print(BMP(t))
ab � cd
So your code would become:
for item in temp["users"]:
print(BMP(item))
followers.append(item)

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 3: ordinal not in range(128) [duplicate]

This question already has answers here:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
(34 answers)
Closed 6 years ago.
I got the encoding error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 3: ordinal not in range(128)
at the following python (pyspark) code, where row is the data frame row:
def rowToLine(row):
line = str(row[0]).strip()
columnNum = 44
for k in xrange(1, columnNum):
line = line + "\t"
line = line + str(row[k]).strip() # encoding error here
return line
I also tried the join below:
def rowToLine(row):
s = "\t"
return s.join(row)
but some values of the row is int, so I got errors:
TypeError: sequence item 19: expected string or Unicode, int found
Does anyone know how to fix this? Thanks!

Thanks for everyone's suggestions!
I basically took Padraic Cunningham's idea and made some modification to handle the int case. The code below works.
def rowToLine(row):
s = "\t"
return s.join( x.encode("utf-8") if isinstance(x, basestring) else str(x).encode("utf-8") for x in row)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 42: ordinal not in range(128)

def main():
client = ##client_here
db = client.brazil
rio_bus = client.tweets
result_cursor = db.tweets.find()
first = result_cursor[0]
ordered_fieldnames = first.keys()
with open('brazil_tweets.csv','wb') as csvfile:
csvwriter = csv.DictWriter(csvfile,fieldnames = ordered_fieldnames,extrasaction='ignore')
csvwriter.writeheader()
for x in result_cursor:
print x
csvwriter.writerow( {k: str(x[k]).encode('utf-8') for k in x})
#[ csvwriter.writerow(x.encode('utf-8')) for x in result_cursor ]
if __name__ == '__main__':
main()
Basically the issue is that the tweets contain a bunch of characters in Portuguese. I tried to correct for this by encoding everything into unicode values before putting them in the dictionary that was to be added to the row. However this doesn't work. Any other ideas for formatting these values so that csv reader and dictreader can read them?

str(x[k]).encode('utf-8') is the problem.
str(x[k]) will convert a Unicode string to an byte string using the default ascii codec in Python 2:
>>> x = u'résumé'
>>> str(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
Non-Unicode values, like booleans, will be converted to byte strings, but then Python will implicitly decode the byte string to a Unicode string before calling .encode(), because you can only encode Unicode strings. This usually won't cause an error because most non-Unicode objects have an ASCII representation. Here's an example where a custom object returns a non-ASCII str() representation:
>>> class Test(object):
... def __str__(self):
... return 'r\xc3\xa9sum\xc3\xa9'
...
>>> x=Test()
>>> str(x)
'r\xc3\xa9sum\xc3\xa9'
>>> str(x).encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Note the above was a decode error instead of an encode error.
If str() is only there to coerce booleans to a string, coerce it to a Unicode string instead:
unicode(x[k]).encode('utf-8')
Non-Unicode values will be converted to Unicode strings, which can then be correctly encoded, but Unicode strings will remain unchanged, so they will also be encoded correctly.
>>> x = True
>>> unicode(x)
u'True'
>>> unicode(x).encode('utf8')
'True'
>>> x = u'résumé'
>>> unicode(x).encode('utf8')
'r\xc3\xa9sum\xc3\xa9'
P.S. Python 3 does not do implicit encode/decode between byte and Unicode strings and makes these errors easier to spot.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 8: ordinal not in range(128)

I'm writing data, fetched from jobs API, to the Google spreadsheet. Following encoding for 'latin-1' encodes till page# 93 but when reaches 94, it goes in exception. I've used different following techniques, but 'latin-1' did max pagination. Else have been commented(as they die on page #65). Could you please tell me how to modify non-commented(i-e .encode('latin-1')) to get 199 pages safely written on spreadsheet?
Code is given as below:
Any guideline in this regard is appreciated in advance.
def append_data(self,worksheet,row,start_row, start_col,end_col):
r = start_row #last_empty_row(worksheet)
j = 0
i = start_col
while (i <= end_col):
try:
worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1','ignore'))
#worksheet.update_cell(r,i,unicode(row[j]).decode('latin-1').encode("utf-
16"))
#worksheet.update_cell(r,i,unicode(row[j]).encode('iso-8859-1'))
#worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1').decode("utf-
8"))
#worksheet.update_cell(r,i,unicode(row[j]).decode('utf-8'))
#worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1', 'replace'))
#worksheet.update_cell(r,i,unicode(row[j]).encode(sys.stdout.encoding,
'replace'))
#worksheet.update_cell(r,i,row[j].encode('utf8'))
#worksheet.update_cell(r,i,filter(self.onlyascii(str(row[j]))))
except Exception as e:
self.ehandling_obj.error_handler(self.ehandling_obj.SPREADSHEET_ERROR,[1])
try:
worksheet.update_cell(r,i,'N/A')
except Exception as ee:
y = 23
j = j + 1
i = i + 1

You are calling unicode() on a byte string value, which means Python will have to decode to Unicode first:
>>> unicode('\xea')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)
It is this decoding that fails, not the encoding from Unicode back to byte strings.
You either already have Latin-1 input data, or you should decode using the appropriate codec:
unicode(row[j], 'utf8').encode('latin1')
or using str.decode():
row[j].decode('utf8').encode('latin1')
I picked UTF-8 as an example here, you didn't provide any detail about the input data or its possible encodings. You need to pick the right codec yourself here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Unicode and Str - python

Just use string.decode() where string is the variable you are manipulating.

Related

#'charmap' codec can't decode byte 0x8d in position 1148 [duplicate]

'UCS-2' codec can't encode characters in position 61-61

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 3: ordinal not in range(128) [duplicate]

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 42: ordinal not in range(128)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 8: ordinal not in range(128)

Categories

Resources