This question already has answers here:
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>
(12 answers)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7240: character maps to <undefined>
(3 answers)
Closed 2 years ago.
I want to read several .text documents but got some error on the line
lyrics = "".join(f.readlines())
The error is:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 1148: character maps to <undefined>
How can I fix it. It would be helpful if anyone fixes it.
My code function is:
def read_lyrics():
reg1 = re.compile("\.txt$")
reg2 = re.compile("([0-9]+)\.txt")
reg3 = re.compile(".*_([0-9])\.txt")
reg4 = re.compile("\[.+\]")
reg5 = re.compile("info\.txt")
lyrics_dictionary = {}
#iter all directory and load all song(txt file)
for i in os.listdir():
if os.path.isdir(i):
for path,sub,items in os.walk(i):
if any([reg1.findall(item) for item in items]):
for item in items:
if reg5.findall(item):
continue
if reg3.findall(item):
num = ["0"+reg3.findall(item)[0]]
name = "_".join(path.split("/") + num)
else:
name = "_".join(path.split("/") + reg2.findall(item))
print("The path is: ", path)
print("The item is: ", item)
with open(os.path.join(path,item),"r") as f:
print("The file path is: ", f)
lyrics = "".join(f.readlines())
lyrics = reg4.subn("",lyrics)[0]
lyrics_dictionary[name] = lyrics
return lyrics_dictionary
When you use open(), you also use a default encoding. It most likely didn't fit you. Try using something like -
with open(os.path.join(path,item),"r",encoding='utf8')
Or, if you can, check what is the enryption which was used on this file.
Try to check the answers this post, one of them might help you.
Related
I want to print '\xd6\xd0\xb9\xfa\xba\xda\xc1\xfa\xbd\xad' which is a Chinese character.
l = ['\xd6\xd0\xb9\xfa\xba\xda\xc1\xfa\xbd\xad']
a = [l[0].decode('utf-8')]
print(a[0])
But it raises this error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xd6 in position 0: invalid continuation byte. I also tried deocde('latin-1'). But the result aren't Chinese characters.
Try with:
l = ['\xd6\xd0\xb9\xfa\xba\xda\xc1\xfa\xbd\xad']
a = [l[0].decode('gb2312').encode('utf-8')]
print(a[0])
output:
中国黑龙江
Update: as Mark's advice, use l[0].decode('gb2312') will be sufficient.
l = ['\xd6\xd0\xb9\xfa\xba\xda\xc1\xfa\xbd\xad']
a = [l[0].decode('gb2312')]
print(a[0])
This question already has answers here:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
(34 answers)
Closed 6 years ago.
I got the encoding error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 3: ordinal not in range(128)
at the following python (pyspark) code, where row is the data frame row:
def rowToLine(row):
line = str(row[0]).strip()
columnNum = 44
for k in xrange(1, columnNum):
line = line + "\t"
line = line + str(row[k]).strip() # encoding error here
return line
I also tried the join below:
def rowToLine(row):
s = "\t"
return s.join(row)
but some values of the row is int, so I got errors:
TypeError: sequence item 19: expected string or Unicode, int found
Does anyone know how to fix this? Thanks!
Thanks for everyone's suggestions!
I basically took Padraic Cunningham's idea and made some modification to handle the int case. The code below works.
def rowToLine(row):
s = "\t"
return s.join( x.encode("utf-8") if isinstance(x, basestring) else str(x).encode("utf-8") for x in row)
What's the best way to convert to str from unicode? When I run the code, I receive the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 13: ordinal not in range(128).
My code:
import json
import urllib2
def locu_data(city,api_key):
local_business_data='https://api.locu.com/v1_0/venue/search/?locality='+city+'&api_key='+api_key
open_local_business_data=urllib2.urlopen(local_business_data)
data_load=json.load(open_local_business_data)
Business = [x for x in data_load['objects']]
Event=[]
for item in Business:
if (item != None):
Event.append(('Name:{},Categories:{},City:{},Longitutde:{},Latitude:{},Website:{}\n').format(item['name'],item['categories'],item['locality'],item['long'],item['lat'],item['website_url']))
else:
Event.append('None')
print Event
locu_data(city=raw_input("Please enter the city you would like to analyze:"),api_key=raw_input("Please enter your locu API key."))
Just use string.decode() where string is the variable you are manipulating.
I'm writing data, fetched from jobs API, to the Google spreadsheet. Following encoding for 'latin-1' encodes till page# 93 but when reaches 94, it goes in exception. I've used different following techniques, but 'latin-1' did max pagination. Else have been commented(as they die on page #65). Could you please tell me how to modify non-commented(i-e .encode('latin-1')) to get 199 pages safely written on spreadsheet?
Code is given as below:
Any guideline in this regard is appreciated in advance.
def append_data(self,worksheet,row,start_row, start_col,end_col):
r = start_row #last_empty_row(worksheet)
j = 0
i = start_col
while (i <= end_col):
try:
worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1','ignore'))
#worksheet.update_cell(r,i,unicode(row[j]).decode('latin-1').encode("utf-
16"))
#worksheet.update_cell(r,i,unicode(row[j]).encode('iso-8859-1'))
#worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1').decode("utf-
8"))
#worksheet.update_cell(r,i,unicode(row[j]).decode('utf-8'))
#worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1', 'replace'))
#worksheet.update_cell(r,i,unicode(row[j]).encode(sys.stdout.encoding,
'replace'))
#worksheet.update_cell(r,i,row[j].encode('utf8'))
#worksheet.update_cell(r,i,filter(self.onlyascii(str(row[j]))))
except Exception as e:
self.ehandling_obj.error_handler(self.ehandling_obj.SPREADSHEET_ERROR,[1])
try:
worksheet.update_cell(r,i,'N/A')
except Exception as ee:
y = 23
j = j + 1
i = i + 1
You are calling unicode() on a byte string value, which means Python will have to decode to Unicode first:
>>> unicode('\xea')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)
It is this decoding that fails, not the encoding from Unicode back to byte strings.
You either already have Latin-1 input data, or you should decode using the appropriate codec:
unicode(row[j], 'utf8').encode('latin1')
or using str.decode():
row[j].decode('utf8').encode('latin1')
I picked UTF-8 as an example here, you didn't provide any detail about the input data or its possible encodings. You need to pick the right codec yourself here.
This question already has answers here:
Best way to convert string to bytes in Python 3?
(5 answers)
Closed 11 days ago.
I have the following function to parse a utf-8 string from a sequence of bytes
Note -- 'length_size' is the number of bytes it take to represent the length of the utf-8 string
def parse_utf8(self, bytes, length_size):
length = bytes2int(bytes[0:length_size])
value = ''.join(['%c' % b for b in bytes[length_size:length_size+length]])
return value
def bytes2int(raw_bytes, signed=False):
"""
Convert a string of bytes to an integer (assumes little-endian byte order)
"""
if len(raw_bytes) == 0:
return None
fmt = {1:'B', 2:'H', 4:'I', 8:'Q'}[len(raw_bytes)]
if signed:
fmt = fmt.lower()
return struct.unpack('<'+fmt, raw_bytes)[0]
I'd like to write the function in reverse -- i.e. a function that will take a utf-8 encoded string and return it's representation as a byte string.
So far, I have the following:
def create_utf8(self, utf8_string):
return utf8_string.encode('utf-8')
I run into the following error when attempting to test it:
File "writer.py", line 229, in create_utf8
return utf8_string.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x98 in position 0: ordinal not in range(128)
If possible, I'd like to adopt a structure for the code similar to the parse_utf8 example. What am I doing wrong?
Thank you for your help!
UPDATE: test driver, now correct
def random_utf8_seq(self, length):
# from http://www.w3.org/2001/06/utf-8-test/postscript-utf-8.html
test_charset = u" !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĂ㥹ĆćČčĎďĐđĘęĚěĹ弾ŁłŃńŇňŐőŒœŔŕŘřŚśŞşŠšŢţŤťŮůŰűŸŹźŻżŽžƒˆˇ˘˙˛˜˝–—‘’‚“”„†‡•…‰‹›€™"
utf8_seq = u""
for i in range(length):
utf8_seq += random.choice(test_charset)
return utf8_seq
I get the following error:
input_str = self.random_utf8_seq(200)
File "writer.py", line 226, in random_utf8_seq
print unicode(utf8_seq, "utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 0: invalid start byte
If utf-8 => bytestring conversion is what do you want then you may use str.encode, but first you need to properly mark the type of source string in your example - prefix with u for unicode:
# coding: utf-8
import random
def random_utf8_seq(length):
# from http://www.w3.org/2001/06/utf-8-test/postscript-utf-8.html
test_charset = u" !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĂ㥹ĆćČčĎďĐđĘęĚěĹ弾ŁłŃńŇňŐőŒœŔŕŘřŚśŞşŠšŢţŤťŮůŰűŸŹźŻżŽžƒˆˇ˘˙˛˜˝–—‘’‚“”„†‡•…‰‹›€™"
utf8_seq = u''
for i in range(length):
utf8_seq += random.choice(test_charset)
print utf8_seq.encode('utf-8')
return utf8_seq.encode('utf-8')
print( type(random_utf8_seq(200)) )
-- output --
õ3×sÔP{Ć.s(Ë°˙ě÷xÓ#bűV—û´ő¢uZÓČn˜0|_"Ðyø`êš·ÏÝhunÍÅ=ä?
óP{tlÇűpb¸7s´ňƒG—čøň\zčłŢXÂYqLĆúěă(ÿî ¥PyÐÔŇnל¦Ì˝+•ì›
ŻÛ°Ñ^ÝC÷ŢŐIñJĹţÒył"MťÆ‹ČČ4þ!»šåŮ#Öhň-
ÈLGĄ¢ß˛Đ¯.ªÆź˘Ř^ĽÛŹËaĂŕ¹#¢éüÜńlÊqš=VřU…‚–MŽÎÉèoÙŹŠ¨Ð
<type 'str'>