#'charmap' codec can't decode byte 0x8d in position 1148 [duplicate] - python

This question already has answers here:
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>
(12 answers)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7240: character maps to <undefined>
(3 answers)
Closed 2 years ago.
I want to read several .text documents but got some error on the line
lyrics = "".join(f.readlines())
The error is:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 1148: character maps to <undefined>
How can I fix it. It would be helpful if anyone fixes it.
My code function is:
def read_lyrics():
reg1 = re.compile("\.txt$")
reg2 = re.compile("([0-9]+)\.txt")
reg3 = re.compile(".*_([0-9])\.txt")
reg4 = re.compile("\[.+\]")
reg5 = re.compile("info\.txt")
lyrics_dictionary = {}
#iter all directory and load all song(txt file)
for i in os.listdir():
if os.path.isdir(i):
for path,sub,items in os.walk(i):
if any([reg1.findall(item) for item in items]):
for item in items:
if reg5.findall(item):
continue
if reg3.findall(item):
num = ["0"+reg3.findall(item)[0]]
name = "_".join(path.split("/") + num)
else:
name = "_".join(path.split("/") + reg2.findall(item))
print("The path is: ", path)
print("The item is: ", item)
with open(os.path.join(path,item),"r") as f:
print("The file path is: ", f)
lyrics = "".join(f.readlines())
lyrics = reg4.subn("",lyrics)[0]
lyrics_dictionary[name] = lyrics
return lyrics_dictionary

When you use open(), you also use a default encoding. It most likely didn't fit you. Try using something like -
with open(os.path.join(path,item),"r",encoding='utf8')
Or, if you can, check what is the enryption which was used on this file.
Try to check the answers this post, one of them might help you.

Related

Python: How to print this special string?

I want to print '\xd6\xd0\xb9\xfa\xba\xda\xc1\xfa\xbd\xad' which is a Chinese character.
l = ['\xd6\xd0\xb9\xfa\xba\xda\xc1\xfa\xbd\xad']
a = [l[0].decode('utf-8')]
print(a[0])
But it raises this error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xd6 in position 0: invalid continuation byte. I also tried deocde('latin-1'). But the result aren't Chinese characters.
Try with:
l = ['\xd6\xd0\xb9\xfa\xba\xda\xc1\xfa\xbd\xad']
a = [l[0].decode('gb2312').encode('utf-8')]
print(a[0])
output:
中国黑龙江
Update: as Mark's advice, use l[0].decode('gb2312') will be sufficient.
l = ['\xd6\xd0\xb9\xfa\xba\xda\xc1\xfa\xbd\xad']
a = [l[0].decode('gb2312')]
print(a[0])

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 3: ordinal not in range(128) [duplicate]

This question already has answers here:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
(34 answers)
Closed 6 years ago.
I got the encoding error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 3: ordinal not in range(128)
at the following python (pyspark) code, where row is the data frame row:
def rowToLine(row):
line = str(row[0]).strip()
columnNum = 44
for k in xrange(1, columnNum):
line = line + "\t"
line = line + str(row[k]).strip() # encoding error here
return line
I also tried the join below:
def rowToLine(row):
s = "\t"
return s.join(row)
but some values of the row is int, so I got errors:
TypeError: sequence item 19: expected string or Unicode, int found
Does anyone know how to fix this? Thanks!
Thanks for everyone's suggestions!
I basically took Padraic Cunningham's idea and made some modification to handle the int case. The code below works.
def rowToLine(row):
s = "\t"
return s.join( x.encode("utf-8") if isinstance(x, basestring) else str(x).encode("utf-8") for x in row)

Python: Unicode and Str

What's the best way to convert to str from unicode? When I run the code, I receive the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 13: ordinal not in range(128).
My code:
import json
import urllib2
def locu_data(city,api_key):
local_business_data='https://api.locu.com/v1_0/venue/search/?locality='+city+'&api_key='+api_key
open_local_business_data=urllib2.urlopen(local_business_data)
data_load=json.load(open_local_business_data)
Business = [x for x in data_load['objects']]
Event=[]
for item in Business:
if (item != None):
Event.append(('Name:{},Categories:{},City:{},Longitutde:{},Latitude:{},Website:{}\n').format(item['name'],item['categories'],item['locality'],item['long'],item['lat'],item['website_url']))
else:
Event.append('None')
print Event
locu_data(city=raw_input("Please enter the city you would like to analyze:"),api_key=raw_input("Please enter your locu API key."))
Just use string.decode() where string is the variable you are manipulating.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 8: ordinal not in range(128)

I'm writing data, fetched from jobs API, to the Google spreadsheet. Following encoding for 'latin-1' encodes till page# 93 but when reaches 94, it goes in exception. I've used different following techniques, but 'latin-1' did max pagination. Else have been commented(as they die on page #65). Could you please tell me how to modify non-commented(i-e .encode('latin-1')) to get 199 pages safely written on spreadsheet?
Code is given as below:
Any guideline in this regard is appreciated in advance.
def append_data(self,worksheet,row,start_row, start_col,end_col):
r = start_row #last_empty_row(worksheet)
j = 0
i = start_col
while (i <= end_col):
try:
worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1','ignore'))
#worksheet.update_cell(r,i,unicode(row[j]).decode('latin-1').encode("utf-
16"))
#worksheet.update_cell(r,i,unicode(row[j]).encode('iso-8859-1'))
#worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1').decode("utf-
8"))
#worksheet.update_cell(r,i,unicode(row[j]).decode('utf-8'))
#worksheet.update_cell(r,i,unicode(row[j]).encode('latin-1', 'replace'))
#worksheet.update_cell(r,i,unicode(row[j]).encode(sys.stdout.encoding,
'replace'))
#worksheet.update_cell(r,i,row[j].encode('utf8'))
#worksheet.update_cell(r,i,filter(self.onlyascii(str(row[j]))))
except Exception as e:
self.ehandling_obj.error_handler(self.ehandling_obj.SPREADSHEET_ERROR,[1])
try:
worksheet.update_cell(r,i,'N/A')
except Exception as ee:
y = 23
j = j + 1
i = i + 1
You are calling unicode() on a byte string value, which means Python will have to decode to Unicode first:
>>> unicode('\xea')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)
It is this decoding that fails, not the encoding from Unicode back to byte strings.
You either already have Latin-1 input data, or you should decode using the appropriate codec:
unicode(row[j], 'utf8').encode('latin1')
or using str.decode():
row[j].decode('utf8').encode('latin1')
I picked UTF-8 as an example here, you didn't provide any detail about the input data or its possible encodings. You need to pick the right codec yourself here.

Python: Convert utf-8 string to byte string [duplicate]

This question already has answers here:
Best way to convert string to bytes in Python 3?
(5 answers)
Closed 11 days ago.
I have the following function to parse a utf-8 string from a sequence of bytes
Note -- 'length_size' is the number of bytes it take to represent the length of the utf-8 string
def parse_utf8(self, bytes, length_size):
length = bytes2int(bytes[0:length_size])
value = ''.join(['%c' % b for b in bytes[length_size:length_size+length]])
return value
def bytes2int(raw_bytes, signed=False):
"""
Convert a string of bytes to an integer (assumes little-endian byte order)
"""
if len(raw_bytes) == 0:
return None
fmt = {1:'B', 2:'H', 4:'I', 8:'Q'}[len(raw_bytes)]
if signed:
fmt = fmt.lower()
return struct.unpack('<'+fmt, raw_bytes)[0]
I'd like to write the function in reverse -- i.e. a function that will take a utf-8 encoded string and return it's representation as a byte string.
So far, I have the following:
def create_utf8(self, utf8_string):
return utf8_string.encode('utf-8')
I run into the following error when attempting to test it:
File "writer.py", line 229, in create_utf8
return utf8_string.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x98 in position 0: ordinal not in range(128)
If possible, I'd like to adopt a structure for the code similar to the parse_utf8 example. What am I doing wrong?
Thank you for your help!
UPDATE: test driver, now correct
def random_utf8_seq(self, length):
# from http://www.w3.org/2001/06/utf-8-test/postscript-utf-8.html
test_charset = u" !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĂ㥹ĆćČčĎďĐđĘęĚěĹ弾ŁłŃńŇňŐőŒœŔŕŘřŚśŞşŠšŢţŤťŮůŰűŸŹźŻżŽžƒˆˇ˘˙˛˜˝–—‘’‚“”„†‡•…‰‹›€™"
utf8_seq = u""
for i in range(length):
utf8_seq += random.choice(test_charset)
return utf8_seq
I get the following error:
input_str = self.random_utf8_seq(200)
File "writer.py", line 226, in random_utf8_seq
print unicode(utf8_seq, "utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 0: invalid start byte
If utf-8 => bytestring conversion is what do you want then you may use str.encode, but first you need to properly mark the type of source string in your example - prefix with u for unicode:
# coding: utf-8
import random
def random_utf8_seq(length):
# from http://www.w3.org/2001/06/utf-8-test/postscript-utf-8.html
test_charset = u" !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĂ㥹ĆćČčĎďĐđĘęĚěĹ弾ŁłŃńŇňŐőŒœŔŕŘřŚśŞşŠšŢţŤťŮůŰűŸŹźŻżŽžƒˆˇ˘˙˛˜˝–—‘’‚“”„†‡•…‰‹›€™"
utf8_seq = u''
for i in range(length):
utf8_seq += random.choice(test_charset)
print utf8_seq.encode('utf-8')
return utf8_seq.encode('utf-8')
print( type(random_utf8_seq(200)) )
-- output --
­
õ3×sÔP{Ć.s(Ë°˙ě÷xÓ#bűV—û´ő¢uZÓČn˜0|_"Ðyø`êš·ÏÝhunÍÅ=ä?
óP{tlÇűpb¸7s´ňƒG—čøň\zčłŢXÂYqLĆúěă(ÿî ¥PyÐÔŇnל¦Ì˝+•ì›
ŻÛ°Ñ^ÝC÷ŢŐIñJĹţÒył­"MťÆ‹ČČ4þ!»šåŮ#Öhň-
ÈLGĄ¢ß˛Đ¯.ªÆź˘Ř^ĽÛŹËaĂŕ¹#¢éüÜńlÊqš=VřU…‚–MŽÎÉèoÙŹŠ¨Ð
<type 'str'>

Categories