UnicodeDecodeError: 'utf8' codec can't decode byte 0x8e in position 1 - python

This is my first post here excuse me if i miss anything.
I have some data in my CSV file and am trying to import data into my prod and getting UnicodeDecodeError. I have some french words in my csv file
Code:
open_csv = csv.DictReader(open('filename',''rb))
for i in open_csv:
x = find(where={})#mongodb query
x.something = i.get(row_header)
x.save()
am getting UnicodeDecodeError: 'utf8' codec can't decode byte 0x8e in position 1 error while saving the data

I would suggest you to try the following code:
import codecs
open_csv = csv.DictReader(codecs.open('filename','rb'))
for i in open_csv:
x = find(where={})
x.something = i.get(row_header)
x.save()
I work in Python 3.x but this should work in 2.x too if that is what you are using.

Related

unicode decode error while importing Medical Data on pandas

I tried importing a medical data and I ran into this unicode error, here is my code:
output_path = r"C:/Users/muham/Desktop/AI projects/cancer doc classification"
my_file = glob.glob(os.path.join(output_path, '*.csv'))
for files in my_file:
data = pd.read_csv(files)
print(data)
My error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 3314: invalid start byte
Try other encodings, default one is utf-8
like
import pandas
pandas.read_csv(path, encoding="cp1252")
or ascii, latin1, etc ...

Encoding 'UTF-8' throws an exception in macOS

I'm trying to read a GloVe file: glove.twitter.27B.200d.txt. I have the next function to read the file:
def glove_reader(glove_file):
glove_dict = {}
with open(glove_file, 'rt', encoding='utf-8') as glove_reader:
for line in glove_reader:
tokens = line.rstrip().split()
vect = [float(token) for token in tokens[1:]]
glove_dict[tokens[0]] = vect
return glove_dict
The problem is that I get the next error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 0: invalid continuation byte
I tried with latin-1 but it didn't work. Throws me the next error:
ValueError: could not convert string to float: 'Ù\x86'
I also tried change 'rt' with 'r' and 'rb'. I think is a problem of macOS because in Windows didn't throw me this error. Can someone please help me to know why I can't read this file.

Tabula-py windows- UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position N: invalid start byte

I'm trying to read pdf file like
di = read_pdf('test.pdf', output_format= 'json', encoding='utf-8', guess = False)
Its working fine on linux. But on windows getting error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position N: invalid start byte", Please help me to know the issue.

Can't find the directory no idea why

import requests
test = requests.get("https://www.hipstercode.com/")
outfile = open("./settings.txt", "w")
test.encoding = 'ISO-8859-1'
outfile.write(str(test.text))
The error that i'm getting is:
File "C:/Users/Bamba/PycharmProjects/Requests/Requests/Requests.py", line 8, in <module>
outfile.write(str(test.text))
File "C:\Users\Bamba\AppData\Local\Programs\Python\Python35\lib\encodings\cp1255.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xef' in position 0: character maps to <undefined>
So, looks like response contains smth you can't encode in cp1251.
If utf-8 is ok for you, try
import requests
test = requests.get("https://www.hipstercode.com/")
outfile = open("./settings.txt", "wb")
outfile.write(test.text.encode('ISO-8859-1'))
If you're getting error while encoding, you simply cannot encode lossless. Options you have described in encode docs: https://docs.python.org/3/library/stdtypes.html#str.encode
I.e., you can
outfile.write(test.text.encode('ISO-8859-1', 'replace'))
to handle errors without losing most sense of text written in smth that doesn't fit ISO-8859-1

urllib2 opener providing wrong charset

When I open the url and read it, I can't recognize it. But when I check the content header it says it is encoded as utf-8. So I tried to convert it to unicode and it complained UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128) using unicode().
.encode("utf-8") produces
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128)
.decode("utf-8") produced
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte.
I have tried everything I can come up with(I'm not that good at encodings)
I would be happy if I could get this to work. Thanks.
This is a common mistake. The server sends gzipped stream.
You should unpack it first:
response = opener.open(self.__url, data)
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO.StringIO( response.read())
gzip_f = gzip.GzipFile(fileobj=buf)
content = gzip_f.read()
else:
content = response.read()
The header is probably wrong. Check out chardet.
EDIT: Thinking more about it -- my money is on the contents being gzipped. I believe some of Python's various URL-opening modules/classes/etc will ungzip, while others won't.

Categories