I tried importing a medical data and I ran into this unicode error, here is my code:
output_path = r"C:/Users/muham/Desktop/AI projects/cancer doc classification"
my_file = glob.glob(os.path.join(output_path, '*.csv'))
for files in my_file:
data = pd.read_csv(files)
print(data)
My error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 3314: invalid start byte
Try other encodings, default one is utf-8
like
import pandas
pandas.read_csv(path, encoding="cp1252")
or ascii, latin1, etc ...
Related
I have a csv file that seems to be encoded in UTF-8 based on the BOM present at the start of the file. However, when I try to open it, I get an error
import io
import chardet
from zipfile import ZipFile
from pandas import read_csv
filename = './sample.zip'
objs = []
frames = []
with ZipFile(filename) as zf:
zipinfo_objs = [ zi for zi in zf.infolist()
if zi.filename.endswith(".csv") ]
for zipinfo_obj in zipinfo_objs:
obj = zf.read(zipinfo_obj.filename)
objs.append(obj)
print("Bytes Objects:", [type(obj) for obj in objs])
print("Encoding:", chardet.detect(objs[0]))
print("BOM:", objs[0][:4])
buffer = io.BytesIO(objs[0])
frame = read_csv(buffer)
frames.append(frame)
yields
Bytes Objects: [<class 'bytes'>]
Encoding: {'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}
BOM: b'\xef\xbb\xbf"'
...
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 12: invalid continuation byte
However, if I specify the encoding as latin-1 in the attempt to decode the buffer, like:
frame = read_csv(buffer, encoding="latin-1").
I get a success and pandas is able to read in the dataframe.
This file was generated from Adobe Analytics and apparently there was no option specify the format of the export besides giving the user an option to choose a CSV or a Tableau file.
My questions are:
Is it a typical occurrence for CSV files to be encoded in latin-1 and include the UTF-8-SIG BOM at the beginning of the file?
Should I be checking for encoding differently / extracting data differently?
I am trying to load a utf-8 encoded json file using python's json module. The file contains several right quotation marks, encoded as E2 80 9D. When I call
json.load(f, encoding='utf-8')
I receive the message:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 212068: character maps to
How can I convince the json module to decode this properly?
EDIT: Here's a minimal example:
[
{
"aQuote": "“A quote”"
}
]
There is no encoding in the signature of json.load. The solution should be simply:
with open(filename, encoding='utf-8') as f:
x = json.load(f)
This is my first post here excuse me if i miss anything.
I have some data in my CSV file and am trying to import data into my prod and getting UnicodeDecodeError. I have some french words in my csv file
Code:
open_csv = csv.DictReader(open('filename',''rb))
for i in open_csv:
x = find(where={})#mongodb query
x.something = i.get(row_header)
x.save()
am getting UnicodeDecodeError: 'utf8' codec can't decode byte 0x8e in position 1 error while saving the data
I would suggest you to try the following code:
import codecs
open_csv = csv.DictReader(codecs.open('filename','rb'))
for i in open_csv:
x = find(where={})
x.something = i.get(row_header)
x.save()
I work in Python 3.x but this should work in 2.x too if that is what you are using.
I want to open a json file in python and I have the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 64864: ordinal not in range(128)
my code is quite simple:
# -*- coding: utf-8 -*-
import json
with open('birdw3l2.json') as data_file:
data = json.load(data_file)
print(data)
Someone can help me? Thanks!
Try the following code.
import json
with open('birdw3l2.json') as data_file:
data = json.load(data_file).decode('utf-8')
print(data)
You should specify your encoding format when you load your json file. like this:
data = json.load(data_file, encoding='utf-8')
The encoding depends on your file encoding.
I have a GIF file (or any image format) in unicode form:
>>> data
u'GIF89a,\x000\x00\ufffd\ufffd\x00\x00\x00\x00\ufffd\ufffd\ufff...
I want to write this to file:
>>> f = open('file.gif', 'wb')
>>> f.write(data)
But I get an error:
UnicodeEncodeError at /image
'ascii' codec can't encode characters in position 10-11: ordinal not in range(128)
How do I do this?
Try this:
utf8data = data.encode('UTF-8')
open('file.gif', 'w').write(utf8data)
You must encode the string to unicode explicitly
f.write(data.encode('utf-8'))