While running the code to merge(basically inner join) two csv files I am facing an error while reading csv file. My code:
import csv
import pandas as pd
s1= pd.read_csv(".../noun.csv")
s2= pd.read_csv(".../verb.csv")
merged= s1.merge(s2, on=("userID" ,"sentID"), how ="inner")
merged.to_excel(".../merge1.xlsx",index = False)
Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 5: invalid start byte
example of my content is:
verb file
userID sentID verb
['3477' 1 ['am', 'were', 'having', 'attended', 'stopped']
['3477' 2 ['felt', 'thrusting']
noun file
userID sentID Sentences
['3477' 1 Thursday,
['3477' 1 November
You can use a library that attempts to detect the encoding, for example cchardet:
pip install cchardet
If you use python 2.X you also need a backport of the CSV library. They support Unicode natively, while Python 2's csv does not:
pip install backports.csv
Then in your code you can do something like this:
import cchardet
import io
from backports import csv
# detect encoding
with io.open(filename, mode="rb") as f:
data = f.read()
detect = cchardet.detect(data)
encoding_ = detect['encoding']
# retrieve data
with io.open(filename, encoding=encoding_) as csvfile:
reader = csv.reader(csvfile, ...)
...
I don't know pandas, but you can do something like this:
# retrieve data
s1= pd.read_csv(".../noun.csv", encoding=encoding_)
Related
I need to convert multiple CSV files (with different encodings) into UTF-8.
Here is my code:
#find encoding and if not in UTF-8 convert it
import os
import sys
import glob
import chardet
import codecs
myFiles = glob.glob('/mypath/*.csv')
csv_encoding = []
for file in myFiles:
with open(file, 'rb') as opened_file:
bytes_file=opened_file.read()
result=chardet.detect(bytes_file)
my_encoding=result['encoding']
csv_encoding.append(my_encoding)
print(csv_encoding)
for file in myFiles:
if csv_encoding in ['utf-8', 'ascii']:
print(file + ' in utf-8 encoding')
else:
with codecs.open(file, 'r') as file_for_conversion:
read_file_for_conversion = file_for_conversion.read()
with codecs.open(file, 'w', 'utf-8') as converted_file:
converted_file.write(read_file_for_conversion)
print(file +' converted to utf-8')
When I try to run this code I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 5057: invalid continuation byte
Can someone help me? Thanks!!!
You need to zip the lists myFiles and csv_encoding to get their values aligned:
for file, encoding in zip(myFiles, csv_encoding):
...
And you need to specify that value in the open() call:
...
with codecs.open(file, 'r', encoding=encoding) as file_for_conversion:
Note: in Python 3 there's no need to use the codecs module for opening files.
Just use the built-in open function and specify the encoding with the encoding parameter.
I have a csv file that seems to be encoded in UTF-8 based on the BOM present at the start of the file. However, when I try to open it, I get an error
import io
import chardet
from zipfile import ZipFile
from pandas import read_csv
filename = './sample.zip'
objs = []
frames = []
with ZipFile(filename) as zf:
zipinfo_objs = [ zi for zi in zf.infolist()
if zi.filename.endswith(".csv") ]
for zipinfo_obj in zipinfo_objs:
obj = zf.read(zipinfo_obj.filename)
objs.append(obj)
print("Bytes Objects:", [type(obj) for obj in objs])
print("Encoding:", chardet.detect(objs[0]))
print("BOM:", objs[0][:4])
buffer = io.BytesIO(objs[0])
frame = read_csv(buffer)
frames.append(frame)
yields
Bytes Objects: [<class 'bytes'>]
Encoding: {'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}
BOM: b'\xef\xbb\xbf"'
...
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 12: invalid continuation byte
However, if I specify the encoding as latin-1 in the attempt to decode the buffer, like:
frame = read_csv(buffer, encoding="latin-1").
I get a success and pandas is able to read in the dataframe.
This file was generated from Adobe Analytics and apparently there was no option specify the format of the export besides giving the user an option to choose a CSV or a Tableau file.
My questions are:
Is it a typical occurrence for CSV files to be encoded in latin-1 and include the UTF-8-SIG BOM at the beginning of the file?
Should I be checking for encoding differently / extracting data differently?
I´m trying to import data from a csv file to a Django model. I´m using the manage.py shell for it with the following code:
>>> import csv
>>> import os
>>> path = "C:\\Users\Lia Love\Downloads"
>>> os.chdir(path)
>>> from catalog.models import ProductosBase
>>> with open('FarmaciasGob.csv') as csvfile:
... reader = csv.DictReader(csvfile)
... for row in reader:
... p = Country(country=row['Country'], continent=row['Continent'])
... p.save()
...
>>>
>>> exit()
I get the following error message at a given point of the dataset:
UnicodeDecodeError: "charmap" codec can´t decode byte 0x81 in position 7823: character maps to (undefined)
For what I could find, it seems to be a problem with the "latin" encoding of the csv file.
Inspecting the csv, I don´t see nothing special about the specific row where it get´s the error. I´m able to import about 2200 rows before this one, all with latin characters.
Any clues?
Assuming you are in python3, this is an issue of the character encoding of your file. Most likely, the encoding is 'utf-8', but it could also be 'utf-16', 'utf-16le', 'cp1252', or 'cp437', all of which are also commonly used. In python3, you can specify the encoding of the file on the open:
with open('FarmaciasGob.csv', encoding='utf-8') as csvfile:
I've been trying to get a full data from a MySQL table to csv with python and I've done it well but now that table has a column "description" and it's a piece of hell now coz there're encoding issues everywhere.
After trying tons and tons of things readed from other posts now i give up with those characters and i want to skip them directly and avoid those errors.
test.py:
import MySQLdb, csv, codecs
dbConn = MySQLdb.connect(dbServer,dbUser,dbPass,dbName,charset='utf8')
cur = dbConn.cursor()
def createFile(self):
SQLview = 'SELECT fruit_type, fruit_qty, fruit_price, fruit_description FROM fruits'
cur.execute(SQLview)
with codecs.open('fruits.csv','wb',encoding='utf8',errors='ignore') as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerows(cur)
Still getting that error from the function:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd1' in position 27: ordinal not in range(128)
Any idea if i can skip that error and still writting the rest of the data from the DB Query?
PD: The function crash line is:
csv_writer.writerows(cur)
Don't know if that's usefull info for someone
Finally solved
changed:
import csv
to:
import unicodecsv as csv
changed:
csv_writer = csv.writer(csv_file)
to:
csv_writer = csv.writer(csv_file,encoding='utf-8')
just trying to load this JSON file(with non-ascii characters) as a python dictionary with Unicode encoding but still getting this error:
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 90: ordinal not in range(128)
JSON file content = "tooltip":{
"dxPivotGrid-sortRowBySummary": "Sort\"{0}\"byThisRow",}
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
for line in f:
data.append(json.loads(line.encode('utf-8','replace')))
You have several problems as near as I can tell. First, is the file encoding. When you open a file without specifying an encoding, the file is opened with whatever sys.getfilesystemencoding() is. Since that may vary (especially on Windows machines) its a good idea to explicitly use encoding="utf-8" for most json files. Because of your error message, I suspect that the file was opened with an ascii encoding.
Next, the file is decoded from utf-8 into python strings as it is read by the file system object. The utf-8 line has already been decoded to a string and is already ready for json to read. When you do line.encode('utf-8','replace'), you encode the line back into a bytes object which the json loads (that is, "load string") can't handle.
Finally, "tooltip":{ "navbar":"Operações de grupo"} isn't valid json, but it does look like one line of a pretty-printed json file containing a single json object. My guess is that you should read the entire file as 1 json object.
Putting it all together you get:
import json
with open('/Users/myvb/Desktop/Automation/pt-PT.json', encoding="utf-8") as f:
data = json.load(f)
From its name, its possible that this file is encoded as a Windows Portugese code page. If so, the "cp860" encoding may work better.
I had the same problem, what worked for me was creating a regular expression, and parsing every line from the json file:
REGEXP = '[^A-Za-z0-9\'\:\.\;\-\?\!]+'
new_file_line = re.sub(REGEXP, ' ', old_file_line).strip()
Having a file with content similar to yours I can read the file in one simple shot:
>>> import json
>>> fname = "data.json"
>>> with open(fname) as f:
... data = json.load(f)
...
>>> data
{'tooltip': {'navbar': 'Operações de grupo'}}
You don't need to read each line. You have two options:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.load(f))
Or, you can load all lines and pass them to the json module:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.loads(''.join(f.readlines())))
Obviously, the first suggestion is the best.