how to NOT read_csv if csv is empty - python

Using Python 2.7 and Pandas
I have to parse through my directory and plot a bunch of CSVs. If the CSV is empty, the script breaks and produces the error message:
pandas.io.common.EmptyDataError: No columns to parse from file
If I have my file paths stored in
file_paths=[]
how do I read through each one and only plot the non empty CSVs? If I have an empty dataframe defined as df=[] I attempt the following code
for i in range(0,len(file_paths)):
if pd.read_csv(file_paths[i] == ""):
print "empty"
else df.append(pd.read_csv(file_paths[i],header=None))

I would just catch the appropriate exception, as a catch all is not recommended in python:
import pandas.io.common
for i in range(0,len(file_paths)):
try:
pd.read_csv(file_paths[i])
except pandas.errors.EmptyDataError:
print file_paths[i], " is empty"

Note, as of pandas 0.22.0 (that I can be sure of) , the exception raised for empty csv is pandas.errors.EmptyDataError. And if you're importing pandas like import pandas as pd, then use pd instead of pandas.
If your csv filenames are in an array manyfiles, then
import pandas as pd
for filename in manyfiles:
try:
df = pd.read_csv(filename)
except pd.errors.EmptyDataError:
print('Note: filename.csv was empty. Skipping.')
continue # will skip the rest of the block and move to next file
# operations on df
I'm not sure if pandas.io.common.EmptyDataError is still valid or not. Can't find it in reference docs. And I also would advise against the catch-all except: as you won't be able to know if it's something else causing the issue.

You can use the in built try and except syntax to skip over files that return you an error, as follows:
Described here: Try/Except in Python: How do you properly ignore Exceptions?
for i in range(0,len(file_paths)):
try:
pd.read_csv(file_paths[i])
### Do Some Stuff
except:
continue
# or pass
This will attempt to read each file, and if unsuccessful continue to the next file.

Related

Check if Excel is empty

I have a bunch of excel files automatically generated by a process. However, some of them are empty because the process stopped before actually writing anything. These excels do not even contain any columns, so they are just an empty sheet.
I'm now runnin some scripts on each of the excels, so I would like to check if the excel is empty, and if so, skip it.
I have tried:
pandas.DataFrame.empty
But I still get the message: EmptyDataError: No columns to parse from file
How can I perform this check?
Why not using a try/except:
try:
# try reading the excel file
df = pd.read_excel(…) # or pd.read_csv(…)
except pd.errors.EmptyDataError:
# do something else if this fails
df = pd.DataFrame()

Python: ignore filenotfound and other errors in pandas/python code

I have wrote a code which has basically the following structure:
Step 1: import libraries
Step 2: read multiple input files
>>>os.getcwd()
'C:\\Users\\User\\Downloads\\Input\\Daily'
>>>df_delhi = pd.read_excel("Delhi.xlsx")
>>>df_mea = pd.read_excel("MEA.xlsx",sheet_name='MTD NSV_Volumes',usecols ='B:D', skiprows=3)
Step 3: calculations on each file
Step 4: output in excel file.
I have a condition here. out of 20 odd files, if any one file is missing, i need to ignore that and move ahead with the code. Also i need to ignore the calculations from that file which come later on in the code.
How do i achieve this?
Not sure if any code is required to be pasted here since its just basic file read and calculations
Look into Exceptions (official Python tutorial: https://docs.python.org/3/tutorial/errors.html). By expecting and catching a FileNotFoundError you could neatly ignore that file, and only work on files which you find.
Look's like u need to loop over a directory and read the file's avilable in it.
import os
import pandas as pd
# each file get's different argument's while reading..
file_params = {
"Delhi.xlsx": {},
"MEA.xlsx": {"sheet_name": 'MTD NSV_Volumes', 'usecols': 'B:D'},
}
data = {}
for file_ in os.listdir('C:\\Users\\User\\Downloads\\Input\\Daily'):
data[file_.replace(".xlsx", "")] = pd.read_excel(file_, **file_params.get(file_, {}))
# dataframe's contained inside dict.
# data['Delhi'], data['MEA']....

Adding a single column of only one value( eg: Experiment condition name) to multiple csv files in a single folder

I have a folder with multiple csv files,I want to add one column with the name of the experiment condition to all the csv files in the folder.
import os, pandas as pd
import csv
file_list = list()
os.chdir('.....name.../')
g=[]
for file in os.listdir():
if file.endswith('.csv'):
df = pd.read_csv(file)
df['Condition'] = "After food"
file_list.append(df)
g.append(file_list)
Assuming you're already in the correct directory we can try something like this :
first we pass all the csv files into a list
files = glob.glob('*.csv')
we can then iterate through this list to find your columns with a handy try and except statement.
for file in files:
df = pd.read_csv(file)
try:
df['Condition'] == "After food"
# do something.
df.to_csv(f'{file}.csv',index=False)
print(f'{file} has been altered')
except KeyError:
print(f'{file} has not met the condition, therefore has not been changed.')
except EmptyDataError:
print(f"this {file} has no data to parse")
the output will be something like this :
PayCodes.csv has been altered
birmingham.csv has not met the condition
CF44.csv has not met the condition
DE11.csv has not met the condition
Dublin.csv has not met the condition
DY8.csv has not met the condition
If I understood you correctly, your problem is saving back to the csv files.
If you want to do that, you can use the function to_csv in your loop function:
df.to_csv(file)
Tell me if that wasn't what you meant and I'll try to help you.

Error while reading a big JSON file with Python: "json.decoder.JSONDecodeError: Expecting ',' delimiter"

I am trying to read a big json file (about 3 Go) with Python. The file is actually contains about 7 million json objects (one per line).
I have tried quite a few different solutions but I keep running into the same error:
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 25 (char 24)
The code I am using is here:
import json
import pandas as pd
with open('mydata.json') as json_file:
data = json_file.readlines()
# this line below may take at least 8-10 minutes of processing for 4-5
# million rows. It converts all strings in list to actual json objects.
data = list(map(json.loads, data))
pd.DataFrame(data)
Any ideas as to why I am getting this error? It seems to be related to the format of the file but in principle it is in correct json format (I have checked a few lines with https://jsonformatter.curiousconcept.com/).
I have also tried reading a much shorter version of the file (with only about 30 lines) and this operation is successful.
a slightly cleaned up Python 3 version of BoboDarph's code:
import json
import logging
import pandas as pd
logger = logging.getLogger(__name__)
def iter_good_json_lines(lines):
for lineno, line in enumerate(lines, 1):
try:
yield json.loads(line.strip())
except json.JSONDecodeError as err:
logger.warning(f"lineno {lineno}:{err.colno} {err.msg}: {err.doc}")
with open('mydata.json') as fd:
data = pd.DataFrame(iter_good_json_lines(fd))
data
this changes:
iterating an open file gives you an iterator that yields lines
use the logging module so errors don't end up on stdout
Pandas >=0.13 allows a generator to be passed to the DataFrame constructor
f-strings!
Elaborating on the comment above:
One or more of the lines in the in your data file is most likely not a JSON so Python errors out when it tries to load the string into a JSON object.
Depending on what your needs are, you could either allow your code to fail because you rely on all the lines of that file to be JSONs and if they are not, you want to know (like it does now), or you could avoid parsing non-JSON lines at all and have your code issue a warning whenever one is met.
To implement the second solution, wrap the string to JSON loading into a try block to weed out all the offending lines. If you do this, all the lines that are not JSONS will be ignored and your code will continue to try and parse all the other lines.
Here's how I would implement this:
import json
from json import JSONDecodeError
import pandas as pd
data = []
with open('mydata.json') as json_file:
for line in json_file.readlines():
js = None
try:
js = json.loads(line)
except JSONDecodeError:
print('Skipping line %s' %(line))
if js:
#You don't want None value in your dataframe
data.append(js)
test = pd.DataFrame(data)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(test)

Validate and format JSON files

I have around 2000 JSON files which I'm trying to run through a Python program. A problem occurs when a JSON file is not in the correct format. (Error: ValueError: No JSON object could be decoded) In turn, I can't read it into my program.
I am currently doing something like the below:
for files in folder:
with open(files) as f:
data = json.load(f); # It causes an error at this part
I know there's offline methods to validating and formatting JSON files but is there a programmatic way to check and format these files? If not, is there a free/cheap alternative to fixing all of these files offline i.e. I just run the program on the folder containing all the JSON files and it formats them as required?
SOLVED using #reece's comment:
invalid_json_files = []
read_json_files = []
def parse():
for files in os.listdir(os.getcwd()):
with open(files) as json_file:
try:
simplejson.load(json_file)
read_json_files.append(files)
except ValueError, e:
print ("JSON object issue: %s") % e
invalid_json_files.append(files)
print invalid_json_files, len(read_json_files)
Turns out that I was saving a file which is not in JSON format in my working directory which was the same place I was reading data from. Thanks for the helpful suggestions.
The built-in JSON module can be used as a validator:
import json
def parse(text):
try:
return json.loads(text)
except ValueError as e:
print('invalid json: %s' % e)
return None # or: raise
You can make it work with files by using:
with open(filename) as f:
return json.load(f)
instead of json.loads and you can include the filename as well in the error message.
On Python 3.3.5, for {test: "foo"}, I get:
invalid json: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
and on 2.7.6:
invalid json: Expecting property name: line 1 column 2 (char 1)
This is because the correct json is {"test": "foo"}.
When handling the invalid files, it is best to not process them any further. You can build a skipped.txt file listing the files with the error, so they can be checked and fixed by hand.
If possible, you should check the site/program that generated the invalid json files, fix that and then re-generate the json file. Otherwise, you are going to keep having new files that are invalid JSON.
Failing that, you will need to write a custom json parser that fixes common errors. With that, you should be putting the original under source control (or archived), so you can see and check the differences that the automated tool fixes (as a sanity check). Ambiguous cases should be fixed by hand.
Yes, there are ways to validate that a JSON file is valid. One way is to use a JSON parsing library that will throw exceptions if the input you provide is not well-formatted.
try:
load_json_file(filename)
except InvalidDataException: # or something
# oops guess it's not valid
Of course, if you want to fix it, you naturally cannot use a JSON loader since, well, it's not valid JSON in the first place. Unless the library you're using will automatically fix things for you, in which case you probably wouldn't even have this question.
One way is to load the file manually and tokenize it and attempt to detect errors and try to fix them as you go, but I'm sure there are cases where the error is just not possible to fix automatically and would be better off throwing an error and asking the user to fix their files.
I have not written a JSON fixer myself so I can't provide any details on how you might go about actually fixing errors.
However I am not sure whether it would be a good idea to fix all errors, since then you'd have assume your fixes are what the user actually wants. If it's a missing comma or they have an extra trailing comma, then that might be OK, but there may be cases where it is ambiguous what the user wants.
Here is a full python3 example for the next novice python programmer that stumbles upon this answer. I was exporting 16000 records as json files. I had to restart the process several times so I needed to verify that all of the json files were indeed valid before I started importing into a new system.
I am no python programmer so when I tried the answers above as written, nothing happened. Seems like a few lines of code were missing. The example below handles files in the current folder or a specific folder.
verify.py
import json
import os
import sys
from os.path import isfile,join
# check if a folder name was specified
if len(sys.argv) > 1:
folder = sys.argv[1]
else:
folder = os.getcwd()
# array to hold invalid and valid files
invalid_json_files = []
read_json_files = []
def parse():
# loop through the folder
for files in os.listdir(folder):
# check if the combined path and filename is a file
if isfile(join(folder,files)):
# open the file
with open(join(folder,files)) as json_file:
# try reading the json file using the json interpreter
try:
json.load(json_file)
read_json_files.append(files)
except ValueError as e:
# if the file is not valid, print the error
# and add the file to the list of invalid files
print("JSON object issue: %s" % e)
invalid_json_files.append(files)
print(invalid_json_files)
print(len(read_json_files))
parse()
Example:
python3 verify.py
or
python3 verify.py somefolder
tested with python 3.7.3
It was not clear to me how to provide path to the file folder, so I'd like to provide answer with this option.
path = r'C:\Users\altz7\Desktop\your_folder_name' # use your path
all_files = glob.glob(path + "/*.json")
data_list = []
invalid_json_files = []
for filename in all_files:
try:
df = pd.read_json(filename)
data_list.append(df)
except ValueError:
invalid_json_files.append(filename)
print("Files in correct format: {}".format(len(data_list)))
print("Not readable files: {}".format(len(invalid_json_files)))
#df = pd.concat(data_list, axis=0, ignore_index=True) #will create pandas dataframe
from readable files, if you like

Categories