I have a bunch of excel files automatically generated by a process. However, some of them are empty because the process stopped before actually writing anything. These excels do not even contain any columns, so they are just an empty sheet.
I'm now runnin some scripts on each of the excels, so I would like to check if the excel is empty, and if so, skip it.
I have tried:
pandas.DataFrame.empty
But I still get the message: EmptyDataError: No columns to parse from file
How can I perform this check?
Why not using a try/except:
try:
# try reading the excel file
df = pd.read_excel(…) # or pd.read_csv(…)
except pd.errors.EmptyDataError:
# do something else if this fails
df = pd.DataFrame()
Related
I have a process to read csv files and do some processing in pyspark. At times I might get a zero byte empty file. In such cases when I use the below code
df = spark.read.csv('/path/empty.txt', header = False)
It is failing with error:
py4j.protocol.Py4JJavaError: An error occurred while calling o139.csv.
: java.lang.UnsupportedOperationException: empty collection
Since its empty file I tried to read as a json it worked fine
df = spark.read.json('/path/empty.txt')
When I add header to the empt csv manually to the code reads fine.
df = spark.read.csv('/path/empty.txt', header = True)
In few places I read to use databricks csv but
I don't have the data bricks csv package options to use as those jars are not available in my environment.
I am attempting to create an upload tool that takes an .xls file and then converts it to a pandas dataframe before finally saving it as a csv file to be processed and analyzed. After the file comes out of this code:
def xls_to_csv(data):
#Formats into pandas dataframe. Index removes first column of .xls file.
formatted_file = pd.read_excel(data, index_col=0)
#Converts the formatted file into a csv file and saves it.
final_file = formatted_file.to_csv('out.csv')
It saves properly and in the right location, however when I attempt to plug the resulted file into other functions that contain loops, it raises
TypeError: 'NoneType' object is not iterable.
The file is saved as 'out.csv' and I am able to open it manually, however the open command won't even work without this error being raised.
I'm using Python 3.6.
to_csv returns None that's why you got that error
to maintain formatted_file, you could try this,
final_file=formatted_file.copy()
or
final_file=pd.read_csv('out.csv')
Pandas to_csv function saves the file but does not return anything. To loop through the csv file later you'll have to change the code to look like this.
formatted_file.to_csv('out.csv')
final_file = open('out.csv', 'r')
def xls_to_csv(data):
# Read the excel file as a dataframe object
formatted_dataframe = pd.read_excel(data, index_col=0)
# Save the dataframe to a csv file in disk. The method returns None.
formatted_file.to_csv('out.csv')
# The dataframe object is still here
final_dataframe = formatted_dataframe
# The final file NAME
final_filename = 'out.csv'
Your variable names are misleading
Your formatted_file is in fact a data frame object
Your final_file: it is unclear for me whether you want the filename or the dataframe.
Using Python 2.7 and Pandas
I have to parse through my directory and plot a bunch of CSVs. If the CSV is empty, the script breaks and produces the error message:
pandas.io.common.EmptyDataError: No columns to parse from file
If I have my file paths stored in
file_paths=[]
how do I read through each one and only plot the non empty CSVs? If I have an empty dataframe defined as df=[] I attempt the following code
for i in range(0,len(file_paths)):
if pd.read_csv(file_paths[i] == ""):
print "empty"
else df.append(pd.read_csv(file_paths[i],header=None))
I would just catch the appropriate exception, as a catch all is not recommended in python:
import pandas.io.common
for i in range(0,len(file_paths)):
try:
pd.read_csv(file_paths[i])
except pandas.errors.EmptyDataError:
print file_paths[i], " is empty"
Note, as of pandas 0.22.0 (that I can be sure of) , the exception raised for empty csv is pandas.errors.EmptyDataError. And if you're importing pandas like import pandas as pd, then use pd instead of pandas.
If your csv filenames are in an array manyfiles, then
import pandas as pd
for filename in manyfiles:
try:
df = pd.read_csv(filename)
except pd.errors.EmptyDataError:
print('Note: filename.csv was empty. Skipping.')
continue # will skip the rest of the block and move to next file
# operations on df
I'm not sure if pandas.io.common.EmptyDataError is still valid or not. Can't find it in reference docs. And I also would advise against the catch-all except: as you won't be able to know if it's something else causing the issue.
You can use the in built try and except syntax to skip over files that return you an error, as follows:
Described here: Try/Except in Python: How do you properly ignore Exceptions?
for i in range(0,len(file_paths)):
try:
pd.read_csv(file_paths[i])
### Do Some Stuff
except:
continue
# or pass
This will attempt to read each file, and if unsuccessful continue to the next file.
So I've got about 5008 rows in a CSV file, a total of 5009 with the headers. I'm creating and writing this file all within the same script. But when i read it at the end, with either pandas pd.read_csv, or python3's csv module, and print the len, it outputs 4967. I checked the file for any weird characters that may be confusing python but don't see any. All the data is delimited by commas.
I also opened it in sublime and it shows 5009 rows not 4967.
I could try other methods from pandas like merge or concat, but if python wont read the csv correct, that's no use.
This is one method i tried.
df1=pd.read_csv('out.csv',quoting=csv.QUOTE_NONE, error_bad_lines=False)
df2=pd.read_excel(xlsfile)
print (len(df1))#4967
print (len(df2))#5008
df2['Location']=df1['Location']
df2['Sublocation']=df1['Sublocation']
df2['Zone']=df1['Zone']
df2['Subnet Type']=df1['Subnet Type']
df2['Description']=df1['Description']
newfile = input("Enter a name for the combined csv file: ")
print('Saving to new csv file...')
df2.to_csv(newfile, index=False)
print('Done.')
target.close()
Another way I tried is
dfcsv = pd.read_csv('out.csv')
wb = xlrd.open_workbook(xlsfile)
ws = wb.sheet_by_index(0)
xlsdata = []
for rx in range(ws.nrows):
xlsdata.append(ws.row_values(rx))
print (len(dfcsv))#4967
print (len(xlsdata))#5009
df1 = pd.DataFrame(data=dfcsv)
df2 = pd.DataFrame(data=xlsdata)
df3 = pd.concat([df2,df1], axis=1)
newfile = input("Enter a name for the combined csv file: ")
print('Saving to new csv file...')
df3.to_csv(newfile, index=False)
print('Done.')
target.close()
But not matter what way I try the CSV file is the actual issue, python is writing it correctly but not reading it correctly.
Edit: Weirdest part is that i'm getting absolutely no encoding errors or any errors when running the code...
Edit2: Tried testing it with nrows param in first code example, works up to 4000 rows. Soon as i specify 5000 rows, it reads only 4967.
Edit3: manually saved csv file with my data instead of using the one written by the program, and it read 5008 rows. Why is python not writing the csv file correctly?
I ran into this issue also. I realized that some of my lines had open-ended quotes, which was for some reason interfering with the reader.
So for example, some rows were written as:
GO:0000026 molecular_function "alpha-1
GO:0000027 biological_process ribosomal large subunit assembly
GO:0000033 molecular_function "alpha-1
and this led to rows being read incorrectly. (Unfortunately I don't know enough about how csvreader works to tell you why. Hopefully someone can clarify the quote behavior!)
I just removed the quotes and it worked out.
Edited: This option works too, if you want to maintain the quotes:
quotechar=None
My best guess without seeing the file is that you have some lines with too many or not enough commas, maybe due to values like foo,bar.
Please try setting error_bad_lines=True. From Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html to see if it catches lines with errors in them, and my guess is that there will be 41 such lines.
error_bad_lines : boolean, default True
Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)
The csv.QUOTE_NONE option seems to not quote fields and replace the current delimiter with escape_char + delimiter when writing, but you didn't paste your writing code, but on read it's unclear what this option does. https://docs.python.org/3/library/csv.html#csv.Dialect
I know this type of question is asked all the time. But I am having trouble figuring out the best way to do this.
I wrote a script that reformats a single excel file using pandas.
It works great.
Now I want to loop through multiple excel files, preform the same reformat, and place the newly reformatted data from each excel sheet at the bottom, one after another.
I believe the first step is to make a list of all excel files in the directory.
There are so many different ways to do this so I am having trouble finding the best way.
Below is the code I currently using to import multiple .xlsx and create a list.
import os
import glob
os.chdir('C:\ExcelWorkbooksFolder')
for FileList in glob.glob('*.xlsx'):
print(FileList)
I am not sure if the previous glob code actually created the list that I need.
Then I have trouble understanding where to go from there.
The code below fails at pd.ExcelFile(File)
I beleive I am missing something....
# create for loop
for File in FileList:
for x in File:
# Import the excel file and call it xlsx_file
xlsx_file = pd.ExcelFile(File)
xlsx_file
# View the excel files sheet names
xlsx_file.sheet_names
# Load the xlsx files Data sheet as a dataframe
df = xlsx_file.parse('Data',header= None)
# select important rows,
df_NoHeader = df[4:]
#then It does some more reformatting.
'
Any help is greatly appreciated
I solved my problem. Instead of using the glob function I used the os.listdir to read all my excel sheets, loop through each excel file, reformat, then append the final data to the end of the table.
#first create empty appended_data table to store the info.
appended_data = []
for WorkingFile in os.listdir('C:\ExcelFiles'):
if os.path.isfile(WorkingFile):
# Import the excel file and call it xlsx_file
xlsx_file = pd.ExcelFile(WorkingFile)
# View the excel files sheet names
xlsx_file.sheet_names
# Load the xlsx files Data sheet as a dataframe
df = xlsx_file.parse('sheet1',header= None)
#.... do so reformating, call finished sheet reformatedDataSheet
reformatedDataSheet
appended_data.append(reformatedDataSheet)
appended_data = pd.concat(appended_data)
And thats it, it does everything I wanted.
you need to change
os.chdir('C:\ExcelWorkbooksFolder')
for FileList in glob.glob('*.xlsx'):
print(FileList)
to just
os.chdir('C:\ExcelWorkbooksFolder')
FileList = glob.glob('*.xlsx')
print(FileList)
Why does this fix it? glob returns a single list. Since you put for FileList in glob.glob(...), you're going to walk that list one by one and put the result into FileList. At the end of your loop, FileList is a single filename - a single string.
When you do this code:
for File in FileList:
for x in File:
the first line will assign File to the first character of the last filename (as a string). The second line will assign x to the first (and only) character of File. This is not likely to be a valid filename, so it throws an error.