I'm looping through a directory to read in a series of csv files into a single pandas dataframe. In one of the csv's its throwing an error. I can work through this one by one and figure out which one is causing the error, but I assume there must be some way to build in error handling that would print out the file that is causing the issue, however I'm not sure how to implement something like that.
Any advice appreciated, code below:
import os
import glob
import pandas as pd
path = r'C:\Users\PATH\TestCSVs'
all_files = glob.glob(os.path.join(path, "*.csv"))
master_df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)
print(master_df.shape)
Related
I've searched for about an hour for an answer to this and none of the solutions I've found are working. I'm trying to get a folder full of CSVs into a single dataframe, to output to one big csv. Here's my current code:
import os
sourceLoc = "SOURCE"
destLoc = sourceLoc + "MasterData.csv"
masterDF = pd.DataFrame([])
for file in os.listdir(sourceLoc):
workingDF = pd.read_csv(sourceLoc + file)
print(workingDF)
masterDF.append(workingDF)
print(masterDF)
The SOURCE is a folder path but I've had to remove it as it's a work network path. The loop is reading the CSVs to the workingDF variable as when I run it it prints the data into the console, but it's also finding 349 rows for each file. None of them have that many rows of data in them.
When I print masterDF it prints Empty DataFrame Columns: [] Index: []
My code is from this solution but that example is using xlsx files and I'm not sure what changes, if any, are needed to get it to work with CSVs. The Pandas documentation on .append and read_csv is quite limited and doesn't indicate anything specific I'm doing wrong.
Any help would be appreciated.
There are a couple of things wrong with your code, but the main thing is that pd.append returns a new dataframe, instead of modifying in place. So you would have to do:
masterDF = masterDF.append(workingDF)
I also like the approach taken by I_Al-thamary - concat will probably be faster.
One last thing I would suggest, is instead of using glob, check out pathlib.
import pandas as pd
from pathlib import Path
path = Path("your path")
df = pd.concat(map(pd.read_csv, path.rglob("*.csv"))))
you can use glob
import glob
import pandas as pd
import os
path = "your path"
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join(path,'*.csv'))))
print(df)
You may store them all in a list and pd.concat them at last.
dfs = [
pd.read_csv(os.path.join(sourceLoc, file))
for file in os.listdir(sourceLoc)
]
masterDF = pd.concat(df)
I am trying to open multiple xls files in a folder from a particular directory. I wish to read into these files and open all of them in one data frame. So far I am able to access the directory and put all the xls files into a list like this
import os
import pandas as pd
path = ('D:\Anaconda Hub\ARK analysis\data\year2021\\february')
files = os.listdir(path)
files
# outputting the variable files which appears to be a list.
Output:
['ARK_Trade_02012021_0619PM_EST_601875e069e08.xls',
'ARK_Trade_02022021_0645PM_EST_6019df308ae5e.xls',
'ARK_Trade_02032021_0829PM_EST_601b2da2185c6.xls',
'ARK_Trade_02042021_0637PM_EST_601c72b88257f.xls',
'ARK_Trade_02052021_0646PM_EST_601dd4dc308c5.xls',
'ARK_Trade_02082021_0629PM_EST_6021c739595b0.xls',
'ARK_Trade_02092021_0642PM_EST_602304eebdd43.xls',
'ARK_Trade_02102021_0809PM_EST_6024834cc5c8d.xls',
'ARK_Trade_02112021_0639PM_EST_6025bf548f5e7.xls',
'ARK_Trade_02122021_0705PM_EST_60270e4792d9e.xls',
'ARK_Trade_02162021_0748PM_EST_602c58957b6a8.xls']
I am now trying to get it into one dataframe like this:
frame = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1')
frame.append(data)
df = pd.concat(frame, axis=0, ignore_index=True)
However, when doing this I sometimes obtain a blank data frame or it throws an error like this:
FileNotFoundError: [Errno 2] No such file or directory: 'ARK_Trade_02012021_0619PM_EST_601875e069e08.xls'
Help would truly be appreciated with this task.
Thanks in advance.
The issue happens because if you simply put the file name the interpreter assumes that it is in the current working directory, therefore you need to use os module to get proper location:
import os
import pandas as pd
path = ('D:\Anaconda Hub\ARK analysis\data\year2021\\february')
files = os.listdir(path)
#frame = pd.DataFrame() ...This will not work!
frame = [] # Do this instead
for f in files:
data = pd.read_excel(os.path.join(path, f), 'Sheet1') # Here join filename with folder location
frame.append(data)
df = pd.concat(frame, axis=0, ignore_index=True)
The other issue is that frame should be a list or some other iterable. Pandas has append method for dataframes but if you want to use concat then it will need to be a list.
you can add a middle step to check if the path exists or not, I suspect this is an isolated issue with your server, from memory when working on older windows servers (namely 2012), I would sometimes have issues where the Path couldn't be found even though it 100% existed.
import pandas as pd
from pathlib import Path
# assuming you want xls and xlsx.
files = Path('folder_location').glob('*.xls*')
dfs = []
for file in files:
if file.is_file():
df = pd.read_excel(file, sheet_name='sheet')
dfs.append(df)
final_df = pd.concat(dfs)
Hy Guys,
I m trying to import in a Dataframe many csv files.
I ve an error: Value error: No objects to concatenate
this is my code:
from glob import iglob
import numpy as np
import pandas as pd
# read datas from github repository
path = r'https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports'
df1 = pd.concat((pd.read_csv(f) for f in iglob(path+"/*.csv", recursive=True)), ignore_index=True)
thanks for your help. If think it is due to path definition ?
The error indicates the dfs is empty hence the line pd.concat(dfs, ...) failed. So, I'm guessing the .csv files are not at where they are expected.
If you have the strange data folder structure, it should be able to load but it's hard for me to know as I can not see your folder structure.
Try this construction:
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
df1 = pd.concat(dfs, recursive=True)), ignore_index=True)
It works fine for simple files but not with more complex ones.
My files are not corrupted and they are in the right directory.
I tried it with easy generate files (1,2,3,4... a,b,c,d...).
I put it at Github tonight so you can run the code and see the files.
import os
import glob
import pandas as pd
def concatenate(indir='./files/', outfile='./all.csv'):
os.chdir(indir)
fileList = glob.glob('*.CSV')
dfList = []
'''colnames = ['Time', 'Number', 'Reaction', 'Code', 'Message', 'date']'''
print(len(fileList))
for filename in fileList:
print(filename)
df = pd.read_csv(filename, header=0)
dfList.append(df)
'''print(dfList)'''
concatDf = pd.concat(dfList, axis=0)
'''concatDf.columns = colnames'''
concatDf.to_csv(outfile, index=None)
concatenate()
Error
Unable to open parsers.pyx: Unable to read file (Error: File not found
(/Users/alf4/Documents/vs_code/files/pandas/_libs/parsers.pyx)).
But just after more than two files.
complex ones? do you mean bigger csv files ?
instead of appendding data to an empty list and then concatenating back to the dataframe, we can do it in a single step, take an empty dataframe(df1), keep appending df to df1 in the loop.
df1=df1.append(df)
and then write it out in the end
df1.to_csv(outfile, index=None)
I am sorry for this question/the wrong topic because it seems not to be a code problem.
It seems that the installation of pandas is bugged. It put it to repl.it to share it here and there it works. At the moment I try to repair the python and pandas installation.
So many thanks to these guys in the comments for the helping.
I have several files with the same format, but with different values.
with the help of StackOverflow users I got the code running, but now I am trying to optimize it, and I need some help to do it.
this is the full code:
import pandas as pd
# filenames
excel_names = ["file-JAN_2019.xlsx", "example-JAN_2019.xlsx", "stuff-JAN_2019.xlsx"]
# read them in
excels = [pd.ExcelFile(name) for name in excel_names]
# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in
excels]
#frames = [df.iloc[20:, :] for df in frames]
frames_2 = [df.iloc[21:, :] for df in frames[1:]]
#And combine them separately
combined = pd.concat([frames[0], *frames_2])
# concatenate them..
#combined = pd.concat(frames)
combined = combined[~combined[4].isin(['-'])]
combined.dropna(subset=[4], inplace=True)
# write it out
combined.to_excel("c.xlsx", header=False, index=False)
the code that I am trying to use is as follows:
from glob import glob
excel_names = glob.glob('*JAN_2019-jan.xlsx')
files = []
for names in (excel_names):
files.extend(names)
print(files)
at this moment i am getting the following error:
Traceback (most recent call last):
File "finaltwek.py", line 4, in
excel_names = glob.glob('*JAN_2019-jan.xlsx')
AttributeError: 'function' object has no attribute 'glob'
but while I was tweaking with the code I also made the code run, but it found all the files in the folder, and I need only the ones that have the same designation in the end, including the extension
I am trying to make the code more dynamic by making it find all the files that end in the same way and are located in the same folder, but for some reason, I can't make it work, can anyone help?
Thanks
glob.glob("*JAN_2019-jan.xlsx") will search within the directory where the Python interpreter is located.
You can easily construct a file path by using os.path.join(...) and os.path.dirname(__file__) to point to your script's directory:
import os
import glob
excel_names = glob.glob(os.path.join(os.path.dirname(__file__), '*JAN_2019-jan.xlsx'))
print execel_names
Prints for me:
['/tmp/ex-JAN_2019-jan.xlsx']
If you want to use glob.glob() then you should call
import glob
#then use
file_names = glob.glob('*.xlxs')
In your code, you are importing the glob function from the glob file. In that case you cannot use glob.glob(). For your code:
from glob import glob
excel_names = glob('*JAN_2019-jan.xlsx')