Reading multiple CSV Files and add them to pandas dataframe - python

Hy Guys,
I m trying to import in a Dataframe many csv files.
I ve an error: Value error: No objects to concatenate
this is my code:
from glob import iglob
import numpy as np
import pandas as pd
# read datas from github repository
path = r'https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports'
df1 = pd.concat((pd.read_csv(f) for f in iglob(path+"/*.csv", recursive=True)), ignore_index=True)
thanks for your help. If think it is due to path definition ?

The error indicates the dfs is empty hence the line pd.concat(dfs, ...) failed. So, I'm guessing the .csv files are not at where they are expected.
If you have the strange data folder structure, it should be able to load but it's hard for me to know as I can not see your folder structure.
Try this construction:
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
df1 = pd.concat(dfs, recursive=True)), ignore_index=True)

Related

Opening Multiple`.xls` files from a folder in a different directory and creating one dataframe using Pandas

I am trying to open multiple xls files in a folder from a particular directory. I wish to read into these files and open all of them in one data frame. So far I am able to access the directory and put all the xls files into a list like this
import os
import pandas as pd
path = ('D:\Anaconda Hub\ARK analysis\data\year2021\\february')
files = os.listdir(path)
files
# outputting the variable files which appears to be a list.
Output:
['ARK_Trade_02012021_0619PM_EST_601875e069e08.xls',
'ARK_Trade_02022021_0645PM_EST_6019df308ae5e.xls',
'ARK_Trade_02032021_0829PM_EST_601b2da2185c6.xls',
'ARK_Trade_02042021_0637PM_EST_601c72b88257f.xls',
'ARK_Trade_02052021_0646PM_EST_601dd4dc308c5.xls',
'ARK_Trade_02082021_0629PM_EST_6021c739595b0.xls',
'ARK_Trade_02092021_0642PM_EST_602304eebdd43.xls',
'ARK_Trade_02102021_0809PM_EST_6024834cc5c8d.xls',
'ARK_Trade_02112021_0639PM_EST_6025bf548f5e7.xls',
'ARK_Trade_02122021_0705PM_EST_60270e4792d9e.xls',
'ARK_Trade_02162021_0748PM_EST_602c58957b6a8.xls']
I am now trying to get it into one dataframe like this:
frame = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1')
frame.append(data)
df = pd.concat(frame, axis=0, ignore_index=True)
However, when doing this I sometimes obtain a blank data frame or it throws an error like this:
FileNotFoundError: [Errno 2] No such file or directory: 'ARK_Trade_02012021_0619PM_EST_601875e069e08.xls'
Help would truly be appreciated with this task.
Thanks in advance.
The issue happens because if you simply put the file name the interpreter assumes that it is in the current working directory, therefore you need to use os module to get proper location:
import os
import pandas as pd
path = ('D:\Anaconda Hub\ARK analysis\data\year2021\\february')
files = os.listdir(path)
#frame = pd.DataFrame() ...This will not work!
frame = [] # Do this instead
for f in files:
data = pd.read_excel(os.path.join(path, f), 'Sheet1') # Here join filename with folder location
frame.append(data)
df = pd.concat(frame, axis=0, ignore_index=True)
The other issue is that frame should be a list or some other iterable. Pandas has append method for dataframes but if you want to use concat then it will need to be a list.
you can add a middle step to check if the path exists or not, I suspect this is an isolated issue with your server, from memory when working on older windows servers (namely 2012), I would sometimes have issues where the Path couldn't be found even though it 100% existed.
import pandas as pd
from pathlib import Path
# assuming you want xls and xlsx.
files = Path('folder_location').glob('*.xls*')
dfs = []
for file in files:
if file.is_file():
df = pd.read_excel(file, sheet_name='sheet')
dfs.append(df)
final_df = pd.concat(dfs)

Return File Name Causing Issue

I'm looping through a directory to read in a series of csv files into a single pandas dataframe. In one of the csv's its throwing an error. I can work through this one by one and figure out which one is causing the error, but I assume there must be some way to build in error handling that would print out the file that is causing the issue, however I'm not sure how to implement something like that.
Any advice appreciated, code below:
import os
import glob
import pandas as pd
path = r'C:\Users\PATH\TestCSVs'
all_files = glob.glob(os.path.join(path, "*.csv"))
master_df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)
print(master_df.shape)

how to combine multiple CSV files from multiple folders in Python?

I have several CSV files each represents data for a day with no header! more like month-1/day-1.csv ... day-30.csv - month-2/day-1.csv ... etc
how can I combine all of these CSV files into one big CSV file that contains all of them?
Hi quant and welcome to SO!
You can use following code to do this:
import os
import glob
import pandas as pd
path = '/your_directory_containing the files'
os.chdir(path)
all_filenames = [i for i in glob.glob('*.{}'.format('csv'))]
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
Please note, that this code will combine all .csv-files in the specified directory.
I hope the code works for you :)
This assumes that your input is in a in_data folder and the output goes into a folder called out_data and both folders are in the directory of your notebook.
import pandas as pd
import glob
dfs = pd.concat([pd.read_csv(f, header=None) for f in glob.glob("./in_data/month*/day*")])
dfs.to_csv("./out_data/df_combined.csv", index=False)

How to Include Source File in Column in Pandas Dataframe

I have a list of 50+ Excel files that I loop through and consolidate into one dataframe. However, I need to know the source of the data, since the data will repeat across these files.
Each file name is the date of the report. Since this data is time series data, I need to pull this date into the dataframe to do further manipulations.
import os
import glob
import pandas as pd
path = r"path"
extension = 'xls*'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))
files_df = pd.concat([pd.read_excel(fp, usecols=[0,15], header=None) for fp in files], ignore_index=True)
I get the expected dataframe. I just do not know how to include the source file name as a third column. I thought there was an argument to do this in pd.read_excel(), but I couldn't find it.
For example, I have a list of the following files:
02-2019.xlsx
03-2011.xls
04-2014.xls
etc
I want to include those file names next to the data that comes from that file in the combined dataframe.
Maybe use the keys= parameter in pd.concat()?
import os
import glob
import pandas as pd
path = r"path"
extension = 'xls*'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))
# remove ignore_index=True otherwise keys parameter won't work
files_df = pd.concat([pd.read_excel(fp, usecols=[0,15], header=None) for fp in files], keys=[f"{fp.split('.')[0]}" for fp in files])
You can then reset_index() and convert to_datetime()
fp.reset_index(inplace=True)
fp['index'] = pd.to_datetime(fp['index'])

Extract data from multiple excel files in multiple directories in python pandas

I am new to Python and I am posting the question in stack overflow for the first time. Please help in solving the problem.
My main directory is 'E:\Data Science\Macros\ZBILL_Dump', containing month-wise folders and each folder contains date-wise excel data.
I was able to extract data from a single folder:
import os
import pandas as pd
import numpy as np
# Find file names in the specified directory
loc = 'E:\Data Science\Macros\ZBILL_Dump\Apr17\\'
files = os.listdir(loc)
# Find the ONLY Excel files
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
# Create empty dataframe and read in new data
zbill = pd.DataFrame()
for f in files_xlsx:
New_data = pd.read_excel(os.path.normpath(loc + f), 'Sheet1')
zbill = zbill.append(New_data)
zbill.head()
I am trying to extract data from my main directory i.e "ZBILL_Dump" which contains many sub folders, but I could not do it. Please somebody help me.
Thanks a lot.
You can use glob.
import glob
import pandas as pd
# grab excel files only
pattern = 'E:\Data Science\Macros\ZBILL_Dump\Apr17\\*.xlsx'
# Save all file matches: xlsx_files
xlsx_files = glob.glob(pattern)
# Create an empty list: frames
frames = []
# Iterate over csv_files
for file in xlsx_files:
# Read xlsx into a DataFrame
df = pd.read_xlsx(file)
# Append df to frames
frames.append(df)
# Concatenate frames into dataframe
zbill = pd.concat(frames)
You can use regex if you want to look in different sub-directories. Use 'filepath/*/*.xlsx' to search the next level. More info here https://docs.python.org/3/library/glob.html
Use glob with its recursive feature for searching sub-directories:
import glob
files = glob.glob('E:\Data Science\Macros\ZBILL_Dump\**\*.xlsx', recursive=True)
Docs: https://docs.python.org/3/library/glob.html

Categories