how to combine multiple CSV files from multiple folders in Python? - python

I have several CSV files each represents data for a day with no header! more like month-1/day-1.csv ... day-30.csv - month-2/day-1.csv ... etc
how can I combine all of these CSV files into one big CSV file that contains all of them?

Hi quant and welcome to SO!
You can use following code to do this:
import os
import glob
import pandas as pd
path = '/your_directory_containing the files'
os.chdir(path)
all_filenames = [i for i in glob.glob('*.{}'.format('csv'))]
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
Please note, that this code will combine all .csv-files in the specified directory.
I hope the code works for you :)

This assumes that your input is in a in_data folder and the output goes into a folder called out_data and both folders are in the directory of your notebook.
import pandas as pd
import glob
dfs = pd.concat([pd.read_csv(f, header=None) for f in glob.glob("./in_data/month*/day*")])
dfs.to_csv("./out_data/df_combined.csv", index=False)

Related

Read Linux Path and append all data

I want to read all csv files present in a Linux path and store it in a single data frame using Python.
I am able to read the files but while storing, each file is getting created as dictionary object ex: df['file1'],df['file2'] and so on.
Please let me know how can I store each csv file into separate data frame dynamically and then combine them to store in a single data frame.
Thanks in advance.
from pathlib import Path
import pandas as pd
dataframes = []
for p in Path("path/to/data").iterdir():
if p.suffix == ".csv":
dataframes.append(pd.read_csv(p))
df = pd.concat(dataframes)
Or if you want to include subdirectories
from pathlib import Path
import pandas as pd
path = Path("path/to/data")
df = pd.concat([pd.read_csv(f) for f in path.glob("**/*.csv")], ignore_index=True)

Opening Multiple`.xls` files from a folder in a different directory and creating one dataframe using Pandas

I am trying to open multiple xls files in a folder from a particular directory. I wish to read into these files and open all of them in one data frame. So far I am able to access the directory and put all the xls files into a list like this
import os
import pandas as pd
path = ('D:\Anaconda Hub\ARK analysis\data\year2021\\february')
files = os.listdir(path)
files
# outputting the variable files which appears to be a list.
Output:
['ARK_Trade_02012021_0619PM_EST_601875e069e08.xls',
'ARK_Trade_02022021_0645PM_EST_6019df308ae5e.xls',
'ARK_Trade_02032021_0829PM_EST_601b2da2185c6.xls',
'ARK_Trade_02042021_0637PM_EST_601c72b88257f.xls',
'ARK_Trade_02052021_0646PM_EST_601dd4dc308c5.xls',
'ARK_Trade_02082021_0629PM_EST_6021c739595b0.xls',
'ARK_Trade_02092021_0642PM_EST_602304eebdd43.xls',
'ARK_Trade_02102021_0809PM_EST_6024834cc5c8d.xls',
'ARK_Trade_02112021_0639PM_EST_6025bf548f5e7.xls',
'ARK_Trade_02122021_0705PM_EST_60270e4792d9e.xls',
'ARK_Trade_02162021_0748PM_EST_602c58957b6a8.xls']
I am now trying to get it into one dataframe like this:
frame = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1')
frame.append(data)
df = pd.concat(frame, axis=0, ignore_index=True)
However, when doing this I sometimes obtain a blank data frame or it throws an error like this:
FileNotFoundError: [Errno 2] No such file or directory: 'ARK_Trade_02012021_0619PM_EST_601875e069e08.xls'
Help would truly be appreciated with this task.
Thanks in advance.
The issue happens because if you simply put the file name the interpreter assumes that it is in the current working directory, therefore you need to use os module to get proper location:
import os
import pandas as pd
path = ('D:\Anaconda Hub\ARK analysis\data\year2021\\february')
files = os.listdir(path)
#frame = pd.DataFrame() ...This will not work!
frame = [] # Do this instead
for f in files:
data = pd.read_excel(os.path.join(path, f), 'Sheet1') # Here join filename with folder location
frame.append(data)
df = pd.concat(frame, axis=0, ignore_index=True)
The other issue is that frame should be a list or some other iterable. Pandas has append method for dataframes but if you want to use concat then it will need to be a list.
you can add a middle step to check if the path exists or not, I suspect this is an isolated issue with your server, from memory when working on older windows servers (namely 2012), I would sometimes have issues where the Path couldn't be found even though it 100% existed.
import pandas as pd
from pathlib import Path
# assuming you want xls and xlsx.
files = Path('folder_location').glob('*.xls*')
dfs = []
for file in files:
if file.is_file():
df = pd.read_excel(file, sheet_name='sheet')
dfs.append(df)
final_df = pd.concat(dfs)

Open - Edit - Save - Loop csv files in a folder with python

I will receive a folder with 100+ .csv files and I will need to edit them in the same way. Files have the same structure.
Folder looks like this:
df1.csv
df2.csv
df3.csv
...
df100.csv. I need to open all files - edit them - and then save them as "df1-edited", "df2-edited" and so on.
As per each df the code runns perfectly. I am not sure how to automatically run it through every file and save them accordingly.
Here is my code:
import pandas as pd
df = pd.read_csv('df1.csv')
[Edit steps here]
df.to_csv("df1-edited.csv", index=None, encoding='utf-8-sig', decimal=',')
Thanks!
For this you can use a module from the standard-library that works with your operating system.
Essentially, you need to find all the .csv files in your folder and iterate over them.
Let's use pathlib. This is not tested but something like this should work:
from pathlib import Path
import pandas as pd
csv_folder = Path('path/to/csvs')
for file in csv_folder.glob('*.csv'): #create iteratable & iterate on it.
df = pd.read_csv(file)
# do stuff
new_file_name = file.parent.joinpath(f"{file.stem}-edited.csv")
df.to_csv(new_file_name, index=None, encoding='utf-8-sig', decimal=',')

Reading multiple CSV Files and add them to pandas dataframe

Hy Guys,
I m trying to import in a Dataframe many csv files.
I ve an error: Value error: No objects to concatenate
this is my code:
from glob import iglob
import numpy as np
import pandas as pd
# read datas from github repository
path = r'https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports'
df1 = pd.concat((pd.read_csv(f) for f in iglob(path+"/*.csv", recursive=True)), ignore_index=True)
thanks for your help. If think it is due to path definition ?
The error indicates the dfs is empty hence the line pd.concat(dfs, ...) failed. So, I'm guessing the .csv files are not at where they are expected.
If you have the strange data folder structure, it should be able to load but it's hard for me to know as I can not see your folder structure.
Try this construction:
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
df1 = pd.concat(dfs, recursive=True)), ignore_index=True)

Extract data from multiple excel files in multiple directories in python pandas

I am new to Python and I am posting the question in stack overflow for the first time. Please help in solving the problem.
My main directory is 'E:\Data Science\Macros\ZBILL_Dump', containing month-wise folders and each folder contains date-wise excel data.
I was able to extract data from a single folder:
import os
import pandas as pd
import numpy as np
# Find file names in the specified directory
loc = 'E:\Data Science\Macros\ZBILL_Dump\Apr17\\'
files = os.listdir(loc)
# Find the ONLY Excel files
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
# Create empty dataframe and read in new data
zbill = pd.DataFrame()
for f in files_xlsx:
New_data = pd.read_excel(os.path.normpath(loc + f), 'Sheet1')
zbill = zbill.append(New_data)
zbill.head()
I am trying to extract data from my main directory i.e "ZBILL_Dump" which contains many sub folders, but I could not do it. Please somebody help me.
Thanks a lot.
You can use glob.
import glob
import pandas as pd
# grab excel files only
pattern = 'E:\Data Science\Macros\ZBILL_Dump\Apr17\\*.xlsx'
# Save all file matches: xlsx_files
xlsx_files = glob.glob(pattern)
# Create an empty list: frames
frames = []
# Iterate over csv_files
for file in xlsx_files:
# Read xlsx into a DataFrame
df = pd.read_xlsx(file)
# Append df to frames
frames.append(df)
# Concatenate frames into dataframe
zbill = pd.concat(frames)
You can use regex if you want to look in different sub-directories. Use 'filepath/*/*.xlsx' to search the next level. More info here https://docs.python.org/3/library/glob.html
Use glob with its recursive feature for searching sub-directories:
import glob
files = glob.glob('E:\Data Science\Macros\ZBILL_Dump\**\*.xlsx', recursive=True)
Docs: https://docs.python.org/3/library/glob.html

Categories