I am trying to read multiple files from a folder with specific name (1.car.csv, 2.car.csv and so on) and trying to add a new label after each iteration at right most of the dataset and merge all the csv files into one csv file. As the ".car.csv" is constant, I think I can use a for loop with .format(index) function to run over the csv files. All of the csv files has got same attributes.
Kindly help me!
glob is used to get all files in the folder that match the pattern *.csv
pd.read_csv is used to read each file as a DataFrame
index_col=None you are telling Pandas to not use any of the columns as the index, and instead to create a default index for the DataFrame.
header=0 you are telling Pandas to use the first row of the CSV file as the header row.
pd.concat is used to merge all the DataFrames into a single DataFrame merged_df
axis=0 means that the concatenation should happen along the rows (vertically)
ignore_index=True the concatenation is performed such that the original indices of the individual DataFrames are discarded, and a new default index is created for the resulting DataFrame.
import glob
import pandas as pd
path = r'<path to folder containing csv files>'
all_files = glob.glob(path + "/*.csv")
lst = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
lst.append(df)
merged_df = pd.concat(lst, axis=0, ignore_index=True)
This can be easily done with a CSV tool like miller:
mlr --csv cat --filename bla1.csv *.car.csv
This will concatenate the files (without repeating the header) and prepend the filename as the first column.
You can use the pandas library this way:
import pandas as pd
import os
# path to folder where the csv files are stored
path = '/path/to/folder'
result = pd.DataFrame()
for i in range(1, n+1):
filename = "{}.car.csv".format(i)
file_path = os.path.join(path, filename)
df = pd.read_csv(file_path)
df['new_label'] = i
result = pd.concat([result, df], ignore_index=True)
result.to_csv('final_result.csv', index=False)
The n in the code above should be replaced with the number of csv files you have in the folder.
If you need any explanation of the code (in case you're new to python or dataframes) just comment below.
Using pathlib and pandas you can use .assign() to enter the new column and finally .concat() to concatenate all the files into one.
from pathlib import Path
import pandas as pd
input_path = Path("path/to/car/files/").glob("*car.csv")
output_path = "path/to/output"
pd.concat(
(pd.read_csv(x).assign(new_label="new data") for x in input_path), ignore_index=True
).to_csv(f"{output_path}/final.csv", index=False)
I am attempting to combine numerous excel files in the same folder into a single, new file using Python. All of the excel files contain time series data, each only has one worksheet, and all have the same layout:
Timestamp ----- Data Tag
TS1 ----------------- ###
TS2 ----------------- ###
etc.
The new combined sheet would look something like this:
Timestamp ----- Data Tag 1-----Data Tag 2
TS1 ----------------- ###----------------###
TS2 ----------------- ###----------------###
etc.
I have tried a couple of different methods.
Method 1 only returns the columns from the last file in the list:
import glob
import pandas as pd
all_data = pd.DataFrame()
for f in glob.glob("*.xls"):
print(f)
df = pd.read_excel(f)
all_data = pd.concat([df])
all_data.to_excel('destination.xlsx')
print("done")
Whereas Method 2 only concatenates the first two files in the folder. It also includes the timestamp column from the second excel file, but I was going to sort out that issue when I got it working for the whole lot.
import glob
import pandas as pd
all_data = pd.DataFrame()
for f in glob.glob("*.xls"):
print(f)
df = pd.read_excel(f)
all_data = df.merge(df, left_index=True, right_index=True)
all_data.to_excel('destination.xlsx')
print("done")
I searched around, both on SO and elsewhere, but I am struggling to combine these files in the method specified above. Appending is simple, merging for some reason isn't.
If anyone could point me in the right direction I would greatly appreciate it!
Thank you.
EDIT:
Per AMC's comment, edited the code to iterate through files. Now, however, a KeyError is passed as "Timestamp" is not recognized despite calling for no suffixes.
Error:
File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1706, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'Timestamp'
Code causing error:
import os
import pandas as pd
rootdir = os.path.dirname(os.path.realpath(__file__))
files = os.listdir(rootdir)
df = pd.DataFrame()
for f in files:
if f.endswith(".xls"):
print(f)
df = df.merge(pd.read_excel(f), suffixes=None, left_on='Timestamp', right_on='Timestamp')
df.to_excel('concat.xlsx')
print("done")
EDIT (v2.0): Added an f-string (as suggested by user #NissesSenap) at the beginning of the script that allows me to define a text ("snippet"?) that names the new files and header automatically.
I am new to Python (but very proud of my little script!). I have CSV files with data that I am subjecting to some math and reorganizing (eg. renaming and removing columns). I then exporting these as a new CSV file. Several CSV files in a common folder is the combined into one CSV file that I use for data analysis.
Step-by-step what my script(s) aim to do
Subject data in CSV file to operations and reorganizing
Export as new CSV in a common folder
Combine several CSV files in folder based on common index
My issue is that I have to do step 1. and 2. on each sample file one at a time by going into the scrip, define filename using an f-string. Which is manageable but time consuming when I deal with a lot of files.
I would like to take all files in a folder, subject them to step 1. and 2. automatically, then export them into another folder for step 3.
Any and all feedback greatly appreciated!
---> Link to sample data
My code so far...
Script for step 1. and 2.
# Import modules and libraries
import pandas as pd
import numpy as np
from glob import glob
# Define filename
FILENAME = "FILENAME"
# Open CSV file
df = pd.read_csv(f"data/{FILENAME}_data.csv")
# Adds new columns and calulates length in picometer, length score and splaying score
df['Length (um)'] = df.loc[: , 'Length (pm)'] / 1000000
df['Length score'] = df.loc[: , 'Length (um)'] / df['Curvature points'] * 2
df['Splaying score'] = df.loc[: , 'Length score'] * df['Splaying profile']
# Saves safecopy as data and creates and opens new CSV as results
df.to_csv(f"data/{FILENAME}_data.csv", index=False)
df.to_csv(f"results/{FILENAME}_results.csv", index=False)
df_results = pd.read_csv(f"results/{FILENAME}_results.csv")
# Removes unnecessary columns from results
df_results = df_results.drop(['Splaying profile', 'Curvature points', 'Length (pm)', 'Length score'],axis=1)
# Rename columns
df_results.rename(columns={'Length (um)' : f"{FILENAME} Average Length (um)", 'Splaying score' : f"{FILENAME} Average Splaying score"}, inplace=True)
# Group by objects and calculate average for each object then save to new CSV
df_results.groupby("Object").mean().to_csv(f"results/{FILENAME}_results.csv")
df_results = pd.read_csv(f"results/{FILENAME}_results.csv")
Script for step 3.
# Import modules and libraries
import pandas as pd
import numpy as np
from glob import glob
# Use glob to list all files in folder. * includes variable part of filename
stock_files = sorted(glob('results/*_results.CSV'))
# Turn list into dataframes and setting one header as index
dataframes = [pd.read_csv(stock_files, index_col ="Object") for stock_files in stock_files]
# Use concat to join based on index. axis=1 stacks columns and join=outer includes all data
pd.concat(dataframes, axis=1, sort=False, join='outer').to_csv('results/summary_results.csv', index=True)
I used pathlib.Path() in combination with glob.glob() to get all files from an inpurt directory. this works fine. Afterwards you can iterate over all files found with glob and read the csv to a pandas dataframe with pd.read_csv(file). After that you can check with an if-condition, utilizing the .stem property of pathlib to only get the file name (see docs), and check if the filename of a given input csv-file matches the pattern you'd expect (I just guessed it could be B1xxxx).
#!/usr/bin/env python3
import pandas as pd
from glob import glob
from pathlib import Path
def main():
path_to_input = Path('C:/Test/Data')
input_files = sorted(path_to_input.glob('*.csv'))
for file in input_files:
df = pd.read_csv(file)
if file.stem.startswith('B1'):
pass
# Processing for file Type 1 (B1xx)
elif file.stem.startswith('B2'):
pass
# Processing for file Type 2 (B2xx)
else:
print(f"File \"{file.parts[-1]}\" name is not recognized")
I'm using python to merge some excel files into a single csv file, but when doing so, the datetimes get turned into integers. So, when I read it back with pandas to treat my unified database, I would need to convert it back to datetime, which is possible but seems unnecessary. The code for reading and compiling the files:
folder = Path('myPath')
os.chdir(folder)
files = sorted(os.listdir(os.getcwd()), key = os.path.getctime)
for file in files:
with xlrd.open_workbook(folder/file) as wb:
sh = wb.sheet_by_index(0)
with open('Unified database.csv', 'wb') as f:
c = csv.writer(f, encoding = 'utf-8')
for r in range(sh.nrows):
c.writerow(sh.row_values(r))
Is there a way to take less steps into solving this problem, and just write the datetime columns as strings, which pandas has a much easier time automatically identifying as dates? Even if I have to pass the datetime columns manually.
Have you tried to read all of the excel files directly into a pandas dataframe? The code below is from this answer on how to Import multiple csv files into pandas and concatenate into one DataFrame. I have added the dtype so you can specify which columns should be datetime.
import pandas as pd
import glob
path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(path + "/*.xlsx")
li = []
for filename in all_files:
df = pd.read_xlsx(filename, index_col=None, header=0, dtype={‘a’: np.datetime})
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
I would like to read several CSV files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far:
import glob
import pandas as pd
# Get data file names
path = r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)
I guess I need some help within the for loop?
See pandas: IO tools for all of the available .read_ methods.
Try the following code if all of the CSV files have the same columns.
I have added header=0, so that after reading the CSV file's first row, it can be assigned as the column names.
import pandas as pd
import glob
import os
path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path , "/*.csv"))
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Or, with attribution to a comment from Sid.
all_files = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)
It's often necessary to identify each sample of data, which can be accomplished by adding a new column to the dataframe.
pathlib from the standard library will be used for this example. It treats paths as objects with methods, instead of strings to be sliced.
Imports and Setup
from pathlib import Path
import pandas as pd
import numpy as np
path = r'C:\DRO\DCL_rawdata_files' # or unix / linux / mac path
# Get the files from the path provided in the OP
files = Path(path).glob('*.csv') # .rglob to get subdirectories
Option 1:
Add a new column with the file name
dfs = list()
for f in files:
data = pd.read_csv(f)
# .stem is method for pathlib objects to get the filename w/o the extension
data['file'] = f.stem
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
Option 2:
Add a new column with a generic name using enumerate
dfs = list()
for i, f in enumerate(files):
data = pd.read_csv(f)
data['file'] = f'File {i}'
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
Option 3:
Create the dataframes with a list comprehension, and then use np.repeat to add a new column.
[f'S{i}' for i in range(len(dfs))] creates a list of strings to name each dataframe.
[len(df) for df in dfs] creates a list of lengths
Attribution for this option goes to this plotting answer.
# Read the files into dataframes
dfs = [pd.read_csv(f) for f in files]
# Combine the list of dataframes
df = pd.concat(dfs, ignore_index=True)
# Add a new column
df['Source'] = np.repeat([f'S{i}' for i in range(len(dfs))], [len(df) for df in dfs])
Option 4:
One liners using .assign to create the new column, with attribution to a comment from C8H10N4O2
df = pd.concat((pd.read_csv(f).assign(filename=f.stem) for f in files), ignore_index=True)
or
df = pd.concat((pd.read_csv(f).assign(Source=f'S{i}') for i, f in enumerate(files)), ignore_index=True)
An alternative to darindaCoder's answer:
path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path, "*.csv")) # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one
import glob
import os
import pandas as pd
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))
Almost all of the answers here are either unnecessarily complex (glob pattern matching) or rely on additional third-party libraries. You can do this in two lines using everything Pandas and Python (all versions) already have built in.
For a few files - one-liner
df = pd.concat(map(pd.read_csv, ['d1.csv', 'd2.csv','d3.csv']))
For many files
import os
filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))
For No Headers
If you have specific things you want to change with pd.read_csv (i.e., no headers) you can make a separate function and call that with your map:
def f(i):
return pd.read_csv(i, header=None)
df = pd.concat(map(f, filepaths))
This pandas line, which sets the df, utilizes three things:
Python's map (function, iterable) sends to the function (the
pd.read_csv()) the iterable (our list) which is every CSV element
in filepaths).
Panda's read_csv() function reads in each CSV file as normal.
Panda's concat() brings all these under one df variable.
Easy and Fast
Import two or more CSV files without having to make a list of names.
import glob
import pandas as pd
df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv')))
The Dask library can read a dataframe from multiple files:
>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')
(Source: https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files)
The Dask dataframes implement a subset of the Pandas dataframe API. If all the data fits into memory, you can call df.compute() to convert the dataframe into a Pandas dataframe.
I googled my way into Gaurav Singh's answer.
However, as of late, I am finding it faster to do any manipulation using NumPy and then assigning it once to a dataframe rather than manipulating the dataframe itself on an iterative basis and it seems to work in this solution too.
I do sincerely want anyone hitting this page to consider this approach, but I don't want to attach this huge piece of code as a comment and making it less readable.
You can leverage NumPy to really speed up the dataframe concatenation.
import os
import glob
import pandas as pd
import numpy as np
path = "my_dir_full_path"
allFiles = glob.glob(os.path.join(path,"*.csv"))
np_array_list = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None, header=0)
np_array_list.append(df.as_matrix())
comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)
big_frame.columns = ["col1", "col2"....]
Timing statistics:
total files :192
avg lines per file :8492
--approach 1 without NumPy -- 8.248656988143921 seconds ---
total records old :1630571
--approach 2 with NumPy -- 2.289292573928833 seconds ---
A one-liner using map, but if you'd like to specify additional arguments, you could do:
import pandas as pd
import glob
import functools
df = pd.concat(map(functools.partial(pd.read_csv, sep='|', compression=None),
glob.glob("data/*.csv")))
Note: map by itself does not let you supply additional arguments.
If you want to search recursively (Python 3.5 or above), you can do the following:
from glob import iglob
import pandas as pd
path = r'C:\user\your\path\**\*.csv'
all_rec = iglob(path, recursive=True)
dataframes = (pd.read_csv(f) for f in all_rec)
big_dataframe = pd.concat(dataframes, ignore_index=True)
Note that the three last lines can be expressed in one single line:
df = pd.concat((pd.read_csv(f) for f in iglob(path, recursive=True)), ignore_index=True)
You can find the documentation of ** here. Also, I used iglobinstead of glob, as it returns an iterator instead of a list.
EDIT: Multiplatform recursive function:
You can wrap the above into a multiplatform function (Linux, Windows, Mac), so you can do:
df = read_df_rec('C:\user\your\path', *.csv)
Here is the function:
from glob import iglob
from os.path import join
import pandas as pd
def read_df_rec(path, fn_regex=r'*.csv'):
return pd.concat((pd.read_csv(f) for f in iglob(
join(path, '**', fn_regex), recursive=True)), ignore_index=True)
Inspired from MrFun's answer:
import glob
import pandas as pd
list_of_csv_files = glob.glob(directory_path + '/*.csv')
list_of_csv_files.sort()
df = pd.concat(map(pd.read_csv, list_of_csv_files), ignore_index=True)
Notes:
By default, the list of files generated through glob.glob is not sorted. On the other hand, in many scenarios, it's required to be sorted e.g. one may want to analyze number of sensor-frame-drops v/s timestamp.
In pd.concat command, if ignore_index=True is not specified then it reserves the original indices from each dataframes (i.e. each individual CSV file in the list) and the main dataframe looks like
timestamp id valid_frame
0
1
2
.
.
.
0
1
2
.
.
.
With ignore_index=True, it looks like:
timestamp id valid_frame
0
1
2
.
.
.
108
109
.
.
.
IMO, this is helpful when one may want to manually create a histogram of number of frame drops v/s one minutes (or any other duration) bins and want to base the calculation on very first timestamp e.g.
begin_timestamp = df['timestamp'][0]
Without, ignore_index=True, df['timestamp'][0] generates the series containing very first timestamp from all the individual dataframes, it does not give just a value.
Another one-liner with list comprehension which allows to use arguments with read_csv.
df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])
Alternative using the pathlib library (often preferred over os.path).
This method avoids iterative use of pandas concat()/apped().
From the pandas documentation:
It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.
import pandas as pd
from pathlib import Path
dir = Path("../relevant_directory")
df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)
If multiple CSV files are zipped, you may use zipfile to read all and concatenate as below:
import zipfile
import pandas as pd
ziptrain = zipfile.ZipFile('yourpath/yourfile.zip')
train = []
train = [ pd.read_csv(ziptrain.open(f)) for f in ziptrain.namelist() ]
df = pd.concat(train)
Based on Sid's good answer.
To identify issues of missing or unaligned columns
Before concatenating, you can load CSV files into an intermediate dictionary which gives access to each data set based on the file name (in the form dict_of_df['filename.csv']). Such a dictionary can help you identify issues with heterogeneous data formats, when column names are not aligned for example.
Import modules and locate file paths:
import os
import glob
import pandas
from collections import OrderedDict
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")
Note: OrderedDict is not necessary, but it'll keep the order of files which might be useful for analysis.
Load CSV files into a dictionary. Then concatenate:
dict_of_df = OrderedDict((f, pandas.read_csv(f)) for f in filenames)
pandas.concat(dict_of_df, sort=True)
Keys are file names f and values are the data frame content of CSV files.
Instead of using f as a dictionary key, you can also use os.path.basename(f) or other os.path methods to reduce the size of the key in the dictionary to only the smaller part that is relevant.
import os
os.system("awk '(NR == 1) || (FNR > 1)' file*.csv > merged.csv")
Where NR and FNR represent the number of the line being processed.
FNR is the current line within each file.
NR == 1 includes the first line of the first file (the header), while FNR > 1 skips the first line of each subsequent file.
In case of an unnamed column issue, use this code for merging multiple CSV files along the x-axis.
import glob
import os
import pandas as pd
merged_df = pd.concat([pd.read_csv(csv_file, index_col=0, header=0) for csv_file in glob.glob(
os.path.join("data/", "*.csv"))], axis=0, ignore_index=True)
merged_df.to_csv("merged.csv")
You can do it this way also:
import pandas as pd
import os
new_df = pd.DataFrame()
for r, d, f in os.walk(csv_folder_path):
for file in f:
complete_file_path = csv_folder_path+file
read_file = pd.read_csv(complete_file_path)
new_df = new_df.append(read_file, ignore_index=True)
new_df.shape
Consider using convtools library, which provides lots of data processing primitives and generates simple ad hoc code under the hood.
It is not supposed to be faster than pandas/polars, but sometimes it can be.
e.g. you could concat csv files into one for further reuse - here's the code:
import glob
from convtools import conversion as c
from convtools.contrib.tables import Table
import pandas as pd
def test_pandas():
df = pd.concat(
(
pd.read_csv(filename, index_col=None, header=0)
for filename in glob.glob("tmp/*.csv")
),
axis=0,
ignore_index=True,
)
df.to_csv("out.csv", index=False)
# took 20.9 s
def test_convtools():
table = None
for filename in glob.glob("tmp/*.csv"):
table_ = Table.from_csv(filename, header=False)
if table is None:
table = table_
else:
table = table.chain(table_)
table.into_csv("out_convtools.csv", include_header=False)
# took 15.8 s
Of course if you just want to obtain a dataframe without writing a concatenated file, it will take 4.63 s and 10.9 s correspondingly (pandas is faster here because it doesn't need to zip columns for writing it back).
import pandas as pd
import glob
path = r'C:\DRO\DCL_rawdata_files' # use your path
file_path_list = glob.glob(path + "/*.csv")
file_iter = iter(file_path_list)
list_df_csv = []
list_df_csv.append(pd.read_csv(next(file_iter)))
for file in file_iter:
lsit_df_csv.append(pd.read_csv(file, header=0))
df = pd.concat(lsit_df_csv, ignore_index=True)
This is how you can do it using Colaboratory on Google Drive:
import pandas as pd
import glob
path = r'/content/drive/My Drive/data/actual/comments_only' # Use your path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True,sort=True)
frame.to_csv('/content/drive/onefile.csv')