I have a pandas script as follows. Where I am reading multiple csv files in a given folder. All the csv files have a similar format and columns.
For a given column(Area), I want to add all the rows. Then I want to save this data into a new CSV file.
This is the code so far.
import pandas as pd
import glob
path = r'C:\Users\kundemj\Desktop\Post_Processing\HEA517_2000' # path
all_files = glob.glob(path + "/*.csv")
for filename in all_files:
df = pd.read_csv(filename)
area_sum = df['Area'].sum()
print(area_sum)
I could figure out, by using excel_write function, but I want to use 'to_csv', and also with mode = 'append', as I have bunch of folders with same filenames.
The CSV file format I am looking for is as follows:
filename1, filename2, filename3,.....
area_sum1, area_sum2, area_sum3,.....
You could try this:
import pandas as pd
import glob
path = r'C:\Users\kundemj\Desktop\Post_Processing\HEA517_2000' # path
all_files = glob.glob(path + "/*.csv")
# Create an empty dict
results = {"filename": [], "sum": []}
# Iterate on files and populate the newly created dict
for filename in all_files:
results["filename"].append(filename)
df = pd.read_csv(filename)
results["sum"].append(df['Area'].sum())
# Save to csv file
results = pd.DataFrame(results)
results.to_csv(path="path_to_file.csv", index=False)
Related
I want to concat all csv file that have this specific word 'tables' on the filename.
Below code is upload all csv file without filter the specific word that i want.
# importing the required modules
import glob
import pandas as pd
# specifying the path to csv files
#path = "csvfoldergfg"
path = "folder_directory"
# csv files in the path
files = glob.glob(path + "/*.csv")
# defining an empty list to store
# content
data_frame = pd.DataFrame()
content = []
# checking all the csv files in the
# specified path
for filename in files:
# reading content of csv file
# content.append(filename)
df = pd.read_csv(filename, index_col=None)
content.append(df)
# converting content to data frame
data_frame = pd.concat(content)
print(data_frame)
example filename are:
abcd-tables.csv
abcd-text.csv
abcd-forms.csv
defg-tables.csv
defg-text.csv
defg-forms.csv
From the example filenames. The expected output is concat filenames
abcd-tables.csv
defg-tables.csv
into single dataframe. Assuming the header are same.
*Really appreciate you guys can solve this
You can use:
import pandas as pd
import pathlib
path = 'folder_directory'
content = []
for filename in pathlib.Path(path).glob('*-tables.csv'):
df = pd.read_csv(filename, index_col=None)
content.append(df)
df = pd.concat(content, ignore_index=True)
I have a lot of csv files in a folder. For the example say file1.csv to file9.csv. What I want is to import each of these files in separarate dataframes. Importing it in 1 dataframe is no option since all the files have different columns. The code below loops through all the csv files in this filepath and is supposed to import them in different dataframes. However only I don't see 9 dataframes but only 1 called df. Why isn't this working. I thought df.name would do the trick of creating the separate dataframes but it doesn't. Does anyone know what I should change to have this work.
import pandas as pd
import os
import glob
filepath = r'C:/Source data'
all_files = glob.glob(filepath + "/*.csv")
for filename in all_files:
name_df = os.path.basename(filename)
name_df = name_df.replace('.csv','')
df = pd.read_csv(filename)
df.name = name_df
You only see 1 dataframe called df because in every iteration of the loop you overwrite the previous dataframe. What you can do, is have an array of dataframes or a dictionary of dataframes
Dictionary Approach
This is useful if you want to access dataframes by name.
import pandas as pd
import glob
filepath = r'C:/Source data'
all_files = glob.glob(filepath + "/*.csv")
df_dict = dict()
for filename in all_files:
name_df = os.path.basename(filename)
name_df = name_df.replace('.csv','')
df_dict[name_df] = pd.read_csv(filename)
List Approach
This is useful if you want to access dataframes by index.
import pandas as pd
import glob
filepath = r'C:/Source data'
all_files = glob.glob(filepath + "/*.csv")
df_list = []
for filename in all_files:
name_df = os.path.basename(filename)
name_df = name_df.replace('.csv','')
df = pd.read_csv(filename)
df_list.append(df)
Append them to a list of data frames and access by list index e.g. df_list[0]:
import pandas as pd
import glob
filepath = r'C:/Source data'
all_files = glob.glob(filepath + "/*.csv")
df_list = []
for filename in all_files:
df = pd.read_csv(filename)
df_list.append(df)
You are overwriting the df object each time you loop. I would suggest using a dict of dataframes in this case.
import os
import pandas as pd
import glob
filepath = r'C:/Source data'
all_files = glob.glob(filepath + "/*.csv")
# create the empty dict to be filled in the loop
dfs = {}
for filename in all_files:
name_df = os.path.basename(filename)
name_df = name_df.replace('.csv','')
# add the df to the dict with the filename as its key
dfs[name_df] = pd.read_csv(filename)
# then use it like this
print(dfs['file9'])
I tried the examples with the dictionary and with the list as well. Both work. Thanks a lot all for your help.
I have a folder with about 500 .txt files. I would like to store the content in a csv file, with 2 columns, column 1 being the name of the file and column 2 being the file content in string. So I'd end up with a CSV file with 501 rows.
I've snooped around SO and tried to find similar questions, and came up with the following code:
import pandas as pd
from pandas.io.common import EmptyDataError
import os
def Aggregate_txt_csv(path):
for files in os.listdir(path):
with open(files, 'r') as file:
try:
df = pd.read_csv(file, header=None, delim_whitespace=True)
except EmptyDataError:
df = pd.DataFrame()
return df.to_csv('file.csv', index=False)
However it returns an empty .csv file. Am I doing something wrong?
There are several problems on your code. One of them is that pd.read_csv is not opening file because you're not passing the path to the given file. I think you should try to play from this code
import os
import pandas as pd
from pandas.io.common import EmptyDataError
def Aggregate_txt_csv(path):
files = os.listdir(path)
df = []
for file in files:
try:
d = pd.read_csv(os.path.join(path, file), header=None, delim_whitespace=True)
d["file"] = file
except EmptyDataError:
d = pd.DataFrame({"file":[file]})
df.append(d)
df = pd.concat(df, ignore_index=True)
df.to_csv('file.csv', index=False)
Use pathlib
Path.glob() to find all the files
When using path objects, file.stem returns the file name from the path.
Use pandas.concat to combine the dataframes in df_list
from pathlib import Path
import pandas as pd
p = Path('e:/PythonProjects/stack_overflow') # path to files
files = p.glob('*.txt') # get all txt files
df_list = list() # create an empty list for the dataframes
for file in files: # iterate through each file
with file.open('r') as f:
text = '\n'.join([line.strip() for line in f.readlines()]) # join all rows in list as a single string separated with \n
df_list.append(pd.DataFrame({'filename': [file.stem], 'contents': [text]})) # create and append a dataframe
df_all = pd.concat(df_list) # concat all the dataframes
df_all.to_csv('files.txt', index=False) # save to csv
I noticed there's already an answer, but I've gotten it to work with a relatively simple piece of code. I've only edited the file read-in a little bit, and the dataframe is outputting successfully.
Link here
import pandas as pd
from pandas.io.common import EmptyDataError
import os
def Aggregate_txt_csv(path):
result = []
print(os.listdir(path))
for files in os.listdir(path):
fullpath = os.path.join(path, files)
if not os.path.isfile(fullpath):
continue
with open(fullpath, 'r', errors='replace') as file:
try:
content = '\n'.join(file.readlines())
result.append({'title': files, 'body': content})
except EmptyDataError:
result.append({'title': files, 'body': None})
df = pd.DataFrame(result)
return df
df = Aggregate_txt_csv('files')
print(df)
df.to_csv('result.csv')
Most importantly here, I am appending to an array so as not to run pandas' concatenate function too much, as that would be pretty bad for performance. Additionally, reading in the file should not need read_csv, as there isn't a set format for the file. So using '\n'.join(file.readlines()) allows you to read in the file plainly and take out all lines into a string.
At the end, I convert the array of dictionaries into a final dataframe, and it returns the result.
EDIT: for paths that aren't the current directory, I updated it to append the path so that it could find the necessary files, apologies for the confusion
Say I have 200 csv files, I want to read these csv files at one time, and store each csv file in different data frames like df1 for the first file and so on up to df200. Doing manual like df1=pd.read_csv takes a lot of time up to 200. How do I do this using pandas?
I have tried using for loop, but unable to approach, stuck.
import pandas as pd
import glob
all_files = glob.glob("file_path" + "/*.csv")
dfs_dict = {}
for idx, filename in enumerate(all_files):
df = pd.read_csv(filename, index_col=None, header=0)
dfs_dict["df" + str(idx)] = df
Try using this :
import pandas as pd
import glob
path = r'path of the folder where all csv exists'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
li will have all the csv's... you can furthur preprocess them to separate them into different files,
or if all the csv's have the same column and you want to concatenate them to a single dataframe, you could use the concat function in pandas over li to return the single dataframe.
import pandas as pd
import os
dfs=[] #empty list of dataframes
dirname = #where your files are
for root,folders,files in os.walk(dirname):
for file in files:
fp = os.path.join(root,file)
df=pd.read_csv(fp)
dfs.append(df)
df=pd.concat(dfs)
I have multiple csv files in the same folder with all the same data columns,
20100104 080100;5369;5378.5;5365;5378;2368
20100104 080200;5378;5385;5377;5384.5;652
20100104 080300;5384.5;5391.5;5383;5390;457
20100104 080400;5390.5;5391;5387;5389.5;392
I want to merge the csv files into pandas and add a column with the file name to each line so I can track where it came from later. There seems to be similar threads but I haven't been able to adapt any of the solutions. This is what I have so far. The merge data into one data frame works but I'm stuck on the adding file name column,
import os
import glob
import pandas as pd
path = r'/filepath/'
all_files = glob.glob(os.path.join(path, "*.csv"))
names = [os.path.basename(x) for x in glob.glob(path+'\*.csv')]
list_ = []
for file_ in all_files:
list_.append(pd.read_csv(file_,sep=';', parse_dates=[0], infer_datetime_format=True,header=None ))
df = pd.concat(list_)
Instead of using a list just use DataFrame's append.
df = pd.DataFrame()
for file_ in all_files:
file_df = pd.read_csv(file_,sep=';', parse_dates=[0], infer_datetime_format=True,header=None )
file_df['file_name'] = file_
df = df.append(file_df)