I have a lot of csv files in a folder. For the example say file1.csv to file9.csv. What I want is to import each of these files in separarate dataframes. Importing it in 1 dataframe is no option since all the files have different columns. The code below loops through all the csv files in this filepath and is supposed to import them in different dataframes. However only I don't see 9 dataframes but only 1 called df. Why isn't this working. I thought df.name would do the trick of creating the separate dataframes but it doesn't. Does anyone know what I should change to have this work.
import pandas as pd
import os
import glob
filepath = r'C:/Source data'
all_files = glob.glob(filepath + "/*.csv")
for filename in all_files:
name_df = os.path.basename(filename)
name_df = name_df.replace('.csv','')
df = pd.read_csv(filename)
df.name = name_df
You only see 1 dataframe called df because in every iteration of the loop you overwrite the previous dataframe. What you can do, is have an array of dataframes or a dictionary of dataframes
Dictionary Approach
This is useful if you want to access dataframes by name.
import pandas as pd
import glob
filepath = r'C:/Source data'
all_files = glob.glob(filepath + "/*.csv")
df_dict = dict()
for filename in all_files:
name_df = os.path.basename(filename)
name_df = name_df.replace('.csv','')
df_dict[name_df] = pd.read_csv(filename)
List Approach
This is useful if you want to access dataframes by index.
import pandas as pd
import glob
filepath = r'C:/Source data'
all_files = glob.glob(filepath + "/*.csv")
df_list = []
for filename in all_files:
name_df = os.path.basename(filename)
name_df = name_df.replace('.csv','')
df = pd.read_csv(filename)
df_list.append(df)
Append them to a list of data frames and access by list index e.g. df_list[0]:
import pandas as pd
import glob
filepath = r'C:/Source data'
all_files = glob.glob(filepath + "/*.csv")
df_list = []
for filename in all_files:
df = pd.read_csv(filename)
df_list.append(df)
You are overwriting the df object each time you loop. I would suggest using a dict of dataframes in this case.
import os
import pandas as pd
import glob
filepath = r'C:/Source data'
all_files = glob.glob(filepath + "/*.csv")
# create the empty dict to be filled in the loop
dfs = {}
for filename in all_files:
name_df = os.path.basename(filename)
name_df = name_df.replace('.csv','')
# add the df to the dict with the filename as its key
dfs[name_df] = pd.read_csv(filename)
# then use it like this
print(dfs['file9'])
I tried the examples with the dictionary and with the list as well. Both work. Thanks a lot all for your help.
Related
I have a pandas script as follows. Where I am reading multiple csv files in a given folder. All the csv files have a similar format and columns.
For a given column(Area), I want to add all the rows. Then I want to save this data into a new CSV file.
This is the code so far.
import pandas as pd
import glob
path = r'C:\Users\kundemj\Desktop\Post_Processing\HEA517_2000' # path
all_files = glob.glob(path + "/*.csv")
for filename in all_files:
df = pd.read_csv(filename)
area_sum = df['Area'].sum()
print(area_sum)
I could figure out, by using excel_write function, but I want to use 'to_csv', and also with mode = 'append', as I have bunch of folders with same filenames.
The CSV file format I am looking for is as follows:
filename1, filename2, filename3,.....
area_sum1, area_sum2, area_sum3,.....
You could try this:
import pandas as pd
import glob
path = r'C:\Users\kundemj\Desktop\Post_Processing\HEA517_2000' # path
all_files = glob.glob(path + "/*.csv")
# Create an empty dict
results = {"filename": [], "sum": []}
# Iterate on files and populate the newly created dict
for filename in all_files:
results["filename"].append(filename)
df = pd.read_csv(filename)
results["sum"].append(df['Area'].sum())
# Save to csv file
results = pd.DataFrame(results)
results.to_csv(path="path_to_file.csv", index=False)
Say I have 200 csv files, I want to read these csv files at one time, and store each csv file in different data frames like df1 for the first file and so on up to df200. Doing manual like df1=pd.read_csv takes a lot of time up to 200. How do I do this using pandas?
I have tried using for loop, but unable to approach, stuck.
import pandas as pd
import glob
all_files = glob.glob("file_path" + "/*.csv")
dfs_dict = {}
for idx, filename in enumerate(all_files):
df = pd.read_csv(filename, index_col=None, header=0)
dfs_dict["df" + str(idx)] = df
Try using this :
import pandas as pd
import glob
path = r'path of the folder where all csv exists'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
li will have all the csv's... you can furthur preprocess them to separate them into different files,
or if all the csv's have the same column and you want to concatenate them to a single dataframe, you could use the concat function in pandas over li to return the single dataframe.
import pandas as pd
import os
dfs=[] #empty list of dataframes
dirname = #where your files are
for root,folders,files in os.walk(dirname):
for file in files:
fp = os.path.join(root,file)
df=pd.read_csv(fp)
dfs.append(df)
df=pd.concat(dfs)
Scenario: I have a list of files in a folder (including the file paths). I am trying to get the content of each of those files into a dataframe (one for each file), then further perform some operations and later merge these dataframes.
From various other questions in SO, I found multiple ways to iterate over the files in a folder and get the data, but all of those I found usually ready the files in a loop and concatenate them into a single dataframe automatically, which does not work for me.
For example:
import os
import pandas as pd
path = os.getcwd()
files = os.listdir(path)
files_xls = [f for f in files if f[-3:] == 'xls*']
df = pd.DataFrame()
for f in files_xls:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
or
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("*.xls*"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
The only piece of code I could put together from what I found is:
from os.path import isfile, join
import glob
mypath = "/DGMS/Destop/uploaded"
listoffiles = glob.glob(os.path.join(mypath, "*.xls*"))
contentdataframes = (pd.read_excel(f) for f in listoffiles)
This lines run without error, but they appear not to do anything, no variables or created nor changed.
Question: What am I doing wrong here? Is there a better way to do this?
You are really close, need join all data by concat from generator:
contentdataframes = (pd.read_excel(f) for f in listoffiles)
df = pd.concat(contentdataframes, ignore_index=True)
If need list of DataFrames:
contentdataframes = [pd.read_excel(f) for f in listoffiles]
I have multiple csv files in the same folder with all the same data columns,
20100104 080100;5369;5378.5;5365;5378;2368
20100104 080200;5378;5385;5377;5384.5;652
20100104 080300;5384.5;5391.5;5383;5390;457
20100104 080400;5390.5;5391;5387;5389.5;392
I want to merge the csv files into pandas and add a column with the file name to each line so I can track where it came from later. There seems to be similar threads but I haven't been able to adapt any of the solutions. This is what I have so far. The merge data into one data frame works but I'm stuck on the adding file name column,
import os
import glob
import pandas as pd
path = r'/filepath/'
all_files = glob.glob(os.path.join(path, "*.csv"))
names = [os.path.basename(x) for x in glob.glob(path+'\*.csv')]
list_ = []
for file_ in all_files:
list_.append(pd.read_csv(file_,sep=';', parse_dates=[0], infer_datetime_format=True,header=None ))
df = pd.concat(list_)
Instead of using a list just use DataFrame's append.
df = pd.DataFrame()
for file_ in all_files:
file_df = pd.read_csv(file_,sep=';', parse_dates=[0], infer_datetime_format=True,header=None )
file_df['file_name'] = file_
df = df.append(file_df)
I'm trying to extract data from a directory with 12 .txt files. Each file contains 3 columns of data (X,Y,Z) that i want to extract. I want to collect all the data in one df(InforDF), but so far i only succeeded in creating a df with all of the X,Y and Z data in the same column. This is my code:
import pandas as pd
import numpy as np
import os
import fnmatch
path = os.getcwd()
file_list = os.listdir(path)
InfoDF = pd.DataFrame()
for file in file_list:
try:
if fnmatch.fnmatch(file, '*.txt'):
filedata = open(file, 'r')
df = pd.read_table(filedata, delim_whitespace=True, names={'X','Y','Z'})
except Exception as e:
print(e)
What am i doing wrong?
df = pd.read_table(filedata, delim_whitespace=True, names={'X','Y','Z'})
this line replace df at each iteration of the loop, that's why you only have the last one at the end of your program.
what you can do is to save all your dataframe in a list and concatenate them at the end
df_list = []
for file in file_list:
try:
if fnmatch.fnmatch(file, '*.txt'):
filedata = open(file, 'r')
df_list.append(pd.read_table(filedata, delim_whitespace=True, names={'X','Y','Z'}))
df = pd.concat(df_list)
alternatively, you can write it:
df_list = pd.concat([pd.read_table(open(file, 'r'), delim_whitespace=True, names={'X','Y','Z'}) for file in file_list if fnmatch.fnmatch(file, '*.txt')])
I think you need glob for select all files, create list of DataFrames dfs in list comprehension and then use concat:
files = glob.glob('*.txt')
dfs = [pd.read_csv(fp, delim_whitespace=True, names=['X','Y','Z']) for fp in files]
df = pd.concat(dfs, ignore_index=True)
As camilleri mentions above, you are overwriting df in your loop
Also there is no point catching a general exception
Solution: Create an empty dataframe InfoDF before the loop and then use append or concat to populate it with smaller dfs
import pandas as pd
import numpy as np
import os
import fnmatch
path = os.getcwd()
file_list = os.listdir(path)
InfoDF = pd.DataFrame(columns={'X','Y','Z'}) # create empty dataframe
for file in file_list:
if fnmatch.fnmatch(file, '*.txt'):
filedata = open(file, 'r')
df = pd.read_table(filedata, delim_whitespace=True, names={'X','Y','Z'})
InfoDF.append(df, ignore_index=True)
print InfoDF