Read and save multiple csv files from a for-loop - python

I am trying to read multiple csv files from a list of file paths and save them all as separate pandas dataframes.
I feel like there should be a way to do this, however I cannot find a succinct explanation.
import pandas as pd
data_list = [['df_1','filepath1.csv'],
['df_2','filepath2.csv'],
['df_3','filepath3.csv']]
for name, filepath in data_list:
name = pd.read_csv(filepath)
I have also tried:
data_list = [[df_1,'filepath1.csv'],[df_2,'filepath2.csv'],
[df_3,'filepath3.csv']]
for name, filepath in data_list:
name = pd.read_csv(filepath)
I would like to be able to call each dataframe by its assigned name.
Ex):
df_1.head()

df_dct = {name:pd.read_csv(filepath) for name, filepath in data_list}
would create a dictionary of DataFrames. This may help you organize your data.
You may also want to look into glob.glob to create your list of files. For example, to get all CSV files in a directory:
file_paths = glob.glob(my_file_dir+"/*.csv")

I recommend you numpy. Read the csv files with numpy.
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')
You will get nd-array's. After that you can include them into pandas.

You can make sure of dictionary for this...
import pandas as pd
data_list = ['filepath1.csv', 'filepath2.csv', 'filepath3.csv']
d = {}
for _, i in enumerate(data_list):
file_name = "df" + str(_)
d[file_name] = pd.read_csv(filepath)
Here d is the dictionary which contains all your dataframes.

Related

Python - for all csv files in folder make new csv file with header names and the file they came from

I have a code where I am writing to five csv files, and after all of the CSV files are created, I would like to run a function to put all of the headers into a csv or xlsx file where each row represents a header in a file.
So in a folder called "Example" there are 5 csv files, called "1.csv", "2.csv"... "5.csv"; for the code I would like to have, a new file would be created called "Headers of files in Example", where the first column is the name of the csv file the header came from, and the second column contains the headers. Ultimately looking like this:contents of Headers of files in example, where the headers of 1.csv are a,b,c and so on.
My python coding is fairly basic at this point, but I definitely think what I would like to do is possible. Any suggestions to help would be greatly appreciated!
After some more digging I was able to find some code that did what I wanted it to, after some slight modifications:
import csv
import glob
import pandas as pd
def headers():
path = r'path to folder containing csv files/'
all_files = glob.glob(path + "*.csv")
files = all_files
myheaders = ['filename', 'header']
with open("Headers of foldername.csv", "w", newline='') as fw:
cw = csv.writer(fw, delimiter=",")
for filename in files:
with open(filename, 'r') as f:
cr = csv.reader(f)
# get title
for column_name in (x.strip() for x in next(cr)):
cw.writerow([filename, column_name])
file = pd.read_csv("Headers of foldername.csv")
file.to_csv("Headers of foldername.csv", header=myheaders, index=False)
Given you have the DataFrames in the memory, you just need to create a new DataFrame, I like to use dictionaries of lists to create it, then for each file/dataframe you extract the columns and upload it to the mock DataFrame.
Later you can save the new DataFrame to a file.
summary_df = {
'file_name': list(),
'headers': list()}
for file, filename in zip(list_of_files, list_of_names):
aux_headers = file.columns.to_list()
summary_df['headers'] += aux_headers
summary_df['file_name'] += [filename] * len(aux_headers)
summary_df = pd.DataFrame(summary_df)
I hope this piece of code helps. Essentially what it does is to iterate over all files you want, their names in file_names then read them using pandas. Once the csv is loaded you extract the headers with df.columns and store it in a list which is then saves as a new csv by pandas.
import pandas as pd
header_names = []
file_names = ['1.csv', '2.csv']
for file_name in file_names:
df = pd.read_csv(file_name)
header_names.extend(list(df.columns))
new_df = pd.DataFrame(l)
new_df.to_csv("headers.csv")

Sort out columns of multiple csv files at once in Python

really appreciate your help.
I have around 200 csv files with same header.
eg of headers are x , y, z, time, id, type
I would like to sort out time colums of all csv files and save them again.
This is so far I have tried. But it doesn't work.
Could you please help me ??
Thank you
import csv
import operator
import glob
import pandas as pd
data = dict() # filename : lists
path="./*.csv"
files=glob.glob(path)
for filename in files:
# process each file
with open(filename, 'r') as f:
# read file to a list of lists
lists = [row for row in csv.reader(f, delimiter=',')]
# sort and save into a dict
sorted_df = lists.sort_values(by=["time"], ascending=True)
sorted_df.to_csv('%.csv', index=False)
I don't have much knowledge about the csv module but you're using pandas and it supports reading csv files with pd.read_csv, why not utilize that..
for filename in files:
df = pd.read_csv(filename)
df.sort_values('time', inplace=True)
df.to_csv(filename, index=False)
This would overwrite all the files with same data sorted by time.

How to create a dataframe from multiple csv files?

I am loading a csv file in pandas as
premier10 = pd.read_csv('./premier_league/pl_09_10.csv')
However, I have 20+ csv files, which I was hoping to load as separate dfs (one df per csv) using a loop and predefined names, something similar to:
import pandas as pd
file_names = ['pl_09_10.csv','pl_10_11.csv']
names = ['premier10','premier11']
for i in range (0,len(file_names)):
names[i] = pd.read_csv('./premier_league/{}'.format(file_names[i]))
(Note, here I provide only two csv files as example) Unfortunately, this doesn't work (no error messages, but the the pd dfs don't exist).
Any tips/links to previous questions would be greatly appreciated as I haven't found anything similar on Stackoverflow.
Use pathlib to set a Path, p, to the files
Use the .glob method to find the files matching the pattern
Create a dataframe with pandas.read_csv
Use a dict comprehension to create a dict of dataframes, where each file will have its own key-value pair.
Use the dict like any other dict; the keys are the file names and the values are the dataframes.
Alternatively, use a list comprehension with pandas.concat to create a single dataframe from all the files.
In the for-loop in the OP, objects (variables) may not be created in that way (e.g. names[i]).
This is equivalent to 'premier10' = pd.read_csv(...), where 'premier10' is a str type.
from pathlib import Path
import pandas as pd
# set the path to the files
p = Path('some_path/premier_league')
# create a list of the files matching the pattern
files = list(p.glob(f'pl_*.csv'))
# creates a dict of dataframes, where each file has a separate dataframe
df_dict = {f.stem: pd.read_csv(f) for f in files}
# alternative, creates 1 dataframe from all files
df = pd.concat([pd.read_csv(f) for f in files])
names = ['premier10','premier11'] does not create a dictionary but a list. Simply replace it with names = dict() or replace names = ['premier10','premier11'] by names.append(['premier10','premier11'])
This is what you want:
#create a variable and look through contents of the directory
files=[f for f in os.listdir("./your_directory") if f.endswith('.csv')]
#Initalize an empty data frame
all_data = pd.DataFrame()
#iterate through files and their contents, then concatenate their data into the data frame initialized above
for file in files:
df = pd.read_csv('./your_directory' + file)
all_data = pd.concat([all_data, df])
#Call the new data frame and verify that contents were transferred
all_data.head()

Read in multiple csv into separate dataframes in Pandas

I have a long list of csv files that I want to read as dataframes and name them by their file name. For example, I want to read in the file status.csv and assign its dataframe the name status. Is there a way I can efficiently do this using Pandas?
Looking at this, I still have to write the name of each csv in my loop. I want to avoid that.
Looking at this, that allows me to read multiple csv into one dataframe instead of many.
You can list all csv under a directory using os.listdir(dirname) and combine it with os.path.basename to parse the file name.
import os
# current directory csv files
csvs = [x for x in os.listdir('.') if x.endswith('.csv')]
# stats.csv -> stats
fns = [os.path.splitext(os.path.basename(x))[0] for x in csvs]
d = {}
for i in range(len(fns)):
d[fns[i]] = pd.read_csv(csvs[i])
you could create a dictionary of DataFrames:
d = {} # dictionary that will hold them
for file_name in list_of_csvs: # loop over files
# read csv into a dataframe and add it to dict with file_name as it key
d[file_name] = pd.read_csv(file_name)

Using Pandas to concatenate CSV files in directory, recursively

Here is a link from a previous post. I am citing P.R.'s response below.
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
I am wondering how to modify the above, using pandas. Specifically, I am attempting to recursively move through a directory and concatenate all of the CSV headers and their respective row values and then write it out in one file. Using P.R's approach, results in all of the headers and their corresponding values being stacked upon each other. My constraints are:
Writing out the headers and their corresponding values (without "stacking") - essentially concatenated one after the other
If the column headers in one file match another files then their should be no repetition. Only the values should be appended as they are written to the one CSV file.
Since each file has different column headers and different number of column headers these should all be added. Nothing should be deleted.
I have tried the following as well:
import pandas as pd
import csv
import glob
import os
path = '.'
files_in_dir = [f for f in os.listdir(path) if f.endswith('csv')]
for filenames in files_in_dir:
df = pd.read_csv(filenames)
df.to_csv('out.csv', mode='a')
Here are two sample CSV:
ID,Type,ACH,SH,LL,SS,LS,ISO,MID,Pass,TID,CID,TErrors
12821767,Query,,,,,,,,,,,
and
Type,ID,CC,CCD,Message,MemberIdentifier,NPass,UHB,UAP,NewAudioPIN,AType,ASuufix,Member,Share,Note,Flag,Card,MA,Preference,ETF,AutoT,RType,Locator,ISO,MID,Pass,TID,CID,Errors
UMember,12822909,True,10/31/2013 5:22:19 AM,,,,False,False,,,,,,,,,,,,,Member,,,,,,,
Based on the above to exemplars, the output should be something along the lines of:
ID,Type,ACH,SH,LL,SS,LS,ISO,MID,Pass,TID,CID,TErrors,CC,CCD,Message,MemberIdentifier,NPass,UHB,UAP,NewAudioPIN,AType,ASuufix,Member,Share,Note,Flag,Card,MA,Preference,ETF,AutoT,RType,Locator,Errors
12822909,UMember,,,,,,,,,,,,True,10/31/2013 5:22:19 AM,,,,False,False,,,,,,,,,,,,,Member,,
12821767,Query ,,,,,,,,,,,,,,,,,,,,,,,,, etc.
(all of the header columns in the second sample should be filled in with the delimiter ',' for the second row where there is no corresponding header in the first sample)
As one, can see, the second sample has more column headers. Moreover some of the
headers are the same (but in different order). I am trying to combine all of
these - along with their values, following the above requirements. I am
wondering if the best method is to merge or perform a customizable function on a
built-in method of pandas?
A non pandas based approach that uses an OrderedDict and the csv module.
from glob import iglob
import csv
from collections import OrderedDict
files = sorted(iglob('*.csv'))
header = OrderedDict()
data = []
for filename in files:
with open(filename, 'rb') as fin:
csvin = csv.DictReader(fin)
try:
header.update(OrderedDict.fromkeys(csvin.fieldnames))
data.append(next(csvin))
except TypeError:
print filename, 'was empty'
except StopIteration:
print filename, "didn't contain a row"
with open('output_filename.csv', 'wb') as fout:
csvout = csv.DictWriter(fout, fieldnames=list(header))
csvout.writeheader()
csvout.writerows(data)
Given your example input, this gives you:
ID,Type,ACH,SH,LL,SS,LS,ISO,MID,Pass,TID,CID,TErrors,CC,CCD,Message,MemberIdentifier,NPass,UHB,UAP,NewAudioPIN,AType,ASuufix,Member,Share,Note,Flag,Card,MA,Preference,ETF,AutoT,RType,Locator,Errors
12821767,Query,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
12822909,UMember,,,,,,,,,,,,True,10/31/2013 5:22:19 AM,,,,False,False,,,,,,,,,,,,,Member,,
In pandas, you can both append column names and reorder the data frame easily. See this article on merging frames.
To append frames and re-order them you could use the following. Re-indexing is as simple as using a list. There are more solutions here.
import pandas,os
df = None
dfList=[]
for filename in [directory+x for x in os.listdir(path)]:
dfList.append(pd.read_csv(filename))
df=pandas.concat(dfList)
df.to_csv('out.csv', mode='w')
With list comprehension, this would be:
import pandas,os
pandas.concat([pd.read_csv(filename) for filename in [directory+x for x in os.listdir(path) if x.endswith("csv") is True]]).to_csv('out.csv', mode='w')
If you want to reindex anything just use a list.
cols=sorted(list(df.columns.values))
df=df[cols]
#or
df=df[sorted(list(df.columns.values))]

Categories