Using Pandas to concatenate CSV files in directory, recursively - python

Here is a link from a previous post. I am citing P.R.'s response below.
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
I am wondering how to modify the above, using pandas. Specifically, I am attempting to recursively move through a directory and concatenate all of the CSV headers and their respective row values and then write it out in one file. Using P.R's approach, results in all of the headers and their corresponding values being stacked upon each other. My constraints are:
Writing out the headers and their corresponding values (without "stacking") - essentially concatenated one after the other
If the column headers in one file match another files then their should be no repetition. Only the values should be appended as they are written to the one CSV file.
Since each file has different column headers and different number of column headers these should all be added. Nothing should be deleted.
I have tried the following as well:
import pandas as pd
import csv
import glob
import os
path = '.'
files_in_dir = [f for f in os.listdir(path) if f.endswith('csv')]
for filenames in files_in_dir:
df = pd.read_csv(filenames)
df.to_csv('out.csv', mode='a')
Here are two sample CSV:
ID,Type,ACH,SH,LL,SS,LS,ISO,MID,Pass,TID,CID,TErrors
12821767,Query,,,,,,,,,,,
and
Type,ID,CC,CCD,Message,MemberIdentifier,NPass,UHB,UAP,NewAudioPIN,AType,ASuufix,Member,Share,Note,Flag,Card,MA,Preference,ETF,AutoT,RType,Locator,ISO,MID,Pass,TID,CID,Errors
UMember,12822909,True,10/31/2013 5:22:19 AM,,,,False,False,,,,,,,,,,,,,Member,,,,,,,
Based on the above to exemplars, the output should be something along the lines of:
ID,Type,ACH,SH,LL,SS,LS,ISO,MID,Pass,TID,CID,TErrors,CC,CCD,Message,MemberIdentifier,NPass,UHB,UAP,NewAudioPIN,AType,ASuufix,Member,Share,Note,Flag,Card,MA,Preference,ETF,AutoT,RType,Locator,Errors
12822909,UMember,,,,,,,,,,,,True,10/31/2013 5:22:19 AM,,,,False,False,,,,,,,,,,,,,Member,,
12821767,Query ,,,,,,,,,,,,,,,,,,,,,,,,, etc.
(all of the header columns in the second sample should be filled in with the delimiter ',' for the second row where there is no corresponding header in the first sample)
As one, can see, the second sample has more column headers. Moreover some of the
headers are the same (but in different order). I am trying to combine all of
these - along with their values, following the above requirements. I am
wondering if the best method is to merge or perform a customizable function on a
built-in method of pandas?

A non pandas based approach that uses an OrderedDict and the csv module.
from glob import iglob
import csv
from collections import OrderedDict
files = sorted(iglob('*.csv'))
header = OrderedDict()
data = []
for filename in files:
with open(filename, 'rb') as fin:
csvin = csv.DictReader(fin)
try:
header.update(OrderedDict.fromkeys(csvin.fieldnames))
data.append(next(csvin))
except TypeError:
print filename, 'was empty'
except StopIteration:
print filename, "didn't contain a row"
with open('output_filename.csv', 'wb') as fout:
csvout = csv.DictWriter(fout, fieldnames=list(header))
csvout.writeheader()
csvout.writerows(data)
Given your example input, this gives you:
ID,Type,ACH,SH,LL,SS,LS,ISO,MID,Pass,TID,CID,TErrors,CC,CCD,Message,MemberIdentifier,NPass,UHB,UAP,NewAudioPIN,AType,ASuufix,Member,Share,Note,Flag,Card,MA,Preference,ETF,AutoT,RType,Locator,Errors
12821767,Query,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
12822909,UMember,,,,,,,,,,,,True,10/31/2013 5:22:19 AM,,,,False,False,,,,,,,,,,,,,Member,,

In pandas, you can both append column names and reorder the data frame easily. See this article on merging frames.
To append frames and re-order them you could use the following. Re-indexing is as simple as using a list. There are more solutions here.
import pandas,os
df = None
dfList=[]
for filename in [directory+x for x in os.listdir(path)]:
dfList.append(pd.read_csv(filename))
df=pandas.concat(dfList)
df.to_csv('out.csv', mode='w')
With list comprehension, this would be:
import pandas,os
pandas.concat([pd.read_csv(filename) for filename in [directory+x for x in os.listdir(path) if x.endswith("csv") is True]]).to_csv('out.csv', mode='w')
If you want to reindex anything just use a list.
cols=sorted(list(df.columns.values))
df=df[cols]
#or
df=df[sorted(list(df.columns.values))]

Related

Sort out columns of multiple csv files at once in Python

really appreciate your help.
I have around 200 csv files with same header.
eg of headers are x , y, z, time, id, type
I would like to sort out time colums of all csv files and save them again.
This is so far I have tried. But it doesn't work.
Could you please help me ??
Thank you
import csv
import operator
import glob
import pandas as pd
data = dict() # filename : lists
path="./*.csv"
files=glob.glob(path)
for filename in files:
# process each file
with open(filename, 'r') as f:
# read file to a list of lists
lists = [row for row in csv.reader(f, delimiter=',')]
# sort and save into a dict
sorted_df = lists.sort_values(by=["time"], ascending=True)
sorted_df.to_csv('%.csv', index=False)
I don't have much knowledge about the csv module but you're using pandas and it supports reading csv files with pd.read_csv, why not utilize that..
for filename in files:
df = pd.read_csv(filename)
df.sort_values('time', inplace=True)
df.to_csv(filename, index=False)
This would overwrite all the files with same data sorted by time.

Pandas - Trying to store multiple .txt files in a .csv

I have a folder with about 500 .txt files. I would like to store the content in a csv file, with 2 columns, column 1 being the name of the file and column 2 being the file content in string. So I'd end up with a CSV file with 501 rows.
I've snooped around SO and tried to find similar questions, and came up with the following code:
import pandas as pd
from pandas.io.common import EmptyDataError
import os
def Aggregate_txt_csv(path):
for files in os.listdir(path):
with open(files, 'r') as file:
try:
df = pd.read_csv(file, header=None, delim_whitespace=True)
except EmptyDataError:
df = pd.DataFrame()
return df.to_csv('file.csv', index=False)
However it returns an empty .csv file. Am I doing something wrong?
There are several problems on your code. One of them is that pd.read_csv is not opening file because you're not passing the path to the given file. I think you should try to play from this code
import os
import pandas as pd
from pandas.io.common import EmptyDataError
def Aggregate_txt_csv(path):
files = os.listdir(path)
df = []
for file in files:
try:
d = pd.read_csv(os.path.join(path, file), header=None, delim_whitespace=True)
d["file"] = file
except EmptyDataError:
d = pd.DataFrame({"file":[file]})
df.append(d)
df = pd.concat(df, ignore_index=True)
df.to_csv('file.csv', index=False)
Use pathlib
Path.glob() to find all the files
When using path objects, file.stem returns the file name from the path.
Use pandas.concat to combine the dataframes in df_list
from pathlib import Path
import pandas as pd
p = Path('e:/PythonProjects/stack_overflow') # path to files
files = p.glob('*.txt') # get all txt files
df_list = list() # create an empty list for the dataframes
for file in files: # iterate through each file
with file.open('r') as f:
text = '\n'.join([line.strip() for line in f.readlines()]) # join all rows in list as a single string separated with \n
df_list.append(pd.DataFrame({'filename': [file.stem], 'contents': [text]})) # create and append a dataframe
df_all = pd.concat(df_list) # concat all the dataframes
df_all.to_csv('files.txt', index=False) # save to csv
I noticed there's already an answer, but I've gotten it to work with a relatively simple piece of code. I've only edited the file read-in a little bit, and the dataframe is outputting successfully.
Link here
import pandas as pd
from pandas.io.common import EmptyDataError
import os
def Aggregate_txt_csv(path):
result = []
print(os.listdir(path))
for files in os.listdir(path):
fullpath = os.path.join(path, files)
if not os.path.isfile(fullpath):
continue
with open(fullpath, 'r', errors='replace') as file:
try:
content = '\n'.join(file.readlines())
result.append({'title': files, 'body': content})
except EmptyDataError:
result.append({'title': files, 'body': None})
df = pd.DataFrame(result)
return df
df = Aggregate_txt_csv('files')
print(df)
df.to_csv('result.csv')
Most importantly here, I am appending to an array so as not to run pandas' concatenate function too much, as that would be pretty bad for performance. Additionally, reading in the file should not need read_csv, as there isn't a set format for the file. So using '\n'.join(file.readlines()) allows you to read in the file plainly and take out all lines into a string.
At the end, I convert the array of dictionaries into a final dataframe, and it returns the result.
EDIT: for paths that aren't the current directory, I updated it to append the path so that it could find the necessary files, apologies for the confusion

Read and save multiple csv files from a for-loop

I am trying to read multiple csv files from a list of file paths and save them all as separate pandas dataframes.
I feel like there should be a way to do this, however I cannot find a succinct explanation.
import pandas as pd
data_list = [['df_1','filepath1.csv'],
['df_2','filepath2.csv'],
['df_3','filepath3.csv']]
for name, filepath in data_list:
name = pd.read_csv(filepath)
I have also tried:
data_list = [[df_1,'filepath1.csv'],[df_2,'filepath2.csv'],
[df_3,'filepath3.csv']]
for name, filepath in data_list:
name = pd.read_csv(filepath)
I would like to be able to call each dataframe by its assigned name.
Ex):
df_1.head()
df_dct = {name:pd.read_csv(filepath) for name, filepath in data_list}
would create a dictionary of DataFrames. This may help you organize your data.
You may also want to look into glob.glob to create your list of files. For example, to get all CSV files in a directory:
file_paths = glob.glob(my_file_dir+"/*.csv")
I recommend you numpy. Read the csv files with numpy.
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')
You will get nd-array's. After that you can include them into pandas.
You can make sure of dictionary for this...
import pandas as pd
data_list = ['filepath1.csv', 'filepath2.csv', 'filepath3.csv']
d = {}
for _, i in enumerate(data_list):
file_name = "df" + str(_)
d[file_name] = pd.read_csv(filepath)
Here d is the dictionary which contains all your dataframes.

How to select unique values from named column in multiple .csv files?

I am trying to create a list of unique ID's from multiple csvs.
I have around 80 csvs containing data, all in the same format and in the same directory. The files contain time series data from around 1500 sites, but not all sites are in all files. The column with the data I need is called 'Site Id'.
I can get unique values from the first csv by creating a dataframe, but I can't see how to loop through all the remaining files.
If it's not obvious by now I am a complete beginner and my tutors are on vacation!
I've tried creating a df for a single file, but I can't figure out the next step.
df = pd.read_csv(r'C:filepathhere.csv')
ids = df['Site Id'].unique().tolist()
You can do something like this. I used the os.listdir function to get all of the files, and then the list.extend to merge the site IDs I was coming across into my siteIDs list. Finally, turning a list into a set, and then back into a list will remove any duplicate entries.
siteIDs = []
directoryToCSVs = r'c:\...'
for filename in os.listdir(directoryToCSVs):
if filename.lower().endswith('.csv'):
df = pd.read_csv(r'C:filepathhere.csv')
siteIDs.extend( df['Site Id'].tolist() )
#remove duplicate site IDs
siteIDs = list(set(siteIds))
#siteIDs will now contain a list of the unique site IDs across all of your CSV files.
You could do something like this to iterate over all your CSVs and load them into dataframes:
from os import walk, path
import pandas as pd
path = 'Path to CSV dir'
csv_paths = []
for root, dirs, files in walk(path):
for c in glob(path.join(root, '*.csv')):
csv_paths.append(c)
for file_path in csv_paths:
df = pd.read_csv(filepath_or_buffer=file_path)
# do something with df (append, export, etc.)
First you need to gather the files into a list that you will be getting data out of. There are many ways to do this, assuming you know the directory they are all in, see this answer for many options.
from os import walk
f = []
for (dirpath, dirnames, filenames) in walk(mypath):
f.extend(filenames)
break
Then within that list you'll need to gather those unique values that you need. Without using Pandas, since it doesn't seem like you actually need your information in a dataframe:
import csv
unique_data = {}
for file in f:
with open(file, 'rU') as infile:
reader = csv.DictReader(infile)
for row in reader:
# go through each, add value to dictionary
for header, value in row.items():
unique_data[value] = 0
# unqiue_data.keys() is now your list of unique values, if you want a true list
unique_data_list = list(unqiue_data.keys())

Copy column,add some text and write in a new csv file

I want to make a script that would copy 2nd column from multiple csv files in a folder and add some text before saving it to a single csv file .
here is what i want to do :
1.) Grab data in the 2nd column from all csv files
2.) Append text "hello" & "welcome" to each row at start and end
3.) Write the data into a single file
I tried creating it using pandas
import os
import pandas as pd
dataframes = [pd.read_csv(p, index_col=2, header=None) for p in ('1.csv','2.csv','3.csv')]
merged_dataframe = pd.concat(dataframes, axis=0)
merged_dataframe.to_csv("all.csv", index=False)
The Problem is -
In above code I am forced to mention the file names manually which is very difficult, as a solution I need to include all csv file *.csv
Need to use something like writr.writerow(("Hello"+r[1]+"welcome"))
As there are multiple csv files with many rows(around 100k) in each file so i need to speed up as well.
Here is a sample of the csv files:
"1.csv" "2.csv" "3.csv"
a,Jac b,William c,James
And here is how I would like the output to look all.csv:
Hello Jac welcome
Hello William welcome
Hello James welcome
Any solution using .merge() .append() or .concat() ??
How can I achieve this using python ?
You don't need pandas for this. Here's a really simple way of doing this with csv
import csv
import glob
with open("path/to/output", 'w') as outfile:
for fpath in glob.glob('path/to/directory/*.csv'):
with open(fpath) as infile:
for row in csv.reader(infile):
outfile.write("Hello {} welcome\n".format(row[1]))
1) If you would like to import all .csv files in a folder, you can just use
for i in [a in os.listdir() if a[-4:] == '.csv']:
#code to read in .csv file and concatenate to existing dataframe
2) To append the text and write to a file, you can map a function to each element of the dataframe's column 2 to add the text.
#existing dataframe called df
df[df.columns[1]].map(lambda x: "Hello {} welcome".format(x)).to_csv(<targetpath>)
#replace <targetpath> with your target path
See http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.to_csv.html for all the various parameters you can pass in to to_csv.
Here is a non-pandas solution using built in csv module. Not sure about speed.
import os
import csv
path_to_files = "path to files"
all_csv = os.path.join(path_to_files, "all.csv")
file_list = os.listdir(path_to_files)
names = []
for file in file_list:
if file.endswith(".csv"):
path_to_current_file = os.path.join(path_to_files, file)
with open(path_to_current_file, "r") as current_csv:
reader = csv.reader(current_csv, delimiter=',')
for row in reader:
names.append(row[1])
with open(all_csv, "w") as out_csv:
writer = csv.writer(current_csv, delimiter=',')
for name in names:
writer.writerow(["Hello {} welcome".format(name))

Categories