Read multiple csv files into separate dataframes using pandas - python

I like to read two csv files from a particular folder into two separate dataframes.
The two file names are: 23314621_MACI_NAV.CSV and 23314623_MACI_Holding.CSV
The file second part of the file names are fixed MACI_NAV.CSV and MACI_Holding.CSV, however the first part of the file name which are numbers change everyday.
I like to read them into two different dataframe by trying this:
import pandas as pd
import glob
msci_folder = 'N:/Operation/Daily CDS E_Report/CDS/MACI/'
mscifile = glob.glob(msci_folder + "\*.csv")
for file in mscifile:
df_nav=pd.read_csv(file)
df_holding=pd.read_csv(file)
It seems like both lines are reading the same file, how do I make them read different files (second file)?

If want create list of DataFrames:
dfs = []
for file in mscifile:
df = pd.read_csv(file)
dfs.append(df)
Or use list comprehension:
dfs = [pd.read_csv(file) for file in mscifile]
print (dfs[0])
print (dfs[1])
Another solution is create dictionary of DataFrames with keys by last substring after _ in filename:
from os.path import splitext, basename
dfs = {splitext(basename(fp))[0].split('_')[-1] : pd.read_csv(fp) for fp in mscifile}
print (dfs)
print (dfs['NAV'])
print (dfs['Holding'])

Related

Merge csv files based on file names and suffix in Python

First time poster and fairly new to Python here. I have a collection of +1,7000 csv files with 2 columns each. The number and labels of the rows are the same in every file. The files are named with a specific format. For example:
Species_1_OrderA_1.csv
Species_1_OrderA_2.csv
Species_1_OrderA_3.csv
Species_10_OrderB_1.csv
Species_10_OrderB_2.csv
Each imported dataframe is formatted like so:
TreeID Species_1_OrderA_2
0 Bu2_1201_1992 0
1 Bu3_1201_1998 0
2 Bu4_1201_2000 0
3 Bu5_1201_2002 0
4 Bu6_1201_2004 0
.. ... ...
307 Fi141_16101_2004 0
308 Fi142_16101_2006 0
309 Fi143_16101_2008 0
310 Fi144_16101_2010 0
311 Fi147_16101_2015 0
I would like to join the files that correspond to the same species, based on the first column. So, in the end, I would get the files Species_1_OrderA.csv and Species_10_OrderB.csv. Please note that all the species do not necessarily have the same number of files.
This is what I have tried so far.
import os
import glob
import pandas as pd
# Importing csv files from directory
path = '.'
extension = 'csv'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))
# Create a dictionary to loop through each file to read its contents and create a dataframe
file_dict = {}
for file in files:
key = file
df = pd.read_csv(file)
file_dict[key] = df
# Extract the name of each dataframe, convert to a list and extract the relevant
# information (before the 3rd underscore). Compare each of these values to the next and
# if they are the same, append them to a list. This list (in my head, at least) will help
# me merge them using pandas.concat
keys_list = list(file_dict.keys())
group = ''
for line in keys_list:
type = "_".join(line.split("_")[:3])
for i in range(len(type) - 1):
if type[i] == type[i+1]:
group.append(line[keys_list])
print(group)
However, the last bit is not even working, and at this point, I am not sure this is the best way to deal with my problem. Any pointers on how to solve this will be really appreciated.
--- EDIT:
This is the expected output for the files per species. Ideally, I would remove the rows that have zeros in them, but that can easily be done with awk.
TreeID,Species_1_OrderA_0,Species_1_OrderA_1,Species_1_OrderA_2
Bu2_1201_1992,0,0,0
Bu3_1201_1998,0,0,0
Bu4_1201_2000,0,0,0
Bu5_1201_2002,0,0,0
Bu6_1201_2004,0,0,0
Bu7_1201_2006,0,0,0
Bu8_1201_2008,0,0,0
Bu9_1201_2010,0,0,0
Bu10_1201_2012,0,0,0
Bu11_1201_2014,0,0,0
Bu14_1201_2016,0,0,0
Bu16_1201_2018,0,0,0
Bu18_3103_1989,0,0,0
Bu22_3103_1999,0,0,0
Bu23_3103_2001,0,0,0
Bu24_3103_2003,0,0,0
...
Fi141_16101_2004,0,0,10
Fi142_16101_2006,0,4,0
Fi143_16101_2008,0,0,0
Fi144_16101_2010,2,0,0
Fi147_16101_2015,0,7,0
``
Try it like this:
import os
import pandas as pd
path = "C:/Users/username"
files = [file for file in os.listdir(path) if file.endswith(".csv")]
dfs = dict()
for file in files:
#everything before the final _ is the species name
species = file.rsplit("_", maxsplit=1)[0]
#read the csv to a dataframe
df = pd.read_csv(os.path.join(path, file))
#if you don't have a df for a species, create a new key
if species not in dfs:
dfs[species] = df
#else, merge current df to existing df on the TreeID
else:
dfs[species] = pd.merge(dfs[species], df, on="TreeID", how="outer")
#write all dfs to their own csv files
for key in dfs:
dfs[key].to_csv(f"{key}.csv")
If your goal is to concatenate all the csv's for each species-order into a consolidated csv, this is one approach. I haven't tested it so there might be a few errors. The idea is to first use glob, as you're doing, to make a dict of file_paths so that all the file_paths of the same species-order are grouped together. Then for each species-order read in all the data into a single table in memory and then write out to a consolidated file.
import pandas as pd
import glob
#Create a dictionary keyed by species_order, valued by a list of files
#i.e. file_paths_by_species_order['Species_10_OrderB'] = ['Species_10_OrderB_1.csv', 'Species_10_OrderB_2.csv']
file_paths_by_species_order = {}
for file_path in glob.glob('*.csv'):
species_order = file_path.split("_")[:3]
if species_order not in file_paths_by_species_order:
file_paths_by_species_order[species_order] = [file_path]
else:
file_paths_by_species_order[species_order].append(file_path)
#For each species_order, concat all files and save the info into a new csv
for species_order,file_paths in file_paths_by_species_order.items():
df = pd.concat(pd.read_csv(file_path) for file_path in file_paths)
df.to_csv('consolidated_{}.csv'.format(species_order))
There are definitely improvements that can be made such as using collections.defaultdict and writing one file at a time out to the consolidated file, instead of reading them all into memory

How to create a dataframe from multiple csv files?

I am loading a csv file in pandas as
premier10 = pd.read_csv('./premier_league/pl_09_10.csv')
However, I have 20+ csv files, which I was hoping to load as separate dfs (one df per csv) using a loop and predefined names, something similar to:
import pandas as pd
file_names = ['pl_09_10.csv','pl_10_11.csv']
names = ['premier10','premier11']
for i in range (0,len(file_names)):
names[i] = pd.read_csv('./premier_league/{}'.format(file_names[i]))
(Note, here I provide only two csv files as example) Unfortunately, this doesn't work (no error messages, but the the pd dfs don't exist).
Any tips/links to previous questions would be greatly appreciated as I haven't found anything similar on Stackoverflow.
Use pathlib to set a Path, p, to the files
Use the .glob method to find the files matching the pattern
Create a dataframe with pandas.read_csv
Use a dict comprehension to create a dict of dataframes, where each file will have its own key-value pair.
Use the dict like any other dict; the keys are the file names and the values are the dataframes.
Alternatively, use a list comprehension with pandas.concat to create a single dataframe from all the files.
In the for-loop in the OP, objects (variables) may not be created in that way (e.g. names[i]).
This is equivalent to 'premier10' = pd.read_csv(...), where 'premier10' is a str type.
from pathlib import Path
import pandas as pd
# set the path to the files
p = Path('some_path/premier_league')
# create a list of the files matching the pattern
files = list(p.glob(f'pl_*.csv'))
# creates a dict of dataframes, where each file has a separate dataframe
df_dict = {f.stem: pd.read_csv(f) for f in files}
# alternative, creates 1 dataframe from all files
df = pd.concat([pd.read_csv(f) for f in files])
names = ['premier10','premier11'] does not create a dictionary but a list. Simply replace it with names = dict() or replace names = ['premier10','premier11'] by names.append(['premier10','premier11'])
This is what you want:
#create a variable and look through contents of the directory
files=[f for f in os.listdir("./your_directory") if f.endswith('.csv')]
#Initalize an empty data frame
all_data = pd.DataFrame()
#iterate through files and their contents, then concatenate their data into the data frame initialized above
for file in files:
df = pd.read_csv('./your_directory' + file)
all_data = pd.concat([all_data, df])
#Call the new data frame and verify that contents were transferred
all_data.head()

Read multiple CSV files in pairs of two them compare each two pair

I have a folder with multiple CSV files named like this
CINinfo_2019-08-08_rev1,CINinfo_2019-08-08_rev2,CINinfo_2019-08-08_rev3, CINinfo_2019-08-08_rev4, I have about 70 files in one folder my intention is to automate this process so that I can read them automatically in pairs of two then compare for differences in each pair and have the result as one combined table. Currently, I am reading them manually and comparing differences,this is the code:
import pandas as pd
df1 = pd.read_csv("CINinfo_2019-08-08_rev1.csv")
df2 = pd.read_csv("CINinfo_2019-08-08_rev2.csv")
import numpy as np
rows,cols=np.where(comparison_values==False)
for item in zip(rows,cols):
df1.iloc[item[0], item[1]] = '{} --> {}'.format(df1.iloc[item[0], item[1]],df2.iloc[item[0], item[1]])
This process is so tedious being that I have other folders with CSV files that I need to read. Note how the CSV files are named, all CSV files have the same prefixes (CINinfo_2019-08-08_) but suffix in this case name (rev) has an incremental number from 1 to 70. The way I need this to read files in pairs is in format 1 and 2, 2 and 3, 3 and 4 going on. In this case I compare pairs like this, CINinfo_2019-08-08_rev1 and CINinfo_2019-08-08_rev2 then CINinfo_2019-08-08_rev2 nd CINinfo_2019-08-08_rev3 going like that, How can I automate the reading of this files in pairs then compare for differences in each pair of them and have one joined table?
You could try something like this:
import os, re
import pandas as pd
import numpy as np
# your directory path here
path = r'path'
# get all files
file_, pat = [], re.compile('.csv')
for root, dirs, files in os.walk(path):
file_ = [os.path.join(root, f) for f in files if pat.search(f)]
# you may want to filter here, this line is just an example
# filter for all csv files containing 'rev'
file_ = [f for f in file_ if 'rev' in f]
# loop through the files of interest
for (idx, ff) in enumerate(file_[1:]):
df1 = pd.read_csv(ff)
df2 = pd.read_csv(file_[idx])
rows, cols = np.where(comparison_values==False)
for item in zip(rows,cols):
# do calculation
This answer is not all inclusive, but hopefully will give you a possible approach. You may need to adjust filtering, or possibly sort. I have not shown how to add the results to a final table, but the best thing to do is create a DataFrame temp and assign the values from the file pairs to it and then use pd.concat to add to a final DataFrame that wil contain all results.

Read in multiple csv into separate dataframes in Pandas

I have a long list of csv files that I want to read as dataframes and name them by their file name. For example, I want to read in the file status.csv and assign its dataframe the name status. Is there a way I can efficiently do this using Pandas?
Looking at this, I still have to write the name of each csv in my loop. I want to avoid that.
Looking at this, that allows me to read multiple csv into one dataframe instead of many.
You can list all csv under a directory using os.listdir(dirname) and combine it with os.path.basename to parse the file name.
import os
# current directory csv files
csvs = [x for x in os.listdir('.') if x.endswith('.csv')]
# stats.csv -> stats
fns = [os.path.splitext(os.path.basename(x))[0] for x in csvs]
d = {}
for i in range(len(fns)):
d[fns[i]] = pd.read_csv(csvs[i])
you could create a dictionary of DataFrames:
d = {} # dictionary that will hold them
for file_name in list_of_csvs: # loop over files
# read csv into a dataframe and add it to dict with file_name as it key
d[file_name] = pd.read_csv(file_name)

How to append multiple CSV files and add an additional column indicating file name in Python?

I have over 20 CSV files in a single folder. All files have the same structure, they just represent different days.
Example:
Day01.csv
Day02.csv
Day03.csv
Day04.csv (and so on...)
The files contain just two numeric columns: x and y. I would like to append all of these csv files together into one large file and add a column for the file name (day). I have explored similar examples to generate the following code but this code adds each y to a separate column (Y1, Y2, Y3, Y4...and so on). I would like to simply have this appended file as three columns: x, y, file name. How can I modify the code to do the proper append?
I have tried the code from this example: Read multiple csv files and Add filename as new column in pandas
import pandas as pd
import os
os.chdir('C:....path to my folder')
files = os.listdir()
df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp)) for fp in files])
However, this code does not append all Y values under one column. (all other aspects seem to work, however). Can someone help with the code so that all Y values are under a single column?
The following should work by creating the filename column before appending the dataframe to your list.
import os
import pandas as pd
file_list = []
for file in os.listdir():
if file.endswith('.csv'):
df = pd.read_csv(file,sep=";")
df['filename'] = file
file_list.append(df)
all_days = pd.concat(file_list, ignore_index=True)
all_days.to_csv("all.txt")
python is great at these simple task, almost too good to be true…
fake_files = lambda n: '\n'.join(('%d\t%d'%(i, i+1) for i in range(n, n+3)))
file_name = 'fake_me%s.csv'
with open('my_new.csv', 'wt') as new:
for number in range(3): # os.listdir()
# with open(number) as to_add:
# rows = to_add.readlines()
rows_fake = fake_files(number*2).split('\n')
adjusted_rows = [file_name%number + '\t' + row for row in rows_fake]
new.write('\n'.join(adjusted_rows) + '\n')
with adjustments to your specific io and naming, this is all you need.
you can just copy the code and run it and study how it works.

Categories