Read multiple csv files into separate dataframes using pandas

Read multiple csv files into separate dataframes using pandas - python

I like to read two csv files from a particular folder into two separate dataframes.
The two file names are: 23314621_MACI_NAV.CSV and 23314623_MACI_Holding.CSV
The file second part of the file names are fixed MACI_NAV.CSV and MACI_Holding.CSV, however the first part of the file name which are numbers change everyday.
I like to read them into two different dataframe by trying this:
import pandas as pd
import glob
msci_folder = 'N:/Operation/Daily CDS E_Report/CDS/MACI/'
mscifile = glob.glob(msci_folder + "\*.csv")
for file in mscifile:
df_nav=pd.read_csv(file)
df_holding=pd.read_csv(file)
It seems like both lines are reading the same file, how do I make them read different files (second file)?

If want create list of DataFrames:
dfs = []
for file in mscifile:
df = pd.read_csv(file)
dfs.append(df)
Or use list comprehension:
dfs = [pd.read_csv(file) for file in mscifile]
print (dfs[0])
print (dfs[1])
Another solution is create dictionary of DataFrames with keys by last substring after _ in filename:
from os.path import splitext, basename
dfs = {splitext(basename(fp))[0].split('_')[-1] : pd.read_csv(fp) for fp in mscifile}
print (dfs)
print (dfs['NAV'])
print (dfs['Holding'])

Related

Merge csv files based on file names and suffix in Python

First time poster and fairly new to Python here. I have a collection of +1,7000 csv files with 2 columns each. The number and labels of the rows are the same in every file. The files are named with a specific format. For example:
Species_1_OrderA_1.csv
Species_1_OrderA_2.csv
Species_1_OrderA_3.csv
Species_10_OrderB_1.csv
Species_10_OrderB_2.csv
Each imported dataframe is formatted like so:
TreeID Species_1_OrderA_2
0 Bu2_1201_1992 0
1 Bu3_1201_1998 0
2 Bu4_1201_2000 0
3 Bu5_1201_2002 0
4 Bu6_1201_2004 0
.. ... ...
307 Fi141_16101_2004 0
308 Fi142_16101_2006 0
309 Fi143_16101_2008 0
310 Fi144_16101_2010 0
311 Fi147_16101_2015 0
I would like to join the files that correspond to the same species, based on the first column. So, in the end, I would get the files Species_1_OrderA.csv and Species_10_OrderB.csv. Please note that all the species do not necessarily have the same number of files.
This is what I have tried so far.
import os
import glob
import pandas as pd
# Importing csv files from directory
path = '.'
extension = 'csv'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))
# Create a dictionary to loop through each file to read its contents and create a dataframe
file_dict = {}
for file in files:
key = file
df = pd.read_csv(file)
file_dict[key] = df
# Extract the name of each dataframe, convert to a list and extract the relevant
# information (before the 3rd underscore). Compare each of these values to the next and
# if they are the same, append them to a list. This list (in my head, at least) will help
# me merge them using pandas.concat
keys_list = list(file_dict.keys())
group = ''
for line in keys_list:
type = "_".join(line.split("_")[:3])
for i in range(len(type) - 1):
if type[i] == type[i+1]:
group.append(line[keys_list])
print(group)
However, the last bit is not even working, and at this point, I am not sure this is the best way to deal with my problem. Any pointers on how to solve this will be really appreciated.
--- EDIT:
This is the expected output for the files per species. Ideally, I would remove the rows that have zeros in them, but that can easily be done with awk.
TreeID,Species_1_OrderA_0,Species_1_OrderA_1,Species_1_OrderA_2
Bu2_1201_1992,0,0,0
Bu3_1201_1998,0,0,0
Bu4_1201_2000,0,0,0
Bu5_1201_2002,0,0,0
Bu6_1201_2004,0,0,0
Bu7_1201_2006,0,0,0
Bu8_1201_2008,0,0,0
Bu9_1201_2010,0,0,0
Bu10_1201_2012,0,0,0
Bu11_1201_2014,0,0,0
Bu14_1201_2016,0,0,0
Bu16_1201_2018,0,0,0
Bu18_3103_1989,0,0,0
Bu22_3103_1999,0,0,0
Bu23_3103_2001,0,0,0
Bu24_3103_2003,0,0,0
...
Fi141_16101_2004,0,0,10
Fi142_16101_2006,0,4,0
Fi143_16101_2008,0,0,0
Fi144_16101_2010,2,0,0
Fi147_16101_2015,0,7,0
``

Try it like this:
import os
import pandas as pd
path = "C:/Users/username"
files = [file for file in os.listdir(path) if file.endswith(".csv")]
dfs = dict()
for file in files:
#everything before the final _ is the species name
species = file.rsplit("_", maxsplit=1)[0]
#read the csv to a dataframe
df = pd.read_csv(os.path.join(path, file))
#if you don't have a df for a species, create a new key
if species not in dfs:
dfs[species] = df
#else, merge current df to existing df on the TreeID
else:
dfs[species] = pd.merge(dfs[species], df, on="TreeID", how="outer")
#write all dfs to their own csv files
for key in dfs:
dfs[key].to_csv(f"{key}.csv")

If your goal is to concatenate all the csv's for each species-order into a consolidated csv, this is one approach. I haven't tested it so there might be a few errors. The idea is to first use glob, as you're doing, to make a dict of file_paths so that all the file_paths of the same species-order are grouped together. Then for each species-order read in all the data into a single table in memory and then write out to a consolidated file.
import pandas as pd
import glob
#Create a dictionary keyed by species_order, valued by a list of files
#i.e. file_paths_by_species_order['Species_10_OrderB'] = ['Species_10_OrderB_1.csv', 'Species_10_OrderB_2.csv']
file_paths_by_species_order = {}
for file_path in glob.glob('*.csv'):
species_order = file_path.split("_")[:3]
if species_order not in file_paths_by_species_order:
file_paths_by_species_order[species_order] = [file_path]
else:
file_paths_by_species_order[species_order].append(file_path)
#For each species_order, concat all files and save the info into a new csv
for species_order,file_paths in file_paths_by_species_order.items():
df = pd.concat(pd.read_csv(file_path) for file_path in file_paths)
df.to_csv('consolidated_{}.csv'.format(species_order))
There are definitely improvements that can be made such as using collections.defaultdict and writing one file at a time out to the consolidated file, instead of reading them all into memory

How to create a dataframe from multiple csv files?

I am loading a csv file in pandas as
premier10 = pd.read_csv('./premier_league/pl_09_10.csv')
However, I have 20+ csv files, which I was hoping to load as separate dfs (one df per csv) using a loop and predefined names, something similar to:
import pandas as pd
file_names = ['pl_09_10.csv','pl_10_11.csv']
names = ['premier10','premier11']
for i in range (0,len(file_names)):
names[i] = pd.read_csv('./premier_league/{}'.format(file_names[i]))
(Note, here I provide only two csv files as example) Unfortunately, this doesn't work (no error messages, but the the pd dfs don't exist).
Any tips/links to previous questions would be greatly appreciated as I haven't found anything similar on Stackoverflow.

Use pathlib to set a Path, p, to the files
Use the .glob method to find the files matching the pattern
Create a dataframe with pandas.read_csv
Use a dict comprehension to create a dict of dataframes, where each file will have its own key-value pair.
Use the dict like any other dict; the keys are the file names and the values are the dataframes.
Alternatively, use a list comprehension with pandas.concat to create a single dataframe from all the files.
In the for-loop in the OP, objects (variables) may not be created in that way (e.g. names[i]).
This is equivalent to 'premier10' = pd.read_csv(...), where 'premier10' is a str type.
from pathlib import Path
import pandas as pd
# set the path to the files
p = Path('some_path/premier_league')
# create a list of the files matching the pattern
files = list(p.glob(f'pl_*.csv'))
# creates a dict of dataframes, where each file has a separate dataframe
df_dict = {f.stem: pd.read_csv(f) for f in files}
# alternative, creates 1 dataframe from all files
df = pd.concat([pd.read_csv(f) for f in files])

names = ['premier10','premier11'] does not create a dictionary but a list. Simply replace it with names = dict() or replace names = ['premier10','premier11'] by names.append(['premier10','premier11'])

This is what you want:
#create a variable and look through contents of the directory
files=[f for f in os.listdir("./your_directory") if f.endswith('.csv')]
#Initalize an empty data frame
all_data = pd.DataFrame()
#iterate through files and their contents, then concatenate their data into the data frame initialized above
for file in files:
df = pd.read_csv('./your_directory' + file)
all_data = pd.concat([all_data, df])
#Call the new data frame and verify that contents were transferred
all_data.head()

Read multiple CSV files in pairs of two them compare each two pair

I have a folder with multiple CSV files named like this
CINinfo_2019-08-08_rev1,CINinfo_2019-08-08_rev2,CINinfo_2019-08-08_rev3, CINinfo_2019-08-08_rev4, I have about 70 files in one folder my intention is to automate this process so that I can read them automatically in pairs of two then compare for differences in each pair and have the result as one combined table. Currently, I am reading them manually and comparing differences,this is the code:
import pandas as pd
df1 = pd.read_csv("CINinfo_2019-08-08_rev1.csv")
df2 = pd.read_csv("CINinfo_2019-08-08_rev2.csv")
import numpy as np
rows,cols=np.where(comparison_values==False)
for item in zip(rows,cols):
df1.iloc[item[0], item[1]] = '{} --> {}'.format(df1.iloc[item[0], item[1]],df2.iloc[item[0], item[1]])
This process is so tedious being that I have other folders with CSV files that I need to read. Note how the CSV files are named, all CSV files have the same prefixes (CINinfo_2019-08-08_) but suffix in this case name (rev) has an incremental number from 1 to 70. The way I need this to read files in pairs is in format 1 and 2, 2 and 3, 3 and 4 going on. In this case I compare pairs like this, CINinfo_2019-08-08_rev1 and CINinfo_2019-08-08_rev2 then CINinfo_2019-08-08_rev2 nd CINinfo_2019-08-08_rev3 going like that, How can I automate the reading of this files in pairs then compare for differences in each pair of them and have one joined table?

You could try something like this:
import os, re
import pandas as pd
import numpy as np
# your directory path here
path = r'path'
# get all files
file_, pat = [], re.compile('.csv')
for root, dirs, files in os.walk(path):
file_ = [os.path.join(root, f) for f in files if pat.search(f)]
# you may want to filter here, this line is just an example
# filter for all csv files containing 'rev'
file_ = [f for f in file_ if 'rev' in f]
# loop through the files of interest
for (idx, ff) in enumerate(file_[1:]):
df1 = pd.read_csv(ff)
df2 = pd.read_csv(file_[idx])
rows, cols = np.where(comparison_values==False)
for item in zip(rows,cols):
# do calculation
This answer is not all inclusive, but hopefully will give you a possible approach. You may need to adjust filtering, or possibly sort. I have not shown how to add the results to a final table, but the best thing to do is create a DataFrame temp and assign the values from the file pairs to it and then use pd.concat to add to a final DataFrame that wil contain all results.

Read in multiple csv into separate dataframes in Pandas

I have a long list of csv files that I want to read as dataframes and name them by their file name. For example, I want to read in the file status.csv and assign its dataframe the name status. Is there a way I can efficiently do this using Pandas?
Looking at this, I still have to write the name of each csv in my loop. I want to avoid that.
Looking at this, that allows me to read multiple csv into one dataframe instead of many.

You can list all csv under a directory using os.listdir(dirname) and combine it with os.path.basename to parse the file name.
import os
# current directory csv files
csvs = [x for x in os.listdir('.') if x.endswith('.csv')]
# stats.csv -> stats
fns = [os.path.splitext(os.path.basename(x))[0] for x in csvs]
d = {}
for i in range(len(fns)):
d[fns[i]] = pd.read_csv(csvs[i])

you could create a dictionary of DataFrames:
d = {} # dictionary that will hold them
for file_name in list_of_csvs: # loop over files
# read csv into a dataframe and add it to dict with file_name as it key
d[file_name] = pd.read_csv(file_name)

How to append multiple CSV files and add an additional column indicating file name in Python?

I have over 20 CSV files in a single folder. All files have the same structure, they just represent different days.
Example:
Day01.csv
Day02.csv
Day03.csv
Day04.csv (and so on...)
The files contain just two numeric columns: x and y. I would like to append all of these csv files together into one large file and add a column for the file name (day). I have explored similar examples to generate the following code but this code adds each y to a separate column (Y1, Y2, Y3, Y4...and so on). I would like to simply have this appended file as three columns: x, y, file name. How can I modify the code to do the proper append?
I have tried the code from this example: Read multiple csv files and Add filename as new column in pandas
import pandas as pd
import os
os.chdir('C:....path to my folder')
files = os.listdir()
df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp)) for fp in files])
However, this code does not append all Y values under one column. (all other aspects seem to work, however). Can someone help with the code so that all Y values are under a single column?

The following should work by creating the filename column before appending the dataframe to your list.
import os
import pandas as pd
file_list = []
for file in os.listdir():
if file.endswith('.csv'):
df = pd.read_csv(file,sep=";")
df['filename'] = file
file_list.append(df)
all_days = pd.concat(file_list, ignore_index=True)
all_days.to_csv("all.txt")

python is great at these simple task, almost too good to be true…
fake_files = lambda n: '\n'.join(('%d\t%d'%(i, i+1) for i in range(n, n+3)))
file_name = 'fake_me%s.csv'
with open('my_new.csv', 'wt') as new:
for number in range(3): # os.listdir()
# with open(number) as to_add:
# rows = to_add.readlines()
rows_fake = fake_files(number*2).split('\n')
adjusted_rows = [file_name%number + '\t' + row for row in rows_fake]
new.write('\n'.join(adjusted_rows) + '\n')
with adjustments to your specific io and naming, this is all you need.
you can just copy the code and run it and study how it works.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read multiple csv files into separate dataframes using pandas - python

Related

Merge csv files based on file names and suffix in Python

How to create a dataframe from multiple csv files?

Read multiple CSV files in pairs of two them compare each two pair

Read in multiple csv into separate dataframes in Pandas

How to append multiple CSV files and add an additional column indicating file name in Python?

Categories

Resources