Pandas reading multiple files from different folders - python

I have the same file with quarterly data saved in different folders corresponding to the quarter. In other words, a quarter 1 folder, quarter 2 , quarter 3, quarter 4. This is the only difference in the file path. I am looking to read all four files in and concatenate them into one database. I can do this manually using a version of the simplified code below and changing the period each time.
period = ‘Q1’
filepath = ‘filepath/’ + period
file = filepath + ‘/file.xls’
df = pd.read_excel(file)
I would like to automate it with some form of for loop (I assume). That loops through the 4 periods, reads the file into a database and then concatenates. I have read other answers as to how this can be done with files in the same folder. But am struggling to do it where the file path changes. Manually putting the files into the same folder is not a desirable solution.
I tried making period a tuple and a list containing all 4 periods then a simple for loop but this didn’t work. I got the following error message.
TypeError: Can't convert 'list' object to str implicitly
Greatly appreciate any advice.

How about you first use list comprehension to get a list of all files:
periods= ["Q1", "Q2", "Q3", "Q4"]
files = ["filepath/"+ p + "/file.xls" for p in periods]
and then load them all into a list of data frames with
dfs = []
for f in files:
df = pd.read_excel(f)
dfs.append(df)

You can use these loops to create full file paths and to iterate over them to create one DataFrame containing all the files.
filepath = 'path/'
file = 'file.xlsx'
periods=['Q1','Q2','Q3','Q4']
files = []
for p in periods:
files.append(filepath+p+'/'+file)
files
data = []
for f in files:
data.append(pd.read_excel(f))
df = pd.concat(data)

You probably want something like this:
periods = ['Q1', 'Q2', 'Q3', 'Q4']
df = None
for period in periods:
filepath = 'filepath/' + period
file = filepath + '/file.xls'
if df is None:
df = pd.read_excel(file)
else:
df.append(pd.read_excel(file))

You could try something like this:
complete_df = pd.DataFrame()
for i in range(1,5):
quarter = 'Q'+str(i)
filepath = 'filepath/' + quarter
file = filepath + '/file.xls'
df = pd.read_excel(file)
complete_df = complete_df.append(df)

Related

Compare multiple CSV files by row and delete files not needed

I am comparing multiple CSV files against a master file by a selected column values, and want to keep only the file that has the most matches with the master file.
The code I actually created give me the results for each file, but I don't know how to make the comparison between the files themselves, and just keep the one with the highest values sum at the end.
I know how to delete files via os.remove() and so on, but need help with the selection of the maximum value.
data0 = pd.read_csv('input_path/master_file.csv', sep=',')
csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)
for df in csv_files:
df_base = os.path.basename(df)
input_dir = os.path.dirname(df)
data1 = pd.read_csv(df, sep=',')
comp1 = pd.concat([data0, data1])[['values']]
cnt1 = comp1.loc[comp1.duplicated()]
match1 = cnt1.count(axis=1)
sum = str(sum(match1))
print('Matches between ' + df_base + ' & ' + input_dir + ': ' + sum)
The print gives me (paths and directories names appear correct):
Matches between ... & ...: 332215
Matches between ... & ...: 273239
Had the idea to try it via sub-lists, but just did not get anywhere.
You could write a function to calculate the "match score" for each file, and use that function as the key argument for the max function:
def match_score(csv_file):
df_base = os.path.basename(csv_file)
data1 = pd.read_csv(csv_file, sep=",")
comp1 = pd.concat([data0, data1])[['values']]
cnt1 = comp1.loc[comp1.duplicated()]
match1 = cnt1.count(axis=1)
return match1.sum()
Then,
csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)
max_match_file = max(csv_files, key=match_score)
You can simplify your code a lot using pathlib.
Addressing your question, you can store the duplicates sum in a dictionary, and after comparing all files, choose the one with most matches. Something like this:
import pandas as pd
from pathlib import Path
main_file = Path('/main/path/main.csv')
main_df = pd.read_csv(main_file)
other_path = Path('/other/path/')
other_files = other_path.rglob('*.csv')
matches_per_file = {}
for other_file in other_files:
other_df = pd.read_csv(other_file)
merged_df = pd.concat([main_df, other_df])[['values']]
dups = merged_df.loc[merged_df.duplicated()]
dups_sum = sum(dups.count(axis=1))
matches_per_file[other_file] = dups_sum
print(f'Matches between {other_file} and {main_file}: {dups_sum}')
# find the file with most matches
most_matches = max(matches_per_file, key=matches_per_file.get)
The code above will populate matches_per_file with pairs filename: matches. That will make it easy for you to find the max(matches) and the corresponding filename, and then decide which files you will keep and which ones you will delete. The variable most_matches will be set with that filename.
Use the code snippet as a starting point, since I don't have the data files to test it properly.
Thank you for your support. I have built a solution using list and sub-list. I added the following to my code and it works. Probably not the nicest solution, but it's my turn to improve my python skills.
liste1.append(df)
liste2.append(summe)
liste_overall = list(zip(liste1, liste2))
max_liste = max(liste_overall, key=lambda sublist: sublist[1])
for df2 in liste_overall:
#print(max_liste)
print(df2)
if df2[1] in max_liste[1]:
print("Maximum duplicated values, keep file!")
else:
print("Not maximum duplicated, file is removed!")
os.remove(df2[0])

Import multiple excel files into pandas and create a column based on name of file

I have multiple excel files in one folder which I want to read and concat together,but while concating together I want to add column based on name of the file
'D:\\156667_Report.xls',
'D:\\192059_Report.xls',
'D:\\254787_Report.xls',
'D:\\263421_Report.xls',
'D:\\273554_Report.xls',
'D:\\280163_Report.xls',
'D:\\307928_Report.xls'
I can read these files in pandas with following script
path =r'D:\' # use your path
allFiles = glob.glob(path + "/*.xls")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_excel(file_,index_col=None, header=0)
list_.append(df)
frame = pd.concat(list_)
I want to add column as Code in all the files which I read.Code will be numbers from filename e.g. 156667,192059
why not just match
foo = re.match('\.*_Report', file_)
num = foo[:6]`
df['Code']= num
Inside your loop?
One you could do this, is by using join, isdigit, inside a list comprehension.
The isdigit will get only the numbers from the file name (in a list), and the join function will join them back into 1.
To be clear, you could change your for loop to this:
for file_ in allFiles:
df = pd.read_excel(file_,index_col=None, header=0)
df['Code'] = ''.join(str(i) for i in file_ if i.isdigit())
list_.append(df)
which will add a column called Code in each df.

Function gives list, I need DataFrame so I can concatenate

I started off pulling all files in the folder and concatenating them, this one works:
warranty_list = []
warranty_files = glob.glob(os.path.join(qms, '*.csv'))
for file_ in warranty_files:
df = pd.read_csv(file_,index_col=None, header=0)
warranty_list.append(df)
warranty = pd.concat(warranty_list)
Then I had to write a function so I would only grab certain files and concatenate them, but this one is not working. I do not get an error but the last line is not being used, so I am not concatenating the files.
def get_warranty(years=5):
warranty_list = [] #list for glob.glob()
current_year = datetime.datetime.today().year #current year
last_n_years = [str(current_year-i) for i in range(0,years+1)]
for year in last_n_years:
warranty = glob.glob(os.path.join(qms, "Warranty Detail%s.csv" % year))
if warranty:
for file_ in warranty:
df = pd.read_csv(file_,index_col=None, header=0)
warranty_list.append(df)
warranty_df = pd.concat(warranty_list)
The last line isn't working presumably because the pd.concat() is getting a list as an input and it won't do anything with that. O don't understand why it worked in the first set of code and not this one.
I don't know how to change the function to get a data frame or how to change what I get at the end into a data frame.
Any suggestions?
I would suggest to use directly append because it do same thing as concat
So basically you start with an empty dataframe
warranty_df = pd.Dataframe()
And then append the the others dataframe to this while reading the file
So your function should remain the same but you need to delete the following line
warranty_df = pd.concat(warranty_list)
And after the loop, you return the warranty_df!
def get_warranty(years=5):
warranty_df = pd.Dataframe()
current_year = datetime.datetime.today().year #current year
last_n_years = [str(current_year-i) for i in range(0,years+1)]
for year in last_n_years:
warranty = glob.glob(os.path.join(qms, "Warranty Detail%s.csv" % year))
if warranty:
for file_ in warranty:
df = pd.read_csv(file_,index_col=None, header=0)
waranty_df = warranty_df.append(df)
return warranty_df

How to read multiple files from different folder in python

I have yearly data files in different folders. each file contains daily data ranging from Jan 1 to Dec 31. Data files name is looks like AS060419.67 where last four digit represent year i.e. 1967 and 0604 is folder name.
I tried to read these multiple files by using the code (below), but it reads only for last year data in last folder
def date_parser(doy, year):
return dt.datetime.strptime(doy.zfill(3)+year, '%j%Y')
files = glob.glob('????/AS*')
files.sort()
files
STNS = {}
for f in files:
stn_id, info = f.split('/')
year = "".join(info[-5:].split('.'))
#print (f,stn_id)
with open(f) as fo:
data = fo.readlines()[:-1]
data = [d.strip() for d in data]
data = '\n'.join(data)
with open('data.dump', 'w') as dump:
dump.write(data)
parser = lambda date: date_parser(date, year=year)
df = pd.read_table('data.dump', delim_whitespace=True,names=['date','prec'],
na_values='DNA', parse_dates=[0], date_parser=parser, index_col='date' )
df.replace({'T': 0})
df = df.apply(pd.to_numeric, args=('coerce',))
df.name = stn_name
df.sid = stn_id
if stn_id not in STNS.keys():
STNS[stn_name] = df
else:
STNS[stn_id] = STNS[stn_id].append(df)
STNS[stn_id].name = df.name
STNS[stn_id].sid = df.sid
#outfile.write(line)
For making plot
for stn in STNS:
STNS[stn_id].plot()
plt.title('Precipitation for {0}'.format(STNS[stn].name))
The problem is it reads only last year data in last folder. Can anyone help to figure out this problem.Your help will be highly appreciated.
You can do it like this:
import os
import glob
import pandas as pd
import matplotlib.pyplot as plt
# file mask
fmask = r'./data/????/AS*.??'
# all RegEx replacements
replacements = {
r'T': 0
}
# list of data files
flist = glob.glob(fmask)
def read_data(flist, date_col='date', **kwargs):
dfs = []
for f in flist:
# parse year from the file name
y = os.path.basename(f).replace('.', '')[-4:]
df = pd.read_table(f, **kwargs)
# replace day of year with a date
df[date_col] = pd.to_datetime(y + df[date_col].astype(str).str.zfill(3), format='%Y%j')
dfs.append(df)
return pd.concat(dfs, ignore_index=True)
df = read_data(flist,
date_col='date',
sep=r'\s+',
header=None,
names=['date','prec'],
engine='python',
skipfooter=1,
) \
.replace(replacements, regex=True) \
.set_index('date') \
.apply(pd.to_numeric, args=('coerce',))
df.plot()
plt.show()
I've downloaded only four files, so the corresponding data you can see on the plot...
You overwrite the same file over and over again.
Derive your target file name from your source file name.
Or use the append mode if you want it all in the same file.
How do you append to a file?

imported csv to dataframe objects not recognized

I have imported multiple csv files from a folder. First I created a list of all the csv files in the folder and then I provide the length of the list to my function.
The csv files have rows with different column lengths so that is why I think I have to use readlines.
The problem is that when I try to filter the DataFrame the values are not recognized.
I saved it to a sqlite table and pulled it in to R and a value that looks like "H"
appears to be like this in r --- "\"H\""
How can I prevent those extra characters from being added to my object "H"
Or do I have another problem?
x = []
count = 0
while (count < len(filelist) ):
for file in filelist:
filename = open(filelist[count])
count = count + 1
for line in filename.readlines():
x.append(line.split(','))
df = pd.DataFrame(x)
For example I am just trying to create a mask. But I am getting all False. The DataFrame appears to contain "H"?
data['V1'] == "H"
Try this
df_list =[]
file_list = []
path = 'file_path'
for file in file_list:
df_name = 'df_%s' %file
df_list.append(df_name)
('df_%s' % file) = pd.read_csv(path+file)
new_df = pd.concat(df_list)
Answer: This code fixed the problem by removing the quotes throughout. Now the mask works.
for i, col in enumerate(df.columns):
df.iloc[:, i] = df.iloc[:, i].str.replace('"', '')

Categories