I am comparing multiple CSV files against a master file by a selected column values, and want to keep only the file that has the most matches with the master file.
The code I actually created give me the results for each file, but I don't know how to make the comparison between the files themselves, and just keep the one with the highest values sum at the end.
I know how to delete files via os.remove() and so on, but need help with the selection of the maximum value.
data0 = pd.read_csv('input_path/master_file.csv', sep=',')
csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)
for df in csv_files:
df_base = os.path.basename(df)
input_dir = os.path.dirname(df)
data1 = pd.read_csv(df, sep=',')
comp1 = pd.concat([data0, data1])[['values']]
cnt1 = comp1.loc[comp1.duplicated()]
match1 = cnt1.count(axis=1)
sum = str(sum(match1))
print('Matches between ' + df_base + ' & ' + input_dir + ': ' + sum)
The print gives me (paths and directories names appear correct):
Matches between ... & ...: 332215
Matches between ... & ...: 273239
Had the idea to try it via sub-lists, but just did not get anywhere.
You could write a function to calculate the "match score" for each file, and use that function as the key argument for the max function:
def match_score(csv_file):
df_base = os.path.basename(csv_file)
data1 = pd.read_csv(csv_file, sep=",")
comp1 = pd.concat([data0, data1])[['values']]
cnt1 = comp1.loc[comp1.duplicated()]
match1 = cnt1.count(axis=1)
return match1.sum()
Then,
csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)
max_match_file = max(csv_files, key=match_score)
You can simplify your code a lot using pathlib.
Addressing your question, you can store the duplicates sum in a dictionary, and after comparing all files, choose the one with most matches. Something like this:
import pandas as pd
from pathlib import Path
main_file = Path('/main/path/main.csv')
main_df = pd.read_csv(main_file)
other_path = Path('/other/path/')
other_files = other_path.rglob('*.csv')
matches_per_file = {}
for other_file in other_files:
other_df = pd.read_csv(other_file)
merged_df = pd.concat([main_df, other_df])[['values']]
dups = merged_df.loc[merged_df.duplicated()]
dups_sum = sum(dups.count(axis=1))
matches_per_file[other_file] = dups_sum
print(f'Matches between {other_file} and {main_file}: {dups_sum}')
# find the file with most matches
most_matches = max(matches_per_file, key=matches_per_file.get)
The code above will populate matches_per_file with pairs filename: matches. That will make it easy for you to find the max(matches) and the corresponding filename, and then decide which files you will keep and which ones you will delete. The variable most_matches will be set with that filename.
Use the code snippet as a starting point, since I don't have the data files to test it properly.
Thank you for your support. I have built a solution using list and sub-list. I added the following to my code and it works. Probably not the nicest solution, but it's my turn to improve my python skills.
liste1.append(df)
liste2.append(summe)
liste_overall = list(zip(liste1, liste2))
max_liste = max(liste_overall, key=lambda sublist: sublist[1])
for df2 in liste_overall:
#print(max_liste)
print(df2)
if df2[1] in max_liste[1]:
print("Maximum duplicated values, keep file!")
else:
print("Not maximum duplicated, file is removed!")
os.remove(df2[0])
Related
Hi I got a DF that contains two columns one with invoice number and another with client name.
Is there a way to find all files (by name) that contains "Banana" AND "A5000" at same time?
I've tried to work on a code yesterday and a very nice guy helped me to get trough with one criteria, but I'm stuck again when trying to add another one.
maybe I can't use "and" on this line: files = list(path.rglob(f'*{v and s}*')), i tried "&" but dind't work.
data = {'Invoice':['A5000','B8000','C3000'],'Client':['Banana','Orange','Lemon']}
df = pd.DataFrame(data=data)
path = Path('D:/Pyfilesearch')
dest = Path('D:/Dest')
for v,s in zip(df.Invoice,df.Client):
files = list(path.rglob(f'*{v and s}*'))
files = [f for f in files if f.is_file()]
for f in files:
print(f)
thanks.
I would use sets and intersection, as in the following example:
p = Path("files")
data = {'Invoice':['A5000','B8000','C3000'],'Client':['Banana','Orange','Lemon']}
df = pd.DataFrame(data=data)
for i, c in zip(df.Invoice, df.Client):
s1 = set(p.rglob(f"*{i}*"))
s2 = set(p.rglob(f"*{c}*"))
i_c_files = s1.intersection(s2)
print(i_c_files)
if i_c_files:
first_file = list(i_c_files)[0]
print("the first file is " + str(first_file))
Use substring in file.stem
import pandas as pd
from pathlib import Path
data = {'Invoice':['A5000','B8000','C3000'],'Client':['Banana','Orange','Lemon']}
df = pd.DataFrame(data=data)
path = Path('D:/Pyfilesearch')
for v, s in zip(df.Invoice,df.Client):
files = [f for f in path.iterdir()
if f.is_file() and v in f.stem and s in f.stem]
for f in files:
print(str(f))
"""
D:\Pyfilesearch\A5000 Banana.txt
D:\Pyfilesearch\B8000 Orange.txt
D:\Pyfilesearch\C3000 Lemon.txt
"""
I was wondering if there is a way that prints out the file names conditionally based on the multiple imported csv files. My procedure is:
Set my path.
Grab all the csv files in this path.
Import all these csv files grabbing only the numbers of each file names and store this in 'new_column'.
Check number of columns of each file and want to exclude the files that are not having 10 columns (acvhieved using shape[1]).
Now, I want to print out the actual file names that don't have 10 columns -> I am stuck here.
I have no problems up to number 4. However, I am stuck on 5. How do I achieve 5. ?
# setting my path
path = r'my\path'
# make a function that grabs all csv files in my path
all_files = glob.glob(path + "/*.csv")
# grab the numeric part of each file
def get_numbers_from_filename(filename):
return re.search(r'\d+', filename).group(0)
# import all the actual csv files and add a 'new_column' column based on the "get_numbers_from_filename" function
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
df['new_column'] = get_numbers_from_filename(filename)
li.append(df)
# check frequency of column numbers for each file using a frequency table
result = []
for lis in li:
result.append(lis.shape[1])
# make this a dataframe
result = pd.DataFrame(result, columns = ['shape'])
# actual checking step
result['shape'].value_counts()
# grab only shape == 10 files to correctly concatenate
result = []
for lis in li:
if lis.shape[1] == 10:
result.append(lis)
## my solution for part 5:
# print and save all the paths of my directory
path = os.listdir(path)
# grab file names if columns numbers are not 10
result3 = []
for paths in path:
for list in li:
if lis.shape[1] != 10:
result3.append(paths)
my solution gives an empty string []
I have an excel file (df2) and i have used for loop to it to get multiple outputs and then i want to append all the multiple outputs obtained for that file so that i can get it into one single excel. Please find my code below and suggest some ideas so that i can complete my code.
import os
from os import path
import pandas as pd
src = "C:\\ASPAIN-ORANGE\\test_wind3\\udc\\folder\\"
df = pd.read_excel('check_.xlsx',sheet_name='Align_pivot')
files = [i for i in os.listdir(src) if i.startswith("_Verification_") and path.isfile(path.join(src, i))]
for f in files:
slice1 = 19
file_slice = f[slice1:].replace(".csv", "")
df1 = pd.read_csv(f)
total_rows_df1 = len(df1.axes[0])
df2 = df[df['MO'] == (file_slice)]
total_rows_df2 = sum(df2.To_Align)
print("filename : "+str(file_slice))
print("Number of Rows_df1: "+str(total_rows_df1))
print("Number of Rows_df2: "+str(total_rows_df2))
if total_rows_df1 == total_rows_df2:
print('True')
else:
print('False')
df2.to_excel('output.xlsx', index=False, na_rep = 'NA', header = True)
1st Iteration output
2nd Iteration output
3rd Iteration output
and so on
Final appended Output
Your kind help would really be appreciated.
You can use DataFrame.append method (Append rows of other to the end of caller, returning a new object):
df = df.append(sheet, ignore_index=True)
Once all rows are added you can call to_excel method to write the excel.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html
I have the same file with quarterly data saved in different folders corresponding to the quarter. In other words, a quarter 1 folder, quarter 2 , quarter 3, quarter 4. This is the only difference in the file path. I am looking to read all four files in and concatenate them into one database. I can do this manually using a version of the simplified code below and changing the period each time.
period = ‘Q1’
filepath = ‘filepath/’ + period
file = filepath + ‘/file.xls’
df = pd.read_excel(file)
I would like to automate it with some form of for loop (I assume). That loops through the 4 periods, reads the file into a database and then concatenates. I have read other answers as to how this can be done with files in the same folder. But am struggling to do it where the file path changes. Manually putting the files into the same folder is not a desirable solution.
I tried making period a tuple and a list containing all 4 periods then a simple for loop but this didn’t work. I got the following error message.
TypeError: Can't convert 'list' object to str implicitly
Greatly appreciate any advice.
How about you first use list comprehension to get a list of all files:
periods= ["Q1", "Q2", "Q3", "Q4"]
files = ["filepath/"+ p + "/file.xls" for p in periods]
and then load them all into a list of data frames with
dfs = []
for f in files:
df = pd.read_excel(f)
dfs.append(df)
You can use these loops to create full file paths and to iterate over them to create one DataFrame containing all the files.
filepath = 'path/'
file = 'file.xlsx'
periods=['Q1','Q2','Q3','Q4']
files = []
for p in periods:
files.append(filepath+p+'/'+file)
files
data = []
for f in files:
data.append(pd.read_excel(f))
df = pd.concat(data)
You probably want something like this:
periods = ['Q1', 'Q2', 'Q3', 'Q4']
df = None
for period in periods:
filepath = 'filepath/' + period
file = filepath + '/file.xls'
if df is None:
df = pd.read_excel(file)
else:
df.append(pd.read_excel(file))
You could try something like this:
complete_df = pd.DataFrame()
for i in range(1,5):
quarter = 'Q'+str(i)
filepath = 'filepath/' + quarter
file = filepath + '/file.xls'
df = pd.read_excel(file)
complete_df = complete_df.append(df)
I have imported multiple csv files from a folder. First I created a list of all the csv files in the folder and then I provide the length of the list to my function.
The csv files have rows with different column lengths so that is why I think I have to use readlines.
The problem is that when I try to filter the DataFrame the values are not recognized.
I saved it to a sqlite table and pulled it in to R and a value that looks like "H"
appears to be like this in r --- "\"H\""
How can I prevent those extra characters from being added to my object "H"
Or do I have another problem?
x = []
count = 0
while (count < len(filelist) ):
for file in filelist:
filename = open(filelist[count])
count = count + 1
for line in filename.readlines():
x.append(line.split(','))
df = pd.DataFrame(x)
For example I am just trying to create a mask. But I am getting all False. The DataFrame appears to contain "H"?
data['V1'] == "H"
Try this
df_list =[]
file_list = []
path = 'file_path'
for file in file_list:
df_name = 'df_%s' %file
df_list.append(df_name)
('df_%s' % file) = pd.read_csv(path+file)
new_df = pd.concat(df_list)
Answer: This code fixed the problem by removing the quotes throughout. Now the mask works.
for i, col in enumerate(df.columns):
df.iloc[:, i] = df.iloc[:, i].str.replace('"', '')