Find files and copy with multiple criteria - python

Hi I got a DF that contains two columns one with invoice number and another with client name.
Is there a way to find all files (by name) that contains "Banana" AND "A5000" at same time?
I've tried to work on a code yesterday and a very nice guy helped me to get trough with one criteria, but I'm stuck again when trying to add another one.
maybe I can't use "and" on this line: files = list(path.rglob(f'*{v and s}*')), i tried "&" but dind't work.
data = {'Invoice':['A5000','B8000','C3000'],'Client':['Banana','Orange','Lemon']}
df = pd.DataFrame(data=data)
path = Path('D:/Pyfilesearch')
dest = Path('D:/Dest')
for v,s in zip(df.Invoice,df.Client):
files = list(path.rglob(f'*{v and s}*'))
files = [f for f in files if f.is_file()]
for f in files:
print(f)
thanks.

I would use sets and intersection, as in the following example:
p = Path("files")
data = {'Invoice':['A5000','B8000','C3000'],'Client':['Banana','Orange','Lemon']}
df = pd.DataFrame(data=data)
for i, c in zip(df.Invoice, df.Client):
s1 = set(p.rglob(f"*{i}*"))
s2 = set(p.rglob(f"*{c}*"))
i_c_files = s1.intersection(s2)
print(i_c_files)
if i_c_files:
first_file = list(i_c_files)[0]
print("the first file is " + str(first_file))

Use substring in file.stem
import pandas as pd
from pathlib import Path
data = {'Invoice':['A5000','B8000','C3000'],'Client':['Banana','Orange','Lemon']}
df = pd.DataFrame(data=data)
path = Path('D:/Pyfilesearch')
for v, s in zip(df.Invoice,df.Client):
files = [f for f in path.iterdir()
if f.is_file() and v in f.stem and s in f.stem]
for f in files:
print(str(f))
"""
D:\Pyfilesearch\A5000 Banana.txt
D:\Pyfilesearch\B8000 Orange.txt
D:\Pyfilesearch\C3000 Lemon.txt
"""

Related

Compare multiple CSV files by row and delete files not needed

I am comparing multiple CSV files against a master file by a selected column values, and want to keep only the file that has the most matches with the master file.
The code I actually created give me the results for each file, but I don't know how to make the comparison between the files themselves, and just keep the one with the highest values sum at the end.
I know how to delete files via os.remove() and so on, but need help with the selection of the maximum value.
data0 = pd.read_csv('input_path/master_file.csv', sep=',')
csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)
for df in csv_files:
df_base = os.path.basename(df)
input_dir = os.path.dirname(df)
data1 = pd.read_csv(df, sep=',')
comp1 = pd.concat([data0, data1])[['values']]
cnt1 = comp1.loc[comp1.duplicated()]
match1 = cnt1.count(axis=1)
sum = str(sum(match1))
print('Matches between ' + df_base + ' & ' + input_dir + ': ' + sum)
The print gives me (paths and directories names appear correct):
Matches between ... & ...: 332215
Matches between ... & ...: 273239
Had the idea to try it via sub-lists, but just did not get anywhere.
You could write a function to calculate the "match score" for each file, and use that function as the key argument for the max function:
def match_score(csv_file):
df_base = os.path.basename(csv_file)
data1 = pd.read_csv(csv_file, sep=",")
comp1 = pd.concat([data0, data1])[['values']]
cnt1 = comp1.loc[comp1.duplicated()]
match1 = cnt1.count(axis=1)
return match1.sum()
Then,
csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)
max_match_file = max(csv_files, key=match_score)
You can simplify your code a lot using pathlib.
Addressing your question, you can store the duplicates sum in a dictionary, and after comparing all files, choose the one with most matches. Something like this:
import pandas as pd
from pathlib import Path
main_file = Path('/main/path/main.csv')
main_df = pd.read_csv(main_file)
other_path = Path('/other/path/')
other_files = other_path.rglob('*.csv')
matches_per_file = {}
for other_file in other_files:
other_df = pd.read_csv(other_file)
merged_df = pd.concat([main_df, other_df])[['values']]
dups = merged_df.loc[merged_df.duplicated()]
dups_sum = sum(dups.count(axis=1))
matches_per_file[other_file] = dups_sum
print(f'Matches between {other_file} and {main_file}: {dups_sum}')
# find the file with most matches
most_matches = max(matches_per_file, key=matches_per_file.get)
The code above will populate matches_per_file with pairs filename: matches. That will make it easy for you to find the max(matches) and the corresponding filename, and then decide which files you will keep and which ones you will delete. The variable most_matches will be set with that filename.
Use the code snippet as a starting point, since I don't have the data files to test it properly.
Thank you for your support. I have built a solution using list and sub-list. I added the following to my code and it works. Probably not the nicest solution, but it's my turn to improve my python skills.
liste1.append(df)
liste2.append(summe)
liste_overall = list(zip(liste1, liste2))
max_liste = max(liste_overall, key=lambda sublist: sublist[1])
for df2 in liste_overall:
#print(max_liste)
print(df2)
if df2[1] in max_liste[1]:
print("Maximum duplicated values, keep file!")
else:
print("Not maximum duplicated, file is removed!")
os.remove(df2[0])

Import and append pickle files

How could I import and append all files in a directory?
files = os.listdir(r"C:\Users\arv\Desktop\pickle_files")
data = []
for i in files:
data.append(pd.read_pickle(i))
df = pd.concat(['data'])  
Almost like you tried to do it yourslf:
df = pd.concat([pd.read_pickle(f) for f in files])

Pandas reading multiple files from different folders

I have the same file with quarterly data saved in different folders corresponding to the quarter. In other words, a quarter 1 folder, quarter 2 , quarter 3, quarter 4. This is the only difference in the file path. I am looking to read all four files in and concatenate them into one database. I can do this manually using a version of the simplified code below and changing the period each time.
period = ‘Q1’
filepath = ‘filepath/’ + period
file = filepath + ‘/file.xls’
df = pd.read_excel(file)
I would like to automate it with some form of for loop (I assume). That loops through the 4 periods, reads the file into a database and then concatenates. I have read other answers as to how this can be done with files in the same folder. But am struggling to do it where the file path changes. Manually putting the files into the same folder is not a desirable solution.
I tried making period a tuple and a list containing all 4 periods then a simple for loop but this didn’t work. I got the following error message.
TypeError: Can't convert 'list' object to str implicitly
Greatly appreciate any advice.
How about you first use list comprehension to get a list of all files:
periods= ["Q1", "Q2", "Q3", "Q4"]
files = ["filepath/"+ p + "/file.xls" for p in periods]
and then load them all into a list of data frames with
dfs = []
for f in files:
df = pd.read_excel(f)
dfs.append(df)
You can use these loops to create full file paths and to iterate over them to create one DataFrame containing all the files.
filepath = 'path/'
file = 'file.xlsx'
periods=['Q1','Q2','Q3','Q4']
files = []
for p in periods:
files.append(filepath+p+'/'+file)
files
data = []
for f in files:
data.append(pd.read_excel(f))
df = pd.concat(data)
You probably want something like this:
periods = ['Q1', 'Q2', 'Q3', 'Q4']
df = None
for period in periods:
filepath = 'filepath/' + period
file = filepath + '/file.xls'
if df is None:
df = pd.read_excel(file)
else:
df.append(pd.read_excel(file))
You could try something like this:
complete_df = pd.DataFrame()
for i in range(1,5):
quarter = 'Q'+str(i)
filepath = 'filepath/' + quarter
file = filepath + '/file.xls'
df = pd.read_excel(file)
complete_df = complete_df.append(df)

Retrieving data from multiple files into multiple dataframes

Scenario: I have a list of files in a folder (including the file paths). I am trying to get the content of each of those files into a dataframe (one for each file), then further perform some operations and later merge these dataframes.
From various other questions in SO, I found multiple ways to iterate over the files in a folder and get the data, but all of those I found usually ready the files in a loop and concatenate them into a single dataframe automatically, which does not work for me.
For example:
import os
import pandas as pd
path = os.getcwd()
files = os.listdir(path)
files_xls = [f for f in files if f[-3:] == 'xls*']
df = pd.DataFrame()
for f in files_xls:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
or
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("*.xls*"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
The only piece of code I could put together from what I found is:
from os.path import isfile, join
import glob
mypath = "/DGMS/Destop/uploaded"
listoffiles = glob.glob(os.path.join(mypath, "*.xls*"))
contentdataframes = (pd.read_excel(f) for f in listoffiles)
This lines run without error, but they appear not to do anything, no variables or created nor changed.
Question: What am I doing wrong here? Is there a better way to do this?
You are really close, need join all data by concat from generator:
contentdataframes = (pd.read_excel(f) for f in listoffiles)
df = pd.concat(contentdataframes, ignore_index=True)
If need list of DataFrames:
contentdataframes = [pd.read_excel(f) for f in listoffiles]

Get filenames using glob

I am reading several tsd files using panda and combine them to a big frame. I am using glob to iterate trough all the files in my directory and sub-directories. Every single frame gets a unique key. Now I want to create a reference table where the file name to each key is stored. But since I don't really understand glob I don't know how to get only the names of the files.
p = Path('myPath')
data = []
reference_table = {}
number_of_files = 0
for tsd_files in p.glob('**/*.tsd'):
data.append(pd.read_csv(str(tsd_files), delim_whitespace=True, header=None))
number_of_files = number_of_files + 1
whole_data = pd.concat(data, keys= list(range(number_of_files)))
Just use os.path.basename() to get only filename from path.
p = Path('myPath')
data = []
reference_table = {}
number_of_files = 0
file_names = []
for tsd_files in p.glob('**/*.tsd'):
data.append(pd.read_csv(str(tsd_files), delim_whitespace=True, header=None))
number_of_files = number_of_files + 1
file_names.append(os.path.basename(tsd_files))
whole_data = pd.concat(data, keys= list(range(number_of_files)))
Let's use Path in a pythonic way.
from pathlib import Path
p = Path('dir')
filenames = [i.stem for i in p.glob('**/*.ext')]
p.glob('**/*.ext') returns a generator object, which needed be iterated to get it values out, which done wit [i for i in ..]
i.stem means filenames with extentions.

Categories