Python, Pandas: Faster File Search than os.path? - python

I have a pandas df with file names that need to be searched/matched in a directory tree.
I've been using the following but it crashes with larger directory structures. I record whether or not they are present in 2 lists.
found = []
missed = []
for target_file in df_files['Filename']:
for (dirpath, dirnames, filenames) in os.walk(DIRECTORY_TREE):
if target_file in filenames:
found.append(os.path.join(dirpath,target_file))
else:
missed.append(target_file)
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)
I've read that scandir is quicker and will handle larger directory trees. If true, how might this be rewritten?
My attempt:
found = []
missed = []
for target_file in df_files['Filename']:
for item in os.scandir(DIRECTORY_TREE):
if item.is_file() and item.name() == target_file:
found.append(os.path.join(dirpath,target_file))
else:
missed.append(target_file)
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)
This runs (fast), but everything ends up in the "missed" list.

Scan your directories only once and convert it to a dataframe.
Example on my venv directory:
import pandas as pd
import pathlib
DIRECTORY_TREE = pathlib.Path('./venv').resolve()
data = [(str(pth.parent), pth.name) for pth in DIRECTORY_TREE.glob('**/*') if pth.is_file()]
df_path = pd.DataFrame(data, columns=['Directory', 'Filename'])
df_files = pd.DataFrame({'Filename': ['__init__.py']})
Now you can use df_path to lookup filenames from df_files with merge:
out = (df_files.merge(df_path, on='Filename', how='left')
.value_counts('Filename').to_frame('Found'))
out['Missed'] = len(df_path) - out['Found']
print(out.reset_index())
# Output
Filename Found Missed
0 __init__.py 5837 105418

Related

Read multiple csv files from multiple folders in Python

I have a folder that includes folders and these folders include many csv files. I want to import and concatenate all of them in Python.
Let's say main folder: /main
subfolders: /main/main_1
csv: /main/main_1/first.csv
path='/main'
df_list = []
for file in os.listdir(path):
df = pd.read_csv(file)
df_list.append(df)
final_df = df.append(df for df in df_list)
What about this:
import pandas as pd
from pathlib import Path
directory = "path/to/root_dir"
# Read each CSV file in dir "path/to/root_dir"
dfs = []
for file in Path(directory).glob("**/*.csv"):
dfs.append(pd.read_csv(file))
# Put the dataframes to a single dataframe
df = pd.concat(dfs)
Change the path/to/root_dir to where ever your CSV files are.
I found a way to concat all of them but it doesn't satisfy to me as it takes too much time due to computational complexity.
path = "/main"
folders = []
directory = os.path.join(path)
for root,dirs,files in os.walk(directory):
folders.append(root)
del folders[0]
final = []
for folder in folders:
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join(folder + "/*.csv"))))
final.append(df)
Remember to add back main to the path:
df =pd.read_csv(path + "/" + file)

Search (in folders and subfolders ) and read files to a list of dataframes, using Python

I have a code
df1 = pd.read_excel('DIRECTORY\\file.xlsm', sheetname='Resume', header=1, usecols='A:I')
#some operations
bf1 =pd.read_excel('DIRECTORY\\file.xlsm', sheetname='Resume', header=1, usecols='K:P')
#some operations
Final_file = pd.concat([df1,bf1], ignore_index=True)
Note that df and bf are reading the same file, the difference is the columns being read.
I have a lot of files.
Is it possible to go through folders and subfolders, search for a filename pattern and create a list of dataframes to read, instead of writing each path I have?
you can use a recursive method with both pathlib and glob
note parent_path should be the top level folder you want to search.
from pathlib import Path
files = [file for file in Path(parent_path).rglob('*filename*.xls')]
this will return a list of files that match your condition. you can then cocnat a list comp.
dfs = [ pd.read_excel(file, sheet_name='Resume', header=1, usecols='A:I') for file in files]
df1 = pd.concat(dfs)
Edit Latest File by Modified Time.
We can use the following function to take in a path and return a list of pathlib objects to get the latest modified time, we do this by splitting on a delimiter to get a unique file so sales_v1, sales_v2, sales_v3 will all become sales. We then get the latest modified file from the three.
import pandas as pd
from pathlib import Path
def get_latest_files(path):
files = {
f: pd.Timestamp(f.stat().st_mtime, unit="s") for f in Path(path).rglob("*.csv")
}
df = (
pd.DataFrame.from_dict(files, orient="index")
.reset_index()
.rename(columns={"index": "path", 0: "seconds"})
)
df["dupe_files"] = df["path"].apply(lambda x: x.stem).str.split("_", expand=True)[0]
max_files = (
df.groupby(["dupe_files", "path"])
.max()
.groupby(level=0)["seconds"]
.nlargest(1)
.to_frame()
.reset_index(-1)["path"]
.tolist()
)
return max_files
Here is a code snippet that might help your cause:-
source = r'C:\Mypath\SubFolder'
for root, dirs, files in os.walk(source):
for name in files:
if name.endswith((".xls", ".xlsx",".xlsm")):
filetoprocess=os.path.join(root,name)
df=pd.read_excel(filetoprocess, sheetname='Resume', header=1, usecols='A:I')
Hope that helps.
You can use glob library to do this -
from glob import glob
filenames = glob('./Folder/pattern*.xlsx') #pattern is the common pattern in filenames
dataframes = [pd.read_excel(f) for f in filenames] #sequentially read all the files and create a dataframe for each file
master_df = pd.concat(dataframes) #master dataframe after concatenating all the dataframes

How to merge multiple text files into one csv file in Python

I am trying to convert 200 text files into csv files. I am using below code I am able to run it but it does not produce csv files. Could anyone tell any easy and fast way to do? Many Thanks
dirpath = 'C:\Files\Code\Analysis\Input\qobs_RR1\\'
output = 'C:\Files\Code\Analysis\output\qobs_CSV.csv'
csvout = pd.DataFrame()
files = os.listdir(dirpath)
for filename in files:
data = pd.read_csv(filename, sep=':', index_col=0, header=None)
csvout = csvout.append(data)
csvout.to_csv(output)
The problem is that your os.listdir gives you the list of filenames inside dirpath, not the full path to these files. You can get the full path by prepending the dirpath to filenames with os.path.join function.
import os
import pandas as pd
dirpath = 'C:\Files\Code\Analysis\Input\qobs_RR1\\'
output = 'C:\Files\Code\Analysis\output\qobs_CSV.csv'
csvout_lst = []
files = [os.path.join(dirpath, fname) for fname in os.listdir(dirpath)]
for filename in sorted(files):
data = pd.read_csv(filename, sep=':', index_col=0, header=None)
csvout_lst.append(data)
pd.concat(csvout_lst).to_csv(output)
Edit: this can be done with a one-liner:
pd.concat(
pd.read_csv(os.path.join(dirpath, fname), sep=':', index_col=0, header=None)
for fname in sorted(os.listdir(dirpath))
).to_csv(output)
Edit 2: updated the answer, so the list of files is sorted alphabetically.

Pandas raising an error when setting the output directory

There is OSError when I try to set the output directory and write Prefix in front of "i" such as 'cal_' or 'edit_'. If I added the postfix like df.to_csv(i + '_edit.csv'), the result was "filename.csv_edit".
So the files were saved the input directory and I couldn't add any prefix and postfix. How to fix this error?
import pandas as pd
import glob
PathIn = r'C:\Users\input'
PathOut = r'C:\Users\output'
filenames = glob.glob(PathIn + "/*.csv")
file_list = []
for i in filenames:
df = pd.read_csv(i)
file_list.append(df)
df.columns = df.columns.str.replace('[','')
df.columns = df.columns.str.replace(']','')
df.to_csv(i + '.csv')
Try this one. This should work. It has the full code you want.
import os
import pandas as pd
PathIn = r'C:\Users\input'
PathOut = r'C:\Users\output'
file_list = []
for name in os.listdir(PathIn):
if name.endswith(".csv"):
#print(name)
df = pd.read_csv(PathIn + "\" + name)
file_list.append(df)
df.columns = df.columns.str.replace('[','')
df.columns = df.columns.str.replace(']','')
df.to_csv(PathOut + name)
The value of i in filenames is the absolute path of the csv file you are reading.
So if you have 3 csv files in your input directory, you filenames list will be like below :
['C:\Users\input\file1.csv',
'C:\Users\input\file2.csv',
'C:\Users\input\file3.csv']
Now you are trying to add a prefix in front of the elements of above list which would not be a valid path.
You need to fetch the filename of input file and append it with PathOut so that a valid path exists.
You can fetch the filenames in any directory as below :
filenames = []
for entry in os.listdir(PathIn):
if os.path.isfile(os.path.join(PathIn, entry)) and ".csv" in entry:
filenames.append(entry)
Now you can iterate over this list and do operations you were doing. For saving the final df to file in output directory, append the filenames with PathOut.

Get filenames using glob

I am reading several tsd files using panda and combine them to a big frame. I am using glob to iterate trough all the files in my directory and sub-directories. Every single frame gets a unique key. Now I want to create a reference table where the file name to each key is stored. But since I don't really understand glob I don't know how to get only the names of the files.
p = Path('myPath')
data = []
reference_table = {}
number_of_files = 0
for tsd_files in p.glob('**/*.tsd'):
data.append(pd.read_csv(str(tsd_files), delim_whitespace=True, header=None))
number_of_files = number_of_files + 1
whole_data = pd.concat(data, keys= list(range(number_of_files)))
Just use os.path.basename() to get only filename from path.
p = Path('myPath')
data = []
reference_table = {}
number_of_files = 0
file_names = []
for tsd_files in p.glob('**/*.tsd'):
data.append(pd.read_csv(str(tsd_files), delim_whitespace=True, header=None))
number_of_files = number_of_files + 1
file_names.append(os.path.basename(tsd_files))
whole_data = pd.concat(data, keys= list(range(number_of_files)))
Let's use Path in a pythonic way.
from pathlib import Path
p = Path('dir')
filenames = [i.stem for i in p.glob('**/*.ext')]
p.glob('**/*.ext') returns a generator object, which needed be iterated to get it values out, which done wit [i for i in ..]
i.stem means filenames with extentions.

Categories