Get filenames using glob - python

I am reading several tsd files using panda and combine them to a big frame. I am using glob to iterate trough all the files in my directory and sub-directories. Every single frame gets a unique key. Now I want to create a reference table where the file name to each key is stored. But since I don't really understand glob I don't know how to get only the names of the files.
p = Path('myPath')
data = []
reference_table = {}
number_of_files = 0
for tsd_files in p.glob('**/*.tsd'):
data.append(pd.read_csv(str(tsd_files), delim_whitespace=True, header=None))
number_of_files = number_of_files + 1
whole_data = pd.concat(data, keys= list(range(number_of_files)))

Just use os.path.basename() to get only filename from path.
p = Path('myPath')
data = []
reference_table = {}
number_of_files = 0
file_names = []
for tsd_files in p.glob('**/*.tsd'):
data.append(pd.read_csv(str(tsd_files), delim_whitespace=True, header=None))
number_of_files = number_of_files + 1
file_names.append(os.path.basename(tsd_files))
whole_data = pd.concat(data, keys= list(range(number_of_files)))

Let's use Path in a pythonic way.
from pathlib import Path
p = Path('dir')
filenames = [i.stem for i in p.glob('**/*.ext')]
p.glob('**/*.ext') returns a generator object, which needed be iterated to get it values out, which done wit [i for i in ..]
i.stem means filenames with extentions.

Related

Python, Pandas: Faster File Search than os.path?

I have a pandas df with file names that need to be searched/matched in a directory tree.
I've been using the following but it crashes with larger directory structures. I record whether or not they are present in 2 lists.
found = []
missed = []
for target_file in df_files['Filename']:
for (dirpath, dirnames, filenames) in os.walk(DIRECTORY_TREE):
if target_file in filenames:
found.append(os.path.join(dirpath,target_file))
else:
missed.append(target_file)
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)
I've read that scandir is quicker and will handle larger directory trees. If true, how might this be rewritten?
My attempt:
found = []
missed = []
for target_file in df_files['Filename']:
for item in os.scandir(DIRECTORY_TREE):
if item.is_file() and item.name() == target_file:
found.append(os.path.join(dirpath,target_file))
else:
missed.append(target_file)
print('Found: ',len(found),'Missed: ',len(missed))
print(missed)
This runs (fast), but everything ends up in the "missed" list.
Scan your directories only once and convert it to a dataframe.
Example on my venv directory:
import pandas as pd
import pathlib
DIRECTORY_TREE = pathlib.Path('./venv').resolve()
data = [(str(pth.parent), pth.name) for pth in DIRECTORY_TREE.glob('**/*') if pth.is_file()]
df_path = pd.DataFrame(data, columns=['Directory', 'Filename'])
df_files = pd.DataFrame({'Filename': ['__init__.py']})
Now you can use df_path to lookup filenames from df_files with merge:
out = (df_files.merge(df_path, on='Filename', how='left')
.value_counts('Filename').to_frame('Found'))
out['Missed'] = len(df_path) - out['Found']
print(out.reset_index())
# Output
Filename Found Missed
0 __init__.py 5837 105418

Import and append pickle files

How could I import and append all files in a directory?
files = os.listdir(r"C:\Users\arv\Desktop\pickle_files")
data = []
for i in files:
data.append(pd.read_pickle(i))
df = pd.concat(['data'])  
Almost like you tried to do it yourslf:
df = pd.concat([pd.read_pickle(f) for f in files])

Find files and copy with multiple criteria

Hi I got a DF that contains two columns one with invoice number and another with client name.
Is there a way to find all files (by name) that contains "Banana" AND "A5000" at same time?
I've tried to work on a code yesterday and a very nice guy helped me to get trough with one criteria, but I'm stuck again when trying to add another one.
maybe I can't use "and" on this line: files = list(path.rglob(f'*{v and s}*')), i tried "&" but dind't work.
data = {'Invoice':['A5000','B8000','C3000'],'Client':['Banana','Orange','Lemon']}
df = pd.DataFrame(data=data)
path = Path('D:/Pyfilesearch')
dest = Path('D:/Dest')
for v,s in zip(df.Invoice,df.Client):
files = list(path.rglob(f'*{v and s}*'))
files = [f for f in files if f.is_file()]
for f in files:
print(f)
thanks.
I would use sets and intersection, as in the following example:
p = Path("files")
data = {'Invoice':['A5000','B8000','C3000'],'Client':['Banana','Orange','Lemon']}
df = pd.DataFrame(data=data)
for i, c in zip(df.Invoice, df.Client):
s1 = set(p.rglob(f"*{i}*"))
s2 = set(p.rglob(f"*{c}*"))
i_c_files = s1.intersection(s2)
print(i_c_files)
if i_c_files:
first_file = list(i_c_files)[0]
print("the first file is " + str(first_file))
Use substring in file.stem
import pandas as pd
from pathlib import Path
data = {'Invoice':['A5000','B8000','C3000'],'Client':['Banana','Orange','Lemon']}
df = pd.DataFrame(data=data)
path = Path('D:/Pyfilesearch')
for v, s in zip(df.Invoice,df.Client):
files = [f for f in path.iterdir()
if f.is_file() and v in f.stem and s in f.stem]
for f in files:
print(str(f))
"""
D:\Pyfilesearch\A5000 Banana.txt
D:\Pyfilesearch\B8000 Orange.txt
D:\Pyfilesearch\C3000 Lemon.txt
"""

Pandas raising an error when setting the output directory

There is OSError when I try to set the output directory and write Prefix in front of "i" such as 'cal_' or 'edit_'. If I added the postfix like df.to_csv(i + '_edit.csv'), the result was "filename.csv_edit".
So the files were saved the input directory and I couldn't add any prefix and postfix. How to fix this error?
import pandas as pd
import glob
PathIn = r'C:\Users\input'
PathOut = r'C:\Users\output'
filenames = glob.glob(PathIn + "/*.csv")
file_list = []
for i in filenames:
df = pd.read_csv(i)
file_list.append(df)
df.columns = df.columns.str.replace('[','')
df.columns = df.columns.str.replace(']','')
df.to_csv(i + '.csv')
Try this one. This should work. It has the full code you want.
import os
import pandas as pd
PathIn = r'C:\Users\input'
PathOut = r'C:\Users\output'
file_list = []
for name in os.listdir(PathIn):
if name.endswith(".csv"):
#print(name)
df = pd.read_csv(PathIn + "\" + name)
file_list.append(df)
df.columns = df.columns.str.replace('[','')
df.columns = df.columns.str.replace(']','')
df.to_csv(PathOut + name)
The value of i in filenames is the absolute path of the csv file you are reading.
So if you have 3 csv files in your input directory, you filenames list will be like below :
['C:\Users\input\file1.csv',
'C:\Users\input\file2.csv',
'C:\Users\input\file3.csv']
Now you are trying to add a prefix in front of the elements of above list which would not be a valid path.
You need to fetch the filename of input file and append it with PathOut so that a valid path exists.
You can fetch the filenames in any directory as below :
filenames = []
for entry in os.listdir(PathIn):
if os.path.isfile(os.path.join(PathIn, entry)) and ".csv" in entry:
filenames.append(entry)
Now you can iterate over this list and do operations you were doing. For saving the final df to file in output directory, append the filenames with PathOut.

Reading in a list of files into a list of DataFrames

I'm trying to read a list of files into a list of Pandas DataFrames in Python. However, the code below doesn't work.
files = [file1, file2, file3]
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()
dfs = [df1, df2, df3]
# Read in data files
for file,df in zip(files, dfs):
if file_exists(file):
with open(file, 'rb') as in_file:
df = pd.read_csv(in_file, low_memory=False)
print df #the file is getting read properly
print df1 #empty
print df2 #empty
print df3 #empty
How to I get the original DataFrames to update if I pass them into a for-loop as a list of DataFrames?
Try this:
dfs = [pd.read_csv(f, low_memory=False) for f in files]
if you want to check whether file exists:
import os
dfs = [pd.read_csv(f, low_memory=False) for f in files if os.path.isfile(f)]
and if you want to concatenate all of them into one data frame:
df = pd.concat([pd.read_csv(f, low_memory=False)
for f in files if os.path.isfile(f)],
ignore_index=True)
You are not working on the list elements themselves when iterating over them but you are not operating on the list.
You need to insert the elements (or append them) to the list. One possibility could be:
files = [file1, file2, file3]
dfs = [None] * 3 # Just a placeholder
# Read in data files
for i, file in enumerate(files): # Enumeration instead of zip
if file_exists(file):
with open(file, 'rb') as in_file:
dfs[i] = pd.read_csv(in_file, low_memory=False) # Setting the list element
print dfs[i] #the file is getting read properly
This updates the list elements and should work.
Your code seems over complicated you can just do:
files = [file1, file2, file3]
dfs = []
# Read in data files
for file in files:
if file_exists(file):
dfs.append(pd.read_csv(file, low_memory=False))
You will end up with a list of dfs as desired
You can try list comprehension:
files = [file1, file2, file3]
dfs = [pd.read_csv(x, low_memory=False) for x in files if file_exists(x)]
Custom-written Python function that appropriately handles both CSV & JSON files.
def generate_list_of_dfs(incoming_files):
"""
Accepts a list of csv and json file/path names.
Returns a list of DataFrames.
"""
outgoing_files = []
for filename in incoming_files:
file_extension = filename.split('.')[1]
if file_extension == 'json':
with open(filename, mode='r') as incoming_file:
outgoing_json = pd.DataFrame(json.load(incoming_file))
outgoing_files.append(outgoing_json)
if file_extension == 'csv':
outgoing_csv = pd.read_csv(filename)
outgoing_files.append(outgoing_csv)
return outgoing_files
How to Call this Function
import pandas as pd
import json
files_to_be_read = ['filename1.json', 'filename2.csv', 'filename3.json', 'filename4.csv']
dataframes_list = generate_list_of_dfs(files_to_be_read)
Here is a simple solution that avoids using a list to hold all the data frames, if you don't need them in a list.
import fnmatch
# get the CSV files only
files = fnmatch.filter(os.listdir('.'), '*.csv')
files
Output which is now a list of the names:
['Feedback Form Submissions 1.21-1.25.22.csv',
'Feedback Form Submissions 1.21.22.csv',
'Feedback Form Submissions 1.25-1.31.22.csv']
Now create a simple list of new names to make working with them easier:
# use a simple format
names = []
for i in range(0,len(files)):
names.append('data' + str(i))
names
['data0', 'data1', 'data2']
You can use any list of names that you want. The next step take the file names and the list of names and then assign them to the names.
# i is the incrementor for the list of names
i = 0
# iterate through the file names
for file in files:
# make an empty dataframe
df = pd.DataFrame()
# load the first file in
df = pd.read_csv(file, low_memory=False)
# get the first name from the list, this will be a string
new_name = names[i]
# assign the string to the variable and assign it to the dataframe
locals()[new_name] = df.copy()
# increment the list of names
i = i + 1
You now have 3 separate dataframes named data0, data1, data2, and do commands like
data2.info()

Categories