loop through folder and find files missing python - python

I am trying to loop through a folder for a given date. There will be 3-4 files in a folder everyday. I have to check if each file is present in the folder for a given date. I cannot hardcode the file name because it has sequence number and can vary.
I got below snippet from stackoverflow and it works if my filename is consistent.
import glob
source='./'
date=20221102
paths=["file1_"+date+"*.json","file2_"+date+"*.json","file3_"+date+"*.json"]
for path in paths:
print(f"looking for {path} with {source+'**/'+path+'.jpg'}")
print(glob.glob(source+"**/"+path+".jpg",recursive=True))
mp=[path for path in paths if not glob.glob(source+"**/"+path+".jpg",recursive=True)]
for nl in mp:
print(f'{nl}... is missing')
I just would like to loop all files in a folder and throw exception if any one of the file is missing.
Can anyone please help?

Here is how I handle the issue. I grab a list of files from path that contain the date and file type. I then split on any possible delimiter that isn't alphanumeric to create a list of lists. Then flatten to a single list.
Finally compare lists for matches return boolean.
import re
from pathlib import Path
def flatten_nested(nested: list or tuple or set) -> list:
flattened = []
[flattened.extend(flatten_nested(x)) if isinstance(x, (list, tuple, set)) else flattened.append(x) for x in nested]
return flattened
def find_files(file_path: str, file_type: str, value: str) -> list:
files = Path(file_path).rglob(f"*{value}*.{file_type}")
delimiter = r"(?<=[^\W_])[\W_]+(?=[^\W_])"
return flatten_nested([re.sub(delimiter, " ", x.stem).split(" ") for x in files if x.is_file()])
source = "./"
to_check = ["file1", "file2", "file3"]
check = all(x in find_files(source, "json", "20221102") for x in to_check)
print(check)
True

Related

Save part of a file name in seperate panda df columns

I would like to save different positions of a file name in different panda df columns.
For example my file names look like this:
001015io.png
position 0-2 in column 'y position' in this case '001'
position 3-5 in column 'x position' in this case '015'
position 6-7 in column 'status' in this case 'io'
My folder contains about 400 of these picture files. I'm a beginner in programming, so I don't know how I should start to solve this.
If the parts of the file names that you need are consistent (same position and length in all files), you can use string slicing to create new columns from the pieces of the file name like this:
import pandas as pd
df = pd.DataFrame({'file_name': ['001015io.png']})
df['y position'] = df['file_name'].str[0:3]
df['x position'] = df['file_name'].str[3:6]
df['status'] = df['file_name'].str[6:8]
This results in the dataframe:
file_name y position x position status
0 001015io.png 001 015 io
Note that when you slice a string you give a start position and a stop position like [0:3]. The start position is inclusive, but the stop position is not, so [0:3] gives you the substring from 0-2.
You can do this with slicing. A string is basically a list of character, so you can slice this string into the parts you need. See the example below.
filename = '001015io.png'
x = filename[0:3]
y = filename[3:6]
status = filename[6:8]
print(x, y, status)
output
001 015 io
As for getting the list of files, there's an absurdly complete answer for that here.
I have this function below in my personal library which I reuse whenever I need to generate a list of files.
def get_files_from_path(path: str = ".", ext=None) -> list:
"""Find files in path and return them as a list.
Gets all files in folders and subfolders
See the answer on the link below for a ridiculously
complete answer for this. I tend to use this one.
note that it also goes into subdirs of the path
https://stackoverflow.com/a/41447012/9267296
Args:
path (str, optional): Which path to start on.
Defaults to '.'.
ext (str/list, optional): Optional file extention.
Defaults to None.
Returns:
list: list of full file paths
"""
result = []
for subdir, dirs, files in os.walk(path):
for fname in files:
filepath = f"{subdir}{os.sep}{fname}"
if ext == None:
result.append(filepath)
elif type(ext) == str and fname.lower().endswith(ext.lower()):
result.append(filepath)
elif type(ext) == list:
for item in ext:
if fname.lower().endswith(item.lower()):
result.append(filepath)
return result
There's one thing you need to take into account here, this function will give the full filepath, fe: path/to/file/001015io.png
You can use the code below to get just the filename:
import os
print(os.path.basename('path/to/file/001015io.png'))
ouput
001015io.png
Use what Bill the Lizard said to turn it into a df

Finding all subfolders that contain two files that end with certain strings

So I have a folder, say D:\Tree, that contains only subfolders (names may contain spaces). These subfolders contain a few files - and they may contain files of the form "D:\Tree\SubfolderName\SubfolderName_One.txt" and "D:\Tree\SubfolderName\SubfolderName_Two.txt" (in other words, the subfolder may contain both of them, one, or neither). I need to find every occurence where a subfolder contains both of these files, and send their absolute paths to a text file (in a format explained in the following example). Consider these three subfolders in D:\Tree:
D:\Tree\Grass contains Grass_One.txt and Grass_Two.txt
D:\Tree\Leaf contains Leaf_One.txt
D:\Tree\Branch contains Branch_One.txt and Branch_Two.txt
Given this structure and the problem mentioned above, I'd to like to be able to write the following lines in myfile.txt:
D:\Tree\Grass\Grass_One.txt D:\Tree\Grass\Grass_Two.txt
D:\Tree\Branch\Branch_One.txt D:\Tree\Branch\Branch_Two.txt
How might this be done? Thanks in advance for any help!
Note: It is very important that "file_One.txt" comes before "file_Two.txt" in myfile.txt
import os
folderPath = r'Your Folder Path'
for (dirPath, allDirNames, allFileNames) in os.walk(folderPath):
for fileName in allFileNames:
if fileName.endswith("One.txt") or fileName.endswith("Two.txt") :
print (os.path.join(dirPath, fileName))
# Or do your task as writing in file as per your need
Hope this helps....
Here is a recursive solution
def findFiles(writable, current_path, ending1, ending2):
'''
:param writable: file to write output to
:param current_path: current path of recursive traversal of sub folders
:param postfix: the postfix which needs to match before
:return: None
'''
# check if current path is a folder or not
try:
flist = os.listdir(current_path)
except NotADirectoryError:
return
# stores files which match given endings
ending1_files = []
ending2_files = []
for dirname in flist:
if dirname.endswith(ending1):
ending1_files.append(dirname)
elif dirname.endswith(ending2):
ending2_files.append(dirname)
findFiles(writable, current_path+ '/' + dirname, ending1, ending2)
# see if exactly 2 files have matching the endings
if len(ending1_files) == 1 and len(ending2_files) == 1:
writable.write(current_path+ '/'+ ending1_files[0] + ' ')
writable.write(current_path + '/'+ ending2_files[0] + '\n')
findFiles(sys.stdout, 'G:/testf', 'one.txt', 'two.txt')

Python: Search for Substring Inside List Using Unique Values

I have two lists, both containing file paths to PDFs. The first list contains PDFs that have unique file names. The second list contains the file names with the same unique file names that need to be matched up to the first list, although it is possible that there could be multiple PDFs in the second list that could be matched to the first. It is a one to many relationship from ListA to ListB. Below is an example.
List A: C:\FolderA\A.pdf, C:\FolderA\B.pdf, C:\FolderA\C.pdf
List B: C:\FolderB\A_1.pdf, C:\FolderB\B_1.pdf, C:\FolderB\C_1.pdf, C:\FolderB\C_2.pdf
I need to find a way to iterate through both lists and combine the PDFs by matching the unique filename. If I can find a way to iterate and match the files, then I think I can combine the PDFs on my own. Below is the code I have so far.
folderA = C:\FolderA
ListA = []
for root, dirs, filenames in os.walk(folderA):
for filename in filenames:
ListA.append(str(filename))
filepath = os.path.join(root, filename)
ListA.append(str(filepath))
folderB: C:\FolderB
ListB = []
for root, dirs, filenames in os.walk(folderB):
for filename in filenames:
filepath = os.path.join(root, filename)
folderB.append(str(filepath))
#Split ListB to file name only without the "_#" so it can be matched to the PDFs in ListA.
for pdfValue in ListB:
pdfsplit = pdfValue.split(".")[0]
pdfsplit1 = pdfsplit.split("\\")[-1]
pdfsplit2 = pdfsplit1.rsplit("_", 1)[0]
for pdfValue2 in ListA:
if pdfsplit2 in ListA:
#combine PDF code
I have verified everything works up to the last if statement. From here is when I am not sure how to go about it. I know how to search for a substring within a string, but I cannot get it to work correctly with a list. No matter how I code it, I either end up in an endless loop or it does not successfully match.
Any ideas on how to make this work, if it is possible?
It would be better to use gather all the information together in one data structure, rather than separate lists. That should allow you to reduce your code to a single function.
Completely untested, but something like this should work.
from collections import defaultdict
pdfs = defaultdict(lambda: defaultdict(list))
def find_pdfs(pdfs, folder, split=False):
for root, dirs, filenames in os.walk(folder):
for filename in filenames:
basename, ext = os.path.splitext(filename)
if ext == '.pdf':
if split:
basename = basename.partition('_')[0]
pdfs[basename][root].append(filename)
find_pdfs(pdfs, folderA)
find_pdfs(pdfs, folderB, True)
This should produce a data structure like this:
pdfs = {
'A':
{'C:\FolderA': ['A.pdf'],
'C:\FolderB': ['A_1.pdf']},
'B':
{'C:\FolderA': ['B.pdf'],
'C:\FolderB': ['B_1.pdf']},
'C':
{'C:\FolderA': ['C.pdf'],
'C:\FolderB': ['C_1.pdf', 'C_2.pdf']},
}
I think what you want to do is create a collections.defaultdict and set it up to hold lists of matching names.
import collections
matching_files = collections.defaultdict(list)
You can then strip the filenames in folder B down to base names, and put the paths into the dict:
matching_files[pdfsplit2].append(pdfValue)
Now you have a list of pdf files from folder B, grouped by base name. Go back to folder A and do the same thing (split off the path and extension, use that for the key, add the full path to the list). You'll have lists, which have files sharing a common base name.
for key,file_list in matching_files.items(): #use .iteritems() for py-2.x
print("Files with base name '%s':"%key)
print(' ', '\n '.join(file_list))
To compare the two files names, rather than split along the '_', you should try the str.startwith() method :
A.startwith(B) returns True if the string A beginning is the string B.
In your case, your code would be :
match={} #the dictionary where you will stock the matching names
for pdfValue in ListA:
match[pdfValue]=[] # To create an entry in the dictionary with the wanted keyword
A=pdfValue.split("\\")[-1] #You want just the filename part
for pdfValue2 in ListB:
B=pdfValue2.split("\\")[-1]
if B.startswith(A): # Then B has the same unique namefile than A
match[pdfValue].append(pdfValue2) #so you associate it with A in the dictionnary
I hope it works for you
One more solution
lista = ['C:\FolderA\A.pdf', 'C:\FolderA\B.pdf', 'C:\FolderA\C.pdf']
listb = ['C:\FolderB\A_1.pdf', 'C:\FolderB\B_1.pdf', 'C:\FolderB\C_1.pdf', 'C:\FolderB\C_2.pdf']
# get the filenames for folder a and folder b
lista_filenames = [l.split('\\')[-1].split('.')[0] for l in lista]
listb_filenames = [l.split('\\')[-1].split('.')[0] for l in listb]
# create a dictionary to store lists of mappings
from collections import defaultdict
data_structure = defaultdict(list)
for i in lista_filenames:
for j in listb_filenames:
if i in j:
data_structure['C:\\FolderA\\' + i +'.pdf'].append('C:\\FolderB\\' + j +'.pdf')
# this is how the mapping dictionary looks like
print data_structure
results in :
defaultdict(<type 'list'>, {'C:\\FolderA\\C.pdf': ['C:\\FolderB\\C_1.pdf', 'C:\\FolderB\\C_2.pdf'], 'C:\\FolderA\\A.pdf': ['C:\\FolderB\\A_1.pdf'], 'C:\\FolderA\\B.pdf': ['C:\\FolderB\\B_1.pdf']})

I have a list of a part of a filename, for each one I want to go through the files in a directory that matches that part and return the filename

So, let's say I have a directory with a bunch of filenames.
for example:
Scalar Product or Dot Product (Hindi)-fodZTqRhC24.m4a
AP Physics C - Dot Product-Wvhn_lVPiw0.m4a
An Introduction to the Dot Product-X5DifJW0zek.m4a
Now let's say I have a list, of only the keys, which are at the end of the file names:
['fodZTqRhC24', 'Wvhn_lVPiw0, 'X5DifJW0zek']
How can I iterate through my list to go into that directory and search for a file name containing that key, and then return me the filename?
Any help is greatly appreciated!
I thought about it, I think I was making it harder than I had to with regex. Sorry about not trying it first. I have done it this way:
audio = ['Scalar Product or Dot Product (Hindi)-fodZTqRhC24.m4a',
'An Introduction to the Dot Product-X5DifJW0zek.m4a',
'AP Physics C - Dot Product-Wvhn_lVPiw0.m4a']
keys = ['fodZTqRhC24', 'Wvhn_lVPiw0', 'X5DifJW0zek']
file_names = []
for Id in keys:
for name in audio:
if Id in name:
file_names.append(name)
combined = zip(keys,file_names)
combined
Here is an example:
ls: list of files in a given directory
names: list of strings to search for
import os
ls=os.listdir("/any/folder")
n=['Py', 'sql']
for file in ls:
for name in names:
if name in file:
print(file)
Results :
.PyCharm50
.mysql_history
zabbix2.sql
.mysql
PycharmProjects
zabbix.sql
Assuming you know which directory that you will be looking in, you could try something like this:
import os
to_find = ['word 1', 'word 2'] # list containing words that you are searching for
all_files = os.listdir('/path/to/file') # creates list with files from given directory
for file in all_files: # loops through all files in directory
for word in to_find: # loops through target words
if word in file:
print file # prints file name if the target word is found
I tested this in my directory which contained these files:
Helper_File.py
forms.py
runserver.py
static
app.py
templates
... and i set to_find to ['runserver', 'static']...
and when I ran this code it returned:
runserver.py
static
For future reference, you should make at least some sort of attempt at solving a problem prior to posting a question on Stackoverflow. It's not common for people to assist you like this if you can't provide proof of an attempt.
Here's a way to do it that allows for a selection of weather to match based on placement of text.
import os
def scan(dir, match_key, bias=2):
'''
:0 startswith
:1 contains
:2 endswith
'''
matches = []
if not isinstance(match_key, (tuple, list)):
match_key = [match_key]
if os.path.exists(dir):
for file in os.listdir(dir):
for match in match_key:
if file.startswith(match) and bias == 0 or file.endswith(match) and bias == 2 or match in file and bias == 1:
matches.append(file)
continue
return matches
print scan(os.curdir, '.py'

Extracting "unsigned files" from a directory

I have a directory with xml files associated with encrypted P7M files, meaning that for every name.xml there is a name.P7M. But there are some exceptions (P7M file is absent) and my goal is to detect them using python.
I'm thinking this code.. Can you help with an elegant code?
import glob
# functions to eleminate extension name
def is_xml(x):
a = re.search(r"(\s)(.xml)",x)
if a :
return a.group(0)
else:
return False
def is_P7M(x):
a = re.search(r"(\s)(.P7M)", x)
if a :
return a.group(0)
else:
return False
# putting xml files and P7M files in two sets
setA = set (glob.glob('directory/*.xml'))
setB = set (glob.glob('directory/*.P7M'))
#eliminating extention names
for elt in setA:
elt= is_xml(elt)
for elt in setB:
elt= is_P7M(elt)
#difference between two sets. setB is always a larger set
print "unsigned files are:", setB.difference(setA)
A simpler way is to glob for the .xml files, then check using os.path.exists for a .P7M file:
import os, glob
for xmlfile in glob.glob('*.xml'):
if not os.path.exists(xmlfile.rsplit(".", 1)[0] + ".P7M"):
print xmlfile, "is unsigned"
This code:
Uses glob.glob to get all the xml files.
Uses str.rsplit to split the filename up into name and extension (e.g. "name.xml" to ("name", ".xml")). The second argument stops str.rsplit splitting more than once.
Takes the name of the file and adds the .P7M extension.
Uses os.path.exists to see if the key file is there. If is isn't, the xmlfile is unsigned, so print it out.
If you need them in a list, you can do:
unsigned = [xmlfile for xmlfile in glob.glob('*.xml') if not os.path.exists(xmlfile.rsplit(".", 1)[0] + ".P7M")]
Or a set:
unsigned = {xmlfile for xmlfile in glob.glob('*.xml') if not os.path.exists(xmlfile.rsplit(".", 1)[0] + ".P7M")}
My solution would be:
import glob
import os
get_name = lambda fname: os.path.splitext(fname)[0]
xml_names = {get_name(fname) for fname in glob.glob('directory/*.xml')}
p7m_names = {get_name(fname) for fname in glob.glob('directory/*.p7m')}
unsigned = [xml_name + ".xml" for xml_name in \
xml_names.difference(p7m_names)]
print unsigned
get all xml's in a dict removing the extension and using the name as key and setting the value to false initially, if we find a matching P7M set value to True, finally print all keys with a False value.
xmls = glob.glob('directory/*.xml')
p7ms = glob.glob('directory/*.P7M')
# use xml file names as keys by removing the extension
d = {k[rsplit(k,1)[0]]:False for k in xmls}
# go over every .P7M again removing extension
# setting value to True for every match
for k in p7ms:
k[rsplit(k,1)[0]] = True
# any values that are False means there is no .P7M match for the xml file
for k,v in d.items():
if not v:
print(k)
Or create a set of each and find the difference:
xmls = {x.rsplit(".",1)[0] for x in in glob.glob('directory/*.xml')}
pm7s = {x.rsplit(".",1)[0] for x in glob.glob('directory/*.P7M')}
print(xmls - pm7s)
Iterate over glob once and populate a dict of filenames by extension. Finally, compute the difference between 'xml' and 'P7M' sets.
import os, glob, collections
fnames = collections.defaultdict(set)
for fname in glob.glob('*'):
f, e = os.path.splitext(fname)
fnames[e].add(f)
print fnames['.xml'] - fnames['.P7M']
Note that unlike other suggestions, this makes one single request to the filesystem, which might be important if the FS is slow (e.g. a network mount).

Categories