I have nested for loops which are causing the execution of my operation to be incredibly slow. I wanted to know if there is another way to do this.
The operation is basically going through files in 6 different directories and seeing if there is a file in each directory that is the same before opening each file up and then displaying them.
My code is:
original_images = os.listdir(original_folder)
ground_truth_images = os.listdir(ground_truth_folder)
randomforest_images = os.listdir(randomforest)
ilastik_images = os.listdir(ilastik)
kmeans_images = os.listdir(kmeans)
logreg_multi_images = os.listdir(logreg_multi)
random_forest_multi_images = os.listdir(randomforest_multi)
for x in original_images:
for y in ground_truth_images:
for z in randomforest_images:
for i in ilastik_images:
for j in kmeans_images:
for t in logreg_multi_images:
for w in random_forest_multi_images:
if x == y == z == i == j == w == t:
*** rest of code operation ***
If the condition is that the same file must be present in all seven directories to run the rest of the code operation, then it's not necessary to search for the same file in all directories. As soon as the file is not in one of the directories, you can forget about it and move to the next file. So you can build a for loop looping through the files in the first directory and then build a chain of nested if statements: If the file exists in the next directory, you move forward to the directory after that and search there. If it doesn't, you move back to the first directory and pick the next file in it.
Convert all of them to sets and iterate through the last one, checking membership for all of the others:
original_images = os.listdir(original_folder)
ground_truth_images = os.listdir(ground_truth_folder)
randomforest_images = os.listdir(randomforest)
ilastik_images = os.listdir(ilastik)
kmeans_images = os.listdir(kmeans)
logreg_multi_images = os.listdir(logreg_multi)
files = set()
# add folder contents to the set of all files here
for folder in [original_images, ground_truth_images, randomforest_images, ilastik_images, kmeans_images, logreg_multi_images]:
files.update(folder)
random_forest_multi_images = set(os.listdir(randomforest_multi))
# find all common items between the sets
for file in random_forest_multi_images.intersection(files):
# rest of code
The reason this works is that you are only interested in the intersection of all sets, so you only need to iterate over one set and check for membership in the rest
You should check x == y before going in the nest loop. Then y == z etc. Now you are going over each loop way too often.
There is also another approach:
You can create a set of all your images and create an intersection over each set so the only elements which will remain are the ones that are equal. If you are sure that the files are the same you can skip that step.
If x is in all other list you can create your paths on the go:
import pathlib
original_images = os.listdir(original_folder)
ground_truth_images = pathlib.Path(ground_truth_folder) #this is a folder
randomforest_images = pathlib.Path(randomforest)
for x in original_images:
y = ground_truth_images / x
i = randomforest_images / x
# And so on for all your files
# check if all files exist:
for file in [x, y, i, j, t ,w]:
if not file.exists():
continue # go to next x
# REST OF YOUR CODE USING x, y, i, j, t, w,
# y, i, j, t, w, are now pathlib object, you can get s string (of its path using str(y), str(i) etc.
Related
Folks,
I'm trying to optimize this to help speed up the process...
What I am doing is creating a dictionary of scandir entries...
e.g.
fs_data = {}
for item in Path(fqpn).iterdir():
# snipped out a bunch of normalization code
fs_data[item.name.title().strip()] = item
{'file1': <file1 scandisk data>, etc}
and then later using a function to gather the count of files, and directories in the data.
Now I suspect that the new code, using map could be optimized to be faster than the old code. I suspect that having to run the list comprehension twice, once for files, and once for directories.
But I can't think of a way to optimize it to only have to run once.
Can anyone suggest a way to sum the files, and directories at the same time in the new version? (I could fall back to the old code, if necessary)
But I might be over optimizing at this point?
Any feedback would be welcome.
def new_fs_counts(fs_entries) -> (int, int):
"""
Quickly count the files vs directories in a list of scandir entries
Used primary by sync_database_disk to count a path's files & directories
Parameters
----------
fs_entries (list) - list of scandir entries
Returns
-------
tuple - (# of files, # of dirs)
"""
def counter(fs_entry):
return (fs_entry.is_file(), not fs_entry.is_file())
mapdata = list(map(counter, fs_entries.values()))
files = sum(files for files, _ in mapdata)
dirs = sum(dirs for _, dirs in mapdata)
return (files, dirs)
vs
def old_fs_counts(fs_entries) -> (int, int):
"""
Quickly count the files vs directories in a list of scandir entries
Used primary by sync_database_disk to count a path's files & directories
Parameters
----------
fs_entries (list) - list of scandir entries
Returns
-------
tuple - (# of files, # of dirs)
"""
files = 0
dirs = 0
for fs_item in fs_entries:
is_file = fs_entries[fs_item].is_file()
files += is_file
dirs += not is_file
return (files, dirs)
map is fast here if you map the is_file function directly:
files = sum(map(os.DirEntry.is_file, fs_entries.values()))
dirs = len(fs_entries) - files
(Something with filter might be even faster, at least if most entries aren't files. Or filter with is_dir if that works for you and most entries aren't directories. Or itertools.filterfalse with is_file. Or using itertools.compress. Also, counting True with list.count or operator.countOf instead of summing bools might be faster. But all of these ideas take more code (and some also memory). I'd prefer my above way.)
Okay, map is definitely not the right answer here.
This morning I got up and created a test using timeit...
and it was a bit of a splash of reality to the face.
Without optimizations, new vs old, the new map code was roughly 2x the time.
New : 0.023185124970041215
old : 0.011841499945148826
I really ended up falling for a bit of click bait, and thought that rewriting with MAP would gain some better efficiency.
For the sake of completeness.
from timeit import timeit
import os
new = '''
def counter(fs_entry):
files = fs_entry.is_file()
return (files, not files)
mapdata = list(map(counter, fs_entries.values()))
files = sum(files for files, _ in mapdata)
dirs = sum(dirs for _, dirs in mapdata)
#dirs = len(fs_entries)-files
'''
#dirs = sum(dirs for _, dirs in mapdata)
old = '''
files = 0
dirs = 0
for fs_item in fs_entries:
is_file = fs_entries[fs_item].is_file()
files += is_file
dirs += not is_file
'''
fs_location = '/Volumes/4TB_Drive/gallery/albums/collection1'
fs_data = {}
for item in os.scandir(fs_location):
fs_data[item.name] = item
print("New : ", timeit(stmt=new, number=1000, globals={'fs_entries':fs_data}))
print("old : ", timeit(stmt=old, number=1000, globals={'fs_entries':fs_data}))
And while I was able close the gap with some optimizations.. (Thank you Lee for your suggestion)
New : 0.10864979098550975
old : 0.08246175001841038
It is clear that the for loop solution is easier to read, faster, and just simpler.
The speed difference between new and old, doesn't seem to be map specifically.
The duplicate sum statement added .021, and The biggest slow down was from the second fs_entry.is_file, it added .06x to the timings...
I have the following code, and it never stops. It never evaluates the next condition once it has finished with exporting the file. What am I doing wrong?
I am working on Python 3.x and Windows 10.
for maindir, subdirs, shpfiles in os.walk(by_numSegments): # check in the whole folder
if "poly1000numSeg" in maindir: # check only in the input folder (segment_img)
if "compactness_1" in maindir:
for s, ishp in enumerate(shpfiles):
input_list = list(filter(lambda mpoly: mpoly.endswith('.shp'), os.listdir(maindir)))
# list with the first uploaded polygon. In the loop will the following polygons added
auto_inter = gpd.GeoDataFrame.from_file(os.path.join(maindir, input_list[0]))
# add the rest of the polygons one by one
for i in range(len(input_list)-1):
mp = gpd.GeoDataFrame.from_file(os.path.join(maindir, input_list[i+1]))
auto_inter = gpd.overlay(auto_inter, mp, how='intersection')
# export
auto_inter.to_file(os.path.join(src, "compactness_1/numSeg1000_c1.shp"))
if "compactness10" in maindir:
for s, ishp in enumerate(shpfiles):
input_list = list(filter(lambda mpoly: mpoly.endswith('.shp'), os.listdir(maindir)))
# list with the first uploaded polygon. In the loop will the following polygons added
auto_inter = gpd.GeoDataFrame.from_file(os.path.join(maindir, input_list[0]))
# add the rest of the polygons one by one
for i in range(len(input_list)-1):
mp = gpd.GeoDataFrame.from_file(os.path.join(maindir, input_list[i+1]))
auto_inter = gpd.overlay(auto_inter, mp, how='intersection')
# export
auto_inter.to_file(os.path.join(src, "compactness10/numSeg1000_c10.shp"))
I suspect src is the same folder you are iterating. You are adding files while iterating the file list.
for maindir, subdirs, shpfiles in os.walk(by_numSegments): # check in the whole folder
if "poly1000numSeg" in maindir: # check only in the input folder (segment_img)
if "compactness_1" in maindir:
for s, ishp in enumerate(shpfiles):
input_list = list(filter(lambda mpoly: mpoly.endswith('.shp'), os.listdir(maindir))) # get file list
......
for i in range(len(input_list)-1): # loop through list
.......
auto_inter.to_file(os.path.join(src, "compactness_1/numSeg1000_c1.shp")) # create new file
Try setting input_list before the loop:
maindir, subdirs, shpfiles in os.walk(by_numSegments): # check in the whole folder
if "poly1000numSeg" in maindir: # check only in the input folder (segment_img)
if "compactness_1" in maindir:
input_list = list(filter(lambda mpoly: mpoly.endswith('.shp'), os.listdir(maindir))) # get file list
for s, ishp in enumerate(shpfiles):
......
for i in range(len(input_list)-1): # loop through list
.......
auto_inter.to_file(os.path.join(src, "compactness_1/numSeg1000_c1.shp")) # create new file
Following scenario of traversing dir structure.
"Build complete dir tree with files but if files in single dir are similar in name list only single entity"
Example tree ( let's assume they're are not sorted ):
- rootDir
-dirA
fileA_01
fileA_03
fileA_05
fileA_06
fileA_04
fileA_02
fileA_...
fileAB
fileAC
-dirB
fileBA
fileBB
fileBC
Expected output:
- rootDir
-dirA
fileA_01 - fileA_06 ...
fileAB
fileAC
-dirB
fileBA
fileBB
fileBC
So I did already simple def findSimilarNames that for fileA_01 (or any fileA_) will return list [fileA_01...fileA_06]
Now I'm in os.walk and I'm doing loop over files so every file will be checked against similar filenames so e.g fileA_03 I've got rest of them [fileA_01 - fileA_06] and now I want to modify the list that I iterate over to just skip items from findSimilarNames, without need of using another loop or if's inside.
I searched here and people are suggesting avoidance of modifying iteration list, but doing so I would avoid every file iteration.
Pseudo code:
for root,dirs,files in os.walk( path ):
for file in files:
similarList = findSimilarNames( file )
#OVERWRITE ITERATION LIST SOMEHOW
files = (set(files)-set(similarList))
#DEAL WITH ELEMENT
What I'm trying to avoid is below - checking each file because maybe it's already found by findSimilarNames.
for root,dirs,files in os.walk( path ):
filteredbysimilar = files[:]
for file in files:
similar = findSimilarNames( file )
filteredbysimilar = list(set(filteredbysimilar)-set(similar))
#--
for filteredFile in filteredbysimilar:
#DEAL WITH ELEMENT
#OVERWRITE ITERATION LIST SOMEHOW
You can get this effect by using a while-loop style iteration. Since you want to do set subtraction to remove the similar groups anyway, the natural approach is to start with a set of all the filenames, and repeatedly remove groups until nothing is left. Thus:
unprocessed = set(files)
while unprocessed:
f = unprocessed.pop() # removes and returns an arbitrary element
group = findSimilarNames(f)
unprocessed -= group # it is not an error that `f` has already been removed.
doSomethingWith(group) # i.e., "DEAL WITH ELEMENT" :)
How about building up a list of files that aren't similar?
unsimilar = set()
for f in files:
if len(findSimilarNames(f).intersection(unsimilar))==0:
unsimilar.add(f)
This assumes findSimilarNames yields a set.
What is correct way to filter out data from functions? Should I try to compress everything as much as possible (search_query) or should I filter through list everytime there is new argument that needs to be included (search_query2). More arguments I have, quicker I become more confused how to deal with this problem. Example:
import os
query = ""
my_path = os.getcwd()
def search_query(query, path, extensions_only=False, case_sensitive=False):
results = []
if extensions_only is True:
for f in os.listdir(path):
if case_sensitive:
if f.endswith(query):
results.append(os.path.join(path, f))
else:
if f.endswith(query):
results.append(os.path.join(path, f).lower())
elif case_sensitive is not True:
for f in os.listdir(path):
if query.lower() in f.lower():
results.append(os.path.join(path, f))
return results
results = search_query("_c", my_path)
print(results)
# Alternative way to deal with this
def search_query2(query, path, extensions_only=False, case_sensitive=False):
results = []
for f in os.listdir(path):
results.append(os.path.join(path, f))
if extensions_only:
filtered_lst = []
for part in results:
if part.endswith(query):
filtered_lst.append(part)
results = filtered_lst
if case_sensitive:
filtered_lst = []
for part in results:
if query in part:
filtered_lst.append(part)
results = filtered_lst
elif not case_sensitive:
filtered_lst = []
for part in results:
if query.lower() in part.lower():
filtered_lst.append(part)
results = filtered_lst
print(results)
return results
search_query2("pyc", my_path, case_sensitive=True)
There isn't a fits-all "correct" way to do things like this. Another option is making separate functions, or private sub-functions called by this one as a wrapper.
In your specific case there are ways of optimising what you want to do in order to make it more clear.
You do a lot of
x = []
for i in y:
if cond(i):
x.append(i)
y = x
This is known as a filter and python has a couple of ways of doing this in one line
y = list(filter(cond, y)) # the 'functional' style
or
y = [i for i in y if cond(i)] # comprehension
which make things a lot clearer. There are similar things for mappings where you write:
x = []
for i in y:
x.append(func(i))
y = x
# instead do:
y = list(map(func, y)) # functional
# or
y = [func(i) for i in y] # comprehension
We can also combine maps and filters:
x = list(map(func, filter(cond, y)))
x = [func(i) for i in y if cond(i)]
using these we can build up many filters and maps in a row whilst remaining very clear about what we are doing. This is one of the advantages of functional programming.
I've modified your code to use generator expressions which will only evaluate right at the end when we call list(results) saving a lot of wasted time making new lists each time:
def search_query2(query, path, extensions_only=False, case_sensitive=False):
results = (os.path.join(path, f) for f in os.listdir(path))
if extensions_only:
results = (part for part in results if part.endswith(query))
elif case_sensitive: # I'm pretty sure this is actually the logic you want
results = (part for part in results if query in part)
else:
results = (part for part in results if query.lower() in part.lower())
return list(results)
Do you want to filter all the same types of files? You can do this by using the glob module.
for example
import glob
# Gets all the images in the specified directory.
print(glob.glob(r"E:/Picture/*/*.jpg"))
# Gets all the .py files from the parent directory.
print glob.glob(r'../*.py')
I like to "prepare" my conditions at the start to get things nice and tidy and make it slighlty easier later on. Identify what effect the different arguments have on the code. In this instance, "case_sensitive" defines whethere you are using f.lower() or not, and "extensions" is defining your comparison method.
In this instance I would write something like this.
def search_query(query, path, extensions_only=False, case_sensitive=False):
results = []
for f in os.listdir(path):
if case_sensitive is True:
fCase=f.lower()
queryCase = query.lower()
elif case sensitive is False:
fCase = f
queryCase = query
if extensions_only is True:
if f.endswith(query):
results.append(os.path.join(path, f))
elif extensions_only is False:
if query in f:
results.append(os.path.join(path, f))
return results
results = search_query("_c", my_path)
print(results)
This allows me to define the impact which each result has on the function at a different level, without having them nested and a bit of a headache to keep track of!
Another possibility: You could just use a single conditional list comprehension:
def search_query2(query, path, extensions_only=False, case_sensitive=False):
files = [os.path.join(path, f) for f in os.listdir(path)]
result = [part for part in files
if (not extensions_only or part.endswith(query)) and
(query in part if case_sensitive
else query.lower() in part.lower())]
print(results)
return results
This may seem very "dense" and incomprehensible at first, but IMHO it makes it very clear (even clearer than your variable names) that all those conditions are merely filtering and never e.g. changing the actual elements of the result list.
Also, as noted in comments, you could just use a default-parameter extension="" instead of extensions_only (everything ends with "", and you could even pass a tuple of valid extensions). Either way, it is not entirely clear how the endswith and in constraints should play together, or whether the extension should also match case_sensitive or not. Further, files could probably be simplified as glob.glob(path + "/*").
But those points do not change the argument for using a single list comprehension for filtering the list of results.
I have all filenames of a directory in a list named files. And I want to filter it so only the files with the .php extension remain.
for x in files:
if x.find(".php") == -1:
files.remove(x)
But this seems to skip filenames. What can I do about this?
How about a simple list comprehension?
files = [f for f in files if f.endswith('.php')]
Or if you prefer a generator as a result:
files = (f for f in files if f.endswith('.php'))
>>> files = ['a.php', 'b.txt', 'c.html', 'd.php']
>>> [f for f in files if f.endswith('.php')]
['a.php', 'd.php']
Most of the answers provided give list / generator comprehensions, which are probably the way you want to go 90% of the time, especially if you don't want to modify the original list.
However, for those situations where (say for size reasons) you want to modify the original list in place, I generally use the following snippet:
idx = 0
while idx < len(files):
if files[idx].find(".php") == -1:
del files[idx]
else:
idx += 1
As to why your original code wasn't working - it's changing the list as you iterator over it... the "for x in files" is implicitly creating an iterator, just like if you'd done "for x in iter(files)", and deleting elements in the list confuses the iterator about what position it is at. For such situations, I generally use the above code, or if it happens a lot in a project, factor it out into a function, eg:
def filter_in_place(func, target):
idx = 0
while idx < len(target):
if func(target[idx)):
idx += 1
else:
del target[idx]
Just stumbled across this old question. Many solutions here will do the job but they ignore a case where filename could be just ".php". I suspect that the question was about how to filter PHP scripts and ".php" may not be a php script. Solution that I propose is as follows:
>>> import os.path
>>> files = ['a.php', 'b.txt', 'c.html', 'd.php', '.php']
>>> [f for f in files if os.path.splitext(f)[1] == ".php"]