Following scenario of traversing dir structure.
"Build complete dir tree with files but if files in single dir are similar in name list only single entity"
Example tree ( let's assume they're are not sorted ):
- rootDir
-dirA
fileA_01
fileA_03
fileA_05
fileA_06
fileA_04
fileA_02
fileA_...
fileAB
fileAC
-dirB
fileBA
fileBB
fileBC
Expected output:
- rootDir
-dirA
fileA_01 - fileA_06 ...
fileAB
fileAC
-dirB
fileBA
fileBB
fileBC
So I did already simple def findSimilarNames that for fileA_01 (or any fileA_) will return list [fileA_01...fileA_06]
Now I'm in os.walk and I'm doing loop over files so every file will be checked against similar filenames so e.g fileA_03 I've got rest of them [fileA_01 - fileA_06] and now I want to modify the list that I iterate over to just skip items from findSimilarNames, without need of using another loop or if's inside.
I searched here and people are suggesting avoidance of modifying iteration list, but doing so I would avoid every file iteration.
Pseudo code:
for root,dirs,files in os.walk( path ):
for file in files:
similarList = findSimilarNames( file )
#OVERWRITE ITERATION LIST SOMEHOW
files = (set(files)-set(similarList))
#DEAL WITH ELEMENT
What I'm trying to avoid is below - checking each file because maybe it's already found by findSimilarNames.
for root,dirs,files in os.walk( path ):
filteredbysimilar = files[:]
for file in files:
similar = findSimilarNames( file )
filteredbysimilar = list(set(filteredbysimilar)-set(similar))
#--
for filteredFile in filteredbysimilar:
#DEAL WITH ELEMENT
#OVERWRITE ITERATION LIST SOMEHOW
You can get this effect by using a while-loop style iteration. Since you want to do set subtraction to remove the similar groups anyway, the natural approach is to start with a set of all the filenames, and repeatedly remove groups until nothing is left. Thus:
unprocessed = set(files)
while unprocessed:
f = unprocessed.pop() # removes and returns an arbitrary element
group = findSimilarNames(f)
unprocessed -= group # it is not an error that `f` has already been removed.
doSomethingWith(group) # i.e., "DEAL WITH ELEMENT" :)
How about building up a list of files that aren't similar?
unsimilar = set()
for f in files:
if len(findSimilarNames(f).intersection(unsimilar))==0:
unsimilar.add(f)
This assumes findSimilarNames yields a set.
Related
Folks,
I'm trying to optimize this to help speed up the process...
What I am doing is creating a dictionary of scandir entries...
e.g.
fs_data = {}
for item in Path(fqpn).iterdir():
# snipped out a bunch of normalization code
fs_data[item.name.title().strip()] = item
{'file1': <file1 scandisk data>, etc}
and then later using a function to gather the count of files, and directories in the data.
Now I suspect that the new code, using map could be optimized to be faster than the old code. I suspect that having to run the list comprehension twice, once for files, and once for directories.
But I can't think of a way to optimize it to only have to run once.
Can anyone suggest a way to sum the files, and directories at the same time in the new version? (I could fall back to the old code, if necessary)
But I might be over optimizing at this point?
Any feedback would be welcome.
def new_fs_counts(fs_entries) -> (int, int):
"""
Quickly count the files vs directories in a list of scandir entries
Used primary by sync_database_disk to count a path's files & directories
Parameters
----------
fs_entries (list) - list of scandir entries
Returns
-------
tuple - (# of files, # of dirs)
"""
def counter(fs_entry):
return (fs_entry.is_file(), not fs_entry.is_file())
mapdata = list(map(counter, fs_entries.values()))
files = sum(files for files, _ in mapdata)
dirs = sum(dirs for _, dirs in mapdata)
return (files, dirs)
vs
def old_fs_counts(fs_entries) -> (int, int):
"""
Quickly count the files vs directories in a list of scandir entries
Used primary by sync_database_disk to count a path's files & directories
Parameters
----------
fs_entries (list) - list of scandir entries
Returns
-------
tuple - (# of files, # of dirs)
"""
files = 0
dirs = 0
for fs_item in fs_entries:
is_file = fs_entries[fs_item].is_file()
files += is_file
dirs += not is_file
return (files, dirs)
map is fast here if you map the is_file function directly:
files = sum(map(os.DirEntry.is_file, fs_entries.values()))
dirs = len(fs_entries) - files
(Something with filter might be even faster, at least if most entries aren't files. Or filter with is_dir if that works for you and most entries aren't directories. Or itertools.filterfalse with is_file. Or using itertools.compress. Also, counting True with list.count or operator.countOf instead of summing bools might be faster. But all of these ideas take more code (and some also memory). I'd prefer my above way.)
Okay, map is definitely not the right answer here.
This morning I got up and created a test using timeit...
and it was a bit of a splash of reality to the face.
Without optimizations, new vs old, the new map code was roughly 2x the time.
New : 0.023185124970041215
old : 0.011841499945148826
I really ended up falling for a bit of click bait, and thought that rewriting with MAP would gain some better efficiency.
For the sake of completeness.
from timeit import timeit
import os
new = '''
def counter(fs_entry):
files = fs_entry.is_file()
return (files, not files)
mapdata = list(map(counter, fs_entries.values()))
files = sum(files for files, _ in mapdata)
dirs = sum(dirs for _, dirs in mapdata)
#dirs = len(fs_entries)-files
'''
#dirs = sum(dirs for _, dirs in mapdata)
old = '''
files = 0
dirs = 0
for fs_item in fs_entries:
is_file = fs_entries[fs_item].is_file()
files += is_file
dirs += not is_file
'''
fs_location = '/Volumes/4TB_Drive/gallery/albums/collection1'
fs_data = {}
for item in os.scandir(fs_location):
fs_data[item.name] = item
print("New : ", timeit(stmt=new, number=1000, globals={'fs_entries':fs_data}))
print("old : ", timeit(stmt=old, number=1000, globals={'fs_entries':fs_data}))
And while I was able close the gap with some optimizations.. (Thank you Lee for your suggestion)
New : 0.10864979098550975
old : 0.08246175001841038
It is clear that the for loop solution is easier to read, faster, and just simpler.
The speed difference between new and old, doesn't seem to be map specifically.
The duplicate sum statement added .021, and The biggest slow down was from the second fs_entry.is_file, it added .06x to the timings...
I'm working on getting some subdirectories to output while ignoring another. I noticed in the output despite everything else seemingly working correctly, that I have an empty list also being returned.
/usr/lib/python3.9/os.py(407)_walk()->('/home/presto.../pB/media/370', ['images5', 'images'], [])
'images' is removed later on in the code but that final empty list remains.
/usr/lib/python3.9/os.py(407)_walk()->('/home/presto.../pB/media/370', ['images5'], [])
for root, subdirectories, files, in os.walk(wk):
for subgals in subdirectories:
if subgals == primary_gallery:
subdirectories.remove(subgals)
else:
subgal_path.append(os.path.join(root, subgals))
for file in sorted(files):
raw_subgal_file_paths.append(os.path.join(root, file))
for x in raw_subgal_file_paths:
split_root = raw_prim_path_length - len(x)
compiled_subgal_file_paths.append(x[split_root:])
print(compiled_subgal_file_paths)
The final output looks like this.
[]
['/images5/10.gif', '/images5/11.gif', '/images5/20.gif', '/images5/21.gif']
How do I fix this?
so I have 2 directories with 2 different file types (eg .csv, .png) but with the same basename (eg 1001_12_15.csv, 1001_12_15.png). I have many thousands of files in each directory.
What I want to do is to get the full paths of files, after having matched the basenames and then DO something with th efull path of both files.
I am asking some help of how to speed up the procedure.
My approach is:
csvList=[a list with the full path of each .csv file]
pngList=[a list with the full path of each .png file]
for i in range(0,len(csvlist)):
csv_base = os.path.basename(csvList[i])
#eg 1001
csv_id = os.path.splitext(fits_base)[0].split("_")[0]
for j in range(0, len(pngList)):
png_base = os.path.basename(pngList[j])
png_id = os.path.splitext(png_base)[0].split("_")[0]
if float(png_id) == float(csv_id):
DO SOMETHING
more over I tried fnmatch something like:
for csv_file in csvList:
try:
csv_base = os.path.basename(csv_file)
csv_id = os.path.splitext(csv_base)[0].split("_")[0]
rel_path = "/path/to/file"
pattern = "*" + csv_id + "*.png"
reg_match = fnmatch.filter(pngList, pattern)
reg_match=" ".join(str(x) for x in reg_match)
if reg_match:
DO something
It seems that using the nested for loops is faster. But I want it to be even faster. Are there any other approaches that I could speed up my code?
first of all, optimize syntax on your existing loop like this
for csv in csvlist:
csv_base = os.path.basename(csv)
csv_id = os.path.splitext(csv_base)[0].split("_")[0]
for png in pnglist:
png_base = os.path.basename(png)
png_id = os.path.splitext(png_base)[0].split("_")[0]
if float(png_id) == float(csv_id):
#do something here
nested loops are very slow because you need to run png loop n2 times
Then you can use list comprehension and array index to speed it up more
## create lists of processed values
## so you dont have to keep running the os library
sv_base_list=[os.path.basename(csv) for csv in csvlist]
csv_id_list=[os.path.splitext(csv_base)[0].split("_")[0] for csv_base in csv_base_list]
png_base_list=[os.path.basename(png) for png in pnglist]
png_id_list=[os.path.splitext(png_base)[0].split("_")[0] for png_base in png_base_list]
## run a single loop with list.index to find matching pair and record base values array
csv_png_base=[(csv_base_list[csv_id_list.index(png_id)], png_base)\
for png_id,png_base in zip(png_id_list,png_base_list)\
if png_id in csv_id_list]
## csv_png_base contains a tuple contianing (csv_base,png_base)
this logic using list index reduces the loop count significantly and there is no repetitive os lib calls
list comprehension is slightly faster than normal loop
You can loop through the list and do something with the values
eg
for csv_base,png_base in csv_png_base:
#do something
pandas will do the job much much faster though because it will run the loop using a C library
You can build up a search index in O(n), then seek items in it in O(1) each. If you have exact matches as your question implies, a flat lookup dict suffices:
from os.path import basename, splitext
png_lookup = {
splitext(basename(png_path))[0] : png_path
for png_path in pngList
}
This allows you to directly look up the png file corresponding to each csv file:
for csv_file in csvList:
csv_id = splitext(basename(csv_file)[0]
try:
png_file = png_lookup[csv_id]
except KeyError:
pass
else:
# do something
In the end, you have an O(n) lookup construction and a separate O(n) iteration with a nested O(1) lookup. The total complexity is O(n) compared to your initial O(n^2).
I am using Python 2.7, btw..
Let's say I have a couple directories that I want to create dictionaries for. The files in each of the directories are named YYYYMMDD.hhmmss and are all different, and the size of each directory is different:
path1 = /path/to/folders/to/make/dictionaries
dir1 = os.listdir(path1)
I also have another static directory that will have some files to compare
gpath1 = /path/to/static/files
gdir1 = os.listdir(gpath1)
dir1_file_list = [datetime.strptime(g, '%Y%m%d.%H%M%S') for g in gdir1]
So I have a static directory of files in gdir2, and I now want to loop through each directory in dir1 and create a unique dictionary. This is the code:
for i in range(0,len(dir1)):
path2 = path1 + "/" + dir1[i]
dir2 = os.listdir(path2)
dir2_file_list = [datetime.strptime(r, '%Y%m%d.%H%M%S') for r in dir2]
# Define a dictionary, and initialize comparisons
dict_gr = []
dict_gr = dict()
for dir1_file in dir1_file_list:
dict_gr[str(dir1_file)] = []
# Look for instances within the last 5 minutes
for dir2_file in dir2_file_list:
if 0 <= (dir1_file - dir2_file).total_seconds() <= 300:
dict_gr[str(dir1_file)].append(str(dir2_file))
# Sort the dictionaries
for key, value in sorted(dict_gr.iteritems()):
dir2_lib.append(key)
dir1_lib.append(sorted(value))
The issue is that path2 and dir2 both properly go to the different folders and grab the necessary filenames, and creating dict_gr will all work well. However, when I go to the part of the script where I sort the dictionaries, the 2nd directory that has been looped over will contain the contents of the first directory. The 3rd looped dictionary will contain the contents of the 1st and 2nd, etc. In other words, they are not matching uniquely with each directory.
Any thoughts?
Overlooked appending to dir2_lib and dir1_lib, needed to initialize these.
Hey guys I'm a rookie in python and need some help.
My problem is, that I have a folder full of text files (with lists in it), where two belong to each other and need to be read and compared.
Folder with many files: File1_in.xlo, File1_out.xlo, File2_in.xlo, File2_out.xlo, ...
--> so File1_in.xlo and File1_out.xlo belong together and need to be compared.
I already can append the lists of the 'in-Files' (or 'out-Files') and then compare them, but since there are many Files the lists become really long (thousands and thousands of entries), so the idea is to compare the files or respectively the lists pairwise.
My first try looks like:
import os
for filename in sorted(os.listdir('path')):
if filename.endswith('in.xlo'):
with open(os.path.join('path', filename)) as inn:
lines = inn.readlines()
for x in lines:
temperatureIn = x.split()[4]
if filename.endswith('out.xlo'):
with open(os.path.join('path', filename)) as outt:
lines = outt.readlines()
for x in lines:
temperatureOut = x.split()[4] #4. column in list
So the problem is, as you can see, the 'temperatureIn's are always overwritten before I can compare them with the 'temperatureOut's. I think/ hope there must be a way to open both files at once to compare the list entries.
I hope you can understand my problem and someone can help me.
Thanks
Use zip to access in-Files and out-Files in pairs
files = sorted(os.listdir('path'))
in_files = [fname for fname in files if fname.endswith('in.xlo')]
out_files = [fname for fname in files if fname.endswith('out.xlo')]
for in_file, out_file in zip(in_files, out_files):
with open(os.path.join('path', in_file)) as inn, open(os.path.join('path', out_file)) as outt:
# Do whatever you want
add them to a list created just before your for loop, as:
temps_in =[]
for x in lines:
temperatureIn = x.split()[4]
temps_in.append(temperatureIn)
Do the same thoing for temperatures out, then compare your two lists