Count number of files in end directories using Python - python

I have a image dataset archived in tree structure, where the name of different level of folders are the labels of these images, for example
/Label1
/Label2
/image1
/image2
/Label3
/image3
/Label4
/image4
/image5
Then how could I count number of images in each end folder. In the above example, I want to know how many images are there in folder /Label1/Label2, /Label1/Label3 and Label4.
I checked out the function os.walk(), but it seems that there is no easy way to count number of files in each individual end folder.
Could anyone help me, thank you in advance.

You can do this with os.walk():
import os
c=0
print(os.getcwd())
for root, directories, filenames in os.walk('C:/Windows'):
for files in filenames:
c=c+1
print(c)
output:
125765
>>>
If you have multiple file formats in sub-directories, you can use an operator and check jpeg, png, and then increment the counter.

I checked out the function os.walk(), but it seems that there is no easy way to count number of files in each individual end folder.
Sure there is. It's just len(files).
I'm not 100% sure what you meant by "end directory", but if you wanted to skip directories that aren't leaves—that is, that have subdirectories—that's just as easy: if not dirs.
Finally, I don't know whether you wanted to print the counts out, store them in a dict with the paths as keys, total them up, or what, but those are all pretty trivial, so here's some code that does all three as examples:
total = 0
counts = {}
for root, dirs, files in os.walk(path):
if not dirs:
print(f'{root}: {len(files)}')
counts[root] = len(files)
total += len(files)

Related

How do I reset an incremental variable inside a for loop (python)?

python newb here so apologies for the extremely basic question, but I have tried to figure this out via generic google searches and can't seem to get the answer I'm looking for. So, I beseech the stackoverflow gods to help me...
Here's my scenario:
I have a directory, with multiple subdirectories and multiple files within those subdirectories. The subdirectories represent one digital object (a physical book), and the files inside are .tif files that correspond to the objects "pages." I want to be able to iterate over these subdirectories and to count the number of .tif files within them, but for each subdirectory, I want the count to begin back at 1, thus representing the "page numbers" of that "book."
All I can seem to figure out is how to count the files, in a linear progression. Here's the code I've been using:
Label = 0
for rootDir, subdir, filenames in os.walk('/Users/kaylaheslin/Desktop/mets_test'):
for filename in fnmatch.filter(filenames, "*.tif"):
Label += 1
print(Label)
Of course, this walks through the files and adds one each time a .tif file is found, but I need to begin # 1 for each subdirectory. Someone, please help! I want to know what I'm doing wrong.
Would swapping the first two lines do what you want?
for rootDir, subdir, filenames in os.walk('/Users/kaylaheslin/Desktop/mets_test'):
Label = 0
for filename in fnmatch.filter(filenames, "*.tif"):
Label += 1
print(Label)
This way you reset Label on every pass through the outer loop. In other words, each time you drop into a new subdirectory you start the count from 0 again.
Also, if you want to associate a number with each file, perhaps you could avoid a counter variable in the first place:
for rootDir, subdir, filenames in os.walk('/Users/kaylaheslin/Desktop/mets_test'):
tif_files = fnmatch.filter(filenames, "*.tif")
for i, file in enumerate(tif_files):
print(i)
You can create a dictionary of each folder name and the file count:
import os
Label = 0
dd = {}
for rootDir, subdir, filenames in os.walk('/Users/kaylaheslin/Desktop/mets_test'):
for filename in filenames:
if filename.endswith('.tif'):
if rootDir in dd:
dd[rootDir]+=1
else:
dd[rootDir]=1
for k in dd:
print(k,dd[k])

Return number of folders in directory and subdirectory

I have a directory similar the example down below which contains the following folders:
C:\Users\xx\Desktop\New folder\New folder\New folder\QGIS
C:\Users\xx\Desktop\New folder\New folder\New folder (2)\1- QGIS
C:\Users\xx\Desktop\New folder\New folder\New folder (4)\1.0 QGIS
C:\Users\xx\Desktop\New folder\New folder\QGIS
I wish to find how many folders with their names ends in QGIS and their path.
My current script is down below. It successfully gives me the path of all folders name ends in QGIS but the script counts only the folders with name "QGIS" only and doesnt count "1.0 QGIS" or "1- QGIS". What am I missing?
import os
rootfolder = r'C:\Users\xx\Desktop\New folder'
isfile = os.path.isfile
join = os.path.join
i=0
with open("folderpath.txt", 'w') as f:
for root, dirs, files in os.walk(rootfolder, topdown=False):
i+= dirs.count('*QGIS')
for name in dirs:
if name.endswith("QGIS"):
f.write(os.path.join(root, name)+'\n')
f.write(str((sum(dirs.count('QGIS') for _, dirs, _ in os.walk(rootfolder)))))
The list.count method does not support any concept of a wildcard -- it just looks for how many elements are equal to the value that is given as an argument. So your line
i+= dirs.count('*QGIS')
is looking for directories which are literally called *QGIS, rather than directories that end with QGIS.
The fix here should be easy because the code is already successfully printing out the correct paths; it is just not counting them correctly. So all that you need to do is to remove the above statement, and instead just add 1 in the place where you print out each path, which is already subject to the correct if condition inside the loop over directory names.
for root, dirs, files in os.walk(rootfolder, topdown=False):
for name in dirs:
if name.endswith("QGIS"):
f.write(os.path.join(root, name)+'\n')
i += 1
You already correctly initialise i=0 before the start of the loop.
At the end, just do:
print(i)
and get rid of that expression involving sum where you walk through all the directories a second time.
import os
print( len( list( filter(None, map(lambda x: x[0] if x[0].endswith('QGIS') else None,os.walk('.'))))))
A shorter form, but not too readable ;)
The "map" goes through the results of os.walk, returns the folder name if it ends with 'QGIS' and None if not.
The "filter" returns every value from map's results which differ from value None.
The "list" is needed, because both map and filter are returning an iterator object, which has no length, but the "list" has.

Remove images in multiple folders (Python)

I want to write a python script to randomly keep only some images in multiple folders.
I am new to python, and I am trying to find the solution. However, I could not find a good one to start with yet.
I would appreciate it if anyone could help me. Thank you.
This might help you. It firstly retrieves the list of all directories, and afterwards removing random files to get only n files. Note: path_to_all_images_folder has to be declared
import os
import random
def keep_n_dir(directory, n):
files = os.listdir(directory) #You retrieve the list of names of files
if len(files) > n: #If you already have less than n files, you do nothing
diff = len(files) - n
files_to_delete = random.sample(files, k=diff) #Random sample files to delete
for file in files_to_delete:
os.remove(os.path.join(directory, file)) #Delete additional files
directories = os.listdir(path_to_all_images_folder)
directories = [os.path.join(path_to_all_images_folder, folder) for folder in directories]
for directory in directories:
if os.path.isdir(directory):
keep_n_dir(directory, n)
ATTENTION! This code removes from the directory the other files. It only keeps n.

Faster searching for specific dirs all over the drive in python

I have a network disk with data. Many of dirs, many files. On disk I have some dirs with logs named LOGS_XXX, in those folders are various files and folders, including the folders I'm interested named YYYYFinal, where YYYY is year of created. So I just want to create list of path to that dirs but only if YYYY > 2017. In one LOGS could be more than one YYYYFinal. Could be nothing interesting too.
So I put here a part code searching dirs by conditions and creating list:
path = path_to_network_drive
def findAllOutDirs(path):
finalPathList = []
for root, subdirs, files in os.walk(path):
for d in subdirs:
if d == "FINAL" or d == "Final":
outPath = root+r"\{}".format(d)
if ("LOGS" in outPath) and ("2018" in outPath or "2019" in outPath or "2020" in outPath):
finalPathList.append(outPath)
return finalPathList
And this code work good. I mean I got a final list but it take long time. So, maybe someone from here see some mistakes, bad using code or just know better option to do it by python?
Thanks!

Python os.walk complex directory criteria

I need to scan a directory with hundreds or GB of data which has structured parts (which I want to scan) and non-structured parts (which I don't want to scan).
Reading up on the os.walk function, I see that I can use a set of criteria in a set to exclude or include certain directory names or patterns.
For this particular scan I would need to add specific include/exclude criteria per level in a directory, for example:
In a root directory, imagine there are two useful directories, 'Dir A' and 'Dir B' and a non-useful trash directory 'Trash'. In Dir A there are two useful sub directories 'Subdir A1' and 'Subdir A2' and a non useful 'SubdirA Trash' directory, then in Dir B there are two useful subdirectories Subdir B1 and Subdir B2 plus a non useful 'SubdirB Trash' subdirectory. Would look something like this:
I need to have a specific criteria list for each level, something like this:
level1DirectoryCriteria = set("Dir A","Dir B")
level2DirectoryCriteria = set("Subdir A1","Subdir A2","Subdir
B1","Subdir B2")
the only ways I can think to do this are quite obviously non-pythonic using complex and lengthy code with a lot of variables and a high risk of instability. Does anyone have any ideas for how to resolve this problem? If successful it could save the codes running time several hours at a time.
You could try something like this:
to_scan = {'set', 'of', 'good', 'directories'}
for dirpath, dirnames, filenames in os.walk(root):
dirnames[:] = [d for d in dirnames if d in to_scan]
#whatever you wanted to do in this directory
This solution is simple, and fails if you want to scan directories with a certain name if they appear in one directory and not another. Another option would be a dictionary that maps directory names to lists or sets of whitelisted or blacklisted directories.
Edit: We can use dirpath.count(os.path.sep) to determine depth.
root_depth = root.count(os.path.sep) #subtract this from all depths to normalize root to 0
sets_by_level = [{'root', 'level'}, {'one', 'deep'}]
for dirpath, dirnames, filenames in os.walk(root):
depth = dirpath.count(os.path.sep) - root_depth
dirnames[:] = [d for d in dirnames if d in sets_by_level[depth]]
#process this directory
Not a direct answer concerning os.walk but just a suggestion: Since you're scanning the directories anyways, and you obviously know the trash directories from the other directories, you could also place a dummy file in the trash directories skip_this_dir or something. When you iterate over directories and create the file list, you check for the presence of the skip_this_dir file, something like if 'skip_this_dir' in filenames: continue; and continue to the next iteration.
This may not involve using os.walk parameters, but it does make the programming task a little easier to manage, without the requirement of writing a lot of 'messy' code with tons of conditionals and lists of include/excludes. It also makes the script easier to reuse since you don't need to change any code, you just place the dummy file in the directories you need to skip.
By using the root.count(os.path.sep) I was able to create specific instructions on what to include/exclude on each level in the structure. Looks something like this:
import os
root_depth = root.count(os.path.sep) #subtract this from all depths to normalize root to 0
directoriesToIncludedByLevel = [{"criteriaString","criteriaString","criteriaString","criteriaString"},#Level 0
{"criteriaString","criteriaString","criteriaString" },#Level 1
{},#Level 2
]
directoriesToExcludedByLevel = [{}, #Level 0
{}, #Level 1
{"criteriaString"}, #Level 2
]
for dirpath, dirnames, filenames in os.walk(root):
depth = dirpath.count(os.path.sep) - root_depth
# Here we create the dirnames path depending on whether we use the directoriesToIncludedByLevel or the directoriesToExcludedByLevel
if depth == 2: #Where we define which directories to exclude
dirnames[:] = [d for d in dirnames if d not in directoriesToExcludedByLevel[depth]]
elif depth < 2 : #Where we define which directories to INclude
dirnames[:] = [d for d in dirnames if d in directoriesToIncludedByLevel[depth]]
I was looking for a solution similar to OP. I needed to scan the subfolders and needed to exclude any folder that had folders labeled 'trash'.
My solution was to use the string find() method. Here's how I used it:
for (dirpath, dirnames, filenames) in os.walk(your_path):
if dirpath.find('trash') > 0:
pass
elif dirpath.find('trash)') < 0:
do_stuff
If 'trash' is found, then it will return the index number. Otherwise find() will return -1.
You can find more information on the find() method here:
https://www.tutorialspoint.com/python/string_find.htm

Categories