I need to scan a directory with hundreds or GB of data which has structured parts (which I want to scan) and non-structured parts (which I don't want to scan).
Reading up on the os.walk function, I see that I can use a set of criteria in a set to exclude or include certain directory names or patterns.
For this particular scan I would need to add specific include/exclude criteria per level in a directory, for example:
In a root directory, imagine there are two useful directories, 'Dir A' and 'Dir B' and a non-useful trash directory 'Trash'. In Dir A there are two useful sub directories 'Subdir A1' and 'Subdir A2' and a non useful 'SubdirA Trash' directory, then in Dir B there are two useful subdirectories Subdir B1 and Subdir B2 plus a non useful 'SubdirB Trash' subdirectory. Would look something like this:
I need to have a specific criteria list for each level, something like this:
level1DirectoryCriteria = set("Dir A","Dir B")
level2DirectoryCriteria = set("Subdir A1","Subdir A2","Subdir
B1","Subdir B2")
the only ways I can think to do this are quite obviously non-pythonic using complex and lengthy code with a lot of variables and a high risk of instability. Does anyone have any ideas for how to resolve this problem? If successful it could save the codes running time several hours at a time.
You could try something like this:
to_scan = {'set', 'of', 'good', 'directories'}
for dirpath, dirnames, filenames in os.walk(root):
dirnames[:] = [d for d in dirnames if d in to_scan]
#whatever you wanted to do in this directory
This solution is simple, and fails if you want to scan directories with a certain name if they appear in one directory and not another. Another option would be a dictionary that maps directory names to lists or sets of whitelisted or blacklisted directories.
Edit: We can use dirpath.count(os.path.sep) to determine depth.
root_depth = root.count(os.path.sep) #subtract this from all depths to normalize root to 0
sets_by_level = [{'root', 'level'}, {'one', 'deep'}]
for dirpath, dirnames, filenames in os.walk(root):
depth = dirpath.count(os.path.sep) - root_depth
dirnames[:] = [d for d in dirnames if d in sets_by_level[depth]]
#process this directory
Not a direct answer concerning os.walk but just a suggestion: Since you're scanning the directories anyways, and you obviously know the trash directories from the other directories, you could also place a dummy file in the trash directories skip_this_dir or something. When you iterate over directories and create the file list, you check for the presence of the skip_this_dir file, something like if 'skip_this_dir' in filenames: continue; and continue to the next iteration.
This may not involve using os.walk parameters, but it does make the programming task a little easier to manage, without the requirement of writing a lot of 'messy' code with tons of conditionals and lists of include/excludes. It also makes the script easier to reuse since you don't need to change any code, you just place the dummy file in the directories you need to skip.
By using the root.count(os.path.sep) I was able to create specific instructions on what to include/exclude on each level in the structure. Looks something like this:
import os
root_depth = root.count(os.path.sep) #subtract this from all depths to normalize root to 0
directoriesToIncludedByLevel = [{"criteriaString","criteriaString","criteriaString","criteriaString"},#Level 0
{"criteriaString","criteriaString","criteriaString" },#Level 1
{},#Level 2
]
directoriesToExcludedByLevel = [{}, #Level 0
{}, #Level 1
{"criteriaString"}, #Level 2
]
for dirpath, dirnames, filenames in os.walk(root):
depth = dirpath.count(os.path.sep) - root_depth
# Here we create the dirnames path depending on whether we use the directoriesToIncludedByLevel or the directoriesToExcludedByLevel
if depth == 2: #Where we define which directories to exclude
dirnames[:] = [d for d in dirnames if d not in directoriesToExcludedByLevel[depth]]
elif depth < 2 : #Where we define which directories to INclude
dirnames[:] = [d for d in dirnames if d in directoriesToIncludedByLevel[depth]]
I was looking for a solution similar to OP. I needed to scan the subfolders and needed to exclude any folder that had folders labeled 'trash'.
My solution was to use the string find() method. Here's how I used it:
for (dirpath, dirnames, filenames) in os.walk(your_path):
if dirpath.find('trash') > 0:
pass
elif dirpath.find('trash)') < 0:
do_stuff
If 'trash' is found, then it will return the index number. Otherwise find() will return -1.
You can find more information on the find() method here:
https://www.tutorialspoint.com/python/string_find.htm
Related
I've been porting (very simply) a Python script from Windows to Linux (directory changes mostly), and I want to add a few new features to it.
The script is used to update mods on a game server. All mods are located in ShooterGame/Content/Mods/. Some mods are included by default (TheCenter and 11111111) - every other mod is located in the same folder as the default ones, but the names consist of random numbers.
I've been trying to exclude the 2 default directories and then build a list of contents of the ShooterGame/Content/Mods/ folder, but I've failed to do so.
This is the code that I've tried to use to exclude just the TheCenter folder:
def build_list_of_mods(self):
"""
Build a list of all installed mods by grabbing all directory names from the mod folder
:return:
"""
exclude = ["TheCenter"]
if not os.path.isdir(os.path.join(self.working_dir, "ShooterGame/Content/Mods/")):
return
for curdir, dirs, files in os.walk(os.path.join(self.working_dir, "ShooterGame/Content/Mods/")):
for d in dirs:
dirs[:] = [d for d in dirs if d not in exclude]
self.installed_mods.append(d)
break
It doesn't work, sadly. Have I missed something or just done everything wrong?
Try adding topdown=True to the os.walk() function like this:
for curdir, dirs, files in os.walk(os.path.join(self.working_dir, "ShooterGame/Content/Mods/"), topdown=True):
Plus I cannot try it but maybe dirs[:] should be outside of the for-loop, as the documentation says:
When topdown is true, the caller can modify the dirnames list in-place (e.g., via del or slice assignment), and walk will only recurse into the subdirectories whose names remain in dirnames;
I'm assuming you want self.installed_mods to contain the values of dirs without the values of exclude.
You could simply call dirs.remove() with the values of exclude and then append the content of dirs to self.installed_mods.
Or in a shorter way: self.installed_mods.extend([dir for dir in dirs if dir not in exclude]).
I need to list all files with the containing directory path inside a folder. I tried to use os.walk, which obviously would be the perfect solution.
However, it also lists hidden folders and files. I'd like my application not to list any hidden folders or files. Is there any flag you can use to make it not yield any hidden files?
Cross-platform is not really important to me, it's ok if it only works for linux (.* pattern)
No, there is no option to os.walk() that'll skip those. You'll need to do so yourself (which is easy enough):
for root, dirs, files in os.walk(path):
files = [f for f in files if not f[0] == '.']
dirs[:] = [d for d in dirs if not d[0] == '.']
# use files and dirs
Note the dirs[:] = slice assignment; os.walk recursively traverses the subdirectories listed in dirs. By replacing the elements of dirs with those that satisfy a criteria (e.g., directories whose names don't begin with .), os.walk() will not visit directories that fail to meet the criteria.
This only works if you keep the topdown keyword argument to True, from the documentation of os.walk():
When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again.
I realize it wasn't asked in the question, but I had a similar problem where I wanted to exclude both hidden files and files beginning with __, specifically __pycache__ directories. I landed on this question because I was trying to figure out why my list comprehension was not doing what I expected. I was not modifying the list in place with dirnames[:].
I created a list of prefixes I wanted to exclude and modified the dirnames in place like so:
exclude_prefixes = ('__', '.') # exclusion prefixes
for dirpath, dirnames, filenames in os.walk(node):
# exclude all dirs starting with exclude_prefixes
dirnames[:] = [dirname
for dirname in dirnames
if not dirname.startswith(exclude_prefixes)]
My use-case was similar to that of OP, except I wanted to return a count of the total number of sub-directories inside a certain folder. In my case I wanted to omit any sub-directories named .git (as well as any folders that may be nested inside these .git folders).
In Python 3.6.7, I found that the accepted answer's approach didn't work -- it counted all .git folder and their sub-folders. Here's what did work for me:
num_local_subdir = 0
for root, dirs, files in os.walk(local_folder_path):
if '.git' in dirs:
dirs.remove('.git')
num_local_subdir += (len(dirs))
Another solution that can allow you to skip those hidden folders using any and map functions.
for root, dirs, files in os.walk(path):
if any(map(lambda p: p[0] == '.', dirs)):
continue
I am aware that I can remove dirs from os.walk using something along the lines of
for root, dirs, files in os.walk('/path/to/dir'):
ignore = ['dir1', 'dir2']
dirs[:] = [d for d in dirs if d not in ignore]
I want to do the opposite of this, so only keep the dirs in list. Ive tried a few variations but to no avail. Any pointers would be appreciated.
The dirs i am interested in are 2 levels down, so I have taken on the comments and created global variables for the sub levels and am using the following Code.
Expected Functionality
for root, dirs, files in os.walk(global_subdir):
keep = ['dir1', 'dir2']
dirs[:] = [d for d in dirs if d in keep]
for filename in files:
print os.path.join(root, filename)
As said in the comments of a deleted answer -
As mentioned already, this doesnt work. The dirs in keep are 2 levels sub root. Im guessing this is causing the problem
The issue is that the directory one level above your required directory would not be traversed since its not in your keep list, hence the program would never reach till your required directories.
The best way to solve this would be to start os.walk at the directory that is just one level above your required directory.
But if this is not possible (like maybe the directories one level above the required one is not known before traversing) or ( the required directories have different directories one level above). And what you really want is to just avoid looping through the files for directories that are not in the keep directory.
A solution would be to traverse all directories, but loop through the files only when root is in the keep list (or set for better performance). Example -
keep = set(['required directory1','required directory2'])
for root, dirs, files in os.walk(global_subdir):
if root in keep:
for filename in files:
print os.path.join(root, filename)
I need to list all files with the containing directory path inside a folder. I tried to use os.walk, which obviously would be the perfect solution.
However, it also lists hidden folders and files. I'd like my application not to list any hidden folders or files. Is there any flag you can use to make it not yield any hidden files?
Cross-platform is not really important to me, it's ok if it only works for linux (.* pattern)
No, there is no option to os.walk() that'll skip those. You'll need to do so yourself (which is easy enough):
for root, dirs, files in os.walk(path):
files = [f for f in files if not f[0] == '.']
dirs[:] = [d for d in dirs if not d[0] == '.']
# use files and dirs
Note the dirs[:] = slice assignment; os.walk recursively traverses the subdirectories listed in dirs. By replacing the elements of dirs with those that satisfy a criteria (e.g., directories whose names don't begin with .), os.walk() will not visit directories that fail to meet the criteria.
This only works if you keep the topdown keyword argument to True, from the documentation of os.walk():
When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again.
I realize it wasn't asked in the question, but I had a similar problem where I wanted to exclude both hidden files and files beginning with __, specifically __pycache__ directories. I landed on this question because I was trying to figure out why my list comprehension was not doing what I expected. I was not modifying the list in place with dirnames[:].
I created a list of prefixes I wanted to exclude and modified the dirnames in place like so:
exclude_prefixes = ('__', '.') # exclusion prefixes
for dirpath, dirnames, filenames in os.walk(node):
# exclude all dirs starting with exclude_prefixes
dirnames[:] = [dirname
for dirname in dirnames
if not dirname.startswith(exclude_prefixes)]
My use-case was similar to that of OP, except I wanted to return a count of the total number of sub-directories inside a certain folder. In my case I wanted to omit any sub-directories named .git (as well as any folders that may be nested inside these .git folders).
In Python 3.6.7, I found that the accepted answer's approach didn't work -- it counted all .git folder and their sub-folders. Here's what did work for me:
num_local_subdir = 0
for root, dirs, files in os.walk(local_folder_path):
if '.git' in dirs:
dirs.remove('.git')
num_local_subdir += (len(dirs))
Another solution that can allow you to skip those hidden folders using any and map functions.
for root, dirs, files in os.walk(path):
if any(map(lambda p: p[0] == '.', dirs)):
continue
I do atomistic modelling, and use Python to analyze simulation results. To simplify work with a whole bunch of Python scripts used for different tasks, I decided to write simple GUI to run scripts from it.
I have a (rather complex) directory structure beginning from some root (say ~/calc), and I want to populate wx.TreeCtrl control with directories containing calculation results preserving their structure. The folder contains the results if it contains a file with .EXT extension. What i try to do is walk through dirs from root and in each dir check whether it contains .EXT file. When such dir is reached, add it and its ancestors to the tree:
def buildTree(self, rootdir):
root = rootdir
r = len(rootdir.split('/'))
ids = {root : self.CalcTree.AddRoot(root)}
for (dirpath, dirnames, filenames) in os.walk(root):
for dirname in dirnames:
fullpath = os.path.join(dirpath, dirname)
if sum([s.find('.EXT') for s in filenames]) > -1 * len(filenames):
ancdirs = fullpath.split('/')[r:]
ad = rootdir
for ancdir in ancdirs:
d = os.path.join(ad, ancdir)
ids[d] = self.CalcTree.AppendItem(ids[ad], ancdir)
ad = d
But this code ends up with many second-level nodes with the same name, and that's definitely not what I want. So I somehow need to see if the node is already added to the tree, and in positive case add new node to the existing one, but I do not understand how this could be done. Could you please give me a hint?
Besides, the code contains 2 dirty hacks I'd like to get rid of:
I get the list of ancestor dirs with splitting the full path in \
positions, and this is Linux-specific;
I find if .EXT file is in the directory by trying to find the extension in the strings from filenames list, taking in account that s.find returns -1 if the substring is not found.
Is there a way to make these chunks of code more readable?
First of all the hacks:
To get the path seperator for whatever os your using you can use os.sep.
Use str.endswith() and use the fact that in Python the empty list [] evaluates to False:
if [ file for file in filenames if file.endswith('.EXT') ]:
In terms of getting them all nicely nested you're best off doing it recursively. So the pseudocode would look something like the following. Please note this is just provided to give you an idea of how to do it, don't expect it to work as it is!
def buildTree(self, rootdir):
rootId = self.CalcTree.AddRoot(root)
self.buildTreeRecursion(rootdir, rootId)
def buildTreeRecursion(self, dir, parentId)
# Iterate over the files in dir
for file in dirFiles:
id = self.CalcTree.AppendItem(parentId, file)
if file is a directory:
self.buildTreeRecursion(file, id)
Hope this helps!