Python os.walk topdown true with regular expression

Python os.walk topdown true with regular expression - python

I am confused as to why the following ONLY works with topdown=False and returns nothing when set to True ?
The reason I want to use topdown=True is because it is taking a very long time to traverse through the directories. I believe that going topdown will increase the time taken to produce the list.
for root, dirs, files in os.walk(mypath, topdown=False): #Why doesn't this work with True?
dirs[:] = [d for d in dirs if re.match('[DMP]\\d{8}$', d)]
for dir in dirs:
print(dir)

In your code you were looking for matching names([dmp]\d{8}) to traverse into, while you should be looking for non-matching directories to traverse into while adding matching names to a global list.
I modified your code and this works:
import os
import re
all_dirs = []
for root, dirs, files in os.walk("root", topdown=True):
subset = []
for d in dirs:
if not re.match('[dmp]\d{8}$', d):
# step inside
subset.append(d)
else:
# add to list
all_dirs.append(os.path.join(root, d))
dirs[:] = subset
print all_dirs
This returns:
['root/temp1/myfiles/d12345678',
'root/temp1/myfiles/m11111111',
'root/temp2/mydirs/moredirs/m22222222',
'root/temp2/mydirs/moredirs/p00000001']

The problem is that you're modifying the contents of dirs while traversing. When using topdown=True this will impact what directories are traversed next.
Look at this code that shows you what is happening:
import os, re
for root, dirs, files in os.walk("./", topdown=False):
print("Walked down {}, dirs={}".format(root, dirs))
dirs[:] = [d for d in dirs if re.match('[DMP]\\d{8}$', d)]
print("After filtering dirs is now: " + str(dirs))
for dir in dirs:
print(dir)
I've just got one directory to traverse - Temp/MyFiles/D12345678 (I'm on Linux). With topdown=False the above produces this output:
Walked down ./Temp/MyFiles/D12345678, dirs=[]
After filtering dirs is now: []
Walked down ./Temp/MyFiles, dirs=['D12345678']
After filtering dirs is now: ['D12345678']
D12345678
Walked down ./Temp, dirs=['MyFiles']
After filtering dirs is now: []
Walked down ./, dirs=['Temp']
After filtering dirs is now: []
But with topdown=True we get this:
Walked down ./, dirs=['Temp']
After filtering dirs is now: []
Since you're removing all subdirectories from dirs you're telling os.walk that you don't want to traverse further into any subdirectories and therefore iteration stops. When using topdown=False the modified value of dirs isn't used to determine what to traverse next so therefore it works.
To fix it, replace dirs[:] = with dirs =
import os, re
for root, dirs, files in os.walk("./", topdown=True):
dirs = [d for d in dirs if re.match('[DMP]\\d{8}$', d)]
for dir in dirs:
print(dir)
This gives us:
D12345678
Update:
If you're absolutely certain that a directory will not contain any subdirectories of interest to you you can remove them from dirs before traversing any further. If, for example, you know that "./Temp/MyDirs2" will never contain any subdirectories of interest you can empty dirs when we get there to speed it up:
import os, re
uninteresting_roots = { "./Temp/MyDirs2" }
for root, dirs, files in os.walk("./", topdown=True):
if root in uninteresting_roots:
# Empty dirs and end this iteration
del dirs[:]
continue
dirs = [d for d in dirs if re.match('[DMP]\\d{8}$', d)]
for dir in dirs:
print(dir)
Other than that there is no way you can know which directories that you don't need to traverse into because to know if they contain interesting subdirectories you have to traverse into them.

That is because your root directory doesn't match the regex, so after the first iteration, dirs is set to empty.
If what you want is to find all subdirectories which match the pattern, you should either:
use topdown = False, or
do not prune the directories

Related

Python - Print all the directories except one

I have a python script that print all the directories from a main directory. What I want is to print all the directories expect the one that is old (that I include on exclude list).
For that I am using the following script:
include = 'C://Data//'
exclude = ['C:/Data//00_Old']
for root, dirs, files in os.walk(include, topdown=False):
dirs[:] = [d for d in dirs if d not in exclude]
for name in dirs:
directory = os.path.join(root, name)
print(directory)
Problem is: it is printing all the directories even the excluded one. What I am doing wrong?

To simplify it even further, you can do:
from pathlib import Path
# I'm assuming this is where all your sub-folders are that you want to filter.
include = 'C://Data//'
# You don't need the parent 'C://Data//' because you looping through the parent folder.
exclude = ['00_Old']
root_folder = Path(include)
for folder in root_folder.iterdir():
if folder not in exclude:
# do work

It is better to use the pathlib module for file system related requirements. I would suggest to try something like this.
from pathlib import Path
files = list(Path('C:/Data/').glob('**/*')) #recursively get all the file names
print([x for x in files if 'C:/Data/00_Old' not in str(x)])

Navigating specific dirs in filter with os.walk

I am aware that I can remove dirs from os.walk using something along the lines of
for root, dirs, files in os.walk('/path/to/dir'):
ignore = ['dir1', 'dir2']
dirs[:] = [d for d in dirs if d not in ignore]
I want to do the opposite of this, so only keep the dirs in list. Ive tried a few variations but to no avail. Any pointers would be appreciated.
The dirs i am interested in are 2 levels down, so I have taken on the comments and created global variables for the sub levels and am using the following Code.
Expected Functionality
for root, dirs, files in os.walk(global_subdir):
keep = ['dir1', 'dir2']
dirs[:] = [d for d in dirs if d in keep]
for filename in files:
print os.path.join(root, filename)

As said in the comments of a deleted answer -
As mentioned already, this doesnt work. The dirs in keep are 2 levels sub root. Im guessing this is causing the problem
The issue is that the directory one level above your required directory would not be traversed since its not in your keep list, hence the program would never reach till your required directories.
The best way to solve this would be to start os.walk at the directory that is just one level above your required directory.
But if this is not possible (like maybe the directories one level above the required one is not known before traversing) or ( the required directories have different directories one level above). And what you really want is to just avoid looping through the files for directories that are not in the keep directory.
A solution would be to traverse all directories, but loop through the files only when root is in the keep list (or set for better performance). Example -
keep = set(['required directory1','required directory2'])
for root, dirs, files in os.walk(global_subdir):
if root in keep:
for filename in files:
print os.path.join(root, filename)

clean way to os.walk once python

I want to talk a few directories once, and just grab the info for one dir. Currently I use:
i = 0
for root, dirs, files in os.walk(home_path):
if i >= 1:
return 1
i += 1
for this_dir in dirs:
do stuff
This is horribly tedious of course. When I want to walk the subdir under it, I do the same 5 lines, using j, etc...
What is the shortest way to grab all dirs and files underneath a single directory in python?

You can empty the dirs list and os.walk() won't recurse:
for root, dirs, files in os.walk(home_path):
for dir in dirs:
# do something with each directory
dirs[:] = [] # clear directories.
Note the dirs[:] = slice assignment; we are replacing the elements in dirs (and not the list referred to by dirs) so that os.walk() will not process deleted directories.
This only works if you keep the topdown keyword argument to True, from the documentation of os.walk():
When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again.
Alternatively, use os.listdir() and filter the names out into directories and files yourself:
dirs = []
files = []
for name in os.listdir(home_path):
path = os.path.join(home_path, name)
if os.isdir(path):
dirs.append(name)
else:
files.append(name)

python walk directory tree with excluding certain directories

i am trying to walk a directory tree and exclude certain directories. Now, according to os.walk exclude .svn folders for example i should be able to modify the 'dirs' list which would then let me prune the tree. I tried the following:
import sys
import os
if __name__ == "__main__":
for root, dirs, files in os.walk("/usr/lib"):
print root
dirs = []
I would have expected to not enter ANY subdirectories but i do:
/usr/lib
/usr/lib/akonadi
/usr/lib/akonadi/contact
/usr/lib/akonadi/contact/editorpageplugins
/usr/lib/os-prober
/usr/lib/gnome-settings-daemon-3.0
/usr/lib/gnome-settings-daemon-3.0/gtk-modules
/usr/lib/git-core
/usr/lib/git-core/mergetools
/usr/lib/gold-ld
/usr/lib/webkitgtk-3.0-0
/usr/lib/webkitgtk-3.0-0/libexec
What am i missing?

dirs = []
rebinds the local name dirs. You can modify the contents of the list instead eg. like this:
dirs[:] = []

Try one of following
dirs[:] = []
OR
del dirs[:]

root gives the entire path and not just the root from where you started.
The docs makes it a bit more clear to what it's doing:
for dirpath, dirnames, filenames in os.walk('/usr/lib'):
print dirpath
See the docs here

Efficiently removing subdirectories in dirnames from os.walk

On a mac in python 2.7 when walking through directories using os.walk my script goes through 'apps' i.e. appname.app, since those are really just directories of themselves. Well later on in processing I am hitting errors when going through them. I don't want to go through them anyways so for my purposes it would be best just to ignore those types of 'directories'.
So this is my current solution:
for root, subdirs, files in os.walk(directory, True):
for subdir in subdirs:
if '.' in subdir:
subdirs.remove(subdir)
#do more stuff
As you can see, the second for loop will run for every iteration of subdirs, which is unnecessary since the first pass removes everything I want to remove anyways.
There must be a more efficient way to do this. Any ideas?

You can do something like this (assuming you want to ignore directories containing '.'):
subdirs[:] = [d for d in subdirs if '.' not in d]
The slice assignment (rather than just subdirs = ...) is necessary because you need to modify the same list that os.walk is using, not create a new one.
Note that your original code is incorrect because you modify the list while iterating over it, which is not allowed.

Perhaps this example from the Python docs for os.walk will be helpful. It works from the bottom up (deleting).
# Delete everything reachable from the directory named in "top",
# assuming there are no symbolic links.
# CAUTION: This is dangerous! For example, if top == '/', it
# could delete all your disk files.
import os
for root, dirs, files in os.walk(top, topdown=False):
for name in files:
os.remove(os.path.join(root, name))
for name in dirs:
os.rmdir(os.path.join(root, name))
I am a bit confused about your goal, are you trying to remove a directory subtree and are encountering errors, or are you trying to walk a tree and just trying to list simple file names (excluding directory names)?

I think all that is required is to remove the directory before iterating over it:
for root, subdirs, files in os.walk(directory, True):
if '.' in subdirs:
subdirs.remove('.')
for subdir in subdirs:
#do more stuff

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python os.walk topdown true with regular expression - python

That is because your root directory doesn't match the regex, so after the first iteration, dirs is set to empty. If what you want is to find all subdirectories which match the pattern, you should either: use topdown = False, or do not prune the directories

Related

Python - Print all the directories except one

Navigating specific dirs in filter with os.walk

clean way to os.walk once python

python walk directory tree with excluding certain directories

Efficiently removing subdirectories in dirnames from os.walk

Categories

Resources