python walk directory tree with excluding certain directories - python

i am trying to walk a directory tree and exclude certain directories. Now, according to os.walk exclude .svn folders for example i should be able to modify the 'dirs' list which would then let me prune the tree. I tried the following:
import sys
import os
if __name__ == "__main__":
for root, dirs, files in os.walk("/usr/lib"):
print root
dirs = []
I would have expected to not enter ANY subdirectories but i do:
/usr/lib
/usr/lib/akonadi
/usr/lib/akonadi/contact
/usr/lib/akonadi/contact/editorpageplugins
/usr/lib/os-prober
/usr/lib/gnome-settings-daemon-3.0
/usr/lib/gnome-settings-daemon-3.0/gtk-modules
/usr/lib/git-core
/usr/lib/git-core/mergetools
/usr/lib/gold-ld
/usr/lib/webkitgtk-3.0-0
/usr/lib/webkitgtk-3.0-0/libexec
What am i missing?

dirs = []
rebinds the local name dirs. You can modify the contents of the list instead eg. like this:
dirs[:] = []

Try one of following
dirs[:] = []
OR
del dirs[:]

root gives the entire path and not just the root from where you started.
The docs makes it a bit more clear to what it's doing:
for dirpath, dirnames, filenames in os.walk('/usr/lib'):
print dirpath
See the docs here

Related

Python - Print all the directories except one

I have a python script that print all the directories from a main directory. What I want is to print all the directories expect the one that is old (that I include on exclude list).
For that I am using the following script:
include = 'C://Data//'
exclude = ['C:/Data//00_Old']
for root, dirs, files in os.walk(include, topdown=False):
dirs[:] = [d for d in dirs if d not in exclude]
for name in dirs:
directory = os.path.join(root, name)
print(directory)
Problem is: it is printing all the directories even the excluded one. What I am doing wrong?
To simplify it even further, you can do:
from pathlib import Path
# I'm assuming this is where all your sub-folders are that you want to filter.
include = 'C://Data//'
# You don't need the parent 'C://Data//' because you looping through the parent folder.
exclude = ['00_Old']
root_folder = Path(include)
for folder in root_folder.iterdir():
if folder not in exclude:
# do work
It is better to use the pathlib module for file system related requirements. I would suggest to try something like this.
from pathlib import Path
files = list(Path('C:/Data/').glob('**/*')) #recursively get all the file names
print([x for x in files if 'C:/Data/00_Old' not in str(x)])

Python os.walk Include only specific folders

I am writing a Python script that takes user input in the form of a date eg 20180829, which will be a subdirectory name, it then uses the os.walk function to walk through a specific directory and once it reaches the directory that is passed in it will jump inside and look at all the directory's within it and create a directory structure in a different location.
My directory structure will look something like this:
|dir1
|-----|dir2|
|-----------|dir3
|-----------|20180829
|-----------|20180828
|-----------|20180827
|-----------|20180826
So dir3 will have a number of sub folders which will all be in the format of a date. I need to be able to copy the directory structure of just the directory that is passed in at the start eg 20180829 and skip the rest of directory's.
I have been looking online for a way to do this but all I can find is ways of Excluding directory's from the os.walk function like in the thread below:
Filtering os.walk() dirs and files
I also found a thread that allows me to print out the directory paths that I want but will not let me create the directory's I want:
Python 3.5 OS.Walk for selected folders and include their subfolders.
The following is the code I have which is printing out the correct directory structure but is creating the entire directory structure in the new location which I don't want it to do.
includes = '20180828'
inputpath = Desktop
outputpath = Documents
for startFilePath, dirnames, filenames in os.walk(inputpath, topdown=True):
endFilePath = os.path.join(outputpath, startFilePath)
if not os.path.isdir(endFilePath):
os.mkdir(endFilePath)
for filename in filenames:
if (includes in startFilePath):
print(includes, "+++", startFilePath)
break
I am not sure if I understand what you need, but I think you overcomplicate a few things. If the code below doesn't help you, let me know and we will think about other approaches.
I run this to create an example like yours.
# setup example project structure
import os
import sys
PLATFORM = 'windows' if sys.platform.startswith('win') else 'linux'
DESKTOP_DIR = \
os.path.join(os.path.join(os.path.expanduser('~')), 'Desktop') \
if PLATFORM == 'linux' \
else os.path.join(os.path.join(os.environ['USERPROFILE']), 'Desktop')
example_dirs = ['20180829', '20180828', '20180827', '20180826']
for _dir in example_dirs:
path = os.path.join(DESKTOP_DIR, 'dir_from', 'dir_1', 'dir_2', 'dir_3', _dir)
os.makedirs(path, exist_ok=True)
And here's what you need.
# do what you want to do
dir_from = os.path.join(DESKTOP_DIR, 'dir_from')
dir_to = os.path.join(DESKTOP_DIR, 'dir_to')
target = '20180828'
for root, dirs, files in os.walk(dir_from, topdown=True):
for _dir in dirs:
if _dir == target:
path = os.path.join(root, _dir).replace(dir_from, dir_to)
os.makedirs(path, exist_ok=True)
continue

Python os.walk topdown true with regular expression

I am confused as to why the following ONLY works with topdown=False and returns nothing when set to True ?
The reason I want to use topdown=True is because it is taking a very long time to traverse through the directories. I believe that going topdown will increase the time taken to produce the list.
for root, dirs, files in os.walk(mypath, topdown=False): #Why doesn't this work with True?
dirs[:] = [d for d in dirs if re.match('[DMP]\\d{8}$', d)]
for dir in dirs:
print(dir)
In your code you were looking for matching names([dmp]\d{8}) to traverse into, while you should be looking for non-matching directories to traverse into while adding matching names to a global list.
I modified your code and this works:
import os
import re
all_dirs = []
for root, dirs, files in os.walk("root", topdown=True):
subset = []
for d in dirs:
if not re.match('[dmp]\d{8}$', d):
# step inside
subset.append(d)
else:
# add to list
all_dirs.append(os.path.join(root, d))
dirs[:] = subset
print all_dirs
This returns:
['root/temp1/myfiles/d12345678',
'root/temp1/myfiles/m11111111',
'root/temp2/mydirs/moredirs/m22222222',
'root/temp2/mydirs/moredirs/p00000001']
The problem is that you're modifying the contents of dirs while traversing. When using topdown=True this will impact what directories are traversed next.
Look at this code that shows you what is happening:
import os, re
for root, dirs, files in os.walk("./", topdown=False):
print("Walked down {}, dirs={}".format(root, dirs))
dirs[:] = [d for d in dirs if re.match('[DMP]\\d{8}$', d)]
print("After filtering dirs is now: " + str(dirs))
for dir in dirs:
print(dir)
I've just got one directory to traverse - Temp/MyFiles/D12345678 (I'm on Linux). With topdown=False the above produces this output:
Walked down ./Temp/MyFiles/D12345678, dirs=[]
After filtering dirs is now: []
Walked down ./Temp/MyFiles, dirs=['D12345678']
After filtering dirs is now: ['D12345678']
D12345678
Walked down ./Temp, dirs=['MyFiles']
After filtering dirs is now: []
Walked down ./, dirs=['Temp']
After filtering dirs is now: []
But with topdown=True we get this:
Walked down ./, dirs=['Temp']
After filtering dirs is now: []
Since you're removing all subdirectories from dirs you're telling os.walk that you don't want to traverse further into any subdirectories and therefore iteration stops. When using topdown=False the modified value of dirs isn't used to determine what to traverse next so therefore it works.
To fix it, replace dirs[:] = with dirs =
import os, re
for root, dirs, files in os.walk("./", topdown=True):
dirs = [d for d in dirs if re.match('[DMP]\\d{8}$', d)]
for dir in dirs:
print(dir)
This gives us:
D12345678
Update:
If you're absolutely certain that a directory will not contain any subdirectories of interest to you you can remove them from dirs before traversing any further. If, for example, you know that "./Temp/MyDirs2" will never contain any subdirectories of interest you can empty dirs when we get there to speed it up:
import os, re
uninteresting_roots = { "./Temp/MyDirs2" }
for root, dirs, files in os.walk("./", topdown=True):
if root in uninteresting_roots:
# Empty dirs and end this iteration
del dirs[:]
continue
dirs = [d for d in dirs if re.match('[DMP]\\d{8}$', d)]
for dir in dirs:
print(dir)
Other than that there is no way you can know which directories that you don't need to traverse into because to know if they contain interesting subdirectories you have to traverse into them.
That is because your root directory doesn't match the regex, so after the first iteration, dirs is set to empty.
If what you want is to find all subdirectories which match the pattern, you should either:
use topdown = False, or
do not prune the directories

List files ONLY in the current directory

In Python, I only want to list all the files in the current directory ONLY. I do not want files listed from any sub directory or parent.
There do seem to be similar solutions out there, but they don't seem to work for me. Here's my code snippet:
import os
for subdir, dirs, files in os.walk('./'):
for file in files:
do some stuff
print file
Let's suppose I have 2 files, holygrail.py and Tim inside my current directory. I have a folder as well and it contains two files - let's call them Arthur and Lancelot - inside it. When I run the script, this is what I get:
holygrail.py
Tim
Arthur
Lancelot
I am happy with holygrail.py and Tim. But the two files, Arthur and Lancelot, I do not want listed.
Just use os.listdir and os.path.isfile instead of os.walk.
Example:
import os
files = [f for f in os.listdir('.') if os.path.isfile(f)]
for f in files:
# do something
But be careful while applying this to other directory, like
files = [f for f in os.listdir(somedir) if os.path.isfile(f)]
which would not work because f is not a full path but relative to the current directory.
Therefore, for filtering on another directory, do os.path.isfile(os.path.join(somedir, f))
(Thanks Causality for the hint)
You can use os.listdir for this purpose. If you only want files and not directories, you can filter the results using os.path.isfile.
example:
files = os.listdir(os.curdir) #files and directories
or
files = filter(os.path.isfile, os.listdir( os.curdir ) ) # files only
files = [ f for f in os.listdir( os.curdir ) if os.path.isfile(f) ] #list comprehension version.
import os
destdir = '/var/tmp/testdir'
files = [ f for f in os.listdir(destdir) if os.path.isfile(os.path.join(destdir,f)) ]
You can use os.scandir(). New function in stdlib starts from Python 3.5.
import os
for entry in os.scandir('.'):
if entry.is_file():
print(entry.name)
Faster than os.listdir(). os.walk() implements os.scandir().
You can use the pathlib module.
from pathlib import Path
x = Path('./')
print(list(filter(lambda y:y.is_file(), x.iterdir())))
this can be done with os.walk()
python 3.5.2 tested;
import os
for root, dirs, files in os.walk('.', topdown=True):
dirs.clear() #with topdown true, this will prevent walk from going into subs
for file in files:
#do some stuff
print(file)
remove the dirs.clear() line and the files in sub folders are included again.
update with references;
os.walk documented here and talks about the triple list being created and topdown effects.
.clear() documented here for emptying a list
so by clearing the relevant list from os.walk you can effect its result to your needs.
import os
for subdir, dirs, files in os.walk('./'):
for file in files:
do some stuff
print file
You can improve this code with del dirs[:]which will be like following .
import os
for subdir, dirs, files in os.walk('./'):
del dirs[:]
for file in files:
do some stuff
print file
Or even better if you could point os.walk with current working directory .
import os
cwd = os.getcwd()
for subdir, dirs, files in os.walk(cwd, topdown=True):
del dirs[:] # remove the sub directories.
for file in files:
do some stuff
print file
instead of os.walk, just use os.listdir
To list files in a specific folder excluding files in its sub-folders with os.walk use:
_, _, file_list = next(os.walk(data_folder))
Following up on Pygirl and Flimm, use of pathlib, (really helpful reference, btw) their solution included the full path in the result, so here is a solution that outputs just the file names:
from pathlib import Path
p = Path(destination_dir) # destination_dir = './' in original post
files = [x.name for x in p.iterdir() if x.is_file()]
print(files)

Efficiently removing subdirectories in dirnames from os.walk

On a mac in python 2.7 when walking through directories using os.walk my script goes through 'apps' i.e. appname.app, since those are really just directories of themselves. Well later on in processing I am hitting errors when going through them. I don't want to go through them anyways so for my purposes it would be best just to ignore those types of 'directories'.
So this is my current solution:
for root, subdirs, files in os.walk(directory, True):
for subdir in subdirs:
if '.' in subdir:
subdirs.remove(subdir)
#do more stuff
As you can see, the second for loop will run for every iteration of subdirs, which is unnecessary since the first pass removes everything I want to remove anyways.
There must be a more efficient way to do this. Any ideas?
You can do something like this (assuming you want to ignore directories containing '.'):
subdirs[:] = [d for d in subdirs if '.' not in d]
The slice assignment (rather than just subdirs = ...) is necessary because you need to modify the same list that os.walk is using, not create a new one.
Note that your original code is incorrect because you modify the list while iterating over it, which is not allowed.
Perhaps this example from the Python docs for os.walk will be helpful. It works from the bottom up (deleting).
# Delete everything reachable from the directory named in "top",
# assuming there are no symbolic links.
# CAUTION: This is dangerous! For example, if top == '/', it
# could delete all your disk files.
import os
for root, dirs, files in os.walk(top, topdown=False):
for name in files:
os.remove(os.path.join(root, name))
for name in dirs:
os.rmdir(os.path.join(root, name))
I am a bit confused about your goal, are you trying to remove a directory subtree and are encountering errors, or are you trying to walk a tree and just trying to list simple file names (excluding directory names)?
I think all that is required is to remove the directory before iterating over it:
for root, subdirs, files in os.walk(directory, True):
if '.' in subdirs:
subdirs.remove('.')
for subdir in subdirs:
#do more stuff

Categories