I have some simple code to print out the structure of a directory.
My example directory ABC contains subdirectory A containing A.txt, a subdirectory Z containing Z.txt, and a file info.txt. In real use, this will be big collection of many files and nested directories.
import os
topdir = 'ABC/'
for dirpath, dirnames, files in os.walk(topdir):
print(os.path.join(dirpath))
for name in files:
print(os.path.join(dirpath, name))
The output is:
ABC/
ABC/info.txt
ABC/A
ABC/A/A.txt
ABC/Z
ABC/Z/Z.txt
How can I make it so directories are processed/printed on the top?
I want the output to replicate what I see in Windows Explorer, which displays directories first, and files after.
The output I want:
ABC/
ABC/A
ABC/A/A.txt
ABC/Z
ABC/Z/Z.txt
ABC/info.txt
Without storing all the files in a list and sorting that list in one way or the other, you could make a recursive function and first recurse to the next level of the directory structure before printing the files on the current level:
def print_dirs(directories):
try:
dirpath, dirnames, files = next(directories)
print(dirpath) # print current path; no need for join here
for _ in dirnames: # once for each sub-directory...
print_dirs(directories) # ... recursively call generator
for name in files: # now, print files in current directory
print(os.path.join(dirpath, name))
except StopIteration:
pass
print_dirs(os.walk(topdir))
The same could also be done with a stack, but I think this way it's a little bit clearer. And yes, this will also store some directories in a list/on a stack, but not all the files but just as many as there are levels of nested directories.
Edit: This had a problem of printing any next directory on the generator, even if that's not a sub-directory but a sibling (or "uncle" or whatever). The for _ in dirnames loop should fix that, making the recursive call once for each of the subdirectories, if any. The directory itself does not have to be passed as a parameter as it will be gotten from the generator.
Related
I’m looking (in Python 3) for a cross-platform way to get a list of all the file and folder paths within a folder, similar to what I would get with pexpect.run(“find /media/elon/SuperDrive/*”).splitlines() on Linux. Is there already a function to do this, say, somewhere in shutil or glob? I could write my own function, but I figured there might be something pre-built that could possibly do it quicker than my code could.
The walk function in the native module os does this nicely.
Help on function walk in module os:
walk(top, topdown=True, onerror=None, followlinks=False)
Directory tree generator.
For each directory in the directory tree rooted at top (including top
itself, but excluding '.' and '..'), yields a 3-tuple
dirpath, dirnames, filenames
dirpath is a string, the path to the directory. dirnames is a list of
the names of the subdirectories in dirpath (excluding '.' and '..').
filenames is a list of the names of the non-directory files in dirpath.
Note that the names in the lists are just names, with no path components.
To get a full path (which begins with top) to a file or directory in
dirpath, do os.path.join(dirpath, name).
If optional arg 'topdown' is true or not specified, the triple for a
directory is generated before the triples for any of its subdirectories
(directories are generated top down). If topdown is false, the triple
for a directory is generated after the triples for all of its
subdirectories (directories are generated bottom up).
When topdown is true, the caller can modify the dirnames list in-place
(e.g., via del or slice assignment), and walk will only recurse into the
subdirectories whose names remain in dirnames; this can be used to prune the
search, or to impose a specific order of visiting. Modifying dirnames when
topdown is false has no effect on the behavior of os.walk(), since the
directories in dirnames have already been generated by the time dirnames
itself is generated. No matter the value of topdown, the list of
subdirectories is retrieved before the tuples for the directory and its
subdirectories are generated.
By default errors from the os.scandir() call are ignored. If
optional arg 'onerror' is specified, it should be a function; it
will be called with one argument, an OSError instance. It can
report the error to continue with the walk, or raise the exception
to abort the walk. Note that the filename is available as the
filename attribute of the exception object.
By default, os.walk does not follow symbolic links to subdirectories on
systems that support them. In order to get this functionality, set the
optional argument 'followlinks' to true.
Caution: if you pass a relative pathname for top, don't change the
current working directory between resumptions of walk. walk never
changes the current directory, and assumes that the client doesn't
either.
Example:
import os
from os.path import join, getsize
for root, dirs, files in os.walk('python/Lib/email'):
print(root, "consumes", end="")
print(sum(getsize(join(root, name)) for name in files), end="")
print("bytes in", len(files), "non-directory files")
if 'CVS' in dirs:
dirs.remove('CVS') # don't visit CVS directories
I am trying to make a small program that looks through a directory (as I want to find recursively all the files in the sub directories I use os.walk()).
Here is my code:
import os
import os.path
filesList=[]
path = "C:\\Users\Robin\Documents"
for(root,dirs,files) in os.walk(path):
for file in files:
filesList+=file
Then I try to use the os.path.getsize() method to elements of filesList, but it doesn't work.
Indeed, I realize that the this code fills the list filesList with characters. I don't know what to do, I have tried several other things, such as :
for(root,dirs,files) in os.walk(path):
filesList+=[file for file in os.listdir(root) if os.path.isfile(file)]
This does give me files, but only one, which isn't even visible when looking in the directory.
Can someone explain me how to obtain files with which we can work (that is to say, get their size, hash them, or modify them...) on with os.walk ?
I am new to Python, and I don't really understand how to use os.walk().
The issue I suspect you're running into is that file contains only the filename itself, not any directories you have to navigate through from your starting folder. You should use os.path.join to combine the file name with the folder it is in, which is the root value yielded by os.walk:
for(root,dirs,files) in os.walk(path):
for file in files:
filesList.append(os.path.join(root, file))
Now all the filenames in filesList will be acceptable to os.path.getsize and other functions (like open).
I also fixed a secondary issue, which is that your use of += to extend a list wouldn't work the way you intended. You'd need to wrap the new file path in a list for that to work. Using append is more appropriate for adding a single value to the end of a list.
If you want to get a list of files including path use:
for(root, dirs, files) in os.walk(path):
fullpaths = [os.path.join(root, fil) for fil in files]
filesList+=fullpaths
I know there are several posts that touch on this, but I haven't found one that works for me yet. I need to create a list of files with an .mxd extension by searching an entire mapped directory. I used this code and it works:
import os
file_list = []
for (paths, dirs, files) in os.walk(folder):
for file in files:
if file.endswith(".mxd"):
file_list.append(os.path.join(paths, file))
However, it only works on the C drive. I need to be able to search for these files on a mapped drive of my choosing. Here's the script I'm using, but it doesn't work. I know there is an mxd file in the subdirectory of this drive, but it isn't being reported in the file list. In fact, the file list is totally empty, and it shouldn't be.
import os
path = r"U:/TEST/"
filenamelist = []
for files in os.walk(path):
if file.endswith(".mxd"):
filenamelist.append(files)
Does someone see anything wrong in my second block of code that would provent it from iterated through subdirectories at the given path and reporting back files with an .mxd extension?
os.walk(path) yields a 3-tuple which is commonly unpacked as root, dirs, files. So instead of
for files in os.walk(path):
...
use
for root, dirs, files in os.walk(path):
for filename in files:
if filename.endswith(".mxd"):
filenamelist.append(filename)
Does someone see anything wrong in my second block of code that would provent it from iterated through subdirectories at the given path and reporting back files with an .mxd extension?
Yes. Compare and contrast your two loops:
for (paths, dirs, files) in os.walk(folder):
for file in files:
if file.endswith(".mxd"):
file_list.append(os.path.join(paths, file))
for files in os.walk(path):
if file.endswith(".mxd"):
filenamelist.append(files)
You're trying to do the exact same thing, but not using even remotely the same code.
In the first one, you loop over the walk, storing each tuple in (paths, dirs, files), then loop over files.
In the second one, you loop over the walk, storing each tuple in files, don't loop over anything, and then just use some variable named file left over from some earlier code.
And then, even if that part worked, you end up appending files (which, remember, is a tuple of three lists) rather than file—or, probably better, os.path.join(paths, file)—to the list.
Just make the second one look like the first. Or, better, put it in a function and call it twice, instead of copying and pasting it.
Here's the final script that worked for crawling on a drive root other than C:
import os
file_list = []
path = r'U:\\'
for (dirpath, subdirs, files) in os.walk(path):
for file in files:
if file.endswith(".mxd"):
file_list.append(os.path.join(dirpath, file))
I have this script, which I have no doubt is flawed:
import fnmatch, os, sys
def findit (rootdir, find, pattern):
for folder, dirs, files in os.walk(rootdir):
print (folder)
for filename in fnmatch.filter(files,pattern):
with open(filename) as f:
s = f.read()
f.close()
if find in s :
print(filename)
findit(sys.argv[1], sys.argv[2], sys.argv[3])
when I run it I get Errno2, no such file or directory. BUT the file exists. For instance if I execute it by going: findit.py c:\python "folder" *.py it will work just fine, listing all the *.py files which contain the word "folder". BUT if I go findit.py c:\php\projects1 "include" *.php
as an example I get [Errno2] no such file or directory: 'About.php' (for example). But About.php exists. I don't understand what it's doing, or what I'm doing wrong.
If you look at any of the examples for os.walk, you'll see that they all do os.path.join(root, name). You need to do that too.
Why? Quoting from the docs:
filenames is a list of the names of the non-directory files in dirpath. Note that the names in the lists contain no path components. To get a full path (which begins with top) to a file or directory in dirpath, do os.path.join(dirpath, name).
If you just use the filename as a path, it's going to look for a file of the same name in the current working directory. If there's no such file, you'll get a FileNotFoundError. If there is such a file, you'll open and read the wrong file. Only if you happen to be looking inside the current working directory will it work.
There's also another major problem in your code: os.walk walks a directory tree recursively, finding all files in the given top directory, or any subdirectory of top, or any subdirectory of… and so on, yielding once for each directory. But you're not doing anything useful with that (except printing out the folders). Instead, you wait until it finishes, and then use the files from whichever directory it happened to reach last.
If you just want to get a flat listing of the files directly in a directory, use os.listdir, not os.walk. (Or maybe use glob.glob instead of explicitly listing everything then filtering with fnmatch.)
On the other hand, if you want to walk the tree, you have to move your second for loop inside the first one.
You've also got a minor problem: You call f.close() inside a with open(…) as f:, which leads to f being closed twice. This is guaranteed to be completely harmless (at least in 2.5+, including 3.x), but it's still a bad idea.
Putting it together, here's a working version of your code:
def findit (rootdir, find, pattern):
for folder, dirs, files in os.walk(rootdir):
print (folder)
for filename in fnmatch.filter(files,pattern):
pathname = os.path.join(folder, filename)
with open(pathname) as f:
s = f.read()
if find in s:
print(pathname)
You are using a relative filename. But your current directory does not contain the file. And you don't want to search there anyway. Use os.path.join(folder, filename) to make an absolute path.
On a mac in python 2.7 when walking through directories using os.walk my script goes through 'apps' i.e. appname.app, since those are really just directories of themselves. Well later on in processing I am hitting errors when going through them. I don't want to go through them anyways so for my purposes it would be best just to ignore those types of 'directories'.
So this is my current solution:
for root, subdirs, files in os.walk(directory, True):
for subdir in subdirs:
if '.' in subdir:
subdirs.remove(subdir)
#do more stuff
As you can see, the second for loop will run for every iteration of subdirs, which is unnecessary since the first pass removes everything I want to remove anyways.
There must be a more efficient way to do this. Any ideas?
You can do something like this (assuming you want to ignore directories containing '.'):
subdirs[:] = [d for d in subdirs if '.' not in d]
The slice assignment (rather than just subdirs = ...) is necessary because you need to modify the same list that os.walk is using, not create a new one.
Note that your original code is incorrect because you modify the list while iterating over it, which is not allowed.
Perhaps this example from the Python docs for os.walk will be helpful. It works from the bottom up (deleting).
# Delete everything reachable from the directory named in "top",
# assuming there are no symbolic links.
# CAUTION: This is dangerous! For example, if top == '/', it
# could delete all your disk files.
import os
for root, dirs, files in os.walk(top, topdown=False):
for name in files:
os.remove(os.path.join(root, name))
for name in dirs:
os.rmdir(os.path.join(root, name))
I am a bit confused about your goal, are you trying to remove a directory subtree and are encountering errors, or are you trying to walk a tree and just trying to list simple file names (excluding directory names)?
I think all that is required is to remove the directory before iterating over it:
for root, subdirs, files in os.walk(directory, True):
if '.' in subdirs:
subdirs.remove('.')
for subdir in subdirs:
#do more stuff