Get a file tree: equivalent of Unix “find” command for Python

Get a file tree: equivalent of Unix “find” command for Python - python

I’m looking (in Python 3) for a cross-platform way to get a list of all the file and folder paths within a folder, similar to what I would get with pexpect.run(“find /media/elon/SuperDrive/*”).splitlines() on Linux. Is there already a function to do this, say, somewhere in shutil or glob? I could write my own function, but I figured there might be something pre-built that could possibly do it quicker than my code could.

The walk function in the native module os does this nicely.
Help on function walk in module os:
walk(top, topdown=True, onerror=None, followlinks=False)
Directory tree generator.
For each directory in the directory tree rooted at top (including top
itself, but excluding '.' and '..'), yields a 3-tuple
dirpath, dirnames, filenames
dirpath is a string, the path to the directory. dirnames is a list of
the names of the subdirectories in dirpath (excluding '.' and '..').
filenames is a list of the names of the non-directory files in dirpath.
Note that the names in the lists are just names, with no path components.
To get a full path (which begins with top) to a file or directory in
dirpath, do os.path.join(dirpath, name).
If optional arg 'topdown' is true or not specified, the triple for a
directory is generated before the triples for any of its subdirectories
(directories are generated top down). If topdown is false, the triple
for a directory is generated after the triples for all of its
subdirectories (directories are generated bottom up).
When topdown is true, the caller can modify the dirnames list in-place
(e.g., via del or slice assignment), and walk will only recurse into the
subdirectories whose names remain in dirnames; this can be used to prune the
search, or to impose a specific order of visiting. Modifying dirnames when
topdown is false has no effect on the behavior of os.walk(), since the
directories in dirnames have already been generated by the time dirnames
itself is generated. No matter the value of topdown, the list of
subdirectories is retrieved before the tuples for the directory and its
subdirectories are generated.
By default errors from the os.scandir() call are ignored. If
optional arg 'onerror' is specified, it should be a function; it
will be called with one argument, an OSError instance. It can
report the error to continue with the walk, or raise the exception
to abort the walk. Note that the filename is available as the
filename attribute of the exception object.
By default, os.walk does not follow symbolic links to subdirectories on
systems that support them. In order to get this functionality, set the
optional argument 'followlinks' to true.
Caution: if you pass a relative pathname for top, don't change the
current working directory between resumptions of walk. walk never
changes the current directory, and assumes that the client doesn't
either.
Example:
import os
from os.path import join, getsize
for root, dirs, files in os.walk('python/Lib/email'):
print(root, "consumes", end="")
print(sum(getsize(join(root, name)) for name in files), end="")
print("bytes in", len(files), "non-directory files")
if 'CVS' in dirs:
dirs.remove('CVS') # don't visit CVS directories

Related

os.walk but with directories on top?

I have some simple code to print out the structure of a directory.
My example directory ABC contains subdirectory A containing A.txt, a subdirectory Z containing Z.txt, and a file info.txt. In real use, this will be big collection of many files and nested directories.
import os
topdir = 'ABC/'
for dirpath, dirnames, files in os.walk(topdir):
print(os.path.join(dirpath))
for name in files:
print(os.path.join(dirpath, name))
The output is:
ABC/
ABC/info.txt
ABC/A
ABC/A/A.txt
ABC/Z
ABC/Z/Z.txt
How can I make it so directories are processed/printed on the top?
I want the output to replicate what I see in Windows Explorer, which displays directories first, and files after.
The output I want:
ABC/
ABC/A
ABC/A/A.txt
ABC/Z
ABC/Z/Z.txt
ABC/info.txt

Without storing all the files in a list and sorting that list in one way or the other, you could make a recursive function and first recurse to the next level of the directory structure before printing the files on the current level:
def print_dirs(directories):
try:
dirpath, dirnames, files = next(directories)
print(dirpath) # print current path; no need for join here
for _ in dirnames: # once for each sub-directory...
print_dirs(directories) # ... recursively call generator
for name in files: # now, print files in current directory
print(os.path.join(dirpath, name))
except StopIteration:
pass
print_dirs(os.walk(topdir))
The same could also be done with a stack, but I think this way it's a little bit clearer. And yes, this will also store some directories in a list/on a stack, but not all the files but just as many as there are levels of nested directories.
Edit: This had a problem of printing any next directory on the generator, even if that's not a sub-directory but a sibling (or "uncle" or whatever). The for _ in dirnames loop should fix that, making the recursive call once for each of the subdirectories, if any. The directory itself does not have to be passed as a parameter as it will be gotten from the generator.

How to rename sub directory and file names recursively in script python3?

I have a recursive directory. Both subdirectory and files names have illegal characters. I have a function to clean up the names, such as it replaces a space with an underscore in the name. There must be an easier way but I couldn't find a way to both rename folders and files. So, I want to rename the folders first.
for path, subdirs, files in os.walk(root):
for name in subdirs:
new_name=clean_names(name)
name=os.path.join(path,name)
new_name=os.path.join(path,new_name)
os.chdir(path)
os.rename(name,new_name)
When I check my real folder and it contents I see that only the first subfolder name is corrected. I can see the reason because os.chdir(path) changes the cwd then it doesn't change back before for loop starts to second path. I thought after the os.rename I could rechange the cwd but I am sure there is a more elegant way to do this. If I remove the os.chdir line it gives filenotfound error.
I see that renaming subdirectories has been asked about before, but they are in command line.

You should use os.walk(root, topdown=False) instead; otherwise once the top folder gets renamed, os.walk won't have access to the subfolders because it can no longer find their parent folders.
Excerpt from the documentation:
If optional argument topdown is True or not specified, the triple for
a directory is generated before the triples for any of its
subdirectories (directories are generated top-down). If topdown is
False, the triple for a directory is generated after the triples for
all of its subdirectories (directories are generated bottom-up). No
matter the value of topdown, the list of subdirectories is retrieved
before the tuples for the directory and its subdirectories are
generated.
Note that you do not need to call os.chdir at all because all the paths passed to os.rename are absolute.

Stop os.walk going down further at a specific directory name

I need to stop os.walk from going down further if the path contains both "release" and "arm-linux". I have a bunch of these at different levels of directories. So I can't simply dictate the level. So far I have the following and it unnecessarily dive past directories in 'arm-linux'.
def main(argv):
for root, dirs, files in os.walk("."):
path = root.split(os.sep)
if "release" and "arm-linux" in path:
print(os.path.abspath(root))
getSharedLib(argv)
[update] This is my solution
def main(argv):
for root, dirs, files in os.walk("."):
path = root.split(os.sep)
if "release" in path and "arm-linux" in path:
print(os.path.abspath(root))
getSharedLib(argv)
del dirs[:]

From the documentation
When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames;
Note that topdown is True by default.
Edit
To delete all the elements of dirs, you will need something like del dirs[:]. That will delete all the elements of the list object that is referred to as dirs in your code, but is referred to by another name in the os.walk code.
Just using del dirs will stop dirs in your code from referring to the list, but won't do anything to the os.walk reference. Similarly dirs = [] will replace what dirs in your code refers to, but won't affect os.walk code.

How to skip directories in os walk Python 2.7

I have written an image carving script to assist with my work. The tool carves images by specified extention and compares to a hash database.
The tool is used to search across mounted drives, some which have operating systems on.
The problem I am having is that when a drive is mounted with an OS, it is searching across the 'All Users' directory, and so is including images from my local disc.
I can't figure out how to skip the 'All Users' directory and just stick to the mounted drive.
My section for os.walk is as follows:
for path, subdirs, files in os.walk(root):
for name in files:
if re.match(pattern, name.lower()):
appendfile.write (os.path.join(path, name))
appendfile.write ('\n')
log(name)
i=i+1
Any help is much appreciated

Assuming All Users is the name of the directory, you can remove the directory from your subdirs list, so that os.walk() does not iterate over it.
Example -
for path, subdirs, files in os.walk(root):
if 'All Users' in subdirs:
subdirs.remove('All Users')
for name in files:
if re.match(pattern, name.lower()):
appendfile.write (os.path.join(path, name))
appendfile.write ('\n')
log(name)
i=i+1
If you only want to not walk for All Users inside a particular parent, you can include the check for that as well in the above if condition.
From os.walk documentation -
os.walk(top, topdown=True, onerror=None, followlinks=False)
Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames).
When topdown is True, the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again. Modifying dirnames when topdown is False is ineffective, because in bottom-up mode the directories in dirnames are generated before dirpath itself is generated.
topdown is normally true, unless specified otherwise.

if you have more than one directory to remove you can use a slice-assignment in oder to remove excluded directories in the subdirs
excl_dirs = {'All Users', 'some other dir'}
for path, dirnames, files in os.walk(root):
dirnames[:] = [d for d in dirnames if d not in excl_dirs]
...
as the documentation states:
When topdown is True, the caller can modify the dirnames list in-place
(perhaps using del or slice assignment), and walk() will only recurse
into the subdirectories whose names remain in dirnames; ..

Limit the number of nested directories traversed by os.walk

I'm using Python to parse a WordPress site downloaded via wget. All the HTML files are nested inside a complicated folder structure (thanks to WordPress and its long URLs), like site_dump/2010/03/11/post-title/index.html.
However, within the post-title directory there are other directories for the feed and for Google News-esque number-based indexes:
site_dump/2010/03/11/post-title/index.html # I want this
site_dump/2010/03/11/post-title/feed/index.html # Not these
site_dump/2010/03/11/post-title/115232/site.com/2010/03/11/post-title/index.html
I only want to access the index.html files that are at the 5th nested level (site_dump/2010/03/11/post-title/index.html), and not beyond. Right now I split the root variable by a slash (/) in the os.walk loop and only deal with the file if it is inside 5 levels of folders:
import os
for root, dirs, files in os.walk('site_dump'):
nested_levels = root.split('/')
if len(nested_levels) == 5:
print(nested_levels) # Eventually do stuff with the file here
However, this seems kind of inefficient, since os.walk is still traversing those really deep folders. Is there a way to limit how deep os.walk goes when traversing a directory tree?

You can modify dirs in place to prevent further traversal into the directory structure.
for root, dirs, files in os.walk('site_dump'):
nested_levels = root.split('/')
if len(nested_levels) == 5:
del dirs[:]
# Eventually do stuff with the file here
del dirs[:] will remove the contents of the list, rather than replace dirs with a reference to a new list. When doing this it is important to modify the list in-place.
From the docs, with topdown referring to an optional parameter for os.walk that you omitted and defaults to True:
When topdown is True, the caller can modify the dirnames list in-place
(perhaps using del or slice assignment), and walk() will only recurse
into the subdirectories whose names remain in dirnames; this can be
used to prune the search, impose a specific order of visiting, or even
to inform walk() about directories the caller creates or renames
before it resumes walk() again. Modifying dirnames when topdown is
False is ineffective, because in bottom-up mode the directories in
dirnames are generated before dirpath itself is generated.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get a file tree: equivalent of Unix “find” command for Python - python

Related

os.walk but with directories on top?

How to rename sub directory and file names recursively in script python3?

Stop os.walk going down further at a specific directory name

How to skip directories in os walk Python 2.7

Limit the number of nested directories traversed by os.walk

Categories

Resources