recursive searching with pathlib - python

recursive searching with pathlib - python - python

Say I have these files
/home/user/one/two/abc.txt
/home/user/one/three/def.txt
/home/user/one/four/ghi.txt
I'm trying to find ghi.txt recursively using the pathlib module. I tried:
p = '/home/user/'
f = Path(p).rglob(*i.txt)
but the only way I can get the filename is by using a list comprehension:
file = [str(i) for i in f]
which actually only works once. Re-running the command above returns an empty list.
I decided to learn pathlib because apparently it's what is recommended by the community, but isn't:
file = glob.glob(os.path.join(p,'**/*i.txt'),recursive=True)
much more straightforward?

You already have a solution.
Not sure if I read the requirements correctly but I have posted the answer with the following assumptions.
PathListPrefix is the starting path beneath which you want to search
all files. In your case it might be '/home/user'
FileName is the name of the file that you are looking for. In your case it is 'ghi.txt'
You are not expecting more than one match.
Something other than pathlib modules has to be tried.
As far as straightforward solution is concerned I am not sure about that either. However the below solution is what I could think of using os module.
import os
PathListPrefix = '/home/user/'
FileName = 'ghi.txt'
def Search(StartingPath, FileNameToSearch):
root, ext = os.path.splitext(StartingPath)
FileName = os.path.split(root)[1]
if FileName+ext == FileNameToSearch:
return True
return False
for root, dirs, files in os.walk(PathListPrefix):
Path = root
for eachfile in files:
SearchPath = os.path.join(Path,eachfile)
Found = Search(SearchPath, FileName)
if Found:
print(Path, FileName)
break
The code is definitely having many many more lines than yours. So it is not as compact as yours.

Related

In Python how to find a specific directory by name and walk through the files? [duplicate]

I feel that assigning files, and folders and doing the += [item] part is a bit hackish. Any suggestions? I'm using Python 3.2
from os import *
from os.path import *
def dir_contents(path):
contents = listdir(path)
files = []
folders = []
for i, item in enumerate(contents):
if isfile(contents[i]):
files += [item]
elif isdir(contents[i]):
folders += [item]
return files, folders

os.walk and os.scandir are great options, however, I've been using pathlib more and more, and with pathlib you can use the .glob() or .rglob() (recursive glob) methods:
root_directory = Path(".")
for path_object in root_directory.rglob('*'):
if path_object.is_file():
print(f"hi, I'm a file: {path_object}")
elif path_object.is_dir():
print(f"hi, I'm a dir: {path_object}")

Take a look at the os.walk function which returns the path along with the directories and files it contains. That should considerably shorten your solution.

For anyone looking for a solution using pathlib (python >= 3.4)
from pathlib import Path
def walk(path):
for p in Path(path).iterdir():
if p.is_dir():
yield from walk(p)
continue
yield p.resolve()
# recursively traverse all files from current directory
for p in walk(Path('.')):
print(p)
# the function returns a generator so if you need a list you need to build one
all_files = list(walk(Path('.')))
However, as mentioned above, this does not preserve the top-down ordering given by os.walk

Since Python >= 3.4 the exists the generator method Path.rglob.
So, to process all paths under some/starting/path just do something such as
from pathlib import Path
path = Path('some/starting/path')
for subpath in path.rglob('*'):
# do something with subpath
To get all subpaths in a list do list(path.rglob('*')).
To get just the files with sql extension, do list(path.rglob('*.sql')).

If you want to recursively iterate through all the files, including all files in the subfolders, I believe this is the best way.
import os
def get_files(input):
for fd, subfds, fns in os.walk(input):
for fn in fns:
yield os.path.join(fd, fn)
## now this will print all full paths
for fn in get_files(fd):
print(fn)

Since Python 3.4 there is new module pathlib. So to get all dirs and files one can do:
from pathlib import Path
dirs = [str(item) for item in Path(path).iterdir() if item.is_dir()]
files = [str(item) for item in Path(path).iterdir() if item.is_file()]

Another solution how to walk a directory tree using the pathlib module:
from pathlib import Path
for directory in Path('.').glob('**'):
for item in directory.iterdir():
print(item)
The pattern ** matches current directory and all subdirectories, recursively, and the method iterdir then iterates over each directory's contents. Useful when you need more control when traversing the directory tree.

def dir_contents(path):
files,folders = [],[]
for p in listdir(path):
if isfile(p): files.append(p)
else: folders.append(p)
return files, folders

Indeed using
items += [item]
is bad for many reasons...
The append method has been made exactly for that (appending one element to the end of a list)
You are creating a temporary list of one element just to throw it away. While raw speed should not your first concern when using Python (otherwise you're using the wrong language) still wasting speed for no reason doesn't seem the right thing.
You are using a little asymmetry of the Python language... for list objects writing a += b is not the same as writing a = a + b because the former modifies the object in place, while the second instead allocates a new list and this can have a different semantic if the object a is also reachable using other ways. In your specific code this doesn't seem the case but it could become a problem later when someone else (or yourself in a few years, that is the same) will have to modify the code. Python even has a method extend with a less subtle syntax that is specifically made to handle the case in which you want to modify in place a list object by adding at the end the elements of another list.
Also as other have noted seems that your code is trying to do what os.walk already does...

Instead of the built-in os.walk and os.path.walk, I use something derived from this piece of code I found suggested elsewhere which I had originally linked to but have replaced with inlined source:
import os
import stat
class DirectoryStatWalker:
# a forward iterator that traverses a directory tree, and
# returns the filename and additional file information
def __init__(self, directory):
self.stack = [directory]
self.files = []
self.index = 0
def __getitem__(self, index):
while 1:
try:
file = self.files[self.index]
self.index = self.index + 1
except IndexError:
# pop next directory from stack
self.directory = self.stack.pop()
self.files = os.listdir(self.directory)
self.index = 0
else:
# got a filename
fullname = os.path.join(self.directory, file)
st = os.stat(fullname)
mode = st[stat.ST_MODE]
if stat.S_ISDIR(mode) and not stat.S_ISLNK(mode):
self.stack.append(fullname)
return fullname, st
if __name__ == '__main__':
for file, st in DirectoryStatWalker("/usr/include"):
print file, st[stat.ST_SIZE]
It walks the directories recursively and is quite efficient and easy to read.

Try using the append method.

While googling for the same info, I found this question.
I am posting here the smallest, clearest code which I found at http://www.pythoncentral.io/how-to-traverse-a-directory-tree-in-python-guide-to-os-walk/ (rather than just posting the URL, in case of link rot).
The page has some useful info and also points to a few other relevant pages.
# Import the os module, for the os.walk function
import os
# Set the directory you want to start from
rootDir = '.'
for dirName, subdirList, fileList in os.walk(rootDir):
print('Found directory: %s' % dirName)
for fname in fileList:
print('\t%s' % fname)

I've not tested this extensively yet, but I believe
this will expand the os.walk generator, join dirnames to all the file paths, and flatten the resulting list; To give a straight up list of concrete files in your search path.
import itertools
import os
def find(input_path):
return itertools.chain(
*list(
list(os.path.join(dirname, fname) for fname in files)
for dirname, _, files in os.walk(input_path)
)
)

Find/Remove oldest file in directory

I am trying to delete oldest file in directory when number of files reaches a threshold.
list_of_files = os.listdir('log')
if len([name for name in list_of_files]) == 25:
oldest_file = min(list_of_files, key=os.path.getctime)
os.remove('log/'+oldest_file)
Problem: The issue is in min method. list_of_files does not contain full path, so it is trying to search file in current directory and failing. How can I pass directory name ('log') to min()?

list_of_files = os.listdir('log')
full_path = ["log/{0}".format(x) for x in list_of_files]
if len(list_of_files) == 25:
oldest_file = min(full_path, key=os.path.getctime)
os.remove(oldest_file)

os.listdir will return relative paths - those are ones that are relative to your current/present working directory/context of what your Python script was executed in (you can see that via os.getcwd()).
Now, the os.remove function expects a full path/absolute path - shells/command line interfaces infer this and do it on your behalf - but Python doesn't. You can get that via using os.path.abspath, so you can change your code to be (and since os.listdir returns a list anyway, we don't need to add a list-comp over it to be able to check its length)...:
list_of_files = os.listdir('log')
if len(list_of_files) >= 25:
oldest_file = min(list_of_files, key=os.path.getctime)
os.remove(os.path.abspath(oldest_file))
That keeps it generic as to where it came from - ie, whatever was produced in the result of os.listdir - you don't need to worry about prepending suitable file paths.

I was trying to achieve the same as what you are trying to achieve. and was facing a similar issue related to os.path.abspath()
I am using a windows system with python 3.7
and the issue is that os.path.abspath() gives one folder up location
replace "Yourpath" and with the path of folder in which your file is and code should work fine
import os
import time
oldest_file = sorted([ "Yourpath"+f for f in os.listdir("Yourpath")], key=os.path.getctime)[0]
print (oldest_file)
os.remove(oldest_file)
print ("{0} has been deleted".format(oldest_file))`
There must be some cleaner method to do the same
I'll update when I get it

The glob library gives on the one hand full paths and allows filtering for file patterns. The above solution resulted in the directory itself as the oldest file, which is not that what I wanted. For me, the following is suitable (a blend of glob and the solution of #Ivan Motin)
import glob
sorted(glob.glob("/home/pi/Pictures/*.jpg"), key=os.path.getctime)[0]

using comprehension (sorry, couldn't resist):
oldest_file = sorted([os.path.abspath(f) for f in os.listdir('log') ], key=os.path.getctime)[0]

how can I save the output of a search for files matching *.txt to a variable?

I'm fairly new to python. I'd like to save the text that is printed by at this script as a variable. (The variable is meant to be written to a file later, if that matters.) How can I do that?
import fnmatch
import os
for file in os.listdir("/Users/x/y"):
if fnmatch.fnmatch(file, '*.txt'):
print(file)

you can store it in variable like this:
import fnmatch
import os
for file in os.listdir("/Users/x/y"):
if fnmatch.fnmatch(file, '*.txt'):
print(file)
my_var = file
# do your stuff
or you can store it in list for later use:
import fnmatch
import os
my_match = []
for file in os.listdir("/Users/x/y"):
if fnmatch.fnmatch(file, '*.txt'):
print(file)
my_match.append(file) # append insert the value at end of list
# do stuff with my_match list

You can store it in a list:
import fnmatch
import os
matches = []
for file in os.listdir("/Users/x/y"):
if fnmatch.fnmatch(file, '*.txt'):
matches.append(file)

Both answers already provided are correct, but Python provides a nice alternative. Since iterating through an array and appending to a list is such a common pattern, the list comprehension was created as a one-stop shop for the process.
import fnmatch
import os
matches = [filename for filename in os.listdir("/Users/x/y") if fnmatch.fnmatch(filename, "*.txt")]

While NSU's answer and the others are all perfectly good, there may be a simpler way to get what you want.
Just as fnmatch tests whether a certain file matches a shell-style wildcard, glob lists all files matching a shell-style wildcard. In fact:
This is done by using the os.listdir() and fnmatch.fnmatch() functions in concert…
So, you can do this:
import glob
matches = glob.glob("/Users/x/y/*.txt")
But notice that in this case, you're going to get full pathnames like '/Users/x/y/spam.txt' rather than just 'spam.txt', which may not be what you want. Often, it's easier to keep the full pathnames around and os.path.basename them when you want to display them, than to keep just the base names around and os.path.join them when you want to open them… but "often" isn't "always".
Also notice that I had to manually paste the "/Users/x/y/" and "*.txt" together into a single string, the way you would at the command line. That's fine here, but if, say, the first one came from a variable, rather than hardcoded into the source, you'd have to use os.path.join(basepath, "*.txt"), which isn't quite as nice.
By the way, if you're using Python 3.4 or later, you can get the same thing out of the higher-level pathlib library:
import pathlib
matches = list(pathlib.Path("/Users/x/y/").glob("*.txt"))

Maybe defining an utility function is the right path to follow...
def list_ext_in_dir(e,d):
"""e=extension, d= directory => list of matching filenames.
If the directory d cannot be listed returns None."""
from fnmatch import fnmatch
from os import listdir
try:
dirlist = os.listdir(d)
except OSError:
return None
return [fname for fname in dirlist if fnmatch(fname,e)]
I have put the dirlist inside a try except clause to catch the
possibility that we cannot list the directory (non-existent, read
permission, etc). The treatment of errors is a bit simplistic, but...
the list of matching filenames is built using a so called list comprehension, that is something that you should investigate as soon as possible if you're going to use python for your programs.
To close my post, an usage example
l_txtfiles = list_ext_in_dir('*.txt','/Users/x/y;)

How to ignore hidden files using os.listdir()?

My python script executes an os.listdir(path) where the path is a queue containing archives that I need to treat one by one.
The problem is that I'm getting the list in an array and then I just do a simple array.pop(0). It was working fine until I put the project in subversion. Now I get the .svn folder in my array and of course it makes my application crash.
So here is my question: is there a function that ignores hidden files when executing an os.listdir() and if not what would be the best way?

You can write one yourself:
import os
def listdir_nohidden(path):
for f in os.listdir(path):
if not f.startswith('.'):
yield f
Or you can use a glob:
import glob
import os
def listdir_nohidden(path):
return glob.glob(os.path.join(path, '*'))
Either of these will ignore all filenames beginning with '.'.

This is an old question, but seems like it is missing the obvious answer of using list comprehension, so I'm adding it here for completeness:
[f for f in os.listdir(path) if not f.startswith('.')]
As a side note, the docs state listdir will return results in 'arbitrary order' but a common use case is to have them sorted alphabetically. If you want the directory contents alphabetically sorted without regards to capitalization, you can use:
sorted((f for f in os.listdir() if not f.startswith(".")), key=str.lower)
(Edited to use key=str.lower instead of a lambda)

On Windows, Linux and OS X:
if os.name == 'nt':
import win32api, win32con
def folder_is_hidden(p):
if os.name== 'nt':
attribute = win32api.GetFileAttributes(p)
return attribute & (win32con.FILE_ATTRIBUTE_HIDDEN | win32con.FILE_ATTRIBUTE_SYSTEM)
else:
return p.startswith('.') #linux-osx

Joshmaker has the right solution to your question.
How to ignore hidden files using os.listdir()?
In Python 3 however, it is recommended to use pathlib instead of os.
from pathlib import Path
visible_files = [
file for file in Path(".").iterdir() if not file.name.startswith(".")
]

glob:
>>> import glob
>>> glob.glob('*')
(glob claims to use listdir and fnmatch under the hood, but it also checks for a leading '.', not by using fnmatch.)

I think it is too much of work to go through all of the items in a loop. I would prefer something simpler like this:
lst = os.listdir(path)
if '.DS_Store' in lst:
lst.remove('.DS_Store')
If the directory contains more than one hidden files, then this can help:
all_files = os.popen('ls -1').read()
lst = all_files.split('\n')
for platform independence as #Josh mentioned the glob works well:
import glob
glob.glob('*')

filenames = (f.name for f in os.scandir() if not f.name.startswith('.'))

You can just use a simple for loop that will exclude any file or directory that has "." in the front.
Code for professionals:
import os
directory_things = [i for i in os.listdir() if i[0] != "."] # Exclude all with . in the start
Code for noobs
items_in_directory = os.listdir()
final_items_in_directory = []
for i in items_in_directory:
if i[0] != ".": # If the item doesn't have any '.' in the start
final_items_in_directory.append(i)

Delete multiple files matching a pattern

I have made an online gallery using Python and Django. I've just started to add editing functionality, starting with a rotation. I use sorl.thumbnail to auto-generate thumbnails on demand.
When I edit the original file, I need to clean up all the thumbnails so new ones are generated. There are three or four of them per image (I have different ones for different occasions).
I could hard-code in the file-varients... But that's messy and if I change the way I do things, I'll need to revisit the code.
Ideally I'd like to do a regex-delete. In regex terms, all my originals are named like so:
^(?P<photo_id>\d+)\.jpg$
So I want to delete:
^(?P<photo_id>\d+)[^\d].*jpg$
(Where I replace photo_id with the ID I want to clean.)

Using the glob module:
import glob, os
for f in glob.glob("P*.jpg"):
os.remove(f)
Alternatively, using pathlib:
from pathlib import Path
for p in Path(".").glob("P*.jpg"):
p.unlink()

Try something like this:
import os, re
def purge(dir, pattern):
for f in os.listdir(dir):
if re.search(pattern, f):
os.remove(os.path.join(dir, f))
Then you would pass the directory containing the files and the pattern you wish to match.

If you need recursion into several subdirectories, you can use this method:
import os, re, os.path
pattern = "^(?P<photo_id>\d+)[^\d].*jpg$"
mypath = "Photos"
for root, dirs, files in os.walk(mypath):
for file in filter(lambda x: re.match(pattern, x), files):
os.remove(os.path.join(root, file))
You can safely remove subdirectories on the fly from dirs, which contains the list of the subdirectories to visit at each node.
Note that if you are in a directory, you can also get files corresponding to a simple pattern expression with glob.glob(pattern). In this case you would have to substract the set of files to keep from the whole set, so the code above is more efficient.

How about this?
import glob, os, multiprocessing
p = multiprocessing.Pool(4)
p.map(os.remove, glob.glob("P*.jpg"))
Mind you this does not do recursion and uses wildcards (not regex).
UPDATE
In Python 3 the map() function will return an iterator, not a list. This is useful since you will probably want to do some kind processing on the items anyway, and an iterator will always be more memory-efficient to that end.
If however, a list is what you really need, just do this:
...
list(p.map(os.remove, glob.glob("P*.jpg")))
I agree it's not the most functional way, but it's concise and does the job.

It's not clear to me that you actually want to do any named-group matching -- in the use you describe, the photoid is an input to the deletion function, and named groups' purpose is "output", i.e., extracting certain substrings from the matched string (and accessing them by name in the match object). So, I would recommend a simpler approach:
import re
import os
def delete_thumbnails(photoid, photodirroot):
matcher = re.compile(r'^%s\d+\D.*jpg$' % photoid)
numdeleted = 0
for rootdir, subdirs, filenames in os.walk(photodirroot):
for name in filenames:
if not matcher.match(name):
continue
path = os.path.join(rootdir, name)
os.remove(path)
numdeleted += 1
return "Deleted %d thumbnails for %r" % (numdeleted, photoid)
You can pass the photoid as a normal string, or as a RE pattern piece if you need to remove several matchable IDs at once (e.g., r'abc[def] to remove abcd, abce, and abcf in a single call) -- that's the reason I'm inserting it literally in the RE pattern, rather than inserting the string re.escape(photoid) as would be normal practice. Certain parts such as counting the number of deletions and returning an informative message at the end are obviously frills which you should remove if they give you no added value in your use case.
Others, such as the "if not ... // continue" pattern, are highly recommended practice in Python (flat is better than nested: bailing out to the next leg of the loop as soon as you determine there is nothing to do on this one is better than nesting the actions to be done within an if), although of course other arrangements of the code would work too.

My recomendation:
def purge(dir, pattern, inclusive=True):
regexObj = re.compile(pattern)
for root, dirs, files in os.walk(dir, topdown=False):
for name in files:
path = os.path.join(root, name)
if bool(regexObj.search(path)) == bool(inclusive):
os.remove(path)
for name in dirs:
path = os.path.join(root, name)
if len(os.listdir(path)) == 0:
os.rmdir(path)
This will recursively remove every file that matches the pattern by default, and every file that doesn't if inclusive is true. It will then remove any empty folders from the directory tree.

import os, sys, glob, re
def main():
mypath = "<Path to Root Folder to work within>"
for root, dirs, files in os.walk(mypath):
for file in files:
p = os.path.join(root, file)
if os.path.isfile(p):
if p[-4:] == ".jpg": #Or any pattern you want
os.remove(p)

I find Popen(["rm " + file_name + "*.ext"], shell=True, stdout=PIPE).communicate() to be a much simpler solution to this problem. Although this is prone to injection attacks, I don't see any issues if your program is using this internally.

def recursive_purge(dir, pattern):
for f in os.listdir(dir):
if os.path.isdir(os.path.join(dir, f)):
recursive_purge(os.path.join(dir, f), pattern)
elif re.search(pattern, os.path.join(dir, f)):
os.remove(os.path.join(dir, f))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

recursive searching with pathlib - python - python

Related

In Python how to find a specific directory by name and walk through the files? [duplicate]

Find/Remove oldest file in directory

how can I save the output of a search for files matching *.txt to a variable?

How to ignore hidden files using os.listdir()?

Delete multiple files matching a pattern

Categories

Resources