Delete multiple files matching a pattern - python

I have made an online gallery using Python and Django. I've just started to add editing functionality, starting with a rotation. I use sorl.thumbnail to auto-generate thumbnails on demand.
When I edit the original file, I need to clean up all the thumbnails so new ones are generated. There are three or four of them per image (I have different ones for different occasions).
I could hard-code in the file-varients... But that's messy and if I change the way I do things, I'll need to revisit the code.
Ideally I'd like to do a regex-delete. In regex terms, all my originals are named like so:
^(?P<photo_id>\d+)\.jpg$
So I want to delete:
^(?P<photo_id>\d+)[^\d].*jpg$
(Where I replace photo_id with the ID I want to clean.)

Using the glob module:
import glob, os
for f in glob.glob("P*.jpg"):
os.remove(f)
Alternatively, using pathlib:
from pathlib import Path
for p in Path(".").glob("P*.jpg"):
p.unlink()

Try something like this:
import os, re
def purge(dir, pattern):
for f in os.listdir(dir):
if re.search(pattern, f):
os.remove(os.path.join(dir, f))
Then you would pass the directory containing the files and the pattern you wish to match.

If you need recursion into several subdirectories, you can use this method:
import os, re, os.path
pattern = "^(?P<photo_id>\d+)[^\d].*jpg$"
mypath = "Photos"
for root, dirs, files in os.walk(mypath):
for file in filter(lambda x: re.match(pattern, x), files):
os.remove(os.path.join(root, file))
You can safely remove subdirectories on the fly from dirs, which contains the list of the subdirectories to visit at each node.
Note that if you are in a directory, you can also get files corresponding to a simple pattern expression with glob.glob(pattern). In this case you would have to substract the set of files to keep from the whole set, so the code above is more efficient.

How about this?
import glob, os, multiprocessing
p = multiprocessing.Pool(4)
p.map(os.remove, glob.glob("P*.jpg"))
Mind you this does not do recursion and uses wildcards (not regex).
UPDATE
In Python 3 the map() function will return an iterator, not a list. This is useful since you will probably want to do some kind processing on the items anyway, and an iterator will always be more memory-efficient to that end.
If however, a list is what you really need, just do this:
...
list(p.map(os.remove, glob.glob("P*.jpg")))
I agree it's not the most functional way, but it's concise and does the job.

It's not clear to me that you actually want to do any named-group matching -- in the use you describe, the photoid is an input to the deletion function, and named groups' purpose is "output", i.e., extracting certain substrings from the matched string (and accessing them by name in the match object). So, I would recommend a simpler approach:
import re
import os
def delete_thumbnails(photoid, photodirroot):
matcher = re.compile(r'^%s\d+\D.*jpg$' % photoid)
numdeleted = 0
for rootdir, subdirs, filenames in os.walk(photodirroot):
for name in filenames:
if not matcher.match(name):
continue
path = os.path.join(rootdir, name)
os.remove(path)
numdeleted += 1
return "Deleted %d thumbnails for %r" % (numdeleted, photoid)
You can pass the photoid as a normal string, or as a RE pattern piece if you need to remove several matchable IDs at once (e.g., r'abc[def] to remove abcd, abce, and abcf in a single call) -- that's the reason I'm inserting it literally in the RE pattern, rather than inserting the string re.escape(photoid) as would be normal practice. Certain parts such as counting the number of deletions and returning an informative message at the end are obviously frills which you should remove if they give you no added value in your use case.
Others, such as the "if not ... // continue" pattern, are highly recommended practice in Python (flat is better than nested: bailing out to the next leg of the loop as soon as you determine there is nothing to do on this one is better than nesting the actions to be done within an if), although of course other arrangements of the code would work too.

My recomendation:
def purge(dir, pattern, inclusive=True):
regexObj = re.compile(pattern)
for root, dirs, files in os.walk(dir, topdown=False):
for name in files:
path = os.path.join(root, name)
if bool(regexObj.search(path)) == bool(inclusive):
os.remove(path)
for name in dirs:
path = os.path.join(root, name)
if len(os.listdir(path)) == 0:
os.rmdir(path)
This will recursively remove every file that matches the pattern by default, and every file that doesn't if inclusive is true. It will then remove any empty folders from the directory tree.

import os, sys, glob, re
def main():
mypath = "<Path to Root Folder to work within>"
for root, dirs, files in os.walk(mypath):
for file in files:
p = os.path.join(root, file)
if os.path.isfile(p):
if p[-4:] == ".jpg": #Or any pattern you want
os.remove(p)

I find Popen(["rm " + file_name + "*.ext"], shell=True, stdout=PIPE).communicate() to be a much simpler solution to this problem. Although this is prone to injection attacks, I don't see any issues if your program is using this internally.

def recursive_purge(dir, pattern):
for f in os.listdir(dir):
if os.path.isdir(os.path.join(dir, f)):
recursive_purge(os.path.join(dir, f), pattern)
elif re.search(pattern, os.path.join(dir, f)):
os.remove(os.path.join(dir, f))

Related

RegEx to find specific file path

I am trying to find the existence of a file testing.txt
The first file exists in: sub/hbc_cube/college/
The second file exists in: sub/hbc/college
However, when searching for where the file exists, I CANNOT assume the string 'hbc' because the name may be different depending on the user. So I am trying to find a way to
PASS if the path is
sub/_cube/college/
FAIL if the path is
sub/*/college
But I cannot use a glob character () because the () will count _cube as failing. I am trying to figure out a regular expression that will only detect a string and not a string with an underscore (hbc_cube for example).
I have tried using the python regex dictionary but I have not been able to figure out the correct regex to use
file_list = lookupfiles(['testing.txt'], dirlist = ['sub/'])
for file in file_list:
if str(file).find('_cube/college/') #hbc_cube/college
print("pass")
if str(file).find('*/college/') #hbc/college
print("fail")
If the file exists in both locations I want only "fail" to print. The problem is the * character is counting hbc_cube.
The glob module is your friend. You don't even need to match against multiple directories, glob will do it for you:
from glob import glob
testfiles = glob("sub/*/testing.txt")
if len(testfiles) > 0 and all("_cube/" in path for path in testfiles):
print("Pass")
else:
print("Fail")
In case it is not obvious, the test all("_cube/" in path for path in testfiles) will take care of this requirement:
If the file exists in both locations I want only "fail" to print. The problem is the * character is counting hbc_cube.
If some of the paths that matched do not contain _cube, the test fails. Since you want to know about files that cause the test to fail, you cannot search solely for files in a path containing *_cube -- you must retrieve both good and bad paths, and inspect them as shown.
Of course you can shorten the above code, or generalize it to construct the globbed path by combining options from a list of folders and a list of files, etc., depending on the particulars of your case.
Note that there are "full regular expressions", provided by the re module, and the simpler "globs" used by the glob module. If you go check the documentation, don't confuse them.
Use the pathlib to parse your path, from the path object get the parent, this will discard the /college part, and check if the path string ends with _cube
from pathlib import Path
file_list = lookupfiles(['testing.txt'], dirlist = ['sub/'])
for file in file_list:
path = Path(file)
if str(path.parent).endswith('_cube'):
print('pass')
else:
print('Fail')
Edit:
If the file variable in the for loop contains the file name (sub/_cube/college/testing.txt) just call parent twice on the path, path.parent.parent
Another approach would be to filter the files inside lookupfiles() that is if you have access to that function and can edit it
The os module is well suited for this:
import os
# This assumes your current working directory has sub in it
for root, dirs, files in os.walk('sub'):
for file in files:
if file=='testing.txt':
# print the file and the directory it's in
print(os.path.join(root, file))
os.walk will return a three-element tuple as it iterates: a root dir, directories in that current folder, and files in that current folder. To print the directory, you combine the root (cwd) and the file name.
For example, on my machine:
for root, dirs, files in os.walk(os.getcwd()):
for file in files:
if file.endswith('ipynb'):
os.path.join(root, file)
# returns
/Users/mm92400/Salesforce_Repos/DataExplorationClustersAndTime.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled1.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationExploratory.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled3.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled4.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled2.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationClusterAnalysis.ipynb

Python: Removing Leading Zeros in the Filename of Every File in a Folder and Subfolders

I would think that this is a basic task but most of my searches refer to adding zeros. I just want to strip the leading zeros from every file. I have files like "01.jpg, 02.jpg... 09.jpg".
Chapter 9 of Automate the Boring Stuff talks specifically about using python for this task but does not go over any examples of this.
import os, sys
for filename in os.listdir(os.path.dirname(os.path.abspath(__file__))):
Well that is the start of what I have so far.
import os, sys
for filename in os.listdir(file folder):
os.rename(filename, filename[1:])
It removes the first char from all file names (in your case, the '0' that you want to get rid of). I recommend using it with caution because it's irreversible.
Use this to get all files under your root folder:
files = []
for root, directories, file_names in os.walk(unicode(path)):
for filename in file_names:
files.append((filename, root))
return files
then simply iterate over the files and rename:
for f, p in files:
if f.startswith('0'):
os.rename(os.path.join(p, f), os.path.join(p, f[1:])

how can I save the output of a search for files matching *.txt to a variable?

I'm fairly new to python. I'd like to save the text that is printed by at this script as a variable. (The variable is meant to be written to a file later, if that matters.) How can I do that?
import fnmatch
import os
for file in os.listdir("/Users/x/y"):
if fnmatch.fnmatch(file, '*.txt'):
print(file)
you can store it in variable like this:
import fnmatch
import os
for file in os.listdir("/Users/x/y"):
if fnmatch.fnmatch(file, '*.txt'):
print(file)
my_var = file
# do your stuff
or you can store it in list for later use:
import fnmatch
import os
my_match = []
for file in os.listdir("/Users/x/y"):
if fnmatch.fnmatch(file, '*.txt'):
print(file)
my_match.append(file) # append insert the value at end of list
# do stuff with my_match list
You can store it in a list:
import fnmatch
import os
matches = []
for file in os.listdir("/Users/x/y"):
if fnmatch.fnmatch(file, '*.txt'):
matches.append(file)
Both answers already provided are correct, but Python provides a nice alternative. Since iterating through an array and appending to a list is such a common pattern, the list comprehension was created as a one-stop shop for the process.
import fnmatch
import os
matches = [filename for filename in os.listdir("/Users/x/y") if fnmatch.fnmatch(filename, "*.txt")]
While NSU's answer and the others are all perfectly good, there may be a simpler way to get what you want.
Just as fnmatch tests whether a certain file matches a shell-style wildcard, glob lists all files matching a shell-style wildcard. In fact:
This is done by using the os.listdir() and fnmatch.fnmatch() functions in concert…
So, you can do this:
import glob
matches = glob.glob("/Users/x/y/*.txt")
But notice that in this case, you're going to get full pathnames like '/Users/x/y/spam.txt' rather than just 'spam.txt', which may not be what you want. Often, it's easier to keep the full pathnames around and os.path.basename them when you want to display them, than to keep just the base names around and os.path.join them when you want to open them… but "often" isn't "always".
Also notice that I had to manually paste the "/Users/x/y/" and "*.txt" together into a single string, the way you would at the command line. That's fine here, but if, say, the first one came from a variable, rather than hardcoded into the source, you'd have to use os.path.join(basepath, "*.txt"), which isn't quite as nice.
By the way, if you're using Python 3.4 or later, you can get the same thing out of the higher-level pathlib library:
import pathlib
matches = list(pathlib.Path("/Users/x/y/").glob("*.txt"))
Maybe defining an utility function is the right path to follow...
def list_ext_in_dir(e,d):
"""e=extension, d= directory => list of matching filenames.
If the directory d cannot be listed returns None."""
from fnmatch import fnmatch
from os import listdir
try:
dirlist = os.listdir(d)
except OSError:
return None
return [fname for fname in dirlist if fnmatch(fname,e)]
I have put the dirlist inside a try except clause to catch the
possibility that we cannot list the directory (non-existent, read
permission, etc). The treatment of errors is a bit simplistic, but...
the list of matching filenames is built using a so called list comprehension, that is something that you should investigate as soon as possible if you're going to use python for your programs.
To close my post, an usage example
l_txtfiles = list_ext_in_dir('*.txt','/Users/x/y;)

Python: rename all files in a folder using numbers that file contain

I want to write a little script for managing a bunch of files I got. Those files have complex and different name but they all contain a number somewhere in their name. I want to take that number, place it in front of the file name so they can be listed logically in my filesystem.
I got a list of all those files using os.listdir but I'm struggling to find a way to locate the numbers in those files. I've checked regular expression but I'm unsure if it's the right way to do this!
example:
import os
files = os.litdir(c:\\folder)
files
['xyz3.txt' , '2xyz.txt', 'x1yz.txt']`
So basically, what I ultimately want is:
1xyz.txt
2xyz.txt
3xyz.txt
where I am stuck so far is to find those numbers (1,2,3) in the list files
This (untested) snippet should show the regexp approach. The search method of compiled patterns is used to look for the number. If found, the number is moved to the front of the file name.
import os, re
NUM_RE = re.compile(r'\d+')
for name in os.listdir('.'):
match = NUM_RE.search(name)
if match is None or match.start() == 0:
continue # no number or number already at start
newname = match.group(0) + name[:match.start()] + name[match.end():]
print 'renaming', name, 'to', newname
#os.rename(name, newname)
If this code is used in production and not as homework assignment, a useful improvement would be to parse match.group(0) as an integer and format it to include a number of leading zeros. That way foo2.txt would become 02foo.txt and get sorted before 12bar.txt. Implementing this is left as an exercise to the reader.
Assuming that the numbers in your file names are integers (untested code):
def rename(dirpath, filename):
inds = [i for i,char in filename if char in '1234567890']
ints = filename[min(inds):max(inds)+1]
newname = ints + filename[:min(inds)] + filename[max(inds)+1:]
os.rename(os.path.join(dirpath, filename), os.path.join(dirpath, newname))
def renameFilesInDir(dirpath):
""" Apply your renaming scheme to all files in the directory specified by dirpath """
dirpath, dirnames, filenames = os.walk(dirpath):
for filename in filenames:
rename(dirpath, filename)
for dirname in dirnames:
renameFilesInDir(os.path.join(dirpath, dirname))
Hope this helps

How to filter files (with known type) from os.walk?

I have list from os.walk. But I want to exclude some directories and files. I know how to do it with directories:
for root, dirs, files in os.walk('C:/My_files/test'):
if "Update" in dirs:
dirs.remove("Update")
But how can I do it with files, which type I know. because this doesn't work:
if "*.dat" in files:
files.remove("*.dat")
files = [ fi for fi in files if not fi.endswith(".dat") ]
Exclude multiple extensions.
files = [ file for file in files if not file.endswith( ('.dat','.tar') ) ]
And in one more way, because I just wrote this, and then stumbled upon this question:
files = filter(lambda file: not file.endswith('.txt'), files)
Mote that in python3 filter returns a generator, not a list, and the list comprehension is "preferred".
A concise way of writing it, if you do this a lot:
def exclude_ext(ext):
def compare(fn): return os.path.splitext(fn)[1] != ext
return compare
files = filter(exclude_ext(".dat"), files)
Of course, exclude_ext goes in your appropriate utility package.
files = [file for file in files if os.path.splitext(file)[1] != '.dat']
Should be exactly what you need:
if thisFile.endswith(".txt"):
Try this:
import os
skippingWalk = lambda targetDirectory, excludedExtentions: (
(root, dirs, [F for F in files if os.path.splitext(F)[1] not in excludedExtentions])
for (root, dirs, files) in os.walk(targetDirectory)
)
for line in skippingWalk("C:/My_files/test", [".dat"]):
print line
This is a generator expression generating lambda function. You pass it a path and some extensions, and it invokes os.walk with the path, filters out the files with extensions in the list of unwanted extensions using a list comprehension, and returns the result.
(edit: removed the .upper() statement because there might be an actual difference between extensions of different case - if you want this to be case insensitive, add .upper() after os.path.splitext(F)[1] and pass extensions in in capital letters.)
The easiest way to filter files with a known type with os.walk() is to tell the path and get all the files filtered by the extension with an if statement.
for base, dirs, files in os.walk(path):
if files.endswith('.type'):
#Here you will go through all the files with the particular extension '.type'
.....
.....
Another solution would be to use the functions from fnmatch module:
def MatchesExtensions(name,extensions=["*.dat", "*.txt", "*.whatever"]):
for pattern in extensions:
if fnmatch.fnmatch(pattern):
return True
return False
This way you avoid all the hassle with upper/lower case extension. This means you don't need to convert to lower/upper when having to match *.JPEG, *.jpeg, *.JPeg, *.Jpeg
All above answers are working. Just wanted to add for anyone else whos files by any chance are coming from heterogeneous sources, e.g. downloading images in archives from the Internet. In this case, because Unix-like systems are case sensitive you may end up having extension like '.PNG' and '.png'. These will be treated by as different strings by endswith method, i.e. '.PNG'.endswith('png') will return False. In order to avoid this problem, use lower() function.
here is how to find all files in a directory ending with a specific extension
import glob, os
path=os.path.expanduser('C:\\Users\\A')
for filename in [item for item in os.listdir(path) if item.endswith(".ipynb") ]:
print(filename)
In these two ways I can select the files by the file type:
from os import listdir
from os.path import isfile, join
source_path = './data'
excelfiles = [f for f in listdir(source_path) if f.endswith(('.xlsx')) and isfile(join(source_path, f))]
from os import walk
excelfiles2 = []
for (dirpath, dirnames, filenames) in walk(source_path):
excelfiles2.extend(filename for filename in filenames if filename.endswith('.xlsx'))
break

Categories