python if file name is "blah" AND "blah" - python

I'm trying to figure out if my file name contains two values.
I've done this to find a certain file name extension:
import os
for root, dirs, files in os.walk("myDir"):
for file in files:
if file.endswith(".jpg"):
print(os.path.join(root, file))
Now I want to do something like this:
import os
for root, dirs, files in os.walk("myDir"):
for file in files:
if file.contains("string1" & "string2"):
print(os.path.join(root, file))
Obivously the above won't work because there's no "contains" in python string method. I saw a find but I couldn't figure out how to work it with two values.
The idea is that if a file name has "state" & "california" then it would print. In this example "iliveinstateofcalifornia.jpg" would print.

You can make use of the all function. That way you can expand the list without having to change the if conditions.
values = ['string1', 'string2']
if all(file.contains(value) for value in values):
# Do something

There may not be a contains() method, but there is the keyword 'in'. For example,
if 'state' in file and 'California' in file:
print('Do things with stuff')
That said, there are almost certainly more efficient ways of checking string containment. If the filenames have the same format and you will be repeatedly checking against many filenames, it will probably be more performant to employ a compiled regular expression. For more information, refer to the documentation of the Python module "re", available here.

Related

Using regex to move multiple files

I'm really new to python and looking to organize hundreds of files and want to use regex to move them to the correct folders.
Example: I would like to move 4 files into different folders.
File A has "USA" in the name
File B has "Europe" in the name
File C has both "USA" and "Europe" in the name
Fild D has "World" in the name
Here is what I am thinking but I don't think this is correct
shutil.move('Z:\local 1\[.*USA.*]', 'Z:\local 1\USA')
shutil.move('Z:\local 1\[.*\(Europe\).*]', 'Z:\local 1\Europe')
shutil.move('Z:\local 1\[.*World.*]', 'Z:\local 1\World')
You can list all the files in a directory and move them in a new folder if their names matches a given regular expression as follows:
import os
import re
import shutil
for filename in os.listdir('path/to/some/directory'):
if re.match(r'Z:\\local 1\\[.*USA.*]+', filename):
shutil.move(os.path.join('path/to/some/directory', filename), 'Z:\local 1\USA')
elif re.match(r'Z:\\local 1\\[.*\(Europe\).*]+', filename):
shutil.move(os.path.join('path/to/some/directory', filename), 'Z:\local 1\Euro')
# and so forth
However, os.listdir shows only the direct subfolders and files, but it does not iterate deeper. If you want to analyze all the files recursively in a given folder use the os.walk method.
According to definition of shutil.move, it needs two things:
src, which is a path of a source file
dst, which is a path to the destination folder.
It says that src and dst should be paths, not regular expressions.
What you have is os.listdir() which list files in a directory.
So what you need to do is to list files, then try to match file names against regular expressions. If you get a match, then you know where the file should go.
That said, you still need to decide what to do with option C that matches both 'USA' and 'Europe'.
For added style points you can put pairs of (regex, destination_path) into an array, tuple or map; in this case you can add any number of rules without changing or duplicating the logic.

RegEx to find specific file path

I am trying to find the existence of a file testing.txt
The first file exists in: sub/hbc_cube/college/
The second file exists in: sub/hbc/college
However, when searching for where the file exists, I CANNOT assume the string 'hbc' because the name may be different depending on the user. So I am trying to find a way to
PASS if the path is
sub/_cube/college/
FAIL if the path is
sub/*/college
But I cannot use a glob character () because the () will count _cube as failing. I am trying to figure out a regular expression that will only detect a string and not a string with an underscore (hbc_cube for example).
I have tried using the python regex dictionary but I have not been able to figure out the correct regex to use
file_list = lookupfiles(['testing.txt'], dirlist = ['sub/'])
for file in file_list:
if str(file).find('_cube/college/') #hbc_cube/college
print("pass")
if str(file).find('*/college/') #hbc/college
print("fail")
If the file exists in both locations I want only "fail" to print. The problem is the * character is counting hbc_cube.
The glob module is your friend. You don't even need to match against multiple directories, glob will do it for you:
from glob import glob
testfiles = glob("sub/*/testing.txt")
if len(testfiles) > 0 and all("_cube/" in path for path in testfiles):
print("Pass")
else:
print("Fail")
In case it is not obvious, the test all("_cube/" in path for path in testfiles) will take care of this requirement:
If the file exists in both locations I want only "fail" to print. The problem is the * character is counting hbc_cube.
If some of the paths that matched do not contain _cube, the test fails. Since you want to know about files that cause the test to fail, you cannot search solely for files in a path containing *_cube -- you must retrieve both good and bad paths, and inspect them as shown.
Of course you can shorten the above code, or generalize it to construct the globbed path by combining options from a list of folders and a list of files, etc., depending on the particulars of your case.
Note that there are "full regular expressions", provided by the re module, and the simpler "globs" used by the glob module. If you go check the documentation, don't confuse them.
Use the pathlib to parse your path, from the path object get the parent, this will discard the /college part, and check if the path string ends with _cube
from pathlib import Path
file_list = lookupfiles(['testing.txt'], dirlist = ['sub/'])
for file in file_list:
path = Path(file)
if str(path.parent).endswith('_cube'):
print('pass')
else:
print('Fail')
Edit:
If the file variable in the for loop contains the file name (sub/_cube/college/testing.txt) just call parent twice on the path, path.parent.parent
Another approach would be to filter the files inside lookupfiles() that is if you have access to that function and can edit it
The os module is well suited for this:
import os
# This assumes your current working directory has sub in it
for root, dirs, files in os.walk('sub'):
for file in files:
if file=='testing.txt':
# print the file and the directory it's in
print(os.path.join(root, file))
os.walk will return a three-element tuple as it iterates: a root dir, directories in that current folder, and files in that current folder. To print the directory, you combine the root (cwd) and the file name.
For example, on my machine:
for root, dirs, files in os.walk(os.getcwd()):
for file in files:
if file.endswith('ipynb'):
os.path.join(root, file)
# returns
/Users/mm92400/Salesforce_Repos/DataExplorationClustersAndTime.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled1.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationExploratory.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled3.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled4.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled2.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationClusterAnalysis.ipynb

Unable to use getsize method with os.walk() returned files

I am trying to make a small program that looks through a directory (as I want to find recursively all the files in the sub directories I use os.walk()).
Here is my code:
import os
import os.path
filesList=[]
path = "C:\\Users\Robin\Documents"
for(root,dirs,files) in os.walk(path):
for file in files:
filesList+=file
Then I try to use the os.path.getsize() method to elements of filesList, but it doesn't work.
Indeed, I realize that the this code fills the list filesList with characters. I don't know what to do, I have tried several other things, such as :
for(root,dirs,files) in os.walk(path):
filesList+=[file for file in os.listdir(root) if os.path.isfile(file)]
This does give me files, but only one, which isn't even visible when looking in the directory.
Can someone explain me how to obtain files with which we can work (that is to say, get their size, hash them, or modify them...) on with os.walk ?
I am new to Python, and I don't really understand how to use os.walk().
The issue I suspect you're running into is that file contains only the filename itself, not any directories you have to navigate through from your starting folder. You should use os.path.join to combine the file name with the folder it is in, which is the root value yielded by os.walk:
for(root,dirs,files) in os.walk(path):
for file in files:
filesList.append(os.path.join(root, file))
Now all the filenames in filesList will be acceptable to os.path.getsize and other functions (like open).
I also fixed a secondary issue, which is that your use of += to extend a list wouldn't work the way you intended. You'd need to wrap the new file path in a list for that to work. Using append is more appropriate for adding a single value to the end of a list.
If you want to get a list of files including path use:
for(root, dirs, files) in os.walk(path):
fullpaths = [os.path.join(root, fil) for fil in files]
filesList+=fullpaths

filetype os.walk search speeding up the code

I have a function that traverses a directory tree searching for files of a designated filetype which works just fine the only problem I have is it can be quite slow. Can anyone offer more pythonic suggestions to potentially speed the process up:
def findbyfiletype (filetype, directory):
"""
findbyfiletype allows the user to search by two parameters, filetype and directory.
Example:
If the user wishes to locate all pdf files with a directory including subdirectories
then the function would be called as follows:
findbyfiletype(".pdf", "D:\\\\")
this will return a dictionary of strings where the filename is the key and the file path is the value
e.g.
{'file.pdf':'c:\\folder\\file.pdf'}
note that both parameters filetype and directory must be enclosed in string double or single quotes
and the directory parameter must use the backslash escape \\\\ as opposed to \ as python will throw a string literal error
"""
indexlist =[] #holds all files in the given directory including sub folders
FiletypeFilenameList =[] #holds list of all filenames of defined filetype in indexlist
FiletypePathList = [] #holds path names to indvidual files of defined filetype
for root, dirs, files in os.walk(directory):
for name in files:
indexlist.append(os.path.join(root,name))
if filetype in name[-5:]:
FiletypeFilenameList.append(name)
for files in indexlist:
if filetype in files[-5:]:
FiletypePathList.append(files)
FileDictionary=dict(zip(FiletypeFilenameList, FiletypePathList))
del indexlist, FiletypePathList, FiletypeFilenameList
return FileDictionary
ok so this is what I ended up with using a combintion of #Ulrich Eckhardt #Anton and #Cox
import os
import scandir
def findbyfiletype (filetype, directory):
FileDictionary={}
for root, dirs, files in scandir.walk(directory):
for name in files:
if filetype in name and name.endswith(filetype):
FileDictionary.update({name:os.path.join(root,name)})
return FileDictionary
as you can see it's been re factored getting rid of the unnecessary lists and creating the dictionary in one step. #Anton your suggestion for the scandir module helped greatly reduced the time in one instance by about 97 percent which nearly blew my mind.
I'm Listing #Anton as the accpeted answer as it sums up all that i actually acheived with refactoring but #Ulrich Eckhardt and #Cox both get up votes as you were both very helpful
regards
Instead of os.walk(), you can use the faster scandir module (PEP-471).
Also, a few other tips:
Don't use the arbitrary [-5:]. Use the ensdswith() string method or use os.path.splitext().
Don't build up two long lists and then make a dict. Build the dict directly.
If escaping back slashes bother you, use forward slashes like 'c:/folder/file.pdf'. They just work.
walk() can be slow because try to cover a lot of things.
I use a simple variant:
def walk(self, path):
try:
l = (os.path.join(path, x) for x in os.listdir(path))
for x in l:
if os.path.isdir(x):self.walk(x)
elif x.endswith(("jpg", "png", "jpeg")):
self.lf.append(x)
except PermissionError:pass
That's fast and python do a local cache of filesystem, so a second invocation is even faster.
PS: function walk is member of a class, obviously, that's why „self” is there.
EDIT: in NTFS, don't bother with islink. Update with try/except.
But this just ignore dirs where you don't have permissions. You have to run script as admin if you want them listed.

Delete multiple files matching a pattern

I have made an online gallery using Python and Django. I've just started to add editing functionality, starting with a rotation. I use sorl.thumbnail to auto-generate thumbnails on demand.
When I edit the original file, I need to clean up all the thumbnails so new ones are generated. There are three or four of them per image (I have different ones for different occasions).
I could hard-code in the file-varients... But that's messy and if I change the way I do things, I'll need to revisit the code.
Ideally I'd like to do a regex-delete. In regex terms, all my originals are named like so:
^(?P<photo_id>\d+)\.jpg$
So I want to delete:
^(?P<photo_id>\d+)[^\d].*jpg$
(Where I replace photo_id with the ID I want to clean.)
Using the glob module:
import glob, os
for f in glob.glob("P*.jpg"):
os.remove(f)
Alternatively, using pathlib:
from pathlib import Path
for p in Path(".").glob("P*.jpg"):
p.unlink()
Try something like this:
import os, re
def purge(dir, pattern):
for f in os.listdir(dir):
if re.search(pattern, f):
os.remove(os.path.join(dir, f))
Then you would pass the directory containing the files and the pattern you wish to match.
If you need recursion into several subdirectories, you can use this method:
import os, re, os.path
pattern = "^(?P<photo_id>\d+)[^\d].*jpg$"
mypath = "Photos"
for root, dirs, files in os.walk(mypath):
for file in filter(lambda x: re.match(pattern, x), files):
os.remove(os.path.join(root, file))
You can safely remove subdirectories on the fly from dirs, which contains the list of the subdirectories to visit at each node.
Note that if you are in a directory, you can also get files corresponding to a simple pattern expression with glob.glob(pattern). In this case you would have to substract the set of files to keep from the whole set, so the code above is more efficient.
How about this?
import glob, os, multiprocessing
p = multiprocessing.Pool(4)
p.map(os.remove, glob.glob("P*.jpg"))
Mind you this does not do recursion and uses wildcards (not regex).
UPDATE
In Python 3 the map() function will return an iterator, not a list. This is useful since you will probably want to do some kind processing on the items anyway, and an iterator will always be more memory-efficient to that end.
If however, a list is what you really need, just do this:
...
list(p.map(os.remove, glob.glob("P*.jpg")))
I agree it's not the most functional way, but it's concise and does the job.
It's not clear to me that you actually want to do any named-group matching -- in the use you describe, the photoid is an input to the deletion function, and named groups' purpose is "output", i.e., extracting certain substrings from the matched string (and accessing them by name in the match object). So, I would recommend a simpler approach:
import re
import os
def delete_thumbnails(photoid, photodirroot):
matcher = re.compile(r'^%s\d+\D.*jpg$' % photoid)
numdeleted = 0
for rootdir, subdirs, filenames in os.walk(photodirroot):
for name in filenames:
if not matcher.match(name):
continue
path = os.path.join(rootdir, name)
os.remove(path)
numdeleted += 1
return "Deleted %d thumbnails for %r" % (numdeleted, photoid)
You can pass the photoid as a normal string, or as a RE pattern piece if you need to remove several matchable IDs at once (e.g., r'abc[def] to remove abcd, abce, and abcf in a single call) -- that's the reason I'm inserting it literally in the RE pattern, rather than inserting the string re.escape(photoid) as would be normal practice. Certain parts such as counting the number of deletions and returning an informative message at the end are obviously frills which you should remove if they give you no added value in your use case.
Others, such as the "if not ... // continue" pattern, are highly recommended practice in Python (flat is better than nested: bailing out to the next leg of the loop as soon as you determine there is nothing to do on this one is better than nesting the actions to be done within an if), although of course other arrangements of the code would work too.
My recomendation:
def purge(dir, pattern, inclusive=True):
regexObj = re.compile(pattern)
for root, dirs, files in os.walk(dir, topdown=False):
for name in files:
path = os.path.join(root, name)
if bool(regexObj.search(path)) == bool(inclusive):
os.remove(path)
for name in dirs:
path = os.path.join(root, name)
if len(os.listdir(path)) == 0:
os.rmdir(path)
This will recursively remove every file that matches the pattern by default, and every file that doesn't if inclusive is true. It will then remove any empty folders from the directory tree.
import os, sys, glob, re
def main():
mypath = "<Path to Root Folder to work within>"
for root, dirs, files in os.walk(mypath):
for file in files:
p = os.path.join(root, file)
if os.path.isfile(p):
if p[-4:] == ".jpg": #Or any pattern you want
os.remove(p)
I find Popen(["rm " + file_name + "*.ext"], shell=True, stdout=PIPE).communicate() to be a much simpler solution to this problem. Although this is prone to injection attacks, I don't see any issues if your program is using this internally.
def recursive_purge(dir, pattern):
for f in os.listdir(dir):
if os.path.isdir(os.path.join(dir, f)):
recursive_purge(os.path.join(dir, f), pattern)
elif re.search(pattern, os.path.join(dir, f)):
os.remove(os.path.join(dir, f))

Categories