Using regex to move multiple files

Using regex to move multiple files - python

I'm really new to python and looking to organize hundreds of files and want to use regex to move them to the correct folders.
Example: I would like to move 4 files into different folders.
File A has "USA" in the name
File B has "Europe" in the name
File C has both "USA" and "Europe" in the name
Fild D has "World" in the name
Here is what I am thinking but I don't think this is correct
shutil.move('Z:\local 1\[.*USA.*]', 'Z:\local 1\USA')
shutil.move('Z:\local 1\[.*\(Europe\).*]', 'Z:\local 1\Europe')
shutil.move('Z:\local 1\[.*World.*]', 'Z:\local 1\World')

You can list all the files in a directory and move them in a new folder if their names matches a given regular expression as follows:
import os
import re
import shutil
for filename in os.listdir('path/to/some/directory'):
if re.match(r'Z:\\local 1\\[.*USA.*]+', filename):
shutil.move(os.path.join('path/to/some/directory', filename), 'Z:\local 1\USA')
elif re.match(r'Z:\\local 1\\[.*\(Europe\).*]+', filename):
shutil.move(os.path.join('path/to/some/directory', filename), 'Z:\local 1\Euro')
# and so forth
However, os.listdir shows only the direct subfolders and files, but it does not iterate deeper. If you want to analyze all the files recursively in a given folder use the os.walk method.

According to definition of shutil.move, it needs two things:
src, which is a path of a source file
dst, which is a path to the destination folder.
It says that src and dst should be paths, not regular expressions.
What you have is os.listdir() which list files in a directory.
So what you need to do is to list files, then try to match file names against regular expressions. If you get a match, then you know where the file should go.
That said, you still need to decide what to do with option C that matches both 'USA' and 'Europe'.
For added style points you can put pairs of (regex, destination_path) into an array, tuple or map; in this case you can add any number of rules without changing or duplicating the logic.

Related

Using glob to find all zip files recursively in three sub folder

I am trying to look only in three specific subfolders and then recursively create a list of all zip files within the folders. I can easily do this with just 1 folder and recursively look through all subfolders that are within the inputpath, but there are other folders that get created that we cannot use plus we do not know what the folder names will be. So This is where I am at and I am not sure how to pass three subfolders to glob correctly.
# using glob, create a list of all the zip files in specified sub directories COMM, NMR, and NMH inside of input_path
zip_file = glob.glob(os.path.join(inputpath, "/comm/*.zip,/nmr/*.zip,/nmh/*.zip"), recursive=True)
#print(zip_file)
print(f"Found {len(zip_file)} zip files")

The string with commas in it is ... just a string. If you want to perform three globs, you need something like
zip_file = []
for dir in {"comm", "nmr", "nmh"}:
zip_file.extend(glob.glob(os.path.join(inputpath, dir, "*.zip"), recursive=True)
As noted by #Barmar in comments, if you want to look for zip files anywhere within these folders, the pattern needs to be ...(os.path.join(inputpath, dir, "**/*.zip"). If not, perhaps edit your question to provide an example of the structure you want to traverse.

RegEx to find specific file path

I am trying to find the existence of a file testing.txt
The first file exists in: sub/hbc_cube/college/
The second file exists in: sub/hbc/college
However, when searching for where the file exists, I CANNOT assume the string 'hbc' because the name may be different depending on the user. So I am trying to find a way to
PASS if the path is
sub/_cube/college/
FAIL if the path is
sub/*/college
But I cannot use a glob character () because the () will count _cube as failing. I am trying to figure out a regular expression that will only detect a string and not a string with an underscore (hbc_cube for example).
I have tried using the python regex dictionary but I have not been able to figure out the correct regex to use
file_list = lookupfiles(['testing.txt'], dirlist = ['sub/'])
for file in file_list:
if str(file).find('_cube/college/') #hbc_cube/college
print("pass")
if str(file).find('*/college/') #hbc/college
print("fail")
If the file exists in both locations I want only "fail" to print. The problem is the * character is counting hbc_cube.

The glob module is your friend. You don't even need to match against multiple directories, glob will do it for you:
from glob import glob
testfiles = glob("sub/*/testing.txt")
if len(testfiles) > 0 and all("_cube/" in path for path in testfiles):
print("Pass")
else:
print("Fail")
In case it is not obvious, the test all("_cube/" in path for path in testfiles) will take care of this requirement:
If the file exists in both locations I want only "fail" to print. The problem is the * character is counting hbc_cube.
If some of the paths that matched do not contain _cube, the test fails. Since you want to know about files that cause the test to fail, you cannot search solely for files in a path containing *_cube -- you must retrieve both good and bad paths, and inspect them as shown.
Of course you can shorten the above code, or generalize it to construct the globbed path by combining options from a list of folders and a list of files, etc., depending on the particulars of your case.
Note that there are "full regular expressions", provided by the re module, and the simpler "globs" used by the glob module. If you go check the documentation, don't confuse them.

Use the pathlib to parse your path, from the path object get the parent, this will discard the /college part, and check if the path string ends with _cube
from pathlib import Path
file_list = lookupfiles(['testing.txt'], dirlist = ['sub/'])
for file in file_list:
path = Path(file)
if str(path.parent).endswith('_cube'):
print('pass')
else:
print('Fail')
Edit:
If the file variable in the for loop contains the file name (sub/_cube/college/testing.txt) just call parent twice on the path, path.parent.parent
Another approach would be to filter the files inside lookupfiles() that is if you have access to that function and can edit it

The os module is well suited for this:
import os
# This assumes your current working directory has sub in it
for root, dirs, files in os.walk('sub'):
for file in files:
if file=='testing.txt':
# print the file and the directory it's in
print(os.path.join(root, file))
os.walk will return a three-element tuple as it iterates: a root dir, directories in that current folder, and files in that current folder. To print the directory, you combine the root (cwd) and the file name.
For example, on my machine:
for root, dirs, files in os.walk(os.getcwd()):
for file in files:
if file.endswith('ipynb'):
os.path.join(root, file)
# returns
/Users/mm92400/Salesforce_Repos/DataExplorationClustersAndTime.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled1.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationExploratory.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled3.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled4.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationUntitled2.ipynb
/Users/mm92400/Salesforce_Repos/DataExplorationClusterAnalysis.ipynb

python if file name is "blah" AND "blah"

I'm trying to figure out if my file name contains two values.
I've done this to find a certain file name extension:
import os
for root, dirs, files in os.walk("myDir"):
for file in files:
if file.endswith(".jpg"):
print(os.path.join(root, file))
Now I want to do something like this:
import os
for root, dirs, files in os.walk("myDir"):
for file in files:
if file.contains("string1" & "string2"):
print(os.path.join(root, file))
Obivously the above won't work because there's no "contains" in python string method. I saw a find but I couldn't figure out how to work it with two values.
The idea is that if a file name has "state" & "california" then it would print. In this example "iliveinstateofcalifornia.jpg" would print.

You can make use of the all function. That way you can expand the list without having to change the if conditions.
values = ['string1', 'string2']
if all(file.contains(value) for value in values):
# Do something

There may not be a contains() method, but there is the keyword 'in'. For example,
if 'state' in file and 'California' in file:
print('Do things with stuff')
That said, there are almost certainly more efficient ways of checking string containment. If the filenames have the same format and you will be repeatedly checking against many filenames, it will probably be more performant to employ a compiled regular expression. For more information, refer to the documentation of the Python module "re", available here.

Change image names using os.walk to include parent directory names

I would like to rename images based on part of the name of the folder the images are in and iterate through the images. I am using os.walk and I was able to rename all the images in the folders but could not figure out how to use the letters to the left of the first hyphen in the folder name as part of the image name.
Folder name: ABCDEF - THIS IS - MY FOLDER - NAME
Current image names in folder:
dsc_001.jpg
dsc_234.jpg
dsc_123.jpg
Want to change to show like this:
ABCDEF_1.jpg
ABCDEF_2.jpg
ABCDEF_3.jpg
What I have is this, but I am not sure why I am unable to split the filename by the hyphen:
import os
from os.path import join
path = r'C:\folderPath'
i = 1
for root, dirs, files in os.walk(path):
for image in files:
prefix = files.split(' - ')[0]
os.rename(os.path.join(path, image), os.path.join(path, prefix + '_'
+ str(i)+'.jpg'))
i = i+1

Okay, I've re-read your question and I think I know what's wrong.
1.) The os.walk() iterable is recursive, i.e. if you use os.walk(r'C:\'), it will loop through all the folders and find all the files under C drive. Now I'm not sure if your C:\folderPath has any sub-folders in it. If it does, and any of the folder/file format are not the convention as C:\folderPath, your code is going to have a bad time.
2.) When you iterate through files, you are split()ing the wrong object. Your question state you want to split the Folder name, but your code is splitting the files iterable which is a list of all the files under the current iteration directory. That doesn't accomplish what you want. Depending if your ABCDEF folder is the C:\folderPath or a sub folder within, you'll need to code differently.
3.) you have imported join from os.path but you still end up calling the full name os.path.join() anyways, which is redundant. Either just import os and call os.path.join() or just with your current imports, just join().
Having said all of that, here are my edits:
Answer 1:
If your ABCDEF is the assigned folder
import os
from os.path import join
path = r'C:\ABCDEF - THIS - IS - MY - FOLDER - NAME'
for root, dirs, files in os.walk(path):
folder = root.split("\\")[-1] # This gets you the current folder's name
for i, image in enumerate(files):
new_image = "{0}_{1}.jpg".format(folder.split(' - ')[0], i + 1)
os.rename(join(path, image), join(path, new_image))
break # if you have sub folders that follow the SAME structure, then remove this break. Otherwise, keep it here so your code stop after all the files are updated in your parent folder.
Answer 2:
Assuming your ABCDEF's are all sub folders under the assigned directory, and all of them follow the same naming convention.
import os
from os.path import join
path = r'C:\parentFolder' # The folder that has all the sub folders that are named ABCDEF...
for i, (root, dirs, files) in enumerate(os.walk(path)):
if i == 0: continue # skip the parentFolder as it doesn't follow the same naming convention
folder = root.split("\\")[-1] # This gets you the current folder's name
for i, image in enumerate(files):
new_image = "{0}_{1}.jpg".format(folder.split(' - ')[0], i + 1)
os.rename(join(path, image), join(path, new_image))
Note:
If your scenario doesn't fall under either of these, please make it clear what your folder structure is (a sample including all sub folders and sub files). Remember, consistency is key in determining how your code should work. If it's inconsistent, your best bet is use Answer 1 on each target folder separately.
Changes:
1.) You can get an incremental index without doing a i += 1. enumerate() is a great tool for iterables that also give you the iteration number.
2.) Your split() should be operated on the folder name instead of files (an iterable). In your case, image is the actual file name, and files is the list of files in the current iteration directory.
3.) Use of str.format() function to make your new file format easier to read.
4.) You'll note the use of split("\\") instead of split(r"\"), and that's because a single backslash cannot be a raw string.
This should now work. I ended up doing a lot more research than expected such as how to handle the os.walk() properly in both scenarios. For future reference, a little google search goes a long way. I hope this finally answers your question. Remember, doing your own research and clarity in demonstrating your problem will get you more efficient answers.
Bonus: if you have python 3.6+, you can even use f strings for your new file name, which ends up looking really cool:
new_image = f"{image.split(' - ')[0]}_{i+1}.jpg"

Python - Navigating through Subdirectories that Meet Naming Criteria

I am using Python 3.5 to analyze data contained in csv files. These files are contained in a "figs" directory, which is contained in a case directory, which is contained in an overall data directory, e.g.:
/strm1/serino/DATA/06052009/figs
Or more generally:
/strm1/serino/DATA/case_date_in_MMDDYYYY/figs
The directory I am starting in is '/strm1/serino/DATA/,' and each subdirectory is the month, day, and year of a case I am working with. Each subdirectory contains another subdirectory named 'figs,' and that is the location of each case's csv file. To be exact:
/strm1/serino/DATA/case_date_in_MMDDYYYY/figs/case_date_in_MMDDYYYY.csv
So, I would like to start in my DATA directory and go through its subdirectories to find those that have the MMDDYYYY naming. However, some of the case directories may be named with a state abbreviation at the end, like: '06052009_TX.' Therefore, instead of matching the MMDDYYYY naming exactly, it could be something as simple as verifying that the directory name contains any number 1 through 9.
Once I am in the first subdirectory (the case directory) I would like to move into the 'figs' subdirectory. Once there, I want to access the csv file with the same naming convention as the first subdirectory (the case directory). I will fill existing arrays with the data contained in each csv file.
Basically, my question concerns navigating through multiple subdirectories that match a certain naming convention and ultimately accessing the data file at the "end." I was naively playing around with glob, fnmatch, os.listdir, and os.walk, but I could not get anything close enough to working that I feel would be helpful to include. I am not very familiar with those modules. What I can include is what I am going for:
for dirs in data_dir that contain a number:
go into this directory
go into 'figs' directory
read data from the csv file whose name matches its case directory name (or whose name format matches the case directory name format)
I have come across related questions, but I have not been able to apply their answers in the way that I would like, especially with nested directories. I really appreciate the help, and let me know if I need to clarify anything.

The following should get you going. It uses the datetime.strptime() function to attempt to convert each folder name into a valid datetime object. If the conversion fails, then you know that the folder name is not in the correct format and can be skipped. It then attempts to parse any CSV file found in the corresponding fig folder:
from datetime import datetime
import glob
import csv
import os
dirpath, dirnames, filenames = next(os.walk('/strm1/serino/DATA'))
for dirname in dirnames:
if len(dirname) >= 8:
try:
dt = datetime.strptime(dirname[:8], '%m%d%Y')
print(dt, dirname)
csv_folder = os.path.join(dirpath, dirname)
for csv_file in glob.glob(os.path.join(csv_folder, 'figs', '*.csv')):
with open(csv_file, newline='') as f_input:
csv_input = csv.reader(f_input)
for row in csv_input:
print(row)
except ValueError as e:
pass

You listed several problems above. Which one are you stuck on? It seems like you already know how to navigate the file storage system using os.path. You may not know of the function os.path.join() which allows you to manually specify a file path relative to a file as such:
os.path.abspath(os.path.join(os.path.dirname(__file__), '../..', 'Data/TrailShelters/'))
To break down the above:
os.path.dirname(__file__) returns the path of the current file. '../..' means: go up two levels in the folder hierarchy. And Data/TrailShelters/ is the directory I wish to navigate to.
How does this apply to your particular case? Well, you will need to make some adaptations but you can store the os.path of the parent directory in a variable. Then you can essentially use a while sub_dir is not null loop to iterate through subdirectories. For every subdirectory you will want to examine its os.path and extract the particular part of the path you are interested in. Then you can simply use something like: if 'TN' in subdirectory_name to determine if it is a subdirectory you are interested in. If so; then update the saved os.path of the parent directory by appending the path to the subdirectory. Does that make any sense?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using regex to move multiple files - python

Related

Using glob to find all zip files recursively in three sub folder

RegEx to find specific file path

python if file name is "blah" AND "blah"

Change image names using os.walk to include parent directory names

Python - Navigating through Subdirectories that Meet Naming Criteria

Categories

Resources