Finding all subfolders that contain two files that end with certain strings

Finding all subfolders that contain two files that end with certain strings - python

So I have a folder, say D:\Tree, that contains only subfolders (names may contain spaces). These subfolders contain a few files - and they may contain files of the form "D:\Tree\SubfolderName\SubfolderName_One.txt" and "D:\Tree\SubfolderName\SubfolderName_Two.txt" (in other words, the subfolder may contain both of them, one, or neither). I need to find every occurence where a subfolder contains both of these files, and send their absolute paths to a text file (in a format explained in the following example). Consider these three subfolders in D:\Tree:
D:\Tree\Grass contains Grass_One.txt and Grass_Two.txt
D:\Tree\Leaf contains Leaf_One.txt
D:\Tree\Branch contains Branch_One.txt and Branch_Two.txt
Given this structure and the problem mentioned above, I'd to like to be able to write the following lines in myfile.txt:
D:\Tree\Grass\Grass_One.txt D:\Tree\Grass\Grass_Two.txt
D:\Tree\Branch\Branch_One.txt D:\Tree\Branch\Branch_Two.txt
How might this be done? Thanks in advance for any help!
Note: It is very important that "file_One.txt" comes before "file_Two.txt" in myfile.txt

import os
folderPath = r'Your Folder Path'
for (dirPath, allDirNames, allFileNames) in os.walk(folderPath):
for fileName in allFileNames:
if fileName.endswith("One.txt") or fileName.endswith("Two.txt") :
print (os.path.join(dirPath, fileName))
# Or do your task as writing in file as per your need
Hope this helps....

Here is a recursive solution
def findFiles(writable, current_path, ending1, ending2):
'''
:param writable: file to write output to
:param current_path: current path of recursive traversal of sub folders
:param postfix: the postfix which needs to match before
:return: None
'''
# check if current path is a folder or not
try:
flist = os.listdir(current_path)
except NotADirectoryError:
return
# stores files which match given endings
ending1_files = []
ending2_files = []
for dirname in flist:
if dirname.endswith(ending1):
ending1_files.append(dirname)
elif dirname.endswith(ending2):
ending2_files.append(dirname)
findFiles(writable, current_path+ '/' + dirname, ending1, ending2)
# see if exactly 2 files have matching the endings
if len(ending1_files) == 1 and len(ending2_files) == 1:
writable.write(current_path+ '/'+ ending1_files[0] + ' ')
writable.write(current_path + '/'+ ending2_files[0] + '\n')
findFiles(sys.stdout, 'G:/testf', 'one.txt', 'two.txt')

Related

How to get the latest folder in a directory using Python

I need to retrieve the directory of the most recently create folder. I am using a program that will output a new run## folder each time it is executed (i.e run01, run02, run03 and so on). Within any one run## folder resides a data file that I want analyze (file-i-want.txt).
folder_numb = 'run01'
dir = os.path.dirname(__file__)
filepath = os.path.join(dir, '..\data\directory',run_numb,'file-i-want.txt')
In short I want to skip having to hardcode in run## and just get the directory of a file within the most recently created run## folder.

You can get the creation date with os.stat
path = '/a/b/c'
#newest
newest = max([f for f in os.listdir(path)], key=lambda x: os.stat(os.path.join(path,x)).st_birthtime)
# all files sorted
sorted_files = sorted([f for f in os.listdir(path)],key=lambda x: os.stat(os.path.join(path, x)).st_birthtime, reverse=True)

pathlib is the recommeded over os for filesystem related tasks.
reference
You can try:
filepath = Path(__file__).parent / 'data/directory'
fnames = sorted(list(Path(filepath).rglob('file-i-want.txt')), key=lambda x: Path.stat(x).st_mtime, reverse=True)
filepath = str(fnames[0])
filepath

glob.glob('run*') will return the list of files/directories that match the pattern ordered by name.
so if you want the latest run your code will be:
import glob
print(glob.glob('run*')[-1]) # raises index error if there are no runs
IMPORTANT, the files are ordered by name, in this case, for example, 'run21' will come AFTER 'run100', so you will need to use a high enough number of digits to not see this error. or just count the number of matched files and recreate the name of the folder with this number.
you can use glob to check the number of files with the same name pattern:
import glob
n = len(glob.glob('run*')) # number of files which name starts with 'run'
new_run_name = 'run' + str(n)
Note: with this code the file names starts from 0, if you want to start from 1 just add 1 to n.
if you want always double digit run number (00, 01, 02) instead of 'str(n)' use 'str(n).zfill(2)'
example:
import glob
n = len(glob.glob('run*')) # number of files which name starts with 'run'
new_run_name = 'run' + str(n + 1).zfill(2)

I have a list of a part of a filename, for each one I want to go through the files in a directory that matches that part and return the filename

So, let's say I have a directory with a bunch of filenames.
for example:
Scalar Product or Dot Product (Hindi)-fodZTqRhC24.m4a
AP Physics C - Dot Product-Wvhn_lVPiw0.m4a
An Introduction to the Dot Product-X5DifJW0zek.m4a
Now let's say I have a list, of only the keys, which are at the end of the file names:
['fodZTqRhC24', 'Wvhn_lVPiw0, 'X5DifJW0zek']
How can I iterate through my list to go into that directory and search for a file name containing that key, and then return me the filename?
Any help is greatly appreciated!

I thought about it, I think I was making it harder than I had to with regex. Sorry about not trying it first. I have done it this way:
audio = ['Scalar Product or Dot Product (Hindi)-fodZTqRhC24.m4a',
'An Introduction to the Dot Product-X5DifJW0zek.m4a',
'AP Physics C - Dot Product-Wvhn_lVPiw0.m4a']
keys = ['fodZTqRhC24', 'Wvhn_lVPiw0', 'X5DifJW0zek']
file_names = []
for Id in keys:
for name in audio:
if Id in name:
file_names.append(name)
combined = zip(keys,file_names)
combined

Here is an example:
ls: list of files in a given directory
names: list of strings to search for
import os
ls=os.listdir("/any/folder")
n=['Py', 'sql']
for file in ls:
for name in names:
if name in file:
print(file)
Results :
.PyCharm50
.mysql_history
zabbix2.sql
.mysql
PycharmProjects
zabbix.sql

Assuming you know which directory that you will be looking in, you could try something like this:
import os
to_find = ['word 1', 'word 2'] # list containing words that you are searching for
all_files = os.listdir('/path/to/file') # creates list with files from given directory
for file in all_files: # loops through all files in directory
for word in to_find: # loops through target words
if word in file:
print file # prints file name if the target word is found
I tested this in my directory which contained these files:
Helper_File.py
forms.py
runserver.py
static
app.py
templates
... and i set to_find to ['runserver', 'static']...
and when I ran this code it returned:
runserver.py
static
For future reference, you should make at least some sort of attempt at solving a problem prior to posting a question on Stackoverflow. It's not common for people to assist you like this if you can't provide proof of an attempt.

Here's a way to do it that allows for a selection of weather to match based on placement of text.
import os
def scan(dir, match_key, bias=2):
'''
:0 startswith
:1 contains
:2 endswith
'''
matches = []
if not isinstance(match_key, (tuple, list)):
match_key = [match_key]
if os.path.exists(dir):
for file in os.listdir(dir):
for match in match_key:
if file.startswith(match) and bias == 0 or file.endswith(match) and bias == 2 or match in file and bias == 1:
matches.append(file)
continue
return matches
print scan(os.curdir, '.py'

Trying to exclude a substring within a list in python

I have a list phplist containing the following strings (example below), there are many more, this is a snippet of the entire list
/home/comradec/public_html/moodle/config.php
/home/comradec/public_html/moodle/cache/classes/config.php
/home/comradec/public_html/moodle/theme/sky_high/config.php
/home/comradec/public_html/moodle/theme/brick/config.php
/home/comradec/public_html/moodle/theme/serenity/config.php
/home/comradec/public_html/moodle/theme/binarius/config.php
/home/comradec/public_html/moodle/theme/anomaly/config.php
/home/comradec/public_html/moodle/theme/standard/config.php
What I am trying to do is only keep the subdir/config.php file and exclude all other config.php files (eg cache/classes/config.php).
Full code is
for folder, subs, files in os.walk(path):
for filename in files:
if filename.endswith('.php'):
phplist.append(abspath(join(folder, filename)))
for i in phplist:
if i.endswith("/config.php"):
cmsconfig.append(i)
if i.endswith("/mdeploy.php"):
cmslist.append(cms1[18])
So the outcome will only add /config.php file path to the list cmsconfig but what is happening I am getting all the config.php files as in the top example
I have been using the code like is not i.endswith("/theme/brick/config.php") but I want a way to exclude the theme directory from the list.
The reason I am placing the output into a list is I use that output in another area of the code.

Change your if-condition to if i.endswith("moodle/config.php").
If you want to change the folder that you want to this with:
path_ending = '%s/config.php' % folder_name
Now change the if-condition to if i.endswith(path_ending)
This will show paths that end with config.php within the folder tbat you passed.

I think this is what you want. may change the naming of variables it is not pep8 style.
First i sort all entries that the shortest comes first, then i remember which parts are already checked.
url1 = '/home/comradec/public_html/moodle/theme/binarius/config.php'
url2 = '/home/comradec/public_html/moodle/config.php'
url3 = '/home/comradec/public_html/othername/theme/binarius/config.php'
url4 = '/home/comradec/public_html/othername/config.php'
urls = []
urls.append(url1)
urls.append(url2)
urls.append(url3)
urls.append(url4)
moodleUrls = []
checkedDirs = []
#sort
for i in sorted(urls):
if str(i).endswith('config.php'):
alreadyChecked = False
for checkedDir in checkedDirs:
if str(i).startswith(checkedDir):
alreadyChecked = True
break
if not alreadyChecked:
moodleUrls.append(i)
checkedDirs.append(str(i).replace('/config.php',''))
print(checkedDirs)
print(moodleUrls)
Output:
['/home/comradec/public_html/moodle', '/home/comradec/public_html/othername']
['/home/comradec/public_html/moodle/config.php', '/home/comradec/public_html/othername/config.php']

The way I resolved my question. Provides the output I am looking for.
path = "/home/comradec"
phplist = []
cmsconfig = []
config = "config.php"
for folder, subs, files in os.walk(path):
for filename in files:
if filename.endswith('.php'):
phplist.append(abspath(join(folder, filename)))
for i in phplist:
if i.endswith("/mdeploy.php"):
newurl = i
newurl = newurl[:-11]
newurl = newurl + config
for i in phplist:
if i.endswith("/config.php"):
confirmurl = i
if confirmurl == newurl:
cmsconfig.append(newurl)
print('\n'.join(cmsconfig))

Python - Truncate unknown file names

Let's say I have the following files in a directory:
snackbox_1a.dat
zebrabar_3z.dat
cornrows_00.dat
meatpack_z2.dat
I have SEVERAL of these directories, in which all of the files are of the same format, ie:
snackbox_xx.dat
zebrabar_xx.dat
cornrows_xx.dat
meatpack_xx.dat
So what I KNOW about these files is the first bit (snackbox, zebrabar, cornrows, meatpack). What I don't know is the bit for the file extension (the 'xx'). This changes both within the directory across the files, and across the directories (so another directory might have different xx values, like 12, yy, 2m, 0t, whatever).
Is there a way for me to rename all of these files, or truncate them all (since the xx.dat will always be the same length), for ease of use when attempting to call them? For instance, I'd like to rename them so that I can, in another script, use a simple index to step through and find the file I want (instead of having to go into each directory and pull the file out manually).
In other words, I'd like to change the file names to:
snackbox.dat
zebrabar.dat
cornrows.dat
meatpack.dat
Thanks!

You can use shutil.move to move files. To calculate the new filename, you can use Python's string split method:
original_name = "snackbox_12.dat"
truncated_name = original.split("_")[0] + ".dat"

Try re.sub:
import re
filename = 'snackbox_xx.dat'
filename_new = re.sub(r'_[A-Za-z0-9]{2}', '', filename)
You should get 'snackbox.dat' for filename_new
This assumes the two characters after the "_" are either a number or lowercase/uppercase letter, but you could choose to expand the classes included in the regular expression.
EDIT: including moving and recursive search:
import shutil, re, os, fnmatch
directory = 'your_path'
for root, dirnames, filenames in os.walk(directory):
for filename in fnmatch.filter(filenames, '*.dat'):
filename_new = re.sub(r'_[A-Za-z0-9]{2}', '', filename)
shutil.move(os.path.join(root, filename), os.path.join(root, filename_new))

This solution renames all files in the current directory that match the pattern in the function call.
What the function does
snackbox_5R.txt >>> snackbox.txt
snackbox_6y.txt >>> snackbox_0.txt
snackbox_a2.txt >>> snackbox_1.txt
snackbox_Tm.txt >>> snackbox_2.txt
Let's look at the functions inputs and some examples.
list_of_files_names This is a list of string. Where each string is the filename without the _?? part.
Examples:
['snackbox.txt', 'zebrabar.txt', 'cornrows.txt', 'meatpack.txt', 'calc.txt']
['text.dat']
upper_bound=1000 This is an integer. When the ideal filename is already taken, e.g snackbox.dat already exist it will create snackbox_0.dat all the way up to snackbox_9999.dat if need be. You shouldn't have to change the default.
The Code
import re
import os
import os.path
def find_and_rename(dir, list_of_files_names, upper_bound=1000):
"""
:param list_of_files_names: List. A list of string: filname (without the _??) + extension, EX: snackbox.txt
Renames snackbox_R5.dat to snackbox.dat, etc.
"""
# split item in the list_of_file_names into two parts, filename and extension "snackbox.dat" -> "snackbox", "dat"
list_of_files_names = [(prefix.split('.')[0], prefix.split('.')[1]) for prefix in list_of_files_names]
# store the content of the dir in a list
list_of_files_in_dir = os.listdir(dir)
for file_in_dir in list_of_files_in_dir: # list all files and folders in current dir
file_in_dir_full_path = os.path.join(dir, file_in_dir) # we need the full path to rename to use .isfile()
print() # DEBUG
print('Is "{}" a file?: '.format(file_in_dir), end='') # DEBUG
print(os.path.isfile(file_in_dir_full_path)) # DEBUG
if os.path.isfile(file_in_dir_full_path): # filters out the folder, only files are needed
# Filename is a tuple containg the prefix filename and the extenstion
for file_name in list_of_files_names: # check if the file matches on of our renaming prefixes
# match both the file name (e.g "snackbox") and the extension (e.g "dat")
# It find "snackbox_5R.txt" by matching "snackbox" in the front and matching "dat" in the rear
if re.match('{}_\w+\.{}'.format(file_name[0], file_name[1]), file_in_dir):
print('\nOriginal File: ' + file_in_dir) # printing this is not necessary
print('.'.join(file_name))
ideal_new_file_name = '.'.join(file_name) # name might already be taken
# print(ideal_new_file_name)
if os.path.isfile(os.path.join(dir, ideal_new_file_name)): # file already exists
# go up a name, e.g "snackbox.dat" --> "snackbox_1.dat" --> "snackbox_2.dat
for index in range(upper_bound):
# check if this new name already exists as well
next_best_name = file_name[0] + '_' + str(index) + '.' + file_name[1]
# file does not already exist
if os.path.isfile(os.path.join(dir,next_best_name)) == False:
print('Renaming with next best name')
os.rename(file_in_dir_full_path, os.path.join(dir, next_best_name))
break
# this file exist as well, keeping increasing the name
else:
pass
# file with ideal name does not already exist, rename with the ideal name (no _##)
else:
print('Renaming with ideal name')
os.rename(file_in_dir_full_path, os.path.join(dir, ideal_new_file_name))
def find_and_rename_include_sub_dirs(master_dir, list_of_files_names, upper_bound=1000):
for path, subdirs, files in os.walk(master_dir):
print(path) # DEBUG
find_and_rename(path, list_of_files_names, upper_bound)
find_and_rename_include_sub_dirs('C:/Users/Oxen/Documents/test_folder', ['snackbox.txt', 'zebrabar.txt', 'cornrows.txt', 'meatpack.txt', 'calc.txt'])

Use of regular expression to exclude characters in file rename,Python?

I am trying to rename files so that they contain an ID followed by a -(int). The files generally come to me in this way but sometimes they come as 1234567-1(crop to bottom).jpg.
I have been trying to use the following code but my regular expression doesn't seem to be having any effect. The reason for the walk is because we have to handles large directory trees with many images.
def fix_length():
for root, dirs, files in os.walk(path):
for fn in files:
path2 = os.path.join(root, fn)
filename_zero, extension = os.path.splitext(fn)
re.sub("[^0-9][-]", "", filename_zero)
os.rename(path2, filename_zero + extension)
fix_length()
I have inserted print statements for filename_zero before and after the re.sub line and I am getting the same result (i.e. 1234567-1(crop to bottom) not what I wanted)
This raises an exception as the rename is trying to create a file that already exists.
I thought perhaps adding the [-] in the regex was the issue but removing it and running again I would then expect 12345671.jpg but this doesn't work either. My regex is failing me or I have failed the regex.
Any insight would be greatly appreciated.
As a follow up, I have taken all the wonderful help and settled on a solution to my specific problem.
path = 'C:\Archive'
errors = 'C:\Test\errors'
num_files = []
def best_sol():
num_files = []
for root, dirs, files in os.walk(path):
for fn in files:
filename_zero, extension = os.path.splitext(fn)
path2 = os.path.join(root, fn)
ID = re.match('^\d{1,10}', fn).group()
if len(ID) <= 7:
if ID not in num_files:
num_files = []
num_files.append(ID)
suffix = str(len(num_files))
os.rename(path2, os.path.join(root, ID + '-' + suffix + extension))
else:
num_files.append(ID)
suffix = str(len(num_files))
os.rename(path2, os.path.join( root, ID + '-' + suffix +extension))
else:
shutil.copy(path2, errors)
os.remove(path2)
This code creates an ID based upon (up to) the first 10 numeric characters in the filename. I then use lists that store the instances of this ID and use the, length of the list append a suffix. The first file will have a -1, second a -2 etc...
I am only interested (or they should only be) in ID's with a length of 7 but allow to read up to 10 to allow for human error in labelling. All files with ID longer than 7 are moved to a folder where we can investigate.
Thanks for pointing me in the right direction.

re.sub() returns the altered string, but you ignore the return value.
You want to re-assign the result to filename_zero:
filename_zero = re.sub("[^\d-]", "", filename_zero)
I've corrected your regular expression as well; this removes anything that is not a digit or a dash from the base filename:
>>> re.sub(r'[^\d-]', '', '1234567-1(crop to bottom)')
'1234567-1'
Remember, strings are immutable, you cannot alter them in-place.
If all you want is the leading digits, plus optional dash-digit suffix, select the characters to be kept, rather than removing what you don't want:
filename_zero = re.match(r'^\d+(?:-\d)?', filename_zero).group()

new_filename = re.sub(r'^([0-9]+)-([0-9]+)', r'\g1-\g2', filename_zero)
Try using this regular expression instead, I hope this is how regular expressions work in Python, I don't use it often. You also appear to have forgotten to assign the value returned by the re.sub call to the filename_zero variable.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding all subfolders that contain two files that end with certain strings - python

Related

How to get the latest folder in a directory using Python

I have a list of a part of a filename, for each one I want to go through the files in a directory that matches that part and return the filename

Trying to exclude a substring within a list in python

Python - Truncate unknown file names

Use of regular expression to exclude characters in file rename,Python?

Categories

Resources