How can I avoid getting duplicate paths from os.walk - python

I have the following directory structure
/mnt/type/split/v2/doc/RESOURCE_ID/YYYY/FY/DOCUMENT_ID
for example, one path might be
/mnt/type/split/v2/doc/100045/2008/FY/28
where
RESOURCE_ID = 100045
YYYY = 2008
DOCUMENT_ID = 28
Note, DOCUMENT_ID is the last directory in the path - there will be files in the DOCUMENT_ID directory
I was trying to take inventory of this structure using the following code
def survey():
magic_paths = []
for (resource_id, dirname,filename) in os.walk('/mnt/type/split/v2/doc'):
if resource_id:
for (magic_path, dirname2,filename2) in os.walk(resource_id):
if len(magic_path.split(os.sep)) == 10:
magic_paths.append(magic_path + os.linesep)
write_survey(magic_paths)
x = len(magic_paths)
return x
I am getting five copies of each path in my magic_paths list. I have 1,500,000 paths, so I am getting 7,500,00 items in my list.
The first 1,500,000 are the unique values. The next 6,000,000 consist of groups that are rooted on the RESOURCE_ID, repeated 4 times
/mnt/type/split/v2/doc/100045/2008/FY/28 #obs_1
/mnt/type/split/v2/doc/100045/2008/FY/29 #obs_2
/mnt/type/split/v2/doc/100045/2008/FY/30 #obs_3
/mnt/type/split/v2/doc/100045/2008/FY/31 #obs_4
/mnt/type/split/v2/doc/1028/2008/FY/28 #obs_5 # see the new RESOURCE_ID
.
. 1,499,995 more unique values
.
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of first repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of second repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of third repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/100045/2008/FY/28 #begin of fourth repetition
/mnt/type/split/v2/doc/100045/2008/FY/29
/mnt/type/split/v2/doc/100045/2008/FY/30
/mnt/type/split/v2/doc/100045/2008/FY/31
/mnt/type/split/v2/doc/1028/2008/FY/28 #series of 4 repetitions based on RESOURCE ID 1028
There are various files in the directories and subs at each level, I just need to inventory the paths to the DOCUMENT_IDs.
I do not understand why the results are patterned as they are. I believed that I was starting at RESOURCE_ID and finding only the directories that were 9 deep since splitting on os.sep gives me a list with ten items.
'/mnt/type/split/v2/doc/100045/2008/FY/31'.split(os.sep) = ['','mnt',type','split','v2','doc','100045','2008','FY','31']
In response to the questions in the comments
I believed that I was getting each RESOURCE_ID directory and then walking it. That the other items returned from the first os.walk (dirnames and filenames) would be ignored
I did not think os.listdir would work, I can make this work with glob but am worried about it eating my memory

os.walk() will recursively walk a directory structure. For each directory you encounter, you start another recursive call. So for every directory, you recursively walk that directory plus all nested directories. That includes nested directories. By kicking off a search for /mnt/type/split/v2/doc, /mnt/type/split/v2/doc/100045, /mnt/type/split/v2/doc/100045/2008, /mnt/type/split/v2/doc/100045/2008 and /mnt/type/split/v2/doc/100045/2008/FY paths, you produce 5 matches per document ID.
Call os.walk() just once:
def survey():
magic_paths = []
for (resource_id, dirnames, filenames) in os.walk('/mnt/type/split/v2/doc'):
if len(resource_id.split(os.sep)) == 10:
magic_paths.append(resource_id + os.linesep)
write_survey(magic_paths)
x = len(magic_paths)
return x
You may want to prune the search after finding a match; there is no point in searching through further subdirectories once you find a DOCUMENT_ID directory:
def survey():
magic_paths = []
for (resource_id, dirnames, filenames) in os.walk('/mnt/type/split/v2/doc'):
if len(resource_id.split(os.sep)) == 10:
magic_paths.append(resource_id + os.linesep)
dirnames[:] = [] # clear the subdirs list to stop further recursion here
write_survey(magic_paths)
x = len(magic_paths)
return x

Related

building a dictionary of my directories and file paths to select all files whose name contains a specific string

I have a rootdirectory called 'IC'. 'IC' contains a bunch of subdirectories which contain subsubdirectories which contain subsubsubdirectories and so on. Is there an easy way to move all the sub...directory files into their parent subdirectory and then delete the empty sub...directories.
So far I've made this monstrosity of nested loops to build a dictionary of file paths and subdirectories as dictionaries containing file paths etc. I was gonna then make something to go through the dictionary and pick all files containing 'IC' and the subdirectory they are in. I need to know which directories contain an 'IC' file or not. I also need to move all the files containing 'IC' to the top level subdirectories(see hashtag in code)
import os, shutil
rootdir = 'data/ICs'
def dir_tree(rootdir):
IC_details = {}
# This first loop is what I'm calling the top level subdirectories. They are the three
# subdirectories inside the directory 'data/ICs'
for i in os.scandir(rootdir):
if os.path.isdir(i):
IC_details[i.path] = {}
for i in IC_details:
for j in os.scandir(i):
if os.path.isdir(j.path):
IC_details[i][j.name] = {}
elif os.path.isfile(j.path):
IC_details[i][j.name] = [j.path]
for j in IC_details[i]:
if os.path.isdir(os.path.join(i,j)):
for k in os.scandir(os.path.join(i,j)):
if os.path.isdir(k.path):
IC_details[i][j][k.name] = {}
elif os.path.isfile(k.path):
IC_details[i][j][k.name] = [k.path]
for k in IC_details[i][j]:
if os.path.isdir(os.path.join(i,j,k)):
for l in os.scandir(os.path.join(i,j,k)):
if os.path.isdir(l.path):
IC_details[i][j][k][l.name] = {}
elif os.path.isfile(l.path):
IC_details[i][j][k][l.name] = [l.path]
for l in IC_details[i][j][k]:
if os.path.isdir(os.path.join(i,j,k,l)):
for m in os.scandir(os.path.join(i,j,k,l)):
if os.path.isfile(m.path):
IC_details[i][j][k][l][m.name] = [m.path]
return IC_details
IC_tree = dir_tree(rootdir)
You should have a look at the 'glob' module :
glob — Unix style pathname pattern expansion¶

Comparing incrementing filenames in Python and check which one is missing

I'm iterating through a folder with files and adding each of the file's path to a list. The folder contains files with incrementing file names, such as 00-0.txt, 00-1.txt, 00-2.txt, 01-0.txt, 01-1.txt, 01-2.txt and so on.
The number of files is not fixed and always varies. Also, sometimes a file could be missing. This means that I will sometimes get this list instead:
00-0.txt, 00-1.txt, 01-0.txt, 01-1.txt, 01-2.txt.
However, in my final list, I should always have groups of 9 (so 00-0, 00-1, 00-2 and so on until 00-8 is one group). If a file is missing, then I will append 'is missing' string text in the new list instead.
What I was thinking to do is the following:
Get the last character of the filename (for ex. '3')
Check if it's value is the same as the previous index + 1.
If it's not, then append 'it's missing' string
If it's the same, then append the file name
In pseudo-code (please don't mind the syntax errors, I'm mainly looking for high level advice), it would be something like this:
empty_list = []
list_with_paths = glob.glob("/path/to/dir*.txt")
for index, item in enumerate(list_with_paths):
basename = os.path.basename(item)
filename = os.path.splitext(basename)[0]
if index == 0 and int(filename[-1]) != 0:
empty_list.append('is missing')
elif filename[-1] != empty_list[index - 1] + 1:
empty_list.append('is missing')
else:
empty_list.append(filename)
I'm sure there is a more optimal solution in order to achieve this.
Once you have the set of actual paths, just iterate over the expected paths until you have accounted for all of the actual paths.
from itertools import count
list_with_paths = set(glob.glob("/path/to/dir/*.txt"))
groups = count()
results = []
for g in groups:
if not list_with_paths:
break
for i in range(0,9):
expected = "{:02}-{}.txt".format(g, i)
if "/path/to/dir/" + expected in list_with_paths:
list_with_paths.remove(expected)
else:
expected = "is missing"
results.append(expected)

Finding all subfolders that contain two files that end with certain strings

So I have a folder, say D:\Tree, that contains only subfolders (names may contain spaces). These subfolders contain a few files - and they may contain files of the form "D:\Tree\SubfolderName\SubfolderName_One.txt" and "D:\Tree\SubfolderName\SubfolderName_Two.txt" (in other words, the subfolder may contain both of them, one, or neither). I need to find every occurence where a subfolder contains both of these files, and send their absolute paths to a text file (in a format explained in the following example). Consider these three subfolders in D:\Tree:
D:\Tree\Grass contains Grass_One.txt and Grass_Two.txt
D:\Tree\Leaf contains Leaf_One.txt
D:\Tree\Branch contains Branch_One.txt and Branch_Two.txt
Given this structure and the problem mentioned above, I'd to like to be able to write the following lines in myfile.txt:
D:\Tree\Grass\Grass_One.txt D:\Tree\Grass\Grass_Two.txt
D:\Tree\Branch\Branch_One.txt D:\Tree\Branch\Branch_Two.txt
How might this be done? Thanks in advance for any help!
Note: It is very important that "file_One.txt" comes before "file_Two.txt" in myfile.txt
import os
folderPath = r'Your Folder Path'
for (dirPath, allDirNames, allFileNames) in os.walk(folderPath):
for fileName in allFileNames:
if fileName.endswith("One.txt") or fileName.endswith("Two.txt") :
print (os.path.join(dirPath, fileName))
# Or do your task as writing in file as per your need
Hope this helps....
Here is a recursive solution
def findFiles(writable, current_path, ending1, ending2):
'''
:param writable: file to write output to
:param current_path: current path of recursive traversal of sub folders
:param postfix: the postfix which needs to match before
:return: None
'''
# check if current path is a folder or not
try:
flist = os.listdir(current_path)
except NotADirectoryError:
return
# stores files which match given endings
ending1_files = []
ending2_files = []
for dirname in flist:
if dirname.endswith(ending1):
ending1_files.append(dirname)
elif dirname.endswith(ending2):
ending2_files.append(dirname)
findFiles(writable, current_path+ '/' + dirname, ending1, ending2)
# see if exactly 2 files have matching the endings
if len(ending1_files) == 1 and len(ending2_files) == 1:
writable.write(current_path+ '/'+ ending1_files[0] + ' ')
writable.write(current_path + '/'+ ending2_files[0] + '\n')
findFiles(sys.stdout, 'G:/testf', 'one.txt', 'two.txt')

I have a list of a part of a filename, for each one I want to go through the files in a directory that matches that part and return the filename

So, let's say I have a directory with a bunch of filenames.
for example:
Scalar Product or Dot Product (Hindi)-fodZTqRhC24.m4a
AP Physics C - Dot Product-Wvhn_lVPiw0.m4a
An Introduction to the Dot Product-X5DifJW0zek.m4a
Now let's say I have a list, of only the keys, which are at the end of the file names:
['fodZTqRhC24', 'Wvhn_lVPiw0, 'X5DifJW0zek']
How can I iterate through my list to go into that directory and search for a file name containing that key, and then return me the filename?
Any help is greatly appreciated!
I thought about it, I think I was making it harder than I had to with regex. Sorry about not trying it first. I have done it this way:
audio = ['Scalar Product or Dot Product (Hindi)-fodZTqRhC24.m4a',
'An Introduction to the Dot Product-X5DifJW0zek.m4a',
'AP Physics C - Dot Product-Wvhn_lVPiw0.m4a']
keys = ['fodZTqRhC24', 'Wvhn_lVPiw0', 'X5DifJW0zek']
file_names = []
for Id in keys:
for name in audio:
if Id in name:
file_names.append(name)
combined = zip(keys,file_names)
combined
Here is an example:
ls: list of files in a given directory
names: list of strings to search for
import os
ls=os.listdir("/any/folder")
n=['Py', 'sql']
for file in ls:
for name in names:
if name in file:
print(file)
Results :
.PyCharm50
.mysql_history
zabbix2.sql
.mysql
PycharmProjects
zabbix.sql
Assuming you know which directory that you will be looking in, you could try something like this:
import os
to_find = ['word 1', 'word 2'] # list containing words that you are searching for
all_files = os.listdir('/path/to/file') # creates list with files from given directory
for file in all_files: # loops through all files in directory
for word in to_find: # loops through target words
if word in file:
print file # prints file name if the target word is found
I tested this in my directory which contained these files:
Helper_File.py
forms.py
runserver.py
static
app.py
templates
... and i set to_find to ['runserver', 'static']...
and when I ran this code it returned:
runserver.py
static
For future reference, you should make at least some sort of attempt at solving a problem prior to posting a question on Stackoverflow. It's not common for people to assist you like this if you can't provide proof of an attempt.
Here's a way to do it that allows for a selection of weather to match based on placement of text.
import os
def scan(dir, match_key, bias=2):
'''
:0 startswith
:1 contains
:2 endswith
'''
matches = []
if not isinstance(match_key, (tuple, list)):
match_key = [match_key]
if os.path.exists(dir):
for file in os.listdir(dir):
for match in match_key:
if file.startswith(match) and bias == 0 or file.endswith(match) and bias == 2 or match in file and bias == 1:
matches.append(file)
continue
return matches
print scan(os.curdir, '.py'

Use of regular expression to exclude characters in file rename,Python?

I am trying to rename files so that they contain an ID followed by a -(int). The files generally come to me in this way but sometimes they come as 1234567-1(crop to bottom).jpg.
I have been trying to use the following code but my regular expression doesn't seem to be having any effect. The reason for the walk is because we have to handles large directory trees with many images.
def fix_length():
for root, dirs, files in os.walk(path):
for fn in files:
path2 = os.path.join(root, fn)
filename_zero, extension = os.path.splitext(fn)
re.sub("[^0-9][-]", "", filename_zero)
os.rename(path2, filename_zero + extension)
fix_length()
I have inserted print statements for filename_zero before and after the re.sub line and I am getting the same result (i.e. 1234567-1(crop to bottom) not what I wanted)
This raises an exception as the rename is trying to create a file that already exists.
I thought perhaps adding the [-] in the regex was the issue but removing it and running again I would then expect 12345671.jpg but this doesn't work either. My regex is failing me or I have failed the regex.
Any insight would be greatly appreciated.
As a follow up, I have taken all the wonderful help and settled on a solution to my specific problem.
path = 'C:\Archive'
errors = 'C:\Test\errors'
num_files = []
def best_sol():
num_files = []
for root, dirs, files in os.walk(path):
for fn in files:
filename_zero, extension = os.path.splitext(fn)
path2 = os.path.join(root, fn)
ID = re.match('^\d{1,10}', fn).group()
if len(ID) <= 7:
if ID not in num_files:
num_files = []
num_files.append(ID)
suffix = str(len(num_files))
os.rename(path2, os.path.join(root, ID + '-' + suffix + extension))
else:
num_files.append(ID)
suffix = str(len(num_files))
os.rename(path2, os.path.join( root, ID + '-' + suffix +extension))
else:
shutil.copy(path2, errors)
os.remove(path2)
This code creates an ID based upon (up to) the first 10 numeric characters in the filename. I then use lists that store the instances of this ID and use the, length of the list append a suffix. The first file will have a -1, second a -2 etc...
I am only interested (or they should only be) in ID's with a length of 7 but allow to read up to 10 to allow for human error in labelling. All files with ID longer than 7 are moved to a folder where we can investigate.
Thanks for pointing me in the right direction.
re.sub() returns the altered string, but you ignore the return value.
You want to re-assign the result to filename_zero:
filename_zero = re.sub("[^\d-]", "", filename_zero)
I've corrected your regular expression as well; this removes anything that is not a digit or a dash from the base filename:
>>> re.sub(r'[^\d-]', '', '1234567-1(crop to bottom)')
'1234567-1'
Remember, strings are immutable, you cannot alter them in-place.
If all you want is the leading digits, plus optional dash-digit suffix, select the characters to be kept, rather than removing what you don't want:
filename_zero = re.match(r'^\d+(?:-\d)?', filename_zero).group()
new_filename = re.sub(r'^([0-9]+)-([0-9]+)', r'\g1-\g2', filename_zero)
Try using this regular expression instead, I hope this is how regular expressions work in Python, I don't use it often. You also appear to have forgotten to assign the value returned by the re.sub call to the filename_zero variable.

Categories