Faster searching for specific dirs all over the drive in python - python

I have a network disk with data. Many of dirs, many files. On disk I have some dirs with logs named LOGS_XXX, in those folders are various files and folders, including the folders I'm interested named YYYYFinal, where YYYY is year of created. So I just want to create list of path to that dirs but only if YYYY > 2017. In one LOGS could be more than one YYYYFinal. Could be nothing interesting too.
So I put here a part code searching dirs by conditions and creating list:
path = path_to_network_drive
def findAllOutDirs(path):
finalPathList = []
for root, subdirs, files in os.walk(path):
for d in subdirs:
if d == "FINAL" or d == "Final":
outPath = root+r"\{}".format(d)
if ("LOGS" in outPath) and ("2018" in outPath or "2019" in outPath or "2020" in outPath):
finalPathList.append(outPath)
return finalPathList
And this code work good. I mean I got a final list but it take long time. So, maybe someone from here see some mistakes, bad using code or just know better option to do it by python?
Thanks!

Related

Count number of files in end directories using Python

I have a image dataset archived in tree structure, where the name of different level of folders are the labels of these images, for example
/Label1
/Label2
/image1
/image2
/Label3
/image3
/Label4
/image4
/image5
Then how could I count number of images in each end folder. In the above example, I want to know how many images are there in folder /Label1/Label2, /Label1/Label3 and Label4.
I checked out the function os.walk(), but it seems that there is no easy way to count number of files in each individual end folder.
Could anyone help me, thank you in advance.
You can do this with os.walk():
import os
c=0
print(os.getcwd())
for root, directories, filenames in os.walk('C:/Windows'):
for files in filenames:
c=c+1
print(c)
output:
125765
>>>
If you have multiple file formats in sub-directories, you can use an operator and check jpeg, png, and then increment the counter.
I checked out the function os.walk(), but it seems that there is no easy way to count number of files in each individual end folder.
Sure there is. It's just len(files).
I'm not 100% sure what you meant by "end directory", but if you wanted to skip directories that aren't leaves—that is, that have subdirectories—that's just as easy: if not dirs.
Finally, I don't know whether you wanted to print the counts out, store them in a dict with the paths as keys, total them up, or what, but those are all pretty trivial, so here's some code that does all three as examples:
total = 0
counts = {}
for root, dirs, files in os.walk(path):
if not dirs:
print(f'{root}: {len(files)}')
counts[root] = len(files)
total += len(files)

Getting data from different folders

I have two sets of json files store in two folders(named firstdata and seconddata) separately. I am trying to read all files in that two folders and put it into two arrays separately. Here is the code I did:
directory = os.path.normpath("D:\Python\project")
for subdir, dir, file in os.walk(directory):
if subdir == 'D:\Python\project\firstdata':
for f in file:
if f.endswith(".json"):
fread=open(os.path.join(subdir, f),'r')
a = fread.next().replace('\n','').split(',')
for line in a:
b = line.replace('.','').replace('\n','').replace('"','').split(': ')
print "___________________________________________________________________"
fread.close()
However it ignores (if subdir == 'D:\Python\project\firstdata': ) and get nothing at the end, can anyone helps?
You are interpreting things wrong. See the docs for **os.walk**.
The 3 variables for your for loop should be root, dirs, and files, in that order.
dirs and files are lists, of the directories and files in the current directory respectively. root is the current directory you are in.
subdir is being ignored because you are using os.walk incorrectly.

Navigating specific dirs in filter with os.walk

I am aware that I can remove dirs from os.walk using something along the lines of
for root, dirs, files in os.walk('/path/to/dir'):
ignore = ['dir1', 'dir2']
dirs[:] = [d for d in dirs if d not in ignore]
I want to do the opposite of this, so only keep the dirs in list. Ive tried a few variations but to no avail. Any pointers would be appreciated.
The dirs i am interested in are 2 levels down, so I have taken on the comments and created global variables for the sub levels and am using the following Code.
Expected Functionality
for root, dirs, files in os.walk(global_subdir):
keep = ['dir1', 'dir2']
dirs[:] = [d for d in dirs if d in keep]
for filename in files:
print os.path.join(root, filename)
As said in the comments of a deleted answer -
As mentioned already, this doesnt work. The dirs in keep are 2 levels sub root. Im guessing this is causing the problem
The issue is that the directory one level above your required directory would not be traversed since its not in your keep list, hence the program would never reach till your required directories.
The best way to solve this would be to start os.walk at the directory that is just one level above your required directory.
But if this is not possible (like maybe the directories one level above the required one is not known before traversing) or ( the required directories have different directories one level above). And what you really want is to just avoid looping through the files for directories that are not in the keep directory.
A solution would be to traverse all directories, but loop through the files only when root is in the keep list (or set for better performance). Example -
keep = set(['required directory1','required directory2'])
for root, dirs, files in os.walk(global_subdir):
if root in keep:
for filename in files:
print os.path.join(root, filename)

Searching recursively for files to add to list, but if one type of file is found ignore other type

How can I search recursively for files to add to a list, but if one type of file is found ignore another type?
Here is my current code:
import os
import fnmatch
rootDir = "//path/to/top/level/directory"
ignore = ['ignoreThisDir','ignoreThisToo']
fileList = []
for dirpath, dirnames, files in os.walk(rootDir):
for idir in ignore:
if idir in dirnames:
dirnames.remove(idir)
for name in files:
if fnmatch.fnmatch(name, 'A.csv') or fnmatch.fnmatch(name, 'B.csv'):
fileList.append(os.path.join(dirpath, name))
Currently this code is partially working for me. It takes a top level directory and searches down recursively through the directory tree creating a list of directories and files within, removing the directories that I don't want the code to os.walk through.
But there is one extra step I can't work out.
If B.csv exists in a directory, I only want to append it, and not A.csv. But if B.csv is not found then I do want to append A.csv to my list of files.
My current code appends both.
If B.csv exists in a directory, I only want to append it and not A.csv. But if B.csv is not found then I do want to append A.csv to my list of files.
There are two ways to do this.
First, you can make two passes through the directory: first search for B.csv, then, only if it wasn't found, search for A.csv. Like this:
for name in files:
if fnmatch.fnmatch(name, 'B.csv'):
fileList.append(os.path.join(dirpath, name))
break
else:
for name in files:
if fnmatch.fnmatch(name, 'A.csv'):
fileList.append(os.path.join(dirpath, name))
break
(If you've never seen a for…else before, the else part triggers if you finished the for loop without hitting a break—in other words, if you didn't find B.csv anywhere.)
Alternatively, you can remember that you found A.csv, but not add it until you know that you haven't found B.csv:
a = b = None
for name in files:
if fnmatch.fnmatch(name, 'A.csv'):
a = name
elif fnmatch.fnmatch(name, 'B.csv'):
b = name
fileList.append(os.path.join(dirpath, name))
if a is not None and b is None:
fileList.append(os.path.join(dirpath, a))
You can also combine the two approaches—break as soon as you find B.csv, and use a for…else followed by just if a is not None:.
As a side note, you don't need fnmatch if all you're doing is checking for an exact match. It's only necessary when you're matching glob patterns, like '*.csv' or the list. So you can simplify this quite a bit:
files = set(files)
if 'B.csv' in files:
fileList.append(os.path.join(dirpath, 'B.csv'))
elif 'A.csv' in files:
fileList.append(os.path.join(dirpath, 'A.csv'))

Efficiently removing subdirectories in dirnames from os.walk

On a mac in python 2.7 when walking through directories using os.walk my script goes through 'apps' i.e. appname.app, since those are really just directories of themselves. Well later on in processing I am hitting errors when going through them. I don't want to go through them anyways so for my purposes it would be best just to ignore those types of 'directories'.
So this is my current solution:
for root, subdirs, files in os.walk(directory, True):
for subdir in subdirs:
if '.' in subdir:
subdirs.remove(subdir)
#do more stuff
As you can see, the second for loop will run for every iteration of subdirs, which is unnecessary since the first pass removes everything I want to remove anyways.
There must be a more efficient way to do this. Any ideas?
You can do something like this (assuming you want to ignore directories containing '.'):
subdirs[:] = [d for d in subdirs if '.' not in d]
The slice assignment (rather than just subdirs = ...) is necessary because you need to modify the same list that os.walk is using, not create a new one.
Note that your original code is incorrect because you modify the list while iterating over it, which is not allowed.
Perhaps this example from the Python docs for os.walk will be helpful. It works from the bottom up (deleting).
# Delete everything reachable from the directory named in "top",
# assuming there are no symbolic links.
# CAUTION: This is dangerous! For example, if top == '/', it
# could delete all your disk files.
import os
for root, dirs, files in os.walk(top, topdown=False):
for name in files:
os.remove(os.path.join(root, name))
for name in dirs:
os.rmdir(os.path.join(root, name))
I am a bit confused about your goal, are you trying to remove a directory subtree and are encountering errors, or are you trying to walk a tree and just trying to list simple file names (excluding directory names)?
I think all that is required is to remove the directory before iterating over it:
for root, subdirs, files in os.walk(directory, True):
if '.' in subdirs:
subdirs.remove('.')
for subdir in subdirs:
#do more stuff

Categories