I do atomistic modelling, and use Python to analyze simulation results. To simplify work with a whole bunch of Python scripts used for different tasks, I decided to write simple GUI to run scripts from it.
I have a (rather complex) directory structure beginning from some root (say ~/calc), and I want to populate wx.TreeCtrl control with directories containing calculation results preserving their structure. The folder contains the results if it contains a file with .EXT extension. What i try to do is walk through dirs from root and in each dir check whether it contains .EXT file. When such dir is reached, add it and its ancestors to the tree:
def buildTree(self, rootdir):
root = rootdir
r = len(rootdir.split('/'))
ids = {root : self.CalcTree.AddRoot(root)}
for (dirpath, dirnames, filenames) in os.walk(root):
for dirname in dirnames:
fullpath = os.path.join(dirpath, dirname)
if sum([s.find('.EXT') for s in filenames]) > -1 * len(filenames):
ancdirs = fullpath.split('/')[r:]
ad = rootdir
for ancdir in ancdirs:
d = os.path.join(ad, ancdir)
ids[d] = self.CalcTree.AppendItem(ids[ad], ancdir)
ad = d
But this code ends up with many second-level nodes with the same name, and that's definitely not what I want. So I somehow need to see if the node is already added to the tree, and in positive case add new node to the existing one, but I do not understand how this could be done. Could you please give me a hint?
Besides, the code contains 2 dirty hacks I'd like to get rid of:
I get the list of ancestor dirs with splitting the full path in \
positions, and this is Linux-specific;
I find if .EXT file is in the directory by trying to find the extension in the strings from filenames list, taking in account that s.find returns -1 if the substring is not found.
Is there a way to make these chunks of code more readable?
First of all the hacks:
To get the path seperator for whatever os your using you can use os.sep.
Use str.endswith() and use the fact that in Python the empty list [] evaluates to False:
if [ file for file in filenames if file.endswith('.EXT') ]:
In terms of getting them all nicely nested you're best off doing it recursively. So the pseudocode would look something like the following. Please note this is just provided to give you an idea of how to do it, don't expect it to work as it is!
def buildTree(self, rootdir):
rootId = self.CalcTree.AddRoot(root)
self.buildTreeRecursion(rootdir, rootId)
def buildTreeRecursion(self, dir, parentId)
# Iterate over the files in dir
for file in dirFiles:
id = self.CalcTree.AppendItem(parentId, file)
if file is a directory:
self.buildTreeRecursion(file, id)
Hope this helps!
Related
The problem is to get all the file names in a list that are under a particular directory and in a particular condition.
We have a directory named "test_dir".
There, we have sub directory "sub_dir_1", "sub_dir_2", "sub_dir_3"
and inside of each sub dir, we have some files.
sub_dir_1 has files ['test.txt', 'test.wav']
sub_dir_2 has files ['test_2.txt', 'test.wav']
sub_dir_2 has files ['test_3.txt', 'test_3.tsv']
What I want to get at the end of the day is a list of of the "test.wav" that exist under the "directory" ['sub_dir_1/test.wav', 'sub_dir_2/test.wav']. As you can see the condition is to get every path of 'test.wav' under the mother directory.
mother_dir_name = "directory"
get_test_wav(mother_dir_name)
returns --> ['sub_dir_1/test.wav', 'sub_dir_2/test.wav']
EDITED
I have changed the direction of the problem.
We first have this list of file names
["sub_dir_1/test.wav","sub_dir_2/test.wav","abc.csv","abc.json","sub_dir_3/test.json"]
from this list I would like to get a list that does not contain any path that contains "test.wav" like below
["abc.csv","abc.json","sub_dir_3/test.json"]
You can use glob patterns for this. Using pathlib,
from pathlib import Path
mother_dir = Path("directory")
list(mother_dir.glob("sub_dir_*/*.wav"))
Notice that I was fairly specific about which subdirectories to check - anything starting with "sub_dir_". You can change that pattern as needed to fit your environment.
Use os.walk():
import os
def get_test_wav(folder):
found = []
for root, folders, files in os.walk(folder):
for file in files:
if file == "test.wav":
found.append(os.path.join(root, file))
return found
Or a list comprehension approach:
import os
def get_test_wav(folder):
found = [f"{arr[0]}\\test.wav" for arr in os.walk(folder) if "test.wav" in arr[2]]
return found
I think this might help you How can I search sub-folders using glob.glob module?
The main way to make a list of files in a folder (to make it callable later) is:
file_path = os.path.join(motherdirectopry, 'subdirectory')
list_files = glob.glob(file_path + "/*.wav")
just check that link to see how you can join all sub-directories in a folder.
This will also give you all the file in sub directories that only has .wav at the end:
os.chdir(motherdirectory)
glob.glob('**/*.wav', recursive=True)
I need to scan a directory with hundreds or GB of data which has structured parts (which I want to scan) and non-structured parts (which I don't want to scan).
Reading up on the os.walk function, I see that I can use a set of criteria in a set to exclude or include certain directory names or patterns.
For this particular scan I would need to add specific include/exclude criteria per level in a directory, for example:
In a root directory, imagine there are two useful directories, 'Dir A' and 'Dir B' and a non-useful trash directory 'Trash'. In Dir A there are two useful sub directories 'Subdir A1' and 'Subdir A2' and a non useful 'SubdirA Trash' directory, then in Dir B there are two useful subdirectories Subdir B1 and Subdir B2 plus a non useful 'SubdirB Trash' subdirectory. Would look something like this:
I need to have a specific criteria list for each level, something like this:
level1DirectoryCriteria = set("Dir A","Dir B")
level2DirectoryCriteria = set("Subdir A1","Subdir A2","Subdir
B1","Subdir B2")
the only ways I can think to do this are quite obviously non-pythonic using complex and lengthy code with a lot of variables and a high risk of instability. Does anyone have any ideas for how to resolve this problem? If successful it could save the codes running time several hours at a time.
You could try something like this:
to_scan = {'set', 'of', 'good', 'directories'}
for dirpath, dirnames, filenames in os.walk(root):
dirnames[:] = [d for d in dirnames if d in to_scan]
#whatever you wanted to do in this directory
This solution is simple, and fails if you want to scan directories with a certain name if they appear in one directory and not another. Another option would be a dictionary that maps directory names to lists or sets of whitelisted or blacklisted directories.
Edit: We can use dirpath.count(os.path.sep) to determine depth.
root_depth = root.count(os.path.sep) #subtract this from all depths to normalize root to 0
sets_by_level = [{'root', 'level'}, {'one', 'deep'}]
for dirpath, dirnames, filenames in os.walk(root):
depth = dirpath.count(os.path.sep) - root_depth
dirnames[:] = [d for d in dirnames if d in sets_by_level[depth]]
#process this directory
Not a direct answer concerning os.walk but just a suggestion: Since you're scanning the directories anyways, and you obviously know the trash directories from the other directories, you could also place a dummy file in the trash directories skip_this_dir or something. When you iterate over directories and create the file list, you check for the presence of the skip_this_dir file, something like if 'skip_this_dir' in filenames: continue; and continue to the next iteration.
This may not involve using os.walk parameters, but it does make the programming task a little easier to manage, without the requirement of writing a lot of 'messy' code with tons of conditionals and lists of include/excludes. It also makes the script easier to reuse since you don't need to change any code, you just place the dummy file in the directories you need to skip.
By using the root.count(os.path.sep) I was able to create specific instructions on what to include/exclude on each level in the structure. Looks something like this:
import os
root_depth = root.count(os.path.sep) #subtract this from all depths to normalize root to 0
directoriesToIncludedByLevel = [{"criteriaString","criteriaString","criteriaString","criteriaString"},#Level 0
{"criteriaString","criteriaString","criteriaString" },#Level 1
{},#Level 2
]
directoriesToExcludedByLevel = [{}, #Level 0
{}, #Level 1
{"criteriaString"}, #Level 2
]
for dirpath, dirnames, filenames in os.walk(root):
depth = dirpath.count(os.path.sep) - root_depth
# Here we create the dirnames path depending on whether we use the directoriesToIncludedByLevel or the directoriesToExcludedByLevel
if depth == 2: #Where we define which directories to exclude
dirnames[:] = [d for d in dirnames if d not in directoriesToExcludedByLevel[depth]]
elif depth < 2 : #Where we define which directories to INclude
dirnames[:] = [d for d in dirnames if d in directoriesToIncludedByLevel[depth]]
I was looking for a solution similar to OP. I needed to scan the subfolders and needed to exclude any folder that had folders labeled 'trash'.
My solution was to use the string find() method. Here's how I used it:
for (dirpath, dirnames, filenames) in os.walk(your_path):
if dirpath.find('trash') > 0:
pass
elif dirpath.find('trash)') < 0:
do_stuff
If 'trash' is found, then it will return the index number. Otherwise find() will return -1.
You can find more information on the find() method here:
https://www.tutorialspoint.com/python/string_find.htm
I am new to Python and, although having been reading and enjoying it so far, have ∂ experience, where ∂ → 0.
I have a folder tree and each folder at the bottom of the tree's branches contains many files. For me, this whole tree in the input.
I would to perform several steps of analysis (I believe these are irrelavant to this question), the results of which I would like to have returned in an identical tree to that of the input, called output.
I have two ideas:
Read through each folder recursively using os.walk() and for each file to perform the analysis, and
Use a function such as shutil.copytree() and perform the analysis somewhere along the way. So actually, I do not want to COPY the tree at all, rather replicate it's structure but with new files. I thought this might be a kind of 'hack' as I do actually want to use each input file to create the output file, so instead of a copycommand, I need an analyse command. The rest should remain unchanged as far as my imagination allows me to understand.
I have little experience with option 1 and zero experience with option 2.
For smaller trees up until now I have been hard-coding the paths, which has become too time-consuming at this point.
I have also seen more mundane ways, such as using glob to first find all the files I would like and work on them, but I don't know how this might help find a shortcut in recreating the input tree for my output.
My attempt at option 1 looks like this:
import os
for root, dirs, files in os.walk('/Volumes/Mac OS Drive/Data/input/'):
# I have no actual need to print these, it just helps me see what is happening
print root, "\n"
print dirs, "\n"
# This is my actual work going on
[analysis_function(name) for name in files]
however I fear this is going to be very slow, I would also like to do some kind of filtering on files too - for example the .DS_Store files created in mac trees are included in the results of the above. I would attempt to use the fnmatch module to filter only the files I want.
I have seen in the copytree function that it is possible to ignore files according to a pattern, which would be helpful, however I do not understand from the documentation where I could put my analysis function in on each file.
You can use both options: you could provide your custom copy_function that performs analysis instead of the default shutil.copy2 to shutil.copytree() (it is a more of a hack) or you could use os.walk() to have a finer control over the process.
You don't need to create parent directories manually either way. copytree() creates the parent directories for you and os.makedirs(root) can create parent directories if you use os.walk():
#!/usr/bin/env python2
import fnmatch
import itertools
import os
ignore_dir = lambda d: d in ('.git', '.svn', '.hg')
src_dir = '/Volumes/Mac OS Drive/Data/input/' # source directory
dst_dir = '/path/to/destination/' # destination directory
for root, dirs, files in os.walk(src_dir):
for input_file in fnmatch.filter(files, "*.input"): # for each input file
output_file = os.path.splitext(input_file)[0] + '.output'
output_dir = os.path.join(dst_dir, root[len(src_dir):])
if not os.path.isdir(output_dir):
os.makedirs(output_dir) # create destination directories
analyze(os.path.join(root, input_file), # perform analysis
os.path.join(output_dir, output_file))
# don't visit ignored subtrees
dirs[:] = itertools.ifilterfalse(ignore_dir, dirs)
I've got two tasks:
I've set up my digital library in the format of a Dewey Decimal Classification, so I've got a 3-deep hierarchy of 10 + 100 + 1000 folders, with directories sometimes going a little deeper. This library structure contains my "books" that I would like to list in a catalog (perhaps a searchable text document). It would be preferable, though not absolutely necessary, if I could view the parent directory name in a separate column next to each "book".
The problem is that some of the "books" in my library are folders that stand alone as items. I planned ahead when I devised this system and made it so that each item in my library would contain a tag in []s that would contain the author name, for instance, and so the idea is that I would try to perform a recursive listing of all of this, but end each recursion when it encounters anything with a [ in the name, directory or file.
How might I go about this? I know a bit of Python (which is originally what I used to create the library structure), and since this is on an external hard drive, I can do this in either Windows or Linux. My rough idea was to perform some sort of a recursive listing that would check the name of each directory or file for a [, and if it did, stop and add it (along with the name of the parent directory) to a list. I don't have any idea where to start.
The answer is based on this where
dirName: The next directory it found.
subdirList: A list of sub-directories in the current directory.
fileList: A list of files in the current directory.
Deletion cannot be done by list comprehension, because we have to "modify the subdirList in-place". Instead, we delete with enumerate on a deep copy of the list so that the counter i wouldn't be skipped after deletions while the original list gets modified.
I haven't tried it so don't trust this 100%.
# Import the os module, for the os.walk function
import os
# Set the directory you want to start from
rootDir = '.'
for dirName, subdirList, fileList in os.walk(rootDir):
print('Found directory: %s' % dirName)
for fname in fileList:
print('\t%s' % fname)
for i, elem in reversed(list(enumerate(subdirList[:]))):
if "[" in elem:
del subdirList[i]
I am writing and testing code on XPsp3 w/ python 2.7. I am running the code on 2003 server w/ python 2.7. My dir structure will look something like this
d:\ssptemp
d:\ssptemp\ssp9-1
d:\ssptemp\ssp9-2
d:\ssptemp\ssp9-3
d:\ssptemp\ssp9-4
d:\ssptemp\ssp10-1
d:\ssptemp\ssp10-2
d:\ssptemp\ssp10-3
d:\ssptemp\ssp10-4
Inside each directory there is one or more files that will have "IWPCPatch" as part of the filename.
Inside one of these files (one in each dir), there will be the line 'IWPCPatchFinal_a.wsf'
What I do is
1) os.walk across all dirs under d:\ssptemp
2) find all files with 'IWPCPatch' in the filename
3) check the contents of the file for 'IWPCPatchFinal_a.wsf'
4) If contents is true I add the path of that file to a list.
My problem is that on my XP machine it works fine. If I print out the results of the list I get several items in the order I listed above.
When I move it to the server 2003 machine I get the same contents in a different order. It comes ssp10-X, then ssp9-X. And this is causing me issues with a different area in the program.
I can see from my output that it begins the os.walk in the wrong order, but I don't know why that is occuring.
import os
import fileinput
print "--createChain--"
listOfFiles = []
for path, dirs, files in os.walk('d:\ssptemp'):
print "parsing dir(s)"
for file in files:
newFile = os.path.join(path,file)
if newFile.find('IWPCPatch') >= 0:
for line in fileinput.FileInput(newFile):
if "IWPCPatchFinal_a.wsf" in line:
listOfFiles.append(newFile)
print "Added", newFile
for item in listOfFiles:
print "list item", item
for path, dirs, files in os.walk('d:\ssptemp'):
# sort dirs and files
dirs.sort()
files.sort()
print "parsing dir(s)"
# ...
The order of directories within os.walk is not necessarily alphabetical (I think it's actually dependent upon how they're stored within the dirent on the filesystem). It will likely be stable on the same exact directory (on the same filesystem) if you don't change the directory contents (ie, repeated calls will return the same order), but the order is not necessarily alphabetical.
If you want to have an ordered list of filenames you will have to build the list and then sort it yourself.