Python: Identifying numerically names folders in a folder structure - python

I have the below function, that walksthe root of a given directory and grabs all subdirectories and places them into a list. This part works, sort of.
The objective is to determine the highest (largest number) numerically named folder.
Assuming that the folder contains only numerically named folders, and does not contain alphanumeric folders of files, I'm good. However, if a file, or folder is present that is not numerically named I encounter issues because the script seems to be collecting all subdirectories and files, and loast everything into the list.
I need to just find those folders whose naming is numeric, and ignore anything else.
Example folder structure for c:\Test
\20200202\
\20200109\
\20190308\
\Apples\
\Oranges\
New Document.txt
This works to walk the directory but puts everything in the list, not just the numeric subfolders.
#Example code
import os
from pprint import pprint
files=[]
MAX_DEPTH = 1
folders = ['C:\\Test']
for stuff in folders:
for root, dirs, files in os.walk(stuff, topdown=True):
for subdirname in dirs:
files.append(os.path.join(subdirname))
#files.append(os.path.join(root, subdirname)) will give full directory
#print("there are", len(files), "files in", root) will show counts of files per directory
if root.count(os.sep) - stuff.count(os.sep) == MAX_DEPTH - 1:
del dirs[:]
pprint(max(files))
Current Result of max(files):
New Document.txt
Desired Output:
20200202
What I have tried so far:
I've tried catching each element before I add it to the list, seeing if the string of the subdirname can be converted to int, and then adding it to the list. This fails to convert the numeric subdirnames to an int, and somehow (I don't know how) the New Document.txt file gets added to the list.
files=[]
MAX_DEPTH = 1
folders = ['C:\\Test']
for stuff in folders:
for root, dirs, files in os.walk(stuff, topdown=True):
for subdirname in dirs:
try:
subdirname = int(subdirname)
print("Found subdir named " + subdirname + " type: " + type(subdirname))
files.append(os.path.join(subdirname))
except:
print("Error converting " + str(subdirname) + " to integer")
pass
#files.append(os.path.join(root, subdirname)) will give full directory
#print("there are", len(files), "files in", root) will show counts of files per directory
if root.count(os.sep) - stuff.count(os.sep) == MAX_DEPTH - 1:
del dirs[:]
return (input + "/" + max(files))
I've also tried appending everything to the list and then creating a second list (ie, without the try/except) using the below, but I wind up with an empty list. I'm not sure why, and I'm not sure where/how to start looking. Using 'type' on the list before applying the following shows that everything in the list is a str type.
list2 = [x for x in files if isinstance(x,int) and not isinstance(x,bool)]

I'm going to go ahead and answer my own question here:
Changing the method entirely helped, and made it significantly faster, and simpler.
#the find_newest_date function looks for a folder with the largest number and assumes that is the newest data
def find_newest_date(input):
intlistfolders = []
list_subfolders_with_paths = [f.name for f in os.scandir(input) if f.is_dir()]
for x in list_subfolders_with_paths:
try:
intval = int(x)
intlistfolders.append(intval)
except:
pass
return (input + "/" + str(max(intlistfolders)))
Explanation:
scandir is 3x faster than walk. directory performance
scandir also allows the use of f.name to pull out just the folder
names, or f.path to get paths.
So, use scandir to load up the list with all the subdirs.
Iterate over the list, and try to convert each value to an integer.
I don't know why it wouldn't work in the earlier example, but it
works in this case.
The first part of the try statement converts to an integer.
If conversion fails, the except clause is run, and 'pass' is
essentially a null statement. It does nothing.
Then, finally, join the input directory with the string
representation of the maximum numeric value (ie most recently dated
folder in this case).
The function is called with:
folder_named_path = find_newest_date("C:\\Test") or something similar.

Try matching dirs with a regular expression.num = r”[0-9]+” is your regular expression. Something like re.findall(num,subdirname) returns to you a matching string that is one or more Numbers.

Related

Return number of folders in directory and subdirectory

I have a directory similar the example down below which contains the following folders:
C:\Users\xx\Desktop\New folder\New folder\New folder\QGIS
C:\Users\xx\Desktop\New folder\New folder\New folder (2)\1- QGIS
C:\Users\xx\Desktop\New folder\New folder\New folder (4)\1.0 QGIS
C:\Users\xx\Desktop\New folder\New folder\QGIS
I wish to find how many folders with their names ends in QGIS and their path.
My current script is down below. It successfully gives me the path of all folders name ends in QGIS but the script counts only the folders with name "QGIS" only and doesnt count "1.0 QGIS" or "1- QGIS". What am I missing?
import os
rootfolder = r'C:\Users\xx\Desktop\New folder'
isfile = os.path.isfile
join = os.path.join
i=0
with open("folderpath.txt", 'w') as f:
for root, dirs, files in os.walk(rootfolder, topdown=False):
i+= dirs.count('*QGIS')
for name in dirs:
if name.endswith("QGIS"):
f.write(os.path.join(root, name)+'\n')
f.write(str((sum(dirs.count('QGIS') for _, dirs, _ in os.walk(rootfolder)))))
The list.count method does not support any concept of a wildcard -- it just looks for how many elements are equal to the value that is given as an argument. So your line
i+= dirs.count('*QGIS')
is looking for directories which are literally called *QGIS, rather than directories that end with QGIS.
The fix here should be easy because the code is already successfully printing out the correct paths; it is just not counting them correctly. So all that you need to do is to remove the above statement, and instead just add 1 in the place where you print out each path, which is already subject to the correct if condition inside the loop over directory names.
for root, dirs, files in os.walk(rootfolder, topdown=False):
for name in dirs:
if name.endswith("QGIS"):
f.write(os.path.join(root, name)+'\n')
i += 1
You already correctly initialise i=0 before the start of the loop.
At the end, just do:
print(i)
and get rid of that expression involving sum where you walk through all the directories a second time.
import os
print( len( list( filter(None, map(lambda x: x[0] if x[0].endswith('QGIS') else None,os.walk('.'))))))
A shorter form, but not too readable ;)
The "map" goes through the results of os.walk, returns the folder name if it ends with 'QGIS' and None if not.
The "filter" returns every value from map's results which differ from value None.
The "list" is needed, because both map and filter are returning an iterator object, which has no length, but the "list" has.

Count the number of folders in a directory and subdirectories

I've got a script that will accurately tell me how many files are in a directory, and the subdirectories within. However, I'm also looking into identify how many folders there are within the same directory and its subdirectories...
My current script:
import os, getpass
from os.path import join, getsize
user = 'Copy of ' + getpass.getuser()
path = "C://Documents and Settings//" + user + "./"
folder_counter = sum([len(folder) for r, d, folder in os.walk(path)])
file_counter = sum([len(files) for r, d, files in os.walk(path)])
print ' [*] ' + str(file_counter) + ' Files were found and ' + str(folder_counter) + ' folders'
This code gives me the print out of: [*] 147 Files were found and 147 folders.
Meaning that the folder_counter isn't counting the right elements. How can I correct this so the folder_counter is correct?
Python 2.7 solution
For a single directory and in you can also do:
import os
print len(os.walk('dir_name').next()[1])
which will not load the whole string list and also return you the amount of directories inside the 'dir_name' directory.
Python 3.x solution
Since many people just want an easy and fast solution, without actually understanding the solution, I edit my answer to include the exact working code for Python 3.x.
So, in Python 3.x we have the next method instead of .next. Thus, the above snippet becomes:
import os
print(len(next(os.walk('dir_name'))[1]))
where dir_name is the directory that you want to find out how many directories has inside.
I think you want something like:
import os
files = folders = 0
for _, dirnames, filenames in os.walk(path):
# ^ this idiom means "we won't be using this value"
files += len(filenames)
folders += len(dirnames)
print "{:,} files, {:,} folders".format(files, folders)
Note that this only iterates over os.walk once, which will make it much quicker on paths containing lots of files and directories. Running it on my Python directory gives me:
30,183 files, 2,074 folders
which exactly matches what the Windows folder properties view tells me.
Note that your current code calculates the same number twice because the only change is renaming one of the returned values from the call to os.walk:
folder_counter = sum([len(folder) for r, d, folder in os.walk(path)])
# ^ here # ^ and here
file_counter = sum([len(files) for r, d, files in os.walk(path)])
# ^ vs. here # ^ and here
Despite that name change, you're counting the same value (i.e. in both it's the third of the three returned values that you're using)! Python functions do not know what names (if any at all; you could do print list(os.walk(path)), for example) the values they return will be assigned to, and their behaviour certainly won't change because of it. Per the documentation, os.walk returns a three-tuple (dirpath, dirnames, filenames), and the names you use for that, e.g. whether:
for foo, bar, baz in os.walk(...):
or:
for all_three in os.walk(..):
won't change that.
If interested only in the number of folders in /input/dir (and not in the subdirectories):
import os
folder_count = 0 # type: int
input_path = "/path/to/your/input/dir" # type: str
for folders in os.listdir(input_path): # loop over all files
if os.path.isdir(os.path.join(input_path, folders): # if it's a directory
folder_count += 1 # increment counter
print("There are {} folders".format(folder_count))
>>> import os
>>> len(list(os.walk('folder_name')))
According to os.walk the first argument dirpath enumerates all directories.

Python: rename all files in a folder using numbers that file contain

I want to write a little script for managing a bunch of files I got. Those files have complex and different name but they all contain a number somewhere in their name. I want to take that number, place it in front of the file name so they can be listed logically in my filesystem.
I got a list of all those files using os.listdir but I'm struggling to find a way to locate the numbers in those files. I've checked regular expression but I'm unsure if it's the right way to do this!
example:
import os
files = os.litdir(c:\\folder)
files
['xyz3.txt' , '2xyz.txt', 'x1yz.txt']`
So basically, what I ultimately want is:
1xyz.txt
2xyz.txt
3xyz.txt
where I am stuck so far is to find those numbers (1,2,3) in the list files
This (untested) snippet should show the regexp approach. The search method of compiled patterns is used to look for the number. If found, the number is moved to the front of the file name.
import os, re
NUM_RE = re.compile(r'\d+')
for name in os.listdir('.'):
match = NUM_RE.search(name)
if match is None or match.start() == 0:
continue # no number or number already at start
newname = match.group(0) + name[:match.start()] + name[match.end():]
print 'renaming', name, 'to', newname
#os.rename(name, newname)
If this code is used in production and not as homework assignment, a useful improvement would be to parse match.group(0) as an integer and format it to include a number of leading zeros. That way foo2.txt would become 02foo.txt and get sorted before 12bar.txt. Implementing this is left as an exercise to the reader.
Assuming that the numbers in your file names are integers (untested code):
def rename(dirpath, filename):
inds = [i for i,char in filename if char in '1234567890']
ints = filename[min(inds):max(inds)+1]
newname = ints + filename[:min(inds)] + filename[max(inds)+1:]
os.rename(os.path.join(dirpath, filename), os.path.join(dirpath, newname))
def renameFilesInDir(dirpath):
""" Apply your renaming scheme to all files in the directory specified by dirpath """
dirpath, dirnames, filenames = os.walk(dirpath):
for filename in filenames:
rename(dirpath, filename)
for dirname in dirnames:
renameFilesInDir(os.path.join(dirpath, dirname))
Hope this helps

Python automated file names

I want to automate the file name used when saving a spreadsheet using xlwt. Say there is a sub directory named Data in the folder the python program is running. I want the program to count the number of files in that folder (# = n). Then the filename must end in (n+1). If there are 0 files in the folder, the filename must be Trial_1.xls. This file must be saved in that sub directory.
I know the following:
import xlwt, os, os.path
n = len([name for name in os.listdir('.') if os.path.isfile(name)])
counts the number of files in the same folder.
a = n + 1
filename = "Trial_" + "a" + ".xls"
book.save(filename)
this will save the file properly named in to the same folder.
My question is how do I extend this in to a sub directory? Thanks.
os.listdir('.') the . in this points to the directory from where the file is executed. Change the . to point to the subdirectory you are interested in.
You should give it the full path name from the root of your file system; otherwise it will be relative to the directory from where the script is executed. This might not be what you want; especially if you need to refer to the sub directory from another program.
You also need to provide the full path to the filename variable; which would include the sub directory.
To make life easier, just set the full path to a variable and refer to it when needed.
TARGET_DIR = '/home/me/projects/data/'
n = sum(1 for f in os.listdir(TARGET_DIR) if os.path.isfile(os.path.join(TARGET_DIR, f)))
new_name = "{}Trial_{}.xls".format(TARGET_DIR,n+1)
You actually want glob:
from glob import glob
DIR = 'some/where/'
existing_files = glob(DIR + '*.xls')
filename = DIR + 'stuff--%d--stuff.xls' % (len(existing_files) + 1)
Since you said Burhan Khalid's answer "Works perfectly!" you should accept it.
I just wanted to point out a different way to compute the number. The way you are doing it works, but if we imagine you were counting grains of sand or something would use way too much memory. Here is a more direct way to get the count:
n = sum(1 for name in os.listdir('.') if os.path.isfile(name))
For every qualifying name, we get a 1, and all these 1's get fed into sum() and you get your count.
Note that this code uses a "generator expression" instead of a list comprehension. Instead of building a list, taking its length, and then discarding the list, the above code just makes an iterator that sum() iterates to compute the count.
It's a bit sleazy, but there is a shortcut we can use: sum() will accept boolean values, and will treat True as a 1, and False as a 0. We can sum these.
# sum will treat Boolean True as a 1, False as a 0
n = sum(os.path.isfile(name) for name in os.listdir('.'))
This is sufficiently tricky that I probably would not use this without putting a comment. But I believe this is the fastest, most efficient way to count things in Python.

WxPython - building a directory tree based on file availability

I do atomistic modelling, and use Python to analyze simulation results. To simplify work with a whole bunch of Python scripts used for different tasks, I decided to write simple GUI to run scripts from it.
I have a (rather complex) directory structure beginning from some root (say ~/calc), and I want to populate wx.TreeCtrl control with directories containing calculation results preserving their structure. The folder contains the results if it contains a file with .EXT extension. What i try to do is walk through dirs from root and in each dir check whether it contains .EXT file. When such dir is reached, add it and its ancestors to the tree:
def buildTree(self, rootdir):
root = rootdir
r = len(rootdir.split('/'))
ids = {root : self.CalcTree.AddRoot(root)}
for (dirpath, dirnames, filenames) in os.walk(root):
for dirname in dirnames:
fullpath = os.path.join(dirpath, dirname)
if sum([s.find('.EXT') for s in filenames]) > -1 * len(filenames):
ancdirs = fullpath.split('/')[r:]
ad = rootdir
for ancdir in ancdirs:
d = os.path.join(ad, ancdir)
ids[d] = self.CalcTree.AppendItem(ids[ad], ancdir)
ad = d
But this code ends up with many second-level nodes with the same name, and that's definitely not what I want. So I somehow need to see if the node is already added to the tree, and in positive case add new node to the existing one, but I do not understand how this could be done. Could you please give me a hint?
Besides, the code contains 2 dirty hacks I'd like to get rid of:
I get the list of ancestor dirs with splitting the full path in \
positions, and this is Linux-specific;
I find if .EXT file is in the directory by trying to find the extension in the strings from filenames list, taking in account that s.find returns -1 if the substring is not found.
Is there a way to make these chunks of code more readable?
First of all the hacks:
To get the path seperator for whatever os your using you can use os.sep.
Use str.endswith() and use the fact that in Python the empty list [] evaluates to False:
if [ file for file in filenames if file.endswith('.EXT') ]:
In terms of getting them all nicely nested you're best off doing it recursively. So the pseudocode would look something like the following. Please note this is just provided to give you an idea of how to do it, don't expect it to work as it is!
def buildTree(self, rootdir):
rootId = self.CalcTree.AddRoot(root)
self.buildTreeRecursion(rootdir, rootId)
def buildTreeRecursion(self, dir, parentId)
# Iterate over the files in dir
for file in dirFiles:
id = self.CalcTree.AppendItem(parentId, file)
if file is a directory:
self.buildTreeRecursion(file, id)
Hope this helps!

Categories