How to count total number of files in each subfolder - python

My file structure looks like this:
- Outer folder
- Inner folder 1
- Files...
- Inner folder 2
- Files...
- …
I'm trying to count the total number of files in the whole of Outer folder. os.walk doesn't return any files when I pass it the Outer folder, and as I've only got two layers I've written it manually:
total = 0
folders = ([name for name in os.listdir(Outer_folder)
if os.path.isdir(os.path.join(Outer_folder, name))])
for folder in folders:
contents = os.listdir(os.path.join(Outer_folder, folder))
total += len(contents)
print(total)
Is there a better way to do this? And can I find the number of files in an arbitrarily nested set of folders? I can't see any examples of deeply nested folders on Stack Overflow.
By 'better', I mean some kind of built in function, rather than manually writing something to iterate - e.g. an os.walk that walks the whole tree.

Use pathlib:
Return total number of files in directory and subdirectories shows how to get just the total number.
pathlib is part of the
standard library, and should be used instead of os because it treats paths as objects with methods, not strings to be sliced.
Python 3's pathlib Module: Taming the File System
Use a condition to select only files:
[x.parent for x in f if x.is_file()]
File and subdirectory count in each directory:
from pathlib import Path
import numpy as np
p = Path.cwd() # if you're running in the current dir
# p = Path('path to to dir') # otherwise, specify a path
# creates a generator of all the files matching the pattern
f = p.rglob('*')
# optionally, use list(...) to unpack the generator
# f = list(p.rglob('*'))
# counts them
paths, counts = np.unique([x.parent for x in f], return_counts=True)
path_counts = list(zip(paths, counts))
Output:
List of tuples with path and count
[(WindowsPath('E:/PythonProjects/stack_overflow'), 8),
(WindowsPath('E:/PythonProjects/stack_overflow/.ipynb_checkpoints'), 7),
(WindowsPath('E:/PythonProjects/stack_overflow/complete_solutions/data'), 6),
(WindowsPath('E:/PythonProjects/stack_overflow/csv_files'), 3),
(WindowsPath('E:/PythonProjects/stack_overflow/csv_files/.ipynb_checkpoints'), 1),
(WindowsPath('E:/PythonProjects/stack_overflow/data'), 5)]
f = list(p.rglob('*')) unpacks the generator and produces a list of all the files.
One-liner:
Use Path.cwd().rglob('*') or Path('some path').rglob('*')
path_counts = list(zip(*np.unique([x.parent for x in Path.cwd().rglob('*')], return_counts=True)))

I will suggest you use recursion as the function below:
def get_folder_count(path):
folders = os.listdir(path)
folders = list(filter(lambda a: os.path.isdir(os.path.join(path, a)), folders))
count = len(folders)
for i in range(count):
count += get_folder_count(os.path.join(path, folders[i]))
return count

Related

How to get the list of csv files in a directory sorted by creation date in Python

I need to get the list of ".csv" files in a directory, sorted by creation date.
I use this function:
from os import listdir
from os.path import isfile, join, getctime
def get_sort_files(path, file_extension):
list_of_files = filter(lambda x: isfile(join(path, x)),listdir(path))
list_of_files = sorted(list_of_files, key=lambda x: getctime(join(path, x)))
list_of_files = [file for file in list_of_files if file.endswith(file_extension)] # keep only csv files
return list_of_files
It works fine when I use it in directories that contain a small number of csv files (e.g. 500), but it's very slow when I use it in directories that contain 50000 csv files: it takes about 50 seconds to return.
How can I modify it? Or can I use a better alternative function?
EDIT1:
The bottleneck is the sorted function, so I must find an alternative to sort the files by creation date without using it
EDIT2:
I only need the oldest file (the first if sorted by creation date), so maybe I don't need to sort all the files. Can I just pick the oldest one?
You should start by only examining the creation time on relevant files. You can do this by using glob() to return the files of interest.
Build a list of 2-tuples - i.e., (creation time, file name)
A sort of that list will implicitly be performed on the first item in each tuple (the creation date).
Then you can return a list of files in the required order.
from glob import glob
from os.path import join, getctime
def get_sort_files(path, extension):
list_of_files = []
for file in glob(join(path,f'*{extension}')):
list_of_files.append((getctime(file), file))
return [file for _, file in sorted(list_of_files)]
print(get_sort_files('some directory', 'csv'))
Edit:
I created a directory with 50,000 dummy CSV files and timed the code shown in this answer. It took 0.24s
Edit 2:
OP only wants oldest file. In which case:
def get_oldest_file(path, extension):
ctime = float('inf')
old_file = None
for file in glob(join(path,f'*{extension}')):
if (ctime_ := getctime(file)) < ctime:
ctime = ctime_
old_file = file
return old_file
You could try using os.scandir:
from os import scandir
def get_sort_files(path, file_extension):
"""Return the oldest file in path with correct file extension"""
list_of_files = [(d.stat().st_ctime, d.path) for d in scandir(path) if d.is_file() and d.path.endswith(file_extension)]
return min(list_of_files)
os.scandir seems to used less calls to stat. See this post for details.
I could see much better performance on a sample folder with 5000 csv files.
You could try the following code:
def get_sort_files(path, file_extension):
list_of_files = [file for file in listdir(path) if isfile(join(path, file)) and file.endswith(file_extension)]
list_of_files.sort(key=lambda x: getctime(join(path, x)))
return list_of_files
This version could have better performance especially on big folders. It uses a list comprehension directly at the beginning to ignore irrelevant files right from the beginning. It uses in-place sorting.
This way, this code uses only one list. In your code, you create multiple lists in memory and the data has to be copied each time:
listdir(path) returns the initial list of filenames
sorted(...) returns a filtered and sorted copy of the initial list
The list comprehension before the return statement creates another new list
You can try this method:
def get_sort_files(path, extention):
# Relative path generator
sort_paths = (join(path, i)
for i in listdir(path) if i.endswith(extention))
sort_paths = sorted(sort_paths, key=getctime)
return sort_paths
# Include the . char to be explicit
>>> get_sort_files("dir", ".csv")
['dir/new.csv', 'dir/test.csv']
However, all file names are in a relative path; folder/file.csv. A slightly less efficient work-around would be to use a lambda key again:
def get_sort_files(path, extention):
# File name generator
sort_paths = (i for i in listdir(path) if i.endswith(extention))
sort_paths = sorted(sort_paths, key=lambda x: getctime(join(path, x)))
return sort_paths
>>> get_sort_files("dir", ".csv")
['new.csv', 'test.csv']
Edit for avoiding sorted():
Using min():
This is the fastest method of all listed in this answer
def get_sort_files(path, extention):
# Relative path generator
sort_paths = (join(path, i) for i in listdir(path) if i.endswith(extention))
return min(sort_paths, key=getctime)
Manually:
def get_sort_files(path, extention):
# Relative path generator
sort_paths = [join(path, i) for i in listdir(path) if i.endswith(extention)]
oldest = (getctime(sort_paths[0]), sort_paths[0])
for i in sort_paths[1:]:
t = getctime(i)
if t < oldest[0]:
oldest = (t, i)
return oldest[1]

How to rename files in reverse order in Python?

I have a scanner that creates a folder of images named like this:
A1.jpg A2.jpg A3.jpg...A24.jpg -> B1.jpg B2.jpg B3.jpg...B24.jpg
There are 16 rows and 24 images per letter row i.e A1 to P24, 384 images total.
I would like to rename them by reversing the order. The first file should take the name of the last and vice versa. Consider first to be A1 (which is also the first created during scanning)
The closest example I can find is in shell but that is not really what I want:
for i in {1..50}; do
mv "$i.txt" "renamed/$(( 50 - $i + 1 )).txt"
done
Perhaps I need to save the filenames into a list (natsort maybe?) then use those names somehow?
I also thought I could use the image creation time as the scanner always creates the files in the same order with the same names. In saying that, any solutions may not be so useful for others with the same challenge.
What is a sensible approach to this problem?
I don't know if this is the most optimal way of doing that, but here it is:
import os
folder_name = "test"
new_folder_name = folder_name + "_new"
file_names = os.listdir(folder_name)
file_names_new = file_names[::-1]
print(file_names)
print(file_names_new)
os.mkdir(new_folder_name)
for name, new_name in zip(file_names, file_names_new):
os.rename(folder_name + "/" + name, new_folder_name + "/" + new_name)
os.rmdir(folder_name)
os.rename(new_folder_name, folder_name)
This assumes that you have files saved in the directory "test"
I would store the original list. Then rename all files in the same order (e.g. 1.jpg, 2.jpg etc.). Then I'd rename all of those files into the reverse of the original list.
In that way you will not encounter duplicate file names during the renaming.
You can make use of the pathlib functions rename and iterdir for this. I think it's straightforward how to put that together.
Solution based on shutil package (os package sometimes has permissions problems) and "in place" not to waste memory if the folder is huge
import wizzi_utils as wu
import os
def reverse_names(dir_path: str, temp_file_suffix: str = '.temp_unique_suffix') -> None:
"""
"in place" solution:
go over the list from both directions and swap names
swap needs a temp variable so move first file to target name with 'temp_file_suffix'
"""
files_full_paths = wu.find_files_in_folder(dir_path=dir_path, file_suffix='', ack=True, tabs=0)
files_num = len(files_full_paths)
for i in range(files_num): # works for even and odd files_num
j = files_num - i - 1
if i >= j: # crossed the middle - done
break
file_a, file_b = files_full_paths[i], files_full_paths[j]
print('replacing {}(idx in dir {}) with {}(idx in dir {}):'.format(
os.path.basename(file_a), i, os.path.basename(file_b), j))
temp_file_name = '{}{}'.format(file_b, temp_file_suffix)
wu.move_file(file_src=file_a, file_dst=temp_file_name, ack=True, tabs=1)
wu.move_file(file_src=file_b, file_dst=file_a, ack=True, tabs=1)
wu.move_file(file_src=temp_file_name, file_dst=file_b, ack=True, tabs=1)
return
def main():
reverse_names(dir_path='./scanner_files', temp_file_suffix='.temp_unique_suffix')
return
if __name__ == '__main__':
main()
found 6 files that ends with in folder "D:\workspace\2021wizzi_utils\temp\StackOverFlow\scanner_files":
['A1.jpg', 'A2.jpg', 'A3.jpg', 'B1.jpg', 'B2.jpg', 'B3.jpg']
replacing A1.jpg(idx in dir 0) with B3.jpg(idx in dir 5):
D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/A1.jpg Moved to D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/B3.jpg.temp_unique_suffix(0B)
D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/B3.jpg Moved to D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/A1.jpg(0B)
D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/B3.jpg.temp_unique_suffix Moved to D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/B3.jpg(0B)
replacing A2.jpg(idx in dir 1) with B2.jpg(idx in dir 4):
D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/A2.jpg Moved to D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/B2.jpg.temp_unique_suffix(0B)
D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/B2.jpg Moved to D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/A2.jpg(0B)
D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/B2.jpg.temp_unique_suffix Moved to D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/B2.jpg(0B)
replacing A3.jpg(idx in dir 2) with B1.jpg(idx in dir 3):
D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/A3.jpg Moved to D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/B1.jpg.temp_unique_suffix(0B)
D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/B1.jpg Moved to D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/A3.jpg(0B)
D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/B1.jpg.temp_unique_suffix Moved to D:/workspace/2021wizzi_utils/temp/StackOverFlow/scanner_files/B1.jpg(0B)

How can I make a list of files in a directory where the files in that directory are first before files in subdirectories in python?

I'm trying to make a list of files, using recursion, in a given directory. I have been able to do this correctly but the order of my list is not correct. I need the files on the most surface layer of the directory show first in the list with the other files in subdirectories being in lexicographical order after.
Here is the code I have to do what I've discussed above.
import os
important = []
def search_directory(folder):
hold = os.listdir(folder)
for i in hold:
test = os.path.join(folder, i)
if os.path.isfile(test) == True and
os.path.isfile(test) not in interesting:
interesting.append(test)
else:
search_directory(test)
return important
Seems like you need BFS traversal of your directory tree.
import collections
import os
def extract_tree(root):
q = collections.deque()
q.append(root)
tree = []
while q:
root = q.popleft()
contents = sorted(os.listdir(root))
for f in contents:
path = os.path.join(root, f)
if os.path.isfile(path):
tree.append(path)
else:
q.append(path)
return tree

Looping over different python dictionaries - wrong results?

I am using Python 2.7, btw..
Let's say I have a couple directories that I want to create dictionaries for. The files in each of the directories are named YYYYMMDD.hhmmss and are all different, and the size of each directory is different:
path1 = /path/to/folders/to/make/dictionaries
dir1 = os.listdir(path1)
I also have another static directory that will have some files to compare
gpath1 = /path/to/static/files
gdir1 = os.listdir(gpath1)
dir1_file_list = [datetime.strptime(g, '%Y%m%d.%H%M%S') for g in gdir1]
So I have a static directory of files in gdir2, and I now want to loop through each directory in dir1 and create a unique dictionary. This is the code:
for i in range(0,len(dir1)):
path2 = path1 + "/" + dir1[i]
dir2 = os.listdir(path2)
dir2_file_list = [datetime.strptime(r, '%Y%m%d.%H%M%S') for r in dir2]
# Define a dictionary, and initialize comparisons
dict_gr = []
dict_gr = dict()
for dir1_file in dir1_file_list:
dict_gr[str(dir1_file)] = []
# Look for instances within the last 5 minutes
for dir2_file in dir2_file_list:
if 0 <= (dir1_file - dir2_file).total_seconds() <= 300:
dict_gr[str(dir1_file)].append(str(dir2_file))
# Sort the dictionaries
for key, value in sorted(dict_gr.iteritems()):
dir2_lib.append(key)
dir1_lib.append(sorted(value))
The issue is that path2 and dir2 both properly go to the different folders and grab the necessary filenames, and creating dict_gr will all work well. However, when I go to the part of the script where I sort the dictionaries, the 2nd directory that has been looped over will contain the contents of the first directory. The 3rd looped dictionary will contain the contents of the 1st and 2nd, etc. In other words, they are not matching uniquely with each directory.
Any thoughts?
Overlooked appending to dir2_lib and dir1_lib, needed to initialize these.

removing file names from a list python

I have all filenames of a directory in a list named files. And I want to filter it so only the files with the .php extension remain.
for x in files:
if x.find(".php") == -1:
files.remove(x)
But this seems to skip filenames. What can I do about this?
How about a simple list comprehension?
files = [f for f in files if f.endswith('.php')]
Or if you prefer a generator as a result:
files = (f for f in files if f.endswith('.php'))
>>> files = ['a.php', 'b.txt', 'c.html', 'd.php']
>>> [f for f in files if f.endswith('.php')]
['a.php', 'd.php']
Most of the answers provided give list / generator comprehensions, which are probably the way you want to go 90% of the time, especially if you don't want to modify the original list.
However, for those situations where (say for size reasons) you want to modify the original list in place, I generally use the following snippet:
idx = 0
while idx < len(files):
if files[idx].find(".php") == -1:
del files[idx]
else:
idx += 1
As to why your original code wasn't working - it's changing the list as you iterator over it... the "for x in files" is implicitly creating an iterator, just like if you'd done "for x in iter(files)", and deleting elements in the list confuses the iterator about what position it is at. For such situations, I generally use the above code, or if it happens a lot in a project, factor it out into a function, eg:
def filter_in_place(func, target):
idx = 0
while idx < len(target):
if func(target[idx)):
idx += 1
else:
del target[idx]
Just stumbled across this old question. Many solutions here will do the job but they ignore a case where filename could be just ".php". I suspect that the question was about how to filter PHP scripts and ".php" may not be a php script. Solution that I propose is as follows:
>>> import os.path
>>> files = ['a.php', 'b.txt', 'c.html', 'd.php', '.php']
>>> [f for f in files if os.path.splitext(f)[1] == ".php"]

Categories