Count all files in all folders/subfolders with Python - python

Which is the most efficient way to count all files in all folders and subfolders in Python? I want to use this on Linux systems.
Example output:
(Path files)
/ 2
/bin 100
/boot 20
/boot/efi/EFI/redhat 1
....
/root 34
....
Paths without a file should be ignored.
Thanks.

You can do it with os.walk();
import os
for root, dirs, files in os.walk('/some/path'):
if files:
print('{0} {1}'.format(root, len(files)))
Note that this will also include hidden files, i.e. those that begin with a dot (.).

import os
print [(item[0], len(item[2])) for item in os.walk('/path') if item[2]]
It returns a list of tuples of folders/subfolders and files count in /path.
OR
import os
for item in os.walk('/path'):
if item[2]:
print item[0], len(item[2])
It prints folders/subfolders and files count in /path.
If you want try faster solution, then you had to try to combine:
os.scandir() # from python 3.5.2
iterate recursively and use:
from itertools import count
counter = count()
counter.next() # returns at first 0, next 1, 2, 3 ...
if counter.next() > 1000:
print 'dir with file count over 1000' # and use continue in for loop
Maybe that will be faster, because I think in os.walk function are unnecessary things for you.

Related

How to get the latest folder in a directory using Python

I need to retrieve the directory of the most recently create folder. I am using a program that will output a new run## folder each time it is executed (i.e run01, run02, run03 and so on). Within any one run## folder resides a data file that I want analyze (file-i-want.txt).
folder_numb = 'run01'
dir = os.path.dirname(__file__)
filepath = os.path.join(dir, '..\data\directory',run_numb,'file-i-want.txt')
In short I want to skip having to hardcode in run## and just get the directory of a file within the most recently created run## folder.
You can get the creation date with os.stat
path = '/a/b/c'
#newest
newest = max([f for f in os.listdir(path)], key=lambda x: os.stat(os.path.join(path,x)).st_birthtime)
# all files sorted
sorted_files = sorted([f for f in os.listdir(path)],key=lambda x: os.stat(os.path.join(path, x)).st_birthtime, reverse=True)
pathlib is the recommeded over os for filesystem related tasks.
reference
You can try:
filepath = Path(__file__).parent / 'data/directory'
fnames = sorted(list(Path(filepath).rglob('file-i-want.txt')), key=lambda x: Path.stat(x).st_mtime, reverse=True)
filepath = str(fnames[0])
filepath
glob.glob('run*') will return the list of files/directories that match the pattern ordered by name.
so if you want the latest run your code will be:
import glob
print(glob.glob('run*')[-1]) # raises index error if there are no runs
IMPORTANT, the files are ordered by name, in this case, for example, 'run21' will come AFTER 'run100', so you will need to use a high enough number of digits to not see this error. or just count the number of matched files and recreate the name of the folder with this number.
you can use glob to check the number of files with the same name pattern:
import glob
n = len(glob.glob('run*')) # number of files which name starts with 'run'
new_run_name = 'run' + str(n)
Note: with this code the file names starts from 0, if you want to start from 1 just add 1 to n.
if you want always double digit run number (00, 01, 02) instead of 'str(n)' use 'str(n).zfill(2)'
example:
import glob
n = len(glob.glob('run*')) # number of files which name starts with 'run'
new_run_name = 'run' + str(n + 1).zfill(2)

Python use count in list

I want to make a list of all of my saved wifi files, with a number in front of each file but my output is not what I what from my code.
import os,
path = '/etc/NetworkManager/system-connections/'
dirs = os.listdir(path)
count = sum([len(files) for r, d, files in os.walk(path)])
for file in dirs:
for item in range(count):
print(item, file)
Expected output:
1 wifi-test
2 androidAP
3 androidAPtest
output now:
0 wifi-test
1 androidAP
2 androidAPtest
0 wifi-test
1 androidAP
2 androidAPtest
and then it starts over
How a Loop inside a Loop works
I think there you have a misunderstanding in what happens when you put a loop inside a loop, so let me explain that first.
If you have, for example
for item_a in ['a', 'b']:
for item_b in ['1', '2']:
print(item_a + item_b)
then your output would be:
a1
a2
b1
b2
The code would start in the a loop first, and then it would go over both items in the inner loop. Once finished, the next item in the outer loop is b, and then it will go over both items in the inner loop again.
If you want to keep track of how many items you've gone over in your loop, you could do so with this type of pattern:
count = 0
for item_a in ['a', 'b']:
count = count + 1
print( str(count) + item_a)
This results in
1a
2a
But there is a shortcut. You can use a nifty function called enumerate to get the count of each item in the for loop.
for count, item_a in enumerate(['a', 'b']):
print( str(count) + item_a)
Which will also give you
1a
2a
Solution to your problem
With all this said, you can create your list of files like so
# First we loop over os.walk to get the files in the current directory and all sub-directories
for root, dirs, files in os.walk(path):
# And then using enumerate, we can consolidate those two for loops into one loop that properly counts everything
for item, file in enumerate(files):
print(item, os.path.join(root, file))
And if you don't care about sub-directories, you can just do
for item, file in enumerate(os.listdir(path)):
print(item, file)
It's not quite clear what you want with your code. What's that count for?
Maybe this is what you want:
import os
path = '/etc/NetworkManager/system-connections/'
dirs = os.listdir(path)
for num, file in enumerate(dirs):
print(num+1, file)
I'm not sure what count is supposed to do here, but if you want the files in the directory (not subdirectories) you just need os.listdir.
import os
path = '/etc/NetworkManager/system-connections/'
dirs = os.listdir(path)
for i in range(len(dirs)):
print(i + 1, dirs[i])
This is exactly what a nested loop will do.
Check the outputs of the loops independently:
It says you have 2 files the directory.
So this is the first thing you want to look at. Because if the output is supposed to be one, it's counting another thing inside dirs. What is it counting? Debug it by printing the loops separately.
Also, for the next problem you can solve it by hardcoding an innocent +1
print(item + 1, file)

How do you count subdirectories in a folder?

I figured out how to count directories in a folder, but not sure how I could edit my code to recursively count subdirectories. Any help would be appreciated.
This is my code so far.
def nestingLevel(path):
count = 0
for item in os.listdir(path):
if item[0] != '.':
n = os.path.join(path,item)
if os.path.isdir(n):
count += 1 + nestingLevel(n)
return count
I think you may want to use os.walk:
import os
def fcount(path):
count1 = 0
for root, dirs, files in os.walk(path):
count1 += len(dirs)
return count1
path = "/home/"
print fcount(path)
You can use a glob here - the ** pattern indicates a recursive glob. The trailing slash matches on directories, excluding other types of files.
from pathlib import Path
def recursive_subdir_count(path):
dirs = Path(path).glob('**/')
result = sum(1 for dir in dirs)
result -= 1 # discount `path` itself
Using / works on windows, macOS, and Linux, so don't worry about putting os.sep instead.
Beware of a weird edge case: shell globs typically exclude hidden directories, i.e. those which begin with a ., but pathlib includes those (it's a feature, not a bug: see issue26096). If you care about discounting hidden directories, filter them out in the expression when calling sum. Or, use the older module glob which excludes them by default.
If you want to count them all without the root, this will do it:
len([i for i, j, k in os.walk('.')])-1

How to get sequence number of the file in the folder?

I have Windows PC. My script should identify sequency number of the file passed in the command line in the folder, i.e.
myscript.py \\network-drive\files\Long-long.file.name.with.numbers.txt
Folder content is the following:
\\network-drive\files\
folder1
folder2
file1
file2
Long.long.file.name.with.numbers.txt
file3
file4
My script should identify sequence number of the file given in the command line, i.e. should return 5 (folders are also to be counted; assumption is that files are sorted by their names).
Upd. I've stopped with the following:
import sys
import os.path
if sys.argv[1]: # regardless of this verification, exception happens if argument is not passed
head, tail = os.path.split(sys.argv[1])
print head
print os.listdir(head)
The list returned by listdir doesn't allow me to identify what is folder and what is file. So, I can not sort them properly.
There are a couple of problems you are trying to solve, and a couple of options for the solutions.
1st - are you looking for something that is naturally sorted i.e.:
/path/to/folder/
subfolder01/
test1.png
test2.png
test3.png
test10.png
test11.png
If so...you'll need to create a natural sort method. If you are happy with alpha-numeric sorting:
/path/to/folder/
subfolder01/
test1.png
test10.png
test11.png
test2.png
test3.png
Then the standard sort will work. Depending on how you sort your files, the index of your result will vary.
To get the directory and files from the system, you can do it one of two ways - not 100% sure which is faster, so test them both out. I'm going to break the answer into chunks so you can piece it together how best seems fit:
Part 01: Initialization
import os
import sys
try:
searchpath = sys.argv[1]
except IndexError:
print 'No searchpath supplied'
sys.exit(0)
basepath, searchname = os.path.split(searchpath)
Part 02: Collecting folders and files
Option #1: os.listdir + os.path.isfile
files = []
folders = []
for filepath in os.listdir(basepath):
if ( os.path.isfile(filepath) ):
files.append(filepath)
else:
folders.append(folder)
Option #2: os.walk
# we only want the top level list of folders and files,
# so break out of the loop after the first result
for basepath, folders, files in os.walk(basepath):
break
Part 03: Calculating the Index
Option #1: no sorting - what you get from the system is what you get
# no sorting
try:
index = len(folders) + files.index(searchname)
except IndexError:
index = -1
Option #2: alphanumeric sorting
# sort alpha-numerically (only need to sort the files)
try:
index = len(folders) + sorted(files).index(searchname)
except IndexError:
index = -1
Option #3: natural sorting
# natural sort using the projex.sorting.natural method
import projex.sorting
sorted_files = sorted(files, projex.sorting.natural)
try:
index = len(folders) + sorted_files.index(searchname)
except IndexError:
index = -1
Part 04: Logging the result
# if wanting a 1-based answer
index += 1
print index
I'm not going to go into detail about natural sorting since that wasn't a part of the question - I think there are other forums on here you can find with advice on that. The projex.sorting module is one that I've written and is available here: http://dev.projexsoftware.com/projects/projex if you want to see the exact implementation of it.
Suffice to say this would be the difference in results:
>>> import pprint, projex.sorting
>>> files = ['test2.png', 'test1.png', 'test10.png', 'test5.png', 'test11.png']
>>> print files.index('test10.png')
2
>>> print sorted(files).index('test10.png')
1
>>> print sorted(files, projex.sorting.natural).index('test10.png')
3
>>> print files
['test2.png', 'test1.png', 'test10.png', 'test5.png', 'test11.png']
>>> print sorted(files)
['test1.png', 'test10.png', 'test11.png', 'test2.png', 'test5.png']
>>> print sorted(files, projex.sorting.natural)
['test1.png', 'test2.png', 'test5.png', 'test10.png', 'test11.png']
So just keep that in mind when you're working with it.
Cheers!
It looks like something like this should work:
import os
import glob
import sys
import os.path as path
try:
directory,file = path.split( sys.argv[1] )
def sort_func(fname):
"""
Russian directories , english directories, russian files then english files
although, honestly I don't know how russian files will actually be sorted ...
"""
fullname = path.join(directory,fname)
isRussian = any(ord(x) > 127 for x in fullname)
isDirectory = path.isdir(fullname)
return ( not isDirectory, not isRussian, fullname)
files = sorted( os.listdir(directory), key=sort_func)
print ( files.index(file) + 1 )
except IndexError:
print "oops, no commandline arguments"
from os import listdir
from sys import argv
from os.path import *
print listdir(dirname(argv[1]).index(basename(argv[1]))
but it really means nothing, can't even imagine usecase when you need it. See os.path for details.

Keep latest file and delete all other

In my folder there are many pdf files with date-timestamp format such as shown in the last.
I would like to keep the latest file for the day and delete the rest for that day. How can I do in Python ?
2012-07-13-15-13-27_1342167207.pdf
2012-07-13-15-18-22_1342167502.pdf
2012-07-13-15-18-33_1342167513.pdf
2012-07-23-14-45-12_1343029512.pdf
2012-07-23-14-56-48_1343030208.pdf
2012-07-23-16-03-45_1343034225.pdf
2012-07-23-16-04-23_1343034263.pdf
2012-07-26-07-27-19_1343262439.pdf
2012-07-26-07-33-27_1343262807.pdf
2012-07-26-07-51-59_1343263919.pdf
2012-07-26-22-38-30_1343317110.pdf
2012-07-26-22-38-54_1343317134.pdf
2012-07-27-10-43-27_1343360607.pdf
2012-07-27-10-58-40_1343361520.pdf
2012-07-27-11-03-19_1343361799.pdf
2012-07-27-11-04-14_1343361854.pdf
Should I use list to fill and sort out then ? Desired output is:
2012-07-13-15-18-33_1342167513.pdf
2012-07-23-16-04-23_1343034263.pdf
2012-07-26-22-38-54_1343317134.pdf
2012-07-27-11-04-14_1343361854.pdf
Thanks
Your desired list can also be achieved using groupby .
from itertools import groupby
from os import listdir,unlink
filtered_list = list()
names = os.listdir()
for key,group in groupby(names,lambda x : x[:10]): # groups based on the start 10 characters of file
filtered_list.append([item for item in group][-1]) #picks the last file from the group
print filtered_list
Sort the list and delete files if the next file in the list is on the same day,
import glob
import os
files = glob.glob("*.pdf")
files.sort()
for ifl, fl in enumerate(files[:-1]):
if files[ifl+1].startswith(fl[:10]): #Check if next file is same day
os.unlink(fl) # It is - delete current file
Edit:
As the OPs question became clearer it became evident that not just the last file of the list is required, but the latest file of each day - to achieve this I included a "same day" conditioned unlinking.
You could do it that way. The following code is untested, but may work:
import os
names = os.listdir()
names.sort()
for f in names[:-1]:
os.unlink(f)
Fortunately your file names use ISO8601 date format so the textual sort achieves the desired result with no need to parse the dates.
The following snippet works with the test case given.
files = os.listdir(".")
days = set(fname[8:10] for fname in files)
for d in days:
f = [i for i in files if i[8:10] == d]
for x in sorted(f)[:-1]:
os.remove(x)
Using dictionary You can keep one value. This can be dirty and quickest solution, maybe not the best.
#!/usr/bin/env python
import os
import datetime
import stat
import shutil
filelist=[]
lst=[]
dc={}
os.chdir(".")
for files in os.listdir("."):
if files.endswith(".pdf"):
lst.append(files)
for x in lst:
print x[0:10].replace("-","")
dc[int(x[0:10].replace("-",""))]=x
a = dc.items()
flist=[]
for k, v in a:
flist.append(v)
dir="tmpdir"
if not os.path.exists(dir):
os.makedirs(dir)
from shutil import copyfile
for x in flist:
print x
copyfile(x, dir + "/" + x)
#os.chdir(".")
for files in os.listdir("."):
if files.endswith(".pdf"):
os.unlink(files)
os.chdir("./tmpdir")
for files in os.listdir("."):
if files.endswith(".pdf"):
copyfile(files, "../"+files)
os.chdir("../")
shutil.rmtree(os.path.abspath(".")+"/tmpdir")

Categories