Directory sizes and extensions - python

I'd like to create python command line code that is able to print directory tree with sizes of all subdirectories (from certain directory) and most frequent extensions... I will show the example output.
root_dir (5 GB, jpg (65 %): avi ( 30 %) : pdf (5 %))
-- aa (3 GB, jpg (100 %) )
-- bb (2 GB, avi (20 %) : pdf (2 %) )
--- bbb (1 GB, ...)
--- bb2 (1 GB, ...)
-- cc (1 GB, pdf (100 %) )
The format is :
nesting level, directory name (size of the directory with all files and subdirectories, most frequent extensions with size percentages in this directory.
I have this code snippet so far. The problem is that it counts only file sizes in directory, so the resulting size is smaller than real size of the directory. Other problem is how to put it all together to print the tree I defined above without redundant computations.

Calculating directory sizes really isn't python's strong suit, as explained in this post: very quickly getting total size of folder. If you have access to du and find, by all means use that. You can easily display the size of each directory with the following line:
find . -type d -exec du -hs "{}" \;
If you insist in doing this in python, you may prefer post-order traversal over os.walk, as suggested by PableG. But using os.walk can be visually cleaner, if efficiency is not the utmost factor for you:
import os, sys
from collections import defaultdict
def walkIt(folder):
for (path, dirs, files) in os.walk(folder):
size = getDirSize(path)
stats = getExtensionStats(files)
# only get the top 3 extensions
print '%s (%s, %s)'%(path, size, stats[:3])
def getExtensionStats(files):
# get all file extensions
extensions = [f.rsplit(os.extsep, 1)[-1]
for f in files if len(f.rsplit(os.extsep, 1)) > 1]
# count the extensions
exCounter = defaultdict(int)
for e in extensions:
exCounter[e] += 1
# convert count to percentage
percentPairs = [(e, 100*ct/len(extensions)) for e, ct in exCounter.items()]
# sort them
percentPairs.sort(key=lambda i: i[1])
return percentPairs
def getDirSize(root):
size = 0
for path, dirs, files in os.walk(root):
for f in files:
size += os.path.getsize( os.path.join( path, f ) )
return size
if __name__ == '__main__':
path = sys.argv[1] if len(sys.argv) > 1 else '.'
walkIt(path)

I personally find os.listdir + a_recursive_function best suited for this task than os.walk:
import os, copy
from os.path import join, getsize, isdir, splitext
frequent_ext = { ".jpg": 0, ".pdf": 0 } # Frequent extensions
def list_dir(base_dir):
dir_sz = 0 # directory size
files = os.listdir(base_dir)
ext_size = copy.copy(frequent_ext)
for file_ in files:
file_ = join(base_dir, file_)
if isdir(file_):
ret = list_dir(file_)
dir_sz += ret[0]
for k, v in frequent_ext.items(): # Add to freq.ext.sizes
ext_size[k] += ret[1][k]
else:
file_sz = getsize(file_)
dir_sz += file_sz
ext = os.path.splitext(file_)[1].lower() # Frequent extension?
if ext in frequent_ext.keys():
ext_size[ext] += file_sz
print base_dir, dir_sz,
for k, v in ext_size.items():
print "%s: %5.2f%%" % (k, float(v) / max(1, dir_sz) * 100.),
print
return (dir_sz, ext_size)
base_dir = "e:/test_dir/"
base_dir = os.path.abspath(base_dir)
list_dir(base_dir)

#Cldy Is right use os.path
for example os.path.walk will walk depth first through every directory below the argument, and return the files and folders in each directory
Use os.path.getsize to get the sizes and split to get the extensions. Store extensions in a list or dict and count them after going through each
If your are on Linux, I would suggest looking at du instead.

That's the module you need. And also this.

Related

Python progressbar inside a for loop which read files number

Using the progressbar library :
import progressbar
.
.
bar = progressbar.ProgressBar(maxval=len(files_total)).start()
This is my base for loop to read and store all .txt files in path4safe (which is a local test folder with 200 .txt file), which also tell me how many text files there is in the folder with a "d"
for root, dirnames, files in os.walk(path4safe):
for x in files:
if x.endswith(tuple(ext)):
d += 1
files_total.append(root + "/" + x)
So I tried this :
for root, dirnames, files in os.walk(path4safe):
for idx, files in enumerate(files_total):
for x in files:
if x.endswith(tuple(ext)):
d += 1
files_total.append(root + "/" + x)
bar.update(idx)
But I only get an empty progress bar, I feel like I'm mixing up one of my var. Basically I'm trying to use "d" to create the progressbar.
[]0% | |
Total files:
0

Python: Recursively count all file types and sizes in folders and subfolders

I am attempting to count all files in a directory and any sub directories by type and over all size
The output should be a table that looks something like:
Directory A
Number of subdirectories: 12
|Type| Count| TotalSize/kb |FirstSeen |LastSeen |
|----|-------|------------------|-----------|----------|
|.pdf| 8 |80767 |1/1/2020 |2/20/2020 |
|.ppt| 9 |2345 |1/5/2020 |2/25/2020 |
|.mov| 2 |234563 |1/10/2020 |3/1/2020 |
|.jpg| 14 |117639 |1/15/2020 |3/5/2020 |
|.doc| 5 |891 |1/20/2020 |3/10/2020 |
Sorry i was trying to get this into a table format for readability. But each record starts with a file type found in the directory.
This should do exactly what you want, count the size mapped by extensions. Tidy it up, pretty print the way you like, and you are done.
import os
def scan_dir(root_dir):
# walk the directory
result = {}
for root, dirs, files in os.walk(root_dir):
# count the files size
for file in files:
path = os.path.join(root, file)
ext = os.path.splitext(file)[1].upper()
size = os.stat(path).st_size
result[ext] = (result[ext] if ext in result else 0) + size
return result
print(scan_dir("."))
Edit: This doesn't collect the min/max timestamps for you, nor counts the files, but this should really put you on the right track.

ValueError: scandir: path too long for Windows

I am writing a simple Python script to tell me file size for a set of documents which I am importing from a CSV. I verified that none of the entries are over 100 characters, so this error "ValueError: scandir: path too long for Windows" does not make sense to me.
Here is my code:
# determine size of a given folder in MBytes
import os, subprocess, json, csv, platform
# Function to check if a Drive Letter exists
def hasdrive(letter):
return "Windows" in platform.system() and os.system("vol %s: 2>nul>nul" % (letter)) == 0
# Define Drive to check for
letter = 'S'
# Check if Drive doesnt exist, if not then map drive
if not hasdrive(letter):
subprocess.call(r'net use s: /del /Y', shell=True)
subprocess.call(r'net use s: \\path_to_files', shell=True)
list1 = []
# Import spreadsheet to calculate size
with open('c:\Temp\files_to_delete_subset.csv') as f:
reader = csv.reader(f, delimiter=':', quoting=csv.QUOTE_NONE)
for row in reader:
list1.extend(row)
# Define variables
folder = "S:"
folder_size = 0
# Exporting outcome
for list1 in list1:
folder = folder + str(list1)
for root, dirs, files in os.walk(folder):
for name in files:
folder_size += os.path.getsize(os.path.join(root, name))
print(folder)
# print(os.path.join(root, name) + " " + chr(os.path.getsize(os.path.join(root, name))))
print(folder_size)
From my understanding the max path size in Windows is 260 characters, so 1 driver letter + 100 character path should NOT exceed the Windows max.
Here is an example of a path: '/Document/8669/CORRESP/1722165.doc'
The folder string you're trying to walk is growing forever. Simplifying the code to the problem area:
folder = "S:"
# Exporting outcome
for list1 in list1:
folder = folder + str(list1)
You never set folder otherwise, so it starts out as S:<firstpath>, then on the next loop it's S:<firstpath><secondpath>, then S:<firstpath><secondpath><thirdpath>, etc. Simple fix: Separate drive from folder:
drive = "S:"
# Exporting outcome
for path in list1:
folder = drive + path
Now folder is constructed from scratch on each loop, throwing away the previous path, rather than concatenating them.
I also gave the iteration value a useful name (and removed the str call, because the values should all be str already).

How can i adjust my code to start with a new number?

import os
src = "/home/user/Desktop/images/"
ext = ".jpg"
for i,filename in enumerate(os.listdir(src)):
# print(i,filename)
if filename.endswith(ext):
os.rename(src + filename, src + str(i) + ext)
print(filename, src + str(i) + ext)
else :
os.remove(src + filename)
this code will rename all the images in a folder starting with 0.jpg,1.jpg etc... and remove none jpg but what if i already had some images in that folder, let's say i had images 0.jpg, 1.jpg, 2.jpg, then i added a few others called im5.jpg and someImage.jpg.
What i want to do is adjust the code to read the value of the last image number, in this case 2 and start counting from 3 .
In other words i'll ignore the already labeled images and proceed with the new ones counting from 3.
Terse and semi-tested version:
import os
import glob
offset = sorted(int(os.path.splitext(os.path.basename(filename))[0])
for filename in glob.glob(os.path.join(src, '*' + ext)))[-1] + 1
for i, filename in enumerate(os.listdir(src), start=offset):
...
Provided all *.jpg files consist of a only a number before their extension. Otherwise you will get a ValueError.
And if there happens to be a gap in the numbering, that gap will not be filled with new files. E.g., 1.jpg, 2.jpg, 3.jpg, 123.jpg will continue with 124.jpg (which is safer anyway).
If you need to filter out filenames such as im5.jpg or someImage.jpg, you could add an if-clause to the list comprehension, with a regular expression:
import os
import glob
import re
offset = sorted(int(os.path.splitext(os.path.basename(filename))[0])
for filename in glob.glob(os.path.join(src, '*' + ext))
if re.search('\d+' + ext, filename))[-1] + 1
Of course, by now the three lines are pretty unreadable, and may not win the code beauty contest.

renaming files to incremental values with correct digits

I would like to rename a list of pictures based on a root name being the directory name, (picture in this example) by padding the previous numbering with the appropriate of zeros based on the total number of files and increment. I was thinking of using Powershell or Python. Recommendations?
current 'C:\picture' directory contents
pic 1.jpg
...
pic 101.jpg
Result
picture 001.jpg
...
picture 101.jpg
Assuming
You already know how to traverse your directory
Access the file names in your script
Rename the files
Couple of Things to understand
Your file name has a format with the numbers padded with '0's if its less than a certain size, in your example if its less than 3. str.format, provides an elaborate format string specifier to achieve this
You need to know how to get the relevant portions of your file name to be reformatted as required
The formatting would vary ultimately based on number of files.
Demo
>>> no_of_files = 100
>>> no_of_digits = int(math.log10(no_of_files)) + 1
>>> format_exp = "pictures {{:>0{}}}.{{}}".format(no_of_digits)
>>> for fname in files:
#Discard the irrelevant portion
fname = fname.rsplit()[-1]
print format_exp.format(*fname.split('.'))
pictures 001.jpg
pictures 002.jpg
pictures 010.jpg
pictures 100.jpg
Here's python solution:
import glob
import os
dirpath = r'c:\picture'
dirname = os.path.basename(dirpath)
filepath_list = glob.glob(os.path.join(dirpath, 'pic *.jpg'))
pad = len(str(len(filepath_list)))
for n, filepath in enumerate(filepath_list, 1):
os.rename(
filepath,
os.path.join(dirpath, 'picture {:>0{}}.jpg'.format(n, pad))
)
pad is calculated using file count len(filepath_list):
>>> len(str(100)) # if file count is 100
3
'picture {:>0{}}.jpg'.format(99, 3) is like 'picture {:>03}.jpg'.format(99). Format string {:>03} zero-pad(0), right-align(>) the input value (99 in the following example).
>>> 'picture {:>0{}}.jpg'.format(99, 3)
'picture 099.jpg'
>>> 'picture {:>03}.jpg'.format(99)
'picture 099.jpg'
Documentation for the functions used:
enumerate
glob.glob
os.path.basename
os.path.join
str.format
Here's a PowerShell solution:
$jpgs = Get-ChildItem C:\Picture\*.jpg
$numDigits = "$($jpgs.Length)".Length
$formatStr = "{0:$('0' * $numDigits)}"
$jpgs | Where {$_.BaseName -match '(\d+)'} |
Rename-Item -NewName {$_.DirectoryName + '\' + $_.Directory.Name + ($formatStr -f [int]$matches[1]) + $_.Extension} -WhatIf
Remove the -WhatIf parameter to actually execute the rename if the preview you get with -WhatIf looks good.

Categories