I have been trying to figure out how to retrieve (quickly) the number of files on a given HFS+ drive with python.
I have been playing with os.statvfs and such, but can't quite get anything (that seems helpful to me).
Any ideas?
Edit: Let me be a bit more specific. =]
I am writing a timemachine-like wrapper around rsync for various reasons, and would like a very fast estimate (does not have to be perfect) of the number of files on the drive rsync is going to scan. This way I can watch the progress from rsync (if you call it like rsync -ax --progress, or with the -P option) as it builds its initial file list, and report a percentage and/or ETA back to the user.
This is completely separate from the actual backup, which is no problem tracking progress. But with the drives I am working on with several million files, it means the user is watching a counter of the number of files go up with no upper bound for a few minutes.
I have tried playing with os.statvfs with exactly the method described in one of the answers so far, but the results do not make sense to me.
>>> import os
>>> os.statvfs('/').f_files - os.statvfs('/').f_ffree
64171205L
The more portable way gives me around 1.1 million on this machine, which is the same as every other indicator I have seen on this machine, including rsync running its preparations:
>>> sum(len(filenames) for path, dirnames, filenames in os.walk("/"))
1084224
Note that the first method is instantaneous, while the second one made me come back 15 minutes later to update because it took just that long to run.
Does anyone know of a similar way to get this number, or what is wrong with how I am treating/interpreting the os.statvfs numbers?
The right answer for your purpose is to live without a progress bar once, store the number rsync came up with and assume you have the same number of files as last time for each successive backup.
I didn't believe it, but this seems to work on Linux:
os.statvfs('/').f_files - os.statvfs('/').f_ffree
This computes the total number of file blocks minus the free file blocks. It seems to show results for the whole filesystem even if you point it at another directory. os.statvfs is implemented on Unix only.
OK, I admit, I didn't actually let the 'slow, correct' way finish before marveling at the fast method. Just a few drawbacks: I suspect .f_files would also count directories, and the result is probably totally wrong. It might work to count the files the slow way, once, and adjust the result from the 'fast' way?
The portable way:
import os
files = sum(len(filenames) for path, dirnames, filenames in os.walk("/"))
os.walk returns a 3-tuple (dirpath, dirnames, filenames) for each directory in the filesystem starting at the given path. This will probably take a long time for "/", but you knew that already.
The easy way:
Let's face it, nobody knows or cares how many files they really have, it's a humdrum and nugatory statistic. You can add this cool 'number of files' feature to your program with this code:
import random
num_files = random.randint(69000, 4000000)
Let us know if any of these methods works for you.
See also How do I prevent Python's os.walk from walking across mount points?
You could use a number from a previous rsync run. It is quick, portable, and for 10**6 files and any reasonable backup strategy it will give you 1% or better precision.
If traversing the directory tree is an option (would be slower than querying the drive directly):
import os
dirs = 0
files = 0
for r, d, f in os.walk('/path/to/drive'):
dirs += len(d)
files += len(f)
Edit: Spotlight does not track every file, so its metadata will not suffice.
Related
I have a directory with 3 million+ files in it (which I should have avoided creating in the first place). Using os.scandir() to simply print out the names,
for f in os.scandir():
print(f)
takes .004 seconds per item for the first ~200,000 files, but drastically slows down to .3 seconds per item. Upon trying it again, it did the same thing- fast for the first ~200,000, then slowed way down.
After waiting an hour and running it again, this time it was fast for the first ~400,000 files but then slowed down in the same way.
The files all start with a year between 1908 and 1963, so I've tried reorganizing the files using bash commands like
for i in {1908..1963}; do
> mkdir ../test-folders/$i;
> mv $i* ../test-folders/$i/;
> done
But it ends up getting hung up and never making it anywhere...
Any advice on how to reorganize this huge folder or more efficiently list the files in the directory?
It sounds like using an iterator, a function that only returns one item at a time instead of putting everything in memory, would be best.
The glob Library has the function iglob
for infile in glob.iglob( os.path.join(rootdir, '*.*') ):
…
Documentation: https://docs.python.org/3/library/glob.html#glob.iglob
Related question and answer: https://stackoverflow.com/a/17020892/7838574
oof. That's a lot of files. I'm not sure why python starts slowing down, that is interesting. But there are many reasons why you're having problems. One, directories can be thought of as a special type of file that just holds filenames/data-pointers of all the files in it (grossly simplified). It can be faster at time, along with accessing any file, when the OS is caching some of that information in memory in order to speed up disk access across the system as a whole.
It seems strange that python gets slower, and maybe you're hitting an internal memory or some other mechanism in python.
But let's fix the root of the problem. Your bash script is problematic, because every time you are using a * character you're forcing the bash script to read the entire directory (and likely sort it alphabetically) too. It might be wiser to get the list once and then operate on sections of the list. Maybe something like:
/bin/ls -1 > /tmp/allfiles
for i in {1908..1963}; do
echo "moving files starting with $i"
mkdir ../test-folders/$i
mv $(egrep "^$i" /tmp/allfiles) ../test-folders/$i/
done
this will read the directory only once (sort of) and will inform you about how fast its going.
I have a box of disk drives that contains backups of work and personal files over many years. Most of the files and directories are duplicates of other backups on other disks, or even on the same disk.
To consolidate this mess, I have created a csv file having the checksum, size and full path of each file. I then wrote a simple Python program using the pandas library to compute a checksum and size for each directory, which is simply the sum of the checksums and sizes of all files contained in the directory. The idea is to find all directories with identical content and delete all of them but one.
Unfortunately (but I expected this before) the code runs a few hours even for my test data set which has about 1 million rows. The actual data set has about 10 million rows.
Here is the python code fragment:
# for all directories, compute their checksum and total content size
df = pd.DataFrame(columns=['cksum', 'len', 'path'])
i = 0
for path in directories:
# create new dataframe having all files in this directory
items = data[data['path'].str.startswith(path)]
# sum all checksums
cksum = pd.to_numeric(items['cksum']).sum()
# sum all file sizes
len = pd.to_numeric(items['len']).sum()
# store result
df.loc[i] = [cksum, len, path]
i += 1
Obviously, the problem is that for each directory I have to find contained directories and files, and to identify those I do a startswith(path) string comparison, which is slow and I need to run this 1 (or 10) million times for each directory. So we have an O(n^2) type of problem here.
I understand that my current algorithm is naive and I could come up with a much better one, but before investing time here I would like to learn whether a different approach might be more worthwile:
Should I rather use a SQL database here? Think of a statement
similar to SELECT cksum, len, path FROM files,directories WHERE leftstr(files.path,n) == directories.path;. But maybe this statement
is just as expensive as its python equivalent?
Is a different database or tool more suitable for this kind of text search? I was thinking of Apache Lucene, ElasticSearch, MongoDB, NOSQL but I have no experience with these to decide which product to try.
Maybe somebody else solved this deduplication problem already? I found a
few commercial PC software products but I am not sure if they can handle 10 million files.
Please advise.
I am working on a program which needs to do several searches through a folder that may contain well over 20,000 files to see if a certain file exists. Does os.path.isfile iterate through every file in a directory, or does it use a more efficient method? And would dividing these 20,000 files between different sub-directories speed up that lookup that isfile has to do?
Note: I am using python 3
Internally it uses stat system call and works with the speed that the filesystem and OS provide.
Whether splitting huge directory in multiple subdirectories helps speed things up very much depends on the OS and the filesystem implementation. But usually yes — the less files in a directory the better.
I have a very large directory of files and folders. Currently, I scan the entire directory for JPGs and store them in a list. This is really slow due to the size of the directory. Is there a faster, more efficient way to do this? Perhaps without scanning everything?
My directory looks like this:
/library/Modified/2000/[FolderName]/Images.JPG
/library/Modified/2001/[FolderName]/Images.JPG
/library/Modified/2002/[FolderName]/Images.JPG
/library/Modified/2003/[FolderName]/Images.JPG
/library/Modified/2004/[FolderName]/Images.JPG
...
/library/Modified/2012/FolderName/Images.JPG
Thanks
See Generator Tricks for System Programmers for a bunch of neat stuff. But specifically, see the gen-find example. This is as efficient as you are going to get, without making a bunch of assumptions about your file structure layout.
Assuming that you application is the only one changing directory and that you have control over the directory names/structure and that you have to do the operation described in your question more than once:
Rename all the files once so you can access them in predictable order. Say, give all files numeric name from 1 to N (where N is the number of files in directory) and have a special file ".count" which will hold the N for each directory. Then access them directly with their names generated by random generator.
I don't know where the slowness occurs, but to scan directories and files I found it much faster the dump the directories/files into a text file first using a batch file then get python to read the file. This worked well on our server system with 7 servers and many thousands of directories.
Python could, of course, run the batch file.
is it possible to check if a file is done copying of if its complete using python?
or even on the command line.
i manipulate files programmatically in a specific folder on mac osx but i need to check if the file is complete before running the code which makes the manipulation.
There's no notion of "file completeness" in the Unix/Mac OS X filesystem. You could either try locking the file with flock or, simpler, copy the files to a subdir of the destination directory temporarily, then move them out once they're fully copied (assuming you have control over the program that does the copying). Moving is an atomic operation; you'll know the file is completely copied once it appears at the expected path.
take the md5 of the file before you copy and then again whenever you think you are done copying. when they match you are good to go. use md5 from the hashlib module for this.
If you know where the files are being copied from, you can check to see whether the size of the copy has reached the size of the original.
Alternatively, if a file's size doesn't change for a couple of seconds, it is probably done being copied, which may be good enough. (May not work well for slow network connections, however.)
It seems like you have control of the (python?) program doing the copying. What commands are you using to copy? I would think writing your code such that it blocks until the copy operation is complete would be sufficient.
Is this program multi-threaded or processed? If so you could add file paths to a queue when they are complete and then have the other thread only act on items in the queue.
You can use lsof and parse the opened handle list. If some process still has an opened handle on the file (aka writing) you can find it there.
You can do this:
import os
# Get the file size two times, now and after 3 seconds.
size_1 = os.path.getsize(file_path)
time.sleep(3)
size_2 = os.path.getsize(file_path)
# compare the sizes.
if size_1 == size_2:
# Do something.
else:
# Do something else.
You can change the time to whatever suits your need.