How is os.path.file implemented? - python

I am working on a program which needs to do several searches through a folder that may contain well over 20,000 files to see if a certain file exists. Does os.path.isfile iterate through every file in a directory, or does it use a more efficient method? And would dividing these 20,000 files between different sub-directories speed up that lookup that isfile has to do?
Note: I am using python 3

Internally it uses stat system call and works with the speed that the filesystem and OS provide.
Whether splitting huge directory in multiple subdirectories helps speed things up very much depends on the OS and the filesystem implementation. But usually yes — the less files in a directory the better.

Related

Is file reading faster in nested folders or doesn't matter?

My question concerns purely file-path reading and Disk Drives...I think.
I have python code that needs to pull up a specific file of which I know exactly the file-path to. And I have a choice either I store that file in a large folder with thousands of other files or segment them all into sub-folders. Which choice would give more reading speed?
My concern and lack of knowledge suggests that when code enters a big folder with thousands of other files then that is much more of a struggle than entering a folder with a few sub-folders. Or am I wrong and it is all instant if I produce the exact file-path?
Again I don't have to scan the files or folders as I know exactly the file-path link but I don't know what happens on the lower-level with Disk Drives?
EDIT: Which of the two would be faster given standard HDD on Windows 7?
C://Folder_with_millions_of_files/myfile.txt
or
C://small_folder/small_folder254/small_folder323/myfile.txt
NOTE: What I need this for is not to scan thousands of files but to pull up just that one file as quickly as possible. Sort of a lookup table I think this is.
Doing some reading for maximum scalability it appears best practice is to split the folder to subfolders although you nesting multiple folders is not recomended and it is best to use multiple larger folders than thousands of smaller folders,
Rather than shoveling all of those files into a single filesystem, why not spread them out across a series of smaller filesystems? The problems with that approach are that (1) it limits the kernel's ability to optimize head seeks and such, reducing performance, and (2) it forces developers (or administrators) to deal with the hassles involved in actually distributing the files. Inevitably things will get out of balance, forcing things to be redistributed in the future.
From looking at these articles I would draw the following conclusion,
< 65,534 files (One folder should suffice)
> 65,534 files (Split into folders)
To allow for scalability in the future it would be advisable to split the data across folders but based on the file system and observed performance potentially creating a new folder per 65,534 items or per day, category etc.
Based on,
single folder or many folders for storing 8 million images of hundreds of stores?
https://lwn.net/Articles/400629/
https://superuser.com/questions/446282/max-files-per-directory-on-ntfs-vol-vs-fat32

Python: Os Filesize returning Garbage for /dev/core

I've written a python script that goes through my filesystem (LINUX) and collects the sizes of files.
Here is the relevant bit:
for name in files:
file_name = os.path.join(root,name)
file_size = os.stat(p).st_size
if x>1000000000000:
print x, p
So the script returns the size and path of any files greater than a terabyte in size. (There are no such files on my system --- my SSD space is 120GB.) It prints the following output:
140737486266368 /dev/core
140737486266368 /proc/kcore
But I know that these files are not this large. Why am I getting these erroneous value?
I should note that I have run the script as root. I have permission to access these files. What's going wrong here?
The problem is that files in /dev and /proc are not ordinary files but just views into devices and, e.g., kernel. If you check the size of that file (it is actually the same file, just symlinked), you will notice that even ls -l reports an insanely large size.
The best approach is to skip at least /dev, /proc, /sys, and /run folders (thanks, user3553031). Another possibility would be to check the file attributes - they'll reveal these are special files. However, it might be easier to just ignore the special folders.
Unfortunately, this is highly OS specific, and the above instructions are for Linux. Even different distributions may have different special files, and BSD, Windows &c. may act differently.

Performance of walking directory structure vs reading file with contents of ls (or similar command) for performing search

Is it better to walk a directory structure when performing multiple searches or is it a good idea to catalog the directory structure (in a file or memory) and then operate on that catalog? Or are there other methods which are better suited which I haven't hit upon?
I have a 3.5TB external HDD with thousands of files.
I have a set of files which list the contents of a directory. These listing files hold a folder name, filenames and file sizes.
I want to search the external HDD for the files in these listing files. If a file is found I then want to check and see if the file size of the actual file matches that in the listing file.
This process will cover about 1000 listing files and probably 10s of thousands of actual files.
A listing file would have contents like
folder: SummerPhotos
name: IMG0096.jpg, length: 6589
name: IMG0097.jpg, length: 6489
name: IMG0098.jpg, length: 6500
name: IMG0099.jpg, length: 6589
name: BeachPhotos/IMG0100.jpg, length, 34892
name: BeachPhotos/IMG0101.jpg, length, 34896
I like the offline processing of the listing files with a file which lists the contents of the external HDD because then I can perform this operation on a faster computer (as the hard drive is on an old computer acting as a server) or split the listing files over several computers and split up the work. Plus I think that continually walking the directory structure is about as inefficient as you can get and putting unnecessary wear on the hardware.
Walk pseudo code:
for each listing file
get base_foldername,filelist
for root,subfolder,files in os.walk(/path/to/3.5TBdrive)
if base_foldername in subfolder
for file in filelist
if file in files
if file.size == os.path.getsize(file)
dosomething
else
somethingelse
else
not_found
For the catalog file method I'm thinking of dumping a recursive 'ls' to file and then pretty much doing a string search on that file. I'll extract the filesize and perform a match there.
My 'ls -RlQ' dump file is 11MB in size with ~150k lines. If there is a better way to get the required data I'm open to suggestions. I'm thinking of using the os.walk() to compile a list and create my own file in a format I like vs trying to parse my ls command.
I feel like I should be doing somethign to make my college professors proud and making a hashtable or balanced tree, but feel like the effort to implement that will take longer than simply brute forcing the solution w cpu cycles.
OS: Linux
preferred programming language: Python 2/3
Thanks!
Is it better to walk a directory structure when performing multiple
searches or is it a good idea to catalog the directory structure (in a
file or memory) and then operate on that catalog?
If you just want to check if the file exists or the directory structure is not too complex, I suggest you to just use your filesystem. You're basically duplicating the work that it already does anyway and this will lead to problems in the future, as complexity always does.
I don't see any point using hashtables or balanced trees for in-program data structures - this is also what your filesystem already does. What you should instead do to speed up lookups is to design a deep directory structure instead of a few single directories that contain thousands of files. There are filesystems that choke while trying to list directories with dozens of thousands of files and it is a better idea to limit yourself to a few thousands and create a new level of directory depth should you exceed it.
For example, if you want to keep logs of your internet-wide scanning research, if you use a single file for each host you scanned, you don't want to create a directory scanning-logs with files such as 1.1.1.1.xml, 1.1.1.2.xml and so on. Instead, naming such as scanning-logs/1/1/1.1.1.1.xml is a better idea.
Also, watch out for the inode limit! I was once building a large file-based database on EXT4 filesystem. One day I started getting error messages like "no space left on device" even though I clearly had quite a lot of space left. The real reason was that I created too many inodes - the limit can be manually set while creating a volume.

How can I efficiently select 100 random JPG files from a directory (including subdirs) in Python?

I have a very large directory of files and folders. Currently, I scan the entire directory for JPGs and store them in a list. This is really slow due to the size of the directory. Is there a faster, more efficient way to do this? Perhaps without scanning everything?
My directory looks like this:
/library/Modified/2000/[FolderName]/Images.JPG
/library/Modified/2001/[FolderName]/Images.JPG
/library/Modified/2002/[FolderName]/Images.JPG
/library/Modified/2003/[FolderName]/Images.JPG
/library/Modified/2004/[FolderName]/Images.JPG
...
/library/Modified/2012/FolderName/Images.JPG
Thanks
See Generator Tricks for System Programmers for a bunch of neat stuff. But specifically, see the gen-find example. This is as efficient as you are going to get, without making a bunch of assumptions about your file structure layout.
Assuming that you application is the only one changing directory and that you have control over the directory names/structure and that you have to do the operation described in your question more than once:
Rename all the files once so you can access them in predictable order. Say, give all files numeric name from 1 to N (where N is the number of files in directory) and have a special file ".count" which will hold the N for each directory. Then access them directly with their names generated by random generator.
I don't know where the slowness occurs, but to scan directories and files I found it much faster the dump the directories/files into a text file first using a batch file then get python to read the file. This worked well on our server system with 7 servers and many thousands of directories.
Python could, of course, run the batch file.

How to determine number of files on a drive with Python?

I have been trying to figure out how to retrieve (quickly) the number of files on a given HFS+ drive with python.
I have been playing with os.statvfs and such, but can't quite get anything (that seems helpful to me).
Any ideas?
Edit: Let me be a bit more specific. =]
I am writing a timemachine-like wrapper around rsync for various reasons, and would like a very fast estimate (does not have to be perfect) of the number of files on the drive rsync is going to scan. This way I can watch the progress from rsync (if you call it like rsync -ax --progress, or with the -P option) as it builds its initial file list, and report a percentage and/or ETA back to the user.
This is completely separate from the actual backup, which is no problem tracking progress. But with the drives I am working on with several million files, it means the user is watching a counter of the number of files go up with no upper bound for a few minutes.
I have tried playing with os.statvfs with exactly the method described in one of the answers so far, but the results do not make sense to me.
>>> import os
>>> os.statvfs('/').f_files - os.statvfs('/').f_ffree
64171205L
The more portable way gives me around 1.1 million on this machine, which is the same as every other indicator I have seen on this machine, including rsync running its preparations:
>>> sum(len(filenames) for path, dirnames, filenames in os.walk("/"))
1084224
Note that the first method is instantaneous, while the second one made me come back 15 minutes later to update because it took just that long to run.
Does anyone know of a similar way to get this number, or what is wrong with how I am treating/interpreting the os.statvfs numbers?
The right answer for your purpose is to live without a progress bar once, store the number rsync came up with and assume you have the same number of files as last time for each successive backup.
I didn't believe it, but this seems to work on Linux:
os.statvfs('/').f_files - os.statvfs('/').f_ffree
This computes the total number of file blocks minus the free file blocks. It seems to show results for the whole filesystem even if you point it at another directory. os.statvfs is implemented on Unix only.
OK, I admit, I didn't actually let the 'slow, correct' way finish before marveling at the fast method. Just a few drawbacks: I suspect .f_files would also count directories, and the result is probably totally wrong. It might work to count the files the slow way, once, and adjust the result from the 'fast' way?
The portable way:
import os
files = sum(len(filenames) for path, dirnames, filenames in os.walk("/"))
os.walk returns a 3-tuple (dirpath, dirnames, filenames) for each directory in the filesystem starting at the given path. This will probably take a long time for "/", but you knew that already.
The easy way:
Let's face it, nobody knows or cares how many files they really have, it's a humdrum and nugatory statistic. You can add this cool 'number of files' feature to your program with this code:
import random
num_files = random.randint(69000, 4000000)
Let us know if any of these methods works for you.
See also How do I prevent Python's os.walk from walking across mount points?
You could use a number from a previous rsync run. It is quick, portable, and for 10**6 files and any reasonable backup strategy it will give you 1% or better precision.
If traversing the directory tree is an option (would be slower than querying the drive directly):
import os
dirs = 0
files = 0
for r, d, f in os.walk('/path/to/drive'):
dirs += len(d)
files += len(f)
Edit: Spotlight does not track every file, so its metadata will not suffice.

Categories