Is file reading faster in nested folders or doesn't matter? - python

My question concerns purely file-path reading and Disk Drives...I think.
I have python code that needs to pull up a specific file of which I know exactly the file-path to. And I have a choice either I store that file in a large folder with thousands of other files or segment them all into sub-folders. Which choice would give more reading speed?
My concern and lack of knowledge suggests that when code enters a big folder with thousands of other files then that is much more of a struggle than entering a folder with a few sub-folders. Or am I wrong and it is all instant if I produce the exact file-path?
Again I don't have to scan the files or folders as I know exactly the file-path link but I don't know what happens on the lower-level with Disk Drives?
EDIT: Which of the two would be faster given standard HDD on Windows 7?
C://Folder_with_millions_of_files/myfile.txt
or
C://small_folder/small_folder254/small_folder323/myfile.txt
NOTE: What I need this for is not to scan thousands of files but to pull up just that one file as quickly as possible. Sort of a lookup table I think this is.

Doing some reading for maximum scalability it appears best practice is to split the folder to subfolders although you nesting multiple folders is not recomended and it is best to use multiple larger folders than thousands of smaller folders,
Rather than shoveling all of those files into a single filesystem, why not spread them out across a series of smaller filesystems? The problems with that approach are that (1) it limits the kernel's ability to optimize head seeks and such, reducing performance, and (2) it forces developers (or administrators) to deal with the hassles involved in actually distributing the files. Inevitably things will get out of balance, forcing things to be redistributed in the future.
From looking at these articles I would draw the following conclusion,
< 65,534 files (One folder should suffice)
> 65,534 files (Split into folders)
To allow for scalability in the future it would be advisable to split the data across folders but based on the file system and observed performance potentially creating a new folder per 65,534 items or per day, category etc.
Based on,
single folder or many folders for storing 8 million images of hundreds of stores?
https://lwn.net/Articles/400629/
https://superuser.com/questions/446282/max-files-per-directory-on-ntfs-vol-vs-fat32

Related

Fast substring search in a large text database?

I have a box of disk drives that contains backups of work and personal files over many years. Most of the files and directories are duplicates of other backups on other disks, or even on the same disk.
To consolidate this mess, I have created a csv file having the checksum, size and full path of each file. I then wrote a simple Python program using the pandas library to compute a checksum and size for each directory, which is simply the sum of the checksums and sizes of all files contained in the directory. The idea is to find all directories with identical content and delete all of them but one.
Unfortunately (but I expected this before) the code runs a few hours even for my test data set which has about 1 million rows. The actual data set has about 10 million rows.
Here is the python code fragment:
# for all directories, compute their checksum and total content size
df = pd.DataFrame(columns=['cksum', 'len', 'path'])
i = 0
for path in directories:
# create new dataframe having all files in this directory
items = data[data['path'].str.startswith(path)]
# sum all checksums
cksum = pd.to_numeric(items['cksum']).sum()
# sum all file sizes
len = pd.to_numeric(items['len']).sum()
# store result
df.loc[i] = [cksum, len, path]
i += 1
Obviously, the problem is that for each directory I have to find contained directories and files, and to identify those I do a startswith(path) string comparison, which is slow and I need to run this 1 (or 10) million times for each directory. So we have an O(n^2) type of problem here.
I understand that my current algorithm is naive and I could come up with a much better one, but before investing time here I would like to learn whether a different approach might be more worthwile:
Should I rather use a SQL database here? Think of a statement
similar to SELECT cksum, len, path FROM files,directories WHERE leftstr(files.path,n) == directories.path;. But maybe this statement
is just as expensive as its python equivalent?
Is a different database or tool more suitable for this kind of text search? I was thinking of Apache Lucene, ElasticSearch, MongoDB, NOSQL but I have no experience with these to decide which product to try.
Maybe somebody else solved this deduplication problem already? I found a
few commercial PC software products but I am not sure if they can handle 10 million files.
Please advise.

medium datasets under source control

This is more of a general question about how feasible is it to store data sets under source control.
I have 20 000 csv files with number data that I update every day. The overall size of the directory is 100Mbytes or so, that are stored on a local disk on ext4 partition.
Each day changes should be diffs of about 1kbyte.
I may have to issue corrections on the data, so I am considering versioning the whole directory= 1 toplevel contains 10 level1 dirs, each contain 10 level2 dirs, each containing 200 csv files.
The data is written to files by python processes( pandas frames ).
The question is about performance of writes where the deltas are small like this compared to the entire data.
svn and git come to mind, and they would have python modules to use them.
What works best?
Other solutions are I am sure possible but I would stick to keeping data is files as is...
If you're asking whether it would be efficient to put your datasets under version control, based on your description of the data, I believe the answer is yes. Both Mercurial and Git are very good at handling thousands of text files. Mercurial might be a better choice for you, since it is written in python and is easier to learn than Git. (As far as I know, there is no good reason to adopt Subversion for a new project now that better tools are available.)
If you're asking whether there's a way to speed up your application's writes by borrowing code from a version control system, I think it would be a lot easier to make your application modify existing files in place. (Maybe that's what you're doing already? It's not clear from what you wrote.)

Python - Fast search for modified files in multiple folders

I am looking for an efficient implementation for finding modified files in multiple directories.
I do know that I can just recursively go through all those directories and check the modified date of those files I wanna check. This is quite trivial.
But what if I have a complex folder structure with many files in it? The upper approach won't really scale and might take up several minutes.
Is there a better approach to probe for modifications? Like is there something like a checksum on folders that I could use to narrow down the number of folders I have to check for modifications or anything like that?
A second step to my problem is also finding newly created files.
I am looking for a python based solution which is windows compatible

Performance of walking directory structure vs reading file with contents of ls (or similar command) for performing search

Is it better to walk a directory structure when performing multiple searches or is it a good idea to catalog the directory structure (in a file or memory) and then operate on that catalog? Or are there other methods which are better suited which I haven't hit upon?
I have a 3.5TB external HDD with thousands of files.
I have a set of files which list the contents of a directory. These listing files hold a folder name, filenames and file sizes.
I want to search the external HDD for the files in these listing files. If a file is found I then want to check and see if the file size of the actual file matches that in the listing file.
This process will cover about 1000 listing files and probably 10s of thousands of actual files.
A listing file would have contents like
folder: SummerPhotos
name: IMG0096.jpg, length: 6589
name: IMG0097.jpg, length: 6489
name: IMG0098.jpg, length: 6500
name: IMG0099.jpg, length: 6589
name: BeachPhotos/IMG0100.jpg, length, 34892
name: BeachPhotos/IMG0101.jpg, length, 34896
I like the offline processing of the listing files with a file which lists the contents of the external HDD because then I can perform this operation on a faster computer (as the hard drive is on an old computer acting as a server) or split the listing files over several computers and split up the work. Plus I think that continually walking the directory structure is about as inefficient as you can get and putting unnecessary wear on the hardware.
Walk pseudo code:
for each listing file
get base_foldername,filelist
for root,subfolder,files in os.walk(/path/to/3.5TBdrive)
if base_foldername in subfolder
for file in filelist
if file in files
if file.size == os.path.getsize(file)
dosomething
else
somethingelse
else
not_found
For the catalog file method I'm thinking of dumping a recursive 'ls' to file and then pretty much doing a string search on that file. I'll extract the filesize and perform a match there.
My 'ls -RlQ' dump file is 11MB in size with ~150k lines. If there is a better way to get the required data I'm open to suggestions. I'm thinking of using the os.walk() to compile a list and create my own file in a format I like vs trying to parse my ls command.
I feel like I should be doing somethign to make my college professors proud and making a hashtable or balanced tree, but feel like the effort to implement that will take longer than simply brute forcing the solution w cpu cycles.
OS: Linux
preferred programming language: Python 2/3
Thanks!
Is it better to walk a directory structure when performing multiple
searches or is it a good idea to catalog the directory structure (in a
file or memory) and then operate on that catalog?
If you just want to check if the file exists or the directory structure is not too complex, I suggest you to just use your filesystem. You're basically duplicating the work that it already does anyway and this will lead to problems in the future, as complexity always does.
I don't see any point using hashtables or balanced trees for in-program data structures - this is also what your filesystem already does. What you should instead do to speed up lookups is to design a deep directory structure instead of a few single directories that contain thousands of files. There are filesystems that choke while trying to list directories with dozens of thousands of files and it is a better idea to limit yourself to a few thousands and create a new level of directory depth should you exceed it.
For example, if you want to keep logs of your internet-wide scanning research, if you use a single file for each host you scanned, you don't want to create a directory scanning-logs with files such as 1.1.1.1.xml, 1.1.1.2.xml and so on. Instead, naming such as scanning-logs/1/1/1.1.1.1.xml is a better idea.
Also, watch out for the inode limit! I was once building a large file-based database on EXT4 filesystem. One day I started getting error messages like "no space left on device" even though I clearly had quite a lot of space left. The real reason was that I created too many inodes - the limit can be manually set while creating a volume.

How can I efficiently select 100 random JPG files from a directory (including subdirs) in Python?

I have a very large directory of files and folders. Currently, I scan the entire directory for JPGs and store them in a list. This is really slow due to the size of the directory. Is there a faster, more efficient way to do this? Perhaps without scanning everything?
My directory looks like this:
/library/Modified/2000/[FolderName]/Images.JPG
/library/Modified/2001/[FolderName]/Images.JPG
/library/Modified/2002/[FolderName]/Images.JPG
/library/Modified/2003/[FolderName]/Images.JPG
/library/Modified/2004/[FolderName]/Images.JPG
...
/library/Modified/2012/FolderName/Images.JPG
Thanks
See Generator Tricks for System Programmers for a bunch of neat stuff. But specifically, see the gen-find example. This is as efficient as you are going to get, without making a bunch of assumptions about your file structure layout.
Assuming that you application is the only one changing directory and that you have control over the directory names/structure and that you have to do the operation described in your question more than once:
Rename all the files once so you can access them in predictable order. Say, give all files numeric name from 1 to N (where N is the number of files in directory) and have a special file ".count" which will hold the N for each directory. Then access them directly with their names generated by random generator.
I don't know where the slowness occurs, but to scan directories and files I found it much faster the dump the directories/files into a text file first using a batch file then get python to read the file. This worked well on our server system with 7 servers and many thousands of directories.
Python could, of course, run the batch file.

Categories