Python - Fast search for modified files in multiple folders

Python - Fast search for modified files in multiple folders - python

I am looking for an efficient implementation for finding modified files in multiple directories.
I do know that I can just recursively go through all those directories and check the modified date of those files I wanna check. This is quite trivial.
But what if I have a complex folder structure with many files in it? The upper approach won't really scale and might take up several minutes.
Is there a better approach to probe for modifications? Like is there something like a checksum on folders that I could use to narrow down the number of folders I have to check for modifications or anything like that?
A second step to my problem is also finding newly created files.
I am looking for a python based solution which is windows compatible

Related

Is file reading faster in nested folders or doesn't matter?

My question concerns purely file-path reading and Disk Drives...I think.
I have python code that needs to pull up a specific file of which I know exactly the file-path to. And I have a choice either I store that file in a large folder with thousands of other files or segment them all into sub-folders. Which choice would give more reading speed?
My concern and lack of knowledge suggests that when code enters a big folder with thousands of other files then that is much more of a struggle than entering a folder with a few sub-folders. Or am I wrong and it is all instant if I produce the exact file-path?
Again I don't have to scan the files or folders as I know exactly the file-path link but I don't know what happens on the lower-level with Disk Drives?
EDIT: Which of the two would be faster given standard HDD on Windows 7?
C://Folder_with_millions_of_files/myfile.txt
or
C://small_folder/small_folder254/small_folder323/myfile.txt
NOTE: What I need this for is not to scan thousands of files but to pull up just that one file as quickly as possible. Sort of a lookup table I think this is.

Doing some reading for maximum scalability it appears best practice is to split the folder to subfolders although you nesting multiple folders is not recomended and it is best to use multiple larger folders than thousands of smaller folders,
Rather than shoveling all of those files into a single filesystem, why not spread them out across a series of smaller filesystems? The problems with that approach are that (1) it limits the kernel's ability to optimize head seeks and such, reducing performance, and (2) it forces developers (or administrators) to deal with the hassles involved in actually distributing the files. Inevitably things will get out of balance, forcing things to be redistributed in the future.
From looking at these articles I would draw the following conclusion,
< 65,534 files (One folder should suffice)
> 65,534 files (Split into folders)
To allow for scalability in the future it would be advisable to split the data across folders but based on the file system and observed performance potentially creating a new folder per 65,534 items or per day, category etc.
Based on,
single folder or many folders for storing 8 million images of hundreds of stores?
https://lwn.net/Articles/400629/
https://superuser.com/questions/446282/max-files-per-directory-on-ntfs-vol-vs-fat32

medium datasets under source control

This is more of a general question about how feasible is it to store data sets under source control.
I have 20 000 csv files with number data that I update every day. The overall size of the directory is 100Mbytes or so, that are stored on a local disk on ext4 partition.
Each day changes should be diffs of about 1kbyte.
I may have to issue corrections on the data, so I am considering versioning the whole directory= 1 toplevel contains 10 level1 dirs, each contain 10 level2 dirs, each containing 200 csv files.
The data is written to files by python processes( pandas frames ).
The question is about performance of writes where the deltas are small like this compared to the entire data.
svn and git come to mind, and they would have python modules to use them.
What works best?
Other solutions are I am sure possible but I would stick to keeping data is files as is...

If you're asking whether it would be efficient to put your datasets under version control, based on your description of the data, I believe the answer is yes. Both Mercurial and Git are very good at handling thousands of text files. Mercurial might be a better choice for you, since it is written in python and is easier to learn than Git. (As far as I know, there is no good reason to adopt Subversion for a new project now that better tools are available.)
If you're asking whether there's a way to speed up your application's writes by borrowing code from a version control system, I think it would be a lot easier to make your application modify existing files in place. (Maybe that's what you're doing already? It's not clear from what you wrote.)

How can I efficiently select 100 random JPG files from a directory (including subdirs) in Python?

I have a very large directory of files and folders. Currently, I scan the entire directory for JPGs and store them in a list. This is really slow due to the size of the directory. Is there a faster, more efficient way to do this? Perhaps without scanning everything?
My directory looks like this:
/library/Modified/2000/[FolderName]/Images.JPG
/library/Modified/2001/[FolderName]/Images.JPG
/library/Modified/2002/[FolderName]/Images.JPG
/library/Modified/2003/[FolderName]/Images.JPG
/library/Modified/2004/[FolderName]/Images.JPG
...
/library/Modified/2012/FolderName/Images.JPG
Thanks

See Generator Tricks for System Programmers for a bunch of neat stuff. But specifically, see the gen-find example. This is as efficient as you are going to get, without making a bunch of assumptions about your file structure layout.

Assuming that you application is the only one changing directory and that you have control over the directory names/structure and that you have to do the operation described in your question more than once:
Rename all the files once so you can access them in predictable order. Say, give all files numeric name from 1 to N (where N is the number of files in directory) and have a special file ".count" which will hold the N for each directory. Then access them directly with their names generated by random generator.

I don't know where the slowness occurs, but to scan directories and files I found it much faster the dump the directories/files into a text file first using a batch file then get python to read the file. This worked well on our server system with 7 servers and many thousands of directories.
Python could, of course, run the batch file.

Search multiple directories, delete duplicate files

I have a directory of files that contain files of records. I just got access to a new directory that has the same records but additional files as well, but the additional files are buried deep inside other folders and i cant find them.
So my solution would be to have a python program run and delete all files that are duplicates in the two different directories (and subdirectories), and leave others intact, which will give me the "new files" i'm looking for.
I have seen a couple of programs that find duplicates, but i'm unsure as to how they run really, and they haven't been helpful.
Any way i can accomplish what i'm looking for?
Thanks!

Possible approach:
Create a set of MD5 hashes from your original folder.
Recursively MD5 hash the files in your new folder, deleting any files that generate hashes already present in your set.
Caveat to the above is that there is a chance two different files can generate the same hash. How different are the files?

use fslint or some similar software. Fslint is able to for example give you a list of the different files and hardlink the copies together, or delete the duplicates. One option is also to just use a diff-like program to diff the directories if their internal structure is the same.

Do they duplicate files in both directories have the same name/path? If I understand correctly you want to find the duplicate filenames rather than file contents? If so, a 'synchronised' call to os.walk in both trees might be helpful.

Diffing two folders (like the diff tool in Linux) with Python

I'm trying to write a project that will have some autonomous components. One of these is the need to diff two folders and spit out the different files into an array of strings. Dircmp does part of this - it spits out the different files. But it would appear it doesn't actually go into the remaining files to see which are different when compared against the same file in a different folder.
Currently I've played with difflib and filecmp, and unless I'm doing something entirely wrong I can't find a way to achieve what I'm looking for without writing it all from scratch. The reason I need this is because this python script will be deployed on windows boxen where the standard linux diff tools will not be available.
My only other thought would be to just call diff and such from the command line, but that doesn't solve either of my problems (getting the files in an array AND not requiring GNU tools).
Can anyone help me? I'm still a total scrub at python and would really appreciate the expert advice. Thank you!

It seems that filecmp.dircmp does what you want already. If you compare two directories, diff_files will be a list of files which are in both directories, but whose contents differ:
>>> dc = filecmp.dircmp('dir1', 'dir2')
>>> dc.diff_files
<<< ['foo']
As pointed out by Jonathanb, if you want actual diffs, it's easy to use difflib at this point to do so.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Fast search for modified files in multiple folders - python

Related

Is file reading faster in nested folders or doesn't matter?

medium datasets under source control

How can I efficiently select 100 random JPG files from a directory (including subdirs) in Python?

Search multiple directories, delete duplicate files

Diffing two folders (like the diff tool in Linux) with Python

Categories

Resources