Fast substring search in a large text database?

Fast substring search in a large text database? - python

I have a box of disk drives that contains backups of work and personal files over many years. Most of the files and directories are duplicates of other backups on other disks, or even on the same disk.
To consolidate this mess, I have created a csv file having the checksum, size and full path of each file. I then wrote a simple Python program using the pandas library to compute a checksum and size for each directory, which is simply the sum of the checksums and sizes of all files contained in the directory. The idea is to find all directories with identical content and delete all of them but one.
Unfortunately (but I expected this before) the code runs a few hours even for my test data set which has about 1 million rows. The actual data set has about 10 million rows.
Here is the python code fragment:
# for all directories, compute their checksum and total content size
df = pd.DataFrame(columns=['cksum', 'len', 'path'])
i = 0
for path in directories:
# create new dataframe having all files in this directory
items = data[data['path'].str.startswith(path)]
# sum all checksums
cksum = pd.to_numeric(items['cksum']).sum()
# sum all file sizes
len = pd.to_numeric(items['len']).sum()
# store result
df.loc[i] = [cksum, len, path]
i += 1
Obviously, the problem is that for each directory I have to find contained directories and files, and to identify those I do a startswith(path) string comparison, which is slow and I need to run this 1 (or 10) million times for each directory. So we have an O(n^2) type of problem here.
I understand that my current algorithm is naive and I could come up with a much better one, but before investing time here I would like to learn whether a different approach might be more worthwile:
Should I rather use a SQL database here? Think of a statement
similar to SELECT cksum, len, path FROM files,directories WHERE leftstr(files.path,n) == directories.path;. But maybe this statement
is just as expensive as its python equivalent?
Is a different database or tool more suitable for this kind of text search? I was thinking of Apache Lucene, ElasticSearch, MongoDB, NOSQL but I have no experience with these to decide which product to try.
Maybe somebody else solved this deduplication problem already? I found a
few commercial PC software products but I am not sure if they can handle 10 million files.
Please advise.

Related

Is file reading faster in nested folders or doesn't matter?

My question concerns purely file-path reading and Disk Drives...I think.
I have python code that needs to pull up a specific file of which I know exactly the file-path to. And I have a choice either I store that file in a large folder with thousands of other files or segment them all into sub-folders. Which choice would give more reading speed?
My concern and lack of knowledge suggests that when code enters a big folder with thousands of other files then that is much more of a struggle than entering a folder with a few sub-folders. Or am I wrong and it is all instant if I produce the exact file-path?
Again I don't have to scan the files or folders as I know exactly the file-path link but I don't know what happens on the lower-level with Disk Drives?
EDIT: Which of the two would be faster given standard HDD on Windows 7?
C://Folder_with_millions_of_files/myfile.txt
or
C://small_folder/small_folder254/small_folder323/myfile.txt
NOTE: What I need this for is not to scan thousands of files but to pull up just that one file as quickly as possible. Sort of a lookup table I think this is.

Doing some reading for maximum scalability it appears best practice is to split the folder to subfolders although you nesting multiple folders is not recomended and it is best to use multiple larger folders than thousands of smaller folders,
Rather than shoveling all of those files into a single filesystem, why not spread them out across a series of smaller filesystems? The problems with that approach are that (1) it limits the kernel's ability to optimize head seeks and such, reducing performance, and (2) it forces developers (or administrators) to deal with the hassles involved in actually distributing the files. Inevitably things will get out of balance, forcing things to be redistributed in the future.
From looking at these articles I would draw the following conclusion,
< 65,534 files (One folder should suffice)
> 65,534 files (Split into folders)
To allow for scalability in the future it would be advisable to split the data across folders but based on the file system and observed performance potentially creating a new folder per 65,534 items or per day, category etc.
Based on,
single folder or many folders for storing 8 million images of hundreds of stores?
https://lwn.net/Articles/400629/
https://superuser.com/questions/446282/max-files-per-directory-on-ntfs-vol-vs-fat32

Best approach to find the last n files created in a folder in linux, based on a timstamp comparision

I need a list of files which were created after a certain period of time.
I had used the following piece of code to achieve the same :
for p, ds, fs in os.walk(idirpath):
for fn in fs:
filepath = os.path.join(p, fn)
if datetime.datetime.fromtimestamp(os.path.getmtime(filepath)) >= datetime.datetime.strptime(lasfilecredt,'%Y-%m-%d %H:%M:%S'):
filelist.append((filepath, os.path.getmtime(filepath)))
Where :
idirpath has the path where the files to be selected are present.
lasfilecredt has the timestamp value used for identifying the files required.
filelist is used to store the files identified along with their creation time.
Is there a better way to achieve this?. I had tried subprocess.popen in the past but wanted to go with this option for ease of maintenance of my code in the future.
Are there any specific scenarios when the above piece of code might not correctly identify the required files.
The above piece of code is being used to make sure that previously processed files are not being selected again in the next run (the '>=' symbol in the IF condition is to make sure we do not over-do the selection of files, else there is no reason not to have '>').
This code when deployed in production seems to fail randomly, atleast twice in three months.
Need some suggestions/help to identify the scenarios when this could possibly fail.

How can I efficiently select 100 random JPG files from a directory (including subdirs) in Python?

I have a very large directory of files and folders. Currently, I scan the entire directory for JPGs and store them in a list. This is really slow due to the size of the directory. Is there a faster, more efficient way to do this? Perhaps without scanning everything?
My directory looks like this:
/library/Modified/2000/[FolderName]/Images.JPG
/library/Modified/2001/[FolderName]/Images.JPG
/library/Modified/2002/[FolderName]/Images.JPG
/library/Modified/2003/[FolderName]/Images.JPG
/library/Modified/2004/[FolderName]/Images.JPG
...
/library/Modified/2012/FolderName/Images.JPG
Thanks

See Generator Tricks for System Programmers for a bunch of neat stuff. But specifically, see the gen-find example. This is as efficient as you are going to get, without making a bunch of assumptions about your file structure layout.

Assuming that you application is the only one changing directory and that you have control over the directory names/structure and that you have to do the operation described in your question more than once:
Rename all the files once so you can access them in predictable order. Say, give all files numeric name from 1 to N (where N is the number of files in directory) and have a special file ".count" which will hold the N for each directory. Then access them directly with their names generated by random generator.

I don't know where the slowness occurs, but to scan directories and files I found it much faster the dump the directories/files into a text file first using a batch file then get python to read the file. This worked well on our server system with 7 servers and many thousands of directories.
Python could, of course, run the batch file.

Recursive Searching and MySql Comparison

Good evening. I am looking at developing some code that will collect EXIF data from JPEG images and store it in a MySQL database using Python v2.x The stumbling block lies in the fact that the JPEGs are scattered in a number of subdirectories and further subdirectories in a root folder so for example 200 JPEGs may be stored in root > subroot > subsubroot1 as well as a further 100 in root > subroot > subroot2. Once all images are identified, they will be scanned and their respective EXIF data abstracted before being added to a MySQL table.
At the moment I am just at the planning stage but I am just wondering, what would be the most efficient and pythonic way to carry out the recursive searching? I am looking to scan the root directory and append any new identified subdirectories to a list and then scan all subdirectory paths in the list for further subdirectories until I have a total list of all directories. This just seems to be a clumsy way though IMHO and a bit repetitive so I assume there may be a more OOP manner of carrying out this function.
Similarly, I am only looking to add new info to my MySQL table and so what would be the most efficient way to establish if an entry already exists? The filename both in the table and the JPEG file name will be its MD5 hash values. I was considering scanning through the table at the beginning of the code and placing all filenames in a set and so, before scanning a new JPEG, if an entry already exists in the set, there would be no need to extract the EXIF and move onto the next picture. Is this an efficient method however or would it be better to scan through the MySQL table when a new image is encountered? I anticipate the set method may be the most efficient however the table may potentially contain tens of millions of entries eventually and so to add the filenames for these entries into a set (volatile memory) may not be the best idea.
Thanks folks.

I would just write a function that scanned a directory for all files; if it's a jpeg, add the full path name of the jpeg to the list of results. If it's a directory, then immediately call the function with the newly discovered directory as an argument. If it's another type of file, do nothing. This is a classic recursive divide-and-conquer strategy. It will break if there are loops in your directory path, for instance with symlinks -- if this is a danger for you, then you'll have to make sure you don't traverse the same directory twice by finding the "real" non-symlinked path of each directory and recording it.
How to avoid duplicate entries is a trickier problem and you have to consider whether you are tolerant of two differently-named files with the exact same contents (and also consider the edge cases of symlinked or multiply-hard-linked files), how new files appear in the directories you are scanning and whether you have any control over that process. One idea to speed it up would be to use os.path.getmtime(). Record the moment you start the directory traversal process. Next time around, have your recursive traversal process ignore any jpeg files with an mtime older than your recorded time. This can't be your only method of keeping track because files modified between the start and end times of your process may or may not be recorded, so you will still have to check the database for those records (for instance using the full path, a hash on the file info or a hash on the data itself, depending on what kind of duplication you're intolerant of) but used as a heuristic it should speed up the process greatly.
You could theoretically load all filenames (probably paths and not filenames) from the database into memory to speed up comparison, but if there's any danger of the table becoming very large it would be better to leave that info in the database. For instance, you could create a hash from the filename, and then simply add that to the database with a UNIQUE constraint -- the database will reject any duplicate entries, you can catch the exception and go on your way. This won't be slow if you use the aforementioned heuristic checking file mtime.
Be sure you account for the possibility of files that may be only modified and not newly created, if that's important for your application.

How to determine number of files on a drive with Python?

I have been trying to figure out how to retrieve (quickly) the number of files on a given HFS+ drive with python.
I have been playing with os.statvfs and such, but can't quite get anything (that seems helpful to me).
Any ideas?
Edit: Let me be a bit more specific. =]
I am writing a timemachine-like wrapper around rsync for various reasons, and would like a very fast estimate (does not have to be perfect) of the number of files on the drive rsync is going to scan. This way I can watch the progress from rsync (if you call it like rsync -ax --progress, or with the -P option) as it builds its initial file list, and report a percentage and/or ETA back to the user.
This is completely separate from the actual backup, which is no problem tracking progress. But with the drives I am working on with several million files, it means the user is watching a counter of the number of files go up with no upper bound for a few minutes.
I have tried playing with os.statvfs with exactly the method described in one of the answers so far, but the results do not make sense to me.
>>> import os
>>> os.statvfs('/').f_files - os.statvfs('/').f_ffree
64171205L
The more portable way gives me around 1.1 million on this machine, which is the same as every other indicator I have seen on this machine, including rsync running its preparations:
>>> sum(len(filenames) for path, dirnames, filenames in os.walk("/"))
1084224
Note that the first method is instantaneous, while the second one made me come back 15 minutes later to update because it took just that long to run.
Does anyone know of a similar way to get this number, or what is wrong with how I am treating/interpreting the os.statvfs numbers?

The right answer for your purpose is to live without a progress bar once, store the number rsync came up with and assume you have the same number of files as last time for each successive backup.
I didn't believe it, but this seems to work on Linux:
os.statvfs('/').f_files - os.statvfs('/').f_ffree
This computes the total number of file blocks minus the free file blocks. It seems to show results for the whole filesystem even if you point it at another directory. os.statvfs is implemented on Unix only.
OK, I admit, I didn't actually let the 'slow, correct' way finish before marveling at the fast method. Just a few drawbacks: I suspect .f_files would also count directories, and the result is probably totally wrong. It might work to count the files the slow way, once, and adjust the result from the 'fast' way?
The portable way:
import os
files = sum(len(filenames) for path, dirnames, filenames in os.walk("/"))
os.walk returns a 3-tuple (dirpath, dirnames, filenames) for each directory in the filesystem starting at the given path. This will probably take a long time for "/", but you knew that already.
The easy way:
Let's face it, nobody knows or cares how many files they really have, it's a humdrum and nugatory statistic. You can add this cool 'number of files' feature to your program with this code:
import random
num_files = random.randint(69000, 4000000)
Let us know if any of these methods works for you.
See also How do I prevent Python's os.walk from walking across mount points?

You could use a number from a previous rsync run. It is quick, portable, and for 10**6 files and any reasonable backup strategy it will give you 1% or better precision.

If traversing the directory tree is an option (would be slower than querying the drive directly):
import os
dirs = 0
files = 0
for r, d, f in os.walk('/path/to/drive'):
dirs += len(d)
files += len(f)

Edit: Spotlight does not track every file, so its metadata will not suffice.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.