Hashing 1000 Image Files Quick as Possible (2000x2000 plus resolution) (Python)

Hashing 1000 Image Files Quick as Possible (2000x2000 plus resolution) (Python) - python

I have a folder with around several thousand RGB 8-bit-per-channel image files on my computer that are anywhere between 2000x2000 and 8000x8000 in resolution (so most of them are extremely large).
I would like to store some small value, such as a hash, for each image so that I have a value to easily compare to in the future to see if any image files have changed. There are three primary requirements in the calculation of this value:
The calculation of this value needs to be fast
The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes. (The hash should not take filename into account).
Collisions should basically never happen.
There are a lot of ways I could go about this, such as sha1, md5, etc, but the real goal here is speed, and really just any extremely quick way to identify if ANY change at all has been made to an image.
How would you achieve this in Python? Is there a particular hash algorithm you recommend for speed? Or can you devise a different way to achieve my three goals altogether?

The calculation of this value needs to be fast
The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes.
(The hash should not take filename into account).
Collisions should basically never happen.
Hash calculation (may differ according to the hashing algorithm) of large files take time, if it needs to be fast, try to choose an efficient hashing algorithm for your task. You can find information about how they compare to each other. But, before checking hash, you can optimize your algorithm by checking something else.
If you decided to use hashing, this is the case. The hash value will be changed even if a small part of image has changed.
Collisions may be (very rare, but not never) happen. This is the nature of hash algorithms
Example to 1st (optimizing algorithm),
Check file size.
If sizes are equal, check CRC
If CRCs are equal, then calculate and check hash. (both requires passing the file)
Optionally, before checking hashes, you can partially calculate and compare hashes, instead of all the file.
If most of your files will be more likely different, then checking other things before calculating hash probably will be faster.
But if most of your files will be identical, then the steps before the hashing will just consume more time. Because you'll already have to calculate the hash for most of files.
So try to implement most efficient algorithm according to your context.

Related

More efficient data structure/algorithm to find similar imagehashes in database

I am writing a small python program that tries to find images similar enough to some already in a database (to detect duplicates that have been resized/recompressed/etc). I am using the imagehash library and average hashing, and want to know if there is a hash in a known database that has a hamming distance lower than, say, 3 or 4.
I am currently just using a dictionary that matches hashes to filenames and use brute force for every new image. However, with tens or hundreds of thousands of images to compare to, performance is starting to suffer.
I believe there must be data structures and algorithms that can allow me to search a lot more efficiently but wasn’t able to find much that would match my particular use case. Would anyone be able to suggest where to look?
Thanks!

Here's a suggestion. You mention a database, so initially I will assume we can use that (and don't have to read it all into memory first). If your new image has a hash of 3a6c6565498da525, think of it as 4 parts: 3a6c 6565 498d a525. For a hamming distance of 3 or less any matching image must have a hash where at least one of these parts is identical. So you can start with a database query to find all images whose hash contains the substring 3a6c or 6565 or 498d or a525. This should be a tiny subset of the full dataset, so you can then run your comparison on that.
To improve further you could pre-compute all the parts and store them separately as additional columns in the database. This will allow more efficient queries.
For a bigger hamming distance you would need to split the hash into more parts (either smaller, or you could even use parts that overlap).
If you want to do it all in a dictionary, rather than using the database you could use the parts as keys that each point to a list of images. Either a single dictionary for simplicity, or for more accurate matching, a dictionary for each "position".
Again, this would be used to get a much smaller set of candidate matches on which to run the full comparison.

what the fastest method to loop through a very large number of account in a log in page

I was wonder what if i had 1 million users, how much time will take it to loop every account to check for the username and email if the registration page to check if they are already used or not, or check if the username and password and correct in the log in page, won't it take ages if i do it in a traditional for loop?

Rather than give a detailed technical answer, I will try to give a theoretical illustration of how to address your concerns. What you seem to be saying is this:
Linear search might be too slow when logging users in.
I imagine what you have in mind is this: a user types in a username and password, clicks a button, and then you loop through a list of username/password combinations and see if there is a matching combination. This takes time that is linear in terms of the number of users in the system; if you have a million users in the system, the loop will take about a thousand times as long as when you just had a thousand users... and if you get a billion users, it will take a thousand times longer over again.
Whether this is a performance problem in practice can only be determined through testing and requirements analysis. However, if it is determined to be a problem, then there is a place for theory to come to the rescue.
Imagine one small improvement to our original scheme: rather than storing the username/password combinations in arbitrary order and looking through the whole list each time, imagine storing these combinations in alphabetic order by username. This enables us to use binary search, rather than linear search, to determine whether there exists a matching username:
check the middle element in the list
if the target element is equal to the middle element, you found a match
otherwise, if the target element comes before the middle element, repeat binary search on the left half of the list
otherwise, if the target element comes after the middle element, repeat binary search on the right half of the list
if you run out of list without finding the target, it's not in the list
The time complexity of this is logarithmic in terms of the number of users in the system: if you go from a thousand users to a million users, the time taken goes up by a factor of roughly ten, rather than one thousand as was the case for linear search. This is already a vast improvement over linear search and for any realistic number of users is probably going to be efficient enough. However, if additional performance testing and requirements analysis determine that it's still too slow, there are other possibilities.
Imagine now creating a large array of username/password pairs and whenever a pair is added to the collection, a function is used to transform the username into a numeric index. The pair is then inserted at that index in the array. Later, when you want to find whether that entry exists, you use the same function to calculate the index, and then check just that index to see if your element is there. If the function that maps the username to indices (called a hash function; the index is called a hash) is perfect - different strings don't map to the same index - then this unambiguously tells you whether your element exists. Notably, under somewhat reasonable assumptions, the time to make this determination is mostly independent from the number of users currently in the system: you can get (amortized) constant time behavior from this scheme, or something reasonably close to it. That means the performance hit from going from a thousand to a million users might be negligible.
This answer does not delve into the ugly real-world minutia of implementing these ideas in a production system. However, real world systems to implement these ideas (and many more) for precisely the kind of situation presented.
EDIT: comments asked for some pointers on actually implementing a hash table in Python. Here are some thoughts on that.
So there is a built-in hash() function that can be made to work if you disable the security feature that causes it to produce different hashes for different executions of the program. Otherwise, you can import hashlib and use some hash function there and convert the output to an integer using e.g. int.from_bytes. Once you get your number, you can take the modulus (or remainder after division, using the % operator) w.r.t. the capacity of your hash table. This gives you the index in the hash table where the item gets put. If you find there's already an item there - i.e. the assumption we made in theory that the hash function is perfect turns out to be incorrect - then you need a strategy. Two strategies for handling collisions like this are:
Instead of putting items at each index in the table, put a linked list of items. Add items to the linked list at the index computed by the hash function, and look for them there when doing the search.
Modify the index using some deterministic method (e.g., squaring and taking the modulus) up to some fixed number of times, to see if a backup spot can easily be found. Then, when searching, if you do not find the value you expected at the index computed by the hash method, check the next backup, and so on. Ultimately, you must fall back to something like method 1 in the worst case, though, since this process could fail indefinitely.
As for how large to make the capacity of the table: I'd recommend studying recommendations but intuitively it seems like creating it larger than necessary by some constant multiplicative factor is the best bet generally speaking. Once the hash table begins to fill up, this can be detected and the capacity expanded (all hashes will have to be recomputed and items re-inserted at their new positions - a costly operation, but if you increase capacity in a multiplicative fashion then I imagine this will not be too frequent an issue).

How to calculate same hash for two similar files?

I want to apply an Hashing algorithm, where the hash is same, If two files are similar. If one bit is lost, the hash of files change. Any algorithm which I can apply in Python to tackle this problem?
Thank you
I heard block hasing do this, but I don't know how to appply that.
I applied the following algorithm, but it does not help
import hashlib
file = "Annotation 2020-04-09 163448.png" # Location of the file (can be set a different way)
BLOCK_SIZE = 65536 # The size of each read from the file
file_hash = hashlib.sha256() # Create the hash object, can use something other than `.sha256()` if you wish
with open(file, 'rb') as f: # Open the file to read it's bytes
fb = f.read(BLOCK_SIZE) # Read from the file. Take in the amount declared above
while len(fb) > 0: # While there is still data being read from the file
file_hash.update(fb) # Update the hash
fb = f.read(BLOCK_SIZE) # Read the next block from the file
print (file_hash.hexdigest()) # Get the hexadecimal digest of the hash

The entire point of hashing algorithms is that they become completely different if any one bit from the source files is different, to ensure that generating hash collisions becomes challenging. Here are some workarounds:
The only robust way to find "similar" but not the same files you need to either compare the entire file content for every part to compute a similarity score. This is rather inefficient however, since it would be a O(n^2) algorithm with frequent hard drive roundtrips.
Another way is to perhaps hash only a part of each file. This will have the same problems that if only one bit of this part is different, the file will be different. However, you may be able to ignore perhaps spaces or markup or capitalization or hash only the file headers or ignore the last few bits of every color value, there are plenty of options for removing less relevant data to create less precise hashes. You could use block hashing here as a small optimization to avoid repeatedly loading big files, and first checking if enough blocks are similar.
You can also combine these techniques, use a hash to check if at least the basic file metadata is correct in a fast way, and then use a more slow algorithm to compute a similarity score only if the hashes match. This combines some of the accuracy of approach one with some of the speed of approach 2, though both the accuracy and the speed will still not be great.
The final option is to use a very weak hashing algorithm. If you just use sum(file)%(2^32), similar files will give sortof similar hashes in some cases, but it's really hard to determine actual similarity based on the final hash, since a difference of a byte anywhere in the file will make a big difference in the hash still, and if you include all hashes within 256 of each other, many files will still be considered similar even if they are not and you miss all files with a difference of two bytes or more.
It depends on your use case which of these techniques work for you, but beware that this is not an easy task. Good luck!

Python fast duplicate detection, can I store only the hash but not the value

I have a method for creating an image "hash" which is useful for duplicate frame detection. (Doesn't really matter for the question)
Currently I put each frame of a video in a set, and can do things like find videos that contain intersections by comparing the sets. (I have billions of hashes)
Since I have my own "hash" I don't need the values of the set, only the ability to detect duplicate items.
This would reduce my memory footprint by like half (since I would only have the hashes).
I know internally a set is actually hash,value pairs. There must be a way to make a "SparseSet" or a "hashonly" set.
Something like
2 in sparset(1,2,3)
True
but where
for s in sparset(1,2,3)
would return nothing, or hashes not values.

That's not quite how sets work. Both the hash value and the value are required, because the values must be checked for equality in case of a hash collision.
If you don't care about collisions, you can use a Bloom filter instead of a set. These are very memory efficient, but give probabilistic answers (either definitely not in the set, or maybe in the set). There's no Bloom filter in the standard library, but there are several implementations on PyPI.
If you care more about optimizing space than time, you could just keep the hashes in a list and then when you need to check for an element, sort it in place and do a binary search. Python's Timsort is very efficient when the list is mostly sorted already, so subsequent sorts will be relatively fast. Python lists have a sort() method and you can implement a binary search fairly easily using the standard library bisect module.
You can combine both techniques, i.e. don't bother sorting if the Bloom filter indicates the element is not in the set. And of course, don't bother sorting again if you haven't added any elements since last time.

Generating variations of checksum functions to minimize collisions

My goal is to efficiently perform an exhaustive diff of a directory tree. Given a set of files F, I must compute the equivalence classes EQV = [F₁, F₂, F₃, ... Fₙ] such that fⱼ, fₖ ∈ EQV[i] iff (fⱼ is identical to fₖ) for all i, j, k.
My general approach is to start with just one big class containing all initial files, EQV₀ = [[f₁, f₂, ..., fₙ]], and repeatedly split it into more refined classes EQV₁, EQV₂... EQVₘ₋₁, based on some m heuristics, for example, file size, checksum function 1, checksum 2. After all m heuristics have been applied (EQVₘ₋₁), a pairwise diff of all the files within each class in EQVₘ₋₁ must be made. Because this last step is quadratic for each of the classes in EQVₘ₋₁, ie
O(sum(n² for n in map(len, EQVₘ₋₁)) )
and will probably be the bottleneck of my algorithm if each the m splits are done in linear time, my goal is to make EQVₘ₋₁ as flat as possible.
I would like to have access to a variety of good hash functions that I can apply to minimize collisions on EQVₘ₋₁. My current idea is to use some library provided checksum function, such as adler, and to generate variations of it by simply applying it to different starting bytes within the file. Another one is to first apply fast hash functions, such as adler, and then more expensive ones such as md5 on only the classes which are still too large.
Considering that I can compute all the hashes for a given file in just one read of that file, how could I compute a variety of hashes that will help me discriminate among non-identical files?
Alternatively, what is a good list of hash functions available in python that aren't cryptographically secure?
Edit:
Another idea seems to use "rabin fingerprint" based on a fixed set of randomly generated inputs. Would this make sense for this purpose?
http://en.wikipedia.org/wiki/Rabin_fingerprint

I would recommend first using adler32, then crc32. There may be many very short files that have the same adler32's, but different crc32's. In fact, you could consider a single step using crc32 on files below a certain size on the first pass. That size could be about 1K. For longer files, adler32 and crc32 will have close to the same collision probability.
Depending on how many files you have, you could consider a subsequent step with larger hashes, such as md5 or sha1. See this answer for the probability and count of expected collisions for a 32-bit check value. Roughly, if you have millions of files, that step may be worth doing.
You will get no benefit by going to even longer hash values. The 128 bits from md5 is plenty to distinguish all of the files on every computer in the world.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.