How to calculate same hash for two similar files?

How to calculate same hash for two similar files? - python

I want to apply an Hashing algorithm, where the hash is same, If two files are similar. If one bit is lost, the hash of files change. Any algorithm which I can apply in Python to tackle this problem?
Thank you
I heard block hasing do this, but I don't know how to appply that.
I applied the following algorithm, but it does not help
import hashlib
file = "Annotation 2020-04-09 163448.png" # Location of the file (can be set a different way)
BLOCK_SIZE = 65536 # The size of each read from the file
file_hash = hashlib.sha256() # Create the hash object, can use something other than `.sha256()` if you wish
with open(file, 'rb') as f: # Open the file to read it's bytes
fb = f.read(BLOCK_SIZE) # Read from the file. Take in the amount declared above
while len(fb) > 0: # While there is still data being read from the file
file_hash.update(fb) # Update the hash
fb = f.read(BLOCK_SIZE) # Read the next block from the file
print (file_hash.hexdigest()) # Get the hexadecimal digest of the hash

The entire point of hashing algorithms is that they become completely different if any one bit from the source files is different, to ensure that generating hash collisions becomes challenging. Here are some workarounds:
The only robust way to find "similar" but not the same files you need to either compare the entire file content for every part to compute a similarity score. This is rather inefficient however, since it would be a O(n^2) algorithm with frequent hard drive roundtrips.
Another way is to perhaps hash only a part of each file. This will have the same problems that if only one bit of this part is different, the file will be different. However, you may be able to ignore perhaps spaces or markup or capitalization or hash only the file headers or ignore the last few bits of every color value, there are plenty of options for removing less relevant data to create less precise hashes. You could use block hashing here as a small optimization to avoid repeatedly loading big files, and first checking if enough blocks are similar.
You can also combine these techniques, use a hash to check if at least the basic file metadata is correct in a fast way, and then use a more slow algorithm to compute a similarity score only if the hashes match. This combines some of the accuracy of approach one with some of the speed of approach 2, though both the accuracy and the speed will still not be great.
The final option is to use a very weak hashing algorithm. If you just use sum(file)%(2^32), similar files will give sortof similar hashes in some cases, but it's really hard to determine actual similarity based on the final hash, since a difference of a byte anywhere in the file will make a big difference in the hash still, and if you include all hashes within 256 of each other, many files will still be considered similar even if they are not and you miss all files with a difference of two bytes or more.
It depends on your use case which of these techniques work for you, but beware that this is not an easy task. Good luck!

Related

More efficient data structure/algorithm to find similar imagehashes in database

I am writing a small python program that tries to find images similar enough to some already in a database (to detect duplicates that have been resized/recompressed/etc). I am using the imagehash library and average hashing, and want to know if there is a hash in a known database that has a hamming distance lower than, say, 3 or 4.
I am currently just using a dictionary that matches hashes to filenames and use brute force for every new image. However, with tens or hundreds of thousands of images to compare to, performance is starting to suffer.
I believe there must be data structures and algorithms that can allow me to search a lot more efficiently but wasn’t able to find much that would match my particular use case. Would anyone be able to suggest where to look?
Thanks!

Here's a suggestion. You mention a database, so initially I will assume we can use that (and don't have to read it all into memory first). If your new image has a hash of 3a6c6565498da525, think of it as 4 parts: 3a6c 6565 498d a525. For a hamming distance of 3 or less any matching image must have a hash where at least one of these parts is identical. So you can start with a database query to find all images whose hash contains the substring 3a6c or 6565 or 498d or a525. This should be a tiny subset of the full dataset, so you can then run your comparison on that.
To improve further you could pre-compute all the parts and store them separately as additional columns in the database. This will allow more efficient queries.
For a bigger hamming distance you would need to split the hash into more parts (either smaller, or you could even use parts that overlap).
If you want to do it all in a dictionary, rather than using the database you could use the parts as keys that each point to a list of images. Either a single dictionary for simplicity, or for more accurate matching, a dictionary for each "position".
Again, this would be used to get a much smaller set of candidate matches on which to run the full comparison.

Hashing 1000 Image Files Quick as Possible (2000x2000 plus resolution) (Python)

I have a folder with around several thousand RGB 8-bit-per-channel image files on my computer that are anywhere between 2000x2000 and 8000x8000 in resolution (so most of them are extremely large).
I would like to store some small value, such as a hash, for each image so that I have a value to easily compare to in the future to see if any image files have changed. There are three primary requirements in the calculation of this value:
The calculation of this value needs to be fast
The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes. (The hash should not take filename into account).
Collisions should basically never happen.
There are a lot of ways I could go about this, such as sha1, md5, etc, but the real goal here is speed, and really just any extremely quick way to identify if ANY change at all has been made to an image.
How would you achieve this in Python? Is there a particular hash algorithm you recommend for speed? Or can you devise a different way to achieve my three goals altogether?

The calculation of this value needs to be fast
The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes.
(The hash should not take filename into account).
Collisions should basically never happen.
Hash calculation (may differ according to the hashing algorithm) of large files take time, if it needs to be fast, try to choose an efficient hashing algorithm for your task. You can find information about how they compare to each other. But, before checking hash, you can optimize your algorithm by checking something else.
If you decided to use hashing, this is the case. The hash value will be changed even if a small part of image has changed.
Collisions may be (very rare, but not never) happen. This is the nature of hash algorithms
Example to 1st (optimizing algorithm),
Check file size.
If sizes are equal, check CRC
If CRCs are equal, then calculate and check hash. (both requires passing the file)
Optionally, before checking hashes, you can partially calculate and compare hashes, instead of all the file.
If most of your files will be more likely different, then checking other things before calculating hash probably will be faster.
But if most of your files will be identical, then the steps before the hashing will just consume more time. Because you'll already have to calculate the hash for most of files.
So try to implement most efficient algorithm according to your context.

how to print all duplicate (same contents) files from a given directory?

I'm kind of new in Python and this is the first time I'm trying to wright a script.
I need to go over a given directory and print out all the duplicated files (files with same contents). If there are to sets of duplicated files then I need to print them in different lines.
Anyone have an idea?

Yes, creating a dictionary with the file size as the key and a list of all the files with that size as the value, and then only comparing files of the same size is a good strategy. However, there's another step you can take to improve the efficiency.
Once you've identified your lists of files of the same size, rather than laboriously comparing each pair of files in the list byte by byte (and returning as soon as you find a mis-match), you can compare their digital fingerprints.
A fingerprinting algorithm (also known as a message digest) takes a string of data and returns a short string of bytes that is (hopefully) unique to that input string. So to find duplicate files you just need to generate the fingerprint of each file and then see if any of the fingerprints are duplicates. This is generally a lot faster than actually comparing file contents, since if you have a list of fingerprints you can easily sort it, so all the files with the same fingerprint will be next to each other in the sorted list.
With the usual functions used for fingerprinting there is a tiny probability that two files with the same fingerprint aren't actually identical. So once you've identified files with matching fingerprints you still do need to compare their byte contents. However, in those very rare cases where two non-identical files have matching fingerprints the file contents will generally differ quite radically, so such false positives can be quickly eliminated.
The odds of two non-identical files have matching fingerprints is so tiny that unless you have many thousands of files to de-dupe you can probably skip the first step of grouping files by size and just go straight to fingerprinting.
The Wikipedia article on fingerprinting I linked to above mentions the Rabin fingerprint as a very fast algorithm. There is a 3rd-party Python module available for it on PyPI - PyRabin 0.5, but I've never used it.
However, the standard Python hashlib module provides various secure hash and message digest algorithms, eg MD5 and the SHA family, which are reasonably fast, and quite familiar to most seasoned Python programmers. For this application, I'd suggest using an algorithm that returns a fairly short fingerprint, like sha1() or md5(), since they tend to be faster, although the shorter the fingerprint, the higher the rate of collision. Although MD5 isn't as secure as once thought it's still ok for this application, unless you need to deal with files created by a malicious person who's creating non-identical files with the same MD5 fingerprint on purpose.
Another option if you expect to get lots of duplicates files is to compute two different fingerprints (eg both SHA1 and MD5) - the odds of two non-identical files have matching SHA1 and MD5 fingerprints is microscopically tiny.
FWIW, here's a link to a simple Python program I wrote last year (in an answer on the Unix & Linux Stack Exchange site) that computes the MD5 and SHA256 hashes of a single file. It's quite efficient on large files. You may find it helpful as an example of using hashlib.
Simultaneously calculate multiple digests (md5, sha256)

Generating variations of checksum functions to minimize collisions

My goal is to efficiently perform an exhaustive diff of a directory tree. Given a set of files F, I must compute the equivalence classes EQV = [F₁, F₂, F₃, ... Fₙ] such that fⱼ, fₖ ∈ EQV[i] iff (fⱼ is identical to fₖ) for all i, j, k.
My general approach is to start with just one big class containing all initial files, EQV₀ = [[f₁, f₂, ..., fₙ]], and repeatedly split it into more refined classes EQV₁, EQV₂... EQVₘ₋₁, based on some m heuristics, for example, file size, checksum function 1, checksum 2. After all m heuristics have been applied (EQVₘ₋₁), a pairwise diff of all the files within each class in EQVₘ₋₁ must be made. Because this last step is quadratic for each of the classes in EQVₘ₋₁, ie
O(sum(n² for n in map(len, EQVₘ₋₁)) )
and will probably be the bottleneck of my algorithm if each the m splits are done in linear time, my goal is to make EQVₘ₋₁ as flat as possible.
I would like to have access to a variety of good hash functions that I can apply to minimize collisions on EQVₘ₋₁. My current idea is to use some library provided checksum function, such as adler, and to generate variations of it by simply applying it to different starting bytes within the file. Another one is to first apply fast hash functions, such as adler, and then more expensive ones such as md5 on only the classes which are still too large.
Considering that I can compute all the hashes for a given file in just one read of that file, how could I compute a variety of hashes that will help me discriminate among non-identical files?
Alternatively, what is a good list of hash functions available in python that aren't cryptographically secure?
Edit:
Another idea seems to use "rabin fingerprint" based on a fixed set of randomly generated inputs. Would this make sense for this purpose?
http://en.wikipedia.org/wiki/Rabin_fingerprint

I would recommend first using adler32, then crc32. There may be many very short files that have the same adler32's, but different crc32's. In fact, you could consider a single step using crc32 on files below a certain size on the first pass. That size could be about 1K. For longer files, adler32 and crc32 will have close to the same collision probability.
Depending on how many files you have, you could consider a subsequent step with larger hashes, such as md5 or sha1. See this answer for the probability and count of expected collisions for a 32-bit check value. Roughly, if you have millions of files, that step may be worth doing.
You will get no benefit by going to even longer hash values. The 128 bits from md5 is plenty to distinguish all of the files on every computer in the world.

Hashing strings for load balancing

There's a great deal of information I can find on hashing strings for obfuscation or lookup tables, where collision avoidance is a primary concern. I'm trying to put together a hashing function for the purpose of load balancing, where I want to fit an unknown set of strings into an arbitrarily small number of buckets with a relatively even distribution. Collisions are expected (desired, even).
My immediate use case is load distribution in an application, where I want each instance of the application to fire at a different time of the half-hour, without needing any state information about other instances. So I'm trying to hash strings into integer values from 0 to 29. However, the general approach has wider application with different int ranges for different purposes.
Can anyone make suggestions, or point me to docs that would cover this little corner of hash generation?
My language of choice for this is python, but I can read most common langues so anything should be applicable.

Your might consider something simple, like the adler32() algo, and just mod for bucket size.
import zlib
buf = 'arbitrary and unknown string'
bucket = zlib.adler32(buf) % 30
# at this point bucket is in the range 0 - 29

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.