Generating variations of checksum functions to minimize collisions

Generating variations of checksum functions to minimize collisions - python

My goal is to efficiently perform an exhaustive diff of a directory tree. Given a set of files F, I must compute the equivalence classes EQV = [F₁, F₂, F₃, ... Fₙ] such that fⱼ, fₖ ∈ EQV[i] iff (fⱼ is identical to fₖ) for all i, j, k.
My general approach is to start with just one big class containing all initial files, EQV₀ = [[f₁, f₂, ..., fₙ]], and repeatedly split it into more refined classes EQV₁, EQV₂... EQVₘ₋₁, based on some m heuristics, for example, file size, checksum function 1, checksum 2. After all m heuristics have been applied (EQVₘ₋₁), a pairwise diff of all the files within each class in EQVₘ₋₁ must be made. Because this last step is quadratic for each of the classes in EQVₘ₋₁, ie
O(sum(n² for n in map(len, EQVₘ₋₁)) )
and will probably be the bottleneck of my algorithm if each the m splits are done in linear time, my goal is to make EQVₘ₋₁ as flat as possible.
I would like to have access to a variety of good hash functions that I can apply to minimize collisions on EQVₘ₋₁. My current idea is to use some library provided checksum function, such as adler, and to generate variations of it by simply applying it to different starting bytes within the file. Another one is to first apply fast hash functions, such as adler, and then more expensive ones such as md5 on only the classes which are still too large.
Considering that I can compute all the hashes for a given file in just one read of that file, how could I compute a variety of hashes that will help me discriminate among non-identical files?
Alternatively, what is a good list of hash functions available in python that aren't cryptographically secure?
Edit:
Another idea seems to use "rabin fingerprint" based on a fixed set of randomly generated inputs. Would this make sense for this purpose?
http://en.wikipedia.org/wiki/Rabin_fingerprint

I would recommend first using adler32, then crc32. There may be many very short files that have the same adler32's, but different crc32's. In fact, you could consider a single step using crc32 on files below a certain size on the first pass. That size could be about 1K. For longer files, adler32 and crc32 will have close to the same collision probability.
Depending on how many files you have, you could consider a subsequent step with larger hashes, such as md5 or sha1. See this answer for the probability and count of expected collisions for a 32-bit check value. Roughly, if you have millions of files, that step may be worth doing.
You will get no benefit by going to even longer hash values. The 128 bits from md5 is plenty to distinguish all of the files on every computer in the world.

Related

More efficient data structure/algorithm to find similar imagehashes in database

I am writing a small python program that tries to find images similar enough to some already in a database (to detect duplicates that have been resized/recompressed/etc). I am using the imagehash library and average hashing, and want to know if there is a hash in a known database that has a hamming distance lower than, say, 3 or 4.
I am currently just using a dictionary that matches hashes to filenames and use brute force for every new image. However, with tens or hundreds of thousands of images to compare to, performance is starting to suffer.
I believe there must be data structures and algorithms that can allow me to search a lot more efficiently but wasn’t able to find much that would match my particular use case. Would anyone be able to suggest where to look?
Thanks!

Here's a suggestion. You mention a database, so initially I will assume we can use that (and don't have to read it all into memory first). If your new image has a hash of 3a6c6565498da525, think of it as 4 parts: 3a6c 6565 498d a525. For a hamming distance of 3 or less any matching image must have a hash where at least one of these parts is identical. So you can start with a database query to find all images whose hash contains the substring 3a6c or 6565 or 498d or a525. This should be a tiny subset of the full dataset, so you can then run your comparison on that.
To improve further you could pre-compute all the parts and store them separately as additional columns in the database. This will allow more efficient queries.
For a bigger hamming distance you would need to split the hash into more parts (either smaller, or you could even use parts that overlap).
If you want to do it all in a dictionary, rather than using the database you could use the parts as keys that each point to a list of images. Either a single dictionary for simplicity, or for more accurate matching, a dictionary for each "position".
Again, this would be used to get a much smaller set of candidate matches on which to run the full comparison.

what the fastest method to loop through a very large number of account in a log in page

I was wonder what if i had 1 million users, how much time will take it to loop every account to check for the username and email if the registration page to check if they are already used or not, or check if the username and password and correct in the log in page, won't it take ages if i do it in a traditional for loop?

Rather than give a detailed technical answer, I will try to give a theoretical illustration of how to address your concerns. What you seem to be saying is this:
Linear search might be too slow when logging users in.
I imagine what you have in mind is this: a user types in a username and password, clicks a button, and then you loop through a list of username/password combinations and see if there is a matching combination. This takes time that is linear in terms of the number of users in the system; if you have a million users in the system, the loop will take about a thousand times as long as when you just had a thousand users... and if you get a billion users, it will take a thousand times longer over again.
Whether this is a performance problem in practice can only be determined through testing and requirements analysis. However, if it is determined to be a problem, then there is a place for theory to come to the rescue.
Imagine one small improvement to our original scheme: rather than storing the username/password combinations in arbitrary order and looking through the whole list each time, imagine storing these combinations in alphabetic order by username. This enables us to use binary search, rather than linear search, to determine whether there exists a matching username:
check the middle element in the list
if the target element is equal to the middle element, you found a match
otherwise, if the target element comes before the middle element, repeat binary search on the left half of the list
otherwise, if the target element comes after the middle element, repeat binary search on the right half of the list
if you run out of list without finding the target, it's not in the list
The time complexity of this is logarithmic in terms of the number of users in the system: if you go from a thousand users to a million users, the time taken goes up by a factor of roughly ten, rather than one thousand as was the case for linear search. This is already a vast improvement over linear search and for any realistic number of users is probably going to be efficient enough. However, if additional performance testing and requirements analysis determine that it's still too slow, there are other possibilities.
Imagine now creating a large array of username/password pairs and whenever a pair is added to the collection, a function is used to transform the username into a numeric index. The pair is then inserted at that index in the array. Later, when you want to find whether that entry exists, you use the same function to calculate the index, and then check just that index to see if your element is there. If the function that maps the username to indices (called a hash function; the index is called a hash) is perfect - different strings don't map to the same index - then this unambiguously tells you whether your element exists. Notably, under somewhat reasonable assumptions, the time to make this determination is mostly independent from the number of users currently in the system: you can get (amortized) constant time behavior from this scheme, or something reasonably close to it. That means the performance hit from going from a thousand to a million users might be negligible.
This answer does not delve into the ugly real-world minutia of implementing these ideas in a production system. However, real world systems to implement these ideas (and many more) for precisely the kind of situation presented.
EDIT: comments asked for some pointers on actually implementing a hash table in Python. Here are some thoughts on that.
So there is a built-in hash() function that can be made to work if you disable the security feature that causes it to produce different hashes for different executions of the program. Otherwise, you can import hashlib and use some hash function there and convert the output to an integer using e.g. int.from_bytes. Once you get your number, you can take the modulus (or remainder after division, using the % operator) w.r.t. the capacity of your hash table. This gives you the index in the hash table where the item gets put. If you find there's already an item there - i.e. the assumption we made in theory that the hash function is perfect turns out to be incorrect - then you need a strategy. Two strategies for handling collisions like this are:
Instead of putting items at each index in the table, put a linked list of items. Add items to the linked list at the index computed by the hash function, and look for them there when doing the search.
Modify the index using some deterministic method (e.g., squaring and taking the modulus) up to some fixed number of times, to see if a backup spot can easily be found. Then, when searching, if you do not find the value you expected at the index computed by the hash method, check the next backup, and so on. Ultimately, you must fall back to something like method 1 in the worst case, though, since this process could fail indefinitely.
As for how large to make the capacity of the table: I'd recommend studying recommendations but intuitively it seems like creating it larger than necessary by some constant multiplicative factor is the best bet generally speaking. Once the hash table begins to fill up, this can be detected and the capacity expanded (all hashes will have to be recomputed and items re-inserted at their new positions - a costly operation, but if you increase capacity in a multiplicative fashion then I imagine this will not be too frequent an issue).

How to calculate same hash for two similar files?

I want to apply an Hashing algorithm, where the hash is same, If two files are similar. If one bit is lost, the hash of files change. Any algorithm which I can apply in Python to tackle this problem?
Thank you
I heard block hasing do this, but I don't know how to appply that.
I applied the following algorithm, but it does not help
import hashlib
file = "Annotation 2020-04-09 163448.png" # Location of the file (can be set a different way)
BLOCK_SIZE = 65536 # The size of each read from the file
file_hash = hashlib.sha256() # Create the hash object, can use something other than `.sha256()` if you wish
with open(file, 'rb') as f: # Open the file to read it's bytes
fb = f.read(BLOCK_SIZE) # Read from the file. Take in the amount declared above
while len(fb) > 0: # While there is still data being read from the file
file_hash.update(fb) # Update the hash
fb = f.read(BLOCK_SIZE) # Read the next block from the file
print (file_hash.hexdigest()) # Get the hexadecimal digest of the hash

The entire point of hashing algorithms is that they become completely different if any one bit from the source files is different, to ensure that generating hash collisions becomes challenging. Here are some workarounds:
The only robust way to find "similar" but not the same files you need to either compare the entire file content for every part to compute a similarity score. This is rather inefficient however, since it would be a O(n^2) algorithm with frequent hard drive roundtrips.
Another way is to perhaps hash only a part of each file. This will have the same problems that if only one bit of this part is different, the file will be different. However, you may be able to ignore perhaps spaces or markup or capitalization or hash only the file headers or ignore the last few bits of every color value, there are plenty of options for removing less relevant data to create less precise hashes. You could use block hashing here as a small optimization to avoid repeatedly loading big files, and first checking if enough blocks are similar.
You can also combine these techniques, use a hash to check if at least the basic file metadata is correct in a fast way, and then use a more slow algorithm to compute a similarity score only if the hashes match. This combines some of the accuracy of approach one with some of the speed of approach 2, though both the accuracy and the speed will still not be great.
The final option is to use a very weak hashing algorithm. If you just use sum(file)%(2^32), similar files will give sortof similar hashes in some cases, but it's really hard to determine actual similarity based on the final hash, since a difference of a byte anywhere in the file will make a big difference in the hash still, and if you include all hashes within 256 of each other, many files will still be considered similar even if they are not and you miss all files with a difference of two bytes or more.
It depends on your use case which of these techniques work for you, but beware that this is not an easy task. Good luck!

Hashing 1000 Image Files Quick as Possible (2000x2000 plus resolution) (Python)

I have a folder with around several thousand RGB 8-bit-per-channel image files on my computer that are anywhere between 2000x2000 and 8000x8000 in resolution (so most of them are extremely large).
I would like to store some small value, such as a hash, for each image so that I have a value to easily compare to in the future to see if any image files have changed. There are three primary requirements in the calculation of this value:
The calculation of this value needs to be fast
The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes. (The hash should not take filename into account).
Collisions should basically never happen.
There are a lot of ways I could go about this, such as sha1, md5, etc, but the real goal here is speed, and really just any extremely quick way to identify if ANY change at all has been made to an image.
How would you achieve this in Python? Is there a particular hash algorithm you recommend for speed? Or can you devise a different way to achieve my three goals altogether?

The calculation of this value needs to be fast
The result needs to be different if ANY part of the image file changes, even in the slightest amount, even if just one pixel changes.
(The hash should not take filename into account).
Collisions should basically never happen.
Hash calculation (may differ according to the hashing algorithm) of large files take time, if it needs to be fast, try to choose an efficient hashing algorithm for your task. You can find information about how they compare to each other. But, before checking hash, you can optimize your algorithm by checking something else.
If you decided to use hashing, this is the case. The hash value will be changed even if a small part of image has changed.
Collisions may be (very rare, but not never) happen. This is the nature of hash algorithms
Example to 1st (optimizing algorithm),
Check file size.
If sizes are equal, check CRC
If CRCs are equal, then calculate and check hash. (both requires passing the file)
Optionally, before checking hashes, you can partially calculate and compare hashes, instead of all the file.
If most of your files will be more likely different, then checking other things before calculating hash probably will be faster.
But if most of your files will be identical, then the steps before the hashing will just consume more time. Because you'll already have to calculate the hash for most of files.
So try to implement most efficient algorithm according to your context.

Would it be worth creating an open CL version of my abstract algebra library?

I'm making an abstract algebra library in Python, and one of the things it does it takes the Cayley table (think of it as an abstract "multiplication" table, which doesn't have to obey the standard rules for multiplication or addition), and uses it to prove whether or not certain identities or properties hold for the binary operator defined by the Cayley table.
The computations for any of these procedures basically boils down to:
Use the values in the Cayley table to get the result of some abstract binary operation
Compare the resulting values to see if they are equal
Do all of the above multiple times in a loop.
This can get very cpu intensive with the sheer number of operations that need to be done, especially if you want to do this for all permutations of cayley tables with size n*n (O(n^2!) for that, I think). The good thing is, this is highly paralelizable, as I have algorithms that can calculate the mth permutation of an n*n cayley table in about O(n) time, and thus the set of all Cayley table permutations can be split near evenly, and multiple processes can be assigned to work on a different subset of the problem in parallel.
Is process like this (replacing values from lookup tables, comparing results) suited to open CL, or any other GPU library for that matter?

If you could change the
Compare the resulting values to see if they are equal
part to something like
subtract one from another, save the result and check for a zero from host side.
if the result will decide for just an addition(+) operation, even something looks like
subtract one from another, multiply (1-signum(result)) with the adder value, then add.
can be faster than cpu. Because gpu memory is generally marginally faster than host memory when read/written in non-random.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.