There's a great deal of information I can find on hashing strings for obfuscation or lookup tables, where collision avoidance is a primary concern. I'm trying to put together a hashing function for the purpose of load balancing, where I want to fit an unknown set of strings into an arbitrarily small number of buckets with a relatively even distribution. Collisions are expected (desired, even).
My immediate use case is load distribution in an application, where I want each instance of the application to fire at a different time of the half-hour, without needing any state information about other instances. So I'm trying to hash strings into integer values from 0 to 29. However, the general approach has wider application with different int ranges for different purposes.
Can anyone make suggestions, or point me to docs that would cover this little corner of hash generation?
My language of choice for this is python, but I can read most common langues so anything should be applicable.
Your might consider something simple, like the adler32() algo, and just mod for bucket size.
import zlib
buf = 'arbitrary and unknown string'
bucket = zlib.adler32(buf) % 30
# at this point bucket is in the range 0 - 29
Related
I want to apply an Hashing algorithm, where the hash is same, If two files are similar. If one bit is lost, the hash of files change. Any algorithm which I can apply in Python to tackle this problem?
Thank you
I heard block hasing do this, but I don't know how to appply that.
I applied the following algorithm, but it does not help
import hashlib
file = "Annotation 2020-04-09 163448.png" # Location of the file (can be set a different way)
BLOCK_SIZE = 65536 # The size of each read from the file
file_hash = hashlib.sha256() # Create the hash object, can use something other than `.sha256()` if you wish
with open(file, 'rb') as f: # Open the file to read it's bytes
fb = f.read(BLOCK_SIZE) # Read from the file. Take in the amount declared above
while len(fb) > 0: # While there is still data being read from the file
file_hash.update(fb) # Update the hash
fb = f.read(BLOCK_SIZE) # Read the next block from the file
print (file_hash.hexdigest()) # Get the hexadecimal digest of the hash
The entire point of hashing algorithms is that they become completely different if any one bit from the source files is different, to ensure that generating hash collisions becomes challenging. Here are some workarounds:
The only robust way to find "similar" but not the same files you need to either compare the entire file content for every part to compute a similarity score. This is rather inefficient however, since it would be a O(n^2) algorithm with frequent hard drive roundtrips.
Another way is to perhaps hash only a part of each file. This will have the same problems that if only one bit of this part is different, the file will be different. However, you may be able to ignore perhaps spaces or markup or capitalization or hash only the file headers or ignore the last few bits of every color value, there are plenty of options for removing less relevant data to create less precise hashes. You could use block hashing here as a small optimization to avoid repeatedly loading big files, and first checking if enough blocks are similar.
You can also combine these techniques, use a hash to check if at least the basic file metadata is correct in a fast way, and then use a more slow algorithm to compute a similarity score only if the hashes match. This combines some of the accuracy of approach one with some of the speed of approach 2, though both the accuracy and the speed will still not be great.
The final option is to use a very weak hashing algorithm. If you just use sum(file)%(2^32), similar files will give sortof similar hashes in some cases, but it's really hard to determine actual similarity based on the final hash, since a difference of a byte anywhere in the file will make a big difference in the hash still, and if you include all hashes within 256 of each other, many files will still be considered similar even if they are not and you miss all files with a difference of two bytes or more.
It depends on your use case which of these techniques work for you, but beware that this is not an easy task. Good luck!
So I was wondering about youtube's url. Specially the video id watch?v=XZmGGAbHqa0. Same for similar tinyurl services.
I came across Tom's video of Will YouTube Ever Run Out Of Video IDs? That was quite interesting. But generating a seemingly random number in a random way, check for duplicates looked quite expensive.
In case you haven't watched the video, youtube generates a large unique number and base64 encode it. Not sure if the whole process is that simple inside youtube but I'm trying to write something similar.
I came across a mathematical function called Modular Multiplicative Inverse. Using this along with a bit of Extended Euclidean algorithm will generate a unique number from our given input. Inverse is possible too, But someone won't be easily able to inverse or brute force without knowing the seeds. So we can easily get a large random number from our subsequent id in database. Maybe even base64 it.
But I have confusions.
As tom mentioned in the video, syncing a subsequent id across multiple servers can lead to problems and overheads. But if we don't make it subsequent, won't it be a tedious and expensive task to find a unique id? We have to check if the id is available in db.
Should I avoid mixing in url with subsequent ids? Or masking it with something like that Multiplicative Inverse is good?
Why would I want to base64 it? To save storage or to save url space? I would still need to generate something seemingly random, unique and be able to quickly generate it without duplicating and search it.
Any better ways to achieve what I'm trying? I did checked all the similar so questions and a few blog posts. On of which is this
Sorry If I couldn't explain well, I'm quite confused what to ask.
This is why you bypass the issue of ensuring uniqueness of randomly generated numbers entirely and use a pseudorandom permutation (PRP) family to transform a pre-coordinated counter.
A PRP family is like a PRF family in that it is a cipher that takes a key and a plaintext and produces a ciphertext except that it has a one-to-one map between plaintexts and ciphertexts.
This lets one use a Twitter Snowflake like design and then simply encode the internal sequential identifiers into non-sequential external identifiers.
See Algorithm to turn numeric IDs in to short, different alphanumeric codes for an implementation in Python.
Also
M. Luby and C. Rackoff. How to construct pseudorandom permutations from pseudorandom functions. SIAM J. Comput., 17(2):373–386, Apr. 1988.
This is py2neo 1.6.
My question is how to generate the unique_identifier for each idea (see commented lines) in order to have a distinct filename for the image.
For the moment we are using python’s uuid.
I wonder if there is some utility in neo4j that can associate a distinct number to each node when the node is added to the index, and so that we can use this number as our unique_identifier
def create_idea_node(idea_text):
#basepath = 'http://www.example.com/ideas/img/'
#filename= str(unique_identifier)+'.png'
#idea_image_url = basepath + filename
newidea_node, = getGraph().create({"idea": idea_text, "idea_image_url": idea_image_url})
_getIdeasIndex().add("idea", idea_text, new_idea_node)
return OK
def _getIdeasIndex():
return getGraph().get_or_create_index(neo4j.Node, "Ideas")
Neo4j nodes have ids, they are integers, however if a node is destroyed and recreated, the integer may be reused. id(n) is the node n’s id. Is there something wrong with the UUID? Integer solutions can become problematic when you are multi-threading or distributing your computing project across multiple servers as you scale. So unless there is something wrong with the UUID solution, I’d just stick with that.
In spite of being hard to read, and perhaps requiring slightly more storage, UUID's have many advantages over trying to enforce uniqueness with integers (in general). I encourage you to read up on the nature of UUIDs on Wikipedia.
Integer uniqueness has many pitfalls when trying to scale across independent systems (for fault tolerance and performance reasons). If you can start out working with UUID's, you can grow with your solution for the long term with many fewer headaches down the road.
FWIW, if you end up storing UUID's in PostgreSQL sometime down the road, be sure to take advantage of the 'uuid' datatype. It will make storing and indexing those values almost as efficient as plain integers. (It will be hard to tell the difference.)
My goal is to efficiently perform an exhaustive diff of a directory tree. Given a set of files F, I must compute the equivalence classes EQV = [F₁, F₂, F₃, ... Fₙ] such that fⱼ, fₖ ∈ EQV[i] iff (fⱼ is identical to fₖ) for all i, j, k.
My general approach is to start with just one big class containing all initial files, EQV₀ = [[f₁, f₂, ..., fₙ]], and repeatedly split it into more refined classes EQV₁, EQV₂... EQVₘ₋₁, based on some m heuristics, for example, file size, checksum function 1, checksum 2. After all m heuristics have been applied (EQVₘ₋₁), a pairwise diff of all the files within each class in EQVₘ₋₁ must be made. Because this last step is quadratic for each of the classes in EQVₘ₋₁, ie
O(sum(n² for n in map(len, EQVₘ₋₁)) )
and will probably be the bottleneck of my algorithm if each the m splits are done in linear time, my goal is to make EQVₘ₋₁ as flat as possible.
I would like to have access to a variety of good hash functions that I can apply to minimize collisions on EQVₘ₋₁. My current idea is to use some library provided checksum function, such as adler, and to generate variations of it by simply applying it to different starting bytes within the file. Another one is to first apply fast hash functions, such as adler, and then more expensive ones such as md5 on only the classes which are still too large.
Considering that I can compute all the hashes for a given file in just one read of that file, how could I compute a variety of hashes that will help me discriminate among non-identical files?
Alternatively, what is a good list of hash functions available in python that aren't cryptographically secure?
Edit:
Another idea seems to use "rabin fingerprint" based on a fixed set of randomly generated inputs. Would this make sense for this purpose?
http://en.wikipedia.org/wiki/Rabin_fingerprint
I would recommend first using adler32, then crc32. There may be many very short files that have the same adler32's, but different crc32's. In fact, you could consider a single step using crc32 on files below a certain size on the first pass. That size could be about 1K. For longer files, adler32 and crc32 will have close to the same collision probability.
Depending on how many files you have, you could consider a subsequent step with larger hashes, such as md5 or sha1. See this answer for the probability and count of expected collisions for a 32-bit check value. Roughly, if you have millions of files, that step may be worth doing.
You will get no benefit by going to even longer hash values. The 128 bits from md5 is plenty to distinguish all of the files on every computer in the world.
I have a large static binary (10GB) that doesn't change.
I want to be able to take as input small strings (15 bytes or lower each) and then to determine which string is the least frequent.
I understand that without actually searching the whole binary I wont be able to determine this exactly, so I know it will be an approximation.
Building a tree/hash table isn't feasible since it will require about 256^15 bytes which is ALOT.
I have about 100GB of disk space and 8GB RAM which will be dedicated into this task, but I can't seem to find any way to accomplish this task without actually going over the file.
I have as much time as I want to prepare the big binary, and after that I'll need to decide which is the least frequent string many many times.
Any ideas?
Thanks!
Daniel.
(BTW: if it matters, I'm using Python)
Maybe build a hashtable with the counts for as many n-tuples as you can afford storage for? You can prune the trees that don't appear anymore. I wouldn't call it "approximation", but could be "upper bounds", with assurance to detect strings that don't appear.
So, say you can build all 4-tuples.
Then to count occurrences for "ABCDEF" you'd have the minimum of count(ABCD), count(BCDE), count(CDEF). If that is zero for any of those, the string is guaranteed to not appear. If it is one, it will appear at most once (but maybe not at all).
Because you have a large static string that does not change you could distinguish one-time work preprocessing this string which never has to be repeated from the work of answering queries. It might be convenient to do the one-time work on a more powerful machine.
If you can find a machine with an order of magnitude or so more internal storage you could build a suffix array - an array of offsets into the stream in sorted order of the suffixes starting at the offset. This could be stored in external storage for queries, and you could use this with binary search to find the first and last positions in sorted order where your query string appears. Obviously the distance between the two will give you the number of occurrences, and a binary search will need about 34 binary chops to do 16 Gbyte assuming 16Gbytes is 2^34 bytes so each query should cost about 68 disk seeks.
It may not be reasonable to expect you to find that amount of internal storage, but I just bought a 1TB USB hard drive for about 50 pounds, so I think you could increase external storage for one time work. There are algorithms for suffix array construction in external memory, but because your query strings are limited to 15 bytes you don't need anything that complicated. Just create 200GB of data by writing out the 15-byte string found at every offset followed by an 5-byte offset number, then sort these 20-byte records with an external sort. This will give you 50Gbytes of indexes into the string in sorted order for you to put into external storage to answer queries with.
If you know all of the queries in advance, or are prepared to batch them up, another approach would be to build an http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm tree from them. This takes time linear in the total size of the queries. Then you can stream the 10GB data past them in time proportional to the sum of the size of that data and the number of times any string finds a match.
Since you are looking for which is least frequent, and are willing to accept approximate solution. You could use a series of Bloom filters instead of a hash table. If you use sufficiently large ones, you shouldn't need to worry about the query size, as you can probably keep the false positive rate low.
The idea would be to go through all of the possible query sizes and make sub-strings out of them. For example, if the queries will be between 3 and 100, then it would cost (N * (sum of (i) from i = 3 to i = 100)). Then one by one add the subsets to one of the bloom filters, such that the query doesn't exist within the filter, creating a new one Bloom filter with the same hash functions if needed. You obtain the count by going through each filter and checking if the query exists within it. Each query then simply goes through each of the filter and checks if it's there, if it is, it adds 1 to a count.
You'll need to try to balance the false positive rate as well as the number of filters. If the false positive rate gets too high on one of the filters it isn't useful, likewise it's bad if you have trillions of bloom filters (quite possible if you one filter per sub-string). There are a couple of ways these issues can be dealt with.
To reduce the number of filters:
Randomly delete filters until there are only so many left. This will likely increase the false negative rate, which probably means it's better to simply delete the filters with the highest expected false positive rates.
Randomly merge filters until there are only so many left. Ideally avoiding merging a filter too often as it increases the false positive rate. Practically speaking, you probably have too many to do this without making use of the scalable version (see below), as it'll probably be hard enough to manage the false positive rate.
It also may not be a bad to avoid a greedy approach when adding to a bloom filter. Be rather selective in which filter something is added to.
You might end up having to implement scalable bloom filters to keep things manageable, which sounds similar to what I'm suggesting anyway, so should work well.