I need to generate a 32 bit random int but depending of some arguments. The idea is generate a unique ID for each message to send through a own P2P network. To generate it, I would like as arguments: my IP and the time stamp. My question is, how can I generate this 32 bit random int from these arguments?
Thanks again!
here's a list of options with their associated problems:
use a random number. you will get a collision (non-unique value) in about half the bits (this is the "birthday collision"). so for 32 bits you get a collision after 2*16 messages. if you are sending less than 65,000 messages this is not a problem, but 65,000 is not such a big number.
use a sequential counter from some service. this is what twitter's snowflake does (see another answer here). the trouble is supplying these across the net. typically with distributed systems you give each agent a set of numbers (so A might get 0-9, B get's 10-19, etc) and they use those numbers then request a new block. that reduces network traffic and load on the service providing numbers. but this is complex.
generate a hash from some values that will be unique. this sounds useful but is really no better than (1), because your hashes are going to collide (i explain why below). so you can hash IP address and timestamp, but all you're doing is generating 32 bit random numbers, in effect (the difference is that you can reproduce these values, but it doesn't seem like you need that functionality anyway), and so again you'll have a collisions after 65,000 messages or so, which is not much.
be smarter about generating ids to guarantee uniqueness. the problem in (3) is that you are hashing more than 32 bits, so you are compressing information and getting overlaps. instead, you could explicitly manage the bits to avoid collisions. for example, number each client for 16 bits (allows up to 65,000 clients) and then have each client user a 16 bit counter (allows up to 65,000 messages per client which is a big improvement on (3)). those won't collide because each is guaranteed unique, but you have a lot of limits in your system and things are starting to get complex (need to number clients and store counter state per client).
use a bigger field. if you used 64 bit ids then you could just use random numbers because collisions would be once every 2**32 messages, which is practically never (1 in 4,000,000,000). or you could join ip address (32 bits) with a 32 bit timestamp (but be careful - that probably means no more than 1 message per second from a client). the only drawback is slightly larger bandwidth, but in most cases ids are much smaller than payloads.
personally, i would use a larger field and random numbers - it's simple and works (although good random numbers are an issue in, say, embedded systems).
finally, if you need the value to be "really" random (because, for example, ids are used to decide priority and you want things to be fair) then you can take one of the solutions above with deterministic values and re-arrange the bits to be pseudo-random. for example, reversing the bits in a counter may well be good enough (compare lsb first).
I would suggest using some sort of hash. There are many possible hashes, the FNV hash comes in a variety of sizes and is fast. If you want something cryptographically secure it will be a lot slower. You may need to add a counter: 1, 2, 3, 4... to ensure that you do not get duplicate hashes within the same time stamp.
Did you try looking into Twitter's Snowflake? There is a Python wrapper for it.
Related
I was wonder what if i had 1 million users, how much time will take it to loop every account to check for the username and email if the registration page to check if they are already used or not, or check if the username and password and correct in the log in page, won't it take ages if i do it in a traditional for loop?
Rather than give a detailed technical answer, I will try to give a theoretical illustration of how to address your concerns. What you seem to be saying is this:
Linear search might be too slow when logging users in.
I imagine what you have in mind is this: a user types in a username and password, clicks a button, and then you loop through a list of username/password combinations and see if there is a matching combination. This takes time that is linear in terms of the number of users in the system; if you have a million users in the system, the loop will take about a thousand times as long as when you just had a thousand users... and if you get a billion users, it will take a thousand times longer over again.
Whether this is a performance problem in practice can only be determined through testing and requirements analysis. However, if it is determined to be a problem, then there is a place for theory to come to the rescue.
Imagine one small improvement to our original scheme: rather than storing the username/password combinations in arbitrary order and looking through the whole list each time, imagine storing these combinations in alphabetic order by username. This enables us to use binary search, rather than linear search, to determine whether there exists a matching username:
check the middle element in the list
if the target element is equal to the middle element, you found a match
otherwise, if the target element comes before the middle element, repeat binary search on the left half of the list
otherwise, if the target element comes after the middle element, repeat binary search on the right half of the list
if you run out of list without finding the target, it's not in the list
The time complexity of this is logarithmic in terms of the number of users in the system: if you go from a thousand users to a million users, the time taken goes up by a factor of roughly ten, rather than one thousand as was the case for linear search. This is already a vast improvement over linear search and for any realistic number of users is probably going to be efficient enough. However, if additional performance testing and requirements analysis determine that it's still too slow, there are other possibilities.
Imagine now creating a large array of username/password pairs and whenever a pair is added to the collection, a function is used to transform the username into a numeric index. The pair is then inserted at that index in the array. Later, when you want to find whether that entry exists, you use the same function to calculate the index, and then check just that index to see if your element is there. If the function that maps the username to indices (called a hash function; the index is called a hash) is perfect - different strings don't map to the same index - then this unambiguously tells you whether your element exists. Notably, under somewhat reasonable assumptions, the time to make this determination is mostly independent from the number of users currently in the system: you can get (amortized) constant time behavior from this scheme, or something reasonably close to it. That means the performance hit from going from a thousand to a million users might be negligible.
This answer does not delve into the ugly real-world minutia of implementing these ideas in a production system. However, real world systems to implement these ideas (and many more) for precisely the kind of situation presented.
EDIT: comments asked for some pointers on actually implementing a hash table in Python. Here are some thoughts on that.
So there is a built-in hash() function that can be made to work if you disable the security feature that causes it to produce different hashes for different executions of the program. Otherwise, you can import hashlib and use some hash function there and convert the output to an integer using e.g. int.from_bytes. Once you get your number, you can take the modulus (or remainder after division, using the % operator) w.r.t. the capacity of your hash table. This gives you the index in the hash table where the item gets put. If you find there's already an item there - i.e. the assumption we made in theory that the hash function is perfect turns out to be incorrect - then you need a strategy. Two strategies for handling collisions like this are:
Instead of putting items at each index in the table, put a linked list of items. Add items to the linked list at the index computed by the hash function, and look for them there when doing the search.
Modify the index using some deterministic method (e.g., squaring and taking the modulus) up to some fixed number of times, to see if a backup spot can easily be found. Then, when searching, if you do not find the value you expected at the index computed by the hash method, check the next backup, and so on. Ultimately, you must fall back to something like method 1 in the worst case, though, since this process could fail indefinitely.
As for how large to make the capacity of the table: I'd recommend studying recommendations but intuitively it seems like creating it larger than necessary by some constant multiplicative factor is the best bet generally speaking. Once the hash table begins to fill up, this can be detected and the capacity expanded (all hashes will have to be recomputed and items re-inserted at their new positions - a costly operation, but if you increase capacity in a multiplicative fashion then I imagine this will not be too frequent an issue).
I would like to shuffle a relatively long array (length ~400). While I am not a cryptography expert, I understand that using a random number generator with a period of less than 400! will limit the space of the possible permutations that can be generated.
I am trying to use Python's random.SystemRandom number generator class, which, in Windows, uses CryptGenRandom as its RNG.
Does anyone smarter than me know what the period of this number generator is? Will it be possible for this implementation to reach the entire space of possible permutations?
You are almost correct: you need a generator not with a period of 400!, but with an internal state of more than log2(400!) bits (which will also have a period larger than 400!, but the latter condition is not sufficient). So you need at least 361 bytes of internal state. CryptGenRandom doesn't qualify, but it ought to be sufficient to generate 361 or more bytes with which to seed a better generator.
I think Marsaglia has versions of MWC with 1024 and 4096 bytes of state.
There's a great deal of information I can find on hashing strings for obfuscation or lookup tables, where collision avoidance is a primary concern. I'm trying to put together a hashing function for the purpose of load balancing, where I want to fit an unknown set of strings into an arbitrarily small number of buckets with a relatively even distribution. Collisions are expected (desired, even).
My immediate use case is load distribution in an application, where I want each instance of the application to fire at a different time of the half-hour, without needing any state information about other instances. So I'm trying to hash strings into integer values from 0 to 29. However, the general approach has wider application with different int ranges for different purposes.
Can anyone make suggestions, or point me to docs that would cover this little corner of hash generation?
My language of choice for this is python, but I can read most common langues so anything should be applicable.
Your might consider something simple, like the adler32() algo, and just mod for bucket size.
import zlib
buf = 'arbitrary and unknown string'
bucket = zlib.adler32(buf) % 30
# at this point bucket is in the range 0 - 29
I have a large static binary (10GB) that doesn't change.
I want to be able to take as input small strings (15 bytes or lower each) and then to determine which string is the least frequent.
I understand that without actually searching the whole binary I wont be able to determine this exactly, so I know it will be an approximation.
Building a tree/hash table isn't feasible since it will require about 256^15 bytes which is ALOT.
I have about 100GB of disk space and 8GB RAM which will be dedicated into this task, but I can't seem to find any way to accomplish this task without actually going over the file.
I have as much time as I want to prepare the big binary, and after that I'll need to decide which is the least frequent string many many times.
Any ideas?
Thanks!
Daniel.
(BTW: if it matters, I'm using Python)
Maybe build a hashtable with the counts for as many n-tuples as you can afford storage for? You can prune the trees that don't appear anymore. I wouldn't call it "approximation", but could be "upper bounds", with assurance to detect strings that don't appear.
So, say you can build all 4-tuples.
Then to count occurrences for "ABCDEF" you'd have the minimum of count(ABCD), count(BCDE), count(CDEF). If that is zero for any of those, the string is guaranteed to not appear. If it is one, it will appear at most once (but maybe not at all).
Because you have a large static string that does not change you could distinguish one-time work preprocessing this string which never has to be repeated from the work of answering queries. It might be convenient to do the one-time work on a more powerful machine.
If you can find a machine with an order of magnitude or so more internal storage you could build a suffix array - an array of offsets into the stream in sorted order of the suffixes starting at the offset. This could be stored in external storage for queries, and you could use this with binary search to find the first and last positions in sorted order where your query string appears. Obviously the distance between the two will give you the number of occurrences, and a binary search will need about 34 binary chops to do 16 Gbyte assuming 16Gbytes is 2^34 bytes so each query should cost about 68 disk seeks.
It may not be reasonable to expect you to find that amount of internal storage, but I just bought a 1TB USB hard drive for about 50 pounds, so I think you could increase external storage for one time work. There are algorithms for suffix array construction in external memory, but because your query strings are limited to 15 bytes you don't need anything that complicated. Just create 200GB of data by writing out the 15-byte string found at every offset followed by an 5-byte offset number, then sort these 20-byte records with an external sort. This will give you 50Gbytes of indexes into the string in sorted order for you to put into external storage to answer queries with.
If you know all of the queries in advance, or are prepared to batch them up, another approach would be to build an http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm tree from them. This takes time linear in the total size of the queries. Then you can stream the 10GB data past them in time proportional to the sum of the size of that data and the number of times any string finds a match.
Since you are looking for which is least frequent, and are willing to accept approximate solution. You could use a series of Bloom filters instead of a hash table. If you use sufficiently large ones, you shouldn't need to worry about the query size, as you can probably keep the false positive rate low.
The idea would be to go through all of the possible query sizes and make sub-strings out of them. For example, if the queries will be between 3 and 100, then it would cost (N * (sum of (i) from i = 3 to i = 100)). Then one by one add the subsets to one of the bloom filters, such that the query doesn't exist within the filter, creating a new one Bloom filter with the same hash functions if needed. You obtain the count by going through each filter and checking if the query exists within it. Each query then simply goes through each of the filter and checks if it's there, if it is, it adds 1 to a count.
You'll need to try to balance the false positive rate as well as the number of filters. If the false positive rate gets too high on one of the filters it isn't useful, likewise it's bad if you have trillions of bloom filters (quite possible if you one filter per sub-string). There are a couple of ways these issues can be dealt with.
To reduce the number of filters:
Randomly delete filters until there are only so many left. This will likely increase the false negative rate, which probably means it's better to simply delete the filters with the highest expected false positive rates.
Randomly merge filters until there are only so many left. Ideally avoiding merging a filter too often as it increases the false positive rate. Practically speaking, you probably have too many to do this without making use of the scalable version (see below), as it'll probably be hard enough to manage the false positive rate.
It also may not be a bad to avoid a greedy approach when adding to a bloom filter. Be rather selective in which filter something is added to.
You might end up having to implement scalable bloom filters to keep things manageable, which sounds similar to what I'm suggesting anyway, so should work well.
I'm implementing a social stream and a notification system for my web application by using redis. I'm new to redis and I have some doubts about hashes and their efficiency.
I've read this awesome Instagram post
and I planned to implement their similar solution for minimal storage.
As mentioned in their blog, they did like this
To take advantage of the hash type, we bucket all our Media IDs into buckets of 1000 (we just take the ID, divide by 1000 and discard the remainder). That determines which key we fall into; next, within the hash that lives at that key, the Media ID is the lookup key within the hash, and the user ID is the value. An example, given a Media ID of 1155315, which means it falls into bucket 1155 (1155315 / 1000 = 1155):
HSET "mediabucket:1155" "1155315" "939"
HGET "mediabucket:1155" "1155315"
> "939"
So Instead of having 1000 seperate keys they are storing it in one hash with thousand lookup keys. And my doubt is why can't we increase the lookup key values to even more larger.
For eg: Media ID of 1155315 will fall into mediabucket:115 by dividing it by 10000
or even greater than that.
Why are they settling with one hash bucket with 1000 lookup keys. Why can't they have one hash bucket with 100000 lookup keys. Is that related to efficiency?
I need your suggestion for implementing the efficient method in my web application.
P.S. Please! don't say that stackoverflow is not for asking suggestions and I don't know where to find help.
Thanks!
Yes, it's related to efficiency.
We asked the always-helpful Pieter Noordhuis, one of Redis’ core developers, for input, and he suggested we use Redis hashes. Hashes in Redis are dictionaries that are can be encoded in memory very efficiently; the Redis setting ‘hash-zipmap-max-entries’ configures the maximum number of entries a hash can have while still being encoded efficiently. We found this setting was best around 1000; any higher and the HSET commands would cause noticeable CPU activity. For more details, you can check out the zipmap source file.
Small hashes are encoded in a special way (zipmaps), that is memory efficient, but makes operations O(N) instead of O(1). So, with one zipmap with 100k fields instead of 100 zipmaps with 1k fields you gain no memory benefits, but all your operations get 100 times slower.
Basically, they want the number of values stored in a single hash to not exceed 1000. Probably, they set up their Redis instance configuration to work nicely with this number (thy set hash-zipmap-max-entries).
Every time an hash will exceed the number of elements or element size specified it will be converted into a real hash table, and the memory saving will be lost.
-- http://redis.io/topics/memory-optimization
As I understand, your question is "why exactly 1000 and not more?" Well, it's because they had to choose between space efficiency and speed. Space-efficient representation has operation complexity O(N), not O(1) as normal hashes - it is N times slower, but takes less memory.
They tested different values and found that 1000 is a good compromise solution - takes not much space, yet still fast enough.