I would like to shuffle a relatively long array (length ~400). While I am not a cryptography expert, I understand that using a random number generator with a period of less than 400! will limit the space of the possible permutations that can be generated.
I am trying to use Python's random.SystemRandom number generator class, which, in Windows, uses CryptGenRandom as its RNG.
Does anyone smarter than me know what the period of this number generator is? Will it be possible for this implementation to reach the entire space of possible permutations?
You are almost correct: you need a generator not with a period of 400!, but with an internal state of more than log2(400!) bits (which will also have a period larger than 400!, but the latter condition is not sufficient). So you need at least 361 bytes of internal state. CryptGenRandom doesn't qualify, but it ought to be sufficient to generate 361 or more bytes with which to seed a better generator.
I think Marsaglia has versions of MWC with 1024 and 4096 bytes of state.
Related
I made a binary search function and I'm curious what would happen if I used it on 4 billion numbers, but I get a MemoryError every time I use it. Is there a way to store the list without this issue?
Yes, but it's gonna be expensive.
Assuming the numbers are 1-4,000,000,000, each number would take up either 28 bytes or 32 bytes. Going with 32 bytes, storing 4,000,000,000 numbers would take ~128GB. The list itself also takes up memory, 8 bytes per object (source), so it would require around 192GB of memory altogether.
If you're wanting to emulate a list, however, things become a lot more reasonable.
If your numbers follow some formula, you can make a class that "pretends" to be a list by following the answers here. This will create a custom class that works like a list, (e.g. you can do foo[3] on it and it will return a number) but doesn't take up all that memory because it isn't actually storing 4,000,000,000 numbers. In fact, this is pretty similar to how range() works, and is the reason why doing range(1,4_000_000_000) doesn't take up 192GB of ram.
Input: A list of positive integers where one entry occurs exactly once, and all other entries occur exactly twice (for example [1,3,2,5,3,4,1,2,4])
Output: The unique entry (5 in the above example)
The following algorithm is supposed to be O(m) time and O(1) space where m is the size of the list.
def get_unique(intlist):
unique_val = 0
for int in intlist:
unique_val ^= int
return unique_val
My analysis: Given a list of length m there will be (m + 1)/2 unique positive integers in the input list, so that the smallest possible maximum integer in the list will be (m+1)/2. If we assume this best case, then when taking an XOR sum the variable unique_val will require ceiling(log((m+1)/2)) bits in memory, so I thought the space complexity should be at least O(log(m)).
Your analysis is certainly one correct answer, particularly in a language like Python which gracefully handles arbitrarily large numbers.
It's important to be clear about what you're trying to measure when thinking about space and time complexity. A reasonable assumption might be that the size of an integer is constant (e.g. you're using 64-bit integers). In that case, the space complexity is certainly O(1), but the time complexity is still O(m).
Now, you could also argue that using a fixed-size integer means you have a constant upper-bound on the size of m, so perhaps the time complexity is also O(1). But in most cases where you need to analyze the running time of this sort of algorithm, you're probably very interested in the difference between a list of length 10 and one of length 1 billion.
I'd say it's important to clarify and state your assumptions when analyzing space- and time-complexity. In this case, I would assume we have a fixed size integer and a value of m much smaller than the maximum integer value. In that case, O(1) space and O(m) time are probably the best answers.
EDIT (based on discussion in other answers)
Since all m gives you is a lower-bound no the maximum value in the list, you really can't provide a worst-case estimate of the space. I.e. a number in the list can be arbitrarily large. To have any reasonable answer as to the space complexity of this algorithm, you need to make some assumption about the maximum size of the input values.
The (space/time) complexity analysis is usually applied to algorithms on a higher level. While you can drop down to specific language implementation level, it may not be useful in all cases.
Your analysis is both right and possibly wrong. It's right for current cpython implementation where integers do not have a maximum value. It's ok if all your integers are relatively small and fit into the implementation-specific case of small numbers.
But it doesn't have to be valid for all other implementations of python. For example, you could have an optimizing implementation which figures out that intlist is not used again and instead of using unique_val, it reuses the space of the consumed list elements. (basically transforming this function into a space-optimized reduce call)
Then again, can we even talk about space complexity in a GC'd language with allocated integers? Your analysis of the complexity is wrong, because a ^= b will allocate new memory for big value b and the size of that depends on the system, architecture, python version, and luck.
Your original question is however "Why is the following algorithm O(1) space?". If you look at the algorithm itself and assume you have some arbitrary maximum integer limits, or your language can represent any number in a limited space, then the answer is yes. The algorithm itself with those conditions uses constant space.
The complexity of an algorithm is always dependent on the machine model (= platform) you use. E.g. we often say that multiplying and dividing IEEE floating point numbers is of run-time complexity O(1) - which is not always the case (e.g. on an 8086 processor without FPU).
For the above algorithm, the space complexity O(1) only holds as long as your input list has no element > 2147483647 (= sys.maxint). Usually, python stores integers as signed 32 bit values. For those datatypes, your processor has all relevant operations already implemented in hardware and it generally takes only a constant number of clock cycles (in most cases only one) to perform them (= run-time complexity O(1)) and only a constant number of memory addresses (only one) is occupied to store the result (= space complexity O(1)).
However, if your input exceeds 2147483647, python generally uses a software-implemented datatype to store these big integers. Operations on these are no longer in O(1) and they require more than constant O(1) space.
I need to generate a 32 bit random int but depending of some arguments. The idea is generate a unique ID for each message to send through a own P2P network. To generate it, I would like as arguments: my IP and the time stamp. My question is, how can I generate this 32 bit random int from these arguments?
Thanks again!
here's a list of options with their associated problems:
use a random number. you will get a collision (non-unique value) in about half the bits (this is the "birthday collision"). so for 32 bits you get a collision after 2*16 messages. if you are sending less than 65,000 messages this is not a problem, but 65,000 is not such a big number.
use a sequential counter from some service. this is what twitter's snowflake does (see another answer here). the trouble is supplying these across the net. typically with distributed systems you give each agent a set of numbers (so A might get 0-9, B get's 10-19, etc) and they use those numbers then request a new block. that reduces network traffic and load on the service providing numbers. but this is complex.
generate a hash from some values that will be unique. this sounds useful but is really no better than (1), because your hashes are going to collide (i explain why below). so you can hash IP address and timestamp, but all you're doing is generating 32 bit random numbers, in effect (the difference is that you can reproduce these values, but it doesn't seem like you need that functionality anyway), and so again you'll have a collisions after 65,000 messages or so, which is not much.
be smarter about generating ids to guarantee uniqueness. the problem in (3) is that you are hashing more than 32 bits, so you are compressing information and getting overlaps. instead, you could explicitly manage the bits to avoid collisions. for example, number each client for 16 bits (allows up to 65,000 clients) and then have each client user a 16 bit counter (allows up to 65,000 messages per client which is a big improvement on (3)). those won't collide because each is guaranteed unique, but you have a lot of limits in your system and things are starting to get complex (need to number clients and store counter state per client).
use a bigger field. if you used 64 bit ids then you could just use random numbers because collisions would be once every 2**32 messages, which is practically never (1 in 4,000,000,000). or you could join ip address (32 bits) with a 32 bit timestamp (but be careful - that probably means no more than 1 message per second from a client). the only drawback is slightly larger bandwidth, but in most cases ids are much smaller than payloads.
personally, i would use a larger field and random numbers - it's simple and works (although good random numbers are an issue in, say, embedded systems).
finally, if you need the value to be "really" random (because, for example, ids are used to decide priority and you want things to be fair) then you can take one of the solutions above with deterministic values and re-arrange the bits to be pseudo-random. for example, reversing the bits in a counter may well be good enough (compare lsb first).
I would suggest using some sort of hash. There are many possible hashes, the FNV hash comes in a variety of sizes and is fast. If you want something cryptographically secure it will be a lot slower. You may need to add a counter: 1, 2, 3, 4... to ensure that you do not get duplicate hashes within the same time stamp.
Did you try looking into Twitter's Snowflake? There is a Python wrapper for it.
Is there a boundary on list and dictionary in python?
if there is, what is the limit?
I think by boundary you mean whether there is an upper bound on the number of elements in a list or dict. Python does not define any limits on them, so they can be as big as the memory available on your machine permits.
Actually, currently hash implementation for inner Python objects use 32 bit hashes --so, at a point close to 2^32 elements on a dictionary (assuming you have memory for that much), you will start to have a lot of collisions, and will have a significant slow down in dictionary usage. But that won't prevent it from working.
(Python devels are looking at making this hash 64 bit in future builds, so this is no longer an issue).
As for absolute limit, there is none - the limiting factor is the available system memory.
The amount of memory you have is the limit.
I have a application that does a certain experiment 1000 times (multi-threaded, so that multiple experiments are done at the same time). Every experiment needs appr. 50.000 random.random() calls.
What is the best approach to get this really random. I could copy a random object to every experiment and do than a jumpahead of 50.000 * expid. The documentation suggests that jumpahead(1) already scrambles the state, but is that really true?
Or is there another way to do this in 'the best way'?
(No, the random numbers are not used for security, but for a metropolis hasting algorithm. The only requirement is that the experiments are independent, not whether the random sequence is somehow predictable or so)
I could copy a random object to every experiment and do than a jumpahead of 50.000 * expid.
Approximately correct. Each thread gets their own Random instance.
Seed all of them to the same seed value. Use a constant to test, use /dev/random when you "run for the record".
Edit. Outside Python and in older implementations, use jumpahead( 50000 * expid ) to avoid the situation where two generators wind up with a parallel sequences of values. In any reasonably current (post 2.3) Python, jumpahead is no longer linear and using expid is sufficient to scramble the state.
You can't simply do jumpahead(1) in each thread, since that will assure they are synchronized. Use jumpahead( expid ) to assure each thread is distinctly scrambled.
The documentation suggests that jumpahead(1) already scrambles the state, but is that really true?
Yes, jumpahead does indeed "scramble" the state. Recall that for a given seed you get one -- long -- but fixed sequence of pseudo-random numbers. You're jumping ahead in this sequence. To pass randomness tests, you must get all your values from this one sequence.
Edit. Once upon a time, jumpahead(1) was limited. Now jumpahead(1) really does a larger scrambling. The scrambling, however, is deterministic. You can't simply do jumpahead(1) in each thread.
If you have multiple generators with different seeds, you violate the "one sequence from one seed" assumption and your numbers aren't going to be as random as if you get them from a single sequence.
If you only jumphead 1, you'll may be getting parallel sequences which will might be similar. [This similarity might not be detectable; theoretically, there's a similarity.]
When you jumpahead 50,000, you assure that you follow the 1-sequence-1-seed premise. You also assure that you won't have adjacent sequences of numbers in two experiments.
Finally, you also have repeatability. For a given seed, you get consistent results.
Same jumpahead: Not Good.
>>> y=random.Random( 1 )
>>> z=random.Random( 1 )
>>> y.jumpahead(1)
>>> z.jumpahead(1)
>>> [ y.random() for i in range(5) ]
[0.99510321786951772, 0.92436920169905545, 0.21932404923057958, 0.20867489035315723, 0.91525579001682567]
>>> [ z.random() for i in range(5) ]
[0.99510321786951772, 0.92436920169905545, 0.21932404923057958, 0.20867489035315723, 0.91525579001682567]
You shouldn't use that function. There is no proof it can work on Mersenne Twister generator. Indeed, it was removed from Python 3 for that reason.
For more information about generation of pseudo-random numbers on parallel environments, see this article from David Hill.
jumpahead(1) is indeed sufficient (and identical to jumpahead(50000) or any other such call, in the current implementation of random -- I believe that came in at the same time as the Mersenne Twister based implementation). So use whatever argument fits in well with your programs' logic. (Do use a separate random.Random instance per thread for thread-safety purposes of course, as your question already hints).
(random module generated numbers are not meant to be cryptographically strong, so it's a good thing that you're not using for security purposes;-).
Per the random module docs at python.org:
"You can instantiate your own instances of Random to get generators that don’t share state."
And there's also a relevant-looking note on jumpahead, as you mention. But the guarantees there are kind of vague. If the calls to OS-provided randomness aren't so expensive as to dominate your running time, I'd skip all the subtlety and do something like:
randoms = [random.Random(os.urandom(4)) for _ in range(num_expts)]
If num_expts is ~1000, then you're unlikely to have any collisions in your seed (birthday paradox says you need about 65000 experiments before there's a >50% probability that you have a collision). If this isn't good enough for you or if the number of experiments is more like 100k instead of 1k, then I think it's reasonable to follow this up with
for idx, r in enumerate(randoms):
r.jumpahead(idx)
Note that I don't think it will work to just make your seed longer (os.urandom(8), for example), since the random docs state that the seed must be hashable, and so on a 32-bit platform you're only going to get at most 32 bits (4 bytes) of useful entropy in your seed.
This question piqued my curiosity, so I went and looked at the code implementing the random module. I am definitely not a PRNG expert, but it does seem like slightly differing values of n in jumpahead(n) will lead to markedly different Random instance states. (Always scary to contradict Alex Martelli, but the code does use the value of n when shuffling the random state).