Linear congruential generator testing full period - python

Write a program to demonstrate that for a linear congruential generator with modulus
𝑀 = 2
𝑁 and constant 𝑏 = 1, in order to achieve the full period, the multiplier 𝑎 must be
equal to 4𝐾 + 1. Consider 𝑁 =5, 7, and 10; and 𝐾 =2 and 9. You should also consider
two values of the multiplier that do not match this.
My problem with this is that I have the code to run this in Python. The suggestions I have been given are
def lcg(modulus, a, c, seed):
"""Linear congruential generator."""
while True:
seed = (a * seed + c) % modulus
yield seed
However, the problem does not mention the seed value, nor do I know how to tell if it will achieve a full period. Any guidance would be appreciated

What is the seed?
The variable seed refers to the state of the RNG (Random-Number Generator). You must have some sort of changeable state, even if it's the position in a file or true random numbers, or the RNG will return the same number on every call.
For a LCG, the seed is simply the most recent value. You save that in a driver routine, or as some sort of internal status integer. It's usually an optional parameter; you provide it on the first call to "seed", or begin the sequence.
How to tell whether I get the full sequence?
Since the LCG state is entirely described by the most recent value, it returns a constant cycle of values, repeating that cycle for as long as you use it. To test this, you can simply start with a seed of 0 (any value will do), and then count how many iterations it takes to get that value again.
You can also determine the cycle length from the LCG parameters. This is a research project for the student (start with Wikipedia).
Does that unblock your worries?

Related

Generate random postfix expressions of an arbitrary length without producing duplicate expressions

For a research project, we are currently using the Gamblers Ruin Algorithm to produce random postfix expressions (terms) using term variables "x", "y", and "z" and an operator "*", as shown in the method below:
def gamblers_ruin_algorithm(prob=0.3,
min_term_length=1,
max_term_length=None):
"""
Generate a random term using the gamblers ruin algorithm
:type prop: float
:param prob: Probability of growing the size of a random term
:type max_term_length: int
:param max_term_length: Maximum length of the generated term
"""
term_variables = ["x", "y", "z"]
substitutions = ("EE*", "I")
term = "E"
term_length = 0
# randomly build a term
while("E" in term):
rand = uniform(0, 1)
if rand < prob or term_length < min_term_length:
index = 0
term_length += 1
else:
index = 1
if (max_term_length is not None and
term_length >= max_term_length):
term = term.replace("E", "I")
break
term = term.replace("E", substitutions[index], 1)
# randomly replace operands
while("I" in term):
term = term.replace("I", choice(term_variables), 1)
return term
The following method will produce a random postfix term like the following:
xyz*xx***zyz*zzx*zzyy***x**x****x**
This issue with this method is that when run thousands of times, it tends to frequently produce duplicate expressions.
Is there a different algorithm for producing random posfix expressions of an arbitrary length that minimizes the probability of producing the same expression more than once?
Your basic problem is not the algorithm; it's the way you force the minimize size of the resulting expression. That procedure introduces an important bias into the generation of the first min_term_length operators. If the expression manages to grow further, that bias will slowly decrease, but it will never disappear.
Until the expression reaches the minimum length, you replace the first E with EE*. So the first few expressions are always:
E
EE*
EE*E*
EE*E*E*
...
When the minimum length is reached, the function starts replacing E with I with probability 1-prob, which is 70% using the default argument. If this succeeds for all the Es, the function will return a tree with the above shape.
Suppose that min_term_length is 5. The probability of five successive tests choosing to not extend the expression is 0.75, or about 16.8%. At that point, the expression will be II*I*I*I*I, and the six Is will be randomly replaced by a variable name. There are three variables, making a total of 36 = 729 different postfix expressions. If you do a large number of samples, the fact that a sixth of the samples fall into 729 possible expressions will certainly create lots of duplicates. That's unnecessary because there are actually 42 possible shapes of a postfix expression with five operands (the fifth Catalan number), so there were actually 30618 possible postfix expressions. If all of those could be produced, you'd expect less than one duplicate in a hundred thousand samples.
Note that the bias introduced by forcing a particular replacement for the first min terms will continue to show up for longer strings as well. For example, If the algorithm happens to expand the string exactly once during the first six steps, which has a probability of about 10%, then it will choose one of six shapes, although there are 132 possibilities. So you can expect duplicates of that size as well, although somewhat fewer.
Instead of forcing a choice when the string is still short, you should let the algorithm just continue until the gambler is ruined or the maximum length occurs. If the gambler is ruined too soon, throw out that sample and start over. That will slow things down a bit, but it's still quite practical. If tossing out so many possibilities annoys you, you could instead pre-generate all possible patterns of the minimum length -- as noted above, if the minimum length is six operators, then that's 132 shapes, which are easy to enumerate -- and select one of those at random as the starting point.
You have four digits: x, y, z, and * such that:
x = 1
y = 2
z = 3
* = 4
So any expression can be expressed as a number using those digits. For example, the postfix expression xy*z* is 12434. And every such number maps to a unique expression.
With this technique, you can map each expression to a unique 32 bit or 64 bit number. And there are many good techniques for generating unique random numbers. See, for example, https://stackoverflow.com/a/34420445/56778.
So:
Generate a bunch of unique random numbers.
For each random number:
Convert to that modified base 5 number.
Generate the expression from that number.
You can of course combine the 3rd and 4th steps. That is, instead of generating a '1', generate an 'x'.
There will of course be some limit on the length of the expressions. Each digit requires two bits to represent, so the maximum length of an expression from a 32 bit number will be 16 characters. You can extend to longer expressions easily enough by generating 64 bit random numbers. Or 128. Or whatever you like. The basic algorithm remains the same.

How do you take randomness into account when finding the complexity of a function?

I've been trying to understand complexities, but all the online material has left me confused. Especially the part where they create actual mathematical functions. I have a for loop and a while loop. My confusion arises from the while loop. I know the complexity of the for loop is O(n), but the while loop is based on randomness. A random number if picked, and if this number is not in the list, it is added and the while loop broken. But my confusion arises here, the while loop may run in the worst case (in my thoughts) for an m number of times, until it is done. So I was thinking the complexity would then be O(n*m)?
I'm just really lost, and need some help.
Technically worst-case complexity is O(inf): random.randint if we consider it real random generator (it isn't, of course) can produce arbitrary long sequence with equal elements. However, we can estimate "average-case" complexity. It isn't the real average-case complexity (best, worst and average cases must be defined by input, not randomly), but it can show how many iterations program will do, if we run it for fixed n multiple times and take average of results.
Let's note that the list works as set here (you never add repeated number), so I'd stick with not in set comparison instead which is O(1) (while not in list is O(i)) to remove that complexity source and simplify things a bit: now count of iterations and complexity can be estimated with same big O limits. Single trial here is choosing from uniform integer distribution on [1; n]. Success is choosing number that is not in the list yet.
Then what's expected value of number of trials before getting item that is not in the set? Set size before each step is i in your code. We can pick any of n-i numbers. Thus probability of success is p_i = (n-i)/n (as the distribution is uniform). Every outer iteration is an example of geometrical distribution: count of trials before first success. So estimated count of while iterations is n_i = 1 / p_i = n / (n-i). To get final complexity we should sum this counts for each for iteration: sum(n_i for i in range(n)). This is obviously equal to n * Harmonic(n), where Harmonic(n) is n-th harmonic number (sum of first n reciprocals to natural numbers). Harmonic(n) ~ O(log n), thus "average-case" complexity of this code is O(n log n).
For list it will be sum(i*n / (n-i) for i in range(n)) ~ O(n^2 log(n)) (proof of this equality will be a little longer).
Big 'O' notation is used for worst case scenarios only.
figure out what could be he worst case for given loop.
make a function in 'n' , and take highest power of 'n' and ignore constant if any, you will get time complexity.

Detect significant changes in a data-set that gradually changes

I have a list of data in python that represents amount of resources used per minute. I want to find the number of times it changes significantly in that data set. What I mean by significant change is a bit different from what I've read so far.
For e.g. if I have a dataset like
[10,15,17,20,30,40,50,70,80,60,40,20]
I say a significant change happens when data increases by double or reduces by half with respect to the previous normal.
For e.g. since the list starts with 10, that is our starting normal point
Then when data doubles to 20, I count that as one significant change and set the normal to 20.
Then when data doubles to 40, it is considered a significant change and the normal is now 40
Then when data doubles to 80, it is considered a significant change and the normal is now 80
After that when data reduces by half to 40, it is considered as another significant change and the normal becomes 40
Finally when data reduces by half to 20, it is the last significant change
Here there are a total of 5 significant changes.
Is it similar to any other change detection algorithm? How can this be done efficiently in python?
This is relatively straightforward. You can do this with a single iteration through the list. We simply update our base when a 'significant' change occurs.
Note that my implementation will work for any iterable or container. This is useful if you want to, for example, read through a file without having to load it all into memory.
def gen_significant_changes(iterable, *, tol = 2):
iterable = iter(iterable) # this is necessary if it is container rather than generator.
# note that if the iterable is already a generator iter(iterable) returns itself.
base = next(iterable)
for x in iterable:
if x >= (base * tol) or x <= (base/tol):
yield x
base = x
my_list = [10,15,17,20,30,40,50,70,80,60,40,20]
print(list(gen_significant_changes(my_list)))
I can't help with the Python part, but in terms of math, the problem you're asking is fairly simple to solve using log base 2. A significant change occurs when the current value divided by a constant can be reached by raising 2 to a different power (as an integer) than the previous value. (The constant is needed since the first value in the array forms the basis of comparison.)
For each element at t, compute:
current = math.log(Array[t] /Array[0], 2)
previous = math.log(Array[t-1]/Array[0], 2)
if math.floor(current) <> math.floor(previous) a significant change has occurred
Using this method you do not need to keep track of a "normal point" at all, you just need the array. By removing the additional state variable we enable the array to be processed in any order, and we could give portions of the array to different threads if the dataset were very large. You wouldn't be able to do that with your current method.

What numbers that I can put in numpy.random.seed()?

I have noticed that you can put various numbers inside of numpy.random.seed(), for example numpy.random.seed(1), numpy.random.seed(101). What do the different numbers mean? How do you choose the numbers?
Consider a very basic random number generator:
Z[i] = (a*Z[i-1] + c) % m
Here, Z[i] is the ith random number, a is the multiplier and c is the increment - for different a, c and m combinations you have different generators. This is known as the linear congruential generator introduced by Lehmer. The remainder of that division, or modulus (%), will generate a number between zero and m-1 and by setting U[i] = Z[i] / m you get random numbers between zero and one.
As you may have noticed, in order to start this generative process - in order to have a Z[1] you need to have a Z[0] - an initial value. This initial value that starts the process is called the seed. Take a look at this example:
The initial value, the seed is determined as 7 to start the process. However, that value is not used to generate a random number. Instead, it is used to generate the first Z.
The most important feature of a pseudo-random number generator would be its unpredictability. Generally, as long as you don't share your seed, you are fine with all seeds as the generators today are much more complex than this. However, as a further step you can generate the seed randomly as well. You can skip the first n numbers as another alternative.
Main source: Law, A. M. (2007). Simulation modeling and analysis. Tata McGraw-Hill.
The short answer:
There are three ways to seed() a random number generator in numpy.random:
use no argument or use None -- the RNG initializes itself from the OS's random number generator (which generally is cryptographically random)
use some 32-bit integer N -- the RNG will use this to initialize its state based on a deterministic function (same seed → same state)
use an array-like sequence of 32-bit integers n0, n1, n2, etc. -- again, the RNG will use this to initialize its state based on a deterministic function (same values for seed → same state). This is intended to be done with a hash function of sorts, although there are magic numbers in the source code and it's not clear why they are doing what they're doing.
If you want to do something repeatable and simple, use a single integer.
If you want to do something repeatable but unlikely for a third party to guess, use a tuple or a list or a numpy array containing some sequence of 32-bit integers. You could, for example, use numpy.random with a seed of None to generate a bunch of 32-bit integers (say, 32 of them, which would generate a total of 1024 bits) from the OS's RNG, store in some seed S which you save in some secret place, then use that seed to generate whatever sequence R of pseudorandom numbers you wish. Then you can later recreate that sequence by re-seeding with S again, and as long as you keep the value of S secret (as well as the generated numbers R), no one would be able to reproduce that sequence R. If you just use a single integer, there's only 4 billion possibilities and someone could potentially try them all. That may be a bit on the paranoid side, but you could do it.
Longer answer
The numpy.random module uses the Mersenne Twister algorithm, which you can confirm yourself in one of two ways:
Either by looking at the documentation for numpy.random.RandomState, of which numpy.random uses an instance for the numpy.random.* functions (but you can also use an isolated independent instance of)
Looking at the source code in mtrand.pyx which uses something called Pyrex to wrap a fast C implementation, and randomkit.c and initarray.c.
In any case here's what the numpy.random.RandomState documentation says about seed():
Compatibility Guarantee A fixed seed and a fixed series of calls to RandomState methods using the same parameters will always produce the same results up to roundoff error except when the values were incorrect. Incorrect values will be fixed and the NumPy version in which the fix was made will be noted in the relevant docstring. Extension of existing parameter ranges and the addition of new parameters is allowed as long the previous behavior remains unchanged.
Parameters:
seed : {None, int, array_like}, optional
Random seed used to initialize the pseudo-random number generator. Can be any integer between 0 and 2**32 - 1 inclusive, an array (or other sequence) of such integers, or None (the default). If seed is None, then RandomState will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise.
It doesn't say how the seed is used, but if you dig into the source code it refers to the init_by_array function: (docstring elided)
def seed(self, seed=None):
cdef rk_error errcode
cdef ndarray obj "arrayObject_obj"
try:
if seed is None:
with self.lock:
errcode = rk_randomseed(self.internal_state)
else:
idx = operator.index(seed)
if idx > int(2**32 - 1) or idx < 0:
raise ValueError("Seed must be between 0 and 2**32 - 1")
with self.lock:
rk_seed(idx, self.internal_state)
except TypeError:
obj = np.asarray(seed).astype(np.int64, casting='safe')
if ((obj > int(2**32 - 1)) | (obj < 0)).any():
raise ValueError("Seed must be between 0 and 2**32 - 1")
obj = obj.astype('L', casting='unsafe')
with self.lock:
init_by_array(self.internal_state, <unsigned long *>PyArray_DATA(obj),
PyArray_DIM(obj, 0))
And here's what the init_by_array function looks like:
extern void
init_by_array(rk_state *self, unsigned long init_key[], npy_intp key_length)
{
/* was signed in the original code. RDH 12/16/2002 */
npy_intp i = 1;
npy_intp j = 0;
unsigned long *mt = self->key;
npy_intp k;
init_genrand(self, 19650218UL);
k = (RK_STATE_LEN > key_length ? RK_STATE_LEN : key_length);
for (; k; k--) {
/* non linear */
mt[i] = (mt[i] ^ ((mt[i - 1] ^ (mt[i - 1] >> 30)) * 1664525UL))
+ init_key[j] + j;
/* for > 32 bit machines */
mt[i] &= 0xffffffffUL;
i++;
j++;
if (i >= RK_STATE_LEN) {
mt[0] = mt[RK_STATE_LEN - 1];
i = 1;
}
if (j >= key_length) {
j = 0;
}
}
for (k = RK_STATE_LEN - 1; k; k--) {
mt[i] = (mt[i] ^ ((mt[i-1] ^ (mt[i-1] >> 30)) * 1566083941UL))
- i; /* non linear */
mt[i] &= 0xffffffffUL; /* for WORDSIZE > 32 machines */
i++;
if (i >= RK_STATE_LEN) {
mt[0] = mt[RK_STATE_LEN - 1];
i = 1;
}
}
mt[0] = 0x80000000UL; /* MSB is 1; assuring non-zero initial array */
self->gauss = 0;
self->has_gauss = 0;
self->has_binomial = 0;
}
This essentially "munges" the random number state in a nonlinear, hash-like method using each value within the provided sequence of seed values.
What is normally called a random number sequence in reality is a "pseudo-random" number sequence because the values are computed using a deterministic algorithm and probability plays no real role.
The "seed" is a starting point for the sequence and the guarantee is that if you start from the same seed you will get the same sequence of numbers. This is very useful for example for debugging (when you are looking for an error in a program you need to be able to reproduce the problem and study it, a non-deterministic program would be much harder to debug because every run would be different).
Basically the number guarantees the same 'randomness' every time.
More properly, the number is a seed, which can be an integer, an array (or other sequence) of integers of any length, or the default (none). If seed is none, then random will try to read data from /dev/urandom if available or make a seed from the clock otherwise.
Edit: In most honesty, as long as your program isn't something that needs to be super secure, it shouldn't matter what you pick. If this is the case, don't use these methods - use os.urandom() or SystemRandom if you require a cryptographically secure pseudo-random number generator.
The most important concept to understand here is that of pseudo-randomness. Once you understand this idea, you can determine if your program really needs a seed etc. I'd recommend reading here.
To understand the meaning of random seeds, you need to first understand the "pseudo-random" number sequence because the values are computed using a deterministic algorithm.
So you can think of this number as a starting value to calulate the next number you get from the random generator. Putting the same value here will make your program getting the same "random" value everytime, so your program becomes deterministic.
As said in this post
they (numpy.random and random.random) both use the Mersenne twister sequence to generate their random numbers, and they're both completely deterministic - that is, if you know a few key bits of information, it's possible to predict with absolute certainty what number will come next.
If you really care about randomness, ask the user to generate some noise (some arbitary words) or just put the system time as seed.
If your codes run on Intel CPU (or AMD with newest chips) I also suggest you to check the RdRand package which uses the cpu instruction rdrand to collect "true" (hardware) randomness.
Refs:
Random seed
What is a seed in terms of generating a random number
One very specific answer: np.random.seed can take values from 0 and 2**32 - 1, which interestingly differs from random.seed which can take any hashable object.
A side comment: better set your seed to a rather large number but still within the generator limit. Doing so can let the seed number have a good balance of 0 and 1 bits. Avoid having many 0 bits in the seed.
Reference: pyTorch documentation

How to predict Mersenne Twister patterns after more than 625 interation?

I read the Wikipedia article for Mersenne twister, in which it mentioned:
The algorithm in its native form is not suitable for cryptography (i.e. it is not a CSPRNG). The reason is that observing a sufficient number of iterations (624 in the case of MT19937, since this is the size of the state vector from which future iterations are produced) allows one to predict all future iterations.
So I thought given a particular seed, after 625 iterations I'll see the return value start repeat. And I did these code:
random.seed( 10 )
for i in xrange(10000):
n = random.random()
if n not in L:
L.append(n)
And I did len(L), I was shocked, it is 10000, they are all different values! Then I thought some values may be close to others. And I did:
for i in L:
if abs(i-L[0]) <= 0.00001:
print L.index(i)
I got 0, which means no numbers are close to the first number of the list.
Through random.getstate(), I can get a tuple, its second element is another tuple which contains 625 long digits, I thought it was related to the pattern stuff. But don't know how. Then, How should I to predict the behaviour of Python random.random() for a particular seed?

Categories