I have noticed that you can put various numbers inside of numpy.random.seed(), for example numpy.random.seed(1), numpy.random.seed(101). What do the different numbers mean? How do you choose the numbers?
Consider a very basic random number generator:
Z[i] = (a*Z[i-1] + c) % m
Here, Z[i] is the ith random number, a is the multiplier and c is the increment - for different a, c and m combinations you have different generators. This is known as the linear congruential generator introduced by Lehmer. The remainder of that division, or modulus (%), will generate a number between zero and m-1 and by setting U[i] = Z[i] / m you get random numbers between zero and one.
As you may have noticed, in order to start this generative process - in order to have a Z[1] you need to have a Z[0] - an initial value. This initial value that starts the process is called the seed. Take a look at this example:
The initial value, the seed is determined as 7 to start the process. However, that value is not used to generate a random number. Instead, it is used to generate the first Z.
The most important feature of a pseudo-random number generator would be its unpredictability. Generally, as long as you don't share your seed, you are fine with all seeds as the generators today are much more complex than this. However, as a further step you can generate the seed randomly as well. You can skip the first n numbers as another alternative.
Main source: Law, A. M. (2007). Simulation modeling and analysis. Tata McGraw-Hill.
The short answer:
There are three ways to seed() a random number generator in numpy.random:
use no argument or use None -- the RNG initializes itself from the OS's random number generator (which generally is cryptographically random)
use some 32-bit integer N -- the RNG will use this to initialize its state based on a deterministic function (same seed → same state)
use an array-like sequence of 32-bit integers n0, n1, n2, etc. -- again, the RNG will use this to initialize its state based on a deterministic function (same values for seed → same state). This is intended to be done with a hash function of sorts, although there are magic numbers in the source code and it's not clear why they are doing what they're doing.
If you want to do something repeatable and simple, use a single integer.
If you want to do something repeatable but unlikely for a third party to guess, use a tuple or a list or a numpy array containing some sequence of 32-bit integers. You could, for example, use numpy.random with a seed of None to generate a bunch of 32-bit integers (say, 32 of them, which would generate a total of 1024 bits) from the OS's RNG, store in some seed S which you save in some secret place, then use that seed to generate whatever sequence R of pseudorandom numbers you wish. Then you can later recreate that sequence by re-seeding with S again, and as long as you keep the value of S secret (as well as the generated numbers R), no one would be able to reproduce that sequence R. If you just use a single integer, there's only 4 billion possibilities and someone could potentially try them all. That may be a bit on the paranoid side, but you could do it.
Longer answer
The numpy.random module uses the Mersenne Twister algorithm, which you can confirm yourself in one of two ways:
Either by looking at the documentation for numpy.random.RandomState, of which numpy.random uses an instance for the numpy.random.* functions (but you can also use an isolated independent instance of)
Looking at the source code in mtrand.pyx which uses something called Pyrex to wrap a fast C implementation, and randomkit.c and initarray.c.
In any case here's what the numpy.random.RandomState documentation says about seed():
Compatibility Guarantee A fixed seed and a fixed series of calls to RandomState methods using the same parameters will always produce the same results up to roundoff error except when the values were incorrect. Incorrect values will be fixed and the NumPy version in which the fix was made will be noted in the relevant docstring. Extension of existing parameter ranges and the addition of new parameters is allowed as long the previous behavior remains unchanged.
Parameters:
seed : {None, int, array_like}, optional
Random seed used to initialize the pseudo-random number generator. Can be any integer between 0 and 2**32 - 1 inclusive, an array (or other sequence) of such integers, or None (the default). If seed is None, then RandomState will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise.
It doesn't say how the seed is used, but if you dig into the source code it refers to the init_by_array function: (docstring elided)
def seed(self, seed=None):
cdef rk_error errcode
cdef ndarray obj "arrayObject_obj"
try:
if seed is None:
with self.lock:
errcode = rk_randomseed(self.internal_state)
else:
idx = operator.index(seed)
if idx > int(2**32 - 1) or idx < 0:
raise ValueError("Seed must be between 0 and 2**32 - 1")
with self.lock:
rk_seed(idx, self.internal_state)
except TypeError:
obj = np.asarray(seed).astype(np.int64, casting='safe')
if ((obj > int(2**32 - 1)) | (obj < 0)).any():
raise ValueError("Seed must be between 0 and 2**32 - 1")
obj = obj.astype('L', casting='unsafe')
with self.lock:
init_by_array(self.internal_state, <unsigned long *>PyArray_DATA(obj),
PyArray_DIM(obj, 0))
And here's what the init_by_array function looks like:
extern void
init_by_array(rk_state *self, unsigned long init_key[], npy_intp key_length)
{
/* was signed in the original code. RDH 12/16/2002 */
npy_intp i = 1;
npy_intp j = 0;
unsigned long *mt = self->key;
npy_intp k;
init_genrand(self, 19650218UL);
k = (RK_STATE_LEN > key_length ? RK_STATE_LEN : key_length);
for (; k; k--) {
/* non linear */
mt[i] = (mt[i] ^ ((mt[i - 1] ^ (mt[i - 1] >> 30)) * 1664525UL))
+ init_key[j] + j;
/* for > 32 bit machines */
mt[i] &= 0xffffffffUL;
i++;
j++;
if (i >= RK_STATE_LEN) {
mt[0] = mt[RK_STATE_LEN - 1];
i = 1;
}
if (j >= key_length) {
j = 0;
}
}
for (k = RK_STATE_LEN - 1; k; k--) {
mt[i] = (mt[i] ^ ((mt[i-1] ^ (mt[i-1] >> 30)) * 1566083941UL))
- i; /* non linear */
mt[i] &= 0xffffffffUL; /* for WORDSIZE > 32 machines */
i++;
if (i >= RK_STATE_LEN) {
mt[0] = mt[RK_STATE_LEN - 1];
i = 1;
}
}
mt[0] = 0x80000000UL; /* MSB is 1; assuring non-zero initial array */
self->gauss = 0;
self->has_gauss = 0;
self->has_binomial = 0;
}
This essentially "munges" the random number state in a nonlinear, hash-like method using each value within the provided sequence of seed values.
What is normally called a random number sequence in reality is a "pseudo-random" number sequence because the values are computed using a deterministic algorithm and probability plays no real role.
The "seed" is a starting point for the sequence and the guarantee is that if you start from the same seed you will get the same sequence of numbers. This is very useful for example for debugging (when you are looking for an error in a program you need to be able to reproduce the problem and study it, a non-deterministic program would be much harder to debug because every run would be different).
Basically the number guarantees the same 'randomness' every time.
More properly, the number is a seed, which can be an integer, an array (or other sequence) of integers of any length, or the default (none). If seed is none, then random will try to read data from /dev/urandom if available or make a seed from the clock otherwise.
Edit: In most honesty, as long as your program isn't something that needs to be super secure, it shouldn't matter what you pick. If this is the case, don't use these methods - use os.urandom() or SystemRandom if you require a cryptographically secure pseudo-random number generator.
The most important concept to understand here is that of pseudo-randomness. Once you understand this idea, you can determine if your program really needs a seed etc. I'd recommend reading here.
To understand the meaning of random seeds, you need to first understand the "pseudo-random" number sequence because the values are computed using a deterministic algorithm.
So you can think of this number as a starting value to calulate the next number you get from the random generator. Putting the same value here will make your program getting the same "random" value everytime, so your program becomes deterministic.
As said in this post
they (numpy.random and random.random) both use the Mersenne twister sequence to generate their random numbers, and they're both completely deterministic - that is, if you know a few key bits of information, it's possible to predict with absolute certainty what number will come next.
If you really care about randomness, ask the user to generate some noise (some arbitary words) or just put the system time as seed.
If your codes run on Intel CPU (or AMD with newest chips) I also suggest you to check the RdRand package which uses the cpu instruction rdrand to collect "true" (hardware) randomness.
Refs:
Random seed
What is a seed in terms of generating a random number
One very specific answer: np.random.seed can take values from 0 and 2**32 - 1, which interestingly differs from random.seed which can take any hashable object.
A side comment: better set your seed to a rather large number but still within the generator limit. Doing so can let the seed number have a good balance of 0 and 1 bits. Avoid having many 0 bits in the seed.
Reference: pyTorch documentation
Related
Write a program to demonstrate that for a linear congruential generator with modulus
𝑀 = 2
𝑁 and constant 𝑏 = 1, in order to achieve the full period, the multiplier 𝑎 must be
equal to 4𝐾 + 1. Consider 𝑁 =5, 7, and 10; and 𝐾 =2 and 9. You should also consider
two values of the multiplier that do not match this.
My problem with this is that I have the code to run this in Python. The suggestions I have been given are
def lcg(modulus, a, c, seed):
"""Linear congruential generator."""
while True:
seed = (a * seed + c) % modulus
yield seed
However, the problem does not mention the seed value, nor do I know how to tell if it will achieve a full period. Any guidance would be appreciated
What is the seed?
The variable seed refers to the state of the RNG (Random-Number Generator). You must have some sort of changeable state, even if it's the position in a file or true random numbers, or the RNG will return the same number on every call.
For a LCG, the seed is simply the most recent value. You save that in a driver routine, or as some sort of internal status integer. It's usually an optional parameter; you provide it on the first call to "seed", or begin the sequence.
How to tell whether I get the full sequence?
Since the LCG state is entirely described by the most recent value, it returns a constant cycle of values, repeating that cycle for as long as you use it. To test this, you can simply start with a seed of 0 (any value will do), and then count how many iterations it takes to get that value again.
You can also determine the cycle length from the LCG parameters. This is a research project for the student (start with Wikipedia).
Does that unblock your worries?
As a Python intermediate learner, I made an 8 ball in Python.
Now that I am starting on learning C, is there a way to simulate to the way random.choice can select a string from a list of strings , but in C ?
The closest thing to a "list of strings" in C is an array of string pointers; and the only standard library function that produces random numbers is rand(), defined in <stdlib.h>.
A simple example:
#include <stdio.h>
#include <stdlib.h>
#include <time.h> // needed for the usual srand() initialization
int main(void)
{
const char *string_table[] = { // array of pointers to constant strings
"alpha",
"beta",
"gamma",
"delta",
"epsilon"
};
int table_size = 5; // This must match the number of entries above
srand(time(NULL)); // randomize the start value
for (int i = 1; i <= 10; ++i)
{
const char *rand_string = string_table[rand() % table_size];
printf("%2d. %s\n", i, rand_string);
}
return 0;
}
That will generate and print ten random choices from an array of five strings.
The string_table variable is an array of const char * pointers. You should always use a constant pointer to refer to a literal character string like "alpha". It keeps you from using that pointer in a context where the string contents might be changed.
The random numbers are what are called "pseudorandom"; statistically uncorrelated, but completely determined by a starting "seed" value. Using the statement srand(time(NULL)) takes the current time/date value (seconds since some starting date) and uses that as a seed that won't be repeated in any computer's lifetime. But you will get exactly the same "random" numbers if you manage to run the program twice in the same second. This is easy to do in a shell script, for example. A higher-resolution timestamp would be nice, but there isn't anything useful in the C standard library.
The rand() function returns a non-negative int value from 0 to some implementation-dependent maximum value. The symbolic constant RAND_MAX has that value. The expression rand() % N will return the remainder from dividing that value by N, which is a number from 0 to N-1.
Aconcagua has pointed out that this isn't ideal. If N doesn't evenly divide RAND_MAX, then there will be a bias toward smaller numbers. It's okay for now, though, but plan to learn some other methods later if you do serious simulation or statistical work; and if you do get to that point, you probably won't use the built-in rand() function anyway.
You can write a function if you know the size of your array and use rand() % size to get a random index from your array. Then return the value of arr[randidx]
My queries are regarding the generation of the uniform random number generator using numpy.random.uniform on [0,1).
Does this implementation involve a uniform step-size, i.e. are the universe of possibilities {0,a,2a,...,Na} where (N+1)a = 1 and a is constant?
If the above is true, then what's the value of this step-size? I noticed that the value of numpy.nextafter(x,y) keeps on changing depending upon x. Hence my question regarding whether a uniform step-size was used to implement numpy.random.uniform.
If the step-size is not uniform, then what would be the best way to figure out the number of unique values that numpy.random.uniform(low=0, high=1) can take?
What's the recurrence period of numpy.random.uniform, i.e. after how many samples will I see my original number again? For maximum efficiency, this should be equal to the number of unique values.
I tried looking up the source code at Github but didn't find anything directly interpretable there.
The relevant function is
double
rk_double(rk_state *state)
{
/* shifts : 67108864 = 0x4000000, 9007199254740992 = 0x20000000000000 */
long a = rk_random(state) >> 5, b = rk_random(state) >> 6;
return (a * 67108864.0 + b) / 9007199254740992.0;
}
which is found in randomkit.c inside the numpy source tree.
As you can see the granularity is 1 / 9007199254740992.0 which happens to equal 2^-53 which is the (downward) float64 resolution at 1.
>>> 1 / 9007199254740992.0
1.1102230246251565e-16
>>> 2**-53
1.1102230246251565e-16
>>> 1-np.nextafter(1.0, 0)
1.1102230246251565e-16
Now i may be nitpicking here but i wanted to know which is computationally more efficient over a large number of iterations (assuming no restrictions on memory) purely in terms of time taken to run the overall programme.
i am asking mainly in terms of two languages python and c, since these are the two i use most.
for e.g. in c, something like:
int count, sum = 0, 0;
while (count < 99999){
if (((pow(count, 2))/10)%10 == 4) {// just some random operation
sum += pow(count, 2);
}
count++;
}
or
int count, sum, temp = 0, 0, 0;
while (count < 99999){
temp = pow(count, 2)
if (((temp/10)%10 == 4) {// just some random operation
sum += temp;
}
count++;
}
In python something like
for i in range (99999):
n = len(str(i))
print "length of string", str(i), "is", n
or
for i in range (99999):
temp = str(i)
n = len(temp)
print "length of string:, temp, "is", n
Now these are just random operations i thought on the fly, my main question is whether it is better create a new variable or just repeat the operation. i know the computing power required will change as i, count increases but just generalizing for a large number of iterations which is better.
I have tested both your codes in python with timeit, and the result was 1.0221271729969885 seconds for the first option, 1.0154028950055363 for the second option
I have done this test only with one iteration on each example and as you can see, they are extremely close to each other, too close to be reliable after only one iteration. If you want to learn more though, I suggest you do these tests yourself with timeit
However, the action you took here is to replace a variable assignment to temp with 2 calls for str(i), not one, and it can of course vary wildly for other functions that are more or less complicated and time consuming than str is
Caching does improve performance and readability of the code most of the time. There may be corner cases for some super fast operations, however I'd say the type of code used in second examples would always be better.
Cached value will most probably be stored in register, hence the one mov will be involved. According to this document there are not so many cheaper operations.
As for readability consider bigger function calls with more arguments, while reviewing code it will require more brainwork to check that calls are the same, moreover if function changes it may introduce bugs where one call has been modified and the other one has been forgotten (true mostly for Python).
All in all it should be readability driving your choices in such cases rather than performance gains. Unless profiler points at this piece of code.
In C, C++, and Java, an integer has a certain range. One thing I realized in Python is that I can calculate really large integers such as pow(2, 100). The same equivalent code, in C, pow(2, 100) would clearly cause an overflow since in 32-bit architecture, the unsigned integer type ranges from 0 to 2^32-1. How is it possible for Python to calculate these large numbers?
Basically, big numbers in Python are stored in arrays of 'digits'. That's quoted, right, because each 'digit' could actually be quite a big number on its own. )
You can check the details of implementation in longintrepr.h and longobject.c:
There are two different sets of parameters: one set for 30-bit digits,
stored in an unsigned 32-bit integer type, and one set for 15-bit
digits with each digit stored in an unsigned short. The value of
PYLONG_BITS_IN_DIGIT, defined either at configure time or in pyport.h,
is used to decide which digit size to use.
/* Long integer representation.
The absolute value of a number is equal to
SUM(for i=0 through abs(ob_size)-1) ob_digit[i] * 2**(SHIFT*i)
Negative numbers are represented with ob_size < 0;
zero is represented by ob_size == 0.
In a normalized number, ob_digit[abs(ob_size)-1] (the most significant
digit) is never zero. Also, in all cases, for all valid i,
0 <= ob_digit[i] <= MASK.
The allocation function takes care of allocating extra memory
so that ob_digit[0] ... ob_digit[abs(ob_size)-1] are actually available.
*/
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
How is it possible for Python to calculate these large numbers?
How is it possible for you to calculate these large numbers if you only have the 10 digits 0-9? Well, you use more than one digit!
Bignum arithmetic works the same way, except the individual "digits" are not 0-9 but 0-4294967296 or 0-18446744073709551616.