Matching randomization seed between Python and R [duplicate]

Matching randomization seed between Python and R [duplicate] - python

Python, NumPy and R all use the same algorithm (Mersenne Twister) for generating random number sequences. Thus, theoretically speaking, setting the same seed should result in same random number sequences in all 3. This is not the case. I think the 3 implementations use different parameters causing this behavior.
R
>set.seed(1)
>runif(5)
[1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
Python
In [3]: random.seed(1)
In [4]: [random.random() for x in range(5)]
Out[4]:
[0.13436424411240122,
0.8474337369372327,
0.763774618976614,
0.2550690257394217,
0.49543508709194095]
NumPy
In [23]: import numpy as np
In [24]: np.random.seed(1)
In [25]: np.random.rand(5)
Out[25]:
array([ 4.17022005e-01, 7.20324493e-01, 1.14374817e-04,
3.02332573e-01, 1.46755891e-01])
Is there some way, where NumPy and Python implementation could produce the same random number sequence? Ofcourse as some comments and answers point out, one could use rpy. What I am specifically looking for is to fine tune the parameters in the respective calls in Python and NumPy to get the sequence.
Context: The concern comes from an EDX course offering in which R is used. In one of the forums, it was asked if Python could be used and the staff replied that some assignments would require setting specific seeds and submitting answers.
Related:
Comparing Matlab and Numpy code that uses random number generation From this it seems that the underlying NumPy and Matlab implementation are similar.
python vs octave random generator: This question does come fairly close to the intended answer. Some sort of wrapper around the default state generator is required.

use rpy2 to call r in python, here is a demo, the numpy array data is sharing memory with x in R:
import rpy2.robjects as robjects
data = robjects.r("""
set.seed(1)
x <- runif(5)
""")
print np.array(data)
data[1] = 1.0
print robjects.r["x"]

I realize this is an old question, but I've stumbled upon the same problem recently, and created a solution which can be useful to others.
I've written a random number generator in C, and linked it to both R and Python. This way, the random numbers are guaranteed to be the same in both languages since they are generated using the same C code.
The program is called SyncRNG and can be found here: https://github.com/GjjvdBurg/SyncRNG.

Related

Does Python have a function to mimic the sequence of C's rand()?

I am looking for a Python function that will mimic the behavior of rand() (and srand()) in c with the following requirements:
I can provide the same epoch time into the Python equivalent of srand() to seed the function
The equivalent of rand()%256 should result in the same char value as in c if both were provided the same seed.
So far, I have considered both the random library and numpy's random library. Instead of providing a random number from 0 to 32767 as C does though both yield a floating point number from 0 to 1 on their random functions. When attempting random.randint(0,32767), I yielded different results than when in my C function.
TL;DR
Is there an existing function/libary in Python that follows the same random sequence as C?

You can accomplish this with CDLL from https://docs.python.org/3/library/ctypes.html
from ctypes import CDLL
libc = CDLL("libc.so.6")
libc.srand(42)
print(libc.rand() % 32768)

You can't make a Python version of rand and srand functions "follo[w] the same random sequence" of C's rand and srand because the C standard doesn't specify exactly what that sequence is, even if the seed is given. Notably:
rand uses an unspecified pseudorandom number algorithm, and that algorithm can differ between C implementations, including versions of the same standard library.
rand returns values no greater than RAND_MAX, and RAND_MAX can differ between C implementations.
In general, the best way to "sync" PRNGs between two programs in different languages is to implement the same PRNG in both languages. This may be viable assuming you're not using pseudorandom numbers for information security purposes. See also this question: How to sync a PRNG between C#/Unity and Python?

You can use random.seed(). The following example should print the same sequence every time it runs:
import random
random.seed(42)
for _ in range(10):
print(random.randint(0, 32768))
Update: Just saw the last comment by the OP. No, this code won't give you the same sequence as the C code, because of reasons given in other comments. Two different C implementations won't agree either.

Timeit showing that regular python is faster than numpy?

I'm working on a piece of code for a game that calculates the distances between all the objects on the screen using their in-game coordinate positions. Originally I was going to use basic Python and lists to do this, but since the number of distances that will need calculated will increase exponentially with the number of objects, I thought it might be faster to do it with numpy.
I'm not very familiar with numpy, and I've been experimenting on basic bits of code with it. I wrote a little bit of code to time how long it takes for the same function to complete a calculation in numpy and in regular Python, and numpy seems to consistently take a good bit more time than the regular python.
The function is very simple. It starts with 1.1 and then increments 200,000 times, adding 0.1 to the last value and then finding the square root of the new value. It's not what I'll actually be doing in the game code, which will involve finding total distance vectors from position coordinates; it's just a quick test I threw together. I already read here that the initialization of arrays takes more time in NumPy, so I moved the initializations of both the numpy and python arrays outside their functions, but Python is still faster than numpy.
Here is the bit of code:
#!/usr/bin/python3
import numpy
from timeit import timeit
#from time import process_time as timer
import math
thing = numpy.array([1.1,0.0], dtype='float')
thing2 = [1.1,0.0]
def NPFunc():
for x in range(1,200000):
thing[0] += 0.1
thing[1] = numpy.sqrt(thing[0])
print(thing)
return None
def PyFunc():
for x in range(1,200000):
thing2[0] += 0.1
thing2[1] = math.sqrt(thing2[0])
print(thing2)
return None
print(timeit(NPFunc, number=1))
print(timeit(PyFunc, number=1))
It gives this result, which indicates normal Python is 3x faster:
[ 20000.99999999 141.42489173]
0.2917748889885843
[20000.99999998944, 141.42489172698504]
0.10341173503547907
Am I doing something wrong, is is this calculation just so simple that it isn't a good test for numpy?

Am I doing something wrong, is is this calculation just so simple that it isn't a good test for NumPy?
It's not really that the calculation is simple, but that you're not taking any advantage of NumPy.
The main benefit of NumPy is vectorization: you can apply an operation to every element of an array in one go, and whatever looping is needed happens inside some tightly-optimized C (or Fortran or C++ or whatever) loop inside NumPy, rather than in a slow generic Python iteration.
But you're only accessing a single value, so there's no looping to be done in C.
On top of that, because the values in an array are stored as "native" values, NumPy functions don't need to unbox them, pulling the raw C double out of a Python float, and then re-box them in a new Python float, the way any Python math functions have to.
But you're not doing that either. In fact, you're doubling that work: You're pulling the value out o the array as a float (boxing it), then passing it to a function (which has to unbox it, and then rebox it to return a result), then storing it back in an array (unboxing it again).
And meanwhile, because np.sqrt is designed to work on arrays, it has to first check the type of what you're passing it and decide whether it needs to loop over an array or unbox and rebox a single value or whatever, while math.sqrt just takes a single value. When you call np.sqrt on an array of 200000 elements, the added cost of that type switch is negligible, but when you're doing it every time through the inner loop, that's a different story.
So, it's not an unfair test.
You've demonstrated that using NumPy to pull out values one at a time, act on them one at a time, and store them back in the array one at a time is slower than just not using NumPy.
But, if you compare it to actually taking advantage of NumPy—e.g., by creating an array of 200000 floats and then calling np.sqrt on that array vs. looping over it and calling math.sqrt on each one—you'll demonstrate that using NumPy the way it was intended is faster than not using it.

you're comparing it wrong
a_list = np.arange(0,20000,0.1)
timeit(lambda:np.sqrt(a_list),number=1)

python one line for loop over a function returning a tuple

I've tried searching quite a lot on this one, but being relatively new to python I feel I am missing the required terminology to find what I'm looking for.
I have a function:
def my_function(x,y):
# code...
return(a,b,c)
Where x and y are numpy arrays of length 2000 and the return values are integers. I'm looking for a shorthand (one-liner) to loop over this function as such:
Output = [my_function(X[i],Y[i]) for i in range(len(Y))]
Where X and Y are of the shape (135,2000). However, after running this I am currently having to do the following to separate out 'Output' into three numpy arrays.
Output = np.asarray(Output)
a = Output.T[0]
b = Output.T[1]
c = Output.T[2]
Which I feel isn't the best practice. I have tried:
(a,b,c) = [my_function(X[i],Y[i]) for i in range(len(Y))]
But this doesn't seem to work. Does anyone know a quick way around my problem?

my_function(X[i], Y[i]) for i in range(len(Y))
On the verge of crossing the "opinion-based" border, ...Y[i]... for i in range(len(Y)) is usually a big no-no in Python. It is even a bigger no-no when working with numpy arrays. One of the advantages of working with numpy is the 'vectorization' that it provides, and thus pushing the for loop down to the C level rather than the (slower) Python level.
So, if you rewrite my_function so it can handle the arrays in a vectorized fashion using the multiple tools and methods that numpy provides, you may not even need that "one-liner" you are looking for.

Inaccurate Large Fibonacci Numbers in Python

I am currently implementing this simple code trying to find the n-th element of the Fibonacci sequence using Python 2.7:
import numpy as np
def fib(n):
F = np.empty(n+2)
F[1] = 1
F[0] = 0
for i in range(2,n+1):
F[i]=F[i-1]+F[i-2]
return int(F[n])
This works fine for F < 79, but after that I get wrong numbers. For example, according to wolfram alpha F79 should be equal to 14472334024676221, but fib(100) gives me 14472334024676220. I think this could be caused by the way python deals with integers, but I have no idea what exactly the problem is. Any help is greatly appreciated!

the default data type for a numpy array is depending on architecture a 64 (or 32) bit int.
pure python would let you have arbitrarily long integers; numpy does not.
so it's more the way numpy deals with integers; pure python would do just fine.

Python will deal with integers perfectly fine here. Indeed, that is the beauty of python. numpy, on the other hand, introduces ugliness and just happens to be completely unnecessary, and will likely slow you down. Your implementation will also require much more space. Python allows you to write beautiful, readable code. Here is Raymond Hettinger's canonical implementation of iterative fibonacci in Python:
def fib(n):
x, y = 0, 1
for _ in range(n):
x, y = y, x + y
return x
That is O(n) time and constant space. It is beautiful, readable, and succinct. It will also give you the correct integer as long as you have memory to store the number on your machine. Learn to use numpy when it is the appropriate tool, and as importantly, learn to not use it when it is inappropriate.

Unless you want to generate a list with all the fibonacci numbers until Fn, there is no need to use a list, numpy or anything else like that, a simple loop and 2 variables will be enough as you only really need to know the 2 previous values
def fib(n):
Fk, Fk1 = 0, 1
for _ in range(n):
Fk, Fk1 = Fk1, Fk+Fk1
return Fk
of course, there is better ways to do it using the mathematical properties of the Fibonacci numbers, with those we know that there is a matrix that give us the right result
import numpy
def fib_matrix(n):
mat = numpy.matrix( [[1,1],[1,0]], dtype=object) ** n
return mat[0,1]
to which I assume they have an optimized matrix exponentiation making it more efficient that the previous method.
Using the properties of the underlying Lucas sequence is possible to do it without the matriz, and equally as efficient as exponentiation by squaring and with the same number of variables as the other, but that is a little harder to understand at first glance unlike the first example because alongside the second example it require more mathematical.
The close form, the one with the golden ratio, will give you the result even faster, but that have the risk of being inaccurate because the use of floating point arithmetic.

As an additional word to the previous answer by hiro protagonist, note that if using Numpy is a requirement, you can solve very easely your issue by replacing:
F = np.empty(n+2)
with
F = np.empty(n+2, dtype=object)
but it will not do anything more than transferring back the computation to pure Python.

Random numbers generation in PySpark

Lets start with a simple function which always returns a random integer:
import numpy as np
def f(x):
return np.random.randint(1000)
and a RDD filled with zeros and mapped using f:
rdd = sc.parallelize([0] * 10).map(f)
Since above RDD is not persisted I expect I'll get a different output every time I collect:
> rdd.collect()
[255, 512, 512, 512, 255, 512, 255, 512, 512, 255]
If we ignore the fact that distribution of values doesn't really look random it is more or less what happens. Problem starts we we when take only a first element:
assert len(set(rdd.first() for _ in xrange(100))) == 1
or
assert len(set(tuple(rdd.take(1)) for _ in xrange(100))) == 1
It seems to return the same number each time. I've been able to reproduce this behavior on two different machines with Spark 1.2, 1.3 and 1.4. Here I am using np.random.randint but it behaves the same way with random.randint.
This issue, same as non-exactly-random results with collect, seems to be Python specific and I couldn't reproduce it using Scala:
def f(x: Int) = scala.util.Random.nextInt(1000)
val rdd = sc.parallelize(List.fill(10)(0)).map(f)
(1 to 100).map(x => rdd.first).toSet.size
rdd.collect()
Did I miss something obvious here?
Edit:
Turns out the source of the problem is Python RNG implementation. To quote official documentation:
The functions supplied by this module are actually bound methods of a hidden instance of the random.Random class. You can instantiate your own instances of Random to get generators that don’t share state.
I assume NumPy works the same way and rewriting f using RandomState instance as follows
import os
import binascii
def f(x, seed=None):
seed = (
seed if seed is not None
else int(binascii.hexlify(os.urandom(4)), 16))
rs = np.random.RandomState(seed)
return rs.randint(1000)
makes it slower but solves the problem.
While above explains not random results from collect I still don't understand how it affects first / take(1) between multiple actions.

So the actual problem here is relatively simple. Each subprocess in Python inherits its state from its parent:
len(set(sc.parallelize(range(4), 4).map(lambda _: random.getstate()).collect()))
# 1
Since parent state has no reason to change in this particular scenario and workers have a limited lifespan, state of every child will be exactly the same on each run.

This seems to be a bug (or feature) of randint. I see the same behavior, but as soon as I change the f, the values do indeed change. So, I'm not sure of the actual randomness of this method....I can't find any documentation, but it seems to be using some deterministic math algorithm instead of using more variable features of the running machine. Even if I go back and forth, the numbers seem to be the same upon returning to the original value...

For my use case, most of the solution was buried in an edit at the bottom of the question. However, there is another complication: I wanted to use the same function to generate multiple (different) random columns. It turns out that Spark has an assumption that the output of a UDF is deterministic, which means that it can skip later calls to the same function with the same inputs. For functions that return random output this is obviously not what you want.
To work around this, I generated a separate seed column for every random column that I wanted using the built-in PySpark rand function:
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
import numpy as np
#F.udf(IntegerType())
def my_rand(seed):
rs = np.random.RandomState(seed)
return rs.randint(1000)
seed_expr = (F.rand()*F.lit(4294967295).astype('double')).astype('bigint')
my_df = (
my_df
.withColumn('seed_0', seed_expr)
.withColumn('seed_1', seed_expr)
.withColumn('myrand_0', my_rand(F.col('seed_0')))
.withColumn('myrand_1', my_rand(F.col('seed_1')))
.drop('seed_0', 'seed_1')
)
I'm using the DataFrame API rather than the RDD of the original problem because that's what I'm more familiar with, but the same concepts presumably apply.
NB: apparently it is possible to disable the assumption of determinism for Scala Spark UDFs since v2.3: https://jira.apache.org/jira/browse/SPARK-20586.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.