Random numbers generation in PySpark

Random numbers generation in PySpark - python

Lets start with a simple function which always returns a random integer:
import numpy as np
def f(x):
return np.random.randint(1000)
and a RDD filled with zeros and mapped using f:
rdd = sc.parallelize([0] * 10).map(f)
Since above RDD is not persisted I expect I'll get a different output every time I collect:
> rdd.collect()
[255, 512, 512, 512, 255, 512, 255, 512, 512, 255]
If we ignore the fact that distribution of values doesn't really look random it is more or less what happens. Problem starts we we when take only a first element:
assert len(set(rdd.first() for _ in xrange(100))) == 1
or
assert len(set(tuple(rdd.take(1)) for _ in xrange(100))) == 1
It seems to return the same number each time. I've been able to reproduce this behavior on two different machines with Spark 1.2, 1.3 and 1.4. Here I am using np.random.randint but it behaves the same way with random.randint.
This issue, same as non-exactly-random results with collect, seems to be Python specific and I couldn't reproduce it using Scala:
def f(x: Int) = scala.util.Random.nextInt(1000)
val rdd = sc.parallelize(List.fill(10)(0)).map(f)
(1 to 100).map(x => rdd.first).toSet.size
rdd.collect()
Did I miss something obvious here?
Edit:
Turns out the source of the problem is Python RNG implementation. To quote official documentation:
The functions supplied by this module are actually bound methods of a hidden instance of the random.Random class. You can instantiate your own instances of Random to get generators that don’t share state.
I assume NumPy works the same way and rewriting f using RandomState instance as follows
import os
import binascii
def f(x, seed=None):
seed = (
seed if seed is not None
else int(binascii.hexlify(os.urandom(4)), 16))
rs = np.random.RandomState(seed)
return rs.randint(1000)
makes it slower but solves the problem.
While above explains not random results from collect I still don't understand how it affects first / take(1) between multiple actions.

So the actual problem here is relatively simple. Each subprocess in Python inherits its state from its parent:
len(set(sc.parallelize(range(4), 4).map(lambda _: random.getstate()).collect()))
# 1
Since parent state has no reason to change in this particular scenario and workers have a limited lifespan, state of every child will be exactly the same on each run.

This seems to be a bug (or feature) of randint. I see the same behavior, but as soon as I change the f, the values do indeed change. So, I'm not sure of the actual randomness of this method....I can't find any documentation, but it seems to be using some deterministic math algorithm instead of using more variable features of the running machine. Even if I go back and forth, the numbers seem to be the same upon returning to the original value...

For my use case, most of the solution was buried in an edit at the bottom of the question. However, there is another complication: I wanted to use the same function to generate multiple (different) random columns. It turns out that Spark has an assumption that the output of a UDF is deterministic, which means that it can skip later calls to the same function with the same inputs. For functions that return random output this is obviously not what you want.
To work around this, I generated a separate seed column for every random column that I wanted using the built-in PySpark rand function:
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
import numpy as np
#F.udf(IntegerType())
def my_rand(seed):
rs = np.random.RandomState(seed)
return rs.randint(1000)
seed_expr = (F.rand()*F.lit(4294967295).astype('double')).astype('bigint')
my_df = (
my_df
.withColumn('seed_0', seed_expr)
.withColumn('seed_1', seed_expr)
.withColumn('myrand_0', my_rand(F.col('seed_0')))
.withColumn('myrand_1', my_rand(F.col('seed_1')))
.drop('seed_0', 'seed_1')
)
I'm using the DataFrame API rather than the RDD of the original problem because that's what I'm more familiar with, but the same concepts presumably apply.
NB: apparently it is possible to disable the assumption of determinism for Scala Spark UDFs since v2.3: https://jira.apache.org/jira/browse/SPARK-20586.

Related

Is it possible to change the seed of a random generator in NumPy?

Say I instantiated a random generator with
import numpy as np
rng = np.random.default_rng(seed=42)
and I want to change its seed. Is it possible to update the seed of the generator instead of instantiating a new generator with the new seed?
I managed to find that you can see the state of the generator with rng.__getstate__(), for example in this case it is
{'bit_generator': 'PCG64',
'state': {'state': 274674114334540486603088602300644985544,
'inc': 332724090758049132448979897138935081983},
'has_uint32': 0,
'uinteger': 0}
and you can change it with rng.__setstate__ with arguments as printed above, but it is not clear to me how to set those arguments so that to have the initial state of the rng given a different seed. Is it possible to do so or instantiating a new generator is the only way?

A Numpy call like default_rng() gives you a Generator with an implicitly created BitGenerator. The difference between these is that a BitGenerator is the low-level method that just knows how to generate uniform uint32s, uint64s, and doubles. The Generator can then take these and turn them into other distributions.
As you noticed you can use __getstate__ but this is designed for the pickle module and not really for what you're using it for.
You're better off accessing the bit_generator directly. Which means you don't need to use any dunder methods.
The following code still uses default_rng but this means the BitGenerator could change in the future so I need a call to type to reseed. You'd probably be better off following the second example which uses an explicit BitGenerator.
import numpy as np
seed = 42
rng = np.random.default_rng()
# get the BitGenerator used by default_rng
BitGen = type(rng.bit_generator)
# use the state from a fresh bit generator
rng.bit_generator.state = BitGen(seed).state
# generate a random float
print(rng.random())
outputs 0.7739560485559633. If you're happy fixing the BitGenerator you can avoid the call to type, e.g.:
rng = np.random.Generator(np.random.PCG64())
rng.bit_generator.state = np.random.PCG64(seed).state
rng.random()
which outputs the same value.

N.B. The other answer (https://stackoverflow.com/a/74474377/2954547) is better. Use that one, not this one.
This is maybe a silly hack, but one solution is to create a new RNG instance using the desired new seed, then replace the state of the existing RNG instance with the state of the new instance:
import numpy as np
seed = 12345
rng = np.random.default_rng(seed)
x1 = rng.normal(size=10)
rng.__setstate__(np.random.default_rng(seed).__getstate__())
x2 = rng.normal(size=10)
np.testing.assert_array_equal(x1, x2)
However this isn't much different from just replacing the RNG instance.
Edit: To answer the question more directly, I don't think it's possible to replace the seed without constructing a new Generator or BitGenerator, unless you know how to correctly construct the state data for the particular BitGenerator inside your Generator. Creating a new RNG is fairly cheap, and while I understand the conceptual appeal of not instantiating a new one, the only alternative here is to post a feature request on the Numpy issue tracker or mailing list.

Using Python Ray With CPLEX Model Object

I am trying to parallelize an interaction with a Python object that is computationally expensive. I would like to use Ray to do this but so far my best efforts have failed.
The object is a CPLEX model object and I'm trying to add a set of constraints for a list of conditions.
Here's my setup:
import numpy as np
import docplex.mp.model as cpx
import ray
m = cpx.Model(name="mymodel")
def mask_array(arr, mask_val):
array_mask = np.argwhere(arr == mask_val)
arg_slice = [i[0] for i in array_mask]
return arg_slice
weeks = [1,3,7,8,9]
const = 1.5
r = rate = np.array(df['r'].tolist(), dtype=np.float)
x1 = m.integer_var_list(data_indices, lb=lower_bound, ub=upper_bound)
x2 = m.dot(x1, r)
#ray.remote
def add_model_constraint(m, x2, x2sum, const):
m.add_constraint(x2sum <= x2*const)
return m
x2sums = []
for w in weeks:
arg_slice = mask_array(x2, w)
x2sum = m.dot([x2[i] for i in arg_slice], r[arg_slice])
x2sums.append(x2sum)
#: this is the expensive part
for x2sum in x2sums:
add_model_constraint.remote(m, x2, x2sum, const)
In a nutshell, what I'm doing is creating a model object, some variables, and then looping over a set of weeks in order to build a constraint. I subset my variable, compute some dot products and apply the constraint. I would like to be able to create the constraint in parallel because it takes a while but so far my code just hangs and I'm not sure why.
I don't know if I should return the model object in my function because by default the m.add_constraint method modifies the object in place. But at the same time I know Ray returns references to the remote value so yea, not sure what's supposed to happen there.
Is this at all a valid use of ray? It it reasonable to expect to be able to modify a CPLEX object in this way (or any other arbitrary python object)?
I am new to Ray so I may be structuring this all wrong, or maybe this will never work for X, Y, and Z reason which would also be good to know.

The Model object is not designed to be used in parallel. You cannot add constraints from multiple threads at the same time. This will result in undefined behavior. You will have to at least a lock to make sure only thread at a time adds constraints.
Note that parallel model building may not be a good idea at all: the order of constraints will be more or less random. On the other hand, behavior of the solver may depend on the order of constraints (this is called performance variability). So you may have a hard time reproducing certain results/behavior.

I understand the primary issue was the performance of module building.
From the code you sent, I have two suggestions to address this:
post constraints in batches, that is store constraints in a list and add them once using Model.add_constraints(), this should be more efficient than adding them one at a time.
experiment with Model.dotf() (functional-style scalar product). It avoids building auxiliary lists, passing instead a function of the key , returning the coefficient.
This method is new in Docplex version 2.12.
For example, assuming a list of 3 variables:
abc = m.integer_var_list(3, name=["a", "b", "c"]) m.dotf(abc, lambda
k: k+2)
docplex.mp.LinearExpression(a+2b+3c)
Model.dotf() is usually faster than Model.dot()

Timeit showing that regular python is faster than numpy?

I'm working on a piece of code for a game that calculates the distances between all the objects on the screen using their in-game coordinate positions. Originally I was going to use basic Python and lists to do this, but since the number of distances that will need calculated will increase exponentially with the number of objects, I thought it might be faster to do it with numpy.
I'm not very familiar with numpy, and I've been experimenting on basic bits of code with it. I wrote a little bit of code to time how long it takes for the same function to complete a calculation in numpy and in regular Python, and numpy seems to consistently take a good bit more time than the regular python.
The function is very simple. It starts with 1.1 and then increments 200,000 times, adding 0.1 to the last value and then finding the square root of the new value. It's not what I'll actually be doing in the game code, which will involve finding total distance vectors from position coordinates; it's just a quick test I threw together. I already read here that the initialization of arrays takes more time in NumPy, so I moved the initializations of both the numpy and python arrays outside their functions, but Python is still faster than numpy.
Here is the bit of code:
#!/usr/bin/python3
import numpy
from timeit import timeit
#from time import process_time as timer
import math
thing = numpy.array([1.1,0.0], dtype='float')
thing2 = [1.1,0.0]
def NPFunc():
for x in range(1,200000):
thing[0] += 0.1
thing[1] = numpy.sqrt(thing[0])
print(thing)
return None
def PyFunc():
for x in range(1,200000):
thing2[0] += 0.1
thing2[1] = math.sqrt(thing2[0])
print(thing2)
return None
print(timeit(NPFunc, number=1))
print(timeit(PyFunc, number=1))
It gives this result, which indicates normal Python is 3x faster:
[ 20000.99999999 141.42489173]
0.2917748889885843
[20000.99999998944, 141.42489172698504]
0.10341173503547907
Am I doing something wrong, is is this calculation just so simple that it isn't a good test for numpy?

Am I doing something wrong, is is this calculation just so simple that it isn't a good test for NumPy?
It's not really that the calculation is simple, but that you're not taking any advantage of NumPy.
The main benefit of NumPy is vectorization: you can apply an operation to every element of an array in one go, and whatever looping is needed happens inside some tightly-optimized C (or Fortran or C++ or whatever) loop inside NumPy, rather than in a slow generic Python iteration.
But you're only accessing a single value, so there's no looping to be done in C.
On top of that, because the values in an array are stored as "native" values, NumPy functions don't need to unbox them, pulling the raw C double out of a Python float, and then re-box them in a new Python float, the way any Python math functions have to.
But you're not doing that either. In fact, you're doubling that work: You're pulling the value out o the array as a float (boxing it), then passing it to a function (which has to unbox it, and then rebox it to return a result), then storing it back in an array (unboxing it again).
And meanwhile, because np.sqrt is designed to work on arrays, it has to first check the type of what you're passing it and decide whether it needs to loop over an array or unbox and rebox a single value or whatever, while math.sqrt just takes a single value. When you call np.sqrt on an array of 200000 elements, the added cost of that type switch is negligible, but when you're doing it every time through the inner loop, that's a different story.
So, it's not an unfair test.
You've demonstrated that using NumPy to pull out values one at a time, act on them one at a time, and store them back in the array one at a time is slower than just not using NumPy.
But, if you compare it to actually taking advantage of NumPy—e.g., by creating an array of 200000 floats and then calling np.sqrt on that array vs. looping over it and calling math.sqrt on each one—you'll demonstrate that using NumPy the way it was intended is faster than not using it.

you're comparing it wrong
a_list = np.arange(0,20000,0.1)
timeit(lambda:np.sqrt(a_list),number=1)

Matching randomization seed between Python and R [duplicate]

Python, NumPy and R all use the same algorithm (Mersenne Twister) for generating random number sequences. Thus, theoretically speaking, setting the same seed should result in same random number sequences in all 3. This is not the case. I think the 3 implementations use different parameters causing this behavior.
R
>set.seed(1)
>runif(5)
[1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
Python
In [3]: random.seed(1)
In [4]: [random.random() for x in range(5)]
Out[4]:
[0.13436424411240122,
0.8474337369372327,
0.763774618976614,
0.2550690257394217,
0.49543508709194095]
NumPy
In [23]: import numpy as np
In [24]: np.random.seed(1)
In [25]: np.random.rand(5)
Out[25]:
array([ 4.17022005e-01, 7.20324493e-01, 1.14374817e-04,
3.02332573e-01, 1.46755891e-01])
Is there some way, where NumPy and Python implementation could produce the same random number sequence? Ofcourse as some comments and answers point out, one could use rpy. What I am specifically looking for is to fine tune the parameters in the respective calls in Python and NumPy to get the sequence.
Context: The concern comes from an EDX course offering in which R is used. In one of the forums, it was asked if Python could be used and the staff replied that some assignments would require setting specific seeds and submitting answers.
Related:
Comparing Matlab and Numpy code that uses random number generation From this it seems that the underlying NumPy and Matlab implementation are similar.
python vs octave random generator: This question does come fairly close to the intended answer. Some sort of wrapper around the default state generator is required.

use rpy2 to call r in python, here is a demo, the numpy array data is sharing memory with x in R:
import rpy2.robjects as robjects
data = robjects.r("""
set.seed(1)
x <- runif(5)
""")
print np.array(data)
data[1] = 1.0
print robjects.r["x"]

I realize this is an old question, but I've stumbled upon the same problem recently, and created a solution which can be useful to others.
I've written a random number generator in C, and linked it to both R and Python. This way, the random numbers are guaranteed to be the same in both languages since they are generated using the same C code.
The program is called SyncRNG and can be found here: https://github.com/GjjvdBurg/SyncRNG.

how to implement 'daisychaining' of pluggable function in python?

here is the problem:
1) suppose that I have some measure data (like 1Msample read from my electronics) and I need to process them by a processing chain.
2) this processing chain consists of different operations, which can be swapped/omitted/have different parameters. A typical example would be to take this data, first pass them via a lookup table, then do exponential fit, then multiply by some calibration factors
3) now, as I do not know what algorithm the the best, I'd like to evaluate at each stage best possible implementation (as an example, the LUTs can be produced by 5 ways and I want to see which one is the best)
4) i'd like to daisychain those functions such, that I would construct a 'class' containing top-level algorithm and owning (i.e. pointing) to child class, containing lower-level algorithm.
I was thinking to use double-linked-list and generate sequence like:
myCaptureClass.addDataTreatment(pmCalibrationFactor(opt, pmExponentialFit (opt, pmLUT (opt))))
where myCaptureClass is the class responsible for datataking and it should as well (after the data being taken) trigger the top-level data processing module (pm). This processing would first go deep into the bottom-child (lut), treat data there, then middle (expofit), then top (califactors) and return the data to the capture, which would return the data to the requestor.
Now this has several issues:
1) everywhere on the net is said that in python one should not use double-linked-lists
2) this seems to me highly inefficient because the data vectors are huge, hence i would prefer solution using generator function, but i'm not sure how to provide the 'plugin-like' mechanism.
could someone give me a hint how to solve this using 'plugin-style' and generator so I do not need to process vector of X megabytes of data and process them 'on-request' as is correct when using generator function?
thanks a lot
david
An addendum to the problem:
it seems that I did not express myself exactly. Hence: the data are generated by an external HW card plugged into VME crate. They are 'fetched' in a single block transfer to the python tuple, which is stored in myCaptureClass.
The set of operation to be applied is in fact on a stream data, represented by this tuple. Even exponential fit is stream operation (it is a set of variable state filters applied on each sample).
The parameter 'opt' i've mistakenly shown was to express, that each of those data processing classes has some configuration data which come with, and modify behaviour of the method used to operate on data.
The goal is to introduce into myCaptureClass a daisychained class (rather than function), which - when user asks for data - us used to process 'raw' data into final form.
In order to 'save' memory resources i thought it might be a good idea to use generator function to provide the data.
from this perspective it seems that the closest match to what i want to do is shown in code of bukzor. I'd prefer to have a class implementation instead of function, but i guess this is just a cosmetic stuff of implementing call operator in particular class, which realizes the data operation....

This is how I imagine you would do this. I expect this is incomplete, since I don't fully understand your problem statement. Please let me know what I've done wrong :)
class ProcessingPipeline(object):
def __init__(self, *functions, **kwargs):
self.functions = functions
self.data = kwargs.get('data')
def __call__(self, data):
return ProcessingPipeline(*self.functions, data=data)
def __iter__(self):
data = self.data
for func in self.functions:
data = func(data)
return data
# a few (very simple) operators, of different kinds
class Multiplier(object):
def __init__(self, by):
self.by = by
def __call__(self, data):
for x in data:
yield x * self.by
def add(data, y):
for x in data:
yield x + y
from functools import partial
by2 = Multiplier(by=2)
sub1 = partial(add, y=-1)
square = lambda data: ( x*x for x in data )
pp = ProcessingPipeline(square, sub1, by2)
print list(pp(range(10)))
print list(pp(range(-3, 4)))
Output:
$ python how-to-implement-daisychaining-of-pluggable-function-in-python.py
[-2, 0, 6, 16, 30, 48, 70, 96, 126, 160]
[16, 6, 0, -2, 0, 6, 16]

Get the functional module from pypi. It has a compose function to compose two callables. With that, you can chain functions together.
Both that module, and functool provide a partial function, for partial-application.
You can use the composed functions in a generator expression just like any other.

Not knowing exactly what you want, I feel like I should point out that you can put whatever you want inside a list comprehension:
l = [myCaptureClass.addDataTreatment(
pmCalibrationFactor(opt, pmExponentialFit (opt, pmLUT (opt))))
for opt in data]
will create a new list of data that has been passed through the composed functions.
Or you could create a generator expression for looping over, this won't construct a whole new list, it will just create an iterator. I don't think that there's any advantage to doing things this way as opposed to just processing the data in the body of the loop, but it's kind of interesting to look at:
d = (myCaptureClass.addDataTreatment(
pmCalibrationFactor(opt, pmExponentialFit (opt, pmLUT (opt))))
for opt in data)
for thing in d:
# do something
pass
Or is opt the data?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.