How to make a mixed random variable in scipy.stats

How to make a mixed random variable in scipy.stats - python

I am trying to understand the random variables from scipy.stats. I can sample from a uniform random variable:
from scipy.stats import uniform
print(uniform.rvs(size=1000))
But how can I make a random variable that with 0.5 probability samples uniformly from 0..1 and with 0.5 prob samples uniformly from 5..6?
I could write a loop that picks a random number between 0 and 1. If it is < .5, then picks a random number between 0 and 1. If it is >= .5 pick a random number between 0 and 1 and add 5. But I would really like to be able to call it like:
mixed_uniform.rvs(size=1000)
I also need to use the survival function of this mixed distribution.

For the distribution, a mix of a custom function to do the transformation, then use vectorize() to apply it will be more efficient than looping.
In [1]: from scipy.stats import uniform
In [2]: r = uniform.rvs(size=1000)
In [3]: r
Out[3]:
array([7.48816182e-02, 4.63880797e-01, 8.75315477e-01, 3.61116729e-01,
...
3.13473322e-01, 3.45434625e-01, 9.49993090e-01, 1.55553018e-01])
In [4]: type(r)
Out[4]: numpy.ndarray
In [8]: def f(a):
...: a *= 2
...: if a > 1: a += 4
...: return a
...:
In [10]: import numpy
In [11]: vf = numpy.vectorize(f)
In [12]: r2 = numpy.vectorize(f)(r)
In [13]: r2
Out[13]:
array([1.49763236e-01, 9.27761594e-01, 5.75063095e+00, 7.22233457e-01,
...
6.26946644e-01, 6.90869250e-01, 5.89998618e+00, 3.11106036e-01])
In [14]: max(r2)
Out[14]: 5.999360665646841
In [15]: min(r2)
Out[15]: 0.0004563758727054168
In [17]: len([x for x in r2 if x<=2])
Out[17]: 504
In [18]: len([x for x in r2 if x>=5])
Out[18]: 496

I generate a random distribution of 1000 numbers between 0 and 1 and randomly chose a element from the list. if the element is greater than .5 then add 5
from scipy.stats import uniform
import random
min_number=0
max_number=1
size=1000
number_pool= uniform.rvs(min_number,max_number,size=size)
plt.hist(number_pool)
plt.show()
def getValue(number_pool):
val=random.choice(number_pool)
if val>.5:
val+=5
return val
print(getValue(number_pool))

Related

Why does it work when columns are larger than rows in Python Sklearn (Linear Regression) [duplicate]

it's known that when the number of variables (p) is larger than the number of samples (n) the least square estimator is not defined.
In sklearn I receive this values:
In [30]: lm = LinearRegression().fit(xx,y_train)
In [31]: lm.coef_
Out[31]:
array([[ 0.20092363, -0.14378298, -0.33504391, ..., -0.40695124,
0.08619906, -0.08108713]])
In [32]: xx.shape
Out[32]: (1097, 3419)
Call [30] should return an error. How does sklearn work when p>n like in this case?
EDIT:
It seems that the matrix is filled with some values
if n > m:
# need to extend b matrix as it will be filled with
# a larger solution matrix
if len(b1.shape) == 2:
b2 = np.zeros((n, nrhs), dtype=gelss.dtype)
b2[:m,:] = b1
else:
b2 = np.zeros(n, dtype=gelss.dtype)
b2[:m] = b1
b1 = b2

When the linear system is underdetermined, then the sklearn.linear_model.LinearRegression finds the minimum L2 norm solution, i.e.
argmin_w l2_norm(w) subject to Xw = y
This is always well defined and obtainable by applying the pseudoinverse of X to y, i.e.
w = np.linalg.pinv(X).dot(y)
The specific implementation of scipy.linalg.lstsq, which is used by LinearRegression uses get_lapack_funcs(('gelss',), ... which is precisely a solver that finds the minimum norm solution via singular value decomposition (provided by LAPACK).
Check out this example
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(5, 10)
y = rng.randn(5)
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=False)
coef1 = lr.fit(X, y).coef_
coef2 = np.linalg.pinv(X).dot(y)
print(coef1)
print(coef2)
And you will see that coef1 == coef2. (Note that fit_intercept=False is specified in the constructor of the sklearn estimator, because otherwise it would subtract the mean of each feature before fitting the model, yielding different coefficients)

numpy random array values between -1 and 1

what is the best way to create a NumPy array of a given size with values randomly and uniformly spread between -1 and 1?
I tried 2*np.random.rand(size)-1

I'm not sure. Try:
s = np.random.uniform(-1, 1, size)
reference: https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.uniform.html

I can use numpy.arange:
import numpy as np
print(np.arange(start=-1.0, stop=1.0, step=0.2, dtype=np.float))
The step parameter defines the size and the uniformity in the distribution of the elements.

In your solution the np.random.rand(size) returns random floats in the half-open interval [0.0, 1.0)
this means 2 * np.random.rand(size) - 1 returns numbers in the half open interval [0, 2) - 1 := [-1, 1), i.e. range including -1 but not 1.
If this is what you wish to do then it is okay.
But, if you wish to generate numbers in the open interval (-1, 1), i.e. between -1 and 1 and hence not including either -1 or 1, may I suggest the following -
from numpy.random import default_rng
rg = default_rng(2)
size = (5,5)
rand_arr = rg.random(size)
rand_signs = rg.choice([-1,1], size)
rand_arr = rand_arr * rand_signs
print(rand_arr)
I have used the new suggested Generator per numpy, see link https://numpy.org/devdocs/reference/random/index.html#quick-start

100% working Code:
a = np.random.uniform(-1,1)
print(a)

filter 3D signal using medilt on each component separately gives different results than filtering the signal at once

I am using medfilt to filter a 3 dimensional array (a,b,c)
import scipy as sp
import numpy as np
a = np.random.rand(180000)
b = np.random.rand(180000)
c = np.random.rand(180000)
if I filter the 3 components separatly like this
x = sp.signal.medfilt(a,3)
y = sp.signal.medfilt(b, 3)
z = sp.signal.medfilt(c, 3)
and then combine them into a numpy array
out1 = np.array([x,y,z]).T
I get a different result than when I filter the 3 components at the same time...
sigIn = np.array([a,b,c]).T
out2 = sp.signal.medfilt(sigIn,3)
Could you please explain me why?

For the 2D stacked array, we need to use 2D version of median filter, with the list of kernel sizes along each dimension.
Thus, we would have -
from scipy.signal import medfilt2d
out2 = medfilt2d(sigIn,[3,1])
Sample run -
In [49]: n = 4
...: a = np.random.rand(n)
...: b = np.random.rand(n)
...: c = np.random.rand(n)
...:
...: x = medfilt(a,3)
...: y = medfilt(b, 3)
...: z = medfilt(c, 3)
...:
...: out1 = np.array([x,y,z]).T
...:
In [53]: from scipy.signal import medfilt2d
In [54]: sigIn = np.array([a,b,c]).T
In [55]: out2 = medfilt2d(sigIn,[3,1])
In [56]: np.allclose(out1, out2)
Out[56]: True
As a sidenote, we can use np.column_stack for the stacking operation and thus avoid that transpose, like so -
sigIn = np.column_stack((a,b,c))

Interval containing specified percent of values

With numpy or scipy, is there any existing method that will return the endpoints of an interval which contains a specified percent of the values in a 1D array? I realize that this is simple to write myself, but it seems like the kind of thing that might be built in, although I can't find it.
E.g:
>>> import numpy as np
>>> x = np.random.randn(100000)
>>> print(np.bounding_interval(x, 0.68))
Would give approximately (-1, 1)

You can use np.percentile:
In [29]: x = np.random.randn(100000)
In [30]: p = 0.68
In [31]: lo = 50*(1 - p)
In [32]: hi = 50*(1 + p)
In [33]: np.percentile(x, [lo, hi])
Out[33]: array([-0.99206523, 1.0006089 ])
There is also scipy.stats.scoreatpercentile:
In [34]: scoreatpercentile(x, [lo, hi])
Out[34]: array([-0.99206523, 1.0006089 ])

I don't know of a built-in function to do it, but you can write one using the math package to specify approximate indices like this:
from __future__ import division
import math
import numpy as np
def bound_interval(arr_in, interval):
lhs = (1 - interval) / 2 # Specify left-hand side chunk to exclude
rhs = 1 - lhs # and the right-hand side
sorted = np.sort(arr_in)
lower = sorted[math.floor(lhs * len(arr_in))] # use floor to get index
upper = sorted[math.floor(rhs * len(arr_in))]
return (lower, upper)
On your specified array, I got the interval (-0.99072237819851039, 0.98691691784955549). Pretty close to (-1, 1)!

Python/Numpy/Scipy: Draw Poisson random values with different lambda

My problem is to extract in the most efficient way N Poisson random values (RV) each with a different mean/rate Lam. Basically the size(RV) == size(Lam).
Here it is a naive (very slow) implementation:
import numpy as NP
def multi_rate_poisson(Lam):
rv = NP.zeros(NP.size(Lam))
for i,lam in enumerate(Lam):
rv[i] = NP.random.poisson(lam=lam, size=1)
return rv
That, on my laptop, with 1e6 samples gives:
Lam = NP.random.rand(1e6) + 1
timeit multi_poisson(Lam)
1 loops, best of 3: 4.82 s per loop
Is it possible to improve from this?

Although the docstrings don't document this functionality, the source indicates it is possible to pass an array to the numpy.random.poisson function.
>>> import numpy
>>> # 1 dimension array of 1M random var's uniformly distributed between 1 and 2
>>> numpyarray = numpy.random.rand(1e6) + 1
>>> # pass to poisson
>>> poissonarray = numpy.random.poisson(lam=numpyarray)
>>> poissonarray
array([4, 2, 3, ..., 1, 0, 0])
The poisson random variable returns discrete multiples of one, and approximates a bell curve as lambda grows beyond one.
>>> import matplotlib.pyplot
>>> count, bins, ignored = matplotlib.pyplot.hist(
numpy.random.poisson(
lam=numpy.random.rand(1e6) + 10),
14, normed=True)
>>> matplotlib.pyplot.show()
This method of passing the array to the poisson generator appears to be quite efficient.
>>> timeit.Timer("numpy.random.poisson(lam=numpy.random.rand(1e6) + 1)",
'import numpy').repeat(3,1)
[0.13525915145874023, 0.12136101722717285, 0.12127304077148438]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to make a mixed random variable in scipy.stats - python

Related

Why does it work when columns are larger than rows in Python Sklearn (Linear Regression) [duplicate]

numpy random array values between -1 and 1

filter 3D signal using medilt on each component separately gives different results than filtering the signal at once

Interval containing specified percent of values

Python/Numpy/Scipy: Draw Poisson random values with different lambda

Categories

Resources