Scipy Multivariate Normal: How to draw deterministic samples?

Scipy Multivariate Normal: How to draw deterministic samples? - python

I am using Scipy.stats.multivariate_normal to draw samples from a multivariate normal distribution. Like this:
from scipy.stats import multivariate_normal
# Assume we have means and covs
mn = multivariate_normal(mean = means, cov = covs)
# Generate some samples
samples = mn.rvs()
The samples are different at every run. How do I get always the same sample?
I was expecting something like:
mn = multivariate_normal(mean = means, cov = covs, seed = aNumber)
or
samples = mn.rsv(seed = aNumber)

There are two ways:
The rvs() method accepts a random_state argument. Its value can
be an integer seed, or an instance of numpy.random.Generator or numpy.random.RandomState. In
this example, I use an integer seed:
In [46]: mn = multivariate_normal(mean=[0,0,0], cov=[1, 5, 25])
In [47]: mn.rvs(size=5, random_state=12345)
Out[47]:
array([[-0.51943872, 1.07094986, -1.0235383 ],
[ 1.39340583, 4.39561899, -2.77865152],
[ 0.76902257, 0.63000355, 0.46453938],
[-1.29622111, 2.25214387, 6.23217368],
[ 1.35291684, 0.51186476, 1.37495817]])
In [48]: mn.rvs(size=5, random_state=12345)
Out[48]:
array([[-0.51943872, 1.07094986, -1.0235383 ],
[ 1.39340583, 4.39561899, -2.77865152],
[ 0.76902257, 0.63000355, 0.46453938],
[-1.29622111, 2.25214387, 6.23217368],
[ 1.35291684, 0.51186476, 1.37495817]])
This version uses an instance of numpy.random.Generator:
In [34]: rng = np.random.default_rng(438753948759384)
In [35]: mn = multivariate_normal(mean=[0,0,0], cov=[1, 5, 25])
In [36]: mn.rvs(size=5, random_state=rng)
Out[36]:
array([[ 0.30626179, 0.60742839, 2.86919105],
[ 1.61859885, 2.63409111, 1.19018398],
[ 0.35469027, 0.85685011, 6.76892829],
[-0.88659459, -0.59922575, -5.43926698],
[ 0.94777687, -5.80057427, -2.16887719]])
You can set the seed for numpy's global random number generator. This is the generator that multivariate_normal.rvs() uses if random_state is not given:
In [54]: mn = multivariate_normal(mean=[0,0,0], cov=[1, 5, 25])
In [55]: np.random.seed(123)
In [56]: mn.rvs(size=5)
Out[56]:
array([[ 0.2829785 , 2.23013222, -5.42815302],
[ 1.65143654, -1.2937895 , -7.53147357],
[ 1.26593626, -0.95907779, -12.13339622],
[ -0.09470897, -1.51803558, -4.33370201],
[ -0.44398196, -1.4286283 , 7.45694813]])
In [57]: np.random.seed(123)
In [58]: mn.rvs(size=5)
Out[58]:
array([[ 0.2829785 , 2.23013222, -5.42815302],
[ 1.65143654, -1.2937895 , -7.53147357],
[ 1.26593626, -0.95907779, -12.13339622],
[ -0.09470897, -1.51803558, -4.33370201],
[ -0.44398196, -1.4286283 , 7.45694813]])

Related

How tp speed up looping arrays as inputs for pandas calculation?

I have two arrays named x and y. The goal is to iterate them as the input for pandas calculation.
Here's an example.
Iterating each x and y and appending the calculation result to the res list is slow.
The calculation is to get the exponential of each column modified by a and then sum together, multiply with b. Anyway, this calculation can be replaced by any other calculations.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,5,size=(5, 1)),columns=['data'])
x = np.linspace(1, 24, 4)
y = np.linspace(10, 1500, 5)
res = []
for a in x:
for b in y:
res.append(np.exp(-df/a).sum().values[0]*b)
res = np.array(res).reshape(4, 5)
expected output:
array([[ 11.67676844, 446.63639283, 881.59601721, 1316.5556416 ,
1751.51526599],
[ 37.52524129, 1435.34047927, 2833.15571725, 4230.97095523,
5628.78619321],
[ 42.79406912, 1636.87314392, 3230.95221871, 4825.0312935 ,
6419.1103683 ],
[ 44.93972433, 1718.94445549, 3392.94918665, 5066.95391781,
6740.95864897]])

You can use numpy broadcasting:
res = np.array(res).reshape(4, 5)
print (res)
[[ 11.67676844 446.63639283 881.59601721 1316.5556416 1751.51526599]
[ 37.52524129 1435.34047927 2833.15571725 4230.97095523 5628.78619321]
[ 42.79406912 1636.87314392 3230.95221871 4825.0312935 6419.1103683 ]
[ 44.93972433 1718.94445549 3392.94918665 5066.95391781 6740.95864897]]
res = np.exp(-df.to_numpy()/x).sum(axis=0)[:, None] * y
print (res)
[[ 11.67676844 446.63639283 881.59601721 1316.5556416 1751.51526599]
[ 37.52524129 1435.34047927 2833.15571725 4230.97095523 5628.78619321]
[ 42.79406912 1636.87314392 3230.95221871 4825.0312935 6419.1103683 ]
[ 44.93972433 1718.94445549 3392.94918665 5066.95391781 6740.95864897]]

I think what you want is:
z = -df['data'].to_numpy()
res = np.exp(z/x[:, None]).sum(axis=1)[:, None]*y
output:
array([[ 11.67676844, 446.63639283, 881.59601721, 1316.5556416 ,
1751.51526599],
[ 37.52524129, 1435.34047927, 2833.15571725, 4230.97095523,
5628.78619321],
[ 42.79406912, 1636.87314392, 3230.95221871, 4825.0312935 ,
6419.1103683 ],
[ 44.93972433, 1718.94445549, 3392.94918665, 5066.95391781,
6740.95864897]])

Numpyic way to sort a matrix based on another similar matrix

Say I have a matrix Y of random float numbers from 0 to 10 with shape (10, 3):
import numpy as np
np.random.seed(99)
Y = np.random.uniform(0, 10, (10, 3))
print(Y)
Output:
[[6.72278559 4.88078399 8.25495174]
[0.31446388 8.08049963 5.6561742 ]
[2.97622499 0.46695721 9.90627399]
[0.06825733 7.69793028 7.46767101]
[3.77438936 4.94147452 9.28948392]
[3.95454044 9.73956297 5.24414715]
[0.93613093 8.13308413 2.11686786]
[5.54345785 2.92269116 8.1614236 ]
[8.28042566 2.21577372 6.44834702]
[0.95181622 4.11663239 0.96865261]]
I am now given a matrix X with same shape that can be seen as obtained by adding small noises to Y and then shuffling the rows:
X = np.random.normal(Y, scale=0.1)
np.random.shuffle(X)
print(X)
Output:
[[ 4.04067271 9.90959141 5.19126867]
[ 5.59873104 2.84109306 8.11175891]
[ 0.10743952 7.74620162 7.51100441]
[ 3.60396019 4.91708372 9.07551354]
[ 0.9400948 4.15448712 1.04187208]
[ 2.91884302 0.47222752 10.12700505]
[ 0.30995155 8.09263241 5.74876947]
[ 1.11247872 8.02092335 1.99767444]
[ 6.68543696 4.8345869 8.17330513]
[ 8.38904822 2.11830619 6.42013343]]
Now I want to sort the matrix X based on Y by row. I already know each pair of column values in each matching pair of rows are not different from each other more than a tolerance of 0.5. I managed to write the following code and it is working fine.
def sort_X_by_Y(X, Y, tol):
idxs = [next(i for i in range(len(X)) if all(abs(X[i] - row) <= tol)) for row in Y]
return X[idxs]
print(sort_X_by_Y(X, Y, tol=0.5))
Output:
[[ 6.68543696 4.8345869 8.17330513]
[ 0.30995155 8.09263241 5.74876947]
[ 2.91884302 0.47222752 10.12700505]
[ 0.10743952 7.74620162 7.51100441]
[ 3.60396019 4.91708372 9.07551354]
[ 4.04067271 9.90959141 5.19126867]
[ 1.11247872 8.02092335 1.99767444]
[ 5.59873104 2.84109306 8.11175891]
[ 8.38904822 2.11830619 6.42013343]
[ 0.9400948 4.15448712 1.04187208]]
However, in reality I am sorting (1000, 3) matrices and my code is way too slow. I feel like there should be more numpyic way to code this. Any suggestions?

This is a vectorized version of your algorithm. It runs ~26.5x faster than your implementation for 1000 samples. But an additional boolean array with shape (1000,1000,3) is created. There is a chance that rows will have similar values within the tolerance and a wrong row is selected.
tol = .5
X[(np.abs(Y[:, np.newaxis] - X) <= tol).all(2).argmax(1)]
Output
array([[ 6.68543696, 4.8345869 , 8.17330513],
[ 0.30995155, 8.09263241, 5.74876947],
[ 2.91884302, 0.47222752, 10.12700505],
[ 0.10743952, 7.74620162, 7.51100441],
[ 3.60396019, 4.91708372, 9.07551354],
[ 4.04067271, 9.90959141, 5.19126867],
[ 1.11247872, 8.02092335, 1.99767444],
[ 5.59873104, 2.84109306, 8.11175891],
[ 8.38904822, 2.11830619, 6.42013343],
[ 0.9400948 , 4.15448712, 1.04187208]])
More robust solutions with L1-norm
X[np.abs(Y[:, np.newaxis] - X).sum(2).argmin(1)]
Or L2-norm
X[((Y[:, np.newaxis] - X)**2).sum(2).argmin(1)]

Using array indexing to apply 2D array function on 3D array

I wrote a function that takes in one set of randomized cartesian coordinates and returns the subset that remains within some spatial domain. To illustrate:
grid = np.ones((5,5))
grid = np.lib.pad(grid, ((10,10), (10,10)), 'constant')
>> np.shape(grid)
(25, 25)
random_pts = np.random.random(size=(100, 2)) * len(grid)
def inside(input):
idx = np.floor(input).astype(np.int)
mask = grid[idx[:,0], idx[:,1]] == 1
return input[mask]
>> inside(random_pts)
array([[ 10.59441506, 11.37998288],
[ 10.39124766, 13.27615815],
[ 12.28225713, 10.6970708 ],
[ 13.78351949, 12.9933591 ]])
But now I want the ability to simultaneously generate n sets of random_pts and keep n corresponding subsets that satisfy the same functional condition. So, if n=3,
random_pts = np.random.random(size=(3, 100, 2)) * len(grid)
Without resorting to for loop, how could I index my variables such that inside(random_pts) returns something like
array([[[ 17.73323523, 9.81956681],
[ 10.97074592, 2.19671642],
[ 21.12081044, 12.80412997]],
[[ 11.41995519, 2.60974757]],
[[ 9.89827156, 9.74580059],
[ 17.35840479, 7.76972241]]])

One approach -
def inside3d(input):
# Get idx in 3D
idx3d = np.floor(input).astype(np.int)
# Create a similar mask as witrh 2D case, but in 3D now
mask3d = grid[idx3d[:,:,0], idx3d[:,:,1]]==1
# Count of mask matches for each index in 0th dim
counts = np.sum(mask3d,axis=1)
# Index into input to get masked matches across all elements in 0th dim
out_cat_array = input.reshape(-1,2)[mask3d.ravel()]
# Split the rows based on the counts, as the final output
return np.split(out_cat_array,counts.cumsum()[:-1])
Verify results -
Create 3D random input:
In [91]: random_pts3d = np.random.random(size=(3, 100, 2)) * len(grid)
With inside3d:
In [92]: inside3d(random_pts3d)
Out[92]:
[array([[ 10.71196268, 12.9875877 ],
[ 10.29700184, 10.00506662],
[ 13.80111411, 14.80514828],
[ 12.55070282, 14.63155383]]), array([[ 10.42636137, 12.45736944],
[ 11.26682474, 13.01632751],
[ 13.23550598, 10.99431284],
[ 14.86871413, 14.19079225],
[ 10.61103434, 14.95970597]]), array([[ 13.67395756, 10.17229061],
[ 10.01518846, 14.95480515],
[ 12.18167251, 12.62880968],
[ 11.27861513, 14.45609646],
[ 10.895685 , 13.35214678],
[ 13.42690335, 13.67224414]])]
With inside:
In [93]: inside(random_pts3d[0])
Out[93]:
array([[ 10.71196268, 12.9875877 ],
[ 10.29700184, 10.00506662],
[ 13.80111411, 14.80514828],
[ 12.55070282, 14.63155383]])
In [94]: inside(random_pts3d[1])
Out[94]:
array([[ 10.42636137, 12.45736944],
[ 11.26682474, 13.01632751],
[ 13.23550598, 10.99431284],
[ 14.86871413, 14.19079225],
[ 10.61103434, 14.95970597]])
In [95]: inside(random_pts3d[2])
Out[95]:
array([[ 13.67395756, 10.17229061],
[ 10.01518846, 14.95480515],
[ 12.18167251, 12.62880968],
[ 11.27861513, 14.45609646],
[ 10.895685 , 13.35214678],
[ 13.42690335, 13.67224414]])

How can I use scipy.interpolate.interp1d to interpolate multi Y arrays using the same X array?

As an example, I have an array of 2-D data with error bars on one of the dimensions, such as this:
In [1]: numpy as np
In [2]: x = np.linspace(0,10,5)
In [3]: y = np.sin(x)
In [4]: y_er = (np.random.random(len(x))-0.5)*0.1
In [5]: data = np.vstack([x,y,y_er]).T
In [6]: data
array([[ 0.00000000e+00, 0.00000000e+00, -6.50361821e-03],
[ 2.50000000e+00, 5.98472144e-01, -3.69252108e-03],
[ 5.00000000e+00, -9.58924275e-01, -2.99042576e-02],
[ 7.50000000e+00, 9.37999977e-01, -7.66584515e-03],
[ 1.00000000e+01, -5.44021111e-01, -4.24650123e-02]])
If I want to use scipy.interpolate.interp1d, how do I format it to only have to call it once? I want to avoid this repeated method:
In [7]: import scipy.interpolate as interpolate
In [8]: new_x = np.linspace(0,10,20)
In [9]: interp_y = interpolate.interp1d(data[:,0], data[:,1], kind='cubic')
In [10]: interp_y_er = interpolate.interp1d(data[:,0], data[:,2], kind='cubic')
In [11]: data_int = np.vstack([new_x, interp_y(new_x), interp_y_er(new_x)]).T
In [12]: data_int
Out[12]:
array([[ 0.00000000e+00, 1.33226763e-15, -6.50361821e-03],
[ 5.26315789e-01, 8.34210211e-01, 4.03036906e-03],
[ 1.05263158e+00, 1.18950397e+00, 7.81676344e-03],
[ 1.57894737e+00, 1.17628260e+00, 6.43203582e-03],
[ 2.10526316e+00, 9.04947417e-01, 1.45265705e-03],
[ 2.63157895e+00, 4.85798968e-01, -5.54638391e-03],
[ 3.15789474e+00, 1.69424684e-02, -1.31694104e-02],
[ 3.68421053e+00, -4.27201979e-01, -2.03689966e-02],
[ 4.21052632e+00, -7.74935541e-01, -2.61377287e-02],
[ 4.73684211e+00, -9.54559384e-01, -2.94681929e-02],
[ 5.26315789e+00, -8.97599881e-01, -2.94003966e-02],
[ 5.78947368e+00, -6.09763178e-01, -2.60650399e-02],
[ 6.31578947e+00, -1.70935195e-01, -2.06835155e-02],
[ 6.84210526e+00, 3.35772943e-01, -1.45246375e-02],
[ 7.36842105e+00, 8.27250110e-01, -8.85721975e-03],
[ 7.89473684e+00, 1.21766391e+00, -4.99008827e-03],
[ 8.42105263e+00, 1.39749683e+00, -4.58031991e-03],
[ 8.94736842e+00, 1.24503605e+00, -9.46430377e-03],
[ 9.47368421e+00, 6.38467937e-01, -2.14799109e-02],
[ 1.00000000e+01, -5.44021111e-01, -4.24650123e-02]])
I believe it would be something like this:
In [13]: interp_data = interpolate.interp1d(data[:,0], data[:,1:], axis=?, kind='cubic')

So looking at my guess, I had tried axis = 1. I double checked the only other option that made sense, axis = 0, and it worked. So for the next dummy who has my same problem, this is what I wanted:
In [14]: interp_data = interpolate.interp1d(data[:,0], data[:,1:], axis=0, kind='cubic')
In [15]: data_int = np.zeros((len(new_x),len(data[0])))
In [16]: data_int[:,0] = new_x
In [17]: data_int[:,1:] = interp_data(new_x)
In [18]: data_int
Out [18]:
array([[ 0.00000000e+00, 1.33226763e-15, -6.50361821e-03],
[ 5.26315789e-01, 8.34210211e-01, 4.03036906e-03],
[ 1.05263158e+00, 1.18950397e+00, 7.81676344e-03],
[ 1.57894737e+00, 1.17628260e+00, 6.43203582e-03],
[ 2.10526316e+00, 9.04947417e-01, 1.45265705e-03],
[ 2.63157895e+00, 4.85798968e-01, -5.54638391e-03],
[ 3.15789474e+00, 1.69424684e-02, -1.31694104e-02],
[ 3.68421053e+00, -4.27201979e-01, -2.03689966e-02],
[ 4.21052632e+00, -7.74935541e-01, -2.61377287e-02],
[ 4.73684211e+00, -9.54559384e-01, -2.94681929e-02],
[ 5.26315789e+00, -8.97599881e-01, -2.94003966e-02],
[ 5.78947368e+00, -6.09763178e-01, -2.60650399e-02],
[ 6.31578947e+00, -1.70935195e-01, -2.06835155e-02],
[ 6.84210526e+00, 3.35772943e-01, -1.45246375e-02],
[ 7.36842105e+00, 8.27250110e-01, -8.85721975e-03],
[ 7.89473684e+00, 1.21766391e+00, -4.99008827e-03],
[ 8.42105263e+00, 1.39749683e+00, -4.58031991e-03],
[ 8.94736842e+00, 1.24503605e+00, -9.46430377e-03],
[ 9.47368421e+00, 6.38467937e-01, -2.14799109e-02],
[ 1.00000000e+01, -5.44021111e-01, -4.24650123e-02]])
I did not figure out the syntax for using np.vstack or np.hstack to combine the new_x and interpolated data in one line but this post made me stop trying as it seems faster to pre-allocate the array (e.g. using np.zeros) then fill it with the new values.

Sorting an Array Alongside a 2d Array

So I'm using NumPy's linear algebra routines to do some basic computational quantum mechanics. Say I have a matrix, hamiltonian, and I want its eigenvalues and eigenvectors
import numpy as np
from numpy import linalg as la
hamiltonian = np.zeros((N, N)) # N is some constant I have defined
# fill up hamiltonian here
energies, states = la.eig(hamiltonian)
Now, I want to sort the energies in increasing order, and I want to sort the states along with them. For example, if I do:
groundStateEnergy = min(energies)
groundStateIndex = np.where(energies == groundStateEnergy)
groundState = states[groundStateIndex, :]
I correctly plot the ground state (eigenvector with the lowest eigenvalue). However, if I try something like this:
energies, states = zip(*sorted(zip(energies, states)))
or even
energies, states = zip(*sorted(zip(energies, states), key = lambda pair:pair[0])))
plotting in the same way no longer plots the correct state.So how can I sort states alongside energies, but only by row? (i.e, I want to associate each row of states with a value in energies, and I want to rearrange the rows so that the ordering of the rows corresponds to the sorted ordering of the values in energies)

You can use argsort as follows:
>>> x = np.random.random((1,10))
>>> x
array([ 0.69719108, 0.75828237, 0.79944838, 0.68245968, 0.36232211,
0.46565445, 0.76552493, 0.94967472, 0.43531813, 0.22913607])
>>> y = np.random.random((10))
>>> y
array([ 0.64332275, 0.34984653, 0.55240204, 0.31019789, 0.96354724,
0.76723872, 0.25721343, 0.51629662, 0.13096252, 0.86220311])
>>> idx = np.argsort(x)
>>> idx
array([9, 4, 8, 5, 3, 0, 1, 6, 2, 7])
>>> xsorted= x[idx]
>>> xsorted
array([ 0.22913607, 0.36232211, 0.43531813, 0.46565445, 0.68245968,
0.69719108, 0.75828237, 0.76552493, 0.79944838, 0.94967472])
>>> ysordedbyx = y[idx]
>>> ysordedbyx
array([ 0.86220311, 0.96354724, 0.13096252, 0.76723872, 0.31019789,
0.64332275, 0.34984653, 0.25721343, 0.55240204, 0.51629662])
and as suggested by the comments an example where we sort a 2d array by it's first collumn
>>> x=np.random.random((10,2))
>>> x
array([[ 0.72789275, 0.29404982],
[ 0.05149693, 0.24411234],
[ 0.34863983, 0.58950756],
[ 0.81916424, 0.32032827],
[ 0.52958012, 0.00417253],
[ 0.41587698, 0.32733306],
[ 0.79918377, 0.18465189],
[ 0.678948 , 0.55039723],
[ 0.8287709 , 0.54735691],
[ 0.74044999, 0.70688683]])
>>> idx = np.argsort(x[:,0])
>>> idx
array([1, 2, 5, 4, 7, 0, 9, 6, 3, 8])
>>> xsorted = x[idx,:]
>>> xsorted
array([[ 0.05149693, 0.24411234],
[ 0.34863983, 0.58950756],
[ 0.41587698, 0.32733306],
[ 0.52958012, 0.00417253],
[ 0.678948 , 0.55039723],
[ 0.72789275, 0.29404982],
[ 0.74044999, 0.70688683],
[ 0.79918377, 0.18465189],
[ 0.81916424, 0.32032827],
[ 0.8287709 , 0.54735691]])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scipy Multivariate Normal: How to draw deterministic samples? - python

Related

How tp speed up looping arrays as inputs for pandas calculation?

Numpyic way to sort a matrix based on another similar matrix

Using array indexing to apply 2D array function on 3D array

How can I use scipy.interpolate.interp1d to interpolate multi Y arrays using the same X array?

Sorting an Array Alongside a 2d Array

Categories

Resources