How to best get a sample from a truncated normal distribution?

How to best get a sample from a truncated normal distribution? - python

I have done some searching but I cannot seem to be able to find a reasonable way to sample from a truncated normal distribution.
Without truncation I was doing:
samples = [np.random.normal(loc=x,scale=d) for (x,d) in zip(X,D)]
X and D being lists of floats.
Currently I am implementing truncation as such:
def truncnorm(loc,scale,bounds):
s = np.random.normal(loc,scale)
if s > bounds[1]:
return bounds[1]
elif s < bounds[0]:
return bounds[0]
return s
samples = [truncnorm(loc=x,scale=d,bounds=b) for (x,d,b) in zip(X,D,bounds)]
bounds being a list of tuples (min,max)
This approach feels a little awkward, so I'm wondering if there is a better way?

Returning the value of the bounds for samples outside them, will result in too many samples falling on the bounds. This is not representative of the actual distribution. The values on the bounds need to be rejected and replaced by a new sample. Such code could be:
def test_truncnorm(loc, scale, bounds):
while True:
s = np.random.normal(loc, scale)
if bounds[0] <= s <= bounds[1]:
break
return s
This can be extremely slow given narrow bounds.
Scipy's truncnorm handles such cases more efficiently. A bit surprisingly, the bounds are expressed in function of the standard normal, so your call would be:
s = scipy.stats.truncnorm.rvs((bounds[0]-loc)/scale, (bounds[1]-loc)/scale, loc=loc, scale=scale)
Note that scipy works much faster when making use of numpy's vectorization and broadcasting. And once you're used to the notation, it also looks simpler to write and read. All samples can be calculated in one go as:
X = np.array(X)
D = np.array(D)
bounds = np.array(bounds)
samples = scipy.stats.truncnorm.rvs((bounds[:, 0] - X) / D, (bounds[:, 1] - X) / D, loc=X, scale=D)

Related

How better perform Pearson R from 2 arrays of dimensions (m, n) and (n), returning an array of (m) size? [Python, NumPy, SciPy]

I'm trying to improve a simple algorithm to obtaining the Pearson correlation coefficient from two arrays, X(m, n) and Y(n), returning me another array R of dimension (m).
In the case, I want to know the behavior each row of X regarding the values of Y. A sample (working) code is presented below:
import numpy as np
from scipy.stats import pearsonr
np.random.seed(1)
m, n = 10, 5
x = 100*np.random.rand(m, n)
y = 2 + 2*x.mean(0)
r = np.empty(m)
for i in range(m):
r[i] = pearsonr(x[i], y)[0]
For this particular case, I get: r = array([0.95272843, -0.69134753, 0.36419159, 0.27467137, 0.76887201, 0.08823868, -0.72608421, -0.01224453, 0.58375626, 0.87442889])
For small values of m (near 10k) this runs pretty fast, but I'm starting to work with m ~ 30k, and so this is taking much longer than I expected. I'm aware I could implement multiprocessing/multi-threading but I believe there's a (better) pythonic way of doing this.
I tried to use use pearsonr(x, np.ones((m, n))*y), but it returns only (nan, nan).

pearsonr only supports 1D array internally. Moreover, it computes the p-values which is not used here. Thus, it would be more efficient not to compute it if possible. Additionally, the code also recompute the y vector every time and it does not efficiently make use of vectorized Numpy operations. This is why the computation is a bit slow. You can check this in the code here.
One way to compute this is by writing your own custom implementation based on the one of Scipy:
def multi_pearsonr(x, y):
xmean = x.mean(axis=1)
ymean = y.mean()
xm = x - xmean[:,None]
ym = y - ymean
normxm = np.linalg.norm(xm, axis=1)
normym = np.linalg.norm(ym)
return np.clip(np.dot(xm/normxm[:,None], ym/normym), -1.0, 1.0)
It is 450 times faster on my machine for m = 10_000.
Note that I did not keep the checks of the Scipy code, but it may be a good idea to keep them if your input is not guaranteed to be statistically safe (ie. well formatted for the computation of the Pearson test).

Using a forloop to solve coupled differential equations in python

I am trying to solve a set of differential equations, but I have been having difficulty making this work. My differential equations contain an "i" subscript that represents numbers from 1 to n. I tried implementing a forloop as follows, but I have been getting this index error (the error message is below). I have tried changing the initial conditions (y0) and other values, but nothing seems to work. In this code, I am using solve_ivp. The code is as follows:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.integrate import solve_ivp
def testmodel(t, y):
X = y[0]
Y = y[1]
J = y[2]
Q = y[3]
a = 3
S = 0.4
K = 0.8
L = 2.3
n = 100
for i in range(1,n+1):
dXdt[i] = K**a+(Q[i]**a) - S*X[i]
dYdt[i] = (K*X[i])-(L*Y[i])
dJdt[i] = S*Y[i]-(K*Q[i])
dQdt[i] = K*X[i]/L+J[i]
return dXdt, dYdt, dJdt, dQdt
t_span= np.array([0, 120])
times = np.linspace(t_span[0], t_span[1], 1000)
y0 = 0,0,0,0
soln = solve_ivp(testmodel, t_span, y0, t_eval=times,
vectorized=True)
t = soln.t
X = soln.y[0]
Y = soln.y[1]
J = soln.y[2]
Q = soln.y[3]
plt.plot(t, X,linewidth=2, color='red')
plt.show()
The error I get is
IndexError Traceback (most recent call last)
<ipython-input-107-3a0cfa6e42ed> in testmodel(t, y)
15 n = 100
16 for i in range(1,n+1):
--> 17 dXdt[i] = K**a+(Q[i]**a) - S*X[i]
IndexError: index 1 is out of bounds for axis 0 with size 1
I have scattered the web for a solution to this, but I have been unable to apply any solution to this problem. I am not sure what I am doing wrong and what to actually change.
I have tried to remove the "vectorized=True" argument, but then I get an error that states I cannot index scalar variables. This is confusing because I do not think these values should be scalar. How do I resolve this problem, my ultimate goal is to plot these differential equations. Thank you in advance.

It is nice that you provide the standard solver with a vectorized ODE function for multi-point evalutions. But the default method is the explicit RK45, and explicit methods do not use Jacobi matrices. So there is no need for multi-point evaluations for difference quotients for the partial derivatives.
In essence, the coordinate arrays always have size 1, as the evaluation is at a single point, so for instance Q is an array of length 1, the only valid index is 0. Remember, in all "true" programming languages, array indices start at 0. It is only some CAS script languages that use the "more mathematical" 1 as index start. (Setting n=100 and ignoring the length of the arrays provided by the solver is wrong as well.)
You can avoid all that and shorten your routine by taking into account that the standard arithmetic operations are applied element-wise for numpy arrays, so
def testmodel(t, y):
X,Y,J,Q = y
a = 3; S = 0.4; K = 0.8; L = 2.3
dXdt = K**a + Q**a - S*X
dYdt = K*X - L*Y
dJdt = S*Y - K*Q
dQdt = K*X/L + J
return dXdt, dYdt, dJdt, dQdt
Modifying your code for multiple compartments with the same dynamic
You need to pass the solver a flat vector of the state. The first design decision is how the compartments and their components are arranged in the flat vector. One variant that is most compatible with the existing code is to cluster the same components together. Then in the ODE function the first operation is to separate out these clusters.
X,Y,J,Q = y.reshape([4,-1])
This splits the input vector into 4 pieces of equal length. At the end you need to reverse this split so that the derivatives are again in a flat vector.
return np.concatenate([dXdt, dYdt, dJdt, dQdt])
Everything else remains the same. Apart from the initial vector, which needs to have 4 segments of length N containing the data for the compartments. Here that could just be
y0 = np.zeros(4*N)
If the initial data is from any other source, and given in records per compartment, you might have to transpose the resulting array before flattening it.
Note that this construction is not vectorized, so leave that option unset in its default False.
For uniform interaction patterns like in a circle I recommend the use of numpy.roll to continue to avoid the use of explicit loops. For an interaction pattern that looks like a network one can use connectivity matrices and masks like in Using python built-in functions for coupled ODEs

how to simulate a variable with fixed interval?

I am trying to simulate the performance of a real life process. The variables that have been measured historically shows a fixed interval, so been lower o greater that those values is physically impossible.
To simulate the process output, each input variable historical data was represented as the best fit probability distribution, respectively (using this approach: Fitting empirical distribution to theoretical ones with Scipy (Python)?).
However, the resulting theoretical distribution when is simulated n-times do not represent the real life expected min and maximum values. I am thinking to apply a try-except test each simulation to check if each simulated value is between the expected interval, but I am not sure if this is the best way to handle this due to, experimental mean and variance is not achieved.

You can use a boolean mask in numpy for regenerating the values that are outside the required boundaries. For example:
def random_with_bounds(func, size, bounds):
x = func(size=size)
r = (x < bounds[0]) | (x > bounds[1])
while r.any():
x[r] = func(size=r.sum())
r[r] = (x[r] < bounds[0]) | (x[r] > bounds[1])
return x
Then you can use it like:
random_with_bounds(np.random.normal, 1000, (-1, 1))
Another version using index arrays via np.argwhere gives slightly increased performance:
def random_with_bounds_2(func, size, bounds):
x = func(size=size)
r = np.argwhere((x < bounds[0]) | (x > bounds[1])).ravel()
while r.size > 0:
x[r] = func(size=r.size)
r = r[(x[r] < bounds[0]) | (x[r] > bounds[1])]
return x

fsum for numpy.arrays, stable summation

I have a number of multidimensional numpy.arrays with small values
that I need to add up with little numerical error. For floats, there is math.fsum (with its implementation here), which has always served me well. numpy.sum isn't stable enough.
How can I get a stable summation for numpy.arrays?
Background
This is for the quadpy package. The arrays of small values are the evaluations of a function at specific points of (many) intervals, times their weights. The sum of these is an approximation of the integral of said function over the intervals.

Alright then, I've implemented accupy which gives a few stable summation algorithms.
Here's a quick and dirty implementation of Kahan summation for numpy arrays. Notice, however, that it is not not very accurate for ill-conditioned sums.
def kahan_sum(a, axis=0):
'''Kahan summation of the numpy array along an axis.
'''
s = numpy.zeros(a.shape[:axis] + a.shape[axis+1:])
c = numpy.zeros(s.shape)
for i in range(a.shape[axis]):
# https://stackoverflow.com/a/42817610/353337
y = a[(slice(None),) * axis + (i,)] - c
t = s + y
c = (t - s) - y
s = t.copy()
return s
It does the job, but it's slow because it's Python-looping over the axis-th dimension.

Averaging unevenly sampled data

I have data which consist of the radial distance to the ground, sampled evenly every d_theta. I would like to do gaussian smoothing on it, but make the size of the smoothing window a constant in x, rather than be a constant number of points. What is a good way to do this?
I made a function to do it, but it is slow and I haven't even put in the parts that will calculate the edges yet.
If it helps to do it faster, I guess you can assume the floor is flat and use that to calculate how many points to sample, rather than using the actual x-values.
Here is what I have attempted so far:
bs = [gaussian(2*n-1,n/2) for n in range (1,500)] #bring the computation of the
bs = [b/b.sum() for b in bs] #gaussian outside to speed it up
def uneven_gauss_smoothing(xvals,yvals,sigma):
newy = []
for i, xval in enumerate (xvals):
#find how big the window should be to have the chosen sigma
#(or .5*sigma, whatever):
wheres = np.where(xvals> xval + sigma )[0]
iright = wheres[0] -i if len(wheres) else 100
if i - iright < 0 :
newy.append(0) #not implemented yet
continue
if i + iright >= len(xvals):
newy.append(0) #not implemented
continue
else:
#weighted average with gaussian curve:
newy.append((yvals[i-iright:i+iright+1]*bs[iright]).sum())
return np.array(newy)
Sorry it's a bit of a mess--it was so incredibly frustrating to debug that I just ended up using the first solution (usually one which was difficult to read) that came to mind for some of the problems that popped up. But it does work in it's limited way.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to best get a sample from a truncated normal distribution? - python

Related

How better perform Pearson R from 2 arrays of dimensions (m, n) and (n), returning an array of (m) size? [Python, NumPy, SciPy]

Using a forloop to solve coupled differential equations in python

how to simulate a variable with fixed interval?

fsum for numpy.arrays, stable summation

Averaging unevenly sampled data

Categories

Resources