Apply truncnorm and draw to 3d arrays of parameters - python

I have two 3D arrays mean and std, containing respectively, as their names states, mean values and standard deviation. Both arrays have same shape, so that there is correspondence between mean value and standard deviation at each single position in both these tables. I would like to, for each position of the array, use the value in mean and corresponding value in std to define a truncated normal distribution from which I draw a single value that I store at the corresponding position in another array p that has the same shape as mean and std.
Of course, I thought of using scipy.stats.truncnorm but I encounter broadcasting problems and I am a bit lost on how to use it elegantly. A for loop would take too much time as the aim is to apply this process to very big arrays.
As a simple example, let us consider
mean = [[[4 0]
[1 3]]
[[3 1]
[3 4]]]
std = [[[0.84700368 0.78628226]
[0.54893714 0.68086502]]
[[0.23237688 0.46543749]
[0.01420151 0.25461322]]]
For simplicity, I initialize p as an array containing indices:
p = [[[1 2]
[3 4]]
[[5 6]
[7 8]]]
For instance, I would like to replace value 5 in p by a value randomly drawn from a truncated normal distribution (say truncated between user-chosen values lower and upper) of mean value 3 and standard deviation 0.23237688, as given at corresponding position in mean and std. The aim is to apply this process to all values at once.
Thank you in advance for your answers !

It's easier than you think.
mean = np.array([[[4, 0],
[1, 3]],
[[3, 1],
[3, 4]]])
std = np.array([[[0.84700368, 0.78628226],
[0.54893714, 0.68086502]],
[[0.23237688, 0.46543749],
[0.01420151, 0.25461322]]])
lower = 1
upper = 3
# from the documentation of truncnorm:
a, b = (lower - mean) / std, (upper - mean) / std
from scipy.stats import truncnorm
# remove random_state from parameters if you don't want reproducible
# results.
p = truncnorm.rvs(a, b, loc=mean, scale=std, random_state=1)
print(np.around(p, 2))
# output:
[[[2.6 1.5 ]
[1. 2.3 ]]
[[2.66 1.05]
[2.98 2.94]]]

Related

How to customize numpy.random.normal() so that the sum of the probabilities in a row doesn't exceed 1?

I want to generate a normal (probability) distribution using numpy.random.normal(), where the sum of the probabilities in a row (over all columns) must be 1.
I used the following code to generate a sample 4 by 3 probability matrix:
mu, sigma = 0.5, 0.20 # mean and standard deviation
np.random.seed(40)
sample_probability = np.random.normal(mu, sigma, size=(4,3))
but the sum of the probabilities in each row becomes larger than 1, which I don't want to have.
[[0.37849046 0.47477272 0.36307873]
[0.68574295 0.13111979 0.40659952]
[0.95849807 0.59776201 0.6420534 ]
[0.71110689 0.51081462 0.55159068]]
i.e. np.sum(sample_probability[0,:]) yields 1.216341905895543, but I want to make it 1.
Would you please share your insights how I can customize numpy.random.normal() so that it limits the distribution of probabilities in a row into 1?
Thanks.
[UPDATE] I went for manually normalizing each row, rather introducing the modifications in numpy.random.normal(). Thanks to Mikhail and Frank.
First of all, make sure you understand that introducing this modification will change the joint distribution so that your variables will no longer be distributed as i.d.d. Gaussians with a given mean and std.
A simple way to do it would be to manually normalize each row by its sum of the entries (after sampling):
sample_probability/=np.sum(sample_probability,axis=1)[:,np.newaxis]
You have to be more clear about what you want. What sort of standard distribution do you want at the end? You'll have to change either mu, sigma or both to convert the distribution you have into one whose elements sum into 1.
If you just want to divide each row by its sum:
row_sums = np.sum(sample_probability, axis=1)
result = sample_probability / row_sums[:,None]
Alternatively, you could look at each row, see how the total of that row differs from 1, divide that delta by the number of items in that row and add delta/n to each element. This is also a standard distribution.
random.normal does not generate probabilities. It generates random numbers with a particular normal distribution. For a large number of the those values, the mean will be close to the specified mu. In your case the row sum will approximate 3*mu, 1.5.
In [1]: mu, sigma = 0.5, 0.20 # mean and standard deviation
...: np.random.seed(40)
In [2]: x = np.random.normal(mu, sigma, size=(4,3))
In [3]: x.mean()
Out[3]: 0.5101817035650666
In [4]: x.mean(axis=1)
Out[4]: array([0.53043459, 0.494771 , 0.4213001 , 0.59422113])
In [5]: x.sum(axis=1)
Out[5]: array([1.59130377, 1.48431299, 1.26390029, 1.78266339])
For a larger dimension:
In [6]: x = np.random.normal(mu, sigma, size=(4,1000))
In [7]: x.mean(axis=1)
Out[7]: array([0.50881455, 0.50950833, 0.49800201, 0.49708817])
In [8]: x.sum(axis=1)
Out[8]: array([508.8145494 , 509.50833417, 498.00200654, 497.08816538])
We could scale the values so the row sum is 1, but the row mean will no longer be mu:
In [19]: x1 = x/x.sum(axis=1, keepdims=True)
In [20]: x1.sum(axis=1)
Out[20]: array([1., 1., 1., 1.])
In [21]: x1.mean(axis=1)
Out[21]: array([0.001, 0.001, 0.001, 0.001])
It would probably make more sense to use a Dirichlet distribution for this. Whereas a normal distribution can theoretically generate any number (some with very low probability), a Dirichlet distribution by definition generates sets of n numbers that add up to one.
If, as you say, you are looking for a matrix of probabilities, well, that's exactly what a Dirichlet distribution is for! It's a probabilistic way to generate probabilities. (Probabilities for a Multinomial distribution, to be precise.)
Here's a simple usage example:
import numpy
prob_mat = numpy.random.dirichlet([5, 5, 5, 5], 4)
print(prob_mat)
Output:
[[ 0.22564822 0.31584644 0.22485089 0.23365445]
[ 0.16188422 0.3077273 0.35070738 0.1796811 ]
[ 0.33209931 0.32359204 0.11584078 0.22846787]
[ 0.21951849 0.02267694 0.50503356 0.25277101]]
Here the numbers will always have the same mean. If you want to give more weight to some than others, pass larger or smaller numbers in the first argument. The number of elements in the first argument determines the size of the rows.
prob_mat = numpy.random.dirichlet([1, 9], 4)
print(prob_mat)
Output:
[[ 0.09191857 0.90808143]
[ 0.05854907 0.94145093]
[ 0.12310873 0.87689127]
[ 0.10848055 0.89151945]]

Excluding rightmost edge in numpy.histogram

I have a list of numbers a and a list of bins which I shall use to bin the numbers in a using numpy.histogram. the bins are calculated from the mean and standard deviation (std) of a. So the number of bins is B, and the minimum value of the first bin is mean - std, the maximum of the last bin being mean + std. (The text in bold indicates my final goal)
An example goes like the following:
>>> a
array([1, 1, 3, 2, 2, 6])
>>> bins = np.linspace(mean - std, mean + std, B + 1)
array([ 0.79217487, 1.93072496, 3.06927504, 4.20782513]))
>>> numpy.histogram(a, bins = bins)[0]
(array([2, 3, 0], dtype=int32)
However, I want to exclude the rightmost edge of the last bin - i.e. if some value in a exactly equals mean + std, I do not wish to include it in the last bin. The caricature about mean and std is not important, excluding the rightmost edge (aka making it a half-open interval) is. The doc says, unfortunately in this regard:
All but the last (righthand-most) bin is half-open. In other words, if
bins is:
[1, 2, 3, 4] then the first bin is [1, 2) (including 1, but excluding
2) and the second [2, 3). The last bin, however, is [3, 4], which
includes 4.
Is there a simple solution I can employ? That is, one that does not involve manually fixing edges. That is something I can do, but that's not what I'm looking for. Is there a flag I can pass or a different method I can use?
Here's one (kind of crude?) way to turn the make the last bin half-open instead of closed. What I'm doing is subtracting the smallest possible value from the right side of the right-most bin:
a = np.array([1, 1, 3, 2, 2, 6])
B = 3 # (in this example)
bins = np.linspace(a.mean() - a.std(), a.mean() + a.std(), B + 1)
# array([ 0.79217487, 1.93072496, 3.06927504, 4.20782513]))
bins[-1] -= np.finfo(float).eps # <== this is the crucial line
np.histogram(a, bins = bins)
If you're using some other type other than float for the values in a, using a different type in the call to finfo. For example:
np.finfo(float).eps
np.finfo(np.float128).eps
Clip the array first. Do NOT use numpy.clip() function. it would just set out-bounded data to clip high/low value and counted into left bin and right bin. that would create high peaks show on both ends
Following code worked with me. My case is integer array, I guess should be ok with Float array.
clip_low = a.mean() - a.std() # I converted clip to int
clip_high = a.mean() + a.std() # should be ok with float
clip= a[ (clip_low <= a) & (a < clip_high) ] # != clip_high (Do NOT use np.clip() fuxntion
bins= clip_high - clip_low # use your bins #
hist, bins_edge= np.histogram( clip, bins=bins, range=(clip_low,clip_high))

How to calculate average value of items in a 3D array?

I am trying to get an average value for parameters to then plot with a given function. I think I have to somehow fill a 3-column array and then take the average of values of that array. I want to create 1000 values for popt[0] , popt[1] , and popt[2] and then take the average of all those values and then plot them.
for n in range(0,1000):
params=np.zeros(3,1000)
y3=y2+np.random.normal(loc=0.0,scale=0.1*y2)
popt,pcov=optimize.curve_fit(fluxmeasureMW,bands,y3)
params.append(popt[0],popt[1],popt[2])
a_avg=st.mean(params[0:])
b_avg=st.mean(params[1:])
e_avg=st.mean(params[2:])
The final goal is to plot:
fluxmeasureMW(bands,a_avg,b_avg,e_avg)
I am just not sure how to iterate the fitting function to then output 1000 values. 1000 is arbitrary, I just want a good sample size. The values for y2 and bands are already defined and can be plotted without issue, as well as the function fluxmeasureMW.
Say your function is like this
def fluxmeasureMW(x,f,g,h):
return result_of_calc
Just run the fit in a loop; accumulate the popts in a list then take the mean
from scipy import optimize
import numpy as np
n = 1000
t = []
for i in range(n):
y3 = y2 + np.random.normal(loc=0.0,scale=.1*y2)
popt,pcov = optimize.curve_fit(fluxmeasureMW,bands,y3)
t.append(popt)
f,g,h = np.mean(t,0)
t will be a list of lists...
[[f,g,h],
[f,g,h],
...]
np.mean(t,0) will average the values over the columns.
You could also use
import statistics
a = [[0, 1, 2],
[1, 2, 3],
[2, 3, 4],
[3, 4, 5]]
for column in zip(*a):
#print(column)
print(statistics.mean(column))

how does sklearn jaccard_score gets calculated?

I was trying to understand what is going on with sklearn's jaccard_score.
This is the result I got
1. jaccard_score([0 1 1], [1 1 1])
0.6666666666666666
2. jaccard_score([1 1 0], [1 0 0])
0.5
3. jaccard_score([1 1 0], [1 0 1])
0.3333333333333333
I understand that the formula is
intersection / size of A + size of B - intersection
I thought the last one should give me 0.2 because the overlap is 1 and total number of entries is 6 resulting 1/5. but I got 0.33333...
Can anyone explain how sklearn calculates jaccard_score?
Per sklearn's doc, the jaccard_score function "is used to compare set of predicted labels for a sample to the corresponding set of labels in y_true". If the attributes are binary, the computation is based on this using the confusion matrix. Otherwise, the same computation is done using the confusion matrix for each attribute value / class label.
The above definition for binary attributes / classes can be reduced to the set definition as explained in the following.
Assume that there are three records r1, r2, and r3. The vector [0, 1, 1] and [1, 1, 1] -- which are true and predicted classes of the records -- can be mapped to two sets {r2, r3} and {r1, r2, r3} respectively. Here, each element in the vector represents whether the correponding record exists in the set. The Jaccard similarity of the two sets are the same as the definition of similarity value for two vectors.

Why does the mean output of multivariate_normal method differ from the mean of distribution?

import numpy as np
np.random.seed(12)
num_observations = 5
x1 = np.random.multivariate_normal([1, 1], [[1, .75],[.75, 1]], num_observations)
sum = 0
for i in x1:
sum += i
print(sum/num_observations)
In this snippet the output is coming as [ 0.95766788 0.79287083] but shouldn't it be [1,1] as while generating the multivariate distribution I have taken the mean as 1,1?
What multivariate_normal does is:
Draw random samples from a multivariate normal distribution.
With the key word here being draw. You are basically taking a fairly small sample that is not guaranteed to have the same mean as the distribution itself. (That's the mathematical expectation, nothing more, and your sample size is 5.)
x1.mean(axis=0)
# array([ 0.958, 0.793])
Consider testing this by taking a much larger sample, where the law of large numbers dictates that your means should more reliably approach 1.00000...
x2 = np.random.multivariate_normal([1, 1], [[1, .75],[.75, 1]], 10000)
x2.mean(axis=0)
# array([ 1.001, 1.009])
In other words: say you had a population of 300 million people where the average age was 50. If you randomly picked 5 of them, you would expect your mean of the 5 to be 50, but it probably wouldn't be exactly 50, and might even be significantly far off from 50.

Categories