Probability functions convolution in python - python

There are N distributions which take on integer values 0,... with associated probabilities. Further, I assume 3 variables [value, prob]:
import numpy as np
x = np.array([ [0,0.3],[1,0.2],[3,0.5] ])
y = np.array([ [10,0.2],[11,0.4],[13,0.1],[14,0.3] ])
z = np.array([ [21,0.3],[23,0.7] ])
As there are N variables I convolve first x+y, then I add z, and so on.
Unfortunately numpy.convole() takes 1-d arrays as input variables, so it does not suit in this case directly. I play with variables to take them all values 0,1,2,...,23 (if value is not know then Pr=0)... I feel like there is another much better solution.
Does anyone have a suggestion for making it more efficient? Thanks in advance.

I don't see a built-in method for this in Scipy; there's a way to define a custom discrete random variables, but those don't support addition. Here is an approach using pandas, assuming import pandas as pd and x,y,z as in your example:
values = np.add.outer(x[:,0], y[:,0]).flatten()
probs = np.multiply.outer(x[:,1], y[:,1]).flatten()
df = pd.DataFrame({'values': values, 'probs': probs})
conv = df.groupby('values').sum()
result = conv.reset_index().values
The output is
array([[ 10. , 0.06],
[ 11. , 0.16],
[ 12. , 0.08],
[ 13. , 0.13],
[ 14. , 0.31],
[ 15. , 0.06],
[ 16. , 0.05],
[ 17. , 0.15]])
With more than two variables, you don't have to go back and forth between numpy and pandas: the additional variables can be included at the beginning.
values = np.add.outer(np.add.outer(x[:,0], y[:,0]), z[:,0]).flatten()
probs = np.multiply.outer(np.multiply.outer(x[:,1], y[:,1]), z[:,1]).flatten()
Aside: it would be better to keep values and probabilities in separate numpy arrays, if they have different intrinsic data types (integers vs reals).

Related

Vectorize Scipy cubic interpolation for multiple Numpy arrays

I have a np.array of 50 elements. For example:
data = np.array([9.22, 9. , 9.01, ..., 7.98, 6.77, 7.3 ])
For each element of the data np.array, I have a x and y data pair (both with the same length) that I want to interpolate with. For example:
x = np.array([[ 1, 2, 3, 4, 5 ],
...,
[ 1.01, 2.01, 3.02, 4.03, 5.07 ]])
y = np.array([[0. , 1. , 0.95, ..., 0.07, 0.06, 0.06],
...,
[0. , 0.99 , 0.85, ..., 0.03, 0.05, 0.06]])
I want to interpolate each data element with the respective np.array of x and y.
I have the following solution using map():
def cubic_spline(i):
return scipy.interpolate.splev(x=data[i],
tck=scipy.interpolate.splrep(x[i], y[i], k=3))
list(map(cubic_spline, np.arange(len(data)))
But I'm wondering if there is a way to do it directly with scipy and numpy to optimize the execution time. Something like:
scipy.interpolate.splev(x=data,
tck=scipy.interpolate.splrep(x, y, k=3))
Any suggestions will be appreciated. Thanks in advance.
If you have a single x array and multiple y arrays, newer interpolators (make_interp_spline, PchipInterpolator etc) support multidimensional y arrays automatically.
If you really have a collection of pairs of 1D arrays, x and y, where x arrays differ, and you want scipy to loop over these datasets, then no, scipy does not support that. You'd need to loop over them manually.

Markov Clustering in Python

As the title says, I'm trying to get a Markov Clustering Algorithm to work in Python, namely Python 3.7
Unfortunately, it's not doing much of anything, and it's driving me up the wall trying to fix it.
EDIT: First, I've made the adjustments to the main code to make each column sum to 100, even if it's not perfectly balanced. I'm going to try to account for that in the final answer.
To be clear, the biggest problem is that the numbers spiral out of control, into such easily-understandable numbers as 5.56268465e-309, and I don't know how to convert that into something understandable.
Here's the code so far:
import numpy as np
import math
## How far you'd like your random-walkers to go (bigger number -> more walking)
EXPANSION_POWER = 2
## How tightly clustered you'd like your final picture to be (bigger number -> more clusters)
INFLATION_POWER = 2
ITERATION_COUNT = 10
def normalize(matrix):
return matrix/np.sum(matrix, axis=0)
def expand(matrix, power):
return np.linalg.matrix_power(matrix, power)
def inflate(matrix, power):
for entry in np.nditer(transition_matrix, op_flags=['readwrite']):
entry[...] = math.pow(entry, power)
return matrix
def run(matrix):
#np.fill_diagonal(matrix, 1)
#print(matrix)
matrix = normalize(matrix)
print(matrix)
for _ in range(ITERATION_COUNT):
matrix = normalize(inflate(expand(matrix, EXPANSION_POWER), INFLATION_POWER))
return matrix
transition_matrix = np.array ([[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
[0.5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
[0.5,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0.34,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0.33,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0.33,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,0.34,0,0,0,0,0,0,0,0,0,0,0,0,0.125,0],
[0,0,0,0.33,0,0,0.5,0,0,0,0,0,0,0,0,0,0.125,1],
[0,0,0,0.33,0,0,0.5,1,1,0,0,0,0,0,0,0,0.125,0],
[0,0,0,0,0.166,0,0,0,0,0,0,0,0,0,0,0,0.125,0],
[0,0,0,0,0.166,0,0,0,0,0.2,0,0,0,0,0,0,0.125,0],
[0,0,0,0,0.167,0,0,0,0,0.2,0.25,0,0,0,0,0,0.125,0],
[0,0,0,0,0.167,0,0,0,0,0.2,0.25,0.5,0,0,0,0,0,0],
[0,0,0,0,0.167,0,0,0,0,0.2,0.25,0.5,0,1,0,0,0.125,0],
[0,0,0,0,0.167,0,0,0,0,0.2,0.25,0,1,0,1,0,0.125,0],
[0,0,0,0,0,0.34,0,0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0.33,0,0,0,0,0,0,0,0,0,0.5,0,0],
[0,0,0,0,0,0.33,0,0,0,0,0,0,0,0,0,0.5,0,0]])
run(transition_matrix)
print(transition_matrix)
This is part of a uni assignment - I need to do this array both weighted and unweighted (though the weighted part can just wait until I've got the bloody thing working at all) any tips or suggestions?
Your transition matrix is not valid.
>>> transition_matrix.sum(axis=0)
>>> matrix([[1. , 1. , 0.99, 0.99, 0.96, 0.99, 1. , 1. , 0. , 1. ,
1. , 1. , 1. , 0. , 0. , 1. , 0.88, 1. ]])
Not only does some of your columns not sum to 1, some of them sum to 0.
This means when you try to normalize your matrix, you will end up with nan because you are dividing by 0.
Lastly, is there a reason why you are using a Numpy matrix instead of just a Numpy array, which is the recommended container for such data? Because using Numpy arrays will simplify some of the operations, such as raising each entry to a power. Also, there are some differences between Numpy matrix and Numpy array which can result in subtle bugs.

How to blur 3D array of points, while maintaining their original values? (Python)

I have a sparse 3D array of values. I am trying to turn each "point" into a fuzzy "sphere", by applying a Gaussian filter to the array.
I would like the original value at the point (x,y,z) to remain the same. I just want to create falloff values around this point... But applying the Gaussian filter changes the original (x,y,z) value as well.
I am currently doing this:
dataCube = scipy.ndimage.filters.gaussian_filter(dataCube, 3, truncate=8)
Is there a way for me to normalize this, or do something so that my original values are still in this new dataCube? I am not necessarily tied to using a Gaussian filter, if that is not the best approach.
You can do this using a convolution with a kernel that has 1 as its central value, and a width smaller than the spacing between your data points.
1-d example:
import numpy as np
import scipy.signal
data = np.array([0,0,0,0,0,5,0,0,0,0,0])
kernel = np.array([0.5,1,0.5])
scipy.signal.convolve(data, kernel, mode="same")
gives
array([ 0. , 0. , 0. , 0. , 2.5, 5. , 2.5, 0. , 0. , 0. , 0. ])
Note that fftconvolve might be much faster for large arrays. You also have to specify what should happen at the boundaries of your array.
Update: 3-d example
import numpy as np
from scipy import signal
# first build the smoothing kernel
sigma = 1.0 # width of kernel
x = np.arange(-3,4,1) # coordinate arrays -- make sure they contain 0!
y = np.arange(-3,4,1)
z = np.arange(-3,4,1)
xx, yy, zz = np.meshgrid(x,y,z)
kernel = np.exp(-(xx**2 + yy**2 + zz**2)/(2*sigma**2))
# apply to sample data
data = np.zeros((11,11,11))
data[5,5,5] = 5.
filtered = signal.convolve(data, kernel, mode="same")
# check output
print filtered[:,5,5]
gives
[ 0. 0. 0.05554498 0.67667642 3.0326533 5. 3.0326533
0.67667642 0.05554498 0. 0. ]

Construct 2 time series random variables with fixed correlation

Is there an easy way to generate two time-series with a fixed correlation? For instance 0.5.
Does anyone know a solution in R or Python?
Thanks!
This question is quite general, I think. It is not limited to just time-series. What you are asking is to generate 2d random variable, with known covariance. r==0.5, std1=1 and std2=2 would translate to a covariance matrix of [[1,1],[1,4]]. Therefore, if we assume the data is multidimensional normal distributed, we can generate such a random variable:
In [42]:
import numpy as np
val=np.random.multivariate_normal((0,0),[[1,1],[1,4]],1000)
In [43]:
np.corrcoef(val.T)
Out[43]:
array([[ 1. , 0.488883],
[ 0.488883, 1. ]])
In [44]:
np.cov(val.T)
Out[44]:
array([[ 1.03693888, 0.96490767],
[ 0.96490767, 3.75671707]])
In [45]:
val=np.random.multivariate_normal((0,0),[[1,1],[1,4]],10)
In [46]:
np.corrcoef(val.T)
Out[46]:
array([[ 1. , 0.56807297],
[ 0.56807297, 1. ]])
In [48]:
val[:,0]
Out[48]:
array([-0.77425116, 0.35758601, -1.21668939, -0.95127533, -0.5714381 ,
0.87530824, 0.9594394 , 1.30123373, 1.92511929, 0.98070711])
In [49]:
val[:,1]
Out[49]:
array([-1.75698285, 2.24011423, -3.5129411 , -1.33889305, 2.32720257,
0.53750133, 3.23935645, 2.96819425, -0.72551024, 3.0743096 ])
As shown in this example, if your sample size is small, the resulting random variable may deviate from r=0.5, considerably.

Euclidian Distances between points

I have an array of points in numpy:
points = rand(dim, n_points)
And I want to:
Calculate all the l2 norm (euclidian distance) between a certain point and all other points
Calculate all pairwise distances.
and preferably all numpy and no for's. How can one do it?
If you're willing to use SciPy, the scipy.spatial.distance module (the functions cdist and/or pdist) do exactly what you want, with all the looping done in C. You can do it with broadcasting too but there's some extra memory overhead.
This might help with the second part:
import numpy as np
from numpy import *
p=rand(3,4) # this is column-wise so each vector has length 3
sqrt(sum((p[:,np.newaxis,:]-p[:,:,np.newaxis])**2 ,axis=0) )
which gives
array([[ 0. , 0.37355868, 0.64896708, 1.14974483],
[ 0.37355868, 0. , 0.6277216 , 1.19625254],
[ 0.64896708, 0.6277216 , 0. , 0.77465192],
[ 1.14974483, 1.19625254, 0.77465192, 0. ]])
if p was
array([[ 0.46193242, 0.11934744, 0.3836483 , 0.84897951],
[ 0.19102709, 0.33050367, 0.36382587, 0.96880535],
[ 0.84963349, 0.79740414, 0.22901247, 0.09652746]])
and you can check one of the entries via
sqrt(sum ((p[:,0]-p[:,2] )**2 ))
0.64896708223796884
The trick is to put newaxis and then do broadcasting.
Good luck!

Categories