Vectorizing a Numpy operation on differently shaped matrices - python

I just cannot find a solution to the following problem:
Consider two NumPy.arrays, one of shape (10,64,10) and one of (x, 64).
Array A (10,64,10) represents 10 classes with 64 features and over each of these features I got a PDF split in 10 bins --> (Classes, Features, Bins). Each value in that innermost array represents a probability.
[[[0.62, 0., 0. ],
[0.12, 0.09, 0.01],
[0.59, 0.01, 0. ],
[0.62, 0., 0. ]],
[[0.62, 0., 0. ],
[0.59, 0.01, 0. ],
[0.62, 0., 0. ],
[0.62, 0., 0. ]]]
(simplified to (2,4,3) so you can test it by copying it directly. The representating classes are "0" and "1")
Array B (X, 64) are the instance of a dataset X and the bin-index the i`th feature belongs to\
[[0, 0, 2, 1]
[0, 0, 1, 0]
[0, 2, 1, 0]]
(simplified to (X=3, 4))
What I want to do is for each row in Array B, e.g. [0, 0, 2, 1] I want to get the probability that when the bin for the first feature is 0 the class is "1" and that the class is "0".
The expected output for the first instance here would be:
"0" = [0.62, 0.12, 0.00, 0.00]
"1" = [0.62, 0.59, 0.00, 0.00]
and if possible then for all X instances.
I do not expect any kind of Dictionary or anything alike but just some array that contains the values in a somewhat sorted manner (can also be another sort than shown in the example)
Of course, I could do all this in giant nested for-loops, but I want at least some vectorization. As anybody any good suggestions, our answer does not have to be a full-fledged solution.
EDIT:
The best nested loop I came up with was
prediction = np.empty((bins.shape[0], histograms.shape[1], histograms.shape[0]))
for n, instance in enumerate(bins):
for i, instance_bin in enumerate(instance):
prediction[n,i] = histograms[:,i, instance_bin] # Prob for every class x that the bin given in "instance_bin" of feature "i" corresponds to a possible instance of that class
histograms = Array A;
bins = Array B
Please also tell me of every other bad-practise if you find any in my way of working with numpy or anything else in this snipped.

Related

Vectorize Scipy cubic interpolation for multiple Numpy arrays

I have a np.array of 50 elements. For example:
data = np.array([9.22, 9. , 9.01, ..., 7.98, 6.77, 7.3 ])
For each element of the data np.array, I have a x and y data pair (both with the same length) that I want to interpolate with. For example:
x = np.array([[ 1, 2, 3, 4, 5 ],
...,
[ 1.01, 2.01, 3.02, 4.03, 5.07 ]])
y = np.array([[0. , 1. , 0.95, ..., 0.07, 0.06, 0.06],
...,
[0. , 0.99 , 0.85, ..., 0.03, 0.05, 0.06]])
I want to interpolate each data element with the respective np.array of x and y.
I have the following solution using map():
def cubic_spline(i):
return scipy.interpolate.splev(x=data[i],
tck=scipy.interpolate.splrep(x[i], y[i], k=3))
list(map(cubic_spline, np.arange(len(data)))
But I'm wondering if there is a way to do it directly with scipy and numpy to optimize the execution time. Something like:
scipy.interpolate.splev(x=data,
tck=scipy.interpolate.splrep(x, y, k=3))
Any suggestions will be appreciated. Thanks in advance.
If you have a single x array and multiple y arrays, newer interpolators (make_interp_spline, PchipInterpolator etc) support multidimensional y arrays automatically.
If you really have a collection of pairs of 1D arrays, x and y, where x arrays differ, and you want scipy to loop over these datasets, then no, scipy does not support that. You'd need to loop over them manually.

affinity propagation in python

I am seeing something strange while using AffinityPropagation from sklearn. I have a 4 x 4 numpy ndarray - which is basically the affinity-scores. sim[i, j] has the affinity score of [i, j]. Now, when I feed into the AffinityPropgation function, I get a total of 4 labels.
here is an similar example with a smaller matrix:
In [215]: x = np.array([[1, 0.2, 0.4, 0], [0.2, 1, 0.8, 0.3], [0.4, 0.8, 1, 0.7], [0, 0.3, 0.7, 1]]
.....: )
In [216]: x
Out[216]:
array([[ 1. , 0.2, 0.4, 0. ],
[ 0.2, 1. , 0.8, 0.3],
[ 0.4, 0.8, 1. , 0.7],
[ 0. , 0.3, 0.7, 1. ]])
In [217]: clusterer = cluster.AffinityPropagation(affinity='precomputed')
In [218]: f = clusterer.fit(x)
In [219]: f.labels_
Out[219]: array([0, 1, 1, 1])
This says (according to Kevin), that the first sample (0th-indexed row) is a cluster (Cluster # 0) on its own and the rest of the samples are in another cluster (cluster # 1). But, still, I do not understand this output. What is a sample here? What are the members? I want to have a set of pairs (i, j) assigned to one cluster, another set of pairs assigned to another cluster and so on.
It looks like a 4-sample x 4-feature matrix..which I do not want. Is this the problem? If so, how to convert this to a nice 4-sample x 4-sample affinity-matrix?
The documentation (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html) says
fit(X, y=None)
Create affinity matrix from negative euclidean distances, then apply affinity propagation clustering.
Parameters:
X: array-like, shape (n_samples, n_features) or (n_samples, n_samples) :
Data matrix or, if affinity is precomputed, matrix of similarities / affinities.
Thanks!
By your description it sounds like you are working with a "pairwise similarity matrix": x (although your example data does not show that). If this is the case your matrix should be symmertric so that: sim[i,j] == sim[j,i] with your diagonal values equal to 1. Example similarity data S:
S
array([[ 1. , 0.08276253, 0.16227766, 0.47213595, 0.64575131],
[ 0.08276253, 1. , 0.56776436, 0.74456265, 0.09901951],
[ 0.16227766, 0.56776436, 1. , 0.47722558, 0.58257569],
[ 0.47213595, 0.74456265, 0.47722558, 1. , 0.87298335],
[ 0.64575131, 0.09901951, 0.58257569, 0.87298335, 1. ]])
Typically when you already have a distance matrix you should use affinity='precomputed'. But in your case, you are using similarity. In this specific example you can transform to pseudo-distance using 1-D. (The reason to do this would be because I don't know that Affinity Propagation will give you expected results if you give it a similarity matrix as input):
1-D
array([[ 0. , 0.91723747, 0.83772234, 0.52786405, 0.35424869],
[ 0.91723747, 0. , 0.43223564, 0.25543735, 0.90098049],
[ 0.83772234, 0.43223564, 0. , 0.52277442, 0.41742431],
[ 0.52786405, 0.25543735, 0.52277442, 0. , 0.12701665],
[ 0.35424869, 0.90098049, 0.41742431, 0.12701665, 0. ]])
With that being said, I think this is where your interpretation was off:
This says that the first 3-rows are similar, 4th row is a cluster on its own, and the 5th row is also a cluster on its own. Totally of 3 clusters.
The f.labels_ array:
array([0, 1, 1, 1, 0])
is telling you that samples (not rows) 0 and 4 are in cluster 0 AND that samples 2, 3, and 4 are in cluster 1. You don't need 25 different labels for a 5 sample problem, that wouldn't make sense. Hope this helps a little, try the demo (inspect the variables along the way and compare them with your data), which starts with raw data; it should help you decide if Affinity Propagation is the right clustering algorithm for you.
According to this page https://scikit-learn.org/stable/modules/clustering.html
you can use a similarity matrix for AffinityPropagation.

numpy mean of rows when speed is a concern

I want to do mean of rows of numpy matrix. So for the input:
array([[ 1, 1, -1],
[ 2, 0, 0],
[ 3, 1, 1],
[ 4, 0, -1]])
my output will be:
array([[ 0.33333333],
[ 0.66666667],
[ 1.66666667],
[ 1. ]])
I came up with a solution result = array([[x] for x in np.mean(my_matrix, axis=1)]), but this function will be called a lots of times on matrices of 40rows x 10-300 columns, so i would like to make it faster, and this implementation seems slow
You can do something like this:
>>> my_matrix.mean(axis=1)[:,np.newaxis]
array([[ 0.33333333],
[ 0.66666667],
[ 1.66666667],
[ 1. ]])
If the matrices are fresh and independent there isn't much you can save because the only way to compute the mean is to actually sum the numbers.
If however the matrices are obtained from partial views of a single fixed dataset (e.g. you're computing a moving average) the you can use a sum table. For example after:
st = data.cumsum(0)
you can compute the average of the elements between index x0 and x1 with
avg = (st[x1] - st[x0]) / (x1 - x0)
in O(1) (i.e. the computing time doesn't depends on how many elements you are averaging).
You can even use numpy to compute an array with the moving averages directly with:
res = (st[n:] - st[:-n]) / n
This approach can even be extended to higher dimensions like computing the average of the values in a rectangle in O(1) with
st = data.cumsum(0).cumsum(1)
rectsum = (st[y1][x1] + st[y0][x0] - st[y0][x1] - st[y1][x0])

Scale(or normalize) an array like this in numpy?

In numpy, the original array has the shape(2,2,2) like this
[[[0.2,0.3],[0.1,0.5]],[[0.1,0.3],[0.1,0.4]]]
I'd like to scale the array so that the max value of the a dimension is 1 like this:
As max([0.2,0.1,0.1,0.1]) is 0.2, and 1/0.2 is 5, so for the first element of the int tuple, multiple it by 5.
As max([0.3,0.5,0.3,0.4]) is 0.5, and 1/0.5 is 2, so for the second element of the int tuple, multiple it by 2
So the final array is like this:
[[[1,0.6],[0.5,1]],[[0.5,0.6],[0.5,0.8]]]
I know how to multiple an array with an integer in numpy, but I'm not sure how to multiple the array with different factor. Does anyone have ideas about this?
If your array = a:
>>> import numpy as np
>>> a = np.array([[[0.2,0.3],[0.1,0.5]],[[0.1,0.3],[0.1,0.4]]])
You can do this:
>>> a/np.amax(a.reshape(4,2),axis=0)
array([[[ 1. , 0.6],
[ 0.5, 1. ]],
[[ 0.5, 0.6],
[ 0.5, 0.8]]])

Matplotlib: How to convert a histogram to a discrete probability mass function?

I have a question regarding the hist() function with matplotlib.
I am writing a code to plot a histogram of data who's value varies from 0 to 1. For example:
values = [0.21, 0.51, 0.41, 0.21, 0.81, 0.99]
bins = np.arange(0, 1.1, 0.1)
a, b, c = plt.hist(values, bins=bins, normed=0)
plt.show()
The code above generates a correct histogram (I could not post an image since I do not have enough reputation). In terms of frequencies, it looks like:
[0 0 2 0 1 1 0 0 1 1]
I would like to convert this output to a discrete probability mass function, i.e. for the above example, I would like to get a following frequency values:
[ 0. 0. 0.333333333 0. 0.166666667 0.166666667 0. 0. 0.166666667 0.166666667 ] # each item in the previous array divided by 6)
I thought I simply need to change the parameter in the hist() function to 'normed=1'. However, I get the following histogram frequencies:
[ 0. 0. 3.33333333 0. 1.66666667 1.66666667 0. 0. 1.66666667 1.66666667 ]
This is not what I expect and I don't know how to get the discrete probability mass function who's sum should be 1.0. A similar question was asked in the following link (link to the question), but I do not think the question was resolved.
I appreciate for your help in advance.
The reason is norm=True gives the probability density function. In probability theory, a probability density function or density of a continuous random variable, describes the relative likelihood for this random variable to take on a given value.
Let us consider a very simple example.
x=np.arange(0.1,1.1,0.1)
array([ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
# Bin size
bins = np.arange(0.05, 1.15, 0.1)
np.histogram(x,bins=bins,normed=1)[0]
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]
np.histogram(x,bins=bins,normed=0)[0]/float(len(x))
[ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
# Change the bin size
bins = np.arange(0.05, 1.15, 0.2)
np.histogram(x,bins=bins,normed=1)[0]
[ 1., 1., 1., 1., 1.]
np.histogram(x,bins=bins,normed=0)[0]/float(len(x))
[ 0.2, 0.2, 0.2, 0.2, 0.2]
As, you can see in the above, the probability that x will lie between [0.05-0.15] or [0.15-0.25] is 1/10 whereas if you change the bin size to 0.2 then the probability that it will lie between [0.05-0.25] or [0.25-0.45] is 1/5. Now these actual probability values are dependent on the bin-size, however, the probability density is independent of the bins size. Thus, this is the only proper way to do the above, otherwise one would need to state the bin-width in each of the plot.
So in your case if you really want to plot the probability value at each bin (and not the probability density) then you can simply divide the frequency of each histogram by the number of total elements. However, I would suggest you not to do this unless you are working with discrete variables and each of your bins represent a single possible value of this variable.
Plotting a Continuous Probability Function(PDF) from a Histogram – Solved in Python. refer this blog for detailed explanation. (http://howdoudoittheeasiestway.blogspot.com/2017/09/plotting-continuous-probability.html)
Else you can use the code below.
n, bins, patches = plt.hist(A, 40, histtype='bar')
plt.show()
n = n/len(A)
n = np.append(n, 0)
mu = np.mean(n)
sigma = np.std(n)
plt.bar(bins,n, width=(bins[len(bins)-1]-bins[0])/40)
y1= (1/(sigma*np.sqrt(2*np.pi))*np.exp(-(bins - mu)**2 /(2*sigma**2)))*0.03
plt.plot(bins, y1, 'r--', linewidth=2)
plt.show()

Categories