I am seeing something strange while using AffinityPropagation from sklearn. I have a 4 x 4 numpy ndarray - which is basically the affinity-scores. sim[i, j] has the affinity score of [i, j]. Now, when I feed into the AffinityPropgation function, I get a total of 4 labels.
here is an similar example with a smaller matrix:
In [215]: x = np.array([[1, 0.2, 0.4, 0], [0.2, 1, 0.8, 0.3], [0.4, 0.8, 1, 0.7], [0, 0.3, 0.7, 1]]
.....: )
In [216]: x
Out[216]:
array([[ 1. , 0.2, 0.4, 0. ],
[ 0.2, 1. , 0.8, 0.3],
[ 0.4, 0.8, 1. , 0.7],
[ 0. , 0.3, 0.7, 1. ]])
In [217]: clusterer = cluster.AffinityPropagation(affinity='precomputed')
In [218]: f = clusterer.fit(x)
In [219]: f.labels_
Out[219]: array([0, 1, 1, 1])
This says (according to Kevin), that the first sample (0th-indexed row) is a cluster (Cluster # 0) on its own and the rest of the samples are in another cluster (cluster # 1). But, still, I do not understand this output. What is a sample here? What are the members? I want to have a set of pairs (i, j) assigned to one cluster, another set of pairs assigned to another cluster and so on.
It looks like a 4-sample x 4-feature matrix..which I do not want. Is this the problem? If so, how to convert this to a nice 4-sample x 4-sample affinity-matrix?
The documentation (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html) says
fit(X, y=None)
Create affinity matrix from negative euclidean distances, then apply affinity propagation clustering.
Parameters:
X: array-like, shape (n_samples, n_features) or (n_samples, n_samples) :
Data matrix or, if affinity is precomputed, matrix of similarities / affinities.
Thanks!
By your description it sounds like you are working with a "pairwise similarity matrix": x (although your example data does not show that). If this is the case your matrix should be symmertric so that: sim[i,j] == sim[j,i] with your diagonal values equal to 1. Example similarity data S:
S
array([[ 1. , 0.08276253, 0.16227766, 0.47213595, 0.64575131],
[ 0.08276253, 1. , 0.56776436, 0.74456265, 0.09901951],
[ 0.16227766, 0.56776436, 1. , 0.47722558, 0.58257569],
[ 0.47213595, 0.74456265, 0.47722558, 1. , 0.87298335],
[ 0.64575131, 0.09901951, 0.58257569, 0.87298335, 1. ]])
Typically when you already have a distance matrix you should use affinity='precomputed'. But in your case, you are using similarity. In this specific example you can transform to pseudo-distance using 1-D. (The reason to do this would be because I don't know that Affinity Propagation will give you expected results if you give it a similarity matrix as input):
1-D
array([[ 0. , 0.91723747, 0.83772234, 0.52786405, 0.35424869],
[ 0.91723747, 0. , 0.43223564, 0.25543735, 0.90098049],
[ 0.83772234, 0.43223564, 0. , 0.52277442, 0.41742431],
[ 0.52786405, 0.25543735, 0.52277442, 0. , 0.12701665],
[ 0.35424869, 0.90098049, 0.41742431, 0.12701665, 0. ]])
With that being said, I think this is where your interpretation was off:
This says that the first 3-rows are similar, 4th row is a cluster on its own, and the 5th row is also a cluster on its own. Totally of 3 clusters.
The f.labels_ array:
array([0, 1, 1, 1, 0])
is telling you that samples (not rows) 0 and 4 are in cluster 0 AND that samples 2, 3, and 4 are in cluster 1. You don't need 25 different labels for a 5 sample problem, that wouldn't make sense. Hope this helps a little, try the demo (inspect the variables along the way and compare them with your data), which starts with raw data; it should help you decide if Affinity Propagation is the right clustering algorithm for you.
According to this page https://scikit-learn.org/stable/modules/clustering.html
you can use a similarity matrix for AffinityPropagation.
Related
I have a np.array of 50 elements. For example:
data = np.array([9.22, 9. , 9.01, ..., 7.98, 6.77, 7.3 ])
For each element of the data np.array, I have a x and y data pair (both with the same length) that I want to interpolate with. For example:
x = np.array([[ 1, 2, 3, 4, 5 ],
...,
[ 1.01, 2.01, 3.02, 4.03, 5.07 ]])
y = np.array([[0. , 1. , 0.95, ..., 0.07, 0.06, 0.06],
...,
[0. , 0.99 , 0.85, ..., 0.03, 0.05, 0.06]])
I want to interpolate each data element with the respective np.array of x and y.
I have the following solution using map():
def cubic_spline(i):
return scipy.interpolate.splev(x=data[i],
tck=scipy.interpolate.splrep(x[i], y[i], k=3))
list(map(cubic_spline, np.arange(len(data)))
But I'm wondering if there is a way to do it directly with scipy and numpy to optimize the execution time. Something like:
scipy.interpolate.splev(x=data,
tck=scipy.interpolate.splrep(x, y, k=3))
Any suggestions will be appreciated. Thanks in advance.
If you have a single x array and multiple y arrays, newer interpolators (make_interp_spline, PchipInterpolator etc) support multidimensional y arrays automatically.
If you really have a collection of pairs of 1D arrays, x and y, where x arrays differ, and you want scipy to loop over these datasets, then no, scipy does not support that. You'd need to loop over them manually.
I want to check whether some vectors are dependent on each other or not by numpy, I found some good suggestions for checking linear dependency of rows of a matrix in the link below:
How to find linearly independent rows from a matrix
I can not understand the 'Cauchy-Schwarz inequality' method which I think is due to lack of my knowledge, however I tried the Eigenvalue method to check linear dependency among columns and here is my code:
A = np.array([
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 1, 1, 0],
[1, 0, 0, 1]
])
lambdas, V = np.linalg.eig(A)
print(lambdas)
print(V)
and I get:
[ 1. 0. 1.61803399 -0.61803399]
[[ 0. 0.70710678 0.2763932 -0.7236068 ]
[ 0. 0. 0.4472136 0.4472136 ]
[ 0. 0. 0.7236068 -0.2763932 ]
[ 1. -0.70710678 0.4472136 0.4472136 ]]
My question is what is the relevance between these eigenvectors or eigenvalues to the dependency of columns of my matrix? How can I understand which columns are dependent to each other and which are independent by these values?
The second column vector corresponds to the eigenvalue of 0.
Just take a look at the API documentation when you get confused.
v : (…, M, M) array
The normalized (unit “length”) eigenvectors, such that the column
v[:,i] is the eigenvector corresponding to the eigenvalue w[i].
You can find the linearly independent columns by QR decomposition as described here.
I just cannot find a solution to the following problem:
Consider two NumPy.arrays, one of shape (10,64,10) and one of (x, 64).
Array A (10,64,10) represents 10 classes with 64 features and over each of these features I got a PDF split in 10 bins --> (Classes, Features, Bins). Each value in that innermost array represents a probability.
[[[0.62, 0., 0. ],
[0.12, 0.09, 0.01],
[0.59, 0.01, 0. ],
[0.62, 0., 0. ]],
[[0.62, 0., 0. ],
[0.59, 0.01, 0. ],
[0.62, 0., 0. ],
[0.62, 0., 0. ]]]
(simplified to (2,4,3) so you can test it by copying it directly. The representating classes are "0" and "1")
Array B (X, 64) are the instance of a dataset X and the bin-index the i`th feature belongs to\
[[0, 0, 2, 1]
[0, 0, 1, 0]
[0, 2, 1, 0]]
(simplified to (X=3, 4))
What I want to do is for each row in Array B, e.g. [0, 0, 2, 1] I want to get the probability that when the bin for the first feature is 0 the class is "1" and that the class is "0".
The expected output for the first instance here would be:
"0" = [0.62, 0.12, 0.00, 0.00]
"1" = [0.62, 0.59, 0.00, 0.00]
and if possible then for all X instances.
I do not expect any kind of Dictionary or anything alike but just some array that contains the values in a somewhat sorted manner (can also be another sort than shown in the example)
Of course, I could do all this in giant nested for-loops, but I want at least some vectorization. As anybody any good suggestions, our answer does not have to be a full-fledged solution.
EDIT:
The best nested loop I came up with was
prediction = np.empty((bins.shape[0], histograms.shape[1], histograms.shape[0]))
for n, instance in enumerate(bins):
for i, instance_bin in enumerate(instance):
prediction[n,i] = histograms[:,i, instance_bin] # Prob for every class x that the bin given in "instance_bin" of feature "i" corresponds to a possible instance of that class
histograms = Array A;
bins = Array B
Please also tell me of every other bad-practise if you find any in my way of working with numpy or anything else in this snipped.
Doing Kmeans cluster analysis, how to I manually define a certain cluster-center?
For example I want to say my cluster centers are [1,2,3] and [3,4,5] and now I want to cluster my vectors to the predefined centers.
something like kmeans.cluster_centers_ = [[1,2,3],[3,4,5]] ?
to work around my problem thats what I do atm:
number_of_clusters = len(vec)
kmeans = KMeans(number_of_clusters, init='k-means++', n_init=100)
kmeans.fit(vec)
it basically defines a cluster for each vector. But it takes ages to compute as I have thousands of vectors/sentences. There must be an option to set the vector coordinates directly as cluster coordinates without the need to compute them with the kmeans algorithm. (as the center outputs are basically the vector coordinates after i run the algorithm...)
Edit to be more specific about my task:
So what I do want is I have tonns of vectors ( generated from sentences) and now I want to cluster these. But imagine I have two columns of sentences and always want to sort a B column sentence to an A column sentence. Not A column sentences to each other. Thats why I want to set cluster centers for the A column vectors and afterwards predict the clostest B vectors to these Centers. Hope that makes sense
I am using sklearn kmeans atm
I think I know what you want to do. So you want to manually select the centroids for k-Means with some known examples and then perform the clustering to assign the closests data points to your pre-defined centroids.
The parameter you are looking for is the k-Means initialization named as init see documentation.
I have prepared a small example that would do exactly this.
import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial import distance_matrix
# 5 datapoints with 3 features
data = [[1, 0, 0],
[1, 0.2, 0],
[0, 0, 1],
[0, 0, 0.9],
[1, 0, 0.1]]
X = np.array(data)
distance_matrix(X,X)
The pairwise distance matrix shows which examples are the closests.
> array([[0. , 0.2 , 1.41421356, 1.3453624 , 0.1 ],
> [0.2 , 0. , 1.42828569, 1.36014705, 0.2236068 ],
> [1.41421356, 1.42828569, 0. , 0.1 , 1.3453624 ],
> [1.3453624 , 1.36014705, 0.1 , 0. , 1.28062485],
> [0.1 , 0.2236068 , 1.3453624 , 1.28062485, 0. ]])
you can select certain data points to be used as your initial centroids
centroid_idx = [0,2] # let data point 0 and 2 be our centroids
centroids = X[centroid_idx,:]
print(centroids) # [[1. 0. 0.]
# [0. 0. 1.]]
kmeans = KMeans(n_clusters=2, init=centroids, max_iter=1) # just run one k-Means iteration so that the centroids are not updated
kmeans.fit(X)
kmeans.labels_
>>> array([0, 0, 1, 1, 0], dtype=int32)
As you can see k-Means labels the data points as expected. You might want to omit the max_iter parameter if you want your centroids to be updated.
I want to do mean of rows of numpy matrix. So for the input:
array([[ 1, 1, -1],
[ 2, 0, 0],
[ 3, 1, 1],
[ 4, 0, -1]])
my output will be:
array([[ 0.33333333],
[ 0.66666667],
[ 1.66666667],
[ 1. ]])
I came up with a solution result = array([[x] for x in np.mean(my_matrix, axis=1)]), but this function will be called a lots of times on matrices of 40rows x 10-300 columns, so i would like to make it faster, and this implementation seems slow
You can do something like this:
>>> my_matrix.mean(axis=1)[:,np.newaxis]
array([[ 0.33333333],
[ 0.66666667],
[ 1.66666667],
[ 1. ]])
If the matrices are fresh and independent there isn't much you can save because the only way to compute the mean is to actually sum the numbers.
If however the matrices are obtained from partial views of a single fixed dataset (e.g. you're computing a moving average) the you can use a sum table. For example after:
st = data.cumsum(0)
you can compute the average of the elements between index x0 and x1 with
avg = (st[x1] - st[x0]) / (x1 - x0)
in O(1) (i.e. the computing time doesn't depends on how many elements you are averaging).
You can even use numpy to compute an array with the moving averages directly with:
res = (st[n:] - st[:-n]) / n
This approach can even be extended to higher dimensions like computing the average of the values in a rectangle in O(1) with
st = data.cumsum(0).cumsum(1)
rectsum = (st[y1][x1] + st[y0][x0] - st[y0][x1] - st[y1][x0])