Printing principal features in clusters (python) - python

I have a mxn matrix, with m features and n samples. The matrix is called term_individual. The clustering is done using scikitlearn:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(term_individual.T)
centroids = kmeans.cluster_centers_.squeeze()
labels = kmeans.labels_
Each sample is a vector filled with positive integer numbers. If the i-th component of a sample is n, it means that the i-th feature is present n times in that sample.
I would like to know the most representative features of each cluster. For instance, suppose that the i-th feature is present many times in the first and second sample, causing these samples to be in the same cluster along with many others in which the i-th feature is present as well. I would like to print that feature (or the index associated with it, i.e. print i).
I appreciate your help.

It appears that the question you're asking is which features are most important to each cluster. Essentially you can just start estimating the zscore mean values of each feature in each cluster:
def cluster_feature_importance(X, Y, feature_importances):
N, M = X.shape
X = scale(X)
out = {}
for c in set(Y):
out[c] = dict(
zip(range(N), np.mean(X[Y==c, :], axis=0))
)
return out
Here Xis your matrix term_individual and Y as a list informing which cluster each sample belongs to e.g. like so: [0, 0, 1, 1, 0, 3, 2, 2, 3, 0, ...] where Y is n long.

Related

Find optimal unique neighbour pairs based on closest distance

General problem
First let's explain the problem more generally. I have a collection of points with x,y coordinates and want to find the optimal unique neighbour pairs such that the distance between the neighbours in all pairs is minimised, but points cannot be used in more than one pair.
Some simple examples
Note: points are not ordered and x and y coordinates will both vary between 0 and 1000, but for simplicity in below examples x==y and items are ordered.
First, let's say I have the following matrix of points:
matrix1 = np.array([[1, 1],[2, 2],[5, 5],[6, 6]])
For this dataset, the output should be [0,0,1,1] as points 1 and 2 are closest to each other and points 3 and 4, providing pairs 0 and 2.
Second, two points cannot have the same partner. If we have the matrix:
matrix2 = np.array([[1, 1],[2, 2],[4, 4],[6, 6]])
Here pt1 and pt3 are closest to pt2, but pt1 is relatively closer, so the output should again be [0,0,1,1].
Third, if we have the matrix :
matrix3 = np.array([[1, 1],[2, 2],[3, 3],[4, 4]])
Now pt1 and pt3 are again closest to pt2 but now they are at the same distance. Now the output should again be [0,0,1,1] as pt4 is closest to pt3.
Fourth, in the case of an uneven number of points, the furthest point should be made nan, e.g.
matrix4 = np.array([[1, 1],[2, 2],[4,4]])
should give output [0,0,nan]
Fifth, in the case there are three or more points with exactly the same distance, the pairing can be random, e.g.
matrix5 = np.array([[1, 1],[2, 2],[3, 3]])
both an output of '[0,0,nan]and[nan,0,0]` should be fine.
My effort
Using sklearn:
import numpy as np
from sklearn.neighbors import NearestNeighbors
data = matrix3
nbrs = NearestNeighbors(n_neighbors=len(data), algorithm="ball_tree")
nbrs = nbrs.fit(data)
distances,indices = nbrs.kneighbors(data)
This outputs instances:
array([[0, 1, 2, 3],
[1, 2, 0, 3],
[2, 1, 3, 0],
[3, 2, 1, 0]]))
The second column provides the nearest points:
nearinds = `indices[:,1]`
Next in case there are duplicates in the list we need to find the nearest distance:
if len(set(nearinds) != len(nearinds):
dupvals = [i for i in set(nearinds) if list(nearinds).count(i) > 1]
for dupval in dupvals:
dupinds = [i for i,j in enumerate(nearinds) if j == dupval]
dupdists = distances[dupinds,1]
Using these dupdists I would be able to find that one is closer to the pt than the other:
if len(set(dupdists))==len(dupdists):
duppriority = np.argsort(dupdists)
Using the duppriority values we can provide the closer pt its right pairing. But to give the other point its pairing will then depend on its second nearest pairing and the distance of all other points to that same point.. Furthermore, if both points are the same distance to their closest point, I would also need to go one layer deeper:
if len(set(dupdists))!=len(dupdists):
dupdists2 = [distances[i,2] for i,j in enumerate(inds) if j == dupval]```
if len(set(dupdists2))==len(dupdists2):
duppriority2 = np.argsort(dupdists2)
etc..
I am kind of stuck here and also feel it is not very efficient in this way, especially for more complicated conditions than 4 points and where multiple points can be similar distance to one or multiple nearest, second-nearest etc points..
I also found that with scipy there is a similar one-line command that could be used to get the distances and indices:
from scipy.spatial import cKDTree
distances,indices = cKDTree(matrix3).query(matrix3, k=len(matrix3))
so am wondering if one would be better to continue with vs the other.
More specific problem that I want to solve
I have a list of points and need to match them optimally to a list of points previous in time. Number of points is generally limited and ranges from 2 to 10 but is generally consistent over time (i.e. it won't jump much between values over time). Data tends to look like:
prevdat = {'loc': [(300, 200), (425, 400), (400, 300)], 'contid': [0, 1, 2]}
currlocs = [(435, 390), (405, 295), (290, 215),(440,330)]`
Pts in time are generally closer to themselves than to others. Thus I should be able to link identities of the points over time. There are however a number of complications that need to be overcome:
sometimes there is no equal number of current and previous points
points often have the same closest neighbour but should not be able to be allocated the same identity
points sometimes have the same distance to closest neighbour (but very unlikely to 2nd, 3rd nearest-neighbours etc.
Any advice to help solve my problem would be much appreciated. I hope my examples and effort above will help. Thanks!
This can be formulated as a mixed integer linear programming problem.
In python you can model and solve such problems using cvxpy.
def connect_point_cloud(points):
'''
Given a set of points computes return pairs of points that
whose added distance is minimised
'''
N = points.shape[0];
I, J = np.indices((N, N))
d = np.sqrt(sum((points[I, i] - points[J, i])**2 for i in range(points.shape[1])));
use = cvxpy.Variable((N, N), integer=True)
# each entry use[i,j] indicates that the point i is connected to point j
# each pair may count 0 or 1 times
constraints = [use >= 0, use <= 1];
# point i must be used in at most one connection
constraints += [sum(use[i,:]) + sum(use[:, i]) <= 1 for i in range(N)]
# at least floor(N/2) connections must be presented
constraints += [sum(use[i,j] for i in range(N) for j in range(N)) >= N//2];
# let the solver to handle the problem
P = cvxpy.Problem(cvxpy.Minimize(sum(use[i,j] * d[i,j] for i in range(N) for j in range(N))), constraints)
dist = P.solve()
return use.value
Here a piece of code to visualize the result for a 2D problem
# create a random set with 50 points
p = np.random.rand(50, 2)
# find the pairs to with minimum distance
pairs = connect_point_cloud(p)
# plot all the points with circles
plt.plot(p[:, 0], p[:, 1], 'o')
# plot lines connecting the points
for i1, i2 in zip(*np.nonzero(pairs)):
plt.plot([p[i1,0], p[i2,0]], [p[i1,1], p[i2,1]])

Scipy and the hierarchical clustering input

When performing hierarchical clustering with scipy, it is said in the docs here that scipy.cluster.hierarchy.linkage takes 1-D condensed distance matrix or a 2-D array of observation vectors as input. However, I generated a simple (symmetric) similarity matrix with pandas Dataframe and scipy took that as input with no problem at all, and the resulting dendrogram is just fine.
Can someone explain, how is this possible? Do I have outdated docs or...?
The docs are accurate, they just don't tell you what will happen if you actually try to use an uncondensed distance matrix.
The function raises a warning but still runs because it first tries to convert input into a numpy array. This creates a 2-D array from your 2-D DataFrame while at the same time recognizing that this likely isn't the expected input based on the array dimensions and symmetry.
Depending on the complexity (e.g. cluster separation, number of clusters, distribution of data across clusters) of your input data, the clustering may still look like it succeeds in generating a suitable dendrogram, as you noted. This makes sense conceptually because the result is a clustering of n- similarity vectors which may be well-separated in simple cases.
For example, here is some synthetic data with 150 observations and 2 clusters:
import pandas as pd
from scipy.spatial.distance import cosine, pdist, squareform
np.random.seed(42) # for repeatability
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
obs_df = pd.DataFrame(np.concatenate((a, b),), columns=['x', 'y'])
obs_df.plot.scatter(x='x', y='y')
Z = linkage(obs_df, 'ward')
fig = plt.figure(figsize=(8, 4))
dn = dendrogram(Z)
If you generate a similarity matrix, this is an n x n matrix that could still be clustered as if it were n vectors. I can't plot 150-D vectors, but plotting the magnitude of each vector and then the dendrogram seems to confirm a similar clustering.
def similarity_func(u, v):
return 1-cosine(u, v)
dists = pdist(obs_df, similarity_func)
sim_df = pd.DataFrame(squareform(dists), columns=obs_df.index, index=obs_df.index)
sim_array = np.asarray(sim_df)
sim_lst = []
for vec in sim_array:
mag = np.linalg.norm(vec,ord=1)
sim_lst.append(mag)
pd.Series(sim_lst).plot.bar()
Z = linkage(sim_df, 'ward')
fig = plt.figure(figsize=(8, 4))
dn = dendrogram(Z)
What we're really clustering here is a vector whose components are similarity measures of each of the 150 points. We're clustering a collection of each point's intra- and inter-cluster similarity measures. Since the two clusters are different sizes, a point in one cluster will have a rather different collection of intra- and inter-cluster similarities relative to a point in the other cluster. Hence, we get two primary clusters that are proportionate to the number of points in each cluster just as we did in the first step.

Generating linearly independent columns for a matrix

As the title suggests, I want to generate a random N x d matrix (N - number of examples, d - number of features) where each column is linearly independent of the other columns. How can I implement the same using numpy and python?
If you just generate the vectors at random, the chance that the column vectors will not be linearly independent is very very small (Assuming N >= d).
Let A = [B | x] where A is a N x d matrix, B is an N x (d-1) matrix with independent column vectors, and x is a column vector with N elements. The set of all x with no constraints is a subspace with dimension N, while the set of all x such that x is NOT linearly independent with all column vectors in B would be a subspace with dimension d-1 (since every column vector in B serves as a basis vector for this set).
Since you are dealing with bounded, discrete numbers (likely doubles, floats, or integers), the probability of the matrix not being linearly independent will not be exactly zero. The more possible values each element can take, in general, the more likely the matrix is to have independent column vectors.
Therefore, I recommend you chose elements at random. You can always verify after the fact that the matrix has linearly independent column vectors by calculating it's column-echelon form. You could do this with np.random.rand(N,d).
One way to guarantee random independent columns would be to iteratively add a random column and check matrix rank:
import numpy as np
N, d = 1000, 200
M = np.random.rand(N,1)
r = 1 #matrix rank
while r < d:
t = np.random.rand(N,1)
if np.linalg.matrix_rank(np.hstack([M,t])) > r:
M = np.hstack([M,t])
r+=1
However this process is quite slow since requires to compute the rank of a matrix at least d times.
A faster approach would be to generate a random Nxd 2d-array and check its rank:
M = np.random.rand(N,d)
r = np.linalg.matrix_rank(M)
while r < d:
M = np.random.rand(N,d)
r = np.linalg.matrix_rank(M)
Which is likely to never enter the while loop, however we add a check and eventually generate another random 2d-array.
You can still have a small degree of correlation, simply by chance, if your number of observations is small.
One way of ensuring that, is to using the principal component scores. So brief explanation from wiki:
Repeating this process yields an orthogonal basis in which different
individual dimensions of the data are uncorrelated. These basis
vectors are called principal components, and several related
procedures principal component analysis (PCA).
We can see this below:
from sklearn.decomposition import PCA
import numpy as np
import seaborn as sns
N = 50
d = 20
a = np.random.normal(0,1,(50,20))
pca = PCA(n_components=d)
pca.fit(a)
pc_scores = pca.transform(a)
fig, ax = plt.subplots(1, 2,figsize=(10,4))
sns.heatmap(np.corrcoef(np.transpose(a)),ax=ax[0],cmap="YlGnBu")
sns.heatmap(np.corrcoef(np.transpose(pc_scores)),ax=ax[1],cmap="YlGnBu")
The heatmap on the matrix shows you can still have some degree of correlation by chance (drawing from a standard normal, but small sample size).

Cosine similarity for very large dataset

I am having trouble with calculating cosine similarity between large list of 100-dimensional vectors. When I use from sklearn.metrics.pairwise import cosine_similarity, I get MemoryError on my 16 GB machine. Each array fits perfectly in my memory but I get MemoryError during np.dot() internal call
Here's my use-case and how I am currently tackling it.
Here's my parent vector of 100-dimension which I need to compare with other 500,000 different vectors of same dimension (i.e. 100)
parent_vector = [1, 2, 3, 4 ..., 100]
Here are my child vectors (with some made-up random numbers for this example)
child_vector_1 = [2, 3, 4, ....., 101]
child_vector_2 = [3, 4, 5, ....., 102]
child_vector_3 = [4, 5, 6, ....., 103]
.......
.......
child_vector_500000 = [3, 4, 5, ....., 103]
My final goal is to get top-N child vectors (with their names such as child_vector_1 and their corresponding cosine score) who have very high cosine similarity with the parent vector.
My current approach (which I know is inefficient and memory consuming):
Step 1: Create a super-dataframe of following shape
parent_vector 1, 2, 3, ....., 100
child_vector_1 2, 3, 4, ....., 101
child_vector_2 3, 4, 5, ....., 102
child_vector_3 4, 5, 6, ....., 103
......................................
child_vector_500000 3, 4, 5, ....., 103
Step 2: Use
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)
to get pair-wise cosine similarity between all vectors (shown in above dataframe)
Step 3: Make a list of tuple to store the key such as child_vector_1 and value such as the cosine similarity number for all such combinations.
Step 4: Get the top-N using sort() of list -- so that I get the child vector name as well as its cosine similarity score with the parent vector.
PS: I know this is highly inefficient but I couldn't think of a better
way to faster compute cosine similarity between each of child vector
and parent vector and get the top-N values.
Any help would be highly appreciated.
even though your (500000, 100) array (the parent and its children) fits into memory
any pairwise metric on it won't. The reason for that is that pairwise metric as the name suggests computes the distance for any two children. In order to store these distances you would need a (500000,500000) sized array of floats which if my calculations are right would take about 100 GB of memory.
Thankfully there is an easy solution for your problem. If I understand you correctly you only want to have the distance between child and parents which will result in a vector of length 500000 which is easily stored in memory.
To do this, you simply need to provide a second argument to cosine_similarity containing only the parent_vector
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame(np.random.rand(500000,100))
df['distances'] = cosine_similarity(df, df.iloc[0:1]) # Here I assume that the parent vector is stored as the first row in the dataframe, but you could also store it separately
n = 10 # or however many you want
n_largest = df['distances'].nlargest(n + 1) # this contains the parent itself as the most similar entry, hence n+1 to get n children
hope that solves your question.
I couldn't even fit the entire corpus in memory so a solution for me was to load it gradually and compute cosine similarity on smaller batches, always retaining the least/most n (depending on your usecase) similar items:
data = []
iterations = 0
with open('/media/corpus.txt', 'r') as f:
for line in f:
data.append(line)
if len(data) <= 1000:
pass
else:
print('Getting bottom k, iteration {x}'.format(x=iterations))
data = get_bottom_k(data, 500)
iterations += 1
filtered = get_bottom_k(data, 500) # final most different 500 texts in corpus
def get_bottom_k(corpus:list, k:int):
pairwise_similarity = make_similarity_matrix(corpus) # returns pairwise similarity matrix
sums = csr_matrix.sum(pairwise_similarity, axis=1) # Similarity index for each item in corpus. Bigger > more
sums = np.squeeze(np.asarray(sums))
# similar to other txt.
indexes = np.argpartition(sums, k, axis=0)[:k] # Bottom k in terms of similarity (-k for top and [-k:])
return [corpus[i] for i in indexes]
This is by far an optimal solution but it's the easiest i found so far, maybe it will be of help to someone.
This solution is insanely fast
child_vectors = np.array(child_vector_1, child_vector_2, ....., child_vector_500000)
input_norm = parent_vector / np.linalg.norm(parent_vector, axis=-1)[:, np.newaxis]
embed_norm = child_vectors/ np.linalg.norm(child_vectors, axis=-1)[:, np.newaxis]
cosine_similarities = np.sort(np.round(np.dot(input_norm, embed_norm.T), 3)[0])[::-1]
paiswise_distances = 1 - cosine_similarities

How to find linearly independent rows from a matrix

How to identify the linearly independent rows from a matrix? For instance,
The 4th rows is independent.
First, your 3rd row is linearly dependent with 1t and 2nd row. However, your 1st and 4th column are linearly dependent.
Two methods you could use:
Eigenvalue
If one eigenvalue of the matrix is zero, its corresponding eigenvector is linearly dependent. The documentation eig states the returned eigenvalues are repeated according to their multiplicity and not necessarily ordered. However, assuming the eigenvalues correspond to your row vectors, one method would be:
import numpy as np
matrix = np.array(
[
[0, 1 ,0 ,0],
[0, 0, 1, 0],
[0, 1, 1, 0],
[1, 0, 0, 1]
])
lambdas, V = np.linalg.eig(matrix.T)
# The linearly dependent row vectors
print matrix[lambdas == 0,:]
Cauchy-Schwarz inequality
To test linear dependence of vectors and figure out which ones, you could use the Cauchy-Schwarz inequality. Basically, if the inner product of the vectors is equal to the product of the norm of the vectors, the vectors are linearly dependent. Here is an example for the columns:
import numpy as np
matrix = np.array(
[
[0, 1 ,0 ,0],
[0, 0, 1, 0],
[0, 1, 1, 0],
[1, 0, 0, 1]
])
print np.linalg.det(matrix)
for i in range(matrix.shape[0]):
for j in range(matrix.shape[0]):
if i != j:
inner_product = np.inner(
matrix[:,i],
matrix[:,j]
)
norm_i = np.linalg.norm(matrix[:,i])
norm_j = np.linalg.norm(matrix[:,j])
print 'I: ', matrix[:,i]
print 'J: ', matrix[:,j]
print 'Prod: ', inner_product
print 'Norm i: ', norm_i
print 'Norm j: ', norm_j
if np.abs(inner_product - norm_j * norm_i) < 1E-5:
print 'Dependent'
else:
print 'Independent'
To test the rows is a similar approach.
Then you could extend this to test all combinations of vectors, but I imagine this solution scale badly with size.
With sympy you can find the linear independant rows using: sympy.Matrix.rref:
>>> import sympy
>>> import numpy as np
>>> mat = np.array([[0,1,0,0],[0,0,1,0],[0,1,1,0],[1,0,0,1]]) # your matrix
>>> _, inds = sympy.Matrix(mat).T.rref() # to check the rows you need to transpose!
>>> inds
[0, 1, 3]
Which basically tells you the rows 0, 1 and 3 are linear independant while row 2 isn't (it's a linear combination of row 0 and 1).
Then you could remove these rows with slicing:
>>> mat[inds]
array([[0, 1, 0, 0],
[0, 0, 1, 0],
[1, 0, 0, 1]])
This also works well for rectangular (not only for quadratic) matrices.
I edited the code for Cauchy-Schwartz inequality which scales better with dimension: the inputs are the matrix and its dimension, while the output is a new rectangular matrix which contains along its rows the linearly independent columns of the starting matrix. This works in the assumption that the first column in never null, but can be readily generalized in order to implement this case too. Another thing that I observed is that 1e-5 seems to be a "sloppy" threshold, since some particular pathologic vectors were found to be linearly dependent in that case: 1e-4 doesn't give me the same problems. I hope this could be of some help: it was pretty difficult for me to find a really working routine to extract li vectors, and so I'm willing to share mine. If you find some bug, please report them!!
from numpy import dot, zeros
from numpy.linalg import matrix_rank, norm
def find_li_vectors(dim, R):
r = matrix_rank(R)
index = zeros( r ) #this will save the positions of the li columns in the matrix
counter = 0
index[0] = 0 #without loss of generality we pick the first column as linearly independent
j = 0 #therefore the second index is simply 0
for i in range(R.shape[0]): #loop over the columns
if i != j: #if the two columns are not the same
inner_product = dot( R[:,i], R[:,j] ) #compute the scalar product
norm_i = norm(R[:,i]) #compute norms
norm_j = norm(R[:,j])
#inner product and the product of the norms are equal only if the two vectors are parallel
#therefore we are looking for the ones which exhibit a difference which is bigger than a threshold
if absolute(inner_product - norm_j * norm_i) > 1e-4:
counter += 1 #counter is incremented
index[counter] = i #index is saved
j = i #j is refreshed
#do not forget to refresh j: otherwise you would compute only the vectors li with the first column!!
R_independent = zeros((r, dim))
i = 0
#now save everything in a new matrix
while( i < r ):
R_independent[i,:] = R[index[i],:]
i += 1
return R_independent
I know this was asked a while ago, but here is a very simple (although probably inefficient) solution. Given an array, the following finds a set of linearly independent vectors by progressively adding a vector and testing if the rank has increased:
from numpy.linalg import matrix_rank
def LI_vecs(dim,M):
LI=[M[0]]
for i in range(dim):
tmp=[]
for r in LI:
tmp.append(r)
tmp.append(M[i]) #set tmp=LI+[M[i]]
if matrix_rank(tmp)>len(LI): #test if M[i] is linearly independent from all (row) vectors in LI
LI.append(M[i]) #note that matrix_rank does not need to take in a square matrix
return LI #return set of linearly independent (row) vectors
#Example
mat=[[1,2,3,4],[4,5,6,7],[5,7,9,11],[2,4,6,8]]
LI_vecs(4,mat)
I interpret the problem as finding rows that are linearly independent from other rows.
That is equivalent to finding rows that are linearly dependent on other rows.
Gaussian elimination and treat numbers smaller than a threshold as zeros can do that. It is faster than finding eigenvalues of a matrix, testing all combinations of rows with Cauchy-Schwarz inequality, or singular value decomposition.
See:
https://math.stackexchange.com/questions/1297437/using-gauss-elimination-to-check-for-linear-dependence
Problem with floating point numbers:
http://numpy-discussion.10968.n7.nabble.com/Reduced-row-echelon-form-td16486.html
With regards to the following discussion:
Find dependent rows/columns of a matrix using Matlab?
from sympy import *
A = Matrix([[1,1,1],[2,2,2],[1,7,5]])
print(A.nullspace())
It is obvious that the first and second row are multiplication of each other.
If we execute the above code we get [-1/3, -2/3, 1]. The indices of the zero elements in the null space show independence. But why is the third element here not zero? If we multiply the A matrix with the null space, we get a zero column vector. So what's wrong?
The answer which we are looking for is the null space of the transpose of A.
B = A.T
print(B.nullspace())
Now we get the [-2, 1, 0], which shows that the third row is independent.
Two important notes here:
Consider whether we want to check the row dependencies or the column
dependencies.
Notice that the null space of a matrix is not equal to the null
space of the transpose of that matrix unless it is symmetric.
You can basically find the vectors spanning the columnspace of the matrix by using SymPy library's columnspace() method of Matrix object. Automatically, they are the linearly independent columns of the matrix.
import sympy as sp
import numpy as np
M = sp.Matrix([[0, 1, 0, 0],
[0, 0, 1, 0],
[1, 0, 0, 1]])
for i in M.columnspace():
print(np.array(i))
print()
# The output is following.
# [[0]
# [0]
# [1]]
# [[1]
# [0]
# [0]]
# [[0]
# [1]
# [0]]

Categories