Generating linearly independent columns for a matrix

Generating linearly independent columns for a matrix - python

As the title suggests, I want to generate a random N x d matrix (N - number of examples, d - number of features) where each column is linearly independent of the other columns. How can I implement the same using numpy and python?

If you just generate the vectors at random, the chance that the column vectors will not be linearly independent is very very small (Assuming N >= d).
Let A = [B | x] where A is a N x d matrix, B is an N x (d-1) matrix with independent column vectors, and x is a column vector with N elements. The set of all x with no constraints is a subspace with dimension N, while the set of all x such that x is NOT linearly independent with all column vectors in B would be a subspace with dimension d-1 (since every column vector in B serves as a basis vector for this set).
Since you are dealing with bounded, discrete numbers (likely doubles, floats, or integers), the probability of the matrix not being linearly independent will not be exactly zero. The more possible values each element can take, in general, the more likely the matrix is to have independent column vectors.
Therefore, I recommend you chose elements at random. You can always verify after the fact that the matrix has linearly independent column vectors by calculating it's column-echelon form. You could do this with np.random.rand(N,d).

One way to guarantee random independent columns would be to iteratively add a random column and check matrix rank:
import numpy as np
N, d = 1000, 200
M = np.random.rand(N,1)
r = 1 #matrix rank
while r < d:
t = np.random.rand(N,1)
if np.linalg.matrix_rank(np.hstack([M,t])) > r:
M = np.hstack([M,t])
r+=1
However this process is quite slow since requires to compute the rank of a matrix at least d times.
A faster approach would be to generate a random Nxd 2d-array and check its rank:
M = np.random.rand(N,d)
r = np.linalg.matrix_rank(M)
while r < d:
M = np.random.rand(N,d)
r = np.linalg.matrix_rank(M)
Which is likely to never enter the while loop, however we add a check and eventually generate another random 2d-array.

You can still have a small degree of correlation, simply by chance, if your number of observations is small.
One way of ensuring that, is to using the principal component scores. So brief explanation from wiki:
Repeating this process yields an orthogonal basis in which different
individual dimensions of the data are uncorrelated. These basis
vectors are called principal components, and several related
procedures principal component analysis (PCA).
We can see this below:
from sklearn.decomposition import PCA
import numpy as np
import seaborn as sns
N = 50
d = 20
a = np.random.normal(0,1,(50,20))
pca = PCA(n_components=d)
pca.fit(a)
pc_scores = pca.transform(a)
fig, ax = plt.subplots(1, 2,figsize=(10,4))
sns.heatmap(np.corrcoef(np.transpose(a)),ax=ax[0],cmap="YlGnBu")
sns.heatmap(np.corrcoef(np.transpose(pc_scores)),ax=ax[1],cmap="YlGnBu")
The heatmap on the matrix shows you can still have some degree of correlation by chance (drawing from a standard normal, but small sample size).

Related

Generate M by N matrix

I am trying to generate an M by N matrix where N is the number of multivariate normal random variables with M number of samples. I am trying to generate the matrix such that each column is a unit-variance and correlation between any two columns is Y.
I have tried
M = 100
N = 3
Y = 0.5
mean = (0,0,0)
cov = np.array([[0.5,0.5,0.5],[0.5,0.5,0.5],[0.5,0.5,0.5]])
np.random.multivariate_normal(mean, cov, (M,N))
it returns an np array consisting of M array where each ith array consist of N number of values and they are all the same.
Can anyone advise how to generate an M by N matrix such that each column in a unit variance and correlation between any two columns is Y, where N is the number of standard multivariate normal random variable.

So it turns out that I am almost entirely wrong. The reason why your values are not deviating is because you have a correlation value of 1 between your values. As a result, the values no not vary.

Increase speed of finding minimum element in a 2-D numpy array which has many entries set to np.inf

I have a 16000*16000 matrix and want to find the minimum entry. This matrix is a distance matrix, so it is symmetric about diagonal. In order to get exactly one minimum at each time, I set the lower triangle and the diagonal to np.inf. Below is an 5*5 matrix example:
inf a0 a1 a2 a3
inf inf a4 a5 a6
inf inf inf a7 a8
inf inf inf inf a9
inf inf inf inf inf
I want to find the index of the minimum entry only in the upper triangle. However, when I use np.argmin(), it will still go through the whole matrix. Is there any way to "ignore" the lower triangle and increase speed?
I have tried many methods, such as:
Use masked array
Use triu_indices() to extract the upper triangle and then find the minimum
Set the entries in the lower triangle and diagonal to None instead of np.inf, then use np.nanargmin() to find the minimum
However, all of the methods I tried are slower the using np.argmin() directly.
Thank you for your time, I would appreciate it if you can help me.
UPDATE 1: Some background of my problem
In fact, I am implementing a modified version of agglomerative clustering from scratch. The original dataset is 16000*64 (I have 16000 points, each is 64-dimensional). At first, I build 16000 clusters and each contains exactly one point. In each iteration, I find the nearest 2 clusters and merge them, until meet the terminate condition.
To avoid repeated calculation of distances, I store the distances in a 16000*16000 distance matrix. I set the diagonal and lower triangle to np.inf. In each iteration, I will find the smallest entry in the distance matrix, and the index of this entry corresponds to the 2 nearest clusters, say c_i and c_j. Afterwards, in the distance matrix, I fill the 2 rows and 2 columns corresponding to c_i and c_j to np.inf, which means that these 2 clusters are merged and do not exist anymore. Then I will calculate an array of the distances between the new cluster and all other clusters, then put the array in the 1 row and 1 column corresponding to c_i.
Let me make it clear: in the whole process, the size of the distance matrix never change. In each iteration, for the 2 rows and 2 columns correspond to the 2 nearest clusters I found, I fill 1 row and 1 column with np.inf and put the distance array of the new cluster in the other 1 row and 1 column.
Now the bottleneck of the performance is finding the smallest entry in the distance matrix, which takes 0.008s. The run time of the whole algorithm is about 40 minutes.
UPDATE 2: How I compute distance matrix
Below is the code I used to generate distance matrix:
from sklearn.metrics import pairwise_distances
dis_matrix = pairwise_distances(dataset)
for i in range(num_dim):
for j in range(num_dim):
if i >= j or (cluster_list[i].contain_reference_point and cluster_list[j].contain_reference_point):
dis_matrix[i][j] = np.inf
Nevertheless, I need to say that generating the distance matrix is not the bottleneck in the algorithm now, because I generate it only once, and then I just update the distance matrix (as mentioned above).

If we back up a step, assuming the distance matrix is symmetric and based on an (i, n) shaped array with i points in n dimensions, and the distance metric is cartesian, this can be done very efficiently with a KDTree data structure:
i = 16000
n = 3
points = np.random.rand(i, n) * 100
from scipy.spatial import cKDTree
tree = cKDTree(points)
close = tree.sparse_distance_matrix(tree,
max_distance = 1, #can tune for your application
output_type = "coo_matrix")
close.eliminate_zeros()
ix = close.data.argmin()
i, j = (close.row[ix], close.col[ix])
This is pretty blazing fast, but it depends on your application and distance metric if it's useful for you.
If you don't need the distance matrix at all (and only need indices), you can do:
d, ix = tree.query(points, 2)
j, i = ix[d[:, 1].argmin()]
EDIT: this doesn't work well for high-dimensionality data. Since you're up against the curse of dimensionality, you'll probably need to brute force. I recommend scipy.spatial.distance.pdist for this:
from scipy.spatial.distance import pdist
D = pdist(points, metric = 'seuclidean') # this only returns the upper diagonal
ix = np.argmin(D)
def ix_to_ij(ix, n):
sorter = np.arange(n-1)[::-1].cumsum()
j = np.searchsorted(sorter, ix)
i = ix - sorter[j]
return i, j
ix_to_ij(ix, 16000)
Not completely tested but I think that should work.

One thing I can think of that might give you a boost is using numba.njit:
#njit
def upper_min(m):
x = np.inf
for r in range(0, m.shape[0] - 1):
for c in range(r + 1, m.shape[1] + 1):
if x < m[r, c]:
x = m[r, c]
Be sure not to time it the first time you run it. The compilation is slow.
Another way could be to use sparse matrices somehow.

You can select upper triangle of array by masking, simple example:
import numpy as np
arr = np.array([[0, 1], [2, 3]])
# Mask of upper triangle
mask = np.array([[True, True],[False, True]])
# Masking returns only upper triangle as 1D array
min_val = np.min(arr[mask]) # Equal to np.min([0, 1, 3])
So instead of making lower triangle as inf, you have to generate a mask where lower triangle is False and upper triangle is True and apply masking arr[mask] which returns 1D array of upper triangle, then you apply min

Most efficient algorithm in Python to generate all 6x6 (0,1) matrices with sum in columns and rows lower than 2?

I am working on a problem which requires me to find all 6x6 (0,1) matrices with some given properties:
The sum of a row/column must be lower than 2.
The matrices are not symmetrical.
I am using this code:
import numpy as np
import itertools as it
n=6
li=[]
for i in it.product([0, 1], repeat = n**2):
if (np.reshape(np.array(i), (n, n)).sum(axis=1) < 2).all() and (np.reshape(np.array(i), (n, n)).sum(axis=0)< 2).all() :
if (np.transpose(np.reshape(np.array(i), (n, n))) != np.reshape(np.array(i), (n, n))).any():
li.append(np.reshape(np.array(i), (n, n)))
The problem is that this method has to go through all 68719476736 (0,1) matrices. After this piece of code I still have to impose extra conditions.
Is there a faster algorithm to find this list of matrices?
Edit:
The problem I am working on is one to find unique adjacency matrices (graph theory) up to a certain equivalence class. For instance, in the 4x4 version of the problem I wanted to find all (0,1) matrices such that:
The sum in a row/column is lower than 2;
Are not symmetrical, i.e. A^T != A;
Also A^T != P^T A P, where P is a matrix representation of the dihedral group D8 (order 8) which is a subgroup of S4.
After this last step I get a certain number of matrices. If A relates to B through the relation B = P^T A P, then it represents the same matrix. I follow to choose only one representative of this equivalence class.
In the 4x4 problem I go from 65536 to 3.
My estimate of the result after sorting through the first condition (sums) is 46080. In the 6x6 problem, the group of transformations P is of order 48.

You have trouble with your math, because if the row/column sum is less than 2, it could be 0 or 1 -- that means that in every row/column can be only one non-zero elememt, which is 7^6 = 117649 possible matrices.
100k matrices is pretty much doable by using a brute force, with additional filtering to remove vertical/horizontal flips and diagonal symmetries.
Here's a simple code that should get you started:
import numpy as np
from itertools import permutations
for perm in permutations( range(7), 6 ) : # there are only 5040 permutations
m = np.zeros(6, 6) # start with an empty matrix
for i, j in enumerate(perm) :
if j == 6 : continue # all zeros
m[i][j] = 1 # put `1` in the current (i)-th row, (j) pos
# here you check `m` for symmetry and save it somewhere or not

Optimizing histogram distance metric for two matrices in Python

I have two matrices A and B, each with a size of NxM, where N is the number of samples and M is the size of histogram bins. Thus, each row represents a histogram for that particular sample.
What I would like to do is to compute the chi-square distance between two matrices for a different pair of samples. Therefore, each row in the matrix A will be compared to all rows in the other matrix B, resulting a final matrix C with a size of NxN and C[i,j] corresponds to the chi-square distance between A[i] and B[j] histograms.
Here is my python code that does the job:
def chi_square(histA,histB):
esp = 1.e-10
d = sum((histA-histB)**2/(histA+histB+eps))
return 0.5*d
def matrix_cost(A,B):
a,_ = A.shape
b,_ = B.shape
C = zeros((a,b))
for i in xrange(a):
for j in xrange(b):
C[i,j] = chi_square(A[i],B[j])
return C
Currently, for a 100x70 matrix, this entire process takes 0.1 seconds.
Is there any way to improve this performance?
I would appreciate any thoughts or recommendations.
Thank you.

Sure! I'm assuming you're using numpy?
If you have the RAM available, you could use broadcast the arrays and use numpy's efficient vectorization of the operations on those arrays.
Here's how:
Abroad = A[:,np.newaxis,:] # prepared for broadcasting
C = np.sum((Abroad - B)**2/(Abroad + B), axis=-1)/2.
Timing considerations on my platform show a factor of 10 speed gain compared to your algorithm.
A slower option (but still faster than your original algorithm) that uses less RAM than the previous option is simply to broadcast the rows of A into 2D arrays:
def new_way(A,B):
C = np.empty((A.shape[0],B.shape[0]))
for rowind, row in enumerate(A):
C[rowind,:] = np.sum((row - B)**2/(row + B), axis=-1)/2.
return C
This has the advantage that it can be run for arrays with shape (N,M) much larger than (100,70).
You could also look to Theano to push the expensive for-loops to the C-level if you don't have the memory available. I get a factor 2 speed gain compared to the first option (not taking into account the initial compile time) for both the (100,70) arrays as well as (1000,70):
import theano
import theano.tensor as T
X = T.matrix("X")
Y = T.matrix("Y")
results, updates = theano.scan(lambda x_i: ((x_i - Y)**2/(x_i+Y)).sum(axis=1)/2., sequences=X)
chi_square_norm = theano.function(inputs=[X, Y], outputs=[results])
chi_square_norm(A,B) # same result

What's wrong with my PCA?

My code:
from numpy import *
def pca(orig_data):
data = array(orig_data)
data = (data - data.mean(axis=0)) / data.std(axis=0)
u, s, v = linalg.svd(data)
print s #should be s**2 instead!
print v
def load_iris(path):
lines = []
with open(path) as input_file:
lines = input_file.readlines()
data = []
for line in lines:
cur_line = line.rstrip().split(',')
cur_line = cur_line[:-1]
cur_line = [float(elem) for elem in cur_line]
data.append(array(cur_line))
return array(data)
if __name__ == '__main__':
data = load_iris('iris.data')
pca(data)
The iris dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Output:
[ 20.89551896 11.75513248 4.7013819 1.75816839]
[[ 0.52237162 -0.26335492 0.58125401 0.56561105]
[-0.37231836 -0.92555649 -0.02109478 -0.06541577]
[ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
[ 0.26199559 -0.12413481 -0.80115427 0.52354627]]
Desired Output:
Eigenvalues - [2.9108 0.9212 0.1474 0.0206]
Principal Components - Same as I got but transposed so okay I guess
Also, what's with the output of the linalg.eig function? According to the PCA description on wikipedia, I'm supposed to this:
cov_mat = cov(orig_data)
val, vec = linalg.eig(cov_mat)
print val
But it doesn't really match the output in the tutorials I found online. Plus, if I have 4 dimensions, I thought I should have 4 eigenvalues and not 150 like the eig gives me. Am I doing something wrong?
edit: I've noticed that the values differ by 150, which is the number of elements in the dataset. Also, the eigenvalues are supposed to add to be equal to the number of dimensions, in this case, 4. What I don't understand is why this difference is happening. If I simply divided the eigenvalues by len(data) I could get the result I want, but I don't understand why. Either way the proportion of the eigenvalues isn't altered, but they are important to me so I'd like to understand what's going on.

You decomposed the wrong matrix.
Principal Component Analysis requires manipulating the eigenvectors/eigenvalues
of the covariance matrix, not the data itself. The covariance matrix, created from an m x n data matrix, will be an m x m matrix with ones along the main diagonal.
You can indeed use the cov function, but you need further manipulation of your data. It's probably a little easier to use a similar function, corrcoef:
import numpy as NP
import numpy.linalg as LA
# a simulated data set with 8 data points, each point having five features
data = NP.random.randint(0, 10, 40).reshape(8, 5)
# usually a good idea to mean center your data first:
data -= NP.mean(data, axis=0)
# calculate the covariance matrix
C = NP.corrcoef(data, rowvar=0)
# returns an m x m matrix, or here a 5 x 5 matrix)
# now get the eigenvalues/eigenvectors of C:
eval, evec = LA.eig(C)
To get the eigenvectors/eigenvalues, I did not decompose the covariance matrix using SVD,
though, you certainly can. My preference is to calculate them using eig in NumPy's (or SciPy's)
LA module--it is a little easier to work with than svd, the return values are the eigenvectors
and eigenvalues themselves, and nothing else. By contrast, as you know, svd doesn't return these these directly.
Granted the SVD function will decompose any matrix, not just square ones (to which the eig function is limited); however when doing PCA, you'll always have a square matrix to decompose,
regardless of the form that your data is in. This is obvious because the matrix you
are decomposing in PCA is a covariance matrix, which by definition is always square
(i.e., the columns are the individual data points of the original matrix, likewise
for the rows, and each cell is the covariance of those two points, as evidenced
by the ones down the main diagonal--a given data point has perfect covariance with itself).

The left singular values returned by SVD(A) are the eigenvectors of AA^T.
The covariance matrix of a dataset A is : 1/(N-1) * AA^T
Now, when you do PCA by using the SVD, you have to divide each entry in your A matrix by (N-1) so you get the eigenvalues of the covariance with the correct scale.
In your case, N=150 and you haven't done this division, hence the discrepancy.
This is explained in detail here

(Can you ask one question, please? Or at least list your questions separately. Your post reads like a stream of consciousness because you are not asking one single question.)
You probably used cov incorrectly by not transposing the matrix first. If cov_mat is 4-by-4, then eig will produce four eigenvalues and four eigenvectors.
Note how SVD and PCA, while related, are not exactly the same. Let X be a 4-by-150 matrix of observations where each 4-element column is a single observation. Then, the following are equivalent:
a. the left singular vectors of X,
b. the principal components of X,
c. the eigenvectors of X X^T.
Also, the eigenvalues of X X^T are equal to the square of the singular values of X. To see all this, let X have the SVD X = QSV^T, where S is a diagonal matrix of singular values. Then consider the eigendecomposition D = Q^T X X^T Q, where D is a diagonal matrix of eigenvalues. Replace X with its SVD, and see what happens.

Question already adressed: Principal component analysis in Python

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.