I want to perform principal component analysis for dimension reduction and data integration.
I have 3 features(variables) and 5 samples like below. I want to integrate them into 1-dimensional(1 feature) output by transforming them(computing 1st PC). I want to use transformed data for further statistical analysis, because I believe that it displays the 'main' characteristics of 3 input features.
I first wrote a test code with python using scikit-learn like below. It is the simple case that the values of 3 features are all equivalent. In other word, I applied PCA for three same vector, [0, 1, 2, 1, 0].
Code
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
samples = np.array([[0,0,0],[1,1,1],[2,2,2],[1,1,1],[0,0,0]])
pc1 = pca.fit_transform(samples)
print (pc1)
Output
[[-1.38564065]
[ 0.34641016]
[ 2.07846097]
[ 0.34641016]
[-1.38564065]]
Is taking 1st PCA after dimension reduction proper approach for data integration?
1-2. For example, if features are like [power rank, speed rank], and power have roughly negative correlation with speed, when it is a 2-feature case. I want to know the sample which have both 'high power' and 'high speed'. It is easy to decide that [power 1, speed 1] is better than [power 2, speed 2], but difficult for the case like [power 4, speed 2] vs [power 3, speed 3].
So I want to apply PCA to 2-dimensional 'power and speed' dataset, and take 1st PC, then use the rank of '1st PC'. Is this kind of approach still proper?
In this case, I think the output should also be [0, 1, 2, 1, 0] which is the same as the input. But output was [-1.38564065, 0.34641016, 2.07846097, 0.34641016, -1.38564065]. Are there any problem with the code, or is it the right answer?
Yes. It is also called data projection (to the lower dimension).
The resulting output is centered and normalized according to the train data. The result is correct.
In case of only 5 samples I don't think it is wise to run any statistical methods. And if you believe that your features are the same, just check that correlation between dimensions is close to 1, and then you can just disregard other dimensions.
There is no need to use PCA for this small dataset. And for PCA you array should be scaled.
In any case, you have only 3 dimensions: you can plot points and take a look with your eyes, you can calculate distances (make some kind on Nearest Neighborhoods algorithm).
I am using scikit-learn preprocessing scaling for sparse matrices.
My goal is to "scale" each feature-column by taking the logarithm-base the column maximum value. My wording may be inexact. I try to explain.
Say feature-column has values: 0, 8, 2:
Max value = 8
Log-8 of feature value 0 should be 0.0 = math.log(0+1, 8+1) (the +1 is to cope with zeros; so yes, we are actually taking log-base 9)
Log-8 of feature value 8 should be 1.0 = math.log(8+1, 8+1)
Log-8 of feature value 2 should be 0.5 = math.log(2+1, 8+1)
Yes, I can easily apply any arbitrary function-based transformer with FunctionTransformer, but I want the base of the log change (based on) each column (in particular, the maximum value). That is, I want to do something like the MaxAbsScaler, only taking logarithms.
I see that MaxAbsScaler gets first a vector (scale) of the maximum values of each column (code) and then multiples the original matrix times 1 / scale in code.
However, I don't know what to do if I want to take the logarithms-based on the scale vector. Is it even possible to transform the logarithm operation to a multiplication (?) or do I have other possibilities of scipy sparse operations that are efficient?
I hope my intent is clear (and possible).
Logarithm of x in base b is the same as log(x)/log(b), where logs are natural. So, the process you describe amounts to first applying log(x+1) transformation to everything, and then scaling by max absolute value. Conveniently, log(x+1) is a built-in function, log1p. Example:
from sklearn.preprocessing import FunctionTransformer, maxabs_scale
from scipy.sparse import csc_matrix
import numpy as np
logtran = FunctionTransformer(np.log1p, accept_sparse=True)
X = csc_matrix([[ 1., 0, 8], [ 2., 0, 0], [ 0, 1., 2]])
Y = maxabs_scale(logtran.transform(X))
Output (sparse matrix Y):
(0, 0) 0.630929753571
(1, 0) 1.0
(2, 1) 1.0
(0, 2) 1.0
(2, 2) 0.5
I would like to count how many m by n matrices whose elements are 1 or -1 have the property that all its floor(m/2)+1 by n submatrices have full rank. My current method is naive and slow and is in the following python/numpy code. It simply iterates over all matrices and tests all the submatrices.
import numpy as np
import itertools
from scipy.misc import comb
m = 8
n = 4
rowstochoose = int(np.floor(m/2)+1)
maxnumber = comb(m, rowstochoose, exact = True)
matrix_g=(np.array(x).reshape(m,n) for x in itertools.product([-1,1], repeat = m*n))
nofound = 0
for A in matrix_g:
count = 0
for rows in itertools.combinations(range(m), int(rowstochoose)):
if (np.linalg.matrix_rank(A[list(rows)]) == int(min(n,rowstochoose))):
count+=1
else:
break
if (count == maxnumber):
nofound+=1
print nofound, 2**(m*n)
Is there a better/faster way to do this? I would like to do this calculation for n and m up to 20 but any significant improvements would be great.
Context. I am interested in getting some exact solutions for https://math.stackexchange.com/questions/640780/probability-that-every-vector-is-not-orthogonal-to-half-of-the-others .
As a data point to compare implementations. n,m = 4,4 should output 26880 . n,m=5,5 is too slow for me to run. For n = 2 and m = 2,3,4,5,6 the outputs should be 8, 0, 96, 0, 1280.
Current status Feb 2, 2014:
The answer of leewangzhong is fast but is not correct for m > n . leewangzhong is considering how to fix it.
The answer of Hooked does not run for m > n .
(Now a partial solution for n = m//2+1, and the requested code.)
Let k := m//2+1
This is somewhat equivalent to asking, "How many collections of m n-dimensional vectors of {-1,1} have no linearly dependent sets of size min(k,n)?"
For those matrices, we know or can assume:
The first entry of every vector is 1 (if not, multiply the whole by -1). This reduces the count by a factor of 2**m.
All vectors in the list are distinct (if not, any submatrix with two identical vectors has non-full rank). This eliminates a lot. There are choose(2**m,n) matrices of distinct vectors.
The list of vectors are sorted lexicographically (rank isn't affected by permutations). So we're really thinking about sets of vectors instead of lists. This reduces the count by a factor of m! (because we require distinctness).
With this, we have a solution for n=4, m=8. There are only eight different vectors with the property that the first entry is positive. There is only one combination (sorted list) of 8 distinct vectors from 8 distinct vectors.
array([[ 1, 1, 1, 1],
[ 1, 1, 1, -1],
[ 1, 1, -1, 1],
[ 1, 1, -1, -1],
[ 1, -1, 1, 1],
[ 1, -1, 1, -1],
[ 1, -1, -1, 1],
[ 1, -1, -1, -1]], dtype=int8)
100 size-4 combinations from this list have rank 3. So there are 0 matrices with the property.
For a more general solution:
Note that there are 2**(n-1) vectors with first coordinate -1, and choose(2**(n-1),m) matrices to inspect. For n=8 and m=8, there are 128 vectors, and 1.4297027e+12 matrices. It might help to answer, "For i=1,...,k, how many combinations have rank i?"
Alternatively, "What kind of matrices (with the above assumptions) have less than full rank?" And I think the answer is exactly, A sufficient condition is, "Two columns are multiples of each other". I have a feeling that this is true, and I tested this for all 4x4, 5x5, and 6x6 matrices.(Must've screwed up the tests) Since the first column was chosen to be homogeneous, and since all homogeneous vectors are multiples of each other, any submatrix of size k with a homogeneous column other than the first column will have rank less than k.
This is not a necessary condition, though. The following matrix is singular (first plus fourth is equal to third plus second).
array([[ 1, 1, 1, 1, 1],
[ 1, 1, 1, 1, -1],
[ 1, 1, -1, -1, 1],
[ 1, 1, -1, -1, -1],
[ 1, -1, 1, -1, 1]], dtype=int8)
Since there are only two possible values (-1 and 1), all mxn matrices where m>2, k := m//2+1, n = k and with first column -1 have a majority member in each column (i.e. at least k members are the same). So for n=k, the answer is 0.
For n<=8, here's code to generate the vectors.
from numpy import unpackbits, arange, uint8, int8
#all distinct n-length vectors from -1,1 with first entry -1
def nvectors(n):
if n > 8:
raise ValueError #is that the right error?
return -1 + 2 * (
#explode binary numbers to arrays of 8 zeroes and ones
unpackbits(arange(2**(n-1),dtype=uint8)) #unpackbits only takes uint
.reshape((-1,8)) #unpackbits flattens, so we need to shape it to 8 bits
[:,-n:] #only take the last n bytes
.view(int8) #need signed
)
Matrix generator:
#generate all length-m matrices that are combinations of distinct n-vectors
def matrix_g(n,m):
return (array(mat) for mat in combinations(nvectors(n),m))
The following is a function to check that all submatrices of length maxrank have full rank. It stops if any have less than maxrank, instead of checking all combinations.
rankof = np.linalg.matrix_rank
#all submatrices of at least half size have maxrank
#(we only need to check the maxrank-sized matrices)
def halfrank(matrix,maxrank):
return all(rankof(submatr) == maxrank for submatr in combinations(matrix,maxrank))
Generate all matrices that have all half-matrices with full rank
def nicematrices(m,n):
maxrank = min(m//2+1,n)
return (matr for matr in matrix_g(n,m) if halfrank(matr,maxrank))
Putting it all together:
import numpy as np
from numpy import unpackbits, arange, uint8, int8, array
from itertools import combinations
#all distinct n-length vectors from -1,1 with first entry -1
def nvectors(n):
if n > 8:
raise ValueError #is that the right error?
if n==0:
return array([])
return -1 + 2 * (
#explode binary numbers to arrays of 8 zeroes and ones
unpackbits(arange(2**(n-1),dtype=uint8)) #unpackbits only takes uint
.reshape((-1,8)) #unpackbits flattens, so we need to shape it to 8 bits
[:,-n:] #only take the last n bytes
.view(int8) #need signed
)
#generate all length-m matrices that are combinations of distinct n-vectors
def matrix_g(n,m):
return (array(mat) for mat in combinations(nvectors(n),m))
rankof = np.linalg.matrix_rank
#all submatrices of at least half size have maxrank
#(we only need to check the maxrank-sized matrices)
def halfrank(matrix,maxrank):
return all(rankof(submatr) == maxrank for submatr in combinations(matrix,maxrank))
#generate all matrices that have all half-matrices with full rank
def nicematrices(m,n):
maxrank = min(m//2+1,n)
return (matr for matr in matrix_g(n,m) if halfrank(matr,maxrank))
#returns (number of nice matrices, number of all matrices)
def count_nicematrices(m,n):
from math import factorial
return (len(list(nicematrices(m,n)))*factorial(m)*2**m, 2**(m*n))
for i in range(0,6):
print (i, count_nicematrices(i,i))
count_nicematrices(5,5) takes about 15 seconds for me, the vast majority of which is taken by the matrix_rank function.
Since no one's answered yet, here's an answer without code. The useful symmetries that I see are as follows.
Multiply a row by -1.
Multiply a column by -1.
Permute the rows.
Permute the columns.
I would attack this problem by exhaustively generating the non-isomorphs, filtering them, and summing the sizes of their orbits. nauty will be quite useful for the first and third steps. Assuming that most matrices have few symmetries (undoubtedly an excellent assumption for n large, but it's not obvious a priori how large), I would expect 8x8 to be doable, 9x9 to be borderline, and 10x10 to be out of reach.
Expanded pseudocode:
Generate one representative of each orbit of the (m - 1) by (n - 1) 0-1 matrices acted upon by the group generated by row and column permutations, together with the size of the orbit (= (m - 1)! (n - 1)! / the size of the automorphism group). Perhaps the author of the paper that Tim linked would be willing to share his code; otherwise, see below.
For each matrix, replace entries x by (-1)^x. Add one row and one column of 1s. Multiply the size of its orbit by 2^(m + n - 1). This takes care of the sign change symmetries.
Filter the matrices and sum the orbit sizes of the ones that remain. You might save a little computation time here by implementing Gram--Schmidt yourself so that when you try all combinations in lexicographic order there's an opportunity to reuse partial results for the shared prefixes.
Isomorph-free enumeration:
McKay's template can be used to generate the representatives for (m + 1) by n 0-1 matrices from the representatives for m by n 0-1 matrices, in a manner amenable to depth-first search. With each m by n 0-1 matrix, associate a bipartite graph with m black vertices, n white vertices, and the appropriate edge for each 1 entry. Do the following for each m by n representative.
For each length-n vector, construct the graph for the (m + 1) by n matrix consisting of the representative together with the new vector and run nauty to get a canonical labeling and the vertex orbits.
Filter out the possibilities where the vertex corresponding to the new vector is in a different orbit from the black vertex with the lowest number.
Filter out the possibilities with duplicate canonical labelings.
nauty also computes the orders of automorphism groups.
You will need to rethink this problem from a mathematical point of view. That said even with brute force, there are some programming tricks you can use to speed up the process (as SO is a programming site). Little tricks like not recalculating int(min(n,rowstochoose)) and itertools.combinations(range(m), int(rowstochoose)) can save a few percent - but the real gain comes from memoization. Others have mentioned it, but I thought it might be useful to have a complete, working, code example:
import numpy as np
from scipy.misc import comb
import itertools, hashlib
m,n = 4,4
rowstochoose = int(np.floor(m/2)+1)
maxnumber = comb(m, rowstochoose, exact = True)
combo_itr = (x for x in itertools.product([-1,1], repeat = m*n))
matrix_itr = (np.array(x,dtype=np.int8).reshape((n,m)) for x in combo_itr)
sub_shapes = map(list,(itertools.combinations(range(m), int(rowstochoose))))
required_rank = int(min(n,rowstochoose))
memo = {}
no_found = 0
for A in matrix_itr:
check = True
for s in sub_shapes:
view = A[s].view(np.int8)
h = hashlib.sha1(view).hexdigest()
if h not in memo:
memo[h] = np.linalg.matrix_rank(view)
if memo[h] != required_rank:
check = False
break
if check: no_found+=1
print no_found, 2**(m*n)
This gives a speed gain of almost 10x for the 4x4 case - you'll see substantial improvements for larger matrices if you care to wait long enough. It's possible for the larger matrices, where the cost of the rank is proportionally more expensive, that you can order the matrices ahead of time on the hashing:
idx = np.lexsort(view.T)
h = hashlib.sha1(view[idx]).hexdigest()
For the 4x4 case this makes it slightly worse, but I expect that to reverse for the 5x5 case.
Algorithm 1 - memorizing small ones
I would use memorizing of the already checked smaller matrices.
You could simply write down in binary format (0 for -1, 1 for 1) all smaller matrices. BTW, you cane directly check for ranges matrices of (0 and 1) instead of (-1 and 1) - it is the same. Let us call these coding IMAGES. Using long types you can have matrices of up to 64 cells, so, up to 8x8. It is fast. Using String you can them have as large as you need.
Really, 8x8 is more than enough - in the 8GB memory we can place 1G longs. it is about 2^30, so, you can remember matrices of about up to 25-28 elements.
For every size you'll have a set of images:
for 2x2: 1001, 0110, 1000, 0100, 0010, 0001, 0111, 1011, 1101, 1110.
So, you'll have archive=array of NxN, each element of which will be an ordered list of binary images of good matrices.
- (for matrix size MxN, where M>=N, the appropriate place in archive will have coordinates M,N. If M
When you are checking a new large matrix, divide it into small ones
For every small matrix T
If the appropriate place in the archive for size of T has no list, create it and fill by images of all full-rank matrices of size of T and order images. If you are out of memory, stop the process of archive filling.
If T could be in archive, according to size:
Make image of T
Look for image(t) in the list - if it is in it, it is OK, if no, the large matrix should be thrown off.
If T is too big for the archive, check it as you do it.
Algorithm 2 - increasing sizes
The other possibility is to create larger matrices by adding pieces to the lesser ones, already found.
You should decide, up to what size the matrices will grow.
When you find a "correct" matrix of size MxN, try to add a row to top it. New matrices should be checked only for submatrices that include the new row. The same with the new column.
You should set exact algorithm, which sizes are derived from which ones. Thus you can minimize the number of remembered matrices. I thought about that sequence:
Start from 2x2 matrices.
continue with 3x2
4x2, 3x3
5x2, 4x3
6x2, 5x3, 4x4
7x2, 6x3, 5x4
...
So you can remember only (M+N)/2-1 matrices for searching among sizes MxN.
If each time when we can create new size from two old ones, we derive from more square ones, we could also greatly spare place for matrices remembering: For "long" matrices as 7x2 we do need remember and check only the last line 1x2. For matrices 6x3 we should remember their stub of 2x3, and so on.
Also, you don't need to remember the largest matrices - you won't use them for further counting.
Again use "images" for remembering the matrix.