I'm trying to learn how to reduce dimensionality in datasets. I came across some tutorials on Principle Component Analysis and Singular Value Decomposition. I understand that it takes the dimension of greatest variance and sequentially collapses dimensions of the next highest variance (overly simplified).
I'm confused on how to interpret the output matrices. I looked at the documentation but it wasn't much help. I followed some tutorials and was not too sure what the resulting matrices were exactly. I provided some code to get a feel for the distribution of each variable in the dataset (sklearn.datasets) .
My initial input array is a (n x m) matrix of n samples and m attributes. I could do a common PCA plot of PC1 vs. PC2 but how do I know which dimensions each PC represents?
Sorry if this is a basic question. A lot of the resources are very math heavy which I'm fine with but a more intuitive answer would be useful. No where I've seen talks about how to interpret the output in terms of the original labeled data.
I'm open to using sklearn's decomposition.PCA
#Singular Value Decomposition
U, s, V = np.linalg.svd(X, full_matrices=True)
print(U.shape, s.shape, V.shape, sep="\n")
(442, 442)
(10,)
(10, 10)
As you stated above matrix M can decomposed as product ot 3 matrices: U * S * V*.
Geometrical sense is next: any transformation could be deemed as a sequence of rotation (V * ), scaling (S) and rotation again(U). Here's good description and animation.
What's important for us?
Matrix S is diagonal - all its values lying off the main diagonal are 0.
Like:
np.diag(s)
array([[ 2.00604441, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 1.22160478, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 1.09816315, 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0.97748473, 0. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0.81374786, 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0.77634993, 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 0.73250287, 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.65854628, 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.27985695, 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.09252313]])
Geometrically - each value is a scaling factor along particular axis.
For our purposes (Classification and Regression) these values show impact of particular axis to the overall result.
As you may see these values are decreasing from 2.0 to 0.093.
One of the most important applications - easy Low-rank matrix approximation with given precision. If you do not need an ultra-precise decomposition (it is true for ML issues) you may throw off lowest values and keep only important. In such a way you may step-by-step refine your solution: estimate quality with test set, throw off least values and repeat. As a result you obtain easy and robust solution.
Here good candidates to be shrinked are 8 and 9, then 5-7, and as a last option you may approximate model to only one value - first.
Related
I currently have a dataset of nxm tensors that I need to pad to 13xm tensor (n <= 13); however, I cannot figure out how to do this without Tensorflow losing the shape of my tensor.
I am trying to apply the map function to these, but tf.constant cannot accept a tensor as part of the padding specification and because of map's requirement I cannot just use the numpy method.
def pad_tensor(x):
current_length = tf.shape(x)[0]
additional_length = 13 - current_length
padding = tf.constant([[0, additional_length], [0, 0]])
return tf.pad(x, padding, "CONSTANT")
I know I can use py_func but when I do that, tensorflow loses the shape of the data in the dataset.
Any help would be appreciated
PS: I'm not sure exactly what you mean by apply the map function to these and because of map's requirement I cannot just use the numpy method, if you still have problem after following example then please make the definition of the problem more clear
FWIW, add .numpy() and code running without any error:
import tensorflow as tf
import numpy as np
def pad_tensor(x):
current_length = tf.shape(x)[0]
additional_length = 13 - current_length.numpy()
padding = tf.constant([[0, additional_length], [0, 0]])
return tf.pad(x, padding, "CONSTANT")
n = 3
m = 5
x = tf.constant(np.random.rand(n, m))
pad_tensor(x)
Outputs:
<tf.Tensor: shape=(13, 5), dtype=float64, numpy=
array([[0.35710346, 0.49611589, 0.18744049, 0.91046784, 0.19934265],
[0.51464596, 0.96416921, 0.87008494, 0.52756893, 0.23010099],
[0.05335277, 0.88451633, 0.25949178, 0.91156944, 0.03638372],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ]])>
I have a numpy array as follows:
array([0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0.00791667, 0. , 0. , 0. , 0. ,
0. , 0.06837452, 0.09166667, 0.00370881, 0. ,
0. , 0.00489809, 0. , 0. , 0. ,
0. , 0. , 0.23888889, 0. , 0.05927778,
0.12138889, 0. , 0. , 0. , 0.36069444,
0.31711111, 0.16333333, 0.15005556, 0.01 , 0.005 ,
0.14357413, 0. , 0.15722222, 0.29494444, 0.3245 ,
0.31276639, 0.095 , 0.04750292, 0.09127039, 0. ,
0.06847222, 0.17 , 0.18039233, 0.21567804, 0.15913079,
0.4579781 , 0. , 0.2459 , 0.14886556, 0.08447222,
0. , 0.13722222, 0.28336984, 0.0725 , 0.077355 ,
0.45166391, 0. , 0.24892933, 0.25360062, 0. ,
0.12923041, 0.16145892, 0.48771795, 0.38527778, 0.29432968,
0.31983305, 1.07573089, 0.30611111, 0. , 0.0216475 ,
0. , 0.62268056, 0.16829156, 0.46239719, 0.6415958 ,
0.02138889, 0.76457155, 0.05711551, 0.35050949, 0.34856278,
0.15686164, 0.23158889, 0.16593262, 0.34961111, 0.21247575,
0.14116667, 0.19414785, 0.09166667, 0.93376627, 0.12772222,
0.00366667, 0.10297222, 0.173 , 0.0381225 , 0.22441667,
0.46686111, 0.18761111, 0.56037889, 0.47566111])
From this array, I need to calculate the area under the curve for each sub-array where the first value is 0, where it goes above 0, and the last number should be the 0 after a non-zero number. Obviously the array lengths will vary. It may also occur that two of these sub-arrays will share a 0 value (the last 0 of the first array will be the fist 0 if the second array).
The expected first two arrays should be:
[0. , 0.00791667, 0. ]
[0. , 0.06837452, 0.09166667, 0.00370881, 0. ]
I've tried and splitting python lists based on a character being equal to 0, but haven't found anything useful. What can I do?
See the code below - I think this is the most efficient you'll be able to do.
First, split the array using the indices of all of the zeroes. Where multiple zeroes are together, this produces several [ 0. ] arrays, so filter those out (based on length, as all arrays must necessarily begin with a zero) to produce C. Finally, since they all begin with zero, but none end with zero, append a zero to each array.
import numpy as np
# <Your array here>
A = np.array(...)
# Split into arrays based on zeroes
B = np.split(A, np.where(A == 0)[0])
# Filter out arrays of length 1
# (just a zero, caused by multiple zeroes together)
f = np.vectorize(lambda a: len(a) > 1)
C = np.extract(f(B), B)
# Append a zero to each array
g = np.vectorize(lambda a: np.append(a, 0), otypes=[object])
D = g(C)
# Output result
for array in D:
print(array)
This gives the following output:
[ 0. 0.00791667 0. ]
[ 0. 0.06837452 0.09166667 0.00370881 0. ]
[ 0. 0.00489809 0. ]
[ 0. 0.23888889 0. ]
[ 0. 0.05927778 0.12138889 0. ]
[ 0. 0.36069444 0.31711111 0.16333333 0.15005556 0.01 0.005
0.14357413 0. ]
[ 0. 0.15722222 0.29494444 0.3245 0.31276639 0.095
0.04750292 0.09127039 0. ]
[ 0. 0.06847222 0.17 0.18039233 0.21567804 0.15913079
0.4579781 0. ]
[ 0. 0.2459 0.14886556 0.08447222 0. ]
[ 0. 0.13722222 0.28336984 0.0725 0.077355 0.45166391
0. ]
[ 0. 0.24892933 0.25360062 0. ]
[ 0. 0.12923041 0.16145892 0.48771795 0.38527778 0.29432968
0.31983305 1.07573089 0.30611111 0. ]
[ 0. 0.0216475 0. ]
[ 0. 0.62268056 0.16829156 0.46239719 0.6415958 0.02138889
0.76457155 0.05711551 0.35050949 0.34856278 0.15686164 0.23158889
0.16593262 0.34961111 0.21247575 0.14116667 0.19414785 0.09166667
0.93376627 0.12772222 0.00366667 0.10297222 0.173 0.0381225
0.22441667 0.46686111 0.18761111 0.56037889 0.47566111 0. ]
Suppose we have an array with numbers between 0 and 1:
arr=np.array([ 0. , 0. , 0. , 0. , 0.6934264 ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0.6934264 , 0. , 0.6934264 ,
0. , 0. , 0. , 0. , 0.251463 ,
0. , 0. , 0. , 0.87104906, 0.251463 ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0.48419626,
0. , 0. , 0. , 0. , 0. ,
0.87104906, 0. , 0. , 0.251463 , 0.48419626,
0. , 0.251463 , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0.251463 , 0. , 0.35524532, 0. ,
0. , 0. , 0. , 0. , 0.251463 ,
0.251463 , 0. , 0.74209813, 0. , 0. ])
Using seaborn, I want to plot a distribution plot:
sns.distplot(arr, hist=False)
Which will give us the following figure:
As you can see, the kde estimation ranges from somewhere near -0.20 to 1.10. Is it possible to force the estimation to be between 0 and 1? I have tried the followings with no luck:
sns.distplot(arr, hist=False, hist_kws={'range': (0.0, 1.0)})
sns.distplot(arr, hist=False, kde_kws={'range': (0.0, 1.0)})
The second line raises an exception -- range not a valid keyword for kde_kws.
The correct way of doing this, is by using the clip keyword instead of range:
sns.distplot(arr, hist=False, kde_kws={'clip': (0.0, 1.0)})
which will produce:
Indeed, if you only care about the kde and not the histogram, you can use the kdeplot function, which will produce the same result:
sns.kdeplot(arr, clip=(0.0, 1.0))
Setting plt.xlim(0, 1) beforehand should helpĀ :
import matplotlib.pyplot as plt
plt.xlim(0, 1)
sns.distplot(arr, hist=False)
I have a numpy array:
arr=np.array([0,1,0,0.5])
I need to form a new array from it as follows, such that every zero elements is repeated thrice and every non-zero element has 2 preceding zeroes, followed by the non-zero number. In short, every element is repeated thrice, zero as it is and non-zero has 2 preceding 0 and then the number itself. It is as follows:
([0,1,0,0.5])=0,0,0, [for index 0]
0,0,1 [for index 1]
0,0,0 [for index 2, which again has a zero] and
0,0,0.5
final output should be:
new_arr=[0,0,0,0,0,1,0,0,0,0,0,0.5]
np.repeat() repeats all the array elements n number of times, but i dont want that exactly. How should this be done? Thanks for the help.
A quick reshape followed by a call to np.pad will do it:
np.pad(arr.reshape(-1, 1), ((0, 0), (2, 0)), 'constant')
Output:
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 1. ],
[ 0. , 0. , 0. ],
[ 0. , 0. , 0.5]])
You'll want to flatten it back again. That's simply done by calling .reshape(-1, ).
>>> np.pad(arr.reshape(-1, 1), ((0, 0), (2, 0)), 'constant').reshape(-1, )
array([ 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. ,
0.5])
A variant on the pad idea is to concatenate a 2d array of zeros
In [477]: arr=np.array([0,1,0,0.5])
In [478]: np.column_stack([np.zeros((len(arr),2)),arr])
Out[478]:
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 1. ],
[ 0. , 0. , 0. ],
[ 0. , 0. , 0.5]])
In [479]: _.ravel()
Out[479]:
array([ 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. ,
0.5])
or padding in the other direction:
In [481]: np.vstack([np.zeros((2,len(arr))),arr])
Out[481]:
array([[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 0.5]])
In [482]: _.T.ravel()
Out[482]:
array([ 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. ,
0.5])
I've implemented a matrix factorization model, say R = U*V, and now I would to train and test this model.
To this end, given a sparse matrix R (zero for missing value), I want to first hide some non-zero elements in the training and use these non-zero elements as test set later.
How can I randomly select some non-zero elements from a numpy.ndarray? Besides, I need to remember the index and column position of these selected elements to use these elements in testing.
for example:
In [2]: import numpy as np
In [4]: mtr = np.random.rand(10,10)
In [5]: mtr
Out[5]:
array([[ 0.92685787, 0.95496193, 0.76878455, 0.12304856, 0.13804963,
0.30867502, 0.60245974, 0.00797898, 0.1060602 , 0.98277982],
[ 0.88879888, 0.40209901, 0.35274404, 0.73097713, 0.56238248,
0.380625 , 0.16432029, 0.5383006 , 0.0678564 , 0.42875591],
[ 0.42343761, 0.31957986, 0.5991212 , 0.04898903, 0.2908878 ,
0.13160296, 0.26938537, 0.91442668, 0.72827097, 0.4511198 ],
[ 0.63979934, 0.33421621, 0.09218392, 0.71520048, 0.57100522,
0.37205284, 0.59726293, 0.58224992, 0.58690505, 0.4791199 ],
[ 0.35219557, 0.34954002, 0.93837312, 0.2745864 , 0.89569075,
0.81244084, 0.09661341, 0.80673646, 0.83756759, 0.7948081 ],
[ 0.09173706, 0.86250006, 0.22121994, 0.21097563, 0.55090202,
0.80954817, 0.97159981, 0.95888693, 0.43151554, 0.2265607 ],
[ 0.00723128, 0.95690539, 0.94214806, 0.01721733, 0.12552314,
0.65977765, 0.20845669, 0.44663729, 0.98392716, 0.36258081],
[ 0.65994805, 0.47697842, 0.35449045, 0.73937445, 0.68578224,
0.44278095, 0.86743906, 0.5126411 , 0.75683392, 0.73354572],
[ 0.4814301 , 0.92410622, 0.85267402, 0.44856078, 0.03887269,
0.48868498, 0.83618382, 0.49404473, 0.37328248, 0.18134919],
[ 0.63999748, 0.48718656, 0.54826717, 0.1001681 , 0.1940816 ,
0.3937014 , 0.48768013, 0.70610649, 0.03213063, 0.88371607]])
In [6]: mtr = np.where(mtr>0.5, 0, mtr)
In [7]: %clear
In [8]: mtr
Out[8]:
array([[ 0. , 0. , 0. , 0.12304856, 0.13804963,
0.30867502, 0. , 0.00797898, 0.1060602 , 0. ],
[ 0. , 0.40209901, 0.35274404, 0. , 0. ,
0.380625 , 0.16432029, 0. , 0.0678564 , 0.42875591],
[ 0.42343761, 0.31957986, 0. , 0.04898903, 0.2908878 ,
0.13160296, 0.26938537, 0. , 0. , 0.4511198 ],
[ 0. , 0.33421621, 0.09218392, 0. , 0. ,
0.37205284, 0. , 0. , 0. , 0.4791199 ],
[ 0.35219557, 0.34954002, 0. , 0.2745864 , 0. ,
0. , 0.09661341, 0. , 0. , 0. ],
[ 0.09173706, 0. , 0.22121994, 0.21097563, 0. ,
0. , 0. , 0. , 0.43151554, 0.2265607 ],
[ 0.00723128, 0. , 0. , 0.01721733, 0.12552314,
0. , 0.20845669, 0.44663729, 0. , 0.36258081],
[ 0. , 0.47697842, 0.35449045, 0. , 0. ,
0.44278095, 0. , 0. , 0. , 0. ],
[ 0.4814301 , 0. , 0. , 0.44856078, 0.03887269,
0.48868498, 0. , 0.49404473, 0.37328248, 0.18134919],
[ 0. , 0.48718656, 0. , 0.1001681 , 0.1940816 ,
0.3937014 , 0.48768013, 0. , 0.03213063, 0. ]])
Given such sparse ndarray, how can I select 20% of the non-zero elements and remember their position?
We'll use numpy.random.choice. First, we get arrays of the (i,j) indices where the data is nonzero:
i,j = np.nonzero(x)
Then we'll select 20% of these:
ix = np.random.choice(len(i), int(np.floor(0.2 * len(i))), replace=False)
Here ix is a list of random, unique indices, 20% the length of i and j (the length of i and j is the number of nonzero entries). To recover the indices, we do i[ix] and j[ix], so we can then select 20% of the nonzero entries of x by writing:
print x[i[ix], j[ix]]