Principal Component Analysis converts 3d array to 1d array using Python - python

I have the following input data [-5, 10,2], [-2, -3,3], [-4, -9,1], [7, 11,-3], [12, 6,-1], [13, 4,5] on hand and would like to use PCA to convert from 3D array to 1D array. I typed with the following code:
import numpy as np
input = np.array([[-5, 10,2], [-2, -3,3], [-4, -9,1], [7, 11,-3], [12, 6,-1], [13, 4,5]])
mean_x = np.mean(input[0,:])
mean_y = np.mean(input[1,:])
mean_z = np.mean(input[2,:])
scaled_vector = np.array([input[0,:]-[mean_x],input[1,:]-[mean_y],input[2,:]-[mean_z]])
data=np.vstack((scaled_vector)).T
scatter_matrix=np.dot(np.transpose(data),data)
eig_val, eig_vec = np.linalg.eig(scatter_matrix)
eig_pairs = [(np.abs(eig_val[i]), eig_vec[:,i]) for i in range(len(eig_val))]
eig_pairs.sort(reverse=True)
feature=eig_pairs[0][1][2]
new_data_reduced=np.dot(data,np.transpose(feature))
print(new_data_reduced)
I also use the sklearn.decomposition import PCA to do as verification.
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-5, 10,2], [-2, -3,3], [-4, -9,1], [7, 11,-3], [12, 6,-1], [13, 4,5]])
pca = PCA(n_components=1)
pca.componrnt = True
newX = pca.fit_transform(X)
print (newX)
The results from sklearn is
[[ 1.81922968]
[ 8.34080915]
[ 13.64517202]
[ -8.17114609]
[ -8.37254693]
[ -7.26151783]]
I am not sure if this results is correct or not. However, when I use my own PCA, I find that the results are extremely different. Therefore, how can I correct it?

First, you subtract the mean along rows instead of column. Then, after computing the eigenvectors, you make several steps that are unnecessary for PCA. A reduced version of your code is:
import numpy as np
input = np.array([[-5, 10,2], [-2, -3,3], [-4, -9,1],
[7, 11,-3], [12, 6,-1], [13, 4,5]])
data = input - np.mean(input, axis=0)
scatter_matrix = np.dot(data, data.T)
eig_val, eig_vec = np.linalg.eig(scatter_matrix)
new_reduced_data = np.sqrt(eig_val[0]) * eig_vec.T[0].reshape(-1,1)
print(new_reduced_data)
which seems to give the correct result.

Related

How can you make tensorflow model from data that contains array next to the other elements?

Let's assume we have such x_data to our array
import numpy as np
import tensorflow as tf
x_data = np.array([[ [1,2] , [3,4] , [5,6] ], [ [7,8], [9,10], [11,12] ]])
then the input_shape of x_data is (3, 2), that is clear, let's step further.
We want to add an additional array of n elements to each of the most inner scope, that will look as follows
import numpy as np
import tensorflow as tf
x_data = np.array([[ [1,2,[n elements]] , [3,4,[...]] , [5,6,[...]] ], [ [7,8,[...]], [9,10,[...]], [11,12,[...]] ]])
Is this possible? How to make right keras.model with this data? Which is the right shape of the input data?
If this is not possible, then how should we approach such problem?
You can use numpy function concatenate, which allows to concatenate two arrays along an axis.
Find more info in the official documentation.
In your case:
x_data = [[1, 2], [3, 4], [5, 6]]
z_data = [[10, 20, 30, 40], [50, 60, 70, 80], [100, 200, 300, 400]]
ar = np.concatenate((x_data, z_data), axis=1)

PIck randomly samples from a 2D matrix and keep the indexes in python

I have a numpy 2D matrix with data in python and I want to perform downsampling by keeping the 25% of the initial samples. In order to do so, I am using the following random.randint functionality:
reduced_train_face = face_train[np.random.randint(face_train.shape[0], size=300), :]
However, I am having a second matrix which contains the labels associated with the faces and I want to reduce with the same way. How, can I keep the indexes from the reduced matrix and apply them to the train_lbls matrix?
You can fix the seed just before applying your extraction:
import numpy as np
# Each labels correspond to the first element of each line of face_train
labels_train = np.array(range(0,15,3))
face_train = np.array(range(15)).reshape(5,3)
np.random.seed(0)
reduced_train_face = face_train[np.random.randint(face_train.shape[0], size=3), :]
np.random.seed(0)
reduced_train_labels = labels_train[np.random.randint(labels_train.shape[0], size=3)]
print(reduced_train_face, reduced_train_labels)
# [[12, 13, 14], [ 0, 1, 2], [ 9, 10, 11]], [12, 0, 9]
With the same seed, it will be reduce the same way.
edit: I advice you to use np.random.choice(n_total_elem, n_reduce_elem) in order to ensure that you only choose each data once and not twice the same data
Why don't you keep the selected index and use them to select data from both matrices?
import numpy as np
# setting up matrices
np.random.seed(1234) # make example repeatable
# the seeding is optional, only for the showing the
# same results as below!
face_train = np.random.rand(8,3)
train_lbls= np.random.rand(8)
print('face_train:\n', face_train)
print('labels:\n', train_lbls)
# Setting the random indexes
random_idxs= np.random.randint(face_train.shape[0], size=4)
print('random_idxs:\n', random_idxs)
# Using the indexes to slice the matrixes
reduced_train_face = face_train[random_idxs, :]
reduced_labels = train_lbls[random_idxs]
print('reduced_train_face:\n', reduced_train_face)
print('reduced_labels:\n', reduced_labels)
Gives as output:
face_train:
[[ 0.19151945 0.62210877 0.43772774]
[ 0.78535858 0.77997581 0.27259261]
[ 0.27646426 0.80187218 0.95813935]
[ 0.87593263 0.35781727 0.50099513]
[ 0.68346294 0.71270203 0.37025075]
[ 0.56119619 0.50308317 0.01376845]
[ 0.77282662 0.88264119 0.36488598]
[ 0.61539618 0.07538124 0.36882401]]
labels:
[ 0.9331401 0.65137814 0.39720258 0.78873014 0.31683612 0.56809865
0.86912739 0.43617342]
random_idxs:
[1 7 5 4]
reduced_train_face:
[[ 0.78535858 0.77997581 0.27259261]
[ 0.61539618 0.07538124 0.36882401]
[ 0.56119619 0.50308317 0.01376845]
[ 0.68346294 0.71270203 0.37025075]]
reduced_labels:
[ 0.65137814 0.43617342 0.56809865 0.31683612]

Can sklearn.decomposition.TruncatedSVD be applied to a matrix in parts?

I am applying sklearn.decomposition.TruncatedSVD to very large matrices. If the matrix is above a certain size (say 350k by 25k), svd.fit(x) runs out of RAM.
I am applying svd to feature matrices, where each row represents a set of features extracted from a single image.
To work around the memory issues, is it safe to apply svd to parts of the matrix (and then concatenate)?
Will the result be the same? I.e.:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=128)
part_1 = svd.fit_transform(features[0:100000, :])
part_2 = svd.fit_transform(features[100000:, :])
svd_features = np.concatenate((part_1, part_2), axis=0)
.. equivalent to(?):
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=128)
svd_features = svd.fit_transform(svd_features)
If not, is there a workaround for dim reduction of very large matrices?
The results will not be the same,
For example, consider the code below:
import numpy as np
features=np.array([[3, 2, 1, 3, 1],
[2, 0, 1, 2, 2],
[1, 3, 2, 1, 3],
[1, 1, 3, 2, 3],
[1, 1, 2, 1, 3]])
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=2)
svd = TruncatedSVD(n_components=2)
part_1 = svd.fit_transform(features[0:2, :])
part_2 = svd.fit_transform(features[2:, :])
svd_features = np.concatenate((part_1, part_2), axis=0)
svd_b = TruncatedSVD(n_components=2)
svd_features_b = svd_b.fit_transform(features)
print(svd_features)
print(svd_features_b)
This prints
[[ 4.81379561 -0.90959982]
[ 3.36212985 1.30233746]
[ 4.70088886 1.37354278]
[ 4.76960857 -1.06524658]
[ 3.94551566 -0.34876626]]
[[ 4.17420185 2.47515867]
[ 3.23525763 0.9479915 ]
[ 4.53499272 -1.13912762]
[ 4.69967028 -0.89231578]
[ 3.81909069 -1.05765576]]
which are different from each other.

Standardizing X different in Python Lasso and R glmnet?

I was trying to get the same result fitting lasso using Python's scikit-learn and R's glmnet. A helpful link
If I specify "normalize =True" in Python and "standardize = T" in R, they gave me the same result.
Python:
from sklearn.linear_model import Lasso
X = np.array([[1, 1, 2], [3, 4, 2], [6, 5, 2], [5, 5, 3]])
y = np.array([1, 0, 0, 1])
reg = Lasso(alpha =0.01, fit_intercept = True, normalize =True)
reg.fit(X, y)
np.hstack((reg.intercept_, reg.coef_))
Out[95]: array([-0.89607695, 0. , -0.24743375, 1.03286824])
R:
reg_glmnet = glmnet(X, y, alpha = 1, lambda = 0.02,standardize = T)
coef(reg_glmnet)
4 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -0.8960770
V1 .
V2 -0.2474338
V3 1.0328682
However, if I don't want to standardize variables and set normalize =False and standardize = F, they gave me quite different results.
Python:
from sklearn.linear_model import Lasso
Z = np.array([[1, 1, 2], [3, 4, 2], [6, 5, 2], [5, 5, 3]])
y = np.array([1, 0, 0, 1])
reg = Lasso(alpha =0.01, fit_intercept = True, normalize =False)
reg.fit(Z, y)
np.hstack((reg.intercept_, reg.coef_))
Out[96]: array([-0.88 , 0.09384212, -0.36159299, 1.05958478])
R:
reg_glmnet = glmnet(X, y, alpha = 1, lambda = 0.02,standardize = F)
coef(reg_glmnet)
4 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -0.76000000
V1 0.04441697
V2 -0.29415542
V3 0.97623074
What's the difference between "normalize" in Python's Lasso and "standardize" in R's glmnet?
Currently, with regard to the normalize parameter the docs state "If you wish to standardize, please use StandardScaler before calling fit on an estimator with normalize=False.''
So evidently normalize and standardize are not the same with sklearn.linear_model.Lasso. Having read the StandardScaler docs I fail to understand the difference, but the fact that there is one is implied by the provided description of the normalize parameter.

Interpolate Table from txt

I'm fairly new to python programming and I'm trying to write a program that plots a graph from a txt file and interpolate the data later.
To get the data, I know that I can use:
precos = np.genfromtxt('Precos.txt', delimiter=',')
or
precos = sp.loadtxt("Precos.txt", delimiter=",")
And the data is something simple like:
1, 69.00
2, 69.00
3, 69.00
4, 69.00
5, 69.00
6, 69.00
7, 69.00
8, 79.00
9, 56.51
10, 56.51
I also know that I can use
plt.plot(precos)
To plot the graph but I don't how to interporlate. I saw that sp.interpolate.interp1d can help, but I am still unable to get my head around it.
----EDIT----
Ok, I tried a new approach, and now my code is almost done, but I still getting one error.
import scipy as sp
import numpy as np
import matplotlib.pyplot as plt
## Importando os dados numa matriz Nx2
M = sp.loadtxt('Precos.txt', delimiter=',')
## Construindo os vetores X e Y
x=np.zeros(len(M))
y=np.zeros(len(M))
for i in range(len(M)):
x[i] = M[i][0]
y[i] = M[i][1]
##Grahp Plot
plt.plot(x,y)
plt.title("Fone de Ouvido JBL com Microfone T100A - Fevereiro 2017")
plt.xlabel("Dia")
plt.ylabel("Preco em R$")
##Interpolation
F = sp.interpolate.interp1d(x,y)
xn = sp.arange(0,9,0.1)
yn = F(xn)
plt.plot(x, y, 'o', xn, yn, '-')
plt.show()
But now I getting: ValueError: A value in x_new is below the interpolation range.
Any ideas?
sp.interpolate.interp1d generates a function that you can reuse to interpolate the original data at intermediate points. Here's some specific code to breathe some life into it:
import numpy as np
from scipy import interpolate
data = np.array([[1, 69.00],
[2, 69.00],
[3, 69.00],
[4, 69.00],
[5, 69.00],
[6, 69.00],
[7, 69.00],
[8, 79.00],
[9, 56.51],
[10, 56.51]])
x = data[:,0]
y = data[:,1]
# Define an interpolation function
interpolation_function = interpolate.interp1d(x,y,kind='linear')
# Define intermediate points to interpolate at, and print the result
xi = [1, 1.5, 2.5, 9.5]
print(interpolation_function(xi))
gives the result:
[ 69. 69. 69. 56.51]

Categories