PCA().fit() is using the wrong axis for data input - python

I'm using sklearn.decomposition.PCA to pre-process some training data for a machine learning model. There is 247 data points with 4095 dimensions, imported from a csv file using pandas. I then scale the data
training_data = StandardScaler().fit_transform(training[:,1:4096])
before calling the PCA algorithm to obtain the variance for each dimension,
pca = PCA(n_components)
pca.fit(training_data).
The output is a vector of length 247, but it should have length 4095 so that I can work out the variance of each dimension, not the variance of each data point.
My code looks like:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
test = np.array(pd.read_csv("testing.csv", sep=','))
training = np.array(pd.read_csv("training.csv", sep=','))
# ID Number = [0]
# features = [1:4096]
training_data = StandardScaler().fit_transform(training[:,1:4096])
test_data = StandardScaler().fit_transform(test[:,1:4096])
training_labels = training[:,4609]
pca = PCA()
pca.fit(training_data)
pca_variance = pca.explained_variance_.
I have tried taking the transpose of training_data, but this didn't change the output. I have also tried changing n_components in the argument of the PCA function, but it is insistent that there can only be 247 dimensions.
This may be a stupid question, but I'm very new to this sort of data processing. Thank you.

You said:
" but it should have length 4095 so that I can work out the variance of
each dimension, not the variance of each data point."
No. This is only true if you would estimate 4095 components using pca = PCA(n_components=4095).
On the other hand, you define:
pca = PCA() # this is actually PCA(n_components=None)
so n_components is set to None.
When this happens we have (see the documentation here):
n_components == min(n_samples, n_features)
Thus, in your case, you have min(247, 4095) = 247 components.
So, pca.explained_variance_. will be a vector with shape 247 since you have 247 PC dimensions.
Why do we have n_components == min(n_samples, n_features) ?
This is related to the rank of the covariance/correlation matrix. Having a data matrix X with shape [247,4095], the covariance/correlation matrix would be [4095,4095] with max rank = min(n_samples, n_features). Thus, you have at most min(n_samples, n_features) meaningful PC components/dimensions.

Related

Scaling the target variable is giving error in Python using StandardScaler of Sklearn library

Scaling the target variable by normal procedure of using StandardScaler class is giving error. However, the error got resolved by adding a line y = y.reshape(-1,1). After which applying the fit_transform method on target variable gave the standardized value. I am not able to figure out how adding y.reshape(-1,1) made it work?
X is independent variable having one feature and y is the numerical target variable 'Salary'. I was trying to apply Support Vector Regression to the problem, which needs explicit feature scaling. I tried the following code:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
It gave me error like:
ValueError: Expected 2D array, got 1D array instead Reshape your data
either using array.reshape(-1, 1) if your data has a single feature
or array.reshape(1, -1) if it contains a single sample.
After I made the following changes:
X = sc_X.fit_transform(X)
y = y.reshape(-1,1)
y = sc_y.fit_transform(y)
The standardization worked just fine. I need to understand how adding this y = y.reshape(-1,1) helped achieve it.
Thanks.
In short, yes you would need to transform it. This is because as per the sklearn documentation, fit_transform expects the X or predictor variables to consist of n_samples with n_features, which is make sense to what it used for. Supplying only 1-D array, this function will read it as 1 sample of n_feature. Perhaps attaching the code below will make this clearer:
In [1]: x_arr
Out[1]: array([1, 2, 3, 4, 5]) # will be considered as 1 sample of 5 feature
In [2]: x_arr.reshape(-1,1)
Out[2]:
array([[1], # 1st sample
[2], # 2nd sample
[3], # 3rd sample
[4], # 4th sample
[5]])# 5th sample
Anyway, on how you use the StandardScaler (unrelated to your question on why your code produce error, which answered above), what you want to do is using the same StandardScaler throughout your data. Generally speaking, scaling the target variable isn't necessary since it's the variable you'd like to predict, not the predictor (assuming y in your code is the target variable).
First, you'd like to store the mean and standard deviation of your training data to be used later in for scaling the test data.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Here the scaler will learn the mean and std of train data
x_train_scaled = scaler.fit_transform(x_train, y_train)
# Use here to transform test data
# This ensures both the train and test data are in the same scale
x_test_scaled = scaler.transform(x_test)
Hope this helps!
This comes up a lot in SKLearn.
From the docs of the scaler's .transform function, the input to .transform has to be a 2D matrix where the second dimension is the number of features:
Perform standardization by centering and scaling
Parameters: X : array-like, shape [n_samples, n_features] The data
used to scale along the features axis.
Now, the last dimension has to be explicitly set to 1, not missing. Before the data is reshaped (i.e. y=y.reshape(-1,1)), the last dimension is missing - see this example:
import numpy as np
a = np.array([0,0,0])
print(a) # [0 0 0]
print(a.shape) # (3,)
b = a.reshape(-1,1)
print(b) # [[0] [0] [0]]
print(b.shape) # (3,1)
The reshape method changes the shape of an array: for example, if a is an array with 6 elements (and whatever shape), a.reshape(3,2) changes its shape to 3-by-2. The -1 argument basically means "use the dimension that is needed here so that the data fits". So, a.reshape(-1,1) an array with n elements to an n-by-1 2d array (without explicitly specifying n).

CNN with vector output and 2D image graph input (input is an array)

I am trying to create a CNN in Keras (Python 3.7) which ingests a 2D matrix input (much like a grayscale image) and outputs a 1 dimensional vector. So far I did manage to get results, but I am not sure if what I am doing is correct (or if my intuition is).
I input a 100x50 array into my convolutional layer. This 2D array holds the peak information at every position (ie. x axis pertains to the position, y-axis pertains to the frequency, and each cell gives the intensity). The 3D graph of this shows something akin to the one given in this link.
From the (all of the) literature I have read, I learned that CNN accepts image data--image is converted into pixel values and then repeatedly convolved and pooled to get the output. However, I am using a MatLab simulator to get my input data, and I have access to the raw 2D array containing information on the peak frequency at each point.
My intuition is this: if we normalize each cell and feed the information to the CNN, it will be as if I fed the normalized pixel values of the image to the CNN, since my raw 2D array also has height, width and depth=1, like an image.
Please enlighten me if my thinking is correct or wrong.
My code is as follows:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf
import keras
'''load sample input'''
BGS1 = pd.read_csv("C:/Users/strain1_input.csv")
BGS2 = pd.read_csv("C:/Users/strain2_input.csv")
BGS3 = pd.read_csv("C:/Users/strain3_input.csv")
BGS_ = np.array([BGS1, BGS2, BGS3]) #3x100x50 array
BGS_normalized = BGS_/np.amax(BGS_)
'''load sample output'''
BFS1 = pd.read_csv("C:/Users/strain1_output.csv")
BFS2 = pd.read_csv("C:/Users/strain2_output.csv")
BFS3 = pd.read_csv("C:/Users/strain3_output.csv")
BFS_ = np.array([BFS1, BFS2, BFS3]) #3x100
BFS_normalized = BFS/50 #since max value for each cell is 50
#after splitting data into training, validation and testing sets,
output_nodes = 100
n_classes = 1
batch_size_ = 8 #so far, optimized for 8 batch size
epoch = 100
input_layer = Input(shape=(45,300,1))
conv1 = Conv2D(16,3,padding="same",activation="relu", input_shape =
(45,300,1))(input_layer)
pool1 = MaxPooling2D(pool_size=(2,2),padding="same")(conv1)
flat = Flatten()(pool1)
hidden1 = Dense(10, activation='softmax')(flat) #relu
batchnorm1 = BatchNormalization()(hidden1)
output_layer = Dense(output_nodes*n_classes, activation="softmax")(batchnorm1)
output_layer2 = Dense(output_nodes*n_classes, activation="relu")(output_layer)
output_reshape = Reshape((output_nodes, n_classes))(output_layer2)
model = Model(inputs=input_layer, outputs=output_reshape)
print(model.summary())
model.compile(loss='mean_squared_error', optimizer='adam', sample_weight_mode='temporal')
model.fit(train_X,train_label,batch_size=batch_size_,epochs=epoch)
predictions = model.predict(train_X)
what you did is exactly the strategy used to input non image data in to 2d convolutional layers. As long the model predicts correctly, what you did is correct. its just that CNN perform very poorly on non-image data or there might be chances to overfit. But then again, as long it performs correctly then its good.

How to reshape input for keras LSTM?

I have a numpy array of some 5000 rows and 4 columns (temp, pressure, speed, cost). So this is of the shape (5000, 4). Each row is an observation at a regular interval this is the first time i'm doing time series prediction and I'm stuck on input shape. I'm trying to predict
a value 1 timestep from the last data point. How do I reshape it into the 3D form for LSTM model in keras?
Also It will be much more helpful if a small sample program is written. There doesn't seem to be any example/tutorial where the input has more than one feature (and also not NLP).
The first question you should ask yourself is :
What is the timescale in which the input features encode relevant information for the value you want to predict?
Let's call this timescale prediction_context.
You can now create your dataset :
import numpy as np
recording_length = 5000
n_features = 4
prediction_context = 10 # Change here
# The data you already have
X_data = np.random.random((recording_length, n_features))
to_predict = np.random.random((5000,1))
# Make lists of training examples
X_in = []
Y_out = []
# Append examples to the lists (input and expected output)
for i in range(recording_length - prediction_context):
X_in.append(X_data[i:i+prediction_context,:])
Y_out.append(to_predict[i+prediction_context])
# Convert them to numpy array
X_train = np.array(X_in)
Y_train = np.array(Y_out)
At the end :
X_train.shape = (recording_length - prediction_context, prediction_context, n_features)
So you will need to make a trade-off between the length of your prediction context and the number of examples you will have to train your network.

Really confused with this numpy shape mismatch error

I was implementing a simple k-nearest-neighbor algorithm from sklearn on google cloud platform ml engine. I used a custom metric to calculate the distance between two input vectors so that the distance is the weighted sum of elements in the element-wise squared difference between two vectors. The code is below:
import os.path
from sklearn import neighbors
import numpy as np
from six.moves import cPickle as pickle
import tensorflow as tf
from tensorflow.python.lib.io import file_io
flags = tf.app.flags
FLAGS = flags.FLAGS
flags.DEFINE_string('input_dir', 'input', 'Input Directory.')
flags.DEFINE_string('input_train_data','train_data','Input Training Data File Name.')
pickle_file = os.path.join(FLAGS.input_dir, FLAGS.input_train_data)
def mydist(x, y):
return np.dot((x - y) ** 2, weight)
with file_io.FileIO(pickle_file, 'r') as f:
save = pickle.load(f)
train_dataset, train_labels, valid_dataset, valid_labels = save['train_dataset'], save['train_labels'], save[
'valid_dataset'], save['valid_labels']
train_data = train_dataset[:1000]
train_label = train_labels[:1000]
test_data = valid_dataset[:100]
weight = [1.0]* len(train_dataset[1])
knn = neighbors.KNeighborsRegressor(weights='distance', n_neighbors=20, metric=lambda x, y: mydist(x, y))
knn.fit(train_data, train_label)
predict = knn.predict(test_data)
print(predict)
train_dataset is a numpy array of shape (86667,13) and valid_dataset has shape (8000,13). Train_labels has shape (86667,1) and valid_labels (8000,1). For some reason I got a dimension mismatch:
line 15, in mydist return np.dot((x - y) ** 2, weight) ValueError: shapes
(10,) and (13,) not aligned: 10 (dim 0) != 13 (dim 0)
Both x and y in the input of the custom metric should have size 13 yet somehow they have size 10. Can anyone explain what is wrong here?
You are taking the distance between the wrong terms. You cannot take the distance between the labels and the train features. These are of two different dimensions. You need to calculate the distance between any two feature points say x1 and x2, not between the label and it's feature point (say x1 and y1). Secondly, when declaring the KNeighborsRegressor object, you have specified the parameter incorrectly. In the 'metric' parameter you specify the 'string' or the 'DistanceMetric' Object. First you have to make a distance metric object and then pass it as metric. So, this is the correct way of calling your function:
my_metric=DistanceMetric.get_metric('myfunc',func=mydist)
knn = neighbors.KNeighborsRegressor(weights='distance', n_neighbors=20, metric='myfunc')
Sklearn will itself take care of how the parameters are to be passed in the distance function. I am assuming that the weights variable is global for your code to function correctly.

scikit learn incremental pca confusion

I have a dataset with (n_samples, n_features) = (466000, 4338093).
I want to perform PCA on this data so utilizing Python's Scikit learn's Incremental PCA.
Since the data is huge, it is split into 466 chunks of 1000 samples each i.e each chunk will have (n_samples, n_features) = (1000, 4338093). Each of the chunk is stored as a hickle file. Also, the matrix is in sparse format.
I have set the n_components of PCA to min(n_samples, n_features) which is 466000.
The following is how I do it for 2 chunks:
import hickle
from sklearn.decomposition import IncrementalPCA
import os
fv_list = list()
for file_name in os.listdir("PATH_TO_DIR"):
if file_name.endswith(".hkl"):
fv_list.append(os.path.join("PATH_TO_DIR", file_name))
data_shape = hickle.load(open(fv_list[0])).shape
ipca = IncrementalPCA(n_components=min(len(fv_list) * data_shape[0], data_shape[1]))
for each_chunk in fv_list:
part = hickle.load(open(each_chunk))
ipca.partial_fit(part.todense())
Now, when I call the partial fit method of the pca, for the 2nd iteration I get the following error:
ValueError: Number of input features has changed from 1000 to 466000 between calls to partial_fit! Try setting n_components to a fixed value.
I am confused of why this ValueError occurs. Is my approach wrong ?

Categories