scikit learn incremental pca confusion - python

I have a dataset with (n_samples, n_features) = (466000, 4338093).
I want to perform PCA on this data so utilizing Python's Scikit learn's Incremental PCA.
Since the data is huge, it is split into 466 chunks of 1000 samples each i.e each chunk will have (n_samples, n_features) = (1000, 4338093). Each of the chunk is stored as a hickle file. Also, the matrix is in sparse format.
I have set the n_components of PCA to min(n_samples, n_features) which is 466000.
The following is how I do it for 2 chunks:
import hickle
from sklearn.decomposition import IncrementalPCA
import os
fv_list = list()
for file_name in os.listdir("PATH_TO_DIR"):
if file_name.endswith(".hkl"):
fv_list.append(os.path.join("PATH_TO_DIR", file_name))
data_shape = hickle.load(open(fv_list[0])).shape
ipca = IncrementalPCA(n_components=min(len(fv_list) * data_shape[0], data_shape[1]))
for each_chunk in fv_list:
part = hickle.load(open(each_chunk))
ipca.partial_fit(part.todense())
Now, when I call the partial fit method of the pca, for the 2nd iteration I get the following error:
ValueError: Number of input features has changed from 1000 to 466000 between calls to partial_fit! Try setting n_components to a fixed value.
I am confused of why this ValueError occurs. Is my approach wrong ?

Related

How to do multiple inferencing on onnx(onnxruntime) similar to sklearn

I want to infer outputs against many inputs from an onnx model using onnxruntime in python. One way is to use the for loop but it seems a very trivial and a slow method. Is there a way to do the same way as sklearn?
Single prediction on onnxruntime:
import onnxruntime as ort
sess = ort.InferenceSession("xxxxx.onnx")
input_name = sess.get_inputs()
label_name = sess.get_outputs()[0].name
pred_onnx= sess.run([label_name], {
input_name[0].name: np.array([[40]]).astype(np.int64),
input_name[1].name: np.array([[0]]).astype(np.int64),
input_name[2].name: np.array([[0]]).astype(np.int64)
})
pred_onnx
>> Output: [array([[23]], dtype=float32)]
Single/Multiple prediction in sklearn(depending on the size of x_test):
test_predictions = model.predict(x_test)
Best way is for the ONNX model to support batches. Based on the input you're providing it may already do that. Your 3 inputs appear to have shape [1,1] and your output has shape [1,1], which may mean the first dimension is the batch size. Example input with shape [2,1] (2 batches, 1 element per batch) would look like [[40],[50]].
I'm guessing if you provide two batches would of input you'd get two outputs, so something like this
pred_onnx= sess.run([label_name], {
input_name[0].name: np.array([[40],[40]]).astype(np.int64),
input_name[1].name: np.array([[0],[0]]).astype(np.int64),
input_name[2].name: np.array([[0],[0]]).astype(np.int64)
})
May give output of
[array([[23],[23]], dtype=float32)]
Here is a small working example using batch inference on a sklearn model exported to ONNX.
from sklearn import datasets, model_selection, linear_model, pipeline, preprocessing
import numpy as np
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime
import pandas as pd
# load toy dataset, define sklearn pipeline and fit model
dataset = datasets.load_diabetes()
X, y = dataset.data, dataset.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
regr = pipeline.Pipeline(
[("std", preprocessing.StandardScaler()), ("reg", linear_model.LinearRegression())]
)
regr.fit(X_train, y_train)
# export model to onnx
initial_type = list(
zip(
dataset.feature_names,
[FloatTensorType([None, 1]) for _ in range(len(dataset.feature_names))],
)
)
onx = convert_sklearn(regr, initial_types=initial_type)
with open("model.onnx", "wb") as f:
f.write(onx.SerializeToString())
# load model in onnx runtime and make batch inference
df_test = pd.DataFrame(X_test, columns=dataset.feature_names)
sess = onnxruntime.InferenceSession("model.onnx")
inputs = {
f: df_test[f].astype(np.float32).values.reshape(-1, 1)
for f in dataset.feature_names
}
label_name = sess.get_outputs()[0].name
pred_onx = sess.run([label_name], inputs)[0]
# compare results
regr.predict(X_test)
pred_onx.flatten()
I think the trickiest part is to get the input shape right for inference.
Since we specified FloatTensorType([None, 1]) the shape of the single input arrays must be of shape (x,1) where x is the number of batches. Thus we need to reshape column values of shape (x,) into (x,1).

PCA().fit() is using the wrong axis for data input

I'm using sklearn.decomposition.PCA to pre-process some training data for a machine learning model. There is 247 data points with 4095 dimensions, imported from a csv file using pandas. I then scale the data
training_data = StandardScaler().fit_transform(training[:,1:4096])
before calling the PCA algorithm to obtain the variance for each dimension,
pca = PCA(n_components)
pca.fit(training_data).
The output is a vector of length 247, but it should have length 4095 so that I can work out the variance of each dimension, not the variance of each data point.
My code looks like:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
test = np.array(pd.read_csv("testing.csv", sep=','))
training = np.array(pd.read_csv("training.csv", sep=','))
# ID Number = [0]
# features = [1:4096]
training_data = StandardScaler().fit_transform(training[:,1:4096])
test_data = StandardScaler().fit_transform(test[:,1:4096])
training_labels = training[:,4609]
pca = PCA()
pca.fit(training_data)
pca_variance = pca.explained_variance_.
I have tried taking the transpose of training_data, but this didn't change the output. I have also tried changing n_components in the argument of the PCA function, but it is insistent that there can only be 247 dimensions.
This may be a stupid question, but I'm very new to this sort of data processing. Thank you.
You said:
" but it should have length 4095 so that I can work out the variance of
each dimension, not the variance of each data point."
No. This is only true if you would estimate 4095 components using pca = PCA(n_components=4095).
On the other hand, you define:
pca = PCA() # this is actually PCA(n_components=None)
so n_components is set to None.
When this happens we have (see the documentation here):
n_components == min(n_samples, n_features)
Thus, in your case, you have min(247, 4095) = 247 components.
So, pca.explained_variance_. will be a vector with shape 247 since you have 247 PC dimensions.
Why do we have n_components == min(n_samples, n_features) ?
This is related to the rank of the covariance/correlation matrix. Having a data matrix X with shape [247,4095], the covariance/correlation matrix would be [4095,4095] with max rank = min(n_samples, n_features). Thus, you have at most min(n_samples, n_features) meaningful PC components/dimensions.

Scaling the target variable is giving error in Python using StandardScaler of Sklearn library

Scaling the target variable by normal procedure of using StandardScaler class is giving error. However, the error got resolved by adding a line y = y.reshape(-1,1). After which applying the fit_transform method on target variable gave the standardized value. I am not able to figure out how adding y.reshape(-1,1) made it work?
X is independent variable having one feature and y is the numerical target variable 'Salary'. I was trying to apply Support Vector Regression to the problem, which needs explicit feature scaling. I tried the following code:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
It gave me error like:
ValueError: Expected 2D array, got 1D array instead Reshape your data
either using array.reshape(-1, 1) if your data has a single feature
or array.reshape(1, -1) if it contains a single sample.
After I made the following changes:
X = sc_X.fit_transform(X)
y = y.reshape(-1,1)
y = sc_y.fit_transform(y)
The standardization worked just fine. I need to understand how adding this y = y.reshape(-1,1) helped achieve it.
Thanks.
In short, yes you would need to transform it. This is because as per the sklearn documentation, fit_transform expects the X or predictor variables to consist of n_samples with n_features, which is make sense to what it used for. Supplying only 1-D array, this function will read it as 1 sample of n_feature. Perhaps attaching the code below will make this clearer:
In [1]: x_arr
Out[1]: array([1, 2, 3, 4, 5]) # will be considered as 1 sample of 5 feature
In [2]: x_arr.reshape(-1,1)
Out[2]:
array([[1], # 1st sample
[2], # 2nd sample
[3], # 3rd sample
[4], # 4th sample
[5]])# 5th sample
Anyway, on how you use the StandardScaler (unrelated to your question on why your code produce error, which answered above), what you want to do is using the same StandardScaler throughout your data. Generally speaking, scaling the target variable isn't necessary since it's the variable you'd like to predict, not the predictor (assuming y in your code is the target variable).
First, you'd like to store the mean and standard deviation of your training data to be used later in for scaling the test data.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Here the scaler will learn the mean and std of train data
x_train_scaled = scaler.fit_transform(x_train, y_train)
# Use here to transform test data
# This ensures both the train and test data are in the same scale
x_test_scaled = scaler.transform(x_test)
Hope this helps!
This comes up a lot in SKLearn.
From the docs of the scaler's .transform function, the input to .transform has to be a 2D matrix where the second dimension is the number of features:
Perform standardization by centering and scaling
Parameters: X : array-like, shape [n_samples, n_features] The data
used to scale along the features axis.
Now, the last dimension has to be explicitly set to 1, not missing. Before the data is reshaped (i.e. y=y.reshape(-1,1)), the last dimension is missing - see this example:
import numpy as np
a = np.array([0,0,0])
print(a) # [0 0 0]
print(a.shape) # (3,)
b = a.reshape(-1,1)
print(b) # [[0] [0] [0]]
print(b.shape) # (3,1)
The reshape method changes the shape of an array: for example, if a is an array with 6 elements (and whatever shape), a.reshape(3,2) changes its shape to 3-by-2. The -1 argument basically means "use the dimension that is needed here so that the data fits". So, a.reshape(-1,1) an array with n elements to an n-by-1 2d array (without explicitly specifying n).

How to reshape input for keras LSTM?

I have a numpy array of some 5000 rows and 4 columns (temp, pressure, speed, cost). So this is of the shape (5000, 4). Each row is an observation at a regular interval this is the first time i'm doing time series prediction and I'm stuck on input shape. I'm trying to predict
a value 1 timestep from the last data point. How do I reshape it into the 3D form for LSTM model in keras?
Also It will be much more helpful if a small sample program is written. There doesn't seem to be any example/tutorial where the input has more than one feature (and also not NLP).
The first question you should ask yourself is :
What is the timescale in which the input features encode relevant information for the value you want to predict?
Let's call this timescale prediction_context.
You can now create your dataset :
import numpy as np
recording_length = 5000
n_features = 4
prediction_context = 10 # Change here
# The data you already have
X_data = np.random.random((recording_length, n_features))
to_predict = np.random.random((5000,1))
# Make lists of training examples
X_in = []
Y_out = []
# Append examples to the lists (input and expected output)
for i in range(recording_length - prediction_context):
X_in.append(X_data[i:i+prediction_context,:])
Y_out.append(to_predict[i+prediction_context])
# Convert them to numpy array
X_train = np.array(X_in)
Y_train = np.array(Y_out)
At the end :
X_train.shape = (recording_length - prediction_context, prediction_context, n_features)
So you will need to make a trade-off between the length of your prediction context and the number of examples you will have to train your network.

Really confused with this numpy shape mismatch error

I was implementing a simple k-nearest-neighbor algorithm from sklearn on google cloud platform ml engine. I used a custom metric to calculate the distance between two input vectors so that the distance is the weighted sum of elements in the element-wise squared difference between two vectors. The code is below:
import os.path
from sklearn import neighbors
import numpy as np
from six.moves import cPickle as pickle
import tensorflow as tf
from tensorflow.python.lib.io import file_io
flags = tf.app.flags
FLAGS = flags.FLAGS
flags.DEFINE_string('input_dir', 'input', 'Input Directory.')
flags.DEFINE_string('input_train_data','train_data','Input Training Data File Name.')
pickle_file = os.path.join(FLAGS.input_dir, FLAGS.input_train_data)
def mydist(x, y):
return np.dot((x - y) ** 2, weight)
with file_io.FileIO(pickle_file, 'r') as f:
save = pickle.load(f)
train_dataset, train_labels, valid_dataset, valid_labels = save['train_dataset'], save['train_labels'], save[
'valid_dataset'], save['valid_labels']
train_data = train_dataset[:1000]
train_label = train_labels[:1000]
test_data = valid_dataset[:100]
weight = [1.0]* len(train_dataset[1])
knn = neighbors.KNeighborsRegressor(weights='distance', n_neighbors=20, metric=lambda x, y: mydist(x, y))
knn.fit(train_data, train_label)
predict = knn.predict(test_data)
print(predict)
train_dataset is a numpy array of shape (86667,13) and valid_dataset has shape (8000,13). Train_labels has shape (86667,1) and valid_labels (8000,1). For some reason I got a dimension mismatch:
line 15, in mydist return np.dot((x - y) ** 2, weight) ValueError: shapes
(10,) and (13,) not aligned: 10 (dim 0) != 13 (dim 0)
Both x and y in the input of the custom metric should have size 13 yet somehow they have size 10. Can anyone explain what is wrong here?
You are taking the distance between the wrong terms. You cannot take the distance between the labels and the train features. These are of two different dimensions. You need to calculate the distance between any two feature points say x1 and x2, not between the label and it's feature point (say x1 and y1). Secondly, when declaring the KNeighborsRegressor object, you have specified the parameter incorrectly. In the 'metric' parameter you specify the 'string' or the 'DistanceMetric' Object. First you have to make a distance metric object and then pass it as metric. So, this is the correct way of calling your function:
my_metric=DistanceMetric.get_metric('myfunc',func=mydist)
knn = neighbors.KNeighborsRegressor(weights='distance', n_neighbors=20, metric='myfunc')
Sklearn will itself take care of how the parameters are to be passed in the distance function. I am assuming that the weights variable is global for your code to function correctly.

Categories