I'm currently trying to fit a Gaussian Process model to my data and have it predict some days ahead. I have reduced my ~10 features down to just 2 components via PCA in sklearn. So now I have PCA1 and PCA2. This was obtained by performing PCA on the training set (40%).
pca = PCA(n_components=2)
pca.fit(train_data)
PCAs = pca.transform(train_data)
PCA1 = PCAs[:,0]
PCA2 = PCAs[:,1]
where train_data is the dataframe with ~10 features and 50 rows and StandardScaler() applied to it.
kernel = RBF()
model = gaussian_process.GaussianProcessRegressor(kernel=kernel, normalize_y=True, n_restarts_optimizer=10)
model.fit(x_days_train, PCA1)
y_pred, y_std = model.predict(x_days, return_std=True)
model.score(x_days_train, PCA1)
where x_days if the full 50 days, and x_days_train is 20 days (0,1,2....). I get a score of 1.0. However, my predicted results looks terrible (as per below). It's like after the training data, it just falls and then stagnates.
Not entirely sure what went wrong, but a couple guesses:
Since my data has no target variables, I used PCA on all the features in the dataframe and they are supposed to be x variables? And then I used them as a y variable (by predicting). Maybe this is an incorrect approach?
Following that, can PCA even be used as y_prediction?
Am I supposed to apply PCA to not just the training data, but also to the test data (apply fit_transform)?
I seem to be only using PCA1 and not PCA2 (nor a combination of the two). Should I use both? If so, how?
Would appreciate any help, thank you.
Since my data has no target variables, I used PCA on all the features
in the dataframe and they are supposed to be x variables? And then I
used them as a y variable (by predicting). Maybe this is an incorrect
approach?
You are correct. PCA is meant to transform high dimensional data into much smaller dimensions. Essentially the data is compressed but still contains the same information relative to each element in the data. Sci-kit learns transform function does not accept y variable. Instead use the fit_transform() function which accepts both variables applying the correct methods to the x variable and ignores the y.
Following that, can PCA even be used as y_prediction?
PCA is only transforming the data, Gaussian Process Regression (GPR) is making predictions.
Am I supposed to apply PCA to not just the training data, but also to
the test data (apply fit_transform)?
Yes.
I seem to be only using PCA1 and not PCA2 (nor a combination of the
two). Should I use both? If so, how?
After using the fit_transform() method like this:
pca_x, pca_y = pca.fit_transform(train_data)
Apply the data like this:
kernel = RBF()
model = gaussian_process.GaussianProcessRegressor(kernel=kernel, normalize_y=True, n_restarts_optimizer=10)
model.fit(pca_x, pca_y)
Here are the Sci-kit Learn user guides for PCA and GPR.
Related
I am trying to implement a machine learning algorithm which detects irregular ecg signals. I extracted some features, but I am not sure how to manage a correct input for the classifier.
I have 20k different ecg signals, each signal has 1000 values. They are all labeld as correct or incorrect.
I choose e.g. the two features heart_rate and xposition_of_3_highest_peaks, but how to feed them into the classifier?
Following you can see my attempt, but everytime I add a second feature the score decreases. Why?
clf = svm.SVC()
#[64,70,48,89...74,58]
X_train_heartRate = StandardScaler().fit_transform(fe.get_avg_heart_rate(X_train))
X_test_heartRate = StandardScaler().fit_transform(fe.get_avg_heart_rate(X_test))
#[[23,56,89],[24,45,78],...,[21,58,90]]
X_train_3_peaks = StandardScaler().fit_transform(fe.get_intervalls(X_train))
X_test_3_peaks = StandardScaler().fit_transform(fe.get_intervalls(X_test))
X_tr = np.concatenate((X_train_heartRate,X_train_3_peaks),axis =1)
X_te = np.concatenate((X_test_heartRate,X_test_3_peaks),axis =1)
clf.fit(X_tr, Y_train)
print("Prediction:", clf.predict(X_te))
print("real Solution:", Y_test)
print(clf.score(X_te,Y_test))
I am not sure if the StandardScaler().fit_transform is necessary or if the np.concatenate is correct? Maybe there is even a better classifier for this use case?
Sorry I am a complete beginner, please be kind :)
When you are doing any transformations for pre-processing, you must use the same process from the training data and apply it to the validation / test data. However, this process must use the same statistics from the training data, because you are assuming that the validation / test data also come from this same distribution. Therefore, you need to create an object to store the transformations of the training data, then apply it to the training and test data equally. Your decreased performance is because you are not applying the right statistics to both training and validation / test correctly. You are scaling both datasets using separate means and standard deviations, which can cause out-of-distribution predictions if your sample size isn't large enough.
Therefore, call fit_transform on the training data, then just transform on the validation / test data. fit_transform will simultaneously find the parameters of the scaling for each column, then apply it to the input data and return the transformed data to you. transform assumes an already fit scaler, such as what was done in fit_transform and applies the scaling accordingly. I sometimes like to separate the operations and do a separate fit on the training data, then transform on the training and validation/test data after. This is a common source of confusion for new practitioners. You also need to save the scaler object so you can apply this to your validation / test data later.
clf = svm.SVC()
#[64,70,48,89...74,58]
heartRate_scaler = StandardScaler()
X_train_heartRate = heartRate_scaler.fit_transform(fe.get_avg_heart_rate(X_train))
X_test_heartRate = heartRate_scaler.transform(fe.get_avg_heart_rate(X_test))
#[[23,56,89],[24,45,78],...,[21,58,90]]
three_peaks_scalar = StandardScaler()
X_train_3_peaks = three_peaks_scalar.fit_transform(fe.get_intervalls(X_train))
X_test_3_peaks = three_peaks_scalar.transform(fe.get_intervalls(X_test))
X_tr = np.concatenate((X_train_heartRate,X_train_3_peaks),axis =1)
X_te = np.concatenate((X_test_heartRate,X_test_3_peaks),axis =1)
clf.fit(X_tr, Y_train)
print("Prediction:", clf.predict(X_te))
print("real Solution:", Y_test)
print(clf.score(X_te,Y_test))
Take note that you can concatenate the features you want first, then apply the StandardScaler after the fact because the method applies the standardization to each feature/column independently. The above method of scaling the different sets of features and concatenating them after is no different than concatenating the features first, then scaling after.
Minor Note
I forgot to ask about the fe object. What is that doing under the hood? Does it use the training data in any way to get you features? You must make sure that this object operates on the statistics of your training data and test data, not separately. What I mentioned about ensuring that the pre-processing must match between training and validation/test, the statistics must also match in this fe object as well. I assume this either uses the training data's statistics to both sets of data, or it is an independent transformation that is agnostic. Either way, you haven't specified what this is doing under the hood, but I will assume the happy path.
Possible Improvement
Consider using a decision tree-based algorithm like a Random Forest Classifier that does not require scaling of the input features, as the job is to partition the feature space of your data into N-dimensional hypercubes, with N being the number of features in your dataset (if N=2, this would be a 2D rectangle, N=3 a 3D rectangle, etc). Depending on how your data is distributed, tree-based algorithms can do better and are the first things to try in Kaggle competitions.
I am practicing with MNIST by sklearn.cluster.KMeans.
Intuitively, I just fit the training data to the sklearn function. But I have got pretty low accuracy. I am wondering what step I have missed. Should I extract feature vectors by PCA in the first place? Or should I change a bigger n_clusters?
from sklearn import cluster
from sklearn.metrics import accuracy_score
clf = cluster.KMeans(init='k-means++', n_clusters=10, random_state=42)
clf.fit(X_train)
y_pred=clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
I got poor 0.137 as result. Any recommendation? Thanks!
How are you passing the images in? Are pixels flattened or kept in the 2d format?Are pixels being normalized to between 0-1?
As you are running clustering I would advise against PCA regardless and instead opt for T-SNE which keeps neighbourhood info but you should not need to do so before running K-Means.
The best way to debug is to see what your fitted model is predicting as the clusters. You can see an example here:
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html
With this info, you can get an idea of where mistakes might be. Good luck!
Adding a note: K-Means also probably is not the best model for your purposes. It's best for unsupervised contexts to cluster data. Whereas, MNIST is a classification usecase. KNN would be a better option while still allowing you to experiment with neighbours and such.
Here is an example I created with KNN: https://gist.github.com/andrew-x/0bb997b129647f3a7b7c0907b7e836fc
Unless I'm missing something: you are comparing clustering labels which are arbitrarily numbered 0-9, to labels which are unarbitrarily numbered 0-9. The 0s in your clustering might not end up in cluster number 0, yet this is the comparison you make. Clustering results are evaluated differently because of this. Some options to get a correct evaluation:
Generate a contingency matrix and plot it
Calculate the adjusted rand index
I am currently working on an image recognition project with machine learning.
The train set has 1600 images with size 300x300, so 90000 features per image.
To speed up training, I apply PCA with n_components = 50
The test set has 450 images and I can test the model in this test set successfully.
Now, I want to predict a single image that is captured by webcam. The question is that should I apply PCA to that image?
If I don't apply PCA, I get ValueError: X.shape[1] = 90000 should be equal to 50, the number of features at training time
If I apply PCA, I get ValueError: n_components=50 must be between 0 and min(n_samples, n_features)=1 with svd_solver='full'
I use Python 3, scikit-learn 0.20.3, this is how I apply PCA:
from sklearn.decomposition import PCA
pca = PCA(50)
pca.fit_transform(features)
You need to apply PCA on your test set as well.
You need to consider what PCA does:
PCA constructs a new features set (containing less features than the original feature space) and then you subsequently train on this new feature set. You need to construct this new feature set for the test set for your model to be valid!
Its important to note that each feature in your 'reduced' feature set are a linear combination of the original features, where for a given number of new features (n_components) they are the feature set that maximize the variance of the original space preserved in the new space.
Practically to perform the relevant transformation on your test set, you need to do:
# X_test - your untransformed test set
X_test_reduced = pca.transform(X_test)
where pca is the instance of PCA() trained on your training set. Essentialy you are constructing a transformation to a lower-dimensional space and you want this transformation to be the same for the training and test set! If you train pca independently on both the training and test set, you are (nearly certainly) embedding the data into different low-dimensional representations and have different feature sets.
Yes, you need to apply PCA, following the principle of doing the same things to data during training and testing.
However, the key thing is that you must not "retrain"/fit the PCA again. Use PCA transform
pca.transform(X_test) #where X_test is a collection of images for testing, should be similar to your features.
The idea being, fit_transform is a two step process made up of fitting a PCA, and then transforming the datasets accordingly.
Sort of taking inspiration from here.
My problem
So I have a dataset with 3 features and n observations. I also have n responses. Basically I want to see if this model is a good fit or not.
From the question above people use R^2 for this purpose. But I am not sure I understand..
Can I just fit the model and then calculate the Mean Squared Error?
Should I use train/test split?
All of these seem to have in common prediction, but here I just want to see how good it is at fitting it.
For instance this is my idea
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
diabetes = datasets.load_diabetes()
#my idea
regr = linear_model.LinearRegression()
regr.fit(diabetes_X, diabetes.target)
print(np.mean((regr.predict(diabetes_X)-diabetes.target)**2))
However I often see people doing things like
diabetes_X = diabetes.data[:, np.newaxis, 2]
# split X
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# split y
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# instantiate and fit
regr = linear_model.LinearRegression()
regr.fit(diabetes_X_train, diabetes_y_train)
# MSE but based on the prediction on test
print('Mean squared error: %.2f' % np.mean((regr.predict(diabetes_X_test)-diabetes_y_test)**2))
In the first instance we get: 3890.4565854612724 while in the second case we get 2548.07. Which is the most correct one?
IMPORTANT: I WANT THIS TO WORK IN MULTIPLE REGRESSION, THIS IS JUST A MWE!
Can I just fit the model and then calculate the Mean Squared Error? Should I use train/test split?
No, you will run the risk of overfitting the model. That's the reason for the data to be split into train and test (or, even validation datasets). So, that the model doesn't just 'memorize' what it sees but learns to perform even on newer, unseen samples.
It's always preferred to evaluate the performance of the model on a new set of data that wasn't observed during training. If you're going to optimize hyper-parameters or choosing among several models, an additional validation data is a right choice.
However, sometimes the data is scarce and entirely removing data from the training process is prohibitive. In these cases, I strongly recommend you to use more efficient ways of validating your models such as k-fold cross-validation (see KFold and StratifiedKFold in scikit-learn).
Finally, it is a good idea to ensure that your partitions behave in a similar way in the training and test sets. I recommend you to sample the data uniformly on the target space so you can ensure that you train/validate your model with the same distribution of target values.
I have a large data of n-hundred-dimensional list of triplets consisting of numbers, mostly integers.
[(50,100,0.5),(20,35,1.0),.....]
[(70,80,0.3),(30,45,2.0),......]
....
I'm looking at sklearn to write a simple generative model that learns the patterns from these data, and generate a likely list of triplets, but my background is rather weak, without which the documentation is rather difficult to follow.
Is there an example sklearn code that does the similar job where I can take a look at?
I agree that this question is probably more appropriate for the data science or statistics sites, but I'll take a stab at it.
First, I'll assume that your data is in a pandas dataframe; this is convenient for scikit-learn as well as other Python packages.
I would first visualize the data. Since you only have three dimensions, a three-dimensional scatter plot might be useful. For instance, see here.
Another useful way to plot the data is to use pair plots. The seaborn package makes this very easy. See here. Pair plots are useful because they show distributions of each of the variables/features, as well as correlations between pairs of features.
At this point, creating a generative model depends on what the plots tell you. If, for instance, all of the variables are independent of one another, then you simply need to estimate the pdf for each variable independently (for instance, using kernel density estimation, which is also implemented in seaborn), and then generate new samples by drawing values from each of the three distributions separately and combining these values in a single tuple.
If the variables are not independent, then the task becomes more complicated, and probably warrants a separate post on the statistics site. For instance, your samples could be generated from different clusters, possibly overlapping, in which case something like a mixture model might be useful.
Here is a small code example that does exactly that (discriminative model):
import numpy as np
from sklearn.linear_model import LinearRegression
#generate random numpy array of the size 10,3
X_train = np.random.random((10,3))
y_train = np.random.random((10,3))
X_test = np.random.random((10,3))
#define the regression
clf = LinearRegression()
#fit & predict (predict returns numpy array of the same dimensions)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
Otherwise here are more examples:
http://scikit-learn.org/stable/auto_examples/index.html
The generative model would be sklearn.mixture.GaussianMixture (works only in version 0.18)