PCA trained model and then back to original feature space

PCA trained model and then back to original feature space - python

I tried to predict special movement with my smartphone. Therfore I have developed an application with creates a dataset containing acceleration, gyroscope, magneticfiled etc.
The Problem is, I dont dont realy know which are good features
Thats why i tried to use PCA
so far no problems
from sklearn.decomposition import PCA
pca = PCA(0.95) # i don't want to lose too much information
.. split recorded data in train and test samples
pc_test = pca.fit_transform(data_test)
pc_train = pca.fit_transform(data_train)
and fit the data to Random Forest or Ridge Regression etc...
But now i have the problem that all my trained classifier, are only working on pca transformed data.
This means i have to do pca on my phone to do my intended prediction.
Is this the correct way to proceed or have i missed something?
I thought about pca like one time analytics tool

First I do not think, it is always a good idea to set a static variance ratio from the beginning like 0.95. Holding as many information as possible(up to all the dimensions you have originally) sometimes lead not to to the best Result/model since you are trying PCA here. I would try a series of variance ratios like:
import numpy as np
n_s = np.linspace(0.65, 0.85, num=21)
for n in n_s:
pca = PCA(n_components=n)
#...
and look at the results Than you can set your variance/number of components (which generates the highest accuracy in your model) to a scalar.It is a important point in ML. To your question: Most likely you are not going to do PCA and even modelling on your phone, you only are going to use the resulting model at the end. You gonna want to have your training data set as large as possible(that leads to better accuracies), as far as your computing hardware allows. That "superior" hardware cannot be your mobile phone.

Related

Principal Component Analysis (PCA) vs. Extra Tree Classifier for Data Reduction

I have a dataset that consists of 13 columns and I wanted to use PCA for data reduction to remove unwanted columns. My problem is PCA doesn't really show columns names but PC1 PC2 etc. I found out extra tree classifier does the same thing but does indicate the variation of each column. I just wanted to make sure if they both have the same objective or are they different in their outcome. Also would anyone suggest a better methods for Data Reduction?
My last question is that I have a code for Extra tree classifier and wanted to confirm if it was correct or not?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.ensemble import IsolationForest
from sklearn.ensemble import ExtraTreesClassifier
df = pd.read_csv('D:\\Project\\database\\5-FINAL2\\Final After Simple Filtering.csv')
extra_tree_forest = ExtraTreesClassifier(n_estimators = 500,
criterion ='entropy', max_features = 'auto')
extra_tree_forest.fit(df)
feature_importance = extra_tree_forest.feature_importances_
feature_importance_normalized = np.std([tree.feature_importances_ for tree in
extra_tree_forest.estimators_],
axis = 0)
plt.bar(df.columns, feature_importance_normalized)
plt.xlabel('Feature Labels')
plt.ylabel('Feature Importances')
plt.title('Comparison of different Feature Importances')
plt.show()
Thank You.

The two methods are very different.
PCA doesn't show you the feature names because dimensionality reduction with PCA doesn't really have anything to do with the relative importance of the features. PCA takes the original data and transforms it into a space where each new 'feature' (principal component) is independent of the others, and you can tell how important each principal component is to faithfully representing the data based on its corresponding eigenvalue. Removing the least important principal components reduces dimensionality in principal component space, but not in the original feature space - so you need to do PCA on all future data, too, and then perform all your classification on the (shortened) principal component vectors.
An extra tree classifier trains an entire classifier on your data, so it's much more powerful than just dimensionality reduction. However, it does seem closer to what you're looking for, since the feature importance does directly tell you how relevant each feature is when making a classification.
Note that in PCA, the principal components with the highest eigenvalues contribute the most to accurately reconstructing the data. This is not the same as contributing the most to accurately classifying the data. The extra tree classifier is the reverse: it tells you what features are most important when classifying the data, not when reconstructing it.
Basically, if you think you have a representative dataset right now and are comfortable only storing variables that are relevant to classifying the data you already have, dimensionality reduction with extra trees is a good choice for you. If you just want to faithfully represent the data with less space without being overly concerned about the effects on classification, PCA is the better choice. Dimensionality reduction with PCA will often also help remove irrelevant features from the original data, but that's not what it's optimized for.

Do I need to extract feature vectors from MNIST before using Kmeans

I am practicing with MNIST by sklearn.cluster.KMeans.
Intuitively, I just fit the training data to the sklearn function. But I have got pretty low accuracy. I am wondering what step I have missed. Should I extract feature vectors by PCA in the first place? Or should I change a bigger n_clusters?
from sklearn import cluster
from sklearn.metrics import accuracy_score
clf = cluster.KMeans(init='k-means++', n_clusters=10, random_state=42)
clf.fit(X_train)
y_pred=clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
I got poor 0.137 as result. Any recommendation? Thanks!

How are you passing the images in? Are pixels flattened or kept in the 2d format?Are pixels being normalized to between 0-1?
As you are running clustering I would advise against PCA regardless and instead opt for T-SNE which keeps neighbourhood info but you should not need to do so before running K-Means.
The best way to debug is to see what your fitted model is predicting as the clusters. You can see an example here:
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html
With this info, you can get an idea of where mistakes might be. Good luck!
Adding a note: K-Means also probably is not the best model for your purposes. It's best for unsupervised contexts to cluster data. Whereas, MNIST is a classification usecase. KNN would be a better option while still allowing you to experiment with neighbours and such.
Here is an example I created with KNN: https://gist.github.com/andrew-x/0bb997b129647f3a7b7c0907b7e836fc

Unless I'm missing something: you are comparing clustering labels which are arbitrarily numbered 0-9, to labels which are unarbitrarily numbered 0-9. The 0s in your clustering might not end up in cluster number 0, yet this is the comparison you make. Clustering results are evaluated differently because of this. Some options to get a correct evaluation:
Generate a contingency matrix and plot it
Calculate the adjusted rand index

I think using train_test_split to sample a large data set and then use cross_validation on the sample may be wrong. agree?

I am trying to solve the DAT102x: Predicting Mortgage Approvals From Government Data since a couple of months.
My goal is to understand the pieces of a classification problem, not to rank to the top.
However, I found something that is not clear to me:
I get almost the same performance out of a Cross Validate model based on a sample (accuracy = 0.69) , as the one scored using this model on the whole dataset (accuracy = 0.69).
BUT when I submit the data using the competition dataset I get a "beautiful" 0.5.
It sounds like a overfitting problem
but I assume that an overfitting problem would be spotted by a CV...
The only logical explanation that I have is that the CV fails because is based on a sample that I created using the "train_test_split" function.
In other words: because I used this way of sampling, my sample has become a kind of FRATTAL: whatever the sub-sample I create, it is always a very precise reproduction of the population.
So: the CV "fails" to detect overfitting.
Ok. I hope I have been able to explain what is going on.
(btw if you wonder why I do not check it running the full population: I am using a HP core duo 2.8 Mhz 8 RAM... it takes forever....)
here the steps of my code:
0) prepare the dataset (NaN, etc) and transform everything into categorical (numerical-->binning)
1) use train_test_split to sample 12.000 records out of 500K dataset
2) encode (selected) categorical with OHE
3) reduce Features via PCA
4) perform CV to identify best Log_reg hyperparameter "C" value
5) split the sample using train_test_split: holding 2000 records out of 12000
6) build a Log_reg model based on Xtrain,y_train (accuracy: 0.69)
7) fitting the whole dataset into the log_reg model (accuracy: 0.69)
8) fitting the whole competition dataset into the log_reg model
9) getting a great 0.5 accuracy result....
The only other explanation I have is that I selected a bunch of features that are kind of "over-ridden" in the competition dataset, by those I left out.
(you know: the competition guys are there to make us sweat...)
also here I have a "hardware problem" to shake the numbers.
If any one has any cloud about this I am happy to learn.
thx a lot
I also tried with Random Forest but I get the same problem. In this case I understand that OHE is not something that sklearn RF model loves: it jeopardize the model and smashes down valuable Features with "many" categories
happy to share if requested.
I would expect one of these two:
or: a model that has a poor performance on the whole dataset
or: a model that has a comparable (0.66?) performance on the competition dataset.

Linear regression: Good results for training data, horrible for test data

I am working with a dataset of about 400.000 x 250.
I have a problem with the model yielding a very good R^2 score when testing it on the training set, but extremely poorly when used on the test set. Initially, this sounds like overfitting. But the data is split into training/test set at random and the data set i pretty big, so I feel like there has to be something else.
Any suggestions?
Splitting dataset into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['SalePrice'],
axis=1), df.SalePrice, test_size = 0.3)
Sklearn's Linear Regression estimator
from sklearn import linear_model
linReg = linear_model.LinearRegression() # Create linear regression object
linReg.fit(X_train, y_train) # Train the model using the training sets
# Predict from training set
y_train_linreg = linReg.predict(X_train)
# Predict from test set
y_pred_linreg = linReg.predict(X_test)
Metric calculation
from sklearn import metrics
metrics.r2_score(y_train, y_train_linreg)
metrics.r2_score(y_test, y_pred_linreg)
R^2 score when testing on training set: 0,64
R^2 score when testing on testing set: -10^23 (approximatly)

While I agree with Mihai that your problem definitely looks like overfitting, I don't necessarily agree on his answer that neural network would solve your problem; at least, not out of the box. By themselves, neural networks overfit more, not less, than linear models. You need somehow to take care of your data, hardly any model can do that for you. A few options that you might consider (apologies, I cannot be more precise without looking at the dataset):
Easiest thing, use regularization. 400k rows is a lot, but with 250 dimensions you can overfit almost whatever you like. So try replacing LinearRegression by Ridge or Lasso (or Elastic Net or whatever). See http://scikit-learn.org/stable/modules/linear_model.html (Lasso has the advantage of discarding features for you, see next point)
Especially if you want to go outside of linear models (and you probably should), it's advisable to first reduce the dimension of the problem, as I said 250 is a lot. Try using some of the Feature selection techniques here: http://scikit-learn.org/stable/modules/feature_selection.html
Probably most importantly than anything else, you should consider adapting your input data. The very first thing I'd try is, assuming you are really trying to predict a price as your code implies, to replace it by its logarithm, or log(1+x). Otherwise linear regression will try very very hard to fit that single object that was sold for 1 Million $ ignoring everything below $1k. Just as important, check if you have any non-numeric (categorical) columns and keep them only if you need them, in case reducing them to macro-categories: a categorical column with 1000 possible values will increase your problem dimension by 1000, making it an assured overfit. A single column with a unique categorical data for each input (e.g. buyer name) will lead you straight to perfect overfitting.
After all this (cleaning data, reducing dimension via either one of the methods above or just Lasso regression until you get to certainly less than dim 100, possibly less than 20 - and remember that this includes any categorical data!), you should consider non-linear methods to further improve your results - but that's useless until your linear model provides you at least some mildly positive R^2 value on test data. sklearn provides a lot of them: http://scikit-learn.org/stable/modules/kernel_ridge.html is the easiest to use out-of-the-box (also does regularization), but it might be too slow to use in your case (you should first try this, and any of the following, on a subset of your data, say 1000 rows once you've selected only 10 or 20 features and see how slow that is). http://scikit-learn.org/stable/modules/svm.html#regression have many different flavours, but I think all but the linear one would be too slow. Sticking to linear things, http://scikit-learn.org/stable/modules/sgd.html#regression is probably the fastest, and would be how I'd train a linear model on this many samples. Going truly out of linear, the easiest techniques would probably include some kind of trees, either directly http://scikit-learn.org/stable/modules/tree.html#regression (but that's an almost-certain overfit) or, better, using some ensemble technique (random forests http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees are the typical go-to algorithm, gradient boosting http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting sometimes works better). Finally, state-of-the-art results are indeed generally obtained via neural networks, see e.g. http://scikit-learn.org/stable/modules/neural_networks_supervised.html but for these methods sklearn is generally not the right answer and you should take a look at dedicated environments (TensorFlow, Caffe, PyTorch, etc.)... however if you're not familiar with those it is certainly not worth the trouble!

Trouble Training SVM (scikit-learn package)

Background / Question
I am trying to create a SVM using Scikit-learn. I have a training set (here is the link to it https://dl.dropboxusercontent.com/u/9876125/training_patients.txt) which I load and then use to train the SVM. The training set is 3600 lines long. When I use all 3600 tuples the SVM never finishes training.... BUT when I only use the first 3594 tuples it finishes training in under a minute. I've tried using a variety of different sized training sets and the same thing continues to happen... depending on how many tuples I use the SVM either trains very quickly or it never completes. This has led me to the conclusion that the SVM is having difficulty converging on an answer depeding on the data.
Is my assumption about this being a convergence problem correct? If so, what is the solution? If not, what other problem could it be?
Code
import pylab as pl # #UnresolvedImport
from sklearn.datasets import load_svmlight_file
print(doc)
import numpy as np
from sklearn import svm, datasets
print "loading training setn"
X_train, y_train = load_svmlight_file("training_patients.txt")
h = .02 # step size in the mesh
C = 1.0 # SVM regularization parameter
print "creating svmn"
poly_svc = svm.SVC(kernel='poly', cache_size=600, degree=40, C=C).fit(X_train, y_train)
print "all done"

The optimization algorithm behind SVM has cubic (O(n^3)) complexity assuming relatively high cost (C) and high-dimensional feature space (polynomial kernel with d=40 implies ~1600 dimensional feature space). I would not call this "problems with convergence", as for over 3000 samples it can take a while to train such a model, and it is normal. The fact that for some subsets you achieve much faster convergence is the effect of very rich feature projection (the same can happen with RBF kernel) - and it is a common phenomenon, it is true even for very simple data from UCI library. As mentioned in the comments, setting "verbose=True" may give you additional information regarding your optimization process - it will output the number of iterations, the number of support vectors (higher the number of SVs, more is SVM overfitting, which can be also a reason for slow convergence).

I would also add to #lejlot's answer that standardizing the input variables (centering and scaling to unit variance or rescaling to some range such as [0, 1] or [-1, 1]) can make the optimization problem much easier and speed up the convergence as well.
By having a look at your data, it seems that some features have min and max values significantly larger than others. Maybe the MinMaxScaler can help. Have a look at the preprocessing doc in general.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.