scikit-learn to learn and generate list of numbers

scikit-learn to learn and generate list of numbers - python

I have a large data of n-hundred-dimensional list of triplets consisting of numbers, mostly integers.
[(50,100,0.5),(20,35,1.0),.....]
[(70,80,0.3),(30,45,2.0),......]
....
I'm looking at sklearn to write a simple generative model that learns the patterns from these data, and generate a likely list of triplets, but my background is rather weak, without which the documentation is rather difficult to follow.
Is there an example sklearn code that does the similar job where I can take a look at?

I agree that this question is probably more appropriate for the data science or statistics sites, but I'll take a stab at it.
First, I'll assume that your data is in a pandas dataframe; this is convenient for scikit-learn as well as other Python packages.
I would first visualize the data. Since you only have three dimensions, a three-dimensional scatter plot might be useful. For instance, see here.
Another useful way to plot the data is to use pair plots. The seaborn package makes this very easy. See here. Pair plots are useful because they show distributions of each of the variables/features, as well as correlations between pairs of features.
At this point, creating a generative model depends on what the plots tell you. If, for instance, all of the variables are independent of one another, then you simply need to estimate the pdf for each variable independently (for instance, using kernel density estimation, which is also implemented in seaborn), and then generate new samples by drawing values from each of the three distributions separately and combining these values in a single tuple.
If the variables are not independent, then the task becomes more complicated, and probably warrants a separate post on the statistics site. For instance, your samples could be generated from different clusters, possibly overlapping, in which case something like a mixture model might be useful.

Here is a small code example that does exactly that (discriminative model):
import numpy as np
from sklearn.linear_model import LinearRegression
#generate random numpy array of the size 10,3
X_train = np.random.random((10,3))
y_train = np.random.random((10,3))
X_test = np.random.random((10,3))
#define the regression
clf = LinearRegression()
#fit & predict (predict returns numpy array of the same dimensions)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
Otherwise here are more examples:
http://scikit-learn.org/stable/auto_examples/index.html
The generative model would be sklearn.mixture.GaussianMixture (works only in version 0.18)

Related

Fitting a Gaussian Process model to PCA. Predictions looks very wrong

I'm currently trying to fit a Gaussian Process model to my data and have it predict some days ahead. I have reduced my ~10 features down to just 2 components via PCA in sklearn. So now I have PCA1 and PCA2. This was obtained by performing PCA on the training set (40%).
pca = PCA(n_components=2)
pca.fit(train_data)
PCAs = pca.transform(train_data)
PCA1 = PCAs[:,0]
PCA2 = PCAs[:,1]
where train_data is the dataframe with ~10 features and 50 rows and StandardScaler() applied to it.
kernel = RBF()
model = gaussian_process.GaussianProcessRegressor(kernel=kernel, normalize_y=True, n_restarts_optimizer=10)
model.fit(x_days_train, PCA1)
y_pred, y_std = model.predict(x_days, return_std=True)
model.score(x_days_train, PCA1)
where x_days if the full 50 days, and x_days_train is 20 days (0,1,2....). I get a score of 1.0. However, my predicted results looks terrible (as per below). It's like after the training data, it just falls and then stagnates.
Not entirely sure what went wrong, but a couple guesses:
Since my data has no target variables, I used PCA on all the features in the dataframe and they are supposed to be x variables? And then I used them as a y variable (by predicting). Maybe this is an incorrect approach?
Following that, can PCA even be used as y_prediction?
Am I supposed to apply PCA to not just the training data, but also to the test data (apply fit_transform)?
I seem to be only using PCA1 and not PCA2 (nor a combination of the two). Should I use both? If so, how?
Would appreciate any help, thank you.

Since my data has no target variables, I used PCA on all the features
in the dataframe and they are supposed to be x variables? And then I
used them as a y variable (by predicting). Maybe this is an incorrect
approach?
You are correct. PCA is meant to transform high dimensional data into much smaller dimensions. Essentially the data is compressed but still contains the same information relative to each element in the data. Sci-kit learns transform function does not accept y variable. Instead use the fit_transform() function which accepts both variables applying the correct methods to the x variable and ignores the y.
Following that, can PCA even be used as y_prediction?
PCA is only transforming the data, Gaussian Process Regression (GPR) is making predictions.
Am I supposed to apply PCA to not just the training data, but also to
the test data (apply fit_transform)?
Yes.
I seem to be only using PCA1 and not PCA2 (nor a combination of the
two). Should I use both? If so, how?
After using the fit_transform() method like this:
pca_x, pca_y = pca.fit_transform(train_data)
Apply the data like this:
kernel = RBF()
model = gaussian_process.GaussianProcessRegressor(kernel=kernel, normalize_y=True, n_restarts_optimizer=10)
model.fit(pca_x, pca_y)
Here are the Sci-kit Learn user guides for PCA and GPR.

Principal Component Analysis (PCA) vs. Extra Tree Classifier for Data Reduction

I have a dataset that consists of 13 columns and I wanted to use PCA for data reduction to remove unwanted columns. My problem is PCA doesn't really show columns names but PC1 PC2 etc. I found out extra tree classifier does the same thing but does indicate the variation of each column. I just wanted to make sure if they both have the same objective or are they different in their outcome. Also would anyone suggest a better methods for Data Reduction?
My last question is that I have a code for Extra tree classifier and wanted to confirm if it was correct or not?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.ensemble import IsolationForest
from sklearn.ensemble import ExtraTreesClassifier
df = pd.read_csv('D:\\Project\\database\\5-FINAL2\\Final After Simple Filtering.csv')
extra_tree_forest = ExtraTreesClassifier(n_estimators = 500,
criterion ='entropy', max_features = 'auto')
extra_tree_forest.fit(df)
feature_importance = extra_tree_forest.feature_importances_
feature_importance_normalized = np.std([tree.feature_importances_ for tree in
extra_tree_forest.estimators_],
axis = 0)
plt.bar(df.columns, feature_importance_normalized)
plt.xlabel('Feature Labels')
plt.ylabel('Feature Importances')
plt.title('Comparison of different Feature Importances')
plt.show()
Thank You.

The two methods are very different.
PCA doesn't show you the feature names because dimensionality reduction with PCA doesn't really have anything to do with the relative importance of the features. PCA takes the original data and transforms it into a space where each new 'feature' (principal component) is independent of the others, and you can tell how important each principal component is to faithfully representing the data based on its corresponding eigenvalue. Removing the least important principal components reduces dimensionality in principal component space, but not in the original feature space - so you need to do PCA on all future data, too, and then perform all your classification on the (shortened) principal component vectors.
An extra tree classifier trains an entire classifier on your data, so it's much more powerful than just dimensionality reduction. However, it does seem closer to what you're looking for, since the feature importance does directly tell you how relevant each feature is when making a classification.
Note that in PCA, the principal components with the highest eigenvalues contribute the most to accurately reconstructing the data. This is not the same as contributing the most to accurately classifying the data. The extra tree classifier is the reverse: it tells you what features are most important when classifying the data, not when reconstructing it.
Basically, if you think you have a representative dataset right now and are comfortable only storing variables that are relevant to classifying the data you already have, dimensionality reduction with extra trees is a good choice for you. If you just want to faithfully represent the data with less space without being overly concerned about the effects on classification, PCA is the better choice. Dimensionality reduction with PCA will often also help remove irrelevant features from the original data, but that's not what it's optimized for.

Do I need to extract feature vectors from MNIST before using Kmeans

I am practicing with MNIST by sklearn.cluster.KMeans.
Intuitively, I just fit the training data to the sklearn function. But I have got pretty low accuracy. I am wondering what step I have missed. Should I extract feature vectors by PCA in the first place? Or should I change a bigger n_clusters?
from sklearn import cluster
from sklearn.metrics import accuracy_score
clf = cluster.KMeans(init='k-means++', n_clusters=10, random_state=42)
clf.fit(X_train)
y_pred=clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
I got poor 0.137 as result. Any recommendation? Thanks!

How are you passing the images in? Are pixels flattened or kept in the 2d format?Are pixels being normalized to between 0-1?
As you are running clustering I would advise against PCA regardless and instead opt for T-SNE which keeps neighbourhood info but you should not need to do so before running K-Means.
The best way to debug is to see what your fitted model is predicting as the clusters. You can see an example here:
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html
With this info, you can get an idea of where mistakes might be. Good luck!
Adding a note: K-Means also probably is not the best model for your purposes. It's best for unsupervised contexts to cluster data. Whereas, MNIST is a classification usecase. KNN would be a better option while still allowing you to experiment with neighbours and such.
Here is an example I created with KNN: https://gist.github.com/andrew-x/0bb997b129647f3a7b7c0907b7e836fc

Unless I'm missing something: you are comparing clustering labels which are arbitrarily numbered 0-9, to labels which are unarbitrarily numbered 0-9. The 0s in your clustering might not end up in cluster number 0, yet this is the comparison you make. Clustering results are evaluated differently because of this. Some options to get a correct evaluation:
Generate a contingency matrix and plot it
Calculate the adjusted rand index

Scikit learn-Classification

Is there a straightforward way to view the top features of each class? Based on tfidf?
I am using KNeighbors classifer, SVC-Linear, MultinomialNB.
Secondly, I have been searching for a way to view documents that have not been classified correctly? I can view the confusion matrix but I would like to see specific documents to see what features are causing the misclassification.
classifier = SVC(kernel='linear')
counts = tfidf_vectorizer.fit_transform(data['text'].values).toarray()
targets = data['class'].values
classifier.fit(counts, targets)
counts = tfidf_vectorizer.fit_transform(test['text'].values).toarray()
predictions = classifier.predict(counts)
EDIT: I have added the code snippet where I am only creating a tfidf vectorizer and using it to traing the classifier.

Like the previous comments suggest, a more specific question would result in a better answer, but I use this package all the time so I will try and help.
I. Determining top features for classification classes in sklearn really depends on the individual tool you are using. For example, many ensemble methods (like RandomForestClassifier and GradientBoostingClassifer) come with the .feature_importances_ attribute which will score each feature based on its importance. In contrast, most linear models (like LogisticRegression or RidgeClassifier) have a regularization penalty which penalizes for the size of coefficients, meaning that the coefficient sizes are somewhat a reflection of feature importance (although you need to keep in mind the numeric scales of individual features) which can be accessed using the .coef_ attribute of the model class.
In summary, almost all sklearn models have some method to extract the feature importances but the methods are different from model to model. Luckily the sklearn documentation is FANTASTIC so I would read up on your specific model to determine your best approach. Also, make sure to read the User Guide associated with your problem type in addition to the model specific API.
II. There is no out of the box sklearn method to provide the mis-classified records but if you are using a pandas DataFrame (which you should) to feed the model it can be accomplished in a few lines of code like this.
import pandas as pd
from sklearn.linear_model import RandomForestClassifier
df = pd.DataFrame(data)
x = df[[<list of feature columns>]]
y = df[<target column>]
mod = RandomForestClassifier()
mod.fit(x.values, y.values)
df['predict'] = mod.predict(x.values)
incorrect = df[df['predict']!=df[<target column>]]
The resultant incorrect DataFrame will contain only records which are misclassified.
Hope this helps!

How to ensure disjoint set of features when using Feature Union

I'm trying to learn how to use some of the helper features in sklearn but am struggling with understanding how to use FeatureUnion
One part of the documentation states this
(A FeatureUnion has no way of checking whether two transformers might
produce identical features. It only produces a union when the feature
sets are disjoint, and making sure they are is the caller’s
responsibility.)
However an example on the Iris dataset shows this
X, y = iris.data, iris.target
# This dataset is way to high-dimensional. Better do PCA:
pca = PCA(n_components=2)
# Maybe some original features where good, too?
selection = SelectKBest(k=1)
# Build estimator from PCA and Univariate selection:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)
How is it ensured that the pca and SelectKBest functions don't select the same feature, or in other words how can the user ensure that the two selections are disjoint?
http://scikit-learn.org/dev/modules/pipeline.html#feature-union
http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py

I think you pretty much answered your own question with that quote from the docs:
(A FeatureUnion has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are is the caller’s responsibility.)
The FeatureUnion does not ensure features are different.
In the example of the Iris dataset it is possible (though highly unlikely) that the PCA and the feature selection process will generate identical features. In that case, you just have twice the same feature in the output of the FeatureUnion.
This is usually not a huge deal, though if you can avoid it it's probably cleaner to do so (for instance a random forest model would be biased towards a feature that is present several times, as it would have a higher probability to be picked as a candidate to split a node).
To be a bit clearer, I don't think there's a lot you can do about it beyond avoiding to combine feature extraction processes that obviously create duplicate features in a FeatureUnion.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.