Remove some features from sklearn PolynomialFeatures - python

I am using sklearn module PolynomialFeatures to fit my model with polynoms over my datas.
To this extent I am doing the following :
P = PolynomialFeatures(3, interaction_only=False, include_bias=False)
model = make_pipeline(P, Ridge(tol=0.001, alpha=1, fit_intercept=False))
model.fit(initial_conditions, times_of_flight)
It works well and now I would like to be able to remove some of these features to refine my model. Say I would like to remove every features that contain one of the two first variables, x_1 and x_2, without the other.
I have tried to modify my PolynomialFeatures attributes (powers_, n_input_features_...) before fitting but Scikit returns me a sklearn.exceptions.NotFittedError error.
How should I proceed ?

Related

How to display feature names in sklearn decision tree?

I've currently got a decision tree displaying the features names as X[index], i.e. X[0], X[1], X[2], etc.
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
# plot tree
plt.figure(figsize=(20,16))# set plot size (denoted in inches)
tree.plot_tree(dt,fontsize=10)
Im looking to replace these X[featureNumber] with the actual feature name.
so instead of it displaying X[0], I would want it to display the feature name returned by X.columns.values[0] (I don't know if this code is correct).
Im also aware there is an easy way of doing this using graphviz, but for some reason I cant get graphviz running in Jupiter, so Im looking for a way of doing it without.
Photo of current decision tree:
This is explained in the documentation:
sklearn.tree.plot_tree(decision_tree, *, max_depth=None, feature_names=None, class_names=None, label='all', filled=False, impurity=True, node_ids=False, proportion=False, rotate='deprecated', rounded=False, precision=3, ax=None, fontsize=None)
feature_names: list of strings, default=None
Names of each of the features. If None, generic names will be used (“X[0]”, “X[1]”, …).

What does fit() exactly does here?

Well, basically i want to know what does the fit() function does in general, but especially in the pieces of code down there.
Im taking the Machine Learning A-Z Course because im pretty new to Machine Learning (i just started). I know some basic conceptual terms, but not the technical part.
CODE1:
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', verbose = 0)
missingvalues = missingvalues.fit(X[:, 1:3])
X[:, 1:3] = missingvalues.transform(X[:, 1:3])
Some other example where I still have the doubt
CODE 2:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
print(sc_X)
X_train = sc_X.fit_transform(X_train)
print(X_train)
X_test = sc_X.transform(X_test)
I think that if I know like the general use for this function and what exactly does in general, I'll be good to go. But certaily I'd like to know what is doing on that code
Here is also a nice check-up possibility: https://scikit-learn.org/stable/tutorial/basic/tutorial.html
The fit-method is always to learn something in machine learning.
You normally have the following steps:
Seperate your data into two/three datasets
Pick one part of your data to learn/train something (normally X_train) with fit
Use the learned algorithm you predict something to unseen data (normally X_test) with predict
In your first example: missingvalues.fit(X[:, 1:3])
You are training SimpleImputerbased on your data Xwhere you are only using column 1,2,3, with transform you used this training to overwrite this data.
In your second example: You are training StandardScalerwith X_trainand are using this training for both datasets X_train, X_test, the StandardScaler learnes from X_trainthat means if he learned that 10 has to be converted to 2, he will convert 10 to 2 in both sets X_train, X_test.
Sklearn uses Classes. See the Python documentation for more info about Classes in Python. For more info about sklearn in particular, take a look at this sklearn documentation.
Here's a short description of how you are using Classes in sklearn.
First you instantiate your sklearn Classes with sc_X = StandardScaler() or missingvalues = SimpleImputer(...).
The objects, sc_X and missingvalues, each have methods. You can use the methods typing object_name.method_name(...). For example, you used the fit_transform() method of the sc_X instance when you typed, sc_X.fit_transform(...). This method will take your data and return a scaled version of it. It both fits (determines the scaling parameters) and transforms (applies scaling) to your data. The transform() method will transform new data, using the same scaling parameters it learned for your previous data.
In the first example, you have separated the fit and transform methods into two separate lines, but the idea is similar -- you first learn the imputation parameters with the fit method, and then you transform your data.
By the way, I think missingvalues = missingvalues.fit(X[:, 1:3]) could be changed to missingvalues.fit(X[:, 1:3]).

How to use k means for a product recommendation dataset

I have a data set with columns titled as product name, brand,rating(1:5),review text, review-helpfulness. What I need is to propose a recommendation algorithm using reviews. I have to use python for coding here. data set is in .csv format.
To identify the nature of the data set I need to use kmeans on the data set. How to use k means on this data set?
Thus I did following,
1.data pre-processing,
2.review text data cleaning,
3.sentiment analysis,
4.giving sentiment score from 1 to 5 according to the sentiment value (given by sentiment analysis) they get and tagging reviews as very negative, negative, neutral, positive, very positive.
after these procedures i have these columns in my data set, product name, brand,rating(1:5),review text, review-helpfulness, sentiment-value, sentiment-tag.
This is the link to the data set https://drive.google.com/file/d/1YhCJNvV2BQk0T7PbPoR746DCL6tYmH7l/view?usp=sharing
I tried to get k means using following code It run without error. but I don't know this is something useful or is there any other ways to use kmeans on this data set to get some other useful outputs. To identify more about data how should i use k means in this data set..
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df.info()
X = np.array(df.drop(['sentiment_value'], 1).astype(float))
y = np.array(df['rating'])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
plt.show()
You did not plot anything.
So nothing shows up.
Unless you are more specific about what you are trying to achieve we won't be able to help. Figure out what exactly you want to predict. Do you just want to cluster products according to their sentiment score which isn't especially promising or do you want to predict actual product preferences on a new dataset?
If you want to build a recommendation system the only possibility (considering your dataset) would be to identify similar products according to the rating/sentiment. Is that what you want?

Up-/downsampling with One vs. rest classifier

I have a data set (tf-idf weighted words) with multiple classes that I try to predict. My classes are imbalanced. I would like to use the One vs. rest classification approach with some classifiers (eg. Multinomial Naive Bayes) using the OneVsRestClassifier from sklearn.
Additionally, I would like to use the imbalanced-learn package (most likely one of the combinations of up- and downsampling) to enhance my data. The normal approach of using imbalanced-learn is:
from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)
I now have a data set with roughly the same number of cases for every label. I then would use the classifier on the resampled data.
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
ovr = OneVsRestClassifier(MultinomialNB())
ovr.fit(X_resampled, y_resampled)
But: now there is a huge imbalance for every label when it's fitted, because I have in total more than 50 labels. Right? I imagine that I need to apply the up-/downsampling method for every label instead of doing it once at the beginning. How can I use the resampling for every label?
As per the discussion in comments, what you want can be done like this:
from sklearn.naive_bayes import MultinomialNB
from imblearn.combine import SMOTEENN
# Observe how I imported Pipeline from IMBLEARN and not SKLEARN
from imblearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
# This pipeline will resample the data and
# pass the output to MultinomialNB
pipe = Pipeline([('sampl', SMOTEENN()),
('clf', MultinomialNB())])
# OVR will transform the `y` as you know and
# then pass single label data to different copies of pipe
# multiple times (as many labels in data)
ovr = OneVsRestClassifier(pipe)
ovr.fit(X, y)
Explanation of code:
Step 1: OneVsRestClassifier will create multiple columns of y. One for each label, where that label is positive and all other are negative.
Step 2: For each label, OneVsRestClassifier will clone the supplied pipe estimator and pass the individual data to it.
Step 3:
a. Each copy of pipe will get a different version of y, which is passed to SMOTEENN inside it and so will do a different sampling to balance the classes there.
b. The second part of pipe (clf) will get that balanced dataset for each label as you wanted.
Step 4: During prediction time, the sampling part will be turned off, so the data will reach the clf as it is. The sklearn pipeline doesnt handle that part so thats why I used imblearn.pipeline.
Hope this helps.

Mixture of Gaussians using scikit learn mixture

I'd like to use sklearn.mixture.GMM to fit a mixture of Gaussians to some data, with results similar to the ones I get using R's "Mclust" package.
The data looks like this:
So here's how I cluster the data using R, it gives me 14 nicely separated clusters and is easy as falling down stairs:
data <- read.table('~/gmtest/foo.csv',sep=",")
library(mclust)
D = Mclust(data,G=1:20)
summary(D)
plot(D, what="classification")
And here's what I say when I try it with python:
from sklearn import mixture
import numpy as np
import os
import pyplot
os.chdir(os.path.expanduser("~/gmtest"))
data = np.loadtxt(open('foo.csv',"rb"),delimiter=",",skiprows=0)
gmm = mixture.GMM( n_components=14,n_iter=5000, covariance_type='full')
gmm.fit(data)
classes = gmm.predict(data)
pyplot.scatter(data[:,0], data[:,1], c=classes)
pyplot.show()
Which assigns all points to the same cluster. I've also noticed that the AIC for the fit is lowest when I tell it to find excatly 1 cluster, and increases linearly with increasing numbers of clusters. What am I doing wrong? Are there additional parameters I need to consider?
Is there a difference in the models used by Mclust and by sklearn.mixture?
But more important: what is the best way in sklearn to cluster my data?
The trick is to set GMM's min_covar. So in this case I get good results from:
mixture.GMM( n_components=14,n_iter=5000, covariance_type='full',min_covar=0.0000001)
The large default value for min_covar assigns all points to one cluster.

Categories