Up-/downsampling with One vs. rest classifier - python

I have a data set (tf-idf weighted words) with multiple classes that I try to predict. My classes are imbalanced. I would like to use the One vs. rest classification approach with some classifiers (eg. Multinomial Naive Bayes) using the OneVsRestClassifier from sklearn.
Additionally, I would like to use the imbalanced-learn package (most likely one of the combinations of up- and downsampling) to enhance my data. The normal approach of using imbalanced-learn is:
from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)
I now have a data set with roughly the same number of cases for every label. I then would use the classifier on the resampled data.
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
ovr = OneVsRestClassifier(MultinomialNB())
ovr.fit(X_resampled, y_resampled)
But: now there is a huge imbalance for every label when it's fitted, because I have in total more than 50 labels. Right? I imagine that I need to apply the up-/downsampling method for every label instead of doing it once at the beginning. How can I use the resampling for every label?

As per the discussion in comments, what you want can be done like this:
from sklearn.naive_bayes import MultinomialNB
from imblearn.combine import SMOTEENN
# Observe how I imported Pipeline from IMBLEARN and not SKLEARN
from imblearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
# This pipeline will resample the data and
# pass the output to MultinomialNB
pipe = Pipeline([('sampl', SMOTEENN()),
('clf', MultinomialNB())])
# OVR will transform the `y` as you know and
# then pass single label data to different copies of pipe
# multiple times (as many labels in data)
ovr = OneVsRestClassifier(pipe)
ovr.fit(X, y)
Explanation of code:
Step 1: OneVsRestClassifier will create multiple columns of y. One for each label, where that label is positive and all other are negative.
Step 2: For each label, OneVsRestClassifier will clone the supplied pipe estimator and pass the individual data to it.
Step 3:
a. Each copy of pipe will get a different version of y, which is passed to SMOTEENN inside it and so will do a different sampling to balance the classes there.
b. The second part of pipe (clf) will get that balanced dataset for each label as you wanted.
Step 4: During prediction time, the sampling part will be turned off, so the data will reach the clf as it is. The sklearn pipeline doesnt handle that part so thats why I used imblearn.pipeline.
Hope this helps.

Related

How does SelectFromModel from scikit-learn select features?

When I use XGBClassifier with SelectFromModel the algorithm always returns around five features regardless of the max_features value
My question is: does XGBClassifier though that there are only five useful features in my dataset?
from sklearn.feature_selection import SelectFromModel
from xgboost import XGBClassifier
sf=SelectFromModel(XGBClassifier(), max_features=10).fit(X, y)
#The output only contains five True, all remaining are False
print(sf.get_support())
To only select based on max_features, set threshold=-np.inf.
I found the above text in the documentation sklearn.feature_selection. This means as priority SelectFromModel depends on the threshold parameter and returns all features that pass the threshold (regardless of max_features).
If you want max_features fully function, then set threshold=-np.inf, in this case, all features pass the threshold, then max_features can select demanded features (based on their rank).
You can use threshold='median' to get the best half of the features, then you can call it one more time on the resulting dataset to get the best quarter, and so on. You can use Pipeline for better handling repeated feature reduction.
sf = SelectFromModel(RandomForestClassifier(), threshold='median')

decision tree always returns the same value for different inputs

I am new to coding. I am learning machine learning with python. Using decision tree, I trying to predict chance for heart attack of individual using dataset from Kaggle. After modeling when I try to predict for different inputs it is always returning same output [1]. What may be the problem? What can I do? This my code.
import pandas as pd
import numpy as np
heart=pd.read_csv('heart_attack.csv')
heart.fillna(heart.mean(),inplace=True)
x=heart.iloc[:,:-1]
y=heart.iloc[:,-1]
import sklearn
from sklearn.preprocessing import LabelEncoder
x=x.apply(LabelEncoder().fit_transform)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.20, random_state=85)
from sklearn.tree import DecisionTreeClassifier
result=DecisionTreeClassifier()
result.fit(x_train,y_train)
y_pred=result.predict(x_test)
This is the code where i store the input values
Patient_Data = [Patient_Age,Patient_Gender,Patient_pain,Patient_RBP,Patient_chol,Patient_FBS,Patient_ECG,Patient_thalach,Patient_exang,Patient_oldpeak,Patient_slope,Patient_thal]
Patient_Data_New= pd.DataFrame([Patient_Data],columns=['Age','Gender','cp','restbps','chol','FBS','restecg','thalach','exang','oldpeak','slope', 'thal'])
Patient= result.predict(Patient_Data_New)
if Patient>0:
print ('This patient has a chance to get heart attack')
else:
print ('This patient does not have a chance to get heart attack')
Thanks in advance.
This is probably because you are predicting values for data which is already present in the training set, hence your model is able to predict all the values correctly thus giving 100 percent accuracy.
try this
result.score
if the result is 100 percent then it's an example of overfitting and you need to drop a few columns andfind the best params using k-fold cross validation.

Remove some features from sklearn PolynomialFeatures

I am using sklearn module PolynomialFeatures to fit my model with polynoms over my datas.
To this extent I am doing the following :
P = PolynomialFeatures(3, interaction_only=False, include_bias=False)
model = make_pipeline(P, Ridge(tol=0.001, alpha=1, fit_intercept=False))
model.fit(initial_conditions, times_of_flight)
It works well and now I would like to be able to remove some of these features to refine my model. Say I would like to remove every features that contain one of the two first variables, x_1 and x_2, without the other.
I have tried to modify my PolynomialFeatures attributes (powers_, n_input_features_...) before fitting but Scikit returns me a sklearn.exceptions.NotFittedError error.
How should I proceed ?

What does fit() exactly does here?

Well, basically i want to know what does the fit() function does in general, but especially in the pieces of code down there.
Im taking the Machine Learning A-Z Course because im pretty new to Machine Learning (i just started). I know some basic conceptual terms, but not the technical part.
CODE1:
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', verbose = 0)
missingvalues = missingvalues.fit(X[:, 1:3])
X[:, 1:3] = missingvalues.transform(X[:, 1:3])
Some other example where I still have the doubt
CODE 2:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
print(sc_X)
X_train = sc_X.fit_transform(X_train)
print(X_train)
X_test = sc_X.transform(X_test)
I think that if I know like the general use for this function and what exactly does in general, I'll be good to go. But certaily I'd like to know what is doing on that code
Here is also a nice check-up possibility: https://scikit-learn.org/stable/tutorial/basic/tutorial.html
The fit-method is always to learn something in machine learning.
You normally have the following steps:
Seperate your data into two/three datasets
Pick one part of your data to learn/train something (normally X_train) with fit
Use the learned algorithm you predict something to unseen data (normally X_test) with predict
In your first example: missingvalues.fit(X[:, 1:3])
You are training SimpleImputerbased on your data Xwhere you are only using column 1,2,3, with transform you used this training to overwrite this data.
In your second example: You are training StandardScalerwith X_trainand are using this training for both datasets X_train, X_test, the StandardScaler learnes from X_trainthat means if he learned that 10 has to be converted to 2, he will convert 10 to 2 in both sets X_train, X_test.
Sklearn uses Classes. See the Python documentation for more info about Classes in Python. For more info about sklearn in particular, take a look at this sklearn documentation.
Here's a short description of how you are using Classes in sklearn.
First you instantiate your sklearn Classes with sc_X = StandardScaler() or missingvalues = SimpleImputer(...).
The objects, sc_X and missingvalues, each have methods. You can use the methods typing object_name.method_name(...). For example, you used the fit_transform() method of the sc_X instance when you typed, sc_X.fit_transform(...). This method will take your data and return a scaled version of it. It both fits (determines the scaling parameters) and transforms (applies scaling) to your data. The transform() method will transform new data, using the same scaling parameters it learned for your previous data.
In the first example, you have separated the fit and transform methods into two separate lines, but the idea is similar -- you first learn the imputation parameters with the fit method, and then you transform your data.
By the way, I think missingvalues = missingvalues.fit(X[:, 1:3]) could be changed to missingvalues.fit(X[:, 1:3]).

How to use k means for a product recommendation dataset

I have a data set with columns titled as product name, brand,rating(1:5),review text, review-helpfulness. What I need is to propose a recommendation algorithm using reviews. I have to use python for coding here. data set is in .csv format.
To identify the nature of the data set I need to use kmeans on the data set. How to use k means on this data set?
Thus I did following,
1.data pre-processing,
2.review text data cleaning,
3.sentiment analysis,
4.giving sentiment score from 1 to 5 according to the sentiment value (given by sentiment analysis) they get and tagging reviews as very negative, negative, neutral, positive, very positive.
after these procedures i have these columns in my data set, product name, brand,rating(1:5),review text, review-helpfulness, sentiment-value, sentiment-tag.
This is the link to the data set https://drive.google.com/file/d/1YhCJNvV2BQk0T7PbPoR746DCL6tYmH7l/view?usp=sharing
I tried to get k means using following code It run without error. but I don't know this is something useful or is there any other ways to use kmeans on this data set to get some other useful outputs. To identify more about data how should i use k means in this data set..
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df.info()
X = np.array(df.drop(['sentiment_value'], 1).astype(float))
y = np.array(df['rating'])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
plt.show()
You did not plot anything.
So nothing shows up.
Unless you are more specific about what you are trying to achieve we won't be able to help. Figure out what exactly you want to predict. Do you just want to cluster products according to their sentiment score which isn't especially promising or do you want to predict actual product preferences on a new dataset?
If you want to build a recommendation system the only possibility (considering your dataset) would be to identify similar products according to the rating/sentiment. Is that what you want?

Categories