How to properly use SMOTE for data balancing

How to properly use SMOTE for data balancing - python

I wanted to know if it is required to use SMOTE only after splitting test and train dataset. I used smote after train_test_split for Churn prediction, but haven't got any significant improvement pre or post SMOTE. Below is my entire code using smote. Not sure where the issue is. I wanted to know if I used SMOTE properly.
Below is the code
import pandas as pd
import numpy as np
from datetime import timedelta,datetime,date
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from numpy import percentile
tel_data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
tel_data.info()
tel_data.isnull().sum()
num = {"No":0,"Yes":1}
tel_data = tel_data.replace({"Churn":num})
# also total charges seem to be object. coverting to integer
tel_data['TotalCharges'] = pd.to_numeric(tel_data['TotalCharges'])
tel_data.head(2)
tel_data['Churn'].value_counts()
plt.figure(figsize=(6,5))
sns.countplot(tel_data['Churn'])
plt.show()
# using pd.to_numeric to convert the TotalCharges column to numeric will help us see the null values
tel_data.TotalCharges = pd.to_numeric(tel_data.TotalCharges, errors="coerce")
tel_data.isnull().sum()
# deleting the rows with null values
tel_data = tel_data.dropna(axis=0)
# encoding all categorical variables using one hot encoding
tel_data = pd.get_dummies(tel_data,drop_first=True,columns=['gender','Partner','Dependents',
'PhoneService','MultipleLines','InternetService',
'OnlineSecurity','OnlineBackup','DeviceProtection',
'TechSupport','StreamingTV','StreamingMovies',
'Contract','PaperlessBilling','PaymentMethod'])
# splitting the dataset (removing 'customerID' since it doesnt serve any purpose)
X = tel_data.drop(['customerID','Churn'],axis=1)
y = tel_data['Churn']
# performing feature selection using chi2 test
from sklearn.feature_selection import chi2
chi_scores = chi2(X,y)
print('chi_values:',chi_scores[0],'\n')
print('p_values:',chi_scores[1])
p_values = pd.Series(chi_scores[1],index = X.columns)
p_values.sort_values(ascending = False , inplace = True)
plt.figure(figsize=(12,8))
p_values.plot.bar()
plt.show()
tel_data.drop(['PhoneService_Yes','gender_Male','MultipleLines_No phone service','MultipleLines_Yes','customerID'],axis=1,inplace=True)
tel_data.head(2)
# splitting the dataset (removing 'customerID' since it doesnt serve any purpose)
X = tel_data.drop(['Churn'],axis=1)
y = tel_data['Churn']
# import sklearn libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb
from sklearn.metrics import accuracy_score
# splitting into train and test data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)
model_xgb_1 = xgb.XGBClassifier(n_estimators=100,
learning_rate=0.3,
max_depth=5,
random_state=42 )
xgbmod = model_xgb_1.fit(X_train,y_train)
# checking accuracy of training data
print('Accuracy of XGB classifier on training set: {:.2f}'
.format(xgbmod.score(X_train, y_train)))
y_xgb_pred = trn_xgbmod.predict(X_test)
print(classification_report(y_test, y_xgb_pred))
from imblearn.over_sampling import SMOTE
smote_preprocess = SMOTE(random_state=42)
X_train_resampled,y_train_resampled = smote_preprocess.fit_resample(X_train,y_train)
model_xgb_smote = xgb.XGBClassifier(n_estimators=100,
learning_rate=0.3,
max_depth=5,
random_state=42 )
xgbmod_smote = model_xgb_smote.fit(X_train_resampled,y_train_resampled)
# checking accuracy of training data
print('Accuracy of XGB classifier on training set: {:.2f}'
.format(xgbmod_smote.score(X_train_resampled,y_train_resampled)))
y_xgb_pred_smote = xgbmod_smote.predict(X_test)
print(classification_report(y_test, y_xgb_pred_smote))

Related

How to calculate the accuracy?

I'm trying to calculate the accuracy for a twitter sentiment analysis project. However, I get this error, and I was wondering if anyone could help me calculate the accuracy? Thanks
Error: ValueError: Classification metrics can't handle a mix of continuous and multiclass targets
My code:
import re
import pickle
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv("updated_tweet_info.csv")
data = df.fillna(' ')
train,test = train_test_split(data, test_size = 0.2, random_state = 42)
train_clean_tweet=[]
for tweet in train['tweet']:
train_clean_tweet.append(tweet)
test_clean_tweet=[]
for tweet in test['tweet']:
test_clean_tweet.append(tweet)
v = CountVectorizer(analyzer = "word")
train_features= v.fit_transform(train_clean_tweet)
test_features=v.transform(test_clean_tweet)
lr = RandomForestRegressor(n_estimators=200)
fit1 = lr.fit(train_features, train['clean_polarity'])
pred = fit1.predict(test_features)
accuracy = accuracy_score(pred, test['clean_polarity'])`

You are trying to use the accuracy_score method, but accuracy is a classification metric.
In your case, try using a regression metric method like: mean_squared_error() and then applying np.sqrt(). This will return you the Root Mean Squared Error. The lower the number, the better. You can also look here for more details.
Try this:
import numpy as np
rmse = np.sqrt(mean_squared_error(test['clean_polarity'], pred))
This guy also had the same problem

High Score in Train Test Split but Low Score in CV in Python Scikit-Learn

I am new in Data Science and have struggled in the problem for the Kaggle's problem. When I use random forest regression for predicting the rating, it is found high Score using Train Test Split but Low Score while using CV Score.
with train test split_randomforest 0.8746277302652172
with no train test split_randomforest 0.8750717943467078
with CV randomforest 10.713885026374156 %
https://www.kaggle.com/data13/machine-learning-model-to-predict-app-rating-94
import time
import datetime
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib as mpl
import numpy as np
import seaborn as sns
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn.metrics import r2_score
import statsmodels.api as sm
import sklearn.model_selection as ms
from sklearn import neighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.cluster import KMeans
from sklearn.neighbors import KDTree
from sklearn import svm
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, ShuffleSplit
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from lightgbm import LGBMClassifier
database = pd.read_csv(r"C:\Users\Anson\Downloads\49864_274957_bundle_archive\googleplaystore.csv")# store wine type as an attribute
## Size - Strip the M and k value
database['Size'] = database['Size'].apply(lambda x : x.strip('M'))
database['Size'] = database['Size'].apply(lambda x : x.strip('k'))
##
## Rating - Fill the Blank Value with median
database['Rating'].fillna(database['Rating'].median(),inplace=True)
database['Rating'].replace(19,database['Rating'].median(),inplace=True)
###
## Reviews - replace the blank cell
database['Reviews'].replace('3.0M',3000000,inplace=True)
database['Reviews'].replace('0',float("NaN"),inplace=True)
database.dropna(subset=['Reviews'],inplace=True)
##
## Strip the + value
database['Installs'] = database['Installs'].apply(lambda x : x.strip('+'))
database['Installs'] = database['Installs'].apply(lambda x : x.replace(',',''))
database['Price'] = database['Price'].apply(lambda x : x.strip('$'))
###
## Drop Blank
database['Content Rating'].fillna("NaN",inplace=True)
database.dropna(subset=['Content Rating'],inplace=True)
##
## Drop Wrong Number
database['Last Updated'].replace('1.0.19',float("NaN"),inplace=True)
database.dropna(subset=['Last Updated'],inplace=True)
database['Last Updated'] = database['Last Updated'].apply(lambda x : time.mktime(datetime.datetime.strptime(x, '%B %d, %Y').timetuple()))
##
le = preprocessing.LabelEncoder()
database['App'] = le.fit_transform(database['App'])
database['Category'] = le.fit_transform(database['Category'])
database['Content Rating'] = le.fit_transform(database['Content Rating'])
database['Type'] = le.fit_transform(database['Type'])
database['Genres'] = le.fit_transform(database['Genres'])
###############################
##feature engineering
features = ['App', 'Reviews', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated']
X=database[features]
y=database['Rating']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=None)
rfc= RandomForestRegressor()
rfc.fit(X_train,y_train)
rfc.fit(X,y)
rfc_score=rfc.score(X_test,y_test)
rfc_score1=rfc.score(X,y)
score_CV_randomforest = cross_val_score(rfc,X,y,cv=KFold(n_splits=5, shuffle=True),scoring='r2')
score_CV_randomforest = score_CV_randomforest.mean()*100
print("with train test split_randomforest", rfc_score)
print("with no train test split_randomforest", rfc_score1)
print("with CV randomforest", score_CV_randomforest, "%")

Train/Test Split:
You are using 80:20 ratio fro training and testing.
Cross-validation
when the data set is randomly split up into ‘k’ groups. One of the groups is used as the test set and the rest are used as the training set. The model is trained on the training set and scored on the test set. Then the process is repeated until each unique group as been used as the test set.
You are using 5-fold cross validation, the data set would be split into 5 groups, and the model would be trained and tested 5 separate times so each group would get a chance to be the test set.
So the reason for different result is, that model is trained on different random samples.

how to generate the confusion matrix through cross validation in python?

I am using the iris flower dataset to do the sorting. I need to make a confusion matrix through cross validation (fold = 10) but I don't know how to do it. I generated the confusion matrix of only one round.
# I am using TPOT autoML library for python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from sklearn.preprocessing import LabelEncoder
tpot_data = pd.read_csv('iris.csv')
tpot_data = tpot_data.apply(LabelEncoder().fit_transform)
features = tpot_data.drop('species', axis=1).values
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['species'].values, random_state=10)
exported_pipeline = make_pipeline(StackingEstimator(estimator=GaussianNB()),
MultinomialNB(alpha=0.01, fit_prior=False)
)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(testing_target, results))
pd.crosstab(testing_target, results, rownames=['Actual Class'], colnames=['Predicted Class'])
from sklearn.model_selection import cross_val_score
array_cross_val_score = cross_val_score(estimator=exported_pipeline, X=training_features,
y=training_target, cv=10, scoring='accuracy')
# I would like the confusion matrix to be based on the average cross-validation
np.mean(array_cross_val_score)

How to know whether i am overfitting/underfitting my data?

So i have to build a regression model to predict wine quality based on 11 inputs. Currently i am evaluating the Mean Squared Error, Mean absolute error and R2 scores of various algorithms. I want to make a decision on which algorithm to use, but before i do, i want to make sure my data is not being overfitted/underfitted. Below is the link to the dataset i use (its a bit different but the data is exactly the same) as well as my entire code.
Any help is greatly appreciated!
Data:
https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
Also, the kagggle link where i copied most of my code from:
https://www.kaggle.com/jhansia/regression-models-analysis-on-the-wine-quality
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
wine = pd.read_csv('wineQualityReds.csv', usecols=lambda x: 'Unnamed' not in x,)
wine.head()
y = wine.quality
X = wine.drop('quality',axis = 1)
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(X,y,random_state = 0, stratify = y)
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(train_x)
train_x_scaled = scaler.transform(train_x)
test_x_scaled = scaler.transform(test_x)
from sklearn import model_selection
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error
models = []
models.append(('DecisionTree', DecisionTreeRegressor()))
models.append(('RandomForest', RandomForestRegressor()))
models.append(('GradienBoost', GradientBoostingRegressor()))
models.append(('SVR', SVR()))
names = []
for name,model in models:
kfold = model_selection.KFold(n_splits=5,random_state=2)
cv_results = model_selection.cross_val_score(model,train_x_scaled,train_y, cv= kfold, scoring = 'neg_mean_absolute_error')
names.append(name)
msg = "%s: %f" % (name, -1*(cv_results).mean())
print(msg)
model = RandomForestRegressor()
model.fit(train_x_scaled,train_y)
pred_y = model.predict(test_x_scaled)
from sklearn import metrics
print('Mean Squared Error:', metrics.mean_squared_error(test_y, pred_y))
print('Mean Absolute Error:', metrics.mean_absolute_error(test_y, pred_y))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(test_y, pred_y)))
print('R2:', metrics.r2_score(test_y, pred_y))

You can use cross validation on the data sets to find whether it is over fitting or under fitting.

Using slices in Python

I use the dataset from UCI repo: http://archive.ics.uci.edu/ml/datasets/Energy+efficiency
Then doing next:
from pandas import *
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.cross_validation import train_test_split
dataset = read_excel('/Users/Half_Pint_boy/Desktop/ENB2012_data.xlsx')
dataset = dataset.drop(['X1','X4'], axis=1)
trg = dataset[['Y1','Y2']]
trn = dataset.drop(['Y1','Y2'], axis=1)
Then do the models and cross validate:
models = [LinearRegression(),
RandomForestRegressor(n_estimators=100, max_features ='sqrt'),
KNeighborsRegressor(n_neighbors=6),
SVR(kernel='linear'),
LogisticRegression()
]
Xtrn, Xtest, Ytrn, Ytest = train_test_split(trn, trg, test_size=0.4)
I'm creating a regression model for predicting values but have a problems. Here is the code:
TestModels = DataFrame()
tmp = {}
for model in models:
m = str(model)
tmp['Model'] = m[:m.index('(')]
for i in range(Ytrn.shape[1]):
model.fit(Xtrn, Ytrn[:,i])
tmp[str(i+1)] = r2_score(Ytest[:,0], model.predict(Xtest))
TestModels = TestModels.append([tmp])
TestModels.set_index('Model', inplace=True)
It shows unhashable type: 'slice' for line model.fit(Xtrn, Ytrn[:,i])
How can it be avoided and made working?
Thanks!

I think that I had a similar problem before! Try to convert your data to numpy arrays before feeding them to sklearn estimators. It most probably solve the hashing problem. For instance, You can do:
Xtrn_array = Xtrn.as_matrix()
Ytrn_array = Ytrn.as_matrix()
and use Xtrn_array and Ytrn_array when you fit your data to estimators.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to properly use SMOTE for data balancing - python

Related

How to calculate the accuracy?

High Score in Train Test Split but Low Score in CV in Python Scikit-Learn

how to generate the confusion matrix through cross validation in python?

How to know whether i am overfitting/underfitting my data?

Using slices in Python

Categories

Resources