I'm working on a dataset composed by 22 columns and 129 rows.
I'm using Support Vector Machine to predict my dependent variable.
To do this, I split the variable in a dummy that assume 0 and 1:
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 13 else 0)
Now, my answer is:
I want to generate in loop this dummy, for example:
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 12 else 0)
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 5 else 0)
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 8 else 0)
and so on. I want to test my variable with different classification (<12, <5, <8) and permit to SVM to test all of this.
Full code:
import pandas as pd # pandas is used to load and manipulate data and for One-Hot Encoding
import numpy as np # data manipulation
import matplotlib.pyplot as plt # matplotlib is for drawing graphs
import matplotlib.colors as colors
from sklearn.utils import resample # downsample the dataset
from sklearn.model_selection import train_test_split # split data into training and testing sets
from sklearn import preprocessing # scale and center data
from sklearn.svm import SVC # this will make a support vector machine for classificaiton
from sklearn.model_selection import GridSearchCV # this will do cross validation
from sklearn.metrics import confusion_matrix # this creates a confusion matrix
from sklearn.metrics import plot_confusion_matrix # draws a confusion matrix
from sklearn.decomposition import PCA # to perform PCA to plot the data
from sklearn import svm, datasets
datafile = (r'C:\Users\gpont\PycharmProjects\pythonProject2\data\Map\databaseCDP0.csv')
df = pd.read_csv(datafile, skiprows = 0, sep=';')
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 13 else 0)
#Splitting data in two datasets
df_lowr = df[df['dummy_medianrat'] == 1]
df_higr = df[df['dummy_medianrat'] == 0]
df_downsample = pd.concat([df_lowr, df_higr])
len(df_downsample)
X = df_downsample.drop('dummy_medianrat', axis=1).copy()
X.head()
y = df_downsample['dummy_medianrat'].copy()
y.head()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42,
test_size=0.25)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train.shape
X_test.shape
#Build A Preliminary Support Vector Machine
#We don't need to scale y_traing because is 0, 1 (binary classification)
clf_svm = SVC(random_state=42)
clf_svm.fit(X_train_scaled, y_train)
titles_options = [("Confusion matrix, without normalization", None),
("Normalized confusion matrix", 'true')]
for title, normalize in titles_options:
disp = plot_confusion_matrix(clf_svm, X_test_scaled, y_test,
display_labels=["Did not default", "Defaulted"],
cmap=plt.cm.Blues,
normalize=normalize)
disp.ax_.set_title(title)
print(title)
print(disp.confusion_matrix)
After created some dummies with differente values, I want to generate two confusion matrix (normalized and not), for each dummy created in a loop.
Related
I'm currently trying the following concept:
I applied np.log1p() to the independent variables and dependent variable (price)
Assuming X = independent variables and Y = dependent variable, I train_test_split X & Y
Then I trained the LinearRegression(), Ridge(), Lasso(), and ElasticNet() models
Given that the labels I used to train the model were also log1p(Y), I'm assuming the model predictions are also log values?
If the predictions are log values, how come np.expm1 doesn't return a value that is on a similar scale?
Linear Regression Code for reference
import os
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from scipy.stats import skew
from scipy import stats
from scipy.stats import norm
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
df_num = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
df_cat = pd.DataFrame(np.random.randint(0,2,size=(10000, 2)), columns=['cat1', 'cat2'])
price = pd.DataFrame(np.random.randint(0,100,size=(10000, 1)), columns=['price'])
y = price
skewness = df_num.apply(lambda x: skew(x))
skewness = skewness[abs(skewness) > 0.5]
skewed_features = skewness.index
df_num[skewed_features] = np.log1p(df_num[skewed_features])
y = np.log1p(y)
train = pd.concat([df_num, df_cat], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(train, y, test_size = 0.3, random_state = 0)
lr_clf = LinearRegression()
lr_clf.fit(X_train, y_train)
def predict_price(A, B, C, D, cat1):
cat1_index = np.where(train.columns == cat1)[0][0]
x = np.zeros(len(train.columns))
x[0] = np.log1p(A)
x[1] = np.log1p(B)
x[2] = np.log1p(C)
x[3] = np.log1p(D)
if cat1_index >= 0:
x[cat1_index] = 1
return np.expm1(lr_clf.predict([x])[0])
predict_price(20, 30, 15, 55, 'cat2')
EDIT1: I tried to recreate an example from scratch, but I can't seem to replicate the issue I'm running into. The issue I run into in my real data is that:
predictions work totally fine if I DON'T log-normalize inputs when training and DON'T log normalize inputs when predicting.
HOWEVER when I do log-normalize when training and log normalize inputs and np.expm1 the prediction, the value is totally off.
Please let me know if there is anything I can explain more clearly.
I have a data set and want to apply scaling and then PCA to a subset of a pandas dataframe and return just the components and the columns not being transformed. So using the mpg data set from seaborn I can see the training set trying to predict mpg looks like this:
Now let's say I want to leave cylinders and discplacement alone and scale everything else and reduce it to 2 components. I'd expect the result to be 4 total columns, the original 2 plus the 2 components.
How can I use ColumnTransformer to do the scaling to a subset of columns, then the PCA and return only the components and the 2 passthrough columns?
MWE
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (StandardScaler)
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
df = sns.load_dataset('mpg').drop(["origin", "name"], axis = 1).dropna()
X = df.loc[:, ~df.columns.isin(['mpg'])]
y = df.iloc[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21)
scaler = StandardScaler()
pca = PCA(n_components = 2)
dtm_i = list(range(2, len(X_train.columns)))
preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i), ('PCA DTM', pca, dtm_i)], remainder='passthrough')
trans = preprocess.fit_transform(X_train)
pd.DataFrame(trans)
I strongly suspect my misconception of how this step works is wrong: preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i), ('PCA DTM', pca, dtm_i)]. I think it operates on the last 4 columns, first doing a scale and then PCA and final returns the 2 components but I get 8 columns, the first 4 are scale, the next 2 appear to be the components (likely they weren't scale first), and lastly, the two columns I 'passthrough'.
I think this works but don't know if this is the way Python/scikit way to solve it:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (StandardScaler)
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
df = sns.load_dataset('mpg').drop(["origin", "name"], axis = 1).dropna()
X = df.loc[:, ~df.columns.isin(['mpg'])]
y = df.iloc[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21)
scaler = StandardScaler()
pca = PCA(n_components = 2)
dtm_i = list(range(2, len(X_train.columns)))
dtm_i2 = list(range(0, len(X_train.columns)-2))
preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i)], remainder='passthrough')
preprocess2 = ColumnTransformer(transformers=[('PCA DTM', pca, dtm_i2)], remainder='passthrough')
trans = preprocess.fit_transform(X_train)
trans = preprocess2.fit_transform(trans)
pd.DataFrame(trans)
I'm trying to implement a complement naive bayes classifier using sklearn. My data have very imbalanced classes (30k samples of class 0 and 6k samples of the 1 class) and I'm trying to compensate this using weighted class.
Here is the shape of my dataset:
enter image description here
I tried to use the compute compute_class_weight function to calcute the weights and then pass it to the fit function when training my model:
import numpy as np
import seaborn as sn
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from sklearn.naive_bayes import ComplementNB
#Import the csv data
data = pd.read_csv('output_pt900.csv')
#Create the header of the csv file
header = []
for x in range(0,2500):
header.append('pixel' + str(x))
header.append('status')
#Add the header to the csv data
data.columns = header
#Replace the b's and the f's in the status column by 0 and 1
data['status'] = data['status'].replace('b',0)
data['status'] = data['status'].replace('f',1)
print(data)
#Drop the NaN values
data = data.dropna()
#Separate the features variables and the status
y = data['status']
x = data.drop('status',axis=1)
#Split the original dataset into two other: train and test
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)
all_together = y_train.to_numpy()
unique_classes = np.unique(all_together)
c_w = class_weight.compute_class_weight('balanced', unique_classes, all_together)
clf = ComplementNB()
clf.fit(x_train,y_train, c_w)
y_predict = clf.predict(x_test)
cm = confusion_matrix(y_test, y_predict)
svm = sn.heatmap(cm, cmap='Blues', annot=True, fmt='g')
figure=svm.get_figure()
figure.savefig('confusion_matrix_cnb.png', dpi=400)
plt.show()
but I got thesse error:
ValueError: sample_weight.shape == (2,), expected (29752,)!
Anyone knows how to use weighted class in sklearn models?
compute_class_weight returns an array of length equal to the number of unique classes with the weight to assign to instances of each class (link). So if there are 2 unique classes, c_w has length 2, containing the weight that should be assigned to samples with label 0 and 1.
When calling fit for your model, the weight for each sample is expected by the sample_weight argument. This should explain the error you received. To solve this issue, you need to use c_w returned by compute_class_weight to create an array of individual sample weights. You could do this with [c_w[i] for i in all_together]. Your fit call would ultimately look something like:
clf.fit(x_train, y_train, sample_weight=[c_w[i] for i in all_together])
I have made a classifier using Logistic Regression and for testing it I used the breast cancer dataset available at:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29
This dataset contains missing values so I have changed those values with three options:
Fill them with a value that is below any data from the dataset
Use a Imputer with the data frame
Use a Imputer, but instead of using the data frame I have used an array of numpy
The issue is that the results from option (1) and (3) are almost similar, but option (2) makes a huge Type II error. My code and results are:
import pandas as pd
import numpy as np
from sklearn import preprocessing, model_selection, linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score,confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
def readfile(name):
df=pd.read_csv(nombre,names=['id', 'clump_thickness','unif_cell_size',
'unif_cell_shape', 'marg_adhesion', 'single_epith_cell_size',
'bare_nuclei', 'bland_chromatin', 'normal_nucleoli','mitoses','class'])
return df
def outlier(df):
#OPTION 1
df.drop(['id'], 1, inplace=True)
df.replace('?', -99999, inplace=True)
return df
def mediaFill(df):
#OPTION 2
df.replace('?',np.NaN,inplace=True)
imp=SimpleImputer(missing_values=np.NaN)
idf=pd.DataFrame(imp.fit_transform(df))
idf.columns=df.columns
idf.index=df.index
return idf
def funcFill():
#OPTION 3
data = np.genfromtxt("breast-cancer-wisconsin.data",delimiter=",")
X = data[:,1:-1]
X[X == '?'] = 'NaN'
imputer = Imputer()
X = imputer.fit_transform(X)
y = data[:, -1].astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
lg=linear_model.LogisticRegression(solver="liblinear")
lg.fit(X_train,y_train)
predictions = lg.predict(X_test)
cm=confusion_matrix(y_test,predictions)
print(cm)
score = lg.score(X_test, y_test)
print(score)
def LogisticFunc(df):
X = np.array(df.drop(['class'],1))
y = np.array(df['class'])
labels=[2,4]
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.2)
clf = linear_model.LogisticRegression(solver="liblinear")
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
conf = confusion_matrix(y_test, y_pred, labels)
print (conf)
print (accuracy_score(y_pred,y_test))
def main():
file="breast-cancer-wisconsin.data"
df=readfile(file)
df=outlier(df)
LogisticFunc(df)
df=readfile(file)
df=mediaFill(df)
LogisticFunc(df)
df=readfile(file)
funcFill()
if __name__=="__main__":
main()
My results are:
Option 1:
[[97 1]
[ 2 40]]
Option 2:
[[89 0]
[51 0]]
Option 3:
[[92 2]
[ 2 44]]
Why does it differ too much Option 2? Any help?
Thanks
In your third method you are using Imputer, while in the second you are using SimpleImputer.
The Imputer class is deprecated in 0.20 and will be removed in 0.22 version of sklearn. You should always use SimpleImputer.
I'm working on a what I thought was a fairly simple machine learning problem.
In this problem the y (label) I'm wanting to classify is a multi-class value. In this dataset I have 6 possible choices.
I've been using the preprocessing.LabelBinarizer() function to pivot my y set to an array of ones or zeros in hopes that this would be sufficient (e.g. [0 0 0 0 0 1]).
This code below fails on the model.fit() due to a ValueError: Found arrays with inconsistent numbers of samples: [ 217 1302] || 1302 is 217*6 BTW
lb = preprocessing.LabelBinarizer()
api_y = lb.fit_transform(df['gear'])
y = pd.DataFrame(api_y)
y = np.ravel(y)
It seems that the binarizer returns results that appear like 6 columns to the algorithm instead of 1 column containing an array of 6 digits.
I've tried to force it into an array model using the code below but then the fit function bails for another reason: ValueError: Unknown label type array([array[0,1,0,0,0]), arrary([0,1,0,0...])
lb = preprocessing.LabelBinarizer()
api_y = lb.fit_transform(df['gear'])
y_list = []
for x in api_y:
item = {'gear': np.array(x)}
y_list.append(item)
y = pd.DataFrame(y_list)
print("after changing to binary classes array y is "+repr(y.shape))
y = np.ravel(y)
I also tried the sklearn_pandas.DataFrameMapper to no avail as it also created 6 distinct fields vs. an array of 6 values represented as one field.
Any help or suggestions would be appreciated...full version of what I thought was right posted here for clarity:
#!/Library/Frameworks/Python.framework/Versions/3.5/bin/python3
import pandas as pd
import numpy as np
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import metrics
import sklearn_pandas
#
# load traing data taken from 2 years of strava rides
df = pd.DataFrame.from_csv("gear_train.csv")
#
# Prepare data for logistic regression
#
y, X = dmatrices('gear ~ distance + moving_time + total_elevation_gain + average_speed + max_speed + average_cadence + has_heartrate + device_watts', df, return_type="dataframe")
#
# Fix up y to be a flattened array of 1 column (binary array?)
#
lb = preprocessing.LabelBinarizer()
api_y = lb.fit_transform(df['gear'])
y = pd.DataFrame(api_y)
y = np.ravel(y)
#
# run the logistic regression
#
model = LogisticRegression()
model = model.fit(X, y)
score = model.score(X, y)
#
# evaluate the model by splitting into training and testing data sets
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, y_train)
predicted = model2.predict(X_test)
print("predicted="+repr(lb2.inverse_transform(predicted)))
print(metrics.classification_report(y_test, predicted))
#
# do a 10-fold CV test to see if this model holds up
#
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print(scores.mean())enter code here
The root cause of my problem was y field contained string values instead of numeric. For example b12345 as a key instead of 12345. Once I changed that to use LabelEncoding and Decoding it worked like a champ.