I have a data set and want to apply scaling and then PCA to a subset of a pandas dataframe and return just the components and the columns not being transformed. So using the mpg data set from seaborn I can see the training set trying to predict mpg looks like this:
Now let's say I want to leave cylinders and discplacement alone and scale everything else and reduce it to 2 components. I'd expect the result to be 4 total columns, the original 2 plus the 2 components.
How can I use ColumnTransformer to do the scaling to a subset of columns, then the PCA and return only the components and the 2 passthrough columns?
MWE
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (StandardScaler)
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
df = sns.load_dataset('mpg').drop(["origin", "name"], axis = 1).dropna()
X = df.loc[:, ~df.columns.isin(['mpg'])]
y = df.iloc[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21)
scaler = StandardScaler()
pca = PCA(n_components = 2)
dtm_i = list(range(2, len(X_train.columns)))
preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i), ('PCA DTM', pca, dtm_i)], remainder='passthrough')
trans = preprocess.fit_transform(X_train)
pd.DataFrame(trans)
I strongly suspect my misconception of how this step works is wrong: preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i), ('PCA DTM', pca, dtm_i)]. I think it operates on the last 4 columns, first doing a scale and then PCA and final returns the 2 components but I get 8 columns, the first 4 are scale, the next 2 appear to be the components (likely they weren't scale first), and lastly, the two columns I 'passthrough'.
I think this works but don't know if this is the way Python/scikit way to solve it:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (StandardScaler)
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
df = sns.load_dataset('mpg').drop(["origin", "name"], axis = 1).dropna()
X = df.loc[:, ~df.columns.isin(['mpg'])]
y = df.iloc[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 21)
scaler = StandardScaler()
pca = PCA(n_components = 2)
dtm_i = list(range(2, len(X_train.columns)))
dtm_i2 = list(range(0, len(X_train.columns)-2))
preprocess = ColumnTransformer(transformers=[('scaler', scaler, dtm_i)], remainder='passthrough')
preprocess2 = ColumnTransformer(transformers=[('PCA DTM', pca, dtm_i2)], remainder='passthrough')
trans = preprocess.fit_transform(X_train)
trans = preprocess2.fit_transform(trans)
pd.DataFrame(trans)
Related
I'm working on a dataset composed by 22 columns and 129 rows.
I'm using Support Vector Machine to predict my dependent variable.
To do this, I split the variable in a dummy that assume 0 and 1:
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 13 else 0)
Now, my answer is:
I want to generate in loop this dummy, for example:
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 12 else 0)
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 5 else 0)
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 8 else 0)
and so on. I want to test my variable with different classification (<12, <5, <8) and permit to SVM to test all of this.
Full code:
import pandas as pd # pandas is used to load and manipulate data and for One-Hot Encoding
import numpy as np # data manipulation
import matplotlib.pyplot as plt # matplotlib is for drawing graphs
import matplotlib.colors as colors
from sklearn.utils import resample # downsample the dataset
from sklearn.model_selection import train_test_split # split data into training and testing sets
from sklearn import preprocessing # scale and center data
from sklearn.svm import SVC # this will make a support vector machine for classificaiton
from sklearn.model_selection import GridSearchCV # this will do cross validation
from sklearn.metrics import confusion_matrix # this creates a confusion matrix
from sklearn.metrics import plot_confusion_matrix # draws a confusion matrix
from sklearn.decomposition import PCA # to perform PCA to plot the data
from sklearn import svm, datasets
datafile = (r'C:\Users\gpont\PycharmProjects\pythonProject2\data\Map\databaseCDP0.csv')
df = pd.read_csv(datafile, skiprows = 0, sep=';')
df['dummy_medianrat'] = df['median_rating'].apply(lambda x: 1 if x < 13 else 0)
#Splitting data in two datasets
df_lowr = df[df['dummy_medianrat'] == 1]
df_higr = df[df['dummy_medianrat'] == 0]
df_downsample = pd.concat([df_lowr, df_higr])
len(df_downsample)
X = df_downsample.drop('dummy_medianrat', axis=1).copy()
X.head()
y = df_downsample['dummy_medianrat'].copy()
y.head()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42,
test_size=0.25)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train.shape
X_test.shape
#Build A Preliminary Support Vector Machine
#We don't need to scale y_traing because is 0, 1 (binary classification)
clf_svm = SVC(random_state=42)
clf_svm.fit(X_train_scaled, y_train)
titles_options = [("Confusion matrix, without normalization", None),
("Normalized confusion matrix", 'true')]
for title, normalize in titles_options:
disp = plot_confusion_matrix(clf_svm, X_test_scaled, y_test,
display_labels=["Did not default", "Defaulted"],
cmap=plt.cm.Blues,
normalize=normalize)
disp.ax_.set_title(title)
print(title)
print(disp.confusion_matrix)
After created some dummies with differente values, I want to generate two confusion matrix (normalized and not), for each dummy created in a loop.
I study support vector regression but I faced a problem: my r2 score becomes negative. Is that normal or is there any changeable part in my code to fix this?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVR
df = pd.read_csv('Position_Salaries.csv')
df.head()
X = df.iloc[:, 1:2].values
y = df.iloc[:, -1].values
from sklearn.preprocessing import StandardScaler
y = y.reshape(len(y),1)
x_scaler = StandardScaler()
y_scaler = StandardScaler()
X = x_scaler.fit_transform(X)
y = y_scaler.fit_transform(y)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
regressor = SVR(kernel="rbf")
regressor.fit(x_train,y_train.ravel())
y_pred = y_scaler.inverse_transform(regressor.predict(x_scaler.transform(x_test)))
from sklearn.metrics import r2_score
r2_score(y_scaler.inverse_transform(y_test), y_pred)
My output is -0.5313206322807349
In this part, your X is in scaled version
X = x_scaler.fit_transform(X)
In this part, your x_test also in scaled version
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
When creating prediction, you shouldn't transform your input again since your x_test already in scaled version
y_pred = y_scaler.inverse_transform(regressor.predict(x_scaler.transform(x_test)))
From the documentation of sklearn.metrics.r2_score.
Best possible score is 1.0 and it can be negative (because the model
can be arbitrarily worse). A constant model that always predicts the
expected value of y, disregarding the input features, would get a R^2
score of 0.0.
Per documentation:
Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse)
Let's take data
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)
And consider code following :
#Defining X,y - independent variable and dependent variables
X=df.drop(df.columns[[1]], axis=1)
y = (df[1] == 'B').astype(int)
clf=LogisticRegression(solver="lbfgs")
kfold = StratifiedKFold(n_splits=10, shuffle=True)
for train, validation in kfold.split(X, y):
# Fit the model
clf.fit(X[train], y[train])
And the following error occurs :
Do you have any idea why it occurs ? I think I did really not complicated things, so I'm not sure what exactly I did wrong.
X is a DataFrame so you need to use .iloc to select the indices:
for train_index, validation_index in kfold.split(X, y):
# Fit the model
X_train = X.iloc[train_index]
y_train = y[train_index]
clf.fit(X_train, y_train)
I have made a classifier using Logistic Regression and for testing it I used the breast cancer dataset available at:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29
This dataset contains missing values so I have changed those values with three options:
Fill them with a value that is below any data from the dataset
Use a Imputer with the data frame
Use a Imputer, but instead of using the data frame I have used an array of numpy
The issue is that the results from option (1) and (3) are almost similar, but option (2) makes a huge Type II error. My code and results are:
import pandas as pd
import numpy as np
from sklearn import preprocessing, model_selection, linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score,confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
def readfile(name):
df=pd.read_csv(nombre,names=['id', 'clump_thickness','unif_cell_size',
'unif_cell_shape', 'marg_adhesion', 'single_epith_cell_size',
'bare_nuclei', 'bland_chromatin', 'normal_nucleoli','mitoses','class'])
return df
def outlier(df):
#OPTION 1
df.drop(['id'], 1, inplace=True)
df.replace('?', -99999, inplace=True)
return df
def mediaFill(df):
#OPTION 2
df.replace('?',np.NaN,inplace=True)
imp=SimpleImputer(missing_values=np.NaN)
idf=pd.DataFrame(imp.fit_transform(df))
idf.columns=df.columns
idf.index=df.index
return idf
def funcFill():
#OPTION 3
data = np.genfromtxt("breast-cancer-wisconsin.data",delimiter=",")
X = data[:,1:-1]
X[X == '?'] = 'NaN'
imputer = Imputer()
X = imputer.fit_transform(X)
y = data[:, -1].astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
lg=linear_model.LogisticRegression(solver="liblinear")
lg.fit(X_train,y_train)
predictions = lg.predict(X_test)
cm=confusion_matrix(y_test,predictions)
print(cm)
score = lg.score(X_test, y_test)
print(score)
def LogisticFunc(df):
X = np.array(df.drop(['class'],1))
y = np.array(df['class'])
labels=[2,4]
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.2)
clf = linear_model.LogisticRegression(solver="liblinear")
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
conf = confusion_matrix(y_test, y_pred, labels)
print (conf)
print (accuracy_score(y_pred,y_test))
def main():
file="breast-cancer-wisconsin.data"
df=readfile(file)
df=outlier(df)
LogisticFunc(df)
df=readfile(file)
df=mediaFill(df)
LogisticFunc(df)
df=readfile(file)
funcFill()
if __name__=="__main__":
main()
My results are:
Option 1:
[[97 1]
[ 2 40]]
Option 2:
[[89 0]
[51 0]]
Option 3:
[[92 2]
[ 2 44]]
Why does it differ too much Option 2? Any help?
Thanks
In your third method you are using Imputer, while in the second you are using SimpleImputer.
The Imputer class is deprecated in 0.20 and will be removed in 0.22 version of sklearn. You should always use SimpleImputer.
I have a data sample of 750x256.
Rows = 750
Columns = 256
If I split my data into 20%. I will have for X_train 600 samples and y_train 150 samples.
Then the problem would accure when doing decisionTreeRegressor
it will say Number of y_train=150 does not match number of samples=600
But if I split my test_size into 50%, then it will work.
is there a way to around this? I don't want to use 50% of my test_size.
Any help would be great!
here is my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import graphviz
#Load the data
dataset = pd.read_csv('new_york.csv')
dataset['Higher'] = dataset['2016-12'].gt(dataset['2016-11']).astype(int)
X = dataset.iloc[:, 6:254].values
y = dataset.iloc[:, 255].values
#Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, :248])
X[:, :248] = imputer.transform(X[:, :248])
#Split the data into train and test sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_test, y_train = train_test_split(X, y, test_size = .2, random_state = 0)
#let's build our first model
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz
clf = DecisionTreeClassifier(max_depth=6)
clf.fit(X_train, y_train)
clf.score(X_train, y_train)
train_test_split() returns X_train, X_test, y_train, y_test, you have y_train and y_test in the wrong order.
If you use a split of 50% this is not causing an error because y_train and y_test will have the same size (but the wrong values obviously).