Related
This question already has an answer here:
Creating a new column for predicted cluster: SettingWithCopyWarning
(1 answer)
Closed 1 year ago.
I am getting a SettingWithCopyWarning from Pandas when performing the below operation. I understand what the warning means and I know I can turn the warning off but I am curious if I am performing this type of standardization incorrectly using a pandas dataframe (I have mixed data with categorical and numeric columns). My numbers seem fine after checking but I would like to clean up my syntax to make sure I am using Pandas correctly.
I am curious if there is a better workflow for this type of operation when dealing with data sets that have mixed data types like this.
My process is as follows with some toy data:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from typing import List
# toy data with categorical and numeric data
df: pd.DataFrame = pd.DataFrame([['0',100,'A', 10],
['1',125,'A',15],
['2',134,'A',20],
['3',112,'A',25],
['4',107,'B',35],
['5',68,'B',50],
['6',321,'B',10],
['7',26,'B',27],
['8',115,'C',64],
['9',100,'C',72],
['10',74,'C',18],
['11',63,'C',18]], columns = ['id', 'weight','type','age'])
df.dtypes
id object
weight int64
type object
age int64
dtype: object
# select categorical data for later operations
cat_cols: List = df.select_dtypes(include=['object']).columns.values.tolist()
# select numeric columns for later operations
numeric_cols: List = df.columns[df.dtypes.apply(lambda x: np.issubdtype(x, np.number))].values.tolist()
# prepare data for modeling by splitting into train and test
# use only standardization means/standard deviations from the TRAINING SET only
# and apply them to the testing set as to avoid information leakage from training set into testing set
X: pd.DataFrame = df.copy()
y: pd.Series = df.pop('type')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# perform standardization of numeric variables using the mean and standard deviations of the training set only
X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train_numeric_tmp)
X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])
<ipython-input-15-74f3f6c70f6a>:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Your X_train, X_test are still slices of the original dataframe. Modifying a slice triggers the warning and often doesn't work.
You can either transform before train_test_split, else do X_train = X_train.copy() after split, then transform.
The second approach would prevent information leak as commented in your code. So something like this:
# these 2 lines don't look good to me
# X: pd.DataFrame = df.copy() # don't you drop the label?
# y: pd.Series = df.pop('type') # y = df['type']
# pass them directly instead
features = [c for c in df if c!='type']
X_train, X_test, y_train, y_test = train_test_split(df[features], df['type'],
test_size = 0.2,
random_state = 0)
# now copy what we want to transform
X_train = X_train.copy()
X_test = X_test.copy()
## Code below should work without warning
############
# perform standardization of numeric variables using the mean and standard deviations of the training set only
# you don't need copy the data to fit
# X_train_numeric_tmp: pd.DataFrame = X_train[numeric_cols].values
X_train_scaler = preprocessing.StandardScaler().fit(X_train[numeric_cols)
X_train[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]: pd.DataFrame = X_train_scaler.transform(X_test[numeric_cols])
I try to explain both pd.get_dummies() and OneHotEncoder() for transforming categorical data into dummy columns. But I do recommend using OneHotEncoder() transformer, because it's a sklearn transformer that you can use it in a Pipeline later if you want.
First OneHotEncoder(): It does the same job as pd.get_dummies function of pandas does, but return of this class is a Numpy ndarray or a sparse array. you can read more about this class here:
from sklearn.preprocessing import OneHotEncoder
X_train_cat = X_train[["type"]]
cat_encoder = OneHotEncoder(sparse=False)
X_train_cat_1hot = cat_encoder.fit_transform(X_train) #This is a numpy ndarray!
#If you want to make a DataFrame again, you can do so like below:
#X_train_cat_1hot = pd.DataFrame(X_train_cat_1hot, columns=cat_encoder.categories_[0])
#You can also concatenate this transformed dataframe with your numerical transformed one.
Second method, pd.get_dummies():
df_dummies = pd.get_dummies(X_train[["type"]])
X_train = pd.concat([X_train, df_dummies], axis=1).drop("type", axis=1)
I am working with a data set labeled Adult and I am trying to run a KNN on a few of the columns I have made into a new data Frame and normalized a couple of the columns. I am getting a ValueError: Unknown label type: 'continuous' error when trying to run
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
After researching the error on line it seems that I need to use a label encoder on my data after I have normalized it, because it is now a float rather than an int but I am having trouble with using the label encoder. The code I am using is:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
Adult.loc[Adult.loc[:, "race"] == "Amer-Indian-Eskimo", "race"] = "Other" #consolidating catagorical data in the race column
Adult.loc[:,"race"].value_counts().plot(kind='bar') #plotting the consolidated catagorical data in the race column
plt.title('race after consolidation')
plt.show()
Adult.loc[:, "White"] = (Adult.loc[:, "race"] == "White").astype(int) #One hot encoding the catagorical/creating new categorical data in the race column
Adult.loc[:, "Black"] = (Adult.loc[:, "race"] == "Black").astype(int)
Adult.loc[:, "Asian-Pac-Islander"] = (Adult.loc[:, "race"] == "Asian-Pac-Islander").astype(int)
Adult.loc[:, "Other"] = (Adult.loc[:, "race"] == "Other").astype(int)
Adult.loc[:,"Other"] #Verifying One-hot encoding for Other column
Adult = Adult.drop("race", axis=1) #removing the obsolete column "race"
Minage = min(Adult.loc[:,"age"]) #MinMax normalizing the age column
Maxage = max(Adult.loc[:,"age"])
MinMaxage = (Adult.loc[:,"age"] - Minage)/(Maxage - Minage)
Minhours = min(Adult.loc[:,"hoursperweek"]) #MinMax ormalizing the hoursperweek column
Maxhours = max(Adult.loc[:,"hoursperweek"])
MinMaxhours = (Adult.loc[:,"hoursperweek"] - Minhours)/(Maxhours - Minhours)
df2 = pd.DataFrame() #creating a dataframe to plot the normilized data
df2.loc[:,0] = Adult.loc[:, "White"] #filling the data frame
df2.loc[:,1] = MinMaxage
df2.loc[:,2] = MinMaxhours
df2.columns = ["White","MinMaxage","MinMaxhours"]
X = np.array(df2.drop(['MinMaxhours'], 1))
y = np.array(df2['MinMaxhours'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(accuracy)
clf.predict(X_test)
y_test
Could someone help me out with how to label encode the data so I can perform Knn on the data? I have looked it up on the sklearn site and different examples, but am still having trouble using it on my dataset. I receive the error message when trying to fit the data running clf.fit(X_train, y_train)
It looks like you have a regression problem instead of a classification problem. You are trying to predict the MinMaxHours variable, which is a real number. If you are trying to predict real number you should use the regression version of the Neirest neighbor algorithm. The following code should work in order to get a prediction.
from sklearn.neighbors import KNeighborsRegressor
clf = KNeighborsRegressor()
clf.fit(X_train, y_train)
y_test_pred = clf.predict(X_test)
I have 2 csv files. one is a training dataset and the other is test dataset. Training dataset contains 36 columns. One column of that is the outcome which have A-F as values. The test dataset has 35 columns which does not have the outcome. I want to add an outcome column to the test dataset as well. I searched for several tutorials but did not find the method that I should follow. Can any one tell about the process that I should follow?
You haven't supplied any sample data and the technique you want to use, my below code will have you to understand how can you make prediction in general:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
Assuming you have read 2 csv files, train & test
X_train = train.loc[:, train.columns != 'Outcome'] # <---- Here you exclude the Outcome from train
y_train = train['Outcome'] # <---- This is your Outcome
le = LabelEncoder()
y_train = le.fit_transform(y_train) # <---- Here you convert your A-F values to numeric(0-5)
I am assuming rest of the x variables are numeric.
rf = RandomForestClassifier() # <---- Here you call the classifier
rf.fit(X_train, y_train) # <---- Fit the classifier on train data
rf.score(X_train, y_train) # <---- Check model accuracy
y_pred = pd.DataFrame(rf.predict(test), columns = ['Outcome']) # <---- Make predictions on test data
test_pred = pd.concat([test, y_pred['Outcome']], axis = 1) # <---- Here you add predictions column to your test dataset
test_pred.to_excel(r'path\Test.xlsx')
That depends on how you will find/calculate the outcome you need to add.
One way would be to load the test dataset as a Pandas data frame. Calculate the outcome and add the values to a list which you the add to your Pandas dataframe:
import pandas as pd
data = pd.DataFrame(columns=['Names', 'Age', 'Outcome'])
names = ['John', 'Nicole', 'Evan']
age = [53, 23, 27]
data['Names'] = names
data['Age'] = age
outcome = [6545, 5252, 85665]
data['Outcome'] = outcome
I read this blog about new things in scikit. The OneHotEncoder taking strings seems like a useful feature. Below my attempt to use this
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
train_df = pd.read_csv('../../data/train.csv', usecols=cols)
test_df = pd.read_csv('../../data/test.csv', usecols=[e for e in cols if e != 'Survived'])
train_df.dropna(inplace=True)
test_df.dropna(inplace=True)
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.copy()
ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False), ['Sex', 'Embarked'])], remainder='passthrough')
X_train_t = ct.fit_transform(train_df)
X_test_t = ct.fit_transform(test_df)
print(X_train_t[0])
print(X_test_t[0])
# [ 0. 1. 0. 0. 1. 0. 3. 22. 1. 0. 7.25]
# [ 0. 1. 0. 1. 0. 3. 34.5 0. 0. 7.8292]
logreg = LogisticRegression(max_iter=5000)
logreg.fit(X_train_t, Y_train)
Y_pred = logreg.predict(X_test_t) # ValueError: X has 10 features per sample; expecting 11
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
print(acc_log)
I encounter the below python error with this code and also I have some additional concerns.
ValueError: X has 10 features per sample; expecting 11
To start from the beginning .. this script is written for the "titanic" dataset from kaggle. We have five numerical columns Pclass, Age, SibSp, Parch and Fare. The columns Sex and Embarked are categories male/female and Q/S/C (which is an abbreviation for a city name).
What I understood from the OneHotEncoder is that it creates dummy variables by placing additional columns. Well actually the output of ct.fit_transform() is no longer a pandas dataframe but a numpy array now. But as seen in the print debug statement there are more than the original 7 columns now.
There are three problems I encounter:
For some reason the test.csv has one less column. That would indicate to me that there is on less option in one of the categories. To fix that i would have to find all the available options in the categories over both train + test data. And then use these options (such as male/female) to transform the train and the test data separately. I have no idea how to do this with the tools i'm working with (pandas, scikit, etc). On second thought .. after inspecting the data i can not find the missing option in the test.csv ..
I want to avoid the "dummy variable trap". Right now it seems that there are too many columns created. I was expecting 1 column for Sex (total options 2 - 1 to avoid trap) and 2 for embarked. With the additional 5 numerical columns that would come to 8 total.
I don't recognize the output of the transform anymore. I would rather prefer a new dataframe where the new dummy columns have given their own name, such as Sex_male (1/0) Embarked_Q (1/0) and Embarked_S(1/0)
I'm only used to using gretl, there dummifying a variable and leaving out one option is very natural. I don't know in python if i'm doing it wrong or if this scenario is not part of the standard scikit toolkit. Any advice? Maybe I should write a custom encoder for this?
I will try and answer all your questions individually.
Answer for Question 1
In your code you have used fit_transform method both on your train and test data which is not the correct way of doing it. Generally, fit_transform is applied only on your train data set, and it returns a transformer which is then just used to transform your test data set. When you apply fit_transform on your test data, you just transform your test data with just the options/levels of the categorical variables available only in your test data set and it is very much possible that your test data may not contain all options/levels of all categorical variables, due to which the dimension of your train and test data set will differ resulting in the error which you have got.
So the correct way of doing it would be:
X_train_t = ct.fit_transform(X_train)
X_test_t = ct.transform(X_test)
Answer for Question 2
If you want to avoid the "dummy variable trap" you can make use of the parameter drop (by setting it to first) while creating the OneHotEncoder object in the ColumnTransformer, this will result in creating just one column for sex and two columns for Embarked since they have two and three options/levels respectively.
So the correct way of doing it would be:
ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False, drop="first"), ['Sex','Embarked'])], remainder='passthrough')
Answer for Question 3
As of now the get_feature_names method which can be reconstruct your data frame with new dummy columns is not implemented insklearn yet. One work around for this would be to change the reminder to drop in the ColumnTransformer construction and construct your data frame separately as shown below:
ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False, drop="first"), ['Sex', 'Embarked'])], remainder='drop')
A = pd.concat([X_train.drop(["Sex", "Embarked"], axis=1), pd.DataFrame(X_train_t, columns=ct.get_feature_names())], axis=1)
A.head()
which will result in something like this:
Your final code will look like this:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
train_df = pd.read_csv('train.csv', usecols=cols)
test_df = pd.read_csv('test.csv', usecols=[e for e in cols if e != 'Survived'])
cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
train_df = train_df.dropna()
test_df = test_df.dropna()
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.copy()
categorical_values = ['Sex', 'Embarked']
X_train_cont = X_train.drop(categorical_values, axis=1)
X_test_cont = X_test.drop(categorical_values, axis=1)
ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False, drop="first"), categorical_values)], remainder='drop')
X_train_categorical = ct.fit_transform(X_train)
X_test_categorical = ct.transform(X_test)
X_train_t = pd.concat([X_train_cont, pd.DataFrame(X_train_categorical, columns=ct.get_feature_names())], axis=1)
X_test_t = pd.concat([X_test_cont, pd.DataFrame(X_test_categorical, columns=ct.get_feature_names())], axis=1)
logreg = LogisticRegression(max_iter=5000)
logreg.fit(X_train_t, Y_train)
Y_pred = logreg.predict(X_test_t)
acc_log = round(logreg.score(X_train_t, Y_train) * 100, 2)
print(acc_log)
80.34
And when you do X_train_t.head() you get
Recommended practice is suggested in #Parthasarathy Subburaj's answer but I have seen in Kaggle or other competition, where people fit on the complete data (train+test). If you want to try the same, use the following format
ct.fit(X_complete)
X_train_t, X_test_t = ct.transform(X_test), ct.transform(X_test)
ya, use drop='first' to get over this issue. At the same time, remember this multicollinearity problem is not a big deal for non-linear models such as neural networks or even decision trees. I believe that is the reason why it is not kept as the default arg param value.
get_feature_names is not implemented exhaustively for pipelines and other stuffs in sklearn. Hence, they are supporting complete in ColumnTransformer as well.
Based on my experience, I had built this wrapper for ColumnTransfomer, which can support for even it has pipelines or reminder=passthrough.
This also picks up the feature names for get_feature_names instead of calling it as x0, x1 because we know the actual column names inside ColumnTransformer using _feature_names_in.
from sklearn.compose import ColumnTransformer
from sklearn.utils.validation import check_is_fitted
def _get_features_out(name, trans, features_in):
if hasattr(trans, 'get_feature_names'):
return [name + "__" + f for f in
trans.get_feature_names(features_in)]
else:
return features_in
class NamedColumnTransformer(ColumnTransformer):
def get_feature_names(self):
check_is_fitted(self)
feature_names = []
for name, trans, features, _ in self._iter(fitted=True):
if trans == 'drop':
continue
if trans == 'passthrough':
feature_names.extend(self._feature_names_in[features])
elif hasattr(trans, '_iter'):
for _, op_name, t in trans._iter():
features=_get_features_out(op_name, t, features)
feature_names.extend(features)
elif not hasattr(trans, 'get_feature_names'):
raise AttributeError("Transformer %s (type %s) does not "
"provide get_feature_names."
% (str(name), type(trans).__name__))
else:
feature_names.extend(_get_features_out(name, trans, features))
return feature_names
Now, for your example,
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
# you can fetch the titanic dataset using this
X, y = fetch_openml("titanic", version=1,
as_frame=True, return_X_y=True)
# removing the columns which you are not using
X.drop(['name', 'ticket', 'cabin', 'boat', 'body', 'home.dest'],
axis=1, inplace=True)
X.dropna(inplace=True)
X.reset_index(drop=True, inplace=True)
y = y[X.index]
categorical_values = ['sex', 'embarked']
ct = NamedColumnTransformer([("onehot", OneHotEncoder(
sparse=False, drop="first"), categorical_values)], remainder='passthrough')
clf = Pipeline(steps=[('preprocessor', ct),
('classifier', LogisticRegression(max_iter=5000))])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
clf[0].get_feature_names()
# ['onehot__sex_male',
# 'onehot__embarked_Q',
# 'onehot__embarked_S',
# 'pclass',
# 'age',
# 'sibsp',
# 'parch',
# 'fare']
pd.DataFrame(clf[0].transform(X_train), columns=clf[0].get_feature_names())
You can also try the NamedColumnTransformer for a more interesting example of ColumnTransformer here.
I am attempting to train models with GradientBoostingClassifier using categorical variables.
The following is a primitive code sample, just for trying to input categorical variables into GradientBoostingClassifier.
from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
import pandas
iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]
# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]
X_train = pandas.DataFrame(X_train)
# Insert fake categorical variable.
# Just for testing in GradientBoostingClassifier.
X_train[0] = ['a']*40 + ['b']*40
# Model.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)
The following error appears:
ValueError: could not convert string to float: 'b'
From what I gather, it seems that One Hot Encoding on categorical variables is required before GradientBoostingClassifier can build the model.
Can GradientBoostingClassifier build models using categorical variables without having to do one hot encoding?
R gbm package is capable of handling the sample data above. I'm looking for a Python library with equivalent capability.
pandas.get_dummies or statsmodels.tools.tools.categorical can be used to convert categorical variables to a dummy matrix. We can then merge the dummy matrix back to the training data.
Below is the example code from the question with the above procedure carried out.
from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve,auc
from statsmodels.tools import categorical
import numpy as np
iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]
# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]
###########################################################################
###### Convert categorical variable to matrix and merge back with training
###### data.
# Fake categorical variable.
catVar = np.array(['a']*40 + ['b']*40)
catVar = categorical(catVar, drop=True)
X_train = np.concatenate((X_train, catVar), axis = 1)
catVar = np.array(['a']*10 + ['b']*10)
catVar = categorical(catVar, drop=True)
X_test = np.concatenate((X_test, catVar), axis = 1)
###########################################################################
# Model and test.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)
prob = clf.predict_proba(X_test)[:,1] # Only look at P(y==1).
fpr, tpr, thresholds = roc_curve(y_test, prob)
roc_auc_prob = auc(fpr, tpr)
print(prob)
print(y_test)
print(roc_auc_prob)
Thanks to Andreas Muller for instructing that pandas Dataframe should not be used for scikit-learn estimators.
Sure it can handle it, you just have to encode the categorical variables as a separate step on the pipeline. Sklearn is perfectly capable of handling categorical variables as well as R or any other ML package. The R package is still (presumably) doing one-hot encoding behind the scenes, it just doesn't separate the concerns of encoding and fitting in this case (as it arguably should).