I'm trying to replace the categorical variable in the Gender column - M, F with 0, 1. However, after running my code I'm getting NaN in place of 0 & 1.
Code-
df['Gender'] = df['Gender'].map({'F':1, 'M':0})
My input data frame-
Dataframe after running the code-
Details- Gender (Data Type) - object
Kindly, suggest a way out!
Maybe values in your dataframe are different from the expected strings 'F' and 'M'. Try to use LabelEncoder from SkLearn.
from sklearn.preprocessing import LabelEncoder
df['Gender'] = LabelEncoder().fit_transform(df['Gender'])
This particular code resolved the issue-
# Import label encoder
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'Gender'.
df['Gender']= label_encoder.fit_transform(df['Gender'])
df['Gender'].unique()
Related
I have a column risk_appetite where there are some NaN in the column. The screenshot below is a summary of my column:
I plan to use KNN method to impute the missing value, and therefore, I need to do encoding first before the imputation. I'm using target encoding technique, and this is the function that I'm using:
from category_encoders import TargetEncoder
encoder = TargetEncoder(handle_missing = 'return_nan')
def targetencoder(data,col,target):
data[col] = encoder.fit_transform(data[col], data[target])
Then, I call the function to encode my column:
listofcol_te = ['risk_appetite']
for col in listofcol_te:
targetencoder(df,col,'target_variable')
Once the encoding done, this is the output:
Everything is fine until here. Next, I start to do imputation (using MissForest imputation) for the column:
import sklearn.neighbors._base
import sys
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
from missingpy import MissForest
# Copy the original dataset
data = df.copy()
# Impute
imputer = MissForest()
data_imputed = imputer.fit_transform(data)
data_imputed = pd.DataFrame(data=data_imputed, columns=data.columns)
I manage to impute all the NaN in risk_appetite, and this is the result:
As you can see from the screenshot above, initially there are only 5 categories for risk_appetite, after the imputation, it became 1332 categories. MissForest imputation method seems like creating new category instead of assigning the existing categories to the NaN.
May I know did I did anything wrong? Or MissForest imputation shouldn't be used for categorical feature? What is the best way for me to impute risk_appetite if MissForest is not suitable? I saw some imputation by mean, mode and median, but I think that is not really a good way to do imputation. Any help or advise will be greatly appreciated!
I am using the Beers dataset in which I want to encode the data with datatype 'object'.
Following is my code.
from sklearn import preprocessing
df3 = BeerDF.select_dtypes(include=['object']).copy()
label_encoder = preprocessing.LabelEncoder()
df3 = df3.apply(label_encoder.fit_transform)
The following error is occurring.
TypeError: Encoders require their input to be uniformly strings or numbers. Got ['float', 'str']
Any insights are helpful!!!
Use:
df3 = df3.astype(str).apply(label_encoder.fit_transform)
From the TypeError, it seems that the column you want to transform into label has two different data types (dtypes), in your case string and float which raises an error. To avoid this, you have to modify your column to have a uniform dtype (string or float). For example in Iris classification dataset, class=['Setosa', 'Versicolour', 'Virginica'] and not ['Setosa', 3, 5, 'Versicolour']
I am trying to convert several columns of string data into numeric to feed into a classification model. An example dataset with one modification column would be:
input:
data = [['tom', 10], ['nick', 15], ['juli', 14], ['nick', '12']]
data = pd.DataFrame(data, columns = ['Name', 'Age'])
data
output:
Name Age
0 tom 10
1 nick 15
2 juli 14
3 nick 12
I realize that scikit learn doesn't handle string data very well, but for now I'd really prefer to press onward with it, if possible (company restrictions). However, my issue is that if I use
sklearn.preprocessing.LabelEncoder
i am able to use '.classes_' to get some numeric values, such as:
input:
le = preprocessing.LabelEncoder()
le.fit(data['Name'])
le.classes_
vals = le.transform(le.classes_)
vals
I get
output:
array([0, 1, 2])
Since this array only contains three values, I cannot use
data['Name'] = vals
for assignment because my column length is 4 and my vals length is 3.
Considering this, is there an alternate way for me to go about this in scikit-learn or is my only option to use a different library?
You could also do this:
pd.get_dummies(data=data, columns=['Name'])
Output:
Age Name_juli Name_nick Name_tom
0 10 0 0 1
1 15 0 1 0
2 14 1 0 0
3 12 0 1 0
Now your data is ready for model training. Usually OneHot encoding is better than LabelEncoding because LabelEncoding implies there's a numerical relationship between your names. If Juli==0, Nick==1 and Tom==2, you're implying July < Nick < Tom which might cause troubles in some models.
try this:
le = preprocessing.LabelEncoder()
data['Name']= le.fit_transform(data['Name'])
This will assign labels to the whole column.
le = preprocessing.LabelEncoder()
le.fit(data['Name'])
le.classes_
vals = le.transform(data['Name'])
vals
When you use fit(data['Name']), you actually can use fit(data['Name'].unique()), because only unique values use for fit, but for transform you must use all your data.
from sklearn.preprocessing import Imputer
imputer =Imputer(missing_values="NaN", strategy='mean', axis=0)
imputer=imputer.fit(X[:,1:3])
X[:,1:3]= imputer.transform(X[:,1:3])
#Concept of Dummy Variable, Handling the conflict of them
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
#We have to give the column number necessary to change it in numeric form
X[:,0]=labelencoder_X.fit_transform(X[:,0])
onehotencoder=OneHotEncoder(categorical_features =[0])
X=onehotencoder.fit_transform(X).toarray()
# prepare input features
def LABEL_Encoding(X):
objList = data.select_dtypes(include = "object").columns
print (objList)
le = LabelEncoder()
for feat in objList:
data[feat] = le.fit_transform(data[feat].astype(str))
return data
I'm trying to solving Kaggle's Titanic with Python.
But I have an error trying to fit my data.
This is my code:
import pandas as pd
from sklearn import linear_model
def clean_data(data):
data["Fare"] = data["Fare"].fillna(data["Fare"].dropna().median())
data["Age"] = data["Age"].fillna(data["Age"].dropna().median())
data.loc[data["Sex"] == "male", "Sex"] = 0
data.loc[data["Sex"] == "female", "Sex"] = 1
data.loc["Embarked"] = data["Embarked"].fillna("S")
data.loc[data["Embarked"] == "S", "Embarked"] = 0
data.loc[data["Embarked"] == "C", "Embarked"] = 1
data.loc[data["Embarked"] == "Q", "Embarked"] = 2
train = pd.read_csv("train.csv")
clean_data(train)
target = train["Survived"].values
features = train[["Pclass", "Age","Sex","SibSp", "Parch"]].values
classifier = linear_model.LogisticRegression()
classifier_ = classifier.fit(features, target) # Here is where error comes from
And the error is this:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Can you help me please?
Before you fit the model with features and target, the best practice is to check whether the null value is present in all the features which you want to use in building the model. You can know the below to check it
dataframe_name.isnull().any() this will give the column names and True if atleast one Nan value is present
dataframe_name.isnull().sum() this will give the column names and value of how many NaN values are present
By knowing the column names then you perform cleaning of data.
This will not create the problem of NaN.
You should reset the index of your dataframe before running any sklearn code:
df = df.reset_index()
Nan simply represents empty,None or null values in a dataset. Before applying some ML algorithm on the dataset you, first, need to preprocess the dataset for it's streamlined processing. In other words it's called data cleaning. you can use scikit learn's imputer module to handle Nan.
How to check if dataset has Nan:
dataframe's isnan() returns a list of True/False values to show whether some column contains Nan or not for example:
str = pd.Series(['a','b',np.nan, 'c', 'np.nan'])
str.isnull()
out: False, False, True, False, True
And str.isnull().sum() would return you the count of null values present in the series. In this case '2'.
you can apply this method on a dataframe itself e.g. df.isnan()
Two techniques I know to handle Nan: 1. Removing the row which contains Nan.e.g.
str.dropna() orstr.dropna(inplace=True) or df.dropna(how=all)
But this would remove many valuable information from the dataset. Hence, mostly we avoid it.
2.Imputing: replacing the Nan values with the mean/median of the column.
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
#strategy can also be median or most_frequent
imputer = imputer.fit(training_data_df)
imputed_data = imputer.fit_transform(training_data_df.values)
print(imputed_data_df)
I hope this would help you.
Image of ull error
I am trying to run LabelEncoder on all columns that are of type object. This is the code I wrote but it throws this error:
TypeError: '<' not supported between instances of 'int' and 'str'
Does anybody know how to fix this?
le=LabelEncoder()
for col in X_test.columns.values:
if X_test[col].dtypes=='object':
data=X_train[col].append(X_test[col])
le.fit(data.values)
X_train[col]=le.transform(X_train[col])
X_test[col]=le.transform(X_test[col])
Looks like it has different types while appending. You try converting all to str at fit method:
le.fit(data.values.astype(str))
And you have to change your data type to str for transform as well since the classes in LabelEncoder will be str:
X_train[col]=le.transform(X_train[col].astype(str))
X_test[col]=le.transform(X_test[col].astype(str))
Trying to recreate similar problem. If dataframe has values with int and str:
import pandas as pd
df = pd.DataFrame({'col1':["tokyo", 1 , "paris"]})
print(df)
Result:
col1
0 tokyo
1 1
2 paris
Now, using Labelenconder would give similar error message i.e. TypeError: unorderable types: int() < str() :
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.col1.values)
Converting all to str in fit or before may resolve issue:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.col1.values.astype(str))
print(le.classes_)
Result:
['1' 'paris' 'tokyo']
If you just call le.transform(df.col1), it will throw similar error again.
So, it has to be le.transform(df.col1.astype(str)) instead.
The error is basically telling you the exact problem: some of the values are strings and some are not. You can solve this by calling c.astype(str) each time you call fit, fit_transform, or transform, on Series c, e.g.:
le.fit(data.values.astype(str))