converting pandas string data to numeric for decision tree - python

I am trying to convert several columns of string data into numeric to feed into a classification model. An example dataset with one modification column would be:
input:
data = [['tom', 10], ['nick', 15], ['juli', 14], ['nick', '12']]
data = pd.DataFrame(data, columns = ['Name', 'Age'])
data
output:
Name Age
0 tom 10
1 nick 15
2 juli 14
3 nick 12
I realize that scikit learn doesn't handle string data very well, but for now I'd really prefer to press onward with it, if possible (company restrictions). However, my issue is that if I use
sklearn.preprocessing.LabelEncoder
i am able to use '.classes_' to get some numeric values, such as:
input:
le = preprocessing.LabelEncoder()
le.fit(data['Name'])
le.classes_
vals = le.transform(le.classes_)
vals
I get
output:
array([0, 1, 2])
Since this array only contains three values, I cannot use
data['Name'] = vals
for assignment because my column length is 4 and my vals length is 3.
Considering this, is there an alternate way for me to go about this in scikit-learn or is my only option to use a different library?

You could also do this:
pd.get_dummies(data=data, columns=['Name'])
Output:
Age Name_juli Name_nick Name_tom
0 10 0 0 1
1 15 0 1 0
2 14 1 0 0
3 12 0 1 0
Now your data is ready for model training. Usually OneHot encoding is better than LabelEncoding because LabelEncoding implies there's a numerical relationship between your names. If Juli==0, Nick==1 and Tom==2, you're implying July < Nick < Tom which might cause troubles in some models.

try this:
le = preprocessing.LabelEncoder()
data['Name']= le.fit_transform(data['Name'])
This will assign labels to the whole column.

le = preprocessing.LabelEncoder()
le.fit(data['Name'])
le.classes_
vals = le.transform(data['Name'])
vals
When you use fit(data['Name']), you actually can use fit(data['Name'].unique()), because only unique values use for fit, but for transform you must use all your data.

from sklearn.preprocessing import Imputer
imputer =Imputer(missing_values="NaN", strategy='mean', axis=0)
imputer=imputer.fit(X[:,1:3])
X[:,1:3]= imputer.transform(X[:,1:3])
#Concept of Dummy Variable, Handling the conflict of them
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
#We have to give the column number necessary to change it in numeric form
X[:,0]=labelencoder_X.fit_transform(X[:,0])
onehotencoder=OneHotEncoder(categorical_features =[0])
X=onehotencoder.fit_transform(X).toarray()

# prepare input features
def LABEL_Encoding(X):
objList = data.select_dtypes(include = "object").columns
print (objList)
le = LabelEncoder()
for feat in objList:
data[feat] = le.fit_transform(data[feat].astype(str))
return data

Related

Getting NaN in a column after applying map() function

I'm trying to replace the categorical variable in the Gender column - M, F with 0, 1. However, after running my code I'm getting NaN in place of 0 & 1.
Code-
df['Gender'] = df['Gender'].map({'F':1, 'M':0})
My input data frame-
Dataframe after running the code-
Details- Gender (Data Type) - object
Kindly, suggest a way out!
Maybe values in your dataframe are different from the expected strings 'F' and 'M'. Try to use LabelEncoder from SkLearn.
from sklearn.preprocessing import LabelEncoder
df['Gender'] = LabelEncoder().fit_transform(df['Gender'])
This particular code resolved the issue-
# Import label encoder
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'Gender'.
df['Gender']= label_encoder.fit_transform(df['Gender'])
df['Gender'].unique()

What is the easiest way to convert a binned feature to a numeric categorical feature?

Converting a numeric feature into a categorical binned feature is pretty simple when using pandas.cut(). However, say you want to do the opposite by converting a binned object feature into a numeric categorical feature (1, 2, 3, 4... etc.), what would be the easiest way to do so?
Distinct binned categories: ["0-9%", "10-19%", "20-29%", "30-39%", "40-49%", "50-59%", etc...]
There are many methods naïve methods that springs to mind to solve this problem.
E.g, running a for-loop with if-statements:
temp = []
for i in list1:
if i == "0-9%":
temp.append(1)
elif i == "10-19%":
temp.append(2)
elif i == "20-29%":
temp.append(3)
etc......
Or by creating a dictionary with each distinct binned category as keys and using their index values as values:
temp = {}
for v, k in enumerate(pd.unique(list1)):
temp[k] = v+1 # +1 just to skip first value 0
list1 = [temp[bin] for bin in list1]
These two methods feel, however, a bit naïve and I'm curious to whether there are simpler solutions to this issue?
There is already a numerical information in a Categorical.
Use cat.codes to access it:
df = pd.DataFrame({'val': range(1,40,7)})
bins = [0,10,20,30,40]
labels = ["0-9%", "10-19%", "20-29%", "30-39%"]
df['cat'] = pd.cut(df['val'], bins=bins, labels=labels)
df['code'] = df['cat'].cat.codes.add(1)
print(df)
Output:
val cat code
0 1 0-9% 1
1 8 0-9% 1
2 15 10-19% 2
3 22 20-29% 3
4 29 20-29% 3
5 36 30-39% 4
If the input is not a Categorical, you need to use factorize.
Create a dictionary showing the current bin and the number you want to convert it to and then you the replace function
conversion={"0-9%":1, "10-19%":2, "20-29%":2,.....etc }
df.replace(conversion)

How to convert string columns to numeric values without getting NaN values

enter image description here
I have columns of strings and I have to convert it into values. I used this code
and unfortunately the fillna method don't work at this example.
How can I fix the problem?
Here's the head()
Head()
data['country_txt'] = data['country_txt'].astype('float64')
data['city'] = data['city'].astype('float64')
I expected a normal result but the actual output is all fulled with NaN values:
country_txt 0 non-null float64
city 0 non-null float64
Apparently, you need to map your strings to integer representations.
There are many different ways to do that.
1 pd.factorize
df['country_as_int'] = pd.factorize(df['country_txt'])[0]
2 LabelEncoder
from sklearn.preprocessing import LabelEncoder
f = LabelEncoder()
df['country_as_int'] = f.fit_transform(df['country_txt'])
3 np.unique
df['country_as_int'] = np.unique(df['country_txt'], return_inverse=True)[-1]

Solving Kaggle's Titanic Machine Learning

I'm trying to solving Kaggle's Titanic with Python.
But I have an error trying to fit my data.
This is my code:
import pandas as pd
from sklearn import linear_model
def clean_data(data):
data["Fare"] = data["Fare"].fillna(data["Fare"].dropna().median())
data["Age"] = data["Age"].fillna(data["Age"].dropna().median())
data.loc[data["Sex"] == "male", "Sex"] = 0
data.loc[data["Sex"] == "female", "Sex"] = 1
data.loc["Embarked"] = data["Embarked"].fillna("S")
data.loc[data["Embarked"] == "S", "Embarked"] = 0
data.loc[data["Embarked"] == "C", "Embarked"] = 1
data.loc[data["Embarked"] == "Q", "Embarked"] = 2
train = pd.read_csv("train.csv")
clean_data(train)
target = train["Survived"].values
features = train[["Pclass", "Age","Sex","SibSp", "Parch"]].values
classifier = linear_model.LogisticRegression()
classifier_ = classifier.fit(features, target) # Here is where error comes from
And the error is this:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Can you help me please?
Before you fit the model with features and target, the best practice is to check whether the null value is present in all the features which you want to use in building the model. You can know the below to check it
dataframe_name.isnull().any() this will give the column names and True if atleast one Nan value is present
dataframe_name.isnull().sum() this will give the column names and value of how many NaN values are present
By knowing the column names then you perform cleaning of data.
This will not create the problem of NaN.
You should reset the index of your dataframe before running any sklearn code:
df = df.reset_index()
Nan simply represents empty,None or null values in a dataset. Before applying some ML algorithm on the dataset you, first, need to preprocess the dataset for it's streamlined processing. In other words it's called data cleaning. you can use scikit learn's imputer module to handle Nan.
How to check if dataset has Nan:
dataframe's isnan() returns a list of True/False values to show whether some column contains Nan or not for example:
str = pd.Series(['a','b',np.nan, 'c', 'np.nan'])
str.isnull()
out: False, False, True, False, True
And str.isnull().sum() would return you the count of null values present in the series. In this case '2'.
you can apply this method on a dataframe itself e.g. df.isnan()
Two techniques I know to handle Nan: 1. Removing the row which contains Nan.e.g.
str.dropna() orstr.dropna(inplace=True) or df.dropna(how=all)
But this would remove many valuable information from the dataset. Hence, mostly we avoid it.
2.Imputing: replacing the Nan values with the mean/median of the column.
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
#strategy can also be median or most_frequent
imputer = imputer.fit(training_data_df)
imputed_data = imputer.fit_transform(training_data_df.values)
print(imputed_data_df)
I hope this would help you.

Possible ways to do one hot encoding in scikit-learn?

I have a pandas data frame with some categorical columns. Some of these contains non-integer values.
I currently want to apply several machine learning models on this data. With some models, it is necessary to do normalization to get better result. For example, converting categorical variable into dummy/indicator variables. Indeed, pandas has a function called get_dummies for that purpose. However, this function returns the result depending on the data. So if I call get_dummies on training data, then call it again on test data, columns achieved in two cases can be different because a categorical column in test data can contains just a sub-set/different set of possible values compared to possible values in training data.
Therefore, I am looking for other methods to do one-hot coding.
What are possible ways to do one hot encoding in python (pandas/sklearn)?
Scikit-learn provides an encoder sklearn.preprocessing.LabelBinarizer.
For encoding training data you can use fit_transform which will discover the category labels and create appropriate dummy variables.
label_binarizer = sklearn.preprocessing.LabelBinarizer()
training_mat = label_binarizer.fit_transform(df.Label)
For the test data you can use the same set of categories using transform.
test_mat = label_binarizer.transform(test_df.Label)
In the past, I've found the easiest way to deal with this problem is to use get_dummies and then enforce that the columns match up between test and train. For example, you might do something like:
import pandas as pd
train = pd.get_dummies(train_df)
test = pd.get_dummies(test_df)
# get the columns in train that are not in test
col_to_add = np.setdiff1d(train.columns, test.columns)
# add these columns to test, setting them equal to zero
for c in col_to_add:
test[c] = 0
# select and reorder the test columns using the train columns
test = test[train.columns]
This will discard information about labels that you haven't seen in the training set, but will enforce consistency. If you're doing cross validation using these splits, I'd recommend two things. First, do get_dummies on the whole dataset to get all of the columns (instead of just on the training set as in the code above). Second, use StratifiedKFold for cross validation so that your splits contain the relevant labels.
Say, I have a feature "A" with possible values "a", "b", "c", "d". But the training data set consists of only three categories "a", "b", "c" as values. If get_dummies is used at this stage, features generated will be three (A_a, A_b, A_c). But ideally there should be another feature A_d as well with all zeros. That can be achieved in the following way :
import pandas as pd
data = pd.DataFrame({"A" : ["a", "b", "c"]})
data["A"] = data["A"].astype("category", categories=["a", "b", "c", "d"])
mod_data = pd.get_dummies(data[["A"]])
print(mod_data)
The output being
A_a A_b A_c A_d
0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0
For the text columns, you can try this
from sklearn.feature_extraction.text import CountVectorizer
data = ['he is good','he is bad','he is strong']
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(data)
For Output:
for i in range(len(data)):
print(vectors[i, :].toarray())
Output:
[[0 1 1 1 0]]
[[1 0 1 1 0]]
[[0 0 1 1 1]]

Categories