I'm trying to solving Kaggle's Titanic with Python.
But I have an error trying to fit my data.
This is my code:
import pandas as pd
from sklearn import linear_model
def clean_data(data):
data["Fare"] = data["Fare"].fillna(data["Fare"].dropna().median())
data["Age"] = data["Age"].fillna(data["Age"].dropna().median())
data.loc[data["Sex"] == "male", "Sex"] = 0
data.loc[data["Sex"] == "female", "Sex"] = 1
data.loc["Embarked"] = data["Embarked"].fillna("S")
data.loc[data["Embarked"] == "S", "Embarked"] = 0
data.loc[data["Embarked"] == "C", "Embarked"] = 1
data.loc[data["Embarked"] == "Q", "Embarked"] = 2
train = pd.read_csv("train.csv")
clean_data(train)
target = train["Survived"].values
features = train[["Pclass", "Age","Sex","SibSp", "Parch"]].values
classifier = linear_model.LogisticRegression()
classifier_ = classifier.fit(features, target) # Here is where error comes from
And the error is this:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Can you help me please?
Before you fit the model with features and target, the best practice is to check whether the null value is present in all the features which you want to use in building the model. You can know the below to check it
dataframe_name.isnull().any() this will give the column names and True if atleast one Nan value is present
dataframe_name.isnull().sum() this will give the column names and value of how many NaN values are present
By knowing the column names then you perform cleaning of data.
This will not create the problem of NaN.
You should reset the index of your dataframe before running any sklearn code:
df = df.reset_index()
Nan simply represents empty,None or null values in a dataset. Before applying some ML algorithm on the dataset you, first, need to preprocess the dataset for it's streamlined processing. In other words it's called data cleaning. you can use scikit learn's imputer module to handle Nan.
How to check if dataset has Nan:
dataframe's isnan() returns a list of True/False values to show whether some column contains Nan or not for example:
str = pd.Series(['a','b',np.nan, 'c', 'np.nan'])
str.isnull()
out: False, False, True, False, True
And str.isnull().sum() would return you the count of null values present in the series. In this case '2'.
you can apply this method on a dataframe itself e.g. df.isnan()
Two techniques I know to handle Nan: 1. Removing the row which contains Nan.e.g.
str.dropna() orstr.dropna(inplace=True) or df.dropna(how=all)
But this would remove many valuable information from the dataset. Hence, mostly we avoid it.
2.Imputing: replacing the Nan values with the mean/median of the column.
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
#strategy can also be median or most_frequent
imputer = imputer.fit(training_data_df)
imputed_data = imputer.fit_transform(training_data_df.values)
print(imputed_data_df)
I hope this would help you.
Related
I have a column risk_appetite where there are some NaN in the column. The screenshot below is a summary of my column:
I plan to use KNN method to impute the missing value, and therefore, I need to do encoding first before the imputation. I'm using target encoding technique, and this is the function that I'm using:
from category_encoders import TargetEncoder
encoder = TargetEncoder(handle_missing = 'return_nan')
def targetencoder(data,col,target):
data[col] = encoder.fit_transform(data[col], data[target])
Then, I call the function to encode my column:
listofcol_te = ['risk_appetite']
for col in listofcol_te:
targetencoder(df,col,'target_variable')
Once the encoding done, this is the output:
Everything is fine until here. Next, I start to do imputation (using MissForest imputation) for the column:
import sklearn.neighbors._base
import sys
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
from missingpy import MissForest
# Copy the original dataset
data = df.copy()
# Impute
imputer = MissForest()
data_imputed = imputer.fit_transform(data)
data_imputed = pd.DataFrame(data=data_imputed, columns=data.columns)
I manage to impute all the NaN in risk_appetite, and this is the result:
As you can see from the screenshot above, initially there are only 5 categories for risk_appetite, after the imputation, it became 1332 categories. MissForest imputation method seems like creating new category instead of assigning the existing categories to the NaN.
May I know did I did anything wrong? Or MissForest imputation shouldn't be used for categorical feature? What is the best way for me to impute risk_appetite if MissForest is not suitable? I saw some imputation by mean, mode and median, but I think that is not really a good way to do imputation. Any help or advise will be greatly appreciated!
I'm trying to replace the categorical variable in the Gender column - M, F with 0, 1. However, after running my code I'm getting NaN in place of 0 & 1.
Code-
df['Gender'] = df['Gender'].map({'F':1, 'M':0})
My input data frame-
Dataframe after running the code-
Details- Gender (Data Type) - object
Kindly, suggest a way out!
Maybe values in your dataframe are different from the expected strings 'F' and 'M'. Try to use LabelEncoder from SkLearn.
from sklearn.preprocessing import LabelEncoder
df['Gender'] = LabelEncoder().fit_transform(df['Gender'])
This particular code resolved the issue-
# Import label encoder
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'Gender'.
df['Gender']= label_encoder.fit_transform(df['Gender'])
df['Gender'].unique()
while working on my submission for the famous Kaggle Titanic dataset (890 rows/11 columns) I would like to execute all of my 'Feature Engineering' steps within one scikit pipeline. However, I could barely find any online examples that demonstrate how to use the scikit FunctionTransformer() in order to execute slightly more complex custom functions, especially functions that refer to more than one column of the dataset.
In my concrete example, I would like to replace NaN values in the column 'Age' depending on the passenger class (column 'Pclass'). Possible passengers classes are 1, 2 or 3 and the corresponding ages that should replace the NaN values are 38, 30 and 25. My current code looks like this:
def impute_age_class(df, column_1, column_2):
for i in range(len(df)):
if np.isnan(df[column_1].iloc[i]):
if df[column_2].iloc[i] == 1:
df[column_1].iloc[i] = 38
elif df[column_2].iloc[i] == 2:
df[column_1].iloc[i] = 30
else:
df[column_1].iloc[i] = 25
return df
age_transformers = [("impute_age_class", FunctionTransformer(impute_age_class,validate=False, kw_args={'column_1': 'Age', 'column_2': 'Pclass'}), ["Age", "Pclass"])]
It seems like the code gets executed and I receive a slightly better accuracy score with my logreg model but also the warnings on this picture:
Note message
I would be very thankful if you could give me any hints on whether the syntax of my code could be improved in order to avoid these warnings and ensure correct execution.
That warning is very common, and worth reading up on. But it's also not great to be looping over the rows of a dataframe. You can use pandas's own fillna for this:
def impute_age_class(df, fillme, groupby):
df = df.copy()
df.loc[:, fillme] = df[fillme].fillna(
value=df[groupby].map(
{1: 38, 2: 30, 3: 25})
)
return df
tfmr = FunctionTransformer(
impute_age_class,
validate=False,
kw_args={'fillme': 'age', 'groupby': 'pclass'}
)
It's a little unusual to have the parameters for the two column names when you are hard-coding the mapping inside the function. And if you didn't have the mapping already in mind, it'd be better to learn it at fit time and then transform train and test sets with that mapping: see SimpleImputer with groupby and https://datascience.stackexchange.com/q/71856/55122.
I have a data frame "df" with columns "bedrooms", "bathrooms", "sqft_living", and "sqft_lot".
I want to create a regression model by filling the missing column values based on the values of the other columns. The missing value would be determined by observing the other columns and making a prediction based on the other column values.
As an example, the sqft_living column is missing in row 12. To determine this, the count for the bedrooms, bathrooms, and sqft_lot would be considered to make a prediction on the missing value.
Is there any way to do this? Any help is appreciated. Thanks!
import pandas as pd
from sklearn.linear_model import LinearRegression
# setup
dictionary = {'bedrooms': [3,3,2,4,3,4,3,3,3,3,3,2,3,3],
'bathrooms': [1,2.25,1,3,2,4.5,2.25,1.5,1,2.5,2.5,1,1,1.75],
'sqft_living': [1180, 2570,770,1960,1680,5420,1715,1060,1780,1890,'',1160,'',1370],
'sqft_lot': [5650,7242,10000,5000,8080,101930,6819,9711,7470,6560,9796,6000,19901,9680]}
df = pd.DataFrame(dictionary)
# setup x and y for training
# drop data with empty row
clean_df = df[df['sqft_living'] != '']
# separate variables into my x and y
x = clean_df.iloc[:, [0,1,3]].values
y = clean_df['sqft_living'].values
# fit my model
lm = LinearRegression()
lm.fit(x, y)
# get the rows I am trying to do my prediction on
predict_x = df[df['sqft_living'] == ''].iloc[:, [0,1,3]].values
# perform my prediction
lm.predict(predict_x)
# I get values 1964.983 for row 10, and 1567.068 row row 12
It should be noted that you're asking about imputation. I suggest reading and understanding other methods, trade offs, and when to do it.
Edit: Putting Code back into DataFrame:
# Get index of missing data
missing_index = df[df['sqft_living'] == ''].index
# Replace
df.loc[missing_index, 'sqft_living'] = lm.predict(predict_x)
I have a pandas data frame with some categorical columns. Some of these contains non-integer values.
I currently want to apply several machine learning models on this data. With some models, it is necessary to do normalization to get better result. For example, converting categorical variable into dummy/indicator variables. Indeed, pandas has a function called get_dummies for that purpose. However, this function returns the result depending on the data. So if I call get_dummies on training data, then call it again on test data, columns achieved in two cases can be different because a categorical column in test data can contains just a sub-set/different set of possible values compared to possible values in training data.
Therefore, I am looking for other methods to do one-hot coding.
What are possible ways to do one hot encoding in python (pandas/sklearn)?
Scikit-learn provides an encoder sklearn.preprocessing.LabelBinarizer.
For encoding training data you can use fit_transform which will discover the category labels and create appropriate dummy variables.
label_binarizer = sklearn.preprocessing.LabelBinarizer()
training_mat = label_binarizer.fit_transform(df.Label)
For the test data you can use the same set of categories using transform.
test_mat = label_binarizer.transform(test_df.Label)
In the past, I've found the easiest way to deal with this problem is to use get_dummies and then enforce that the columns match up between test and train. For example, you might do something like:
import pandas as pd
train = pd.get_dummies(train_df)
test = pd.get_dummies(test_df)
# get the columns in train that are not in test
col_to_add = np.setdiff1d(train.columns, test.columns)
# add these columns to test, setting them equal to zero
for c in col_to_add:
test[c] = 0
# select and reorder the test columns using the train columns
test = test[train.columns]
This will discard information about labels that you haven't seen in the training set, but will enforce consistency. If you're doing cross validation using these splits, I'd recommend two things. First, do get_dummies on the whole dataset to get all of the columns (instead of just on the training set as in the code above). Second, use StratifiedKFold for cross validation so that your splits contain the relevant labels.
Say, I have a feature "A" with possible values "a", "b", "c", "d". But the training data set consists of only three categories "a", "b", "c" as values. If get_dummies is used at this stage, features generated will be three (A_a, A_b, A_c). But ideally there should be another feature A_d as well with all zeros. That can be achieved in the following way :
import pandas as pd
data = pd.DataFrame({"A" : ["a", "b", "c"]})
data["A"] = data["A"].astype("category", categories=["a", "b", "c", "d"])
mod_data = pd.get_dummies(data[["A"]])
print(mod_data)
The output being
A_a A_b A_c A_d
0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0
For the text columns, you can try this
from sklearn.feature_extraction.text import CountVectorizer
data = ['he is good','he is bad','he is strong']
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(data)
For Output:
for i in range(len(data)):
print(vectors[i, :].toarray())
Output:
[[0 1 1 1 0]]
[[1 0 1 1 0]]
[[0 0 1 1 1]]