Could not convert string to float error from the Titanic competition - python

I'm trying to solve the Titanic survival program from Kaggle. It's my first step in actually learning Machine Learning. I have a problem where the gender column causes an error. The stacktrace says could not convert string to float: 'female'. How did you guys come across this issue? I don't want solutions. I just want a practical approach to this problem because I do need the gender column to build my model.
This is my code:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
train_data = pd.read_csv(train_path)
columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)
x = filtered_titanic_data[columns_of_interest]
y = filtered_titanic_data.Survived
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)
titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)
val_predictions = titanic_model.predict(val_x)
print(filtered_titanic_data)

There are a couple ways to deal with this, and it kind of depends what you're looking for:
You could encode your categories to numeric values, i.e. transform each level of your category to a distinct number,
or
dummy code your category, i.e. turn each level of your category into a separate column, which gets a value of 0 or 1.
In lots of machine learning applications, factors are better to deal with as dummy codes.
Note that in the case of a 2-level category, encoding to numeric according to the methods outlined below is essentially equivalent to dummy coding: all the values that are not level 0 are necessarily level 1. In fact, in the dummy code example I've given below, there is redundant information, as I've given each of the 2 classes its own column. It's just to illustrate the concept. Typically, one would only create n-1 columns, where n is the number of levels, and the omitted level is implied (i.e. make a column for Female, and all the 0 values are implied to be Male).
Encoding Categories to numeric:
Method 1: pd.factorize
pd.factorize is a simple, fast way of encoding to numeric:
For example, if your column gender looks like this:
>>> df
gender
0 Female
1 Male
2 Male
3 Male
4 Female
5 Female
6 Male
7 Female
8 Female
9 Female
df['gender_factor'] = pd.factorize(df.gender)[0]
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
Method 2: categorical dtype
Another way would be to use category dtype:
df['gender_factor'] = df['gender'].astype('category').cat.codes
This would result in the same output
Method 3 sklearn.preprocessing.LabelEncoder()
This method comes with some bonuses, such as easy back transforming:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# Transform the gender column
df['gender_factor'] = le.fit_transform(df.gender)
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
# Easy to back transform:
df['gender_factor'] = le.inverse_transform(df.gender_factor)
>>> df
gender gender_factor
0 Female Female
1 Male Male
2 Male Male
3 Male Male
4 Female Female
5 Female Female
6 Male Male
7 Female Female
8 Female Female
9 Female Female
Dummy Coding:
Method 1: pd.get_dummies
df.join(pd.get_dummies(df.gender))
gender Female Male
0 Female 1 0
1 Male 0 1
2 Male 0 1
3 Male 0 1
4 Female 1 0
5 Female 1 0
6 Male 0 1
7 Female 1 0
8 Female 1 0
9 Female 1 0
Note, if you want to omit one column to get a non-redundant dummy code (see my note at the beginning of this answer), you can use:
df.join(pd.get_dummies(df.gender, drop_first=True))
gender Male
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0

Related

Combining is in and where

How can I create new column based on the odd even flag in Pandas
This is my data:
id Flag
001 1
002 2
003 3
004 4
I would like to have this output if flag is even number then female, if flag is odd number then male:
id Flag Gender
001 1 Male
002 2 Female
003 3 Male
004 4 Female
Use numpy.where with modulo 2 for check even and odd numbers:
df['Gender'] = np.where(df['Flag'] % 2,'Male','Female')
print (df)
id Flag Gender
0 1 1 Male
1 2 2 Female
2 3 3 Male
3 4 4 Female
try apply
Id =['001','002','003','004']
Flag=[1,2,3,4]
df=pd.DataFrame({'id':Id,'flag':Flag})
df['gender']=df['flag'].apply(lambda x: 'Male' if x%2 else 'Female')
output:
id flag gender
0 001 1 Male
1 002 2 Female
2 003 3 Male
3 004 4 Female

How to unstack a df from excel table with multiple levels of duplicating columns? Set multi index?

df read from an xlsx: df = pd.read_excel('file.xlsx') arrives like this:
Age Male Female Male.1 Female.1
0 NaN Big Small Small Big
1 1.0 2 3 2 3
2 2.0 3 4 3 4
3 3.0 4 5 4 5
df = pd.DataFrame({'Age':[np.nan, 1,2,3],'Male':['Big',2,3,4],'Female':['Small',3,4,5],'Male.1':['Small',2,3,4],'Female.1':['Big',3,4,5]})
Note Pandas suffixed duplicate columns .1, which was not desired. I'd like to unstack / melt to get this or similar:
Age Gender Size [measure]
1 1 Male Big 2
2 2 Male Big 3
3 3 Male Big 4
4 1 Female Big 3
5 2 Female Big 4
6 3 Female Big 5
7 1 Male Small 2
8 2 Male Small 3
9 3 Male Small 4
10 1 Female Small 3
11 2 Female Small 4
12 3 Female Small 5
Renaming columns and unstacking gets close but no cigar:
df= df.rename(columns={'Male.1': 'Male', 'Female.1':'Female'})
df= df.set_index(['Age']).unstack()
How can I set the 1st row to be the 2nd index level of columns as shown here? What am I missing?
Instead of .unstack(), another approach would be .melt().
You can transpose the dataframe with .T and take everything after the first row with .iloc[1:]. Then, .rename the columns, .replace the .1 with some regex, .melt the dataframe and .sort_values.
df = pd.DataFrame({'Age':[np.nan, 1,2,3],'Male':['Big',2,3,4],'Female':['Small',3,4,5],'Male.1':['Small',2,3,4],'Female.1':['Big',3,4,5]})
df = (df.T.reset_index().iloc[1:]
.rename({'index' : 'Gender', 0 : 'Size'}, axis=1)
.replace(r'\.\d+$', '', regex=True)
.melt(id_vars=['Gender', 'Size'], value_name='[measure]', var_name='Age')
.sort_values(['Size', 'Gender', 'Age'], ascending=[True,False,True])
.reset_index(drop=True))
df = df[['Age', 'Gender', 'Size', '[measure]']]
df
Out[41]:
Age Gender Size [measure]
0 1 Male Big 2
1 2 Male Big 3
2 3 Male Big 4
3 1 Female Big 3
4 2 Female Big 4
5 3 Female Big 5
6 1 Male Small 2
7 2 Male Small 3
8 3 Male Small 4
9 1 Female Small 3
10 2 Female Small 4
11 3 Female Small 5
If possible, create with first 2 rows MultiIndex and also first column to index by header and index_col parameter in read_excel:
df = pd.read_excel('file.xlsx',header=[0,1], index_col=[0])
print (df)
Age Male Female Male Female
Big Small Small Big
1.0 2 3 2 3
2.0 3 4 3 4
3.0 4 5 4 5
print (df.columns)
MultiIndex([( 'Male', 'Big'),
('Female', 'Small'),
( 'Male', 'Small'),
('Female', 'Big')],
names=['Age', None])
print (df.index)
Float64Index([1.0, 2.0, 3.0], dtype='float64')
So is possible use DataFrame.unstack:
df = (df.unstack()
.rename_axis(['Gender', 'Size','Age'])
.reset_index(name='measure'))
print (df)
Gender Size Age measure
0 Male Big 1.0 2
1 Male Big 2.0 3
2 Male Big 3.0 4
3 Female Small 1.0 3
4 Female Small 2.0 4
5 Female Small 3.0 5
6 Male Small 1.0 2
7 Male Small 2.0 3
8 Male Small 3.0 4
9 Female Big 1.0 3
10 Female Big 2.0 4
11 Female Big 3.0 5
If not possible use:
You can create MultiIndex by MultiIndex.from_arrays and remove last . with digit by replace, then filter out first row by DataFrame.iloc and reshape by DataFrame.melt by first column, last set new columns names:
df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$', ''),
df.iloc[0]])
df = df.iloc[1:].melt(df.columns[:1].tolist())
df.columns=['Age','Gender','Size','measure']
print (df)
Age Gender Size measure
0 1.0 Male Big 2
1 2.0 Male Big 3
2 3.0 Male Big 4
3 1.0 Female Small 3
4 2.0 Female Small 4
5 3.0 Female Small 5
6 1.0 Male Small 2
7 2.0 Male Small 3
8 3.0 Male Small 4
9 1.0 Female Big 3
10 2.0 Female Big 4
11 3.0 Female Big 5
Or solution with DataFrame.unstack is possible, only set first column to index by DataFrame.set_index and set levels of MultiIndex by Series.rename_axis for new columns names:
df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$', ''),
df.iloc[0]])
df = (df.iloc[1:].set_index(df.columns[:1].tolist())
.unstack()
.rename_axis(['Gender', 'Size','Age'])
.reset_index(name='measure'))
print (df)
Gender Size Age measure
0 Male Big 1.0 2
1 Male Big 2.0 3
2 Male Big 3.0 4
3 Female Small 1.0 3
4 Female Small 2.0 4
5 Female Small 3.0 5
6 Male Small 1.0 2
7 Male Small 2.0 3
8 Male Small 3.0 4
9 Female Big 1.0 3
10 Female Big 2.0 4
11 Female Big 3.0 5
Create a multiindex column by combining row 0 with the column :
df.columns = pd.MultiIndex.from_arrays((df.columns, df.iloc[0]))
df.columns.names = ['gender', 'size']
df.columns
MultiIndex([( 'Age', nan),
( 'Male', 'Big'),
( 'Female', 'Small'),
( 'Male.1', 'Small'),
('Female.1', 'Big')],
names=['gender', 'size'])
Now you can reshape and rename :
(df
.dropna()
.melt([('Age', np.NaN)], value_name='measure')
.replace(r'\.\d+$', '', regex=True)
.rename(columns={("Age", np.NaN) : "Age"}))
Age gender size measure
0 1.0 Male Big 2
1 2.0 Male Big 3
2 3.0 Male Big 4
3 1.0 Female Small 3
4 2.0 Female Small 4
5 3.0 Female Small 5
6 1.0 Male Small 2
7 2.0 Male Small 3
8 3.0 Male Small 4
9 1.0 Female Big 3
10 2.0 Female Big 4
11 3.0 Female Big 5

Pandas DataFrame removing NaN rows based on condition?

Pandas DataFrame removing NaN rows based on condition.
I'm trying to remove the rows whose gender==male and status == NaN.
Sample df:
name status gender leaves
0 tom NaN male 5
1 tom True male 6
2 tom True male 7
3 mary True female 1
4 mary NaN female 10
5 mary True female 15
6 john NaN male 2
7 mark True male 3
Expected Ouput:
name status gender leaves
0 tom True male 6
1 tom True male 7
2 mary True female 1
3 mary NaN female 10
4 mary True female 15
5 mark True male 3
You can use isna (or isnull) function to get the rows with a value of NaN.
With this knowledge, you can filter your dataframe using something like:
conditions = (df.gender == 'male')&(df.status.isna())
filtered_df = df[~conditions]
Good One given by #Derlin, other way I tried is using fillna() filled NaN with -1 and indexed them, just like below:
>>> df[~((df.fillna(-1)['status']==-1)&(df['gender']=='male'))]
Just for reference ~ operator is the same as np.logical_not() of numpy. So if you use this:
df[np.logical_not((df.fillna(-1)['status']==-1)&(df['gender']=='male'))] (dont forget to import numpy as np), means the same.

Pandas comparing dataframes and changing column value based on number of similar rows in another dataframe

Suppose I have two dataframes:
df1:
Person Number Type
0 Kyle 12 Male
1 Jacob 15 Male
2 Jacob 15 Male
df2:
A much larger dataset with similar format except there is a count column that needs to increment based on df1
Person Number Type Count
0 Kyle 12 Male 0
1 Jacob 15 Male 0
3 Sally 43 Female 0
4 Mary 15 Female 5
What I am looking to do is increase the count column based on the number of occurrences of the same person in df1
Excepted output for this example:
Person Number Type Count
0 Kyle 12 Male 1
1 Jacob 15 Male 2
3 Sally 43 Female 0
4 Mary 15 Female 5
Increase count to 1 for Kyle because there is one instance, increase count to 2 because there are two instances for Jacob. Don't change value for Sally and Mary and keep the value the same.
How do I do this? I have tried using .loc but I can't figure out how to account for two instances of the same row. Meaning that I can only get count to increase by one for Jacob even though there are two Jacobs in df1.
I have tried
df2.loc[df2['Person'].values == df1['Person'].values, 'Count'] += 1
However this does not account for duplicates.
df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df1 = df1.set_index(['Person','Number','Type'])
df2 = df2.set_index(['Person','Number','Type'])
df1.add(df2, fill_value=0).reset_index()
Or
df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df2.merge(df1, on=['Person','Number','Type'], how='left').set_index(['Person','Number','Type']).sum(axis=1).to_frame('Count').reset_index()
value_counts + Index alignment.
u = df2.set_index("Person")
u.assign(Count=df1["Person"].value_counts().add(u["Count"], fill_value=0))
Number Type Count
Person
Kyle 12 Male 1.0
Jacob 15 Male 2.0
Sally 43 Female 0.0
Mary 15 Female 5.0

pandas: transform based on count of row value in another dataframe

I have two dataframes:
df1:
Gender Registered
female 1
male 0
female 0
female 1
male 1
male 0
df2:
Gender
female
female
male
male
I want to modify df2, so that there is a new column 'Count' with the count of registered = 1 for corresponding gender values from df1. For example, in df1 there are 2 registered females and 1 registered male. I want to transform the df2 so that the output is as follows:
output:
Gender Count
female 2
female 2
male 1
male 1
I tried many things and got close but couldn't make it fully work.
sum + map:
v = df1.groupby('Gender').Registered.sum()
df2.assign(Count=df2.Gender.map(v))
Gender Count
0 female 2
1 female 2
2 male 1
3 male 1
pd.merge
pd.merge(df2, df1.groupby('Gender', as_index=False).sum())
Gender Registered
0 female 2
1 female 2
2 male 1
3 male 1

Categories