Could not convert string to float error from the Titanic competition

Could not convert string to float error from the Titanic competition - python

I'm trying to solve the Titanic survival program from Kaggle. It's my first step in actually learning Machine Learning. I have a problem where the gender column causes an error. The stacktrace says could not convert string to float: 'female'. How did you guys come across this issue? I don't want solutions. I just want a practical approach to this problem because I do need the gender column to build my model.
This is my code:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
train_data = pd.read_csv(train_path)
columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)
x = filtered_titanic_data[columns_of_interest]
y = filtered_titanic_data.Survived
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)
titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)
val_predictions = titanic_model.predict(val_x)
print(filtered_titanic_data)

There are a couple ways to deal with this, and it kind of depends what you're looking for:
You could encode your categories to numeric values, i.e. transform each level of your category to a distinct number,
or
dummy code your category, i.e. turn each level of your category into a separate column, which gets a value of 0 or 1.
In lots of machine learning applications, factors are better to deal with as dummy codes.
Note that in the case of a 2-level category, encoding to numeric according to the methods outlined below is essentially equivalent to dummy coding: all the values that are not level 0 are necessarily level 1. In fact, in the dummy code example I've given below, there is redundant information, as I've given each of the 2 classes its own column. It's just to illustrate the concept. Typically, one would only create n-1 columns, where n is the number of levels, and the omitted level is implied (i.e. make a column for Female, and all the 0 values are implied to be Male).
Encoding Categories to numeric:
Method 1: pd.factorize
pd.factorize is a simple, fast way of encoding to numeric:
For example, if your column gender looks like this:
>>> df
gender
0 Female
1 Male
2 Male
3 Male
4 Female
5 Female
6 Male
7 Female
8 Female
9 Female
df['gender_factor'] = pd.factorize(df.gender)[0]
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
Method 2: categorical dtype
Another way would be to use category dtype:
df['gender_factor'] = df['gender'].astype('category').cat.codes
This would result in the same output
Method 3 sklearn.preprocessing.LabelEncoder()
This method comes with some bonuses, such as easy back transforming:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# Transform the gender column
df['gender_factor'] = le.fit_transform(df.gender)
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
# Easy to back transform:
df['gender_factor'] = le.inverse_transform(df.gender_factor)
>>> df
gender gender_factor
0 Female Female
1 Male Male
2 Male Male
3 Male Male
4 Female Female
5 Female Female
6 Male Male
7 Female Female
8 Female Female
9 Female Female
Dummy Coding:
Method 1: pd.get_dummies
df.join(pd.get_dummies(df.gender))
gender Female Male
0 Female 1 0
1 Male 0 1
2 Male 0 1
3 Male 0 1
4 Female 1 0
5 Female 1 0
6 Male 0 1
7 Female 1 0
8 Female 1 0
9 Female 1 0
Note, if you want to omit one column to get a non-redundant dummy code (see my note at the beginning of this answer), you can use:
df.join(pd.get_dummies(df.gender, drop_first=True))
gender Male
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0

Related

Combining is in and where

How can I create new column based on the odd even flag in Pandas
This is my data:
id Flag
001 1
002 2
003 3
004 4
I would like to have this output if flag is even number then female, if flag is odd number then male:
id Flag Gender
001 1 Male
002 2 Female
003 3 Male
004 4 Female

Use numpy.where with modulo 2 for check even and odd numbers:
df['Gender'] = np.where(df['Flag'] % 2,'Male','Female')
print (df)
id Flag Gender
0 1 1 Male
1 2 2 Female
2 3 3 Male
3 4 4 Female

try apply
Id =['001','002','003','004']
Flag=[1,2,3,4]
df=pd.DataFrame({'id':Id,'flag':Flag})
df['gender']=df['flag'].apply(lambda x: 'Male' if x%2 else 'Female')
output:
id flag gender
0 001 1 Male
1 002 2 Female
2 003 3 Male
3 004 4 Female

How to unstack a df from excel table with multiple levels of duplicating columns? Set multi index?

df read from an xlsx: df = pd.read_excel('file.xlsx') arrives like this:
Age Male Female Male.1 Female.1
0 NaN Big Small Small Big
1 1.0 2 3 2 3
2 2.0 3 4 3 4
3 3.0 4 5 4 5
df = pd.DataFrame({'Age':[np.nan, 1,2,3],'Male':['Big',2,3,4],'Female':['Small',3,4,5],'Male.1':['Small',2,3,4],'Female.1':['Big',3,4,5]})
Note Pandas suffixed duplicate columns .1, which was not desired. I'd like to unstack / melt to get this or similar:
Age Gender Size [measure]
1 1 Male Big 2
2 2 Male Big 3
3 3 Male Big 4
4 1 Female Big 3
5 2 Female Big 4
6 3 Female Big 5
7 1 Male Small 2
8 2 Male Small 3
9 3 Male Small 4
10 1 Female Small 3
11 2 Female Small 4
12 3 Female Small 5
Renaming columns and unstacking gets close but no cigar:
df= df.rename(columns={'Male.1': 'Male', 'Female.1':'Female'})
df= df.set_index(['Age']).unstack()
How can I set the 1st row to be the 2nd index level of columns as shown here? What am I missing?

Instead of .unstack(), another approach would be .melt().
You can transpose the dataframe with .T and take everything after the first row with .iloc[1:]. Then, .rename the columns, .replace the .1 with some regex, .melt the dataframe and .sort_values.
df = pd.DataFrame({'Age':[np.nan, 1,2,3],'Male':['Big',2,3,4],'Female':['Small',3,4,5],'Male.1':['Small',2,3,4],'Female.1':['Big',3,4,5]})
df = (df.T.reset_index().iloc[1:]
.rename({'index' : 'Gender', 0 : 'Size'}, axis=1)
.replace(r'\.\d+$', '', regex=True)
.melt(id_vars=['Gender', 'Size'], value_name='[measure]', var_name='Age')
.sort_values(['Size', 'Gender', 'Age'], ascending=[True,False,True])
.reset_index(drop=True))
df = df[['Age', 'Gender', 'Size', '[measure]']]
df
Out[41]:
Age Gender Size [measure]
0 1 Male Big 2
1 2 Male Big 3
2 3 Male Big 4
3 1 Female Big 3
4 2 Female Big 4
5 3 Female Big 5
6 1 Male Small 2
7 2 Male Small 3
8 3 Male Small 4
9 1 Female Small 3
10 2 Female Small 4
11 3 Female Small 5

If possible, create with first 2 rows MultiIndex and also first column to index by header and index_col parameter in read_excel:
df = pd.read_excel('file.xlsx',header=[0,1], index_col=[0])
print (df)
Age Male Female Male Female
Big Small Small Big
1.0 2 3 2 3
2.0 3 4 3 4
3.0 4 5 4 5
print (df.columns)
MultiIndex([( 'Male', 'Big'),
('Female', 'Small'),
( 'Male', 'Small'),
('Female', 'Big')],
names=['Age', None])
print (df.index)
Float64Index([1.0, 2.0, 3.0], dtype='float64')
So is possible use DataFrame.unstack:
df = (df.unstack()
.rename_axis(['Gender', 'Size','Age'])
.reset_index(name='measure'))
print (df)
Gender Size Age measure
0 Male Big 1.0 2
1 Male Big 2.0 3
2 Male Big 3.0 4
3 Female Small 1.0 3
4 Female Small 2.0 4
5 Female Small 3.0 5
6 Male Small 1.0 2
7 Male Small 2.0 3
8 Male Small 3.0 4
9 Female Big 1.0 3
10 Female Big 2.0 4
11 Female Big 3.0 5
If not possible use:
You can create MultiIndex by MultiIndex.from_arrays and remove last . with digit by replace, then filter out first row by DataFrame.iloc and reshape by DataFrame.melt by first column, last set new columns names:
df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$', ''),
df.iloc[0]])
df = df.iloc[1:].melt(df.columns[:1].tolist())
df.columns=['Age','Gender','Size','measure']
print (df)
Age Gender Size measure
0 1.0 Male Big 2
1 2.0 Male Big 3
2 3.0 Male Big 4
3 1.0 Female Small 3
4 2.0 Female Small 4
5 3.0 Female Small 5
6 1.0 Male Small 2
7 2.0 Male Small 3
8 3.0 Male Small 4
9 1.0 Female Big 3
10 2.0 Female Big 4
11 3.0 Female Big 5
Or solution with DataFrame.unstack is possible, only set first column to index by DataFrame.set_index and set levels of MultiIndex by Series.rename_axis for new columns names:
df.columns = pd.MultiIndex.from_arrays([df.columns.str.replace(r'\.\d+$', ''),
df.iloc[0]])
df = (df.iloc[1:].set_index(df.columns[:1].tolist())
.unstack()
.rename_axis(['Gender', 'Size','Age'])
.reset_index(name='measure'))
print (df)
Gender Size Age measure
0 Male Big 1.0 2
1 Male Big 2.0 3
2 Male Big 3.0 4
3 Female Small 1.0 3
4 Female Small 2.0 4
5 Female Small 3.0 5
6 Male Small 1.0 2
7 Male Small 2.0 3
8 Male Small 3.0 4
9 Female Big 1.0 3
10 Female Big 2.0 4
11 Female Big 3.0 5

Create a multiindex column by combining row 0 with the column :
df.columns = pd.MultiIndex.from_arrays((df.columns, df.iloc[0]))
df.columns.names = ['gender', 'size']
df.columns
MultiIndex([( 'Age', nan),
( 'Male', 'Big'),
( 'Female', 'Small'),
( 'Male.1', 'Small'),
('Female.1', 'Big')],
names=['gender', 'size'])
Now you can reshape and rename :
(df
.dropna()
.melt([('Age', np.NaN)], value_name='measure')
.replace(r'\.\d+$', '', regex=True)
.rename(columns={("Age", np.NaN) : "Age"}))
Age gender size measure
0 1.0 Male Big 2
1 2.0 Male Big 3
2 3.0 Male Big 4
3 1.0 Female Small 3
4 2.0 Female Small 4
5 3.0 Female Small 5
6 1.0 Male Small 2
7 2.0 Male Small 3
8 3.0 Male Small 4
9 1.0 Female Big 3
10 2.0 Female Big 4
11 3.0 Female Big 5

Pandas DataFrame removing NaN rows based on condition?

Pandas DataFrame removing NaN rows based on condition.
I'm trying to remove the rows whose gender==male and status == NaN.
Sample df:
name status gender leaves
0 tom NaN male 5
1 tom True male 6
2 tom True male 7
3 mary True female 1
4 mary NaN female 10
5 mary True female 15
6 john NaN male 2
7 mark True male 3
Expected Ouput:
name status gender leaves
0 tom True male 6
1 tom True male 7
2 mary True female 1
3 mary NaN female 10
4 mary True female 15
5 mark True male 3

You can use isna (or isnull) function to get the rows with a value of NaN.
With this knowledge, you can filter your dataframe using something like:
conditions = (df.gender == 'male')&(df.status.isna())
filtered_df = df[~conditions]

Good One given by #Derlin, other way I tried is using fillna() filled NaN with -1 and indexed them, just like below:
>>> df[~((df.fillna(-1)['status']==-1)&(df['gender']=='male'))]
Just for reference ~ operator is the same as np.logical_not() of numpy. So if you use this:
df[np.logical_not((df.fillna(-1)['status']==-1)&(df['gender']=='male'))] (dont forget to import numpy as np), means the same.

Pandas comparing dataframes and changing column value based on number of similar rows in another dataframe

Suppose I have two dataframes:
df1:
Person Number Type
0 Kyle 12 Male
1 Jacob 15 Male
2 Jacob 15 Male
df2:
A much larger dataset with similar format except there is a count column that needs to increment based on df1
Person Number Type Count
0 Kyle 12 Male 0
1 Jacob 15 Male 0
3 Sally 43 Female 0
4 Mary 15 Female 5
What I am looking to do is increase the count column based on the number of occurrences of the same person in df1
Excepted output for this example:
Person Number Type Count
0 Kyle 12 Male 1
1 Jacob 15 Male 2
3 Sally 43 Female 0
4 Mary 15 Female 5
Increase count to 1 for Kyle because there is one instance, increase count to 2 because there are two instances for Jacob. Don't change value for Sally and Mary and keep the value the same.
How do I do this? I have tried using .loc but I can't figure out how to account for two instances of the same row. Meaning that I can only get count to increase by one for Jacob even though there are two Jacobs in df1.
I have tried
df2.loc[df2['Person'].values == df1['Person'].values, 'Count'] += 1
However this does not account for duplicates.

df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df1 = df1.set_index(['Person','Number','Type'])
df2 = df2.set_index(['Person','Number','Type'])
df1.add(df2, fill_value=0).reset_index()
Or
df1 = df1.groupby(df.columns.tolist(), as_index=False).size().to_frame('Count').reset_index()
df2.merge(df1, on=['Person','Number','Type'], how='left').set_index(['Person','Number','Type']).sum(axis=1).to_frame('Count').reset_index()

value_counts + Index alignment.
u = df2.set_index("Person")
u.assign(Count=df1["Person"].value_counts().add(u["Count"], fill_value=0))
Number Type Count
Person
Kyle 12 Male 1.0
Jacob 15 Male 2.0
Sally 43 Female 0.0
Mary 15 Female 5.0

pandas: transform based on count of row value in another dataframe

I have two dataframes:
df1:
Gender Registered
female 1
male 0
female 0
female 1
male 1
male 0
df2:
Gender
female
female
male
male
I want to modify df2, so that there is a new column 'Count' with the count of registered = 1 for corresponding gender values from df1. For example, in df1 there are 2 registered females and 1 registered male. I want to transform the df2 so that the output is as follows:
output:
Gender Count
female 2
female 2
male 1
male 1
I tried many things and got close but couldn't make it fully work.

sum + map:
v = df1.groupby('Gender').Registered.sum()
df2.assign(Count=df2.Gender.map(v))
Gender Count
0 female 2
1 female 2
2 male 1
3 male 1

pd.merge
pd.merge(df2, df1.groupby('Gender', as_index=False).sum())
Gender Registered
0 female 2
1 female 2
2 male 1
3 male 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Could not convert string to float error from the Titanic competition - python

Related

Combining is in and where

How to unstack a df from excel table with multiple levels of duplicating columns? Set multi index?

Pandas DataFrame removing NaN rows based on condition?

Pandas comparing dataframes and changing column value based on number of similar rows in another dataframe

pandas: transform based on count of row value in another dataframe

Categories

Resources