I'm new in Python and I'm trying to use fillna() to avoid NaN values on my dataset.
I'm interested in replacing new values depending on other column's value.
For example, I'm going to use the famous 'titanic' dataframe to better explain myself. Imagine I want to replace 'age' NaN values with the mean age of each 'embark_town'.
First of all, mean_age would be calculated as follows:
mean_age = round(titanic.groupby('embark_town')['age'].mean(), 2)
Being the result:
embark_town
Cherbourg 30.81
Queenstown 28.09
Southampton 29.45
Name: age, dtype: float64
So, to replace NaN age values with the proper mean_age value of each row using 'embark_town' column, I did this:
titanic['age'] = titanic['age'].fillna(titanic['embark_town'].map(mean_age))
And it works fine,
But my question is: how can it be done without using .map() and using inplace=True instead? Do I have to use .loc or .iloc?
Please, can anyone give me a hint or explain me better how to use fillna() with inplace=True depending on other column value.
Just curious,
Thanks a lot!
Related
I have a dataframe(train) that has an Age column in it. This column has missing values. I have merged it with another dataframe- static_values which also has an Age column. I am using the below lines to substitute the missing values for the Age column in train df.
predicted_vals = pd.merge(static_vals, train, on=['Pclass','Sex'])
# num of missing values
predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_y'].isna().sum() # 177
predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_y'] = predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_x']
After running the above lines, I run the following to see if the values have been substituted-
predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_y']
And this is the putput I get -
Series([], Name: Age_x, dtype: float64)
Its empty. No assignment has happened. The strange part is that when I check the values for the Age_x column after running the above lines, I get a blank there too.
>>> predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_x']
Series([], Name: Age_x, dtype: float64)
Below is what the column holds right before I run the lines where I am trying to assign the missing values
>>> predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_x'].head()
3 34.240964
8 34.240964
15 34.240964
25 34.240964
34 34.240964
I searched here for similar questions here but all deal with assigning a single value to many rows. I can't figure what's wrong here. Any help?
Is there actually a problem here?
predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_y'] should be empty because you have filled the values! Try predicted_vals.loc[~predicted_vals['Age_y'].isna(),'Age_y']
This is an alternative solution, which avoids merging and handling column name suffixes. We align the 2 indices and use fillna to map from static_vals.
predicted_vals = predicted_vals.set_index(['Pclass','Sex'])
predicted_vals['Age'] = predicted_vals['Age'].fillna(static_vals.set_index(['Pclass','Sex'])['Age'])
predicted_vals = predicted_vals.reset_index()
If you would like to do an explicit merge, #jezrael's solution is the way to go.
I think you need combine_first:
predicted_vals['Age_y'] = predicted_vals['Age_y'].combine_first(predicted_vals['Age_x'])
I'm new to Pandas and Numpy. I was trying to solve the Kaggle | Titanic Dataset. Now I have to fix the two columns, "Age" and "Embarked" because they contains NAN.
Now I tried the fillna without any success, soon to discover that I was missing the inplace = True.
Now I attached them. But the first imputation was successful but the second one was not. I tried searching in SO and google, but did not find anything useful. Please help me.
Here's the code that I was trying.
# imputing "Age" with mean
titanic_df["Age"].fillna(titanic_df["Age"].mean(), inplace = True)
# imputing "Embarked" with mode
titanic_df["Embarked"].fillna(titanic_df["Embarked"].mode(), inplace = True)
print titanic_df["Age"][titanic_df["Age"].isnull()].size
print titanic_df["Embarked"][titanic_df["Embarked"].isnull()].size
and I got the output as
0
2
However I managed to get what I want without using inplace=True
titanic_df["Age"] =titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df.fillna(titanic_df["Embarked"].mode())
But I am curious what's with the second usage of inplace=True.
Please bear with if I'm asking something which is extremely stupid because I' totally new and I may miss small things. Any help is appreciated. Thanks in advance.
pd.Series.mode returns a Series.
A variable has a single arithmetic mean and a single median but it may have several modes. If more than one value has the highest frequency, there will be multiple modes.
pandas operates on labels.
titanic_df.mean()
Out:
PassengerId 446.000000
Survived 0.383838
Pclass 2.308642
Age 29.699118
SibSp 0.523008
Parch 0.381594
Fare 32.204208
dtype: float64
If I were to use titanic_df.fillna(titanic_df.mean()) it would return a new DataFrame where the column PassengerId is filled with 446.0, column Survived is filled with 0.38 and so on.
However, if I call the mean method on a Series, the returning value is a float:
titanic_df['Age'].mean()
Out: 29.69911764705882
There is no label associated here. So if I use titanic_df.fillna(titanic_df['Age'].mean()) all the missing values in all the columns will be filled with 29.699.
Why the first attempt was not successful
You tried to fill the entire DataFrame, titanic_df with titanic_df["Embarked"].mode(). Let's check the output first:
titanic_df["Embarked"].mode()
Out:
0 S
dtype: object
It is a Series with a single element. The index is 0 and the value is S. Now, remember how it would work if we used titanic_df.mean() to fill: it would fill each column with the corresponding mean value. Here, we only have one label. So it will only fill values if we have a column named 0. Try adding df[0] = np.nan and executing your code again. You'll see that the new column is filled with S.
Why the second attempt was (not) successful
The right hand side of the equation, titanic_df.fillna(titanic_df["Embarked"].mode()) returns a new DataFrame. In this new DataFrame, Embarked column still has nan's:
titanic_df.fillna(titanic_df["Embarked"].mode())['Embarked'].isnull().sum()
Out: 2
However you didn't assign it back to the entire DataFrame. You assigned this DataFrame to a Series - titanic_df['Embarked']. It didn't actually fill the missing values in the Embarked column, it just used the index values of the DataFrame. If you actually check the new column, you'll see numbers 1, 2, ... instead of S, C and Q.
What you should do instead
You are trying to fill a single column with a single value. First, disassociate that value from its label:
titanic_df['Embarked'].mode()[0]
Out: 'S'
Now, it is not important if you use inplace=True or assign the result back. Both
titanic_df['Embarked'] = titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0])
and
titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0], inplace=True)
will fill the missing values in the Embarked column with S.
Of course this assumes you want to use the first value if there are multiple modes. You may need to improve your algorithm there (for example randomly select from the values if there are multiple modes).
Iam trying to get the row with maximum value based on another column of a groupby, I am trying to follow the solutions given here Python : Getting the Row which has the max value in groups using groupby, however it doesn't work when you apply
annotations.groupby(['bookid','conceptid'], sort=False)['weight'].max()
I get
bookid conceptid
12345678 3942 0.137271
10673 0.172345
1002 0.125136
34567819 44407 1.370921
5111 0.104729
6160 0.114766
200 0.151629
3504 0.152793
But I'd like to get only the row with the highest weight, e.g.,
bookid conceptid
12345678 10673 0.172345
34567819 44407 1.370921
I'd appreciate any help
If you need the bookid and conceptid for the maximum weight, try this
annotations.ix[annotations.groupby(['bookid'], sort=False)['weight'].idxmax()][['bookid', 'conceptid', 'weight']]
Note: Since Pandas v0.20 ix has been deprecated. Use .loc instead.
based on your example of what you want, I think you have too much stuff in your group. I think you want only:
annotations.groupby(['bookid'], sort=False)['weight'].max()
After grouping we can pass aggregation functions to the grouped object as a dictionary within the agg function.
annotations.groupby('bookid').agg({'weight': ['max']})
I'm digging in the Kaggle's Titanic excercise.
I have a pandas.DataFrame which column 'Age' has some NaN' values scattered and another column called IsAlone I created whose values are 1 or 0 depending the person was alone on that ship based on a personal rule.
I'm trying to replace the NaN values on column Age for people that were alone with the mean age of those who were alone, and the same way with those who weren't alone. The purpose is just exercise pandas DataFrame, replacing NaN values based on a rule.
I'm doing this to those who were alone:
df_train[(df_train.IsAlone.astype(bool) & df_train.Age.isnull() )].Age = \
df_train[(df_train.IsAlone.astype(bool) & ~df_train.Age.isnull() )].Age.mean()
And the same way to those who weren't alone:
df_train[(~df_train.IsAlone.astype(bool) & df_train.Age.isnull() )].Age = \
df_train[(~df_train.IsAlone.astype(bool) & ~df_train.Age.isnull() )].Age.mean()
But this is not working at all, the column Age still have the same NaN values.
Any thoughts on this?
The problem is that the values are changed on a copy of the original frame. Refer to Returning a view versus a copy for details. As in the documentation:
When setting values in a pandas object, care must be taken to avoid what is called chained indexing.
To change the values on a view of the original frame you may do:
j = df_train.IsAlone.astype(bool) & df_train.Age.isnull()
i = df_train.IsAlone.astype(bool) & ~df_train.Age.isnull()
df_train.loc[j, 'Age'] = df_train.loc[i, 'Age'].mean()
This should be easy but I'm having a surprisingly annoying time at it. The code below shows me doing a Pandas groupby operation so I can calculate variance by symbol. Unfortunately what happens is that the aggregation command seems to get rid of the integer index, so I am trying to create a new integer list and add this as a column to the table and set as a new index.
vardataframe = voldataframe.groupby('Symbol')
vardataframe = vardataframe.aggregate(np.var)
vardataframe['newindex']= np.arange(1,(len(vardataframe)+1))
vardataframe.set_index(['newindex'])
vardataframe = vardataframe.ix[:,['newindex','Symbol','volatility']]
However what comes out is the below vardataframe.head() result, which does not properly change the index of the table from Symbol back to numeric. And this hurts me in a line or two when I try to do a merge command.
newindex Symbol volatility
Symbol
A 1 NaN 0.000249
AA 2 NaN 0.000413
AAIT 3 NaN 0.000237
AAL 4 NaN 0.001664
AAME 5 NaN 0.001283
As you see the problems with the above are now there are two Symbol columns and the index hasn't been set correctly. What I'd like to do is get rid of the second Symbol column and make newindex the new index. Anyone know what I'm doing wrong here? (Perhaps a misunderstanding of the ix command). Much appreciated!
You can use as_index=False to preserve integer index. You need only one line to do what you need:
vardataframe = voldataframe.groupby('Symbol', as_index=False).var()
A couple of things in your code:
vardataframe.set_index(['newindex'])
will set ndewindex as index, but returns a new dataframe which is not used. You can do vardataframe.set_index(['newindex'], inplace=True) if you want this.
vardataframe.ix[:,['newindex','Symbol','volatility']]
gives you a column Symbol of all NaN because Symbol is not a column of vardataframe, but only exists in its index. Querying a non-exist column with ix gives all NaN. As #user2600939 mentioned, you can do vardataframe.reset_index(inplace=True) (or vardataframe=vardataframe.reset_index() to put Symbol back as a column.
Instead of making a new index manually just reset it use...
df = df.reset_index()