What's wrong with fillna in Pandas running twice? - python

I'm new to Pandas and Numpy. I was trying to solve the Kaggle | Titanic Dataset. Now I have to fix the two columns, "Age" and "Embarked" because they contains NAN.
Now I tried the fillna without any success, soon to discover that I was missing the inplace = True.
Now I attached them. But the first imputation was successful but the second one was not. I tried searching in SO and google, but did not find anything useful. Please help me.
Here's the code that I was trying.
# imputing "Age" with mean
titanic_df["Age"].fillna(titanic_df["Age"].mean(), inplace = True)
# imputing "Embarked" with mode
titanic_df["Embarked"].fillna(titanic_df["Embarked"].mode(), inplace = True)
print titanic_df["Age"][titanic_df["Age"].isnull()].size
print titanic_df["Embarked"][titanic_df["Embarked"].isnull()].size
and I got the output as
0
2
However I managed to get what I want without using inplace=True
titanic_df["Age"] =titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df.fillna(titanic_df["Embarked"].mode())
But I am curious what's with the second usage of inplace=True.
Please bear with if I'm asking something which is extremely stupid because I' totally new and I may miss small things. Any help is appreciated. Thanks in advance.

pd.Series.mode returns a Series.
A variable has a single arithmetic mean and a single median but it may have several modes. If more than one value has the highest frequency, there will be multiple modes.
pandas operates on labels.
titanic_df.mean()
Out:
PassengerId 446.000000
Survived 0.383838
Pclass 2.308642
Age 29.699118
SibSp 0.523008
Parch 0.381594
Fare 32.204208
dtype: float64
If I were to use titanic_df.fillna(titanic_df.mean()) it would return a new DataFrame where the column PassengerId is filled with 446.0, column Survived is filled with 0.38 and so on.
However, if I call the mean method on a Series, the returning value is a float:
titanic_df['Age'].mean()
Out: 29.69911764705882
There is no label associated here. So if I use titanic_df.fillna(titanic_df['Age'].mean()) all the missing values in all the columns will be filled with 29.699.
Why the first attempt was not successful
You tried to fill the entire DataFrame, titanic_df with titanic_df["Embarked"].mode(). Let's check the output first:
titanic_df["Embarked"].mode()
Out:
0 S
dtype: object
It is a Series with a single element. The index is 0 and the value is S. Now, remember how it would work if we used titanic_df.mean() to fill: it would fill each column with the corresponding mean value. Here, we only have one label. So it will only fill values if we have a column named 0. Try adding df[0] = np.nan and executing your code again. You'll see that the new column is filled with S.
Why the second attempt was (not) successful
The right hand side of the equation, titanic_df.fillna(titanic_df["Embarked"].mode()) returns a new DataFrame. In this new DataFrame, Embarked column still has nan's:
titanic_df.fillna(titanic_df["Embarked"].mode())['Embarked'].isnull().sum()
Out: 2
However you didn't assign it back to the entire DataFrame. You assigned this DataFrame to a Series - titanic_df['Embarked']. It didn't actually fill the missing values in the Embarked column, it just used the index values of the DataFrame. If you actually check the new column, you'll see numbers 1, 2, ... instead of S, C and Q.
What you should do instead
You are trying to fill a single column with a single value. First, disassociate that value from its label:
titanic_df['Embarked'].mode()[0]
Out: 'S'
Now, it is not important if you use inplace=True or assign the result back. Both
titanic_df['Embarked'] = titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0])
and
titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0], inplace=True)
will fill the missing values in the Embarked column with S.
Of course this assumes you want to use the first value if there are multiple modes. You may need to improve your algorithm there (for example randomly select from the values if there are multiple modes).

Related

Why is dataframe.sum(axis=0) getting NAN's when every value in every column is a real number?

All column values in the selected measure_cols of the dfm DataFrame are real numbers - in fact all are between [-1.0..1.0] inclusive.
Following gives False for all Series/Columns in the dfc dataframe
[print(f"{c}: {dfc[c].hasnans}") for c in dfc.columns]
Results: all False
But all row sums in dfc['shap_sum'] are coming up as NAN's. Why would this be?
dfc['shap_sum'] = dfc[measure_cols].sum(axis=0)
Update The following has the correct results - as seen in the debugger
dfc[measure_cols].sum(axis=0)
But when assigned to a new column in the dataframe they get distorted into NaN's.
Why is this happening ?
dfc['shap_sum'] = dfc[measure_cols].sum(axis=0)
Oh I made the mistake of using axis=0 intending to do row sums. But it's axis=1 to do row sums. I will never agree with that decision on polarity.

Pandas set values of multiple rows of a column

I have a dataframe(train) that has an Age column in it. This column has missing values. I have merged it with another dataframe- static_values which also has an Age column. I am using the below lines to substitute the missing values for the Age column in train df.
predicted_vals = pd.merge(static_vals, train, on=['Pclass','Sex'])
# num of missing values
predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_y'].isna().sum() # 177
predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_y'] = predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_x']
After running the above lines, I run the following to see if the values have been substituted-
predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_y']
And this is the putput I get -
Series([], Name: Age_x, dtype: float64)
Its empty. No assignment has happened. The strange part is that when I check the values for the Age_x column after running the above lines, I get a blank there too.
>>> predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_x']
Series([], Name: Age_x, dtype: float64)
Below is what the column holds right before I run the lines where I am trying to assign the missing values
>>> predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_x'].head()
3 34.240964
8 34.240964
15 34.240964
25 34.240964
34 34.240964
I searched here for similar questions here but all deal with assigning a single value to many rows. I can't figure what's wrong here. Any help?
Is there actually a problem here?
predicted_vals.loc[predicted_vals['Age_y'].isna(),'Age_y'] should be empty because you have filled the values! Try predicted_vals.loc[~predicted_vals['Age_y'].isna(),'Age_y']
This is an alternative solution, which avoids merging and handling column name suffixes. We align the 2 indices and use fillna to map from static_vals.
predicted_vals = predicted_vals.set_index(['Pclass','Sex'])
predicted_vals['Age'] = predicted_vals['Age'].fillna(static_vals.set_index(['Pclass','Sex'])['Age'])
predicted_vals = predicted_vals.reset_index()
If you would like to do an explicit merge, #jezrael's solution is the way to go.
I think you need combine_first:
predicted_vals['Age_y'] = predicted_vals['Age_y'].combine_first(predicted_vals['Age_x'])

Remove excess info from dataframe

I have a big data frame that contains 6 columns. When I want to print the info out of one cell, I use the following code:
df = pd.read_excel(Path_files_data)
info_rol = df.loc[df.Rank == Ranknumber]
print(info_rol['Art_Nr'])
Here Rank is the column that gives the rank of every item and Ranknumber is the Rank of the item i try to look up. How what i get back looks like this:
0 10399
Name: Art_Nr, dtype: object
Here 0 is the rank and 10399 is Art_Nr. How do I get it to work that it only printsout the Art_Nr. and leaves al the crap like dtype: object.
PS. I tried strip but that didnt work for me.
I think you need select first value of Series by iat or iloc for scalar:
print(info_rol['Art_Nr'].iat[0])
print(info_rol['Art_Nr'].iloc[0])
If string or numeric output:
print(info_rol['Art_Nr'].values[0])
But after filtering is possible you get multiple values, then second, third.. values are lost.
So converting to list is more general solution:
print(info_rol['Art_Nr'].tolist())

Replacing missing data in pandas.DataFrame not working

I'm digging in the Kaggle's Titanic excercise.
I have a pandas.DataFrame which column 'Age' has some NaN' values scattered and another column called IsAlone I created whose values are 1 or 0 depending the person was alone on that ship based on a personal rule.
I'm trying to replace the NaN values on column Age for people that were alone with the mean age of those who were alone, and the same way with those who weren't alone. The purpose is just exercise pandas DataFrame, replacing NaN values based on a rule.
I'm doing this to those who were alone:
df_train[(df_train.IsAlone.astype(bool) & df_train.Age.isnull() )].Age = \
df_train[(df_train.IsAlone.astype(bool) & ~df_train.Age.isnull() )].Age.mean()
And the same way to those who weren't alone:
df_train[(~df_train.IsAlone.astype(bool) & df_train.Age.isnull() )].Age = \
df_train[(~df_train.IsAlone.astype(bool) & ~df_train.Age.isnull() )].Age.mean()
But this is not working at all, the column Age still have the same NaN values.
Any thoughts on this?
The problem is that the values are changed on a copy of the original frame. Refer to Returning a view versus a copy for details. As in the documentation:
When setting values in a pandas object, care must be taken to avoid what is called chained indexing.
To change the values on a view of the original frame you may do:
j = df_train.IsAlone.astype(bool) & df_train.Age.isnull()
i = df_train.IsAlone.astype(bool) & ~df_train.Age.isnull()
df_train.loc[j, 'Age'] = df_train.loc[i, 'Age'].mean()

Pandas Dataframes - How do you maintain an index post a group by/aggregation operation?

This should be easy but I'm having a surprisingly annoying time at it. The code below shows me doing a Pandas groupby operation so I can calculate variance by symbol. Unfortunately what happens is that the aggregation command seems to get rid of the integer index, so I am trying to create a new integer list and add this as a column to the table and set as a new index.
vardataframe = voldataframe.groupby('Symbol')
vardataframe = vardataframe.aggregate(np.var)
vardataframe['newindex']= np.arange(1,(len(vardataframe)+1))
vardataframe.set_index(['newindex'])
vardataframe = vardataframe.ix[:,['newindex','Symbol','volatility']]
However what comes out is the below vardataframe.head() result, which does not properly change the index of the table from Symbol back to numeric. And this hurts me in a line or two when I try to do a merge command.
newindex Symbol volatility
Symbol
A 1 NaN 0.000249
AA 2 NaN 0.000413
AAIT 3 NaN 0.000237
AAL 4 NaN 0.001664
AAME 5 NaN 0.001283
As you see the problems with the above are now there are two Symbol columns and the index hasn't been set correctly. What I'd like to do is get rid of the second Symbol column and make newindex the new index. Anyone know what I'm doing wrong here? (Perhaps a misunderstanding of the ix command). Much appreciated!
You can use as_index=False to preserve integer index. You need only one line to do what you need:
vardataframe = voldataframe.groupby('Symbol', as_index=False).var()
A couple of things in your code:
vardataframe.set_index(['newindex'])
will set ndewindex as index, but returns a new dataframe which is not used. You can do vardataframe.set_index(['newindex'], inplace=True) if you want this.
vardataframe.ix[:,['newindex','Symbol','volatility']]
gives you a column Symbol of all NaN because Symbol is not a column of vardataframe, but only exists in its index. Querying a non-exist column with ix gives all NaN. As #user2600939 mentioned, you can do vardataframe.reset_index(inplace=True) (or vardataframe=vardataframe.reset_index() to put Symbol back as a column.
Instead of making a new index manually just reset it use...
df = df.reset_index()

Categories