Replacing missing data in pandas.DataFrame not working - python

I'm digging in the Kaggle's Titanic excercise.
I have a pandas.DataFrame which column 'Age' has some NaN' values scattered and another column called IsAlone I created whose values are 1 or 0 depending the person was alone on that ship based on a personal rule.
I'm trying to replace the NaN values on column Age for people that were alone with the mean age of those who were alone, and the same way with those who weren't alone. The purpose is just exercise pandas DataFrame, replacing NaN values based on a rule.
I'm doing this to those who were alone:
df_train[(df_train.IsAlone.astype(bool) & df_train.Age.isnull() )].Age = \
df_train[(df_train.IsAlone.astype(bool) & ~df_train.Age.isnull() )].Age.mean()
And the same way to those who weren't alone:
df_train[(~df_train.IsAlone.astype(bool) & df_train.Age.isnull() )].Age = \
df_train[(~df_train.IsAlone.astype(bool) & ~df_train.Age.isnull() )].Age.mean()
But this is not working at all, the column Age still have the same NaN values.
Any thoughts on this?

The problem is that the values are changed on a copy of the original frame. Refer to Returning a view versus a copy for details. As in the documentation:
When setting values in a pandas object, care must be taken to avoid what is called chained indexing.
To change the values on a view of the original frame you may do:
j = df_train.IsAlone.astype(bool) & df_train.Age.isnull()
i = df_train.IsAlone.astype(bool) & ~df_train.Age.isnull()
df_train.loc[j, 'Age'] = df_train.loc[i, 'Age'].mean()

Related

Populating a subset of rows in a dataframe with values from another column / Collapsing several columns

First time posting here. I expect there's a better way of doing what I'm trying to do. I've been going round in circles for days and would really appreciate some help.
I am working with survey data about prisoners and their sentences.
Each prisoner has a type for the purpose of the survey, and this is stored in the column 'prisoner_type'. For each prisoner type, there is a group of 5 columns where their offenses can be recorded (not all columns are necessarilly used). I'd like to collapse these groups of columns into one set of 5 columns and add these to the dataset so that, on each row, there is one set of 5 columns where I can find the offenses.
I have created a dictionary to look up the column names that the offence codes and offence types are stored in for each prisoner type. The key in the outer dictionary is the prisoner type. Here is an abridged version:
offense_variables=
{ 3={'codes':{1:'V0114',2:'V0115',3:'V0116',4:'V0117',5:'V0118'},
'off_types':{1:'V0124',2:'V0125',3:'V0126',4:'V0127',5:'V0128'}}
8={'codes':{1:'V0270',2:'V0271',3:'V0272',4:'V0273',5:'V0274'},
'off_types': {1:'V0280',2:'V0281',3:'V0282',4:'V0283',5:'V0285'}} }
I am first creating 10 new columns: offense_1...offense_5 and type_1...type_5.
I am then trying to:
Use pandas iloc to locate the all the rows for a given prisoner type
Set the values for the new columns by looking up the variable for each offense number under
that prisoner type in the dictionary, and assign that column as the new values.
Problems:
The code doesn't terminate. I'm not sure why it's running on and on.
I recieve the error message "A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
pris_types=[3,8]
for pt in pris_types:
#five offenses are listed in the survey, so we need five columns to hold offence codes
#and five to hold offence types
#1 and 2 are just placeholder values
for item in [i+1 for i in range(5)]:
dataset[f'off_{item}_code']='1'
dataset[f'off_{item}_type']='2'
#then use .loc to get indexes for this prisoner type
#look up the variable of the column that we need to take the values from
#using the dictionary shown above
for item in [i+1 for i in range(5)]:
dataset.loc[dataset['prisoner_type'] == pt, \
dataset[f'off_{item}_code']] = \
dataset[offense_variables[pt]['codes'][item]]
dataset.loc[dataset[prisoner_type] == pt, \
dataset[f'off_{item}_type']] = \
dataset[offense_variables[pt]['types'][item]]
The problem is that in your .loc[] sections, you just need to use the column label (string object) to identify the column where values are to be set, not the entire series/column object, as you are currently doing. With your current code, you are creating new columns named with values stored in the dataset[f'off_{item}_type'] columns. So, instead of:
for item in [i+1 for i in range(5)]:
dataset.loc[dataset['prisoner_type'] == pt, \
dataset[f'off_{item}_code']] = \
dataset[offense_variables[pt]['codes'][item]]
dataset.loc[dataset[prisoner_type] == pt, \
dataset[f'off_{item}_type']] = \
dataset[offense_variables[pt]['types'][item]]
use:
for item in range(1,6):
(dataset.loc[dataset['prisoner_type'] == pt, \
f'off_{item}_code'] = \
dataset[offense_variables[pt]['codes'][item]]
dataset.loc[dataset[prisoner_type] == pt, \
f'off_{item}_type'] = \
dataset[offense_variables[pt]['types'][item]]
(I simplified your range loop line too.)
Also, you don't need to have the statements creating the 10 new columns inside the loop over prisoner types, you can move them outside of that loop. You actually don't need to create them manually like that. The .loc[] code would create them for you.

Numpy, pandas merge rows

Im working on numpy, pandas and need to "merge" rows. I have column martial-status and there are things like this:
'Never-married', 'Divorced', 'Separated', 'Windowed'
and:
'Married-civ-spouse','Married-spouse-absent', 'Married-AF-spouse'
Im wondering how to merge them to just 2 rows, for the first 4 to single and for the second one's in relationship. I need it for one hot encoding later.
And for sample output the martial-status should be just single or in relationship adequately to what i mention before
You can use pd.Series.map to convert certain values to other. For this you need a dictionary, that assigns each value with a new value. The values not presented in the dictionary will be replaced with NaN
married_map = {
status:'Single'
for status in ['Never-married', 'Divorced', 'Separated', 'Widowed']}
married_map.update({
status:'In-relationship'
for status in ['Married-civ-spouse','Married-spouse-absent', 'Married-AF-spouse']})
df['marital-status'].map(married_map)

I am not able to correctly assign a value to a df row based on 3 conditions (checking values in 3 other columns)

I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)

What's wrong with fillna in Pandas running twice?

I'm new to Pandas and Numpy. I was trying to solve the Kaggle | Titanic Dataset. Now I have to fix the two columns, "Age" and "Embarked" because they contains NAN.
Now I tried the fillna without any success, soon to discover that I was missing the inplace = True.
Now I attached them. But the first imputation was successful but the second one was not. I tried searching in SO and google, but did not find anything useful. Please help me.
Here's the code that I was trying.
# imputing "Age" with mean
titanic_df["Age"].fillna(titanic_df["Age"].mean(), inplace = True)
# imputing "Embarked" with mode
titanic_df["Embarked"].fillna(titanic_df["Embarked"].mode(), inplace = True)
print titanic_df["Age"][titanic_df["Age"].isnull()].size
print titanic_df["Embarked"][titanic_df["Embarked"].isnull()].size
and I got the output as
0
2
However I managed to get what I want without using inplace=True
titanic_df["Age"] =titanic_df["Age"].fillna(titanic_df["Age"].mean())
titanic_df["Embarked"] = titanic_df.fillna(titanic_df["Embarked"].mode())
But I am curious what's with the second usage of inplace=True.
Please bear with if I'm asking something which is extremely stupid because I' totally new and I may miss small things. Any help is appreciated. Thanks in advance.
pd.Series.mode returns a Series.
A variable has a single arithmetic mean and a single median but it may have several modes. If more than one value has the highest frequency, there will be multiple modes.
pandas operates on labels.
titanic_df.mean()
Out:
PassengerId 446.000000
Survived 0.383838
Pclass 2.308642
Age 29.699118
SibSp 0.523008
Parch 0.381594
Fare 32.204208
dtype: float64
If I were to use titanic_df.fillna(titanic_df.mean()) it would return a new DataFrame where the column PassengerId is filled with 446.0, column Survived is filled with 0.38 and so on.
However, if I call the mean method on a Series, the returning value is a float:
titanic_df['Age'].mean()
Out: 29.69911764705882
There is no label associated here. So if I use titanic_df.fillna(titanic_df['Age'].mean()) all the missing values in all the columns will be filled with 29.699.
Why the first attempt was not successful
You tried to fill the entire DataFrame, titanic_df with titanic_df["Embarked"].mode(). Let's check the output first:
titanic_df["Embarked"].mode()
Out:
0 S
dtype: object
It is a Series with a single element. The index is 0 and the value is S. Now, remember how it would work if we used titanic_df.mean() to fill: it would fill each column with the corresponding mean value. Here, we only have one label. So it will only fill values if we have a column named 0. Try adding df[0] = np.nan and executing your code again. You'll see that the new column is filled with S.
Why the second attempt was (not) successful
The right hand side of the equation, titanic_df.fillna(titanic_df["Embarked"].mode()) returns a new DataFrame. In this new DataFrame, Embarked column still has nan's:
titanic_df.fillna(titanic_df["Embarked"].mode())['Embarked'].isnull().sum()
Out: 2
However you didn't assign it back to the entire DataFrame. You assigned this DataFrame to a Series - titanic_df['Embarked']. It didn't actually fill the missing values in the Embarked column, it just used the index values of the DataFrame. If you actually check the new column, you'll see numbers 1, 2, ... instead of S, C and Q.
What you should do instead
You are trying to fill a single column with a single value. First, disassociate that value from its label:
titanic_df['Embarked'].mode()[0]
Out: 'S'
Now, it is not important if you use inplace=True or assign the result back. Both
titanic_df['Embarked'] = titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0])
and
titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0], inplace=True)
will fill the missing values in the Embarked column with S.
Of course this assumes you want to use the first value if there are multiple modes. You may need to improve your algorithm there (for example randomly select from the values if there are multiple modes).

Pandas Dataframes - How do you maintain an index post a group by/aggregation operation?

This should be easy but I'm having a surprisingly annoying time at it. The code below shows me doing a Pandas groupby operation so I can calculate variance by symbol. Unfortunately what happens is that the aggregation command seems to get rid of the integer index, so I am trying to create a new integer list and add this as a column to the table and set as a new index.
vardataframe = voldataframe.groupby('Symbol')
vardataframe = vardataframe.aggregate(np.var)
vardataframe['newindex']= np.arange(1,(len(vardataframe)+1))
vardataframe.set_index(['newindex'])
vardataframe = vardataframe.ix[:,['newindex','Symbol','volatility']]
However what comes out is the below vardataframe.head() result, which does not properly change the index of the table from Symbol back to numeric. And this hurts me in a line or two when I try to do a merge command.
newindex Symbol volatility
Symbol
A 1 NaN 0.000249
AA 2 NaN 0.000413
AAIT 3 NaN 0.000237
AAL 4 NaN 0.001664
AAME 5 NaN 0.001283
As you see the problems with the above are now there are two Symbol columns and the index hasn't been set correctly. What I'd like to do is get rid of the second Symbol column and make newindex the new index. Anyone know what I'm doing wrong here? (Perhaps a misunderstanding of the ix command). Much appreciated!
You can use as_index=False to preserve integer index. You need only one line to do what you need:
vardataframe = voldataframe.groupby('Symbol', as_index=False).var()
A couple of things in your code:
vardataframe.set_index(['newindex'])
will set ndewindex as index, but returns a new dataframe which is not used. You can do vardataframe.set_index(['newindex'], inplace=True) if you want this.
vardataframe.ix[:,['newindex','Symbol','volatility']]
gives you a column Symbol of all NaN because Symbol is not a column of vardataframe, but only exists in its index. Querying a non-exist column with ix gives all NaN. As #user2600939 mentioned, you can do vardataframe.reset_index(inplace=True) (or vardataframe=vardataframe.reset_index() to put Symbol back as a column.
Instead of making a new index manually just reset it use...
df = df.reset_index()

Categories