How to fill missing values based on grouped average? - python

My data has missing values for 'Age' and I want to replace them by average based on groupby column 'Title'. After the command:
df.groupby('Title').mean()['Age']
I get a list for example
Mr 32
Miss 21.7
Ms 28
etc.
I tried:
df['Age'].replace(np.nan, 0, inplace=True)
df[(df.Age==0.0)&(df.Title=='Mr')]
to just see the cells where age is missing and title is of one type but it doesn't work.
Question 1. Why the code above doesn't show any cells, despite multiple cells satisfying both conditions at the same time (age = 0.0 and title is mr)
Question2. How can I replace all missing values based on the group average as described above?

I cannot reproduce the first error, so if i use an example like below:
import pandas as pd
import numpy as np
np.random.seed(111)
df = pd.DataFrame({'Title':np.random.choice(['Mr','Miss','Mrs'],20),'Age':np.random.randint(20,50,20)})
df.loc[[5,9,10,11,12],['Age']]=np.nan
the data frame looks like:
Title Age
0 Mr 42.0
1 Mr 28.0
2 Mr 25.0
3 Mr 32.0
4 Mrs 26.0
5 Miss NaN
6 Mrs 32.0
7 Mrs 33.0
8 Mrs 25.0
9 Mr NaN
10 Miss NaN
11 Mr NaN
12 Mrs NaN
13 Miss 38.0
14 Mr 31.0
15 Mr 42.0
16 Mr 24.0
17 Mrs 23.0
18 Mrs 49.0
19 Miss 27.0
And we can replace it just doing one more step:
ave_age = df.groupby('Title').mean()['Age']
df.loc[pd.isna(df['Age']),'Age'] = ave_age[df.loc[pd.isna(df['Age']),'Title']].values

Question 1:
Please provide a snippet in order to be able to reproduce the error
Question 2:
Try df['Age'].fillna(f.groupby('Title')['Age'].transform('mean')). This is similar to Pandas: filling missing values by mean in each group

Related

Can't fill nan values in pandas even with inplace flag

I have a pandas dataframe containing NaN values for some column.
I'm trying to fill them with a default value (30), but it doesn't work.
Original dataframe:
type avg_speed
0 CAR 32.0
1 CAR NaN
2 CAR NaN
3 BIKE 16.2
4 CAR 28.5
5 SCOOTER 29.7
6 CAR 30.7
7 CAR NaN
8 BIKE NaN
9 BIKE 35.1
...
Desired result:
type avg_speed
0 CAR 32.0
1 CAR 30
2 CAR 30
3 BIKE 16.2
4 CAR 28.5
5 SCOOTER 29.7
6 CAR 30.7
7 CAR 30
8 BIKE 30
9 BIKE 35.1
My code:
def fill_with_default(pandas_df, column_name, default_value):
print(f"Total count: {pandas_df.count()}")
print(f"Count of Nan BEFORE: {pandas_df[column_name].isna().sum()}")
pandas_df[column_name].fillna(default_value, inplace=True)
print(f"Count of Nan AFTER: {pandas_df[column_name].isna().sum()}")
return pandas_df
df = fill_with_default(df, "avg_speed", 30)
Output:
Total count: 105018
Count of Nan BEFORE: 49514
Count of Nan AFTER: 49514
The chain of dataframe transformations and list of columns are too long, so it's difficult to show all steps (join with another dataframe, drop useless columns, add usefull columns, join with other dataframes, filter etc.)
I've tried other options but they also don't work:
#pandas_df.fillna({column_name: default_value}, inplace=True)
#pandas_df.loc[pandas_df[column_name].isnull(),column_name] = default_value
...
Type of column before applying "fillna" is fload64, the same as default_value
Therefore, my question is: what could be the potential reasons of this problem?
What kind of transformation can lead to this problem? Because this is the method that works for another similar data frame. The only difference between them lies in the chain of transformations.
BTW, there is a system log at this place:
/home/hadoop/.local/lib/python3.6/site-
packages/pandas/core/generic.py:6287: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-
copy
self._update_inplace(new_data)

How to combine 2 columns with different names to fill the nulls of the first with the values of the second?

i have this df:
country customer_id invoice price stream_id times_viewed year month day total_price StreamID TimesViewed
0 United Kingdom 13085.0 489434 6.95 85048 12.0 2017 11 28 NaN NaN NaN
1 United Kingdom NaN 489597 8.65 22130 1.0 2017 11 28 NaN NaN NaN
2 United Kingdom NaN 489597 1.70 22132 6.0 2017 11 28 NaN NaN NaN
3 United Kingdom NaN 489597 1.70 22133 4.0 2017 11 28 NaN NaN NaN
4 United Kingdom NaN 489597 0.87 22134 1.0 2017 11 28 NaN NaN NaN
The columns stream_id and StreamID are in fact the same thing. The df i have is much larger and it was created by chunks. The problem came that when reading those chunks, some of them had the column name as stream_id and some others had StreamID instead, so when putting all the chunks together using pd.concat the final result look like this.
What i would like to do is to fill the null values of StreamID with the values of stream_id when this last one is not null. I'm not sure if this is the right approach or there is a more efficient way of solving this problem.
The same problem occurred with the times_viewed and the TimesViewed columns, so the same solution would apply for this one too.
I tried using np.where like this:
df['new_col'] = np.where(df['StreamID'].isnull(), df['stream_id'], df['StreamID'])
But i'm not sure if this is right or if there is a better way to do it. Could someone please help me solve this?
Thank you very much in advance.
I finally got it solved by progressively renaming the wrong column names after checking they exist, then added each df created from each file to a temporary list that got concatenated in the end, giving the final result:
import glob
import pandas as pd
files = sorted(glob.glob(os.getcwd() + "/data_dir/*.json"))
df_list = []
for i in files:
temp_df = pd.read_json(i)
if 'StreamID' in temp_df.columns or 'total_price' in temp_df.columns or 'TimesViewed' in temp_df.columns:
temp_df.rename(columns = {'StreamID': 'stream_id', 'total_price': 'price', 'TimesViewed': 'times_viewed'}, inplace = True)
df_list.append(temp_df)
df = pd.concat(df_list, axis = 0)
It totally solved the issue of duplicated columns with wrong names. Hopefully, this will help someone.

how to remove rows in python data frame with condition?

I have the following data:
df =
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 Rocky 10 Casual kkkk 22.4
2 jenifer 50 Emergency 2500.6 '51.6'
3 Tom 10 sick Nan 46.2
4 Harry nn Casual 1800.1 '58.3'
5 Julie 22 sick 3600.2 'unknown'
6 Sam 5 Casual Nan 47.2
7 Mady 6 sick unknown Nan
Output:
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 jenifer 50 Emergency 2500.6 51.6
2 Tom 10 sick Nan 46.2
3 Sam 5 Casual Nan 47.2
4 Mady 6 sick unknown Nan
I want to delete records where there is datatype error in numerical columns(Leaves,Salary,Performance).
If numerical columns contains strings then that row show be deleted from data frame?
df[['Leaves','Salary','Performance']].apply(pd.to_numeric, errors = 'coerce')
but this will covert values to Nan.
Let's start from a note concerning your sample data:
It contains Nan strings, which are not among strings automatically
recognized as NaNs.
To treat them as NaN, I read the source text with read_fwf,
passing na_values=['Nan'].
And now get down to the main task:
Define a function to check whether a cell is acceptable:
def isAcceptable(cell):
if pd.isna(cell) or cell == 'unknown':
return True
return all(c.isdigit() or c == '.' for c in cell)
I noticed that you accept NaN values.
You also a cell if it contains only unknown string, but you don't
accept a cell if such word is enclosed between e.g. quotes.
If you change your mind about what is / is not acceptable, change the
above function accordingly.
Then, to leave only rows with all acceptable values in all 3 mentioned
columns, run:
df[df[['Leaves', 'Salary', 'Performance']].applymap(isAcceptable).all(axis=1)]

How to shift the values of a certain group by different amounts

I have a DataFrame that looks like this:
user data
0 Kevin 1
1 Kevin 3
2 Sara 5
3 Kevin 23
...
And I want to get the historical values (looking let's say 2 entries forward) as rows:
user data data_1 data_2
0 Kevin 1 3 23
1 Sara 5 24 NaN
2 Kim ...
...
Right now I'm able to do this through the following command:
_temp = df.groupby(['user'], as_index = False)['data']
for i in range(1,2):
data['data_{0}'.format(i)] = _temp.shift(-1)
I feel like my approach is very inefficient and that there is a much faster way to do this (esp. when the number of lookahead/lookback values go up)!
You can use groupby.cumcount() with set_index() and unstack():
m=df.assign(k=df.groupby('user').cumcount().astype(str)).set_index(['user','k']).unstack()
m.columns=m.columns.map('_'.join)
print(m)
data_0 data_1 data_2
user
Kevin 1.0 3.0 23.0
Sara 5.0 NaN NaN

Write value in next available cell csv

I have a code of writing peoples names, ages and scores for a quiz that I made. I simplified the code to write the names and ages together and not separately but I cant write the score with the names as they are in separate parts of the code. The CSV file looks like this
name, age, score
Alfie, 15, 20
Michael, 16, 19
Alfie, 15, #After I simplified
Dylan, 16,
As you can see i don't know how to write a value in the 3rd column. Does anyone know how to write a value into the next available cell in a CSV file in the column 2. I'm new to programming so any help would be greatly appreciated.
Michael
This is your data:
df = pd.DataFrame({'name':['Alfie','Michael','Alfie','Dylan'], 'age':[15,16,15,16], 'score':[20,19,None,None]})
Out:
name age score
0 Alfie 15 20.0
1 Michael 16 19.0
2 Alfie 15 nan
3 Dylan 16 nan
if you need read csv to pandas then use:
import pandas as pd
df = pd.read_csv('Your_file_name.csv')
I suggest two ways to solve your problem:
df.fillna(0, inplace=True) fill all (this example fill 0).
Out:
name age score
0 Alfie 15 20.0
1 Michael 16 19.0
2 Alfie 15 0.0
3 Dylan 16 0.0
df.loc[2,'score'] = 22 fill specific cells
Out:
name age score
0 Alfie 15 20.0
1 Michael 16 19.0
2 Alfie 15 22.0
3 Dylan 16 nan
If, after that you need write your fixed data to csv, the use:
df.to_csv('New_name.csv', sep=',', header=0)

Categories