I am trying to replace NaN values with mean values

I am trying to replace NaN values with mean values - python

i have to replace the s_months and incidents NaN values with the corresponding means in jupyter notebook.
Input data :
Types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
14 2 4 1 NaN NaN
I have tried the code below but it does not seem to work and I have tried different variations such as replacing the transform.
df.fillna['s_months'] = df.fillna(df.grouby(['types' , 'o_periods']['s_months','incidents']).tranform('mean'),inplace = True)
s_months incidents
Types o_periods
1 1 911 3
2 1688 8
2 1 26851 36
2 14440 36
3 1 914 2
2 862 1
4 1 296 0
2 889 3
5 1 663 4
2 1046 6

From your DataFrame :
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
Types,c_years,o_periods,s_months,incidents
0,1,1,1,127.0,0.0
1,1,1,2,63.0,0.0
2,1,2,1,1095.0,3.0
3,1,2,2,1095.0,4.0
4,1,3,1,1512.0,6.0
5,1,3,2,3353.0,18.0
6,1,4,1,NaN,NaN
7,1,4,2,2244.0,11.0
14,2,4,1,NaN,NaN"""), sep=',')
>>> df
Types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
14 2 4 1 NaN NaN
>>> df[['c_years', 's_months', 'incidents']] = df.groupby(['Types', 'o_periods']).transform(lambda x: x.fillna(x.mean()))
>>> df
Types c_years o_periods s_months incidents
0 1 1 1 127.000000 0.0
1 1 1 2 63.000000 0.0
2 1 2 1 1095.000000 3.0
3 1 2 2 1095.000000 4.0
4 1 3 1 1512.000000 6.0
5 1 3 2 3353.000000 18.0
6 1 4 1 911.333333 3.0
7 1 4 2 2244.000000 11.0
14 2 4 1 NaN NaN
The last NaN is here because it belongs to the last group which contains no value in the columns s_months and incidents and therefore, no mean.

Try this df['s_months'].fillna(df['s_months'].mean())
df['s_months'].mean() counts mean without Nan.

Your code is close, you can try modify it as follows to make it work:
df[['s_months','incidents']] = df[['s_months','incidents']].fillna(df.groupby(['Types' , 'o_periods'])[['s_months','incidents']].transform('mean'))
Data Input:
Types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
14 2 4 1 NaN NaN
Output
Types c_years o_periods s_months incidents
0 1 1 1 127.000000 0.0
1 1 1 2 63.000000 0.0
2 1 2 1 1095.000000 3.0
3 1 2 2 1095.000000 4.0
4 1 3 1 1512.000000 6.0
5 1 3 2 3353.000000 18.0
6 1 4 1 911.333333 3.0
7 1 4 2 2244.000000 11.0
14 2 4 1 NaN NaN

Related

Add missing rows based on column

I have given the following df
df = pd.DataFrame(data = {'day': [1, 1, 1, 2, 2, 3], 'pos': 2*[1, 14, 18], 'value': 2*[1, 2, 3]}
df
day pos value
0 1 1 1
1 1 14 2
2 1 18 3
3 2 1 1
4 2 14 2
5 3 18 3
and i want to fill in rows such that every day has every possible value of column 'pos'
desired result:
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0
Proposition:
df.set_index('pos').reindex(pd.Index(3*[1,14,18])).reset_index()
yields:
ValueError: cannot reindex from a duplicate axis

Let's try pivot then stack:
df.pivot('day','pos','value').stack(dropna=False).reset_index(name='value')
Output:
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0
Option 2: merge with MultiIndex:
df.merge(pd.DataFrame(index=pd.MultiIndex.from_product([df['day'].unique(), df['pos'].unique()])),
left_on=['day','pos'], right_index=True, how='outer')
Output:
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 3 18 3.0
5 2 18 NaN
5 3 1 NaN
5 3 14 NaN

You can reindex:
s = pd.MultiIndex.from_product([df["day"].unique(),df["pos"].unique()], names=["day","pos"])
print (df.set_index(["day","pos"]).reindex(s).reset_index())
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0

I'd avoid the manual product of all possible values.
Instead, one can get the unique values and just reindex per day:
u = df.pos.unique()
df.groupby('day').apply(lambda s: s.set_index('pos').reindex(u))['value']\
.reset_index()
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0

You could use the complete function from pyjanitor to expose the missing values :
# pip install pyjanitor
import pandas as pd
import janitor as jn
df.complete('day', 'pos')
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0

How to group by data using one column perform some operation on another column and assign new groups pandas

I have a dataframe as below :
distance_along_path ID
0 0 1
1 2.2 1
2 4.5 1
3 7.0 1
4 0 2
5 0 3
6 3.0 2
7 5.0 3
8 0 4
9 2.0 4
10 5.0 4
11 0 5
12 3.0 5
11 7.0 4
12
I want be able to group these by id's first and the by distance_along_path values, every time a 0 is seen in distance along path for the id, new group is created and until the next 0 all these rows are under A group as indicated below
distance_along_path ID group
0 0 1 1
1 2.2 1 1
2 4.5 1 1
3 7.0 1 1
4 0 1 2
5 0 2 3
6 3.0 1 2
7 5.0 2 3
8 0 2 4
9 2.0 2 4
10 5.0 2 4
11 0 1 5
12 3.0 1 5
13 7.0 1 5
14 0 1 6
15 0 2 7
16 3.0 1 6
17 5.0 2 7
18 1.0 2 7
Thank you

try the following:
grp_id = df.groupby(['ID']).id.count().reset_index()
grp_distance = grp_id.groupby(['distance_along_path'].grp_id['distance_along_path']==0

Mapping data from one dataframe to another based on grouby

Probably a similar question has been asked before, but I could not find anyone to solve my problem. Maybe I am not using the proper search words!.
I have two pandas Dataframes as below:
import pandas as pd
import numpy as np
df1
a = np.array([1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3])
b = np.array([1,1,2,2,3,3,1,1,2,2,3,3,1,1,2,2,3,3])
df1 = pd.DataFrame({'a':a, 'b':b})
print(df1)
a b
0 1 1
1 1 1
2 1 2
3 1 2
4 1 3
5 1 3
6 2 1
7 2 1
8 2 2
9 2 2
10 2 3
11 2 3
12 3 1
13 3 1
14 3 2
15 3 2
16 3 3
17 3 3
df2 is as below:
a2 = np.array([1,1,1,2,2,2,3,3,3])
b2 = np.array([1,2,3,1,2,3,1,2,3])
c = np.array([4,8,3,np.nan, 2, 5,6, np.nan, 1])
df2 = pd.DataFrame({'a':a2, 'b':b2, 'c': c})
a b c
0 1 1 4.0
1 1 2 8.0
2 1 3 3.0
3 2 1 NaN
4 2 2 2.0
5 2 3 5.0
6 3 1 6.0
7 3 2 NaN
8 3 3 1.0
Now I want to map column c from df2 to df1 but keeping the grouping of columns a=a1 and b=b2. Therefore, df1 is modified as shown below
a b c
0 1 1 4
1 1 1 4
2 1 2 8
3 1 2 8
4 1 3 3
5 1 3 3
6 2 1 NaN
7 2 1 NaN
8 2 2 2.0
9 2 2 2.0
10 2 3 5.0
11 2 3 5.0
12 3 1 6.0
13 3 1 6.0
14 3 2 NaN
15 3 2 NaN
16 3 3 1.0
17 3 3 1.0
How can I achieve this with simple and intuitive way using pandas?

Quite simple using merge:
df1.merge(df2)
a b c
0 1 1 4.0
1 1 1 4.0
2 1 2 8.0
3 1 2 8.0
4 1 3 3.0
5 1 3 3.0
6 2 1 NaN
7 2 1 NaN
8 2 2 2.0
9 2 2 2.0
10 2 3 5.0
11 2 3 5.0
12 3 1 6.0
13 3 1 6.0
14 3 2 NaN
15 3 2 NaN
16 3 3 1.0
17 3 3 1.0
If you have more columns and you want to specifically only merge on a and b, use:
df1.merge(df2, on=['a','b'])

Python pandas : groupby on two columns and create new variables

I have the following dataframe describing the percent of shares held by a type of investor in a company:
company investor pct
1 A 1
1 A 2
1 B 4
2 A 2
2 A 4
2 A 6
2 C 10
2 C 8
And I would like to create a new column for each investor type computing the mean of the shares held in each company. I also need to keep the same lenght of the dataset, using transform for instance.
Here is the result I would like to have:
company investor pct pct_mean_A pct_mean_B pct_mean_C
1 A 1 1.5 4 0
1 A 2 1.5 4 0
1 B 4 1.5 4 0
2 A 2 4.0 0 9
2 A 4 4.0 0 9
2 A 6 4.0 0 9
2 C 10 4.0 0 9
2 C 8 4.0 0 9
Thanks a lot for your help!

Use groupby with aggregate mean and reshape by unstack for helper DataFrame which is join to original df:
s = (df.groupby(['company','investor'])['pct']
.mean()
.unstack(fill_value=0)
.add_prefix('pct_mean_'))
df = df.join(s, 'company')
print (df)
company investor pct pct_mean_A pct_mean_B pct_mean_C
0 1 A 1 1.5 4.0 0.0
1 1 A 2 1.5 4.0 0.0
2 1 B 4 1.5 4.0 0.0
3 2 A 2 4.0 0.0 9.0
4 2 A 4 4.0 0.0 9.0
5 2 A 6 4.0 0.0 9.0
6 2 C 10 4.0 0.0 9.0
7 2 C 8 4.0 0.0 9.0
Or use pivot_table with default aggregate function mean:
s = df.pivot_table(index='company',
columns='investor',
values='pct',
fill_value=0).add_prefix('pct_mean_')
df = df.join(s, 'company')
print (df)
company investor pct pct_mean_A pct_mean_B pct_mean_C
0 1 A 1 1.5 4 0
1 1 A 2 1.5 4 0
2 1 B 4 1.5 4 0
3 2 A 2 4.0 0 9
4 2 A 4 4.0 0 9
5 2 A 6 4.0 0 9
6 2 C 10 4.0 0 9
7 2 C 8 4.0 0 9

Fast way to get the number of NaNs in a column counted from the last valid value in a DataFrame

Say I have a DataFrame like
A B
0 0.1880 0.345
1 0.2510 0.585
2 NaN NaN
3 NaN NaN
4 NaN 1.150
5 0.2300 1.210
6 0.1670 1.290
7 0.0835 1.400
8 0.0418 NaN
9 0.0209 NaN
10 NaN NaN
11 NaN NaN
12 NaN NaN
I want a new DataFrame of the same shape where each entry represents the number of NaNs counted up to its position started from the last valid value as follows
A B
0 0 0
1 0 0
2 1 1
3 2 2
4 3 0
5 0 0
6 0 0
7 0 0
8 0 1
9 0 2
10 1 3
11 2 4
12 3 5
I wonder if this can be done efficiently by utilizing some of the Pandas/Numpy functions?

You can use:
a = df.isnull()
b = a.cumsum()
df1 = b.sub(b.mask(a).ffill().fillna(0).astype(int))
print (df1)
A B
0 0 0
1 0 0
2 1 1
3 2 2
4 3 0
5 0 0
6 0 0
7 0 0
8 0 1
9 0 2
10 1 3
11 2 4
12 3 5
For better understanding:
#add NaN where True in a
a2 = b.mask(a)
#forward filling NaN
a3 = b.mask(a).ffill()
#replace NaN to 0, cast to int
a4 = b.mask(a).ffill().fillna(0).astype(int)
#substract b to a4
a5 = b.sub(b.mask(a).ffill().fillna(0).astype(int))
df1 = pd.concat([a,b,a2, a3, a4, a5], axis=1,
keys=['a','b','where','ffill nan','substract','output'])
print (df1)
a b where ffill nan substract output
A B A B A B A B A B A B
0 False False 0 0 0.0 0.0 0.0 0.0 0 0 0 0
1 False False 0 0 0.0 0.0 0.0 0.0 0 0 0 0
2 True True 1 1 NaN NaN 0.0 0.0 0 0 1 1
3 True True 2 2 NaN NaN 0.0 0.0 0 0 2 2
4 True False 3 2 NaN 2.0 0.0 2.0 0 2 3 0
5 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
6 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
7 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
8 False True 3 3 3.0 NaN 3.0 2.0 3 2 0 1
9 False True 3 4 3.0 NaN 3.0 2.0 3 2 0 2
10 True True 4 5 NaN NaN 3.0 2.0 3 2 1 3
11 True True 5 6 NaN NaN 3.0 2.0 3 2 2 4
12 True True 6 7 NaN NaN 3.0 2.0 3 2 3 5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

I am trying to replace NaN values with mean values - python

Try this df['s_months'].fillna(df['s_months'].mean()) df['s_months'].mean() counts mean without Nan.

Related

Add missing rows based on column

How to group by data using one column perform some operation on another column and assign new groups pandas

Mapping data from one dataframe to another based on grouby

Python pandas : groupby on two columns and create new variables

Fast way to get the number of NaNs in a column counted from the last valid value in a DataFrame

Categories

Resources