I am trying to replace NaN values with mean values - python

i have to replace the s_months and incidents NaN values with the corresponding means in jupyter notebook.
Input data :
Types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
14 2 4 1 NaN NaN
I have tried the code below but it does not seem to work and I have tried different variations such as replacing the transform.
df.fillna['s_months'] = df.fillna(df.grouby(['types' , 'o_periods']['s_months','incidents']).tranform('mean'),inplace = True)
s_months incidents
Types o_periods
1 1 911 3
2 1688 8
2 1 26851 36
2 14440 36
3 1 914 2
2 862 1
4 1 296 0
2 889 3
5 1 663 4
2 1046 6

From your DataFrame :
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
Types,c_years,o_periods,s_months,incidents
0,1,1,1,127.0,0.0
1,1,1,2,63.0,0.0
2,1,2,1,1095.0,3.0
3,1,2,2,1095.0,4.0
4,1,3,1,1512.0,6.0
5,1,3,2,3353.0,18.0
6,1,4,1,NaN,NaN
7,1,4,2,2244.0,11.0
14,2,4,1,NaN,NaN"""), sep=',')
>>> df
Types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
14 2 4 1 NaN NaN
>>> df[['c_years', 's_months', 'incidents']] = df.groupby(['Types', 'o_periods']).transform(lambda x: x.fillna(x.mean()))
>>> df
Types c_years o_periods s_months incidents
0 1 1 1 127.000000 0.0
1 1 1 2 63.000000 0.0
2 1 2 1 1095.000000 3.0
3 1 2 2 1095.000000 4.0
4 1 3 1 1512.000000 6.0
5 1 3 2 3353.000000 18.0
6 1 4 1 911.333333 3.0
7 1 4 2 2244.000000 11.0
14 2 4 1 NaN NaN
The last NaN is here because it belongs to the last group which contains no value in the columns s_months and incidents and therefore, no mean.

Try this df['s_months'].fillna(df['s_months'].mean())
df['s_months'].mean() counts mean without Nan.

Your code is close, you can try modify it as follows to make it work:
df[['s_months','incidents']] = df[['s_months','incidents']].fillna(df.groupby(['Types' , 'o_periods'])[['s_months','incidents']].transform('mean'))
Data Input:
Types c_years o_periods s_months incidents
0 1 1 1 127.0 0.0
1 1 1 2 63.0 0.0
2 1 2 1 1095.0 3.0
3 1 2 2 1095.0 4.0
4 1 3 1 1512.0 6.0
5 1 3 2 3353.0 18.0
6 1 4 1 NaN NaN
7 1 4 2 2244.0 11.0
14 2 4 1 NaN NaN
Output
Types c_years o_periods s_months incidents
0 1 1 1 127.000000 0.0
1 1 1 2 63.000000 0.0
2 1 2 1 1095.000000 3.0
3 1 2 2 1095.000000 4.0
4 1 3 1 1512.000000 6.0
5 1 3 2 3353.000000 18.0
6 1 4 1 911.333333 3.0
7 1 4 2 2244.000000 11.0
14 2 4 1 NaN NaN

Related

Add missing rows based on column

I have given the following df
df = pd.DataFrame(data = {'day': [1, 1, 1, 2, 2, 3], 'pos': 2*[1, 14, 18], 'value': 2*[1, 2, 3]}
df
day pos value
0 1 1 1
1 1 14 2
2 1 18 3
3 2 1 1
4 2 14 2
5 3 18 3
and i want to fill in rows such that every day has every possible value of column 'pos'
desired result:
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0
Proposition:
df.set_index('pos').reindex(pd.Index(3*[1,14,18])).reset_index()
yields:
ValueError: cannot reindex from a duplicate axis
Let's try pivot then stack:
df.pivot('day','pos','value').stack(dropna=False).reset_index(name='value')
Output:
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0
Option 2: merge with MultiIndex:
df.merge(pd.DataFrame(index=pd.MultiIndex.from_product([df['day'].unique(), df['pos'].unique()])),
left_on=['day','pos'], right_index=True, how='outer')
Output:
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 3 18 3.0
5 2 18 NaN
5 3 1 NaN
5 3 14 NaN
You can reindex:
s = pd.MultiIndex.from_product([df["day"].unique(),df["pos"].unique()], names=["day","pos"])
print (df.set_index(["day","pos"]).reindex(s).reset_index())
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0
I'd avoid the manual product of all possible values.
Instead, one can get the unique values and just reindex per day:
u = df.pos.unique()
df.groupby('day').apply(lambda s: s.set_index('pos').reindex(u))['value']\
.reset_index()
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0
You could use the complete function from pyjanitor to expose the missing values :
# pip install pyjanitor
import pandas as pd
import janitor as jn
df.complete('day', 'pos')
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0

How to group by data using one column perform some operation on another column and assign new groups pandas

I have a dataframe as below :
distance_along_path ID
0 0 1
1 2.2 1
2 4.5 1
3 7.0 1
4 0 2
5 0 3
6 3.0 2
7 5.0 3
8 0 4
9 2.0 4
10 5.0 4
11 0 5
12 3.0 5
11 7.0 4
12
I want be able to group these by id's first and the by distance_along_path values, every time a 0 is seen in distance along path for the id, new group is created and until the next 0 all these rows are under A group as indicated below
distance_along_path ID group
0 0 1 1
1 2.2 1 1
2 4.5 1 1
3 7.0 1 1
4 0 1 2
5 0 2 3
6 3.0 1 2
7 5.0 2 3
8 0 2 4
9 2.0 2 4
10 5.0 2 4
11 0 1 5
12 3.0 1 5
13 7.0 1 5
14 0 1 6
15 0 2 7
16 3.0 1 6
17 5.0 2 7
18 1.0 2 7
Thank you
try the following:
grp_id = df.groupby(['ID']).id.count().reset_index()
grp_distance = grp_id.groupby(['distance_along_path'].grp_id['distance_along_path']==0

Mapping data from one dataframe to another based on grouby

Probably a similar question has been asked before, but I could not find anyone to solve my problem. Maybe I am not using the proper search words!.
I have two pandas Dataframes as below:
import pandas as pd
import numpy as np
df1
a = np.array([1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3])
b = np.array([1,1,2,2,3,3,1,1,2,2,3,3,1,1,2,2,3,3])
df1 = pd.DataFrame({'a':a, 'b':b})
print(df1)
a b
0 1 1
1 1 1
2 1 2
3 1 2
4 1 3
5 1 3
6 2 1
7 2 1
8 2 2
9 2 2
10 2 3
11 2 3
12 3 1
13 3 1
14 3 2
15 3 2
16 3 3
17 3 3
df2 is as below:
a2 = np.array([1,1,1,2,2,2,3,3,3])
b2 = np.array([1,2,3,1,2,3,1,2,3])
c = np.array([4,8,3,np.nan, 2, 5,6, np.nan, 1])
df2 = pd.DataFrame({'a':a2, 'b':b2, 'c': c})
a b c
0 1 1 4.0
1 1 2 8.0
2 1 3 3.0
3 2 1 NaN
4 2 2 2.0
5 2 3 5.0
6 3 1 6.0
7 3 2 NaN
8 3 3 1.0
Now I want to map column c from df2 to df1 but keeping the grouping of columns a=a1 and b=b2. Therefore, df1 is modified as shown below
a b c
0 1 1 4
1 1 1 4
2 1 2 8
3 1 2 8
4 1 3 3
5 1 3 3
6 2 1 NaN
7 2 1 NaN
8 2 2 2.0
9 2 2 2.0
10 2 3 5.0
11 2 3 5.0
12 3 1 6.0
13 3 1 6.0
14 3 2 NaN
15 3 2 NaN
16 3 3 1.0
17 3 3 1.0
How can I achieve this with simple and intuitive way using pandas?
Quite simple using merge:
df1.merge(df2)
a b c
0 1 1 4.0
1 1 1 4.0
2 1 2 8.0
3 1 2 8.0
4 1 3 3.0
5 1 3 3.0
6 2 1 NaN
7 2 1 NaN
8 2 2 2.0
9 2 2 2.0
10 2 3 5.0
11 2 3 5.0
12 3 1 6.0
13 3 1 6.0
14 3 2 NaN
15 3 2 NaN
16 3 3 1.0
17 3 3 1.0
If you have more columns and you want to specifically only merge on a and b, use:
df1.merge(df2, on=['a','b'])

Python pandas : groupby on two columns and create new variables

I have the following dataframe describing the percent of shares held by a type of investor in a company:
company investor pct
1 A 1
1 A 2
1 B 4
2 A 2
2 A 4
2 A 6
2 C 10
2 C 8
And I would like to create a new column for each investor type computing the mean of the shares held in each company. I also need to keep the same lenght of the dataset, using transform for instance.
Here is the result I would like to have:
company investor pct pct_mean_A pct_mean_B pct_mean_C
1 A 1 1.5 4 0
1 A 2 1.5 4 0
1 B 4 1.5 4 0
2 A 2 4.0 0 9
2 A 4 4.0 0 9
2 A 6 4.0 0 9
2 C 10 4.0 0 9
2 C 8 4.0 0 9
Thanks a lot for your help!
Use groupby with aggregate mean and reshape by unstack for helper DataFrame which is join to original df:
s = (df.groupby(['company','investor'])['pct']
.mean()
.unstack(fill_value=0)
.add_prefix('pct_mean_'))
df = df.join(s, 'company')
print (df)
company investor pct pct_mean_A pct_mean_B pct_mean_C
0 1 A 1 1.5 4.0 0.0
1 1 A 2 1.5 4.0 0.0
2 1 B 4 1.5 4.0 0.0
3 2 A 2 4.0 0.0 9.0
4 2 A 4 4.0 0.0 9.0
5 2 A 6 4.0 0.0 9.0
6 2 C 10 4.0 0.0 9.0
7 2 C 8 4.0 0.0 9.0
Or use pivot_table with default aggregate function mean:
s = df.pivot_table(index='company',
columns='investor',
values='pct',
fill_value=0).add_prefix('pct_mean_')
df = df.join(s, 'company')
print (df)
company investor pct pct_mean_A pct_mean_B pct_mean_C
0 1 A 1 1.5 4 0
1 1 A 2 1.5 4 0
2 1 B 4 1.5 4 0
3 2 A 2 4.0 0 9
4 2 A 4 4.0 0 9
5 2 A 6 4.0 0 9
6 2 C 10 4.0 0 9
7 2 C 8 4.0 0 9

Fast way to get the number of NaNs in a column counted from the last valid value in a DataFrame

Say I have a DataFrame like
A B
0 0.1880 0.345
1 0.2510 0.585
2 NaN NaN
3 NaN NaN
4 NaN 1.150
5 0.2300 1.210
6 0.1670 1.290
7 0.0835 1.400
8 0.0418 NaN
9 0.0209 NaN
10 NaN NaN
11 NaN NaN
12 NaN NaN
I want a new DataFrame of the same shape where each entry represents the number of NaNs counted up to its position started from the last valid value as follows
A B
0 0 0
1 0 0
2 1 1
3 2 2
4 3 0
5 0 0
6 0 0
7 0 0
8 0 1
9 0 2
10 1 3
11 2 4
12 3 5
I wonder if this can be done efficiently by utilizing some of the Pandas/Numpy functions?
You can use:
a = df.isnull()
b = a.cumsum()
df1 = b.sub(b.mask(a).ffill().fillna(0).astype(int))
print (df1)
A B
0 0 0
1 0 0
2 1 1
3 2 2
4 3 0
5 0 0
6 0 0
7 0 0
8 0 1
9 0 2
10 1 3
11 2 4
12 3 5
For better understanding:
#add NaN where True in a
a2 = b.mask(a)
#forward filling NaN
a3 = b.mask(a).ffill()
#replace NaN to 0, cast to int
a4 = b.mask(a).ffill().fillna(0).astype(int)
#substract b to a4
a5 = b.sub(b.mask(a).ffill().fillna(0).astype(int))
df1 = pd.concat([a,b,a2, a3, a4, a5], axis=1,
keys=['a','b','where','ffill nan','substract','output'])
print (df1)
a b where ffill nan substract output
A B A B A B A B A B A B
0 False False 0 0 0.0 0.0 0.0 0.0 0 0 0 0
1 False False 0 0 0.0 0.0 0.0 0.0 0 0 0 0
2 True True 1 1 NaN NaN 0.0 0.0 0 0 1 1
3 True True 2 2 NaN NaN 0.0 0.0 0 0 2 2
4 True False 3 2 NaN 2.0 0.0 2.0 0 2 3 0
5 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
6 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
7 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
8 False True 3 3 3.0 NaN 3.0 2.0 3 2 0 1
9 False True 3 4 3.0 NaN 3.0 2.0 3 2 0 2
10 True True 4 5 NaN NaN 3.0 2.0 3 2 1 3
11 True True 5 6 NaN NaN 3.0 2.0 3 2 2 4
12 True True 6 7 NaN NaN 3.0 2.0 3 2 3 5

Categories