Pandas: filling missing values by weighted average in each group - python

I have a dataFrame where 'value'column has missing values. I'd like to filling missing values by weighted average within each 'name' group. There was post on how to fill the missing values by simple average in each group but not weighted average. Thanks a lot!
df = pd.DataFrame({'value': [1, np.nan, 3, 2, 3, 1, 3, np.nan, np.nan],'weight':[3,1,1,2,1,2,2,1,1], 'name': ['A','A', 'A','B','B','B', 'C','C','C']})
name value weight
0 A 1.0 3
1 A NaN 1
2 A 3.0 1
3 B 2.0 2
4 B 3.0 1
5 B 1.0 2
6 C 3.0 2
7 C NaN 1
8 C NaN 1
I'd like to fill in "NaN" with weighted value in each "name" group, i.e.
name value weight
0 A 1.0 3
1 A 1.5 1
2 A 3.0 1
3 B 2.0 2
4 B 3.0 1
5 B 1.0 2
6 C 3.0 2
7 C 3.0 1
8 C 3.0 1

You can group data frame by name, and use fillna method to fill the missing values with weighted average which can calculated with np.average with weights parameter:
df['value'] = (df.groupby('name', group_keys=False)
.apply(lambda g: g.value.fillna(np.average(g.dropna().value, weights=g.dropna().weight))))
df
#name value weight
#0 A 1.0 3
#1 A 1.5 1
#2 A 3.0 1
#3 B 2.0 2
#4 B 3.0 1
#5 B 1.0 2
#6 C 3.0 2
#7 C 3.0 1
#8 C 3.0 1
To make this less convoluted, define a fillValue function:
import numpy as np
import pandas as pd
def fillValue(g):
gNotNull = g.dropna()
wtAvg = np.average(gNotNull.value, weights=gNotNull.weight)
return g.value.fillna(wtAvg)
df['value'] = df.groupby('name', group_keys=False).apply(fillValue)

Related

Duplicate positions from group

I have the following Dataset:
col value
0 A 1
1 A NaN
2 B NaN
3 B NaN
4 B NaN
5 B 1
6 C 3
7 C NaN
8 C NaN
9 D 5
10 E 6
There is only one value set per group, the rest in Nan. What I want to do know, is fill the NaN with he value of the group. If a group has no NaNs, I just want to ignore it.
Outcome should look like this:
col value
0 A 1
1 A 1
2 B 1
3 B 1
4 B 1
5 B 1
6 C 3
7 C 3
8 C 3
9 D 5
10 E 6
What I've tried so far is the following:
df["value"] = df.groupby(col).transform(lambda x: x.fillna(x.mean()))
However, this method is not only super slow, but doesn't give me the wished result.
Anybody an idea?
It depends of data - if there is always one non missing value you can sorting and then replace by GroupBy.ffill, it working well if some groups has NANs only:
df = df.sort_values(['col','value'])
df["value"] = df.groupby('col')["value"].ffill()
#if always only one non missing value per group, fail if all NaNs of some group
#df["value"] = df["value"].ffill()
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0
Or if there is multiple values and need replace by mean, for improve performace change your solution with GroupBy.transform only mean passed to Series.fillna:
df["value"] = df["value"].fillna(df.groupby('col')["value"].transform('mean'))
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0
You can use ffill which is the same as fillna() with method=ffill (see docs)
df["value"] = df["value"].ffill()

Pandas Dataframe Question: Subtract next row and add specific value if NaN

Trying to groupby in pandas, then sort values and have a result column show what you need to add to get to the next row in the group, and if your are the end of the group. To replace the value with the number 3. Anyone have an idea how to do it?
import pandas as pd
df = pd.DataFrame({'label': 'a a b c b c'.split(), 'Val': [2,6,6, 4,16, 8]})
df
label Val
0 a 2
1 a 6
2 b 6
3 c 4
4 b 16
5 c 8
Id like the results as shown below, that you have to add 4 to 2 to get 6. So the groups are sorted. But if there is no next value in the group and NaN is added. To replace it with the value 3. I have shown below what the results should look like:
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0
I tried this, and was thinking of shifting values up but the problem is that the labels aren't sorted.
df['Results'] = df.groupby('label').apply(lambda x: x - x.shift())`
df
label Val Results
0 a 2 NaN
1 a 6 4.0
2 b 6 NaN
3 c 4 NaN
4 b 16 10.0
5 c 8 4.0
Hope someone can help:D!
Use groupby, diff and abs:
df['Results'] = abs(df.groupby('label')['Val'].diff(-1)).fillna(3)
label Val Results
0 a 2 4.0
1 a 6 3.0
2 b 6 10.0
3 c 4 4.0
4 b 16 3.0
5 c 8 3.0

Pandas: How to replace values of Nan in column based on another column?

Given that, i have a dataset as below:
dict = {
"A": [math.nan,math.nan,1,math.nan,2,math.nan,3,5],
"B": np.random.randint(1,5,size=8)
}
dt = pd.DataFrame(dict)
My favorite output is, if the in column A we have an Nan then multiply the value of the column B in the same row and replace it with Nan. So, given that, the below is my dataset:
A B
NaN 1
NaN 1
1.0 3
NaN 2
2.0 3
NaN 1
3.0 1
5.0 3
My favorite output is:
A B
2 1
2 1
1 3
4 2
2 3
2 1
3 1
5 3
My current solution is as below which does not work:
dt[pd.isna(dt["A"])]["A"] = dt[pd.isna(dt["A"])]["B"].apply( lambda x:2*x )
print(dt)
In your case with fillna
df.A.fillna(df.B*2, inplace=True)
df
A B
0 2.0 1
1 2.0 1
2 1.0 3
3 4.0 2
4 2.0 3
5 2.0 1
6 3.0 1
7 5.0 3

Interpolate NaN values over a DataFrame as a ring

I need to interpolate the NaN values over a Dataframe but I want that interpolation to get the first values of the DataFrame in case the NaN value is the last value. Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({"a": [1,2,3], "b":[1,2,np.nan]})
So the DataFrame is:
a b
0 1 1.0
1 2 2.0
2 3 NaN
But when I interpolate the nan values like:
df.interpolate(method="linear", inplace=True)
I got:
a b
0 1 1.0
1 2 2.0
2 3 2.0
The interpolation doesn't use the first value to do it. My desired output wold be to fill in with the value of 1.5 because of that circular interpolation.
One possible solution is add first row, interpolate and remove last row:
df = df.append(df.iloc[0]).interpolate(method="linear").iloc[:-1]
print (df)
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 1.5
EDIT:
More general solution:
df = pd.DataFrame.from_dict({"a": [1,2,3,4], "b":[np.nan,1,2,np.nan]})
df = pd.concat([df] * 3).interpolate(method="linear").iloc[len(df):-len(df)]
print (df)
a b
0 1 1.333333
1 2 1.000000
2 3 2.000000
3 4 1.666667
Or if need working only with last non missing values:
df = pd.DataFrame.from_dict({"a": [1,2,3,4], "b":[np.nan,1,2,np.nan]})
df1 = df.ffill().iloc[[-1]]
df2 = df.bfill().iloc[[0]]
df = pd.concat([df1, df, df2]).interpolate(method="linear").iloc[1:-1]
print (df)
a b
0 1 1.5
1 2 1.0
2 3 2.0
3 4 1.5

DataFrameGroupBy diff() on condition

Suppose i have a DataFrame:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
which looks like
CATEGORY VALUE
0 a NaN
1 b 1
2 c 0
3 b 0
4 b 5
5 a 0
6 b 4
I group it:
df = df.groupby(by='CATEGORY')
And now, let me show, what i want with the help of example on one group 'b':
df.get_group('b')
group b:
CATEGORY VALUE
1 b 1
3 b 0
4 b 5
6 b 4
I need: In the scope of each group, count diff() between VALUE values, skipping all NaNs and 0s. So the result should be:
CATEGORY VALUE DIFF
1 b 1 -
3 b 0 -
4 b 5 4
6 b 4 -1
You can use diff to subtract values after dropping 0 and NaN values:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
grouped = df.groupby("CATEGORY")
# define diff func
diff = lambda x: x["VALUE"].replace(0, np.NaN).dropna().diff()
df["DIFF"] = grouped.apply(diff).reset_index(0, drop=True)
print(df)
CATEGORY VALUE DIFF
0 a NaN NaN
1 b 1.0 NaN
2 c 0.0 NaN
3 b 0.0 NaN
4 b 5.0 4.0
5 a 0.0 NaN
6 b 4.0 -1.0
Sounds like a job for a pd.Series.shift() operation along with a notnull mask.
First we remove the unwanted values, before we group the data
nonull_df = df[(df['VALUE'] != 0) & df['VALUE'].notnull()]
groups = nonull_df.groupby(by='CATEGORY')
Now we can shift internally in the groups and calculate the diff
nonull_df['next_value'] = groups['VALUE'].shift(1)
nonull_df['diff'] = nonull_df['VALUE'] - nonull_df['next_value']
Lastly and optionally you can copy the data back to the original dataframe
df.loc[nonull_df.index] = nonull_df
df
CATEGORY VALUE next_value diff
0 a NaN NaN NaN
1 b 1.0 NaN NaN
2 c 0.0 NaN NaN
3 b 0.0 1.0 -1.0
4 b 5.0 1.0 4.0
5 a 0.0 NaN NaN
6 b 4.0 5.0 -1.0

Categories