interpolate(method="nearest") in a groupby in pandas - python

I have a dataset that I want to groupby("CustomerID") and fill NaNs with the nearest number within the group.
I can fill by nearest number irregardless of group like this:
df['num'] = df['num'].interpolate(method="nearest")
When I tried:
df['num'] = df.groupby('CustomerID')['num'].transform(lambda x: x.interpolate(method="nearest"))
I got ValueError: x and y arrays must have at least 2 entries, which I assume is because
some customers only have one entry with NaN or only NaNs.
However, when I extracted a select few rows that should have worked and made a new dataframe, nothing happened.
Is there a way I can group by customerID and fill NaNs with nearest number within the group, and skip customers with only NaNs or just one observation?

I ran into the same "ValueError: x and y arrays must have at least 2 entries" in my code. Adapted to your code (which I obviously could not reproduce) here is how I solved the problem:
import pandas as pd
import numpy as np
df.loc[:,'num'] = df.groupby('CustomerID')['num'].apply(lambda group: group.interpolate(method='nearest') if np.count_nonzero(np.isnan(group)) < (len(group) - 1) else group)
df.loc[:,'num'] = df.groupby('CustomerID').apply(lambda group: group.interpolate(method='linear', limit_area='outside', limit_direction='both'))
It does the following:
The first "groupby + apply" interpolates each group with the method 'nearest' ONLY if the group has at least two non NaNs values.
np.isnan(group) returns an array containing True where group has NaNs and False where it has values.
np.count_nonzero(np.isnan(group)) returns the number of True in the previous array (i.e. the number of NaNs in the group).
If the number of NaNs is strictly smaller than the length of the group minus 1 (i.e. there are at least two non NaNs in the group), the group is interpolated using 'nearest', otherwise it is left untouched.
The second "groupby + apply" finishes to interpolate each group, using method='linear' and argument limit_direction='both'.
If a group was fully interpolated in the previous step: nothing
happens.
If a group had only one non NaN value (therefore was left
untouched in the previous step): The non NaN value will be used to
fill the entire group.
If a group had only NaNs (therefore was left untouched in the previous step): the group remains full of NaNs.
Here's a dummy example using your notations:
df=pd.DataFrame({'CustomerID':['a']*3+['b']*3+['c']*3,'num':[1,np.nan,2,np.nan,1,np.nan,np.nan,np.nan,np.nan]})
df
CustomerID num
0 a 1.0
1 a NaN
2 a 2.0
3 b NaN
4 b 1.0
5 b NaN
6 c NaN
7 c NaN
8 c NaN
df.loc[:,'num'] = df.groupby('CustomerID')['num'].apply(lambda group: group.interpolate(method='nearest') if np.count_nonzero(np.isnan(group)) < (len(group) - 1) else group)
df
CustomerID num
0 a 1.0
1 a 1.0
2 a 2.0
3 b NaN
4 b 1.0
5 b NaN
6 c NaN
7 c NaN
8 c NaN
df.loc[:,'num'] = df.groupby('CustomerID').apply(lambda group: group.interpolate(method='linear', limit_area='outside', limit_direction='both'))
df
CustomerID num
0 a 1.0
1 a 1.0
2 a 2.0
3 b 1.0
4 b 1.0
5 b 1.0
6 c NaN
7 c NaN
8 c NaN
EDIT: important note
The interpolate method 'nearest' uses the numerical values of the index (see documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html). It works well in my dummy example above because the index is clean. If the index of your dataframe is messy (e.g. after concatenating dataframes) you may want to do df.reset_index(inplace=True) before you interpolate.

Related

Python Pandas: How does Axis Parameter Work in Pandas?

Background:
In pandas, if I use the following:
df.sum(axis=1)
It returns sum of each row.
In the same manner, I expect the following to drop any row that contains missing value:
df.dropna(how='any', axis=1)
But the above code line actually drops any column that contains missing values rather than dropping rows with missing values.
The Question: I understand why the first line returns sum of rows; but how come dropna(axis=1) drops columns?
=========
To clarify the question, I have provided the following example:
import numpy as np
import pandas as pd
np.random.seed(100)
df = pd.DataFrame(np.random.randint(1, 10, (4, 3)), columns=list('ABC'))
A B C
0 NaN 9 4.0
1 8.0 8 1.0
2 5.0 3 6.0
3 3.0 3 NaN
df.sum(axis=1)
0 13.0
1 17.0
2 14.0
3 6.0
df.dropna(how='any', axis=1)
B
0 9
1 8
2 3
3 3
df.sum(axis=1) returns the sum of all columns, that is, the sum of values in each column, and therefore it returns a row. sum aggerates and therefore reduces. When the values of the columns are all reduced, they are the ones being summed, and not the rows.
df.sum(axis=0) returns the sum of all rows where the cells are reduced row-wise.
axis=1 reference the columns. df.dropna(how='any', axis=1) looks for NaN values and if a column contains one, it is dropped.

Pandas: Replace missing dataframe values / conditional calculation: fillna

I want to calculate a pandas dataframe, but some rows contain missing values. For those missing values, i want to use a diffent algorithm. Lets say:
If column B contains a value, then substract A from B
If column B does not contain a value, then subtract A from C
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4], 'b':[1,1,None,1],'c':[2,2,2,2]})
df['calc'] = df['b']-df['a']
results in:
print(df)
a b c calc
0 1 1.0 2 0.0
1 2 1.0 2 -1.0
2 3 NaN 2 NaN
3 4 1.0 2 -3.0
Approach 1: fill the NaN rows using .where:
df['calc'].where(df['b'].isnull()) = df['c']-df['a']
which results in SyntaxError: cannot assign to function call.
Approach 2: fill the NaN rows using .iterrows():
for index, row in df.iterrows():
i = df['calc'].iloc[index]
if pd.isnull(row['b']):
i = row['c']-row['a']
print(i)
else:
i = row['b']-row['a']
print(i)
is executed without errors and calculation is correct, these i values are printed to the console:
0.0
-1.0
-1.0
-3.0
but the values are not written into df['calc'], the datafram remains as is:
print(df['calc'])
0 0.0
1 -1.0
2 NaN
3 -3.0
What is the correct way of overwriting the NaN values?
Finally, I stumbled over .fillna:
df['calc'] = df['calc'].fillna( df['c']-df['a'] )
gets the job done! Can anyone explain what is wrong with above two approaches...?
Approach 2:
you are assigning it to i value. but this won't modify your original dataframe.
for index, row in df.iterrows():
i = df['calc'].iloc[index]
if pd.isnull(row['b']):
i = row['c']-row['a']
print(i)
else:
i = row['b']-row['a']
print(i)
df.loc[index,'calc'] = i #<------------- here
also don't use iterrows() it is too slow.
Approach 1:
Pandas where() method is used to check a data frame for one or more condition and return the result accordingly. By default, The rows not satisfying the condition are filled with NaN value.
it should be:
df['calc'] = df['calc'].where(df['b'].isnull(), df['c']-df['a'])
but this will only find those row value where you have non zero value and fill that with the given value.
Use:
df['calc'] = df['calc'].where(~df['b'].isnull(), df['c']-df['a'])
OR
df['calc'] = np.where(df['b'].isnull(), df['c']-df['a'], df['calc'])
Instead of subtracting b from a then c from a what you can do is first fill the nan values in column b with the values from column c, then subtract column a:
df['calc'] = df['b'].fillna(df['c']) - df['a']
a b c calc
0 1 1.0 2 0.0
1 2 1.0 2 -1.0
2 3 NaN 2 -1.0
3 4 1.0 2 -3.0

Fill NaN values wit mean of previous rows?

I have to fill the nan values of a column in a dataframe with the mean of the previous 3 instances.
Here is the following example:
df = pd.DataFrame({'col1': [1, 3, 4, 5, np.NaN, np.NaN, np.NaN, 7]})
df
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 NaN
5 NaN
6 NaN
7 7.0
And here is the output I need:
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 4.0
5 4.3
6 4.4
7 7.0
I tried pd.rolling, but it does not work the way I want when the column has more than one NaN value in a roll:
df.fillna(df.rolling(3, min_periods=1).mean().shift())
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 4.0 # np.nanmean([3, 4, 5])
5 4.5 # np.nanmean([np.NaN, 4, 5])
6 5.0 # np.nanmean([np.NaN, np.naN ,5])
7 7.0
Can someone help me with that? Thanks in advance!
Probably not the most efficient but terse and gets the job done
from functools import reduce
reduce(lambda d, _: d.fillna(d.rolling(3, min_periods=3).mean().shift()), range(df['col1'].isna().sum()), df)
output
col1
0 1.000000
1 3.000000
2 4.000000
3 5.000000
4 4.000000
5 4.333333
6 4.444444
7 7.000000
we basically use fillna but require min_periods=3 meaning it will only fill a single NaN at a time, or rather those NaNs that have three non-NaN numbers immediately preceeding it. Then we use reduce to repeat this operation as many times as there are NaNs in col1
I tried two approaches to this problem. One is a loop over the dataframe, and the second is essentially trying the approach you suggest multiple times, to converge on the right answer.
Loop approach
For each row in the dataframe, get the value from col1. Then, take the average of the last rows. (There can be less than 3 in this list, if we're at the beginning of the dataframe.) If the value is NaN, replace it with the average value. Then, save the value back into the dataframe. If the list of values from the last rows has more than 3 values, then remove the last one.
def impute(df2, col_name):
last_3 = []
for index in df.index:
val = df2.loc[index, col_name]
if len(last_3) > 0:
imputed = np.nanmean(last_3)
else:
imputed = None
if np.isnan(val):
val = imputed
last_3.append(val)
df2.loc[index, col_name] = val
if len(last_3) > 3:
last_3.pop(0)
Repeated column operation
The core idea here is to notice that in your example of pd.rolling, the first NA replacement value is correct. So, you apply the rolling average, take the first NA value for each run of NA values, and use that number. If you apply this repeatedly, you fill in the first missing value, then the second missing value, then the third. You'll need to run this loop as many times as the longest series of consecutive NA values.
def impute(df2, col_name):
while df2[col_name].isna().any().any():
# If there are multiple NA values in a row, identify just
# the first one
first_na = df2[col_name].isna().diff() & df2[col_name].isna()
# Compute mean of previous 3 values
imputed = df2.rolling(3, min_periods=1).mean().shift()[col_name]
# Replace NA values with mean if they are very first NA
# value in run of NA values
df2.loc[first_na, col_name] = imputed
Performance comparison
Running both of these on an 80000 row dataframe, I get the following results:
Loop approach takes 20.744 seconds
Repeated column operation takes 0.056 seconds

Count all NaNs in a pandas DataFrame

I'm trying to count NaN element (data type class 'numpy.float64')in pandas series to know how many are there
which data type is class 'pandas.core.series.Series'
This is for count null value in pandas series
import pandas as pd
oc=pd.read_csv(csv_file)
oc.count("NaN")
my expected output of oc,count("NaN") to be 7 but it show 'Level NaN must be same as name (None)'
The argument to count isn't what you want counted (it's actually the axis name or index).
You're looking for df.isna().values.sum() (to count NaNs across the entire DataFrame), or len(df) - df['column'].count() (to count NaNs in a specific column).
You can use either of the following if your Series.dtype is float64:
oc.isin([np.nan]).sum()
oc.isna().sum()
If your Series is of mixed data-type you can use the following:
oc.isin([np.nan, 'NaN']).sum()
oc.size : returns total element counts of dataframe including NaN
oc.count().sum(): return total element counts of dataframe excluding NaN
Therefore, another way to count number of NaN in dataframe is doing subtraction on them:
NaN_count = oc.size - oc.count().sum()
Just for fun, you can do either
df.isnull().sum().sum()
or
len(df)*len(df.columns) - len(df.stack())
If your dataframe looks like this ;
aa = pd.DataFrame(np.array([[1,2,np.nan],[3,np.nan,5],[8,7,6],
[np.nan,np.nan,0]]), columns=['a','b','c'])
a b c
0 1.0 2.0 NaN
1 3.0 NaN 5.0
2 8.0 7.0 6.0
3 NaN NaN 0.0
To count 'nan' by cols, you can try this
aa.isnull().sum()
a 1
b 2
c 1
For total count of nan
aa.isnull().values.sum()
4

How to compare two dataframes and filter rows and columns where a difference is found

I am testing dataframes for equality.
df_diff=(df1!=df2)
I get df_diff which is same shape as df*, and contains boolean True/False.
Now I would like to keep only the columns and rows of df1 where there was at least a different value.
If I simply do
df1=[df_diff.values]
I get all the rows where there was at least one True in df_diff, but lots of columns originally had False only.
As a second step, I would like then to be able to replace all the values (element-wise in the dataframe) which were equal (where df_diff==False) with NaNs.
example:
df1=pd.DataFrame(data=[[1,2,3],[4,5,6],[7,8,9]])
df2=pd.DataFrame(data=[[1,99,3],[4,5,99],[7,8,9]])
I would like to get from df1
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
to
1 2
0 2 NaN
1 NaN 6
I think you need DataFrame.any for check at least one True per rows of columns:
df = df_diff[df_diff.any(axis=1)]
It is possible to filter both of the original dataframes like so:
df11 = df1[df_diff.any(axis=1)]
df22 = df2[df_diff.any(axis=1)]
If want all columns and rows:
df = df_diff.loc[df_diff.any(axis=1), df_diff.any()]
EDIT: Filter d1 and add NaNs by where:
df_diff=(df1!=df2)
m1 = df_diff.any(axis=1)
m2 = df_diff.any()
out = df1.loc[m1, m2].where(df_diff.loc[m1, m2])
print (out)
1 2
0 2.0 NaN
1 NaN 6.0

Categories