Rolling Average in Python with inf values - python

I'm trying to match this final output, calculating the moving average (3) of Count,
Expected Output
Classification Name Count MA3
0 Fruits Apple inf NaN
1 Fruits Apple inf NaN
2 Fruits Apple inf NaN
3 Fruits Apple inf NaN
4 Fruits Apple 5.0 5.0
5 Fruits Apple 6.0 6.5
6 Fruits Apple 7.0 6.0
7 Fruits Apple 8.0 7.0
8 Veg Broc 10.0 NaN
9 Veg Broc 11.0 NaN
10 Veg Broc 12.0 11.0
But the python .rolling code does not take into account of the inf values, is there any work around on this?
df['MA3'] = df.groupby(['Classification', 'Name'])['Count'].transform(lambda x: x.rolling(3,3).mean())
Current Output
Classification Name Count MA3
0 Fruits Apple inf NaN
1 Fruits Apple inf NaN
2 Fruits Apple inf NaN
3 Fruits Apple inf NaN
4 Fruits Apple 5.0 NaN
5 Fruits Apple 6.0 NaN
6 Fruits Apple 7.0 6.0
7 Fruits Apple 8.0 7.0
8 Veg Broc 10.0 NaN
9 Veg Broc 11.0 NaN
10 Veg Broc 12.0 11.0

Create a series S that contains the calculations replacing inf by nan and set min_periods=1. Then, creates a mask for the rows that need to be modified, that is, the ones that are one or two positions after an inf
df['MA3'] = df.groupby(['Classification', 'Name'])['Count'].transform(lambda x: x.replace(np.inf, np.nan).rolling(3, min_periods=3).mean())
S = df.groupby(['Classification', 'Name'])['Count'].transform(lambda x: x.replace(np.inf, np.nan).rolling(3, min_periods=1).mean())
mask = df['Count'].lt(np.inf) & df['MA3'].isnull() & (df['Count'].shift(1).eq(np.inf) | df['Count'].shift(2).eq(np.inf))
df.loc[mask, 'MA3'] = S.loc[mask]

Related

How can I find a value in a row that is not a NaN value?

I know the annual price for four different apples.
Unfortunately, there are years when some apples are missing in price.
I would like to know the unit price for the first year in which the unit price was entered except for the missing year.
The code is as below.
import pandas as pd
import numpy as np
a = {'Price_Y19':[np.nan,np.nan,np.nan,10],
'Price_Y20':[np.nan,np.nan,10,9],
'Price_Y21':[np.nan,10,9,8],
'Price_Y22':[10,9,8,7]}
index_name = ['yellow apple','red apple','white apple','gray apple']
df = pd.DataFrame(data = a,
index = index_name)
df
I would like to get below DataFrame
b = {'Price_Y19':[np.nan,np.nan,np.nan,10],
'Price_Y20':[np.nan,np.nan,10,9],
'Price_Y21':[np.nan,10,9,8],
'Price_Y22':[10,9,8,7],
'Price_Initial':[10,10,10,10],
'Price_Final':[10,9,8,7],
'Price_Gap':[0,1,2,3]}
df1 = pd.DataFrame(data = b,
index = index_name)
df1
I always wait your answer.
Thanks
I don't have any ideas how to make code.
df['Price_Initial] = some mehode
You can backward or forward filling missing values with select first or last column by position for Initial/Final prices, last subtract new columns:
df = df.assign(Price_Initial = df.bfill(axis=1).iloc[:, 0],
Price_Final = df.ffill(axis=1).iloc[:, -1],
Price_Gap = lambda x: x['Price_Initial'].sub(x['Price_Final']))
print (df)
Price_Y19 Price_Y20 Price_Y21 Price_Y22 Price_Initial \
yellow apple NaN NaN NaN 10 10.0
red apple NaN NaN 10.0 9 10.0
white apple NaN 10.0 9.0 8 10.0
gray apple 10.0 9.0 8.0 7 10.0
Price_Final Price_Gap
yellow apple 10.0 0.0
red apple 9.0 1.0
white apple 8.0 2.0
gray apple 7.0 3.0
If multiple columns and need filter only Price_YY columns:
df1 = df.filter(regex='Price_Y\d{2}')
df = df.assign(Price_Initial = df1.bfill(axis=1).iloc[:, 0],
Price_Final = df1.ffill(axis=1).iloc[:, -1],
Price_Gap = lambda x: x['Price_Initial'].sub(x['Price_Final']))
print (df)
Price_Y19 Price_Y20 Price_Y21 Price_Y22 Price_Initial \
yellow apple NaN NaN NaN 10 10.0
red apple NaN NaN 10.0 9 10.0
white apple NaN 10.0 9.0 8 10.0
gray apple 10.0 9.0 8.0 7 10.0
Price_Final Price_Gap
yellow apple 10.0 0.0
red apple 9.0 1.0
white apple 8.0 2.0
gray apple 7.0 3.0
With pandas.Series.first_valid_index and pandas.Series.last_valid_index functions:
df['Price_Initial'], df['Price_Final'] = df.apply(lambda x: (x[x.first_valid_index()], x[x.last_valid_index()])).values
df['Price_Gap'] = df['Price_Initial'] - df['Price_Final']
print(df)
Price_Y19 Price_Y20 Price_Y21 Price_Y22 Price_Initial \
yellow apple NaN NaN NaN 10 10.0
red apple NaN NaN 10.0 9 10.0
white apple NaN 10.0 9.0 8 10.0
gray apple 10.0 9.0 8.0 7 10.0
Price_Final Price_Gap
yellow apple 10.0 0.0
red apple 9.0 1.0
white apple 8.0 2.0
gray apple 7.0 3.0

Pandas dynamically replace nan values

I have a DataFrame that looks like this:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
a b
0 1.0 4.0
1 2.0 2.0
2 NaN 3.0
3 1.0 NaN
4 NaN NaN
5 NaN 1.0
6 4.0 5.0
7 2.0 NaN
8 3.0 5.0
9 NaN 8.0
I want to dynamically replace the nan values. I have tried doing (df.ffill()+df.bfill())/2 but that does not yield the desired output, as it casts the fill value to the whole column at once, rather then dynamically. I have tried with interpolate, but it doesn't work well for non linear data.
I have seen this answer but did not fully understand it and not sure if it would work.
Update on the computation of the values
I want every nan value to be the mean of the previous and next non nan value. In case there are more than 1 nan value in sequence, I want to replace one at a time and then compute the mean e.g., in case there is 1, np.nan, np.nan, 4, I first want the mean of 1 and 4 (2.5) for the first nan value - obtaining 1,2.5,np.nan,4 - and then the second nan will be the mean of 2.5 and 4, getting to 1,2.5,3.25,4
The desired output is
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
Inspired by the #ye olde noobe answer (thanks to him!):
I've optimized it to make it ≃ 100x faster (times comparison below):
def custom_fillna(s:pd.Series):
for i in range(len(s)):
if pd.isna(s[i]):
last_valid_number = (s[s[:i].last_valid_index()] if s[:i].last_valid_index() is not None else 0)
next_valid_numer = (s[s[i:].first_valid_index()] if s[i:].first_valid_index() is not None else 0)
s[i] = (last_valid_number+next_valid_numer)/2
custom_fillna(df['a'])
df
Times comparison:
Maybe not the most optimized, but it works (note: from your example, I assume that if there is no valid value before or after a NaN, like the last row on column a, 0 is used as a replacement):
import pandas as pd
def fill_dynamically(s: pd.Series):
for i in range(len(s)):
s[i] = (
(0 if s[i:].first_valid_index() is None else s[i:][s[i:].first_valid_index()]) +
(0 if s[:i+1].last_valid_index() is None else s[:i+1][s[:i+1].last_valid_index()])
) / 2
Use like this for the full dataframe:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
df.apply(fill_dynamically)
df after applying:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
In case you would have other columns and don't want to apply that on the whole dataframe, you can of course use it on a single column, like that:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
fill_dynamically(df['a'])
In this case, df looks like that:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 NaN
4 2.50 NaN
5 3.25 1.0
6 4.00 5.0
7 2.00 NaN
8 3.00 5.0
9 1.50 8.0

Python Dataframe Groupby Mean and STD

I know how to compute the groupby mean or std. But now I want to compute both at a time.
My code:
df =
a b c d
0 Apple 3 5 7
1 Banana 4 4 8
2 Cherry 7 1 3
3 Apple 3 4 7
xdf = df.groupby('a').agg([np.mean(),np.std()])
Present output:
TypeError: _mean_dispatcher() missing 1 required positional argument: 'a'
Try to remove () from the np. functions:
xdf = df.groupby("a").agg([np.mean, np.std])
print(xdf)
Prints:
b c d
mean std mean std mean std
a
Apple 3 0.0 4.5 0.707107 7 0.0
Banana 4 NaN 4.0 NaN 8 NaN
Cherry 7 NaN 1.0 NaN 3 NaN
EDIT: To "flatten" column multi-index:
xdf = df.groupby("a").agg([np.mean, np.std])
xdf.columns = xdf.columns.map("_".join)
print(xdf)
Prints:
b_mean b_std c_mean c_std d_mean d_std
a
Apple 3 0.0 4.5 0.707107 7 0.0
Banana 4 NaN 4.0 NaN 8 NaN
Cherry 7 NaN 1.0 NaN 3 NaN

Difference with limit of NaNs between values

I would like to calculate the difference (derivative) between contiguous values, for example:
list = 1, 3, 7, 6
list_diff = NaN, 2, 4, -1
The case above only works when there are no NaNs in the middle of the values. In the case below, I would like to know the grade difference to see how a student's learning is evolving during time. The problem is that some grades are missing! We still want to calculate that difference (only if there are at most 2 missing grades in the middle).
How can I do this?
df:
GRD1 GRD2 GRD3 GRD4 GRD5 GRD6 GRD7
001 1 6 5 9 1 7 9
002 5 8 NaN 8' NaN NaN 2'
003 7 *8* NaN NaN NaN *2* 6
df_diff:
GRD1 GRD2 GRD3 GRD4 GRD5 GRD6 GRD7
001 NaN 5 -1 4 -8 6 2
002 NaN 3 NaN 0 NaN NaN -6'
003 NaN 1 NaN NaN NaN *NaN* 4
See dataframe df: Note for students 001 and 002, the differences between grades are calculated even if NaNs are in the middle because they only have at most 2 missing grades. E.g. 2' - 8' = -6'
However, student 003 has a gap of 3 missing grades, so, the difference in this case will not be calculated. E.g. *2* - *8* = *NaN*.
Use ffill with limit parameter for forward filling only 2 values before DataFrame.diff and then replace 0 differences by original NaNs by DataFrame.mask:
df = df.ffill(axis=1, limit=2).diff(axis=1).mask(df.isna())
print (df)
GRD1 GRD2 GRD3 GRD4 GRD5 GRD6 GRD7
1 NaN 5.0 -1.0 4.0 -8.0 6.0 2.0
2 NaN 3.0 NaN 0.0 NaN NaN -6.0
3 NaN 1.0 NaN NaN NaN NaN 4.0
Details:
print (df.ffill(axis=1, limit=2))
GRD1 GRD2 GRD3 GRD4 GRD5 GRD6 GRD7
1 1.0 6.0 5.0 9.0 1.0 7.0 9.0
2 5.0 8.0 8.0 8.0 8.0 8.0 2.0
3 7.0 8.0 8.0 8.0 NaN 2.0 6.0
print (df.ffill(axis=1, limit=2).diff(axis=1))
GRD1 GRD2 GRD3 GRD4 GRD5 GRD6 GRD7
1 NaN 5.0 -1.0 4.0 -8.0 6.0 2.0
2 NaN 3.0 0.0 0.0 0.0 0.0 -6.0
3 NaN 1.0 0.0 0.0 NaN NaN 4.0

Pandas: sum up multiple columns into one column without last column

If I have a dataframe similar to this one
Apples Bananas Grapes Kiwis
2 3 nan 1
1 3 7 nan
nan nan 2 3
I would like to add a column like this
Apples Bananas Grapes Kiwis Fruit Total
2 3 nan 1 6
1 3 7 nan 11
nan nan 2 3 5
I guess you could use df['Apples'] + df['Bananas'] and so on, but my actual dataframe is much larger than this. I was hoping a formula like df['Fruit Total']=df[-4:-1].sum could do the trick in one line of code. That didn't work however. Is there any way to do it without explicitly summing up all columns?
You can first select by iloc and then sum:
df['Fruit Total']= df.iloc[:, -4:-1].sum(axis=1)
print (df)
Apples Bananas Grapes Kiwis Fruit Total
0 2.0 3.0 NaN 1.0 5.0
1 1.0 3.0 7.0 NaN 11.0
2 NaN NaN 2.0 3.0 2.0
For sum all columns use:
df['Fruit Total']= df.sum(axis=1)
This may be helpful for beginners, so for the sake of completeness, if you know the column names (e.g. they are in a list), you can use:
column_names = ['Apples', 'Bananas', 'Grapes', 'Kiwis']
df['Fruit Total']= df[column_names].sum(axis=1)
This gives you flexibility about which columns you use as you simply have to manipulate the list column_names and you can do things like pick only columns with the letter 'a' in their name. Another benefit of this is that it's easier for humans to understand what they are doing through column names. Combine this with list(df.columns) to get the column names in a list format. Thus, if you want to drop the last column, all you have to do is:
column_names = list(df.columns)
df['Fruit Total']= df[column_names[:-1]].sum(axis=1)
It is possible to do it without knowing the number of columns and even without iloc:
print(df)
Apples Bananas Grapes Kiwis
0 2.0 3.0 NaN 1.0
1 1.0 3.0 7.0 NaN
2 NaN NaN 2.0 3.0
cols_to_sum = df.columns[ : df.shape[1]-1]
df['Fruit Total'] = df[cols_to_sum].sum(axis=1)
print(df)
Apples Bananas Grapes Kiwis Fruit Total
0 2.0 3.0 NaN 1.0 5.0
1 1.0 3.0 7.0 NaN 11.0
2 NaN NaN 2.0 3.0 5.0
Using df['Fruit Total']= df.iloc[:, -4:-1].sum(axis=1) over your original df won't add the last column ('Kiwis'), you should use df.iloc[:, -4:] instead to select all columns:
print(df)
Apples Bananas Grapes Kiwis
0 2.0 3.0 NaN 1.0
1 1.0 3.0 7.0 NaN
2 NaN NaN 2.0 3.0
df['Fruit Total']=df.iloc[:,-4:].sum(axis=1)
print(df)
Apples Bananas Grapes Kiwis Fruit Total
0 2.0 3.0 NaN 1.0 6.0
1 1.0 3.0 7.0 NaN 11.0
2 NaN NaN 2.0 3.0 5.0
I want to build on Ramon's answer if you want to come up with the total without knowing the shape/size of the dataframe.
I will use his answer below but fix one item that didn't include the last column for the total.
I have removed the -1 from the shape:
cols_to_sum = df.columns[ : df.shape[1]-1]
To this:
cols_to_sum = df.columns[ : df.shape[1]]
print(df)
Apples Bananas Grapes Kiwis
0 2.0 3.0 NaN 1.0
1 1.0 3.0 7.0 NaN
2 NaN NaN 2.0 3.0
cols_to_sum = df.columns[ : df.shape[1]]
df['Fruit Total'] = df[cols_to_sum].sum(axis=1)
print(df)
Apples Bananas Grapes Kiwis Fruit Total
0 2.0 3.0 NaN 1.0 6.0
1 1.0 3.0 7.0 NaN 11.0
2 NaN NaN 2.0 3.0 5.0
Which then gives you the correct total without skipping the last column.

Categories