Reset Cumulative sum base on condition Pandas - python

I have a data frame like:
customer spend hurdle
A 20 50
A 31 50
A 20 50
B 50 100
B 51 100
B 30 100
I want to calculate additional column for Cumulative which will reset base on the same customer when the Cumulative sum greater or equal to the hurdle like following :
customer spend hurdle Cumulative
A 20 50 20
A 31 50 51
A 20 50 20
B 50 100 50
B 51 100 101
B 30 100 30
I used the cumsum and groupby in pandas to but I do not know how to reset it base on the condition.
Following are the code I am currently using:
df1['cum_sum'] = df1.groupby(['customer'])['spend'].apply(lambda x: x.cumsum())
which I know it is just a normal cumulative sum. I very appreciate for your help.

There could be faster, efficient way. Here's one inefficient apply way to do would be.
In [3270]: def custcum(x):
...: total = 0
...: for i, v in x.iterrows():
...: total += v.spend
...: x.loc[i, 'cum'] = total
...: if total >= v.hurdle:
...: total = 0
...: return x
...:
In [3271]: df.groupby('customer').apply(custcum)
Out[3271]:
customer spend hurdle cum
0 A 20 50 20.0
1 A 31 50 51.0
2 A 20 50 20.0
3 B 50 100 50.0
4 B 51 100 101.0
5 B 30 100 30.0
You may consider using cython or numba to speed up the custcum
[Update]
Improved version of Ido s answer.
In [3276]: s = df.groupby('customer').spend.cumsum()
In [3277]: np.where(s > df.hurdle.shift(-1), s, df.spend)
Out[3277]: array([ 20, 51, 20, 50, 101, 30], dtype=int64)

One way would be the below code. But it's a really inefficient and inelegant one-liner.
df1.groupby('customer').apply(lambda x: (x['spend'].cumsum() *(x['spend'].cumsum() > x['hurdle']).astype(int).shift(-1)).fillna(x['spend']))

Related

Rolling average based on another column

I have a dataframe df which looks like
time(float)
value (float)
10.45
10
10.50
20
10.55
25
11.20
30
11.44
20
12.30
30
I need help to calculate a new column called rolling_average_value which is basically the average value of that row and all the values 1 hour before that row such that the new dataframe looks like.
time(float)
value (float)
rolling_average_value
10.45
10
10
10.50
20
15
10.55
25
18.33
11.20
30
21.25
11.44
20
21
12.30
30
25
Note: This time column is a float column
You can temporarily set a datetime index and apply rolling.mean:
# extract hours/minuts from float
import numpy as np
minutes, hours = np.modf(df['time(float)'])
hours = hours.astype(int)
minutes = minutes.mul(100).astype(int)
dt = pd.to_datetime(hours.astype(str)+minutes.astype(str), format='%H%M')
# perform rolling computation
df['rolling_mean'] = (df.set_axis(dt)
.rolling('1h')['value (float)']
.mean()
.set_axis(df.index)
)
output:
time(float) value (float) rolling_mean
0 10.45 10 10.000000
1 10.50 20 15.000000
2 10.55 25 18.333333
3 11.20 30 21.250000
4 11.44 20 21.000000
5 12.30 30 25.000000
Alternative to compute dt:
dt = pd.to_datetime(df['time(float)'].astype(str)
.str.replace('\d+', lambda x: x.group().zfill(2),
regex=True),
format='%H.%M')
Assuming your data frame is sorted by time, you can also use a simple list comprehension to solve your problem. Iterate over times and get all indices where the distance from the previous time values to the actual iteration value is less than one (meaning less than one hour) and slice the value column that was converted to an array by those indices. Then, you can just compute the mean of the sliced array:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{"time": [10.45, 10.5, 10.55, 11.2, 11.44, 12.3],
"value": [10, 20, 25, 30, 20, 30]}
)
times = df["time"].values
values = df["value"].values
df["rolling_mean"] = [round(np.mean(values[np.where(times[i] - times[:i+1] < 1)[0]]), 2) for i in range(len(times))]
If your data frame is large, you can compile this loop in C/C++ too make it significantly faster:
from numba import njit
#njit
def compute_rolling_mean(times, values):
return [round(np.mean(values[np.where(times[i] - times[:i+1] < 1)[0]]), 2) for i in range(len(times))]
df["rolling_mean"] = compute_rolling_mean(df["time"].values, df["value"].values)
Output:
time value rolling_mean
0 10.45 10 10.00
1 10.50 20 15.00
2 10.55 25 18.33
3 11.20 30 21.25
4 11.44 20 21.00
5 12.30 30 25.00

calculate new column values based on conditions in pandas

I have columns in the pandas dataframe df_profit:
profit_date profit
0 01.04 70
1 02.04 80
2 03.04 80
3 04.04 100
4 05.04 120
5 06.04 120
6 07.04 120
7 08.04 130
8 09.04 140
9 10.04 140
And I have the second dataframe df_deals:
deals_date
0 03.04
1 05.04
2 06.04
I want to create a new column 'delta' in the df_profit and let it be equal to delta between current value and previous value in 'profit' column. But I want the delta to be calculated only after the first date in the 'profit_date' is equal to the date in the column 'deal_date' of df_deals dataframe and previous value in the delta calculation to be always the same and equal to the value when the first date in 'profit_date' was equal to the first date in 'deals_date'.
So, the result would look like:
profit_date profit delta
0 01.04 70
1 02.04 80
2 03.04 80
3 04.04 100 20
4 05.04 120 40
5 06.04 120 40
6 07.04 120 40
7 08.04 130 50
8 09.04 140 60
9 10.04 140 60
For the next time you should provide better data to make it easier to help (dataframe creation so that we can copy paste your code).
I think this codes does what you want:
import pandas as pd
df_profit = pd.DataFrame(columns=["profit_date", "profit"],
data=[
["01.04", 70],
["02.04", 80],
["03.04", 80],
["04.04", 100],
["05.04", 120],
["06.04", 120],
["07.04", 120],
["08.04", 130],
["09.04", 140],
["10.04", 140]])
df_deals = pd.DataFrame(columns=["deals_date"], data=["03.04", "05.04", "06.04"])
# combine both dataframes, based on date columns
df = df_profit.merge(right=df_deals, left_on="profit_date", right_on="deals_date", how="left")
# find the first value (first row with deals date) and set it to 'base'
df["base"] = df.loc[df["deals_date"].first_valid_index()]["profit"]
# calculate delta
df["delta"] = df["profit"] - df["base"]
# Remove unused values
df.loc[:df["deals_date"].first_valid_index(), "delta"] = None
# remove temporary cols
df.drop(columns=["base", "deals_date"], inplace=True)
print(df)
output is:
profit_date profit delta
0 01.04 70 NaN
1 02.04 80 NaN
2 03.04 80 NaN
3 04.04 100 20.0
4 05.04 120 40.0
5 06.04 120 40.0
6 07.04 120 40.0
7 08.04 130 50.0
8 09.04 140 60.0
9 10.04 140 60.0
You can try this one for don't get NaN values
start_profit = df_profit.loc[(df_profit["profit_date"] == df_deals.iloc[0][0])]
start_profit = start_profit.iloc[0][1]
for i in range(len(df_profit)):
if int(str(df_profit.iloc[i][0]).split(".")[0]) > 3 and int(str(df_profit.iloc[i][0]).split(".")[1]) >= 4:
df_profit.loc[i,"delta"] = df_profit.iloc[i][1]-start_profit
Hope it helps

Group pandas dataframe by quantile of single column

Sorry if this is duplicate post - I can't find a related post though
from random import seed
seed(100)
P = pd.DataFrame(np.random.randint(0, 100, size=(1000, 2)), columns=list('AB'))
What I'd like is to group P by the quartiles/quantiles/deciles/etc of column A and then calculate a aggregate statistic (such as mean) by group. I can define deciles of the column as
P['A'].quantile(np.arange(10) / 10)
I'm not sure how to groupby the deciles of A. Thanks in advance!
If you want to group P e.g. by quartiles, run:
gr = P.groupby(pd.qcut(P.A, 4, labels=False))
Then you can perform any operations on these groups.
For presentation, below you have just a printout of P limited to
20 rows:
for key, grp in gr:
print(f'\nGroup: {key}\n{grp}')
which gives:
Group: 0
A B
0 8 24
3 10 94
10 9 93
15 4 91
17 7 49
Group: 1
A B
7 34 24
8 15 60
12 27 4
13 31 1
14 13 83
Group: 2
A B
4 52 98
5 53 66
9 58 16
16 59 67
18 47 65
Group: 3
A B
1 67 87
2 79 48
6 98 14
11 86 2
19 61 14
As you can see, each group (quartile) has 5 members, so the grouping is
correct.
As a supplement
If you are interested in borders of each quartile, run:
pd.qcut(P.A, 4, labels=False, retbins=True)[1]
Then cut returns 2 results (a tuple). The first element (number 0) is
the result returned before, but we are this time interested in the
second element (number 1) - the bin borders.
For your data they are:
array([ 4. , 12.25, 40.5 , 59.5 , 98. ])
So e.g. the first quartile is between 4 and 12.35.
You can use the quantile Series to make another column, to marking each row with its quantile label, and then group by that column. numpy searchsorted is very useful to do this:
import numpy as np
import pandas as pd
from random import seed
seed(100)
P = pd.DataFrame(np.random.randint(0, 100, size=(1000, 2)), columns=list('AB'))
q = P['A'].quantile(np.arange(10) / 10)
P['G'] = P['A'].apply(lambda x : q.index[np.searchsorted(q, x, side='right')-1])
Since the quantile Series stores the lower values of the quantile intervals, be sure to pass the parameter side='right' to np.searchsorted to not get 0 (the minimum should be 1 or you have one index more than you need).
Now you can elaborate your statistics by doing, for example:
P.groupby('G').agg(['sum', 'mean']) #add to the list all the statistics method you wish

Speeding up the insertion of null rows into a large Pandas DataFrame?

I have a Pandas data frame with hundreds of millions of rows that looks like this:
Date Attribute A Attribute B Value
01/01/16 A 1 50
01/05/16 A 1 60
01/02/16 B 1 59
01/04/16 B 1 90
01/10/16 B 1 84
For each unique combination (call it b) of Attribute A x Attribute B, I need to fill in empty dates starting from the oldest date for that unique group b to the maximum date in the entire dataframe df. That is, so it looks like this:
Date Attribute A Attribute B Value
01/01/16 A 1 50
01/02/16 A 1 0
01/03/16 A 1 0
01/04/16 A 1 0
01/05/16 A 1 60
01/02/16 B 1 59
01/03/16 B 1 0
01/04/16 B 1 90
01/05/16 B 1 0
01/06/16 B 1 0
01/07/16 B 1 0
01/08/16 B 1 84
and then calculate the coefficient of variation (standard deviation/mean) for each unique combination's values (after inserting 0s). My code is this:
final = pd.DataFrame()
max_date = df['Date'].max()
for name, group in df.groupby(['Attribute_A','Attribute_B']):
idx = pd.date_range(group['Date'].min(),
max_date)
temp = group.set_index('Date').reindex(idx, fill_value=0)
coeff_var = temp['Value'].std()/temp['Value'].mean()
final = pd.concat([final, pd.DataFrame({'Attribute_A':[name[0]], 'Attribute_B':[name[1]],'Coeff_Var':[coeff_var]})])
This runs insanely slow, and I'm looking for a way to speed it up.
Suggestions?
This runs insanely slow, and I'm looking for a way to speed it up.
Suggestions?
I don't have a ready solution, however this is how I suggest you approach the problem:
Understand what makes this slow
Find ways to make the critical parts faster
Or, alternatively, find a new approach
Here's the analysis of your code using line profiler:
Timer unit: 1e-06 s
Total time: 0.028074 s
File: <ipython-input-54-ad49822d490b>
Function: foo at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def foo():
2 1 875 875.0 3.1 final = pd.DataFrame()
3 1 302 302.0 1.1 max_date = df['Date'].max()
4 3 3343 1114.3 11.9 for name, group in df.groupby(['Attribute_A','Attribute_B']):
5 2 836 418.0 3.0 idx = pd.date_range(group['Date'].min(),
6 2 3601 1800.5 12.8 max_date)
7
8 2 6713 3356.5 23.9 temp = group.set_index('Date').reindex(idx, fill_value=0)
9 2 1961 980.5 7.0 coeff_var = temp['Value'].std()/temp['Value'].mean()
10 2 10443 5221.5 37.2 final = pd.concat([final, pd.DataFrame({'Attribute_A':[name[0]], 'Attribute_B':[name[1]],'Coeff_Var':[coeff_var]})])
In conclusion, the .reindex and concat statements take 60% of the time.
A first approach that saves 42% of time in my measurement is to collect the data for the final data frame as a list of rows, and create the dataframe as the very last step. Like so:
newdata = []
max_date = df['Date'].max()
for name, group in df.groupby(['Attribute_A','Attribute_B']):
idx = pd.date_range(group['Date'].min(),
max_date)
temp = group.set_index('Date').reindex(idx, fill_value=0)
coeff_var = temp['Value'].std()/temp['Value'].mean()
newdata.append({'Attribute_A': name[0], 'Attribute_B': name[1],'Coeff_Var':coeff_var})
final = pd.DataFrame.from_records(newdata)
Using timeit to measure best execution times I get
your solution: 100 loops, best of 3: 11.5 ms per loop
improved concat: 100 loops, best of 3: 6.67 ms per loop
Details see this ipython notebook
Note: Your mileage may vary - I used the sample data provided in the original post. You should run the line profiler on a subset of your real data - the dominating factor in regards to time use may well be something else then.
I am not sure if my way is faster than the way that you set up, but here goes:
df = pd.DataFrame({'Date': ['1/1/2016', '1/5/2016', '1/2/2016', '1/4/2016', '1/10/2016'],
'Attribute A': ['A', 'A', 'B', 'B', 'B'],
'Attribute B': [1, 1, 1, 1, 1],
'Value': [50, 60, 59, 90, 84]})
unique_attributes = df['Attribute A'].unique()
groups = []
for i in unique_attributes:
subset = df[df['Attribute A'] ==i]
dates = subset['Date'].tolist()
Dates = pd.date_range(dates[0], dates[-1])
subset.set_index('Date', inplace=True)
subset.index = pd.DatetimeIndex(subset.index)
subset = subset.reindex(Dates)
subset['Attribute A'].fillna(method='ffill', inplace=True)
subset['Attribute B'].fillna(method='ffill', inplace=True)
subset['Value'].fillna(0, inplace=True)
groups.append(subset)
result = pd.concat(groups)

Numpy Conditional Max of Range

I'm trying to make a version of my program faster using as much Pandas and Numpy as possible. I am new to Numpy but have been grasping most of it, but I am having trouble with conditional formatting a column with the max of a range. This is the code I am trying to use to achieve this:
x=3
df1['Max']=numpy.where(df1.index>=x,max(df1.High[-x:],0))
Basically, I am trying to conditionally put the maximum value over the last 3 entries into a cell and repeat down the column. Any and all help is appreciated.
Use Scipy's maximum_filter -
from scipy.ndimage.filters import maximum_filter1d
df['max'] = maximum_filter1d(df.High,size=3,origin=1,mode='nearest')
Basically, maximum_filter operates in a sliding window looking for maximum in that window. Now, by default each such max computation would be performed with window being centered at the index itself. Since, we are looking to go three elements before and ending at the current one, we need to change that centeredness with the parameter origin. Therefore, we have it set at 1.
Sample run -
In [21]: df
Out[21]:
High max
0 13 13
1 77 77
2 16 77
3 30 77
4 25 30
5 98 98
6 79 98
7 58 98
8 51 79
9 23 58
Runtime test
Got me interested to see how this Scipy's sliding max operation performs against Pandas's rolling max method on performance. Here's some results on big datasizes -
In [55]: df = pd.DataFrame(np.random.randint(0,99,(10000)),columns=['High'])
In [56]: %%timeit # #Merlin's rolling based solution :
...: df['max'] = df.High.rolling(window=3, min_periods=1).max()
...:
1000 loops, best of 3: 1.35 ms per loop
In [57]: %%timeit # Using Scipy's max filter :
...: df['max1'] = maximum_filter1d(df.High,size=3,\
...: origin=1,mode='nearest')
...:
1000 loops, best of 3: 487 µs per loop
Here is the logic on np.where
numpy.where('test something,if true ,if false)
I think you need below.
dd= {'to': [100, 200, 300, 400, -500, 600, 700,800, 900, 1000]}
df = pd.DataFrame(dd)
df
to
0 100
1 200
2 300
3 400
4 -500
5 600
6 700
7 800
8 900
9 1000
df['Max'] = df.rolling(window=3, min_periods=1).max()
to Max
0 100 100.0
1 200 200.0
2 300 300.0
3 400 400.0
4 -500 400.0
5 600 600.0
6 700 700.0
7 800 800.0
8 900 900.0
9 1000 1000.0

Categories