faster way to calculate the weighted mean based on rolling offset - python

I have an example df like df = pd.DataFrame({'price': [100, 101, 99, 95, 97, 88], 'qty': [12, 5, 1, 3, 1, 3]}). I want to calculate the rolling 5 qty average of (price * qty / qty), and the desired output is 100, 101, 100.6, 97, 96.2, 91.2.
I don't have a good way to calculate this currently unfortunately, I have a slow way that gets close which is to calculate the cumulative sum of qty and then df.qty_cumsum[(df.qty_cumsum<= x.qty_cumsum- 5)].argmax() which returns the max arg of the qty - 5, then I can use this to calculate weighted average in a second step.
Thanks

One option is to repeat price, then take rolling with rows, and groupby index, taking last:
np.repeat(df['price'], df['qty']).rolling(5).mean().groupby(level=0).last()
Output:
0 100.0
1 101.0
2 100.6
3 97.0
4 96.2
5 91.2
Name: price, dtype: float64
P.S. And if you have large qty values, it would also probably make sense to make it more efficient by clipping qty to 5 (since there is no difference if it's 5 or 12, for example):
np.repeat(df['price'], np.clip(df['qty'], 0, 5)
).rolling(5).mean().groupby(level=0).last()

Related

how to get averages of values in columns for each value in another column

I have a dataframe looks like following:
import pandas as pd
df = pd.DataFrame({'Date':[2019-08, 2019-08, 2019-09, 2019-09, 2019-10, 2019-10], 'Name':['A','B','A','C', 'A', 'B'], 'math':[100,90,69,80,0,70], 'science':[100,90,0,80,92,95]})
df['Date'] = pd.to_datetime(df['Date'])
I want to iterate through the data and find the math and science grades per person per month.
My desired output should look like this:
df = pd.DataFrame({'Name':['A','B','C'], 'math':[56.3333, 80, 69], 'science':[64, 92.5, 80]})
To be more clear, here are the steps I want to take.
1) get a Name (ex. A)
2) get math grades for that person (100, 69, 0)
3) calculate the average (56.333)
4) get science grades for that person (100, 0, 92)
5) calculate the average (64)
6) repeat the steps for every name (b and c)
Group them by 'name' and calculate the average for each
df.groupby('Name')[['math','science']].mean()
math science
Name
A 56.333333 64.0
B 80.000000 92.5
C 80.000000 80.0
you can use the 'groupby' feature:
grp_grade=df.groupby(by='Name')
grp_grade.mean()
Out[17]:
math science
Name
A 56.333333 64.0
B 80.000000 92.5
C 80.000000 80.0

Counting number of spikes in a graph in python

With dataset df I plotted a graph looking like the following:
df
Time Temperature
8:23:04 18.5
8:23:04 19
9:12:57 19
9:12:57 20
9:12:58 20
9:12:58 21
9:12:59 21
9:12:59 23
9:13:00 23
9:13:00 25
9:13:01 25
9:13:01 27
9:13:02 27
9:13:02 28
9:13:03 28
Graph(Overall)
When zooming in the data, we can see more details:
I would like to count the number of activations of this temperature measurement device, which gives rise to temperature increasing drastically. I have defined an activation as below:
Let T0, T1, T2, T3 be temperature at time t=0,t=1,t=2,t=3, and d0= T1-T0, d1= T2-T1, d2= T3-T2, ... be the difference of 2 adjacent values.
If
1) d0 ≥ 0 and d1 ≥ 0 and d2 ≥ 0, and
2) T2- T0 > max(d0, d1, d2), and
3) T2-T0 < 30 second
It is considered as an activation. I want to count how many activations are there in total. What's a good way to do this?
Thanks.
There could be a number of different, valid answers depending on how a spike is defined.
Assuming you just want the indices where the temperature increases significantly. One simple method is to just look for very large jumps in value, above some threshold value. The threshold can be calculated from the mean difference of the data, which should give a rough approximation of where the significant variations in value occur. Here's a basic implementation:
import numpy as np
# Data
x = np.array([0, 1, 2, 50, 51, 52, 53, 100, 99, 98, 97, 96, 10, 9, 8, 80])
# Data diff
xdiff = x[1:] - x[0:-1]
# Find mean change
xdiff_mean = np.abs(xdiff).mean()
# Identify all indices greater than the mean
spikes = xdiff > abs(xdiff_mean)+1
print(x[1:][spikes]) # prints 50, 100, 80
print(np.where(spikes)[0]+1) # prints 3, 7, 15
You could also look use outlier rejection, which would be much more clever than this basic comparison to the mean difference. There are lots of answers on how to do that:
Can scipy.stats identify and mask obvious outliers?

Add new column based on mean slice of another column

Suppose I have a DataFrame
my_df = pd.DataFrame([10, 20, 30, 40, 50], columns=['col_1'])
I would like to add a new column where the value of each row in the new column is the mean of the values in col_1 starting at that row. In this case the new column (let's call it 'col_2' would be [30, 35, 40, 45, 50].
The following is not good code but it at least describes generating the values.
for i in range(len(my_df)):
my_df.loc[i]['col_2'] = my_df[i:]['col_1'].mean()
How can I do this in a clean, idiomatic way that doesn't raise a SettingWithCopyWarning?
You can reverse the column, take the incremental mean, and then reverse it back again.
my_df.loc[::-1, 'col_1'].expanding().mean()[::-1]
# 0 30.0
# 1 35.0
# 2 40.0
# 3 45.0
# 4 50.0
# Name: col_1, dtype: float64
A similar ndarray-level approach could be to use np.cumsum and divide by the increasing number of elements.
np.true_divide(np.cumsum(my_df.col_1.values[::-1]),
np.arange(1, len(my_df)+1))[::-1]
# array([30., 35., 40., 45., 50.])

if negative then with weighted average

I have a DataFrame:
a = {'Price': [10, 15, 20, 25, 30], 'Total': [10000, 12000, 15000, 14000, 10000],
'Previous Quarter': [0, 10000, 12000, 15000, 14000]}
a = pd.DataFrame(a)
print (a)
With this raw data, i have added a number of additional columns including a weighted average price (WAP)
a['Change'] = a['Total'] - a['Previous Quarter']
a['Amount'] = a['Price']*a['Change']
a['Cum Sum Amount'] = np.cumsum(a['Amount'])
a['WAP'] = a['Cum Sum Amount'] / a['Total']
This is fine, however as the total starts to decrease this brings down the weighted average price.
my question is, if Total decreases how would i get WAP to reflect the row above? For instance in row 3, Total is 1000, which is lower than in row 2. This brings WAP down from 12.6 to 11.78, but i would like it to say 12.6 instead of 11.78.
I have tried looping through a['Total'] < 0 then a['WAP'] = 0 but this impacts the whole column.
Ultimately i am looking for a WAP column which reads:
10, 10.83, 12.6, 12.6, 12.6
You could use cummax:
a['WAP'] = (a['Cum Sum Amount'] / a['Total']).cummax()
print (a['WAP'])
0 10.000000
1 10.833333
2 12.666667
3 12.666667
4 12.666667
Name: WAP, dtype: float64
As a total Python beginner, here are two options I could think of
Either
a['WAP'] = np.maximum.accumulate(a['Cum Sum Amount'] / a['Total'])
Or after you've already created WAP you could modify only the subset using the diff method (thanks to #ayhan for the loc which will modify a in place)
a.loc[a['WAP'].diff() < 0, 'WAP'] = max(a['WAP'])

pandas qcut not putting equal number of observations into each bin

I have a data frame, from which I can select a column (series) as follows:
df:
value_rank
275488 90
275490 35
275491 60
275492 23
275493 23
275494 34
275495 75
275496 40
275497 69
275498 14
275499 83
... ...
value_rank is a previously created percentile rank from a larger data-set. What I am trying to do, is to create bins of this data set, e.g. quintile
pd.qcut(df.value_rank, 5, labels=False)
275488 4
275490 1
275491 3
275492 1
275493 1
275494 1
275495 3
275496 2
... ...
This appears fine, as expected, but it isn't.
In fact, I have 1569 columns. The nearest number divisible by 5 bins is 1565 which should give 1565 / 5 = 313 observations in each bin. There are 4 extra records, so I would expect to have 4 bins with 314 observations, and one with 313 observations. Instead, I get this:
obs = pd.qcut(df.value_rank, 5, labels=False)
obs.value_counts()
0 329
3 314
1 313
4 311
2 302
I have no nans in df, and cannot think of any reason why this is happening. Literally beginning to tear my hair out!
Here is a small example:
df:
value_rank
286742 11
286835 53
286865 40
286930 31
286936 45
286955 27
287031 30
287111 36
287269 30
287310 18
pd.qcut gives this:
pd.qcut(df.value_rank, 5, labels = False).value_counts()
bin count
1 3
4 2
3 2
0 2
2 1
There should be 2 observations in each bin, not 3 in bin 1 and 1 in bin 2!
qcut is trying to compensate for repeating values. This is earlier to visualize if you return the bin limits along with your qcut results:
In [42]: test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
In [43]: test_series = pd.Series(test_list, name='value_rank')
In [49]: pd.qcut(test_series, 5, retbins=True, labels=False)
Out[49]:
(array([0, 0, 1, 1, 1, 2, 3, 3, 4, 4]),
array([ 11. , 25.2, 30. , 33. , 41. , 53. ]))
You can see that there was no choice but to set the bin limit at 30, so qcut had to "steal" one from the expected values in the third bin and place them in the second. I'm thinking that this is just happening at a larger scale with your percentiles since you're basically condensing their ranks into a 1 to 100 scale. Any reason not to just run qcut directly on the data instead of the percentiles or return percentiles that have greater precision?
Just try with the below code :
pd.qcut(df.rank(method='first'),nbins)
If you must get equal (or nearly equal) bins, then here's a trick you can use with qcut. Using the same data as the accepted answer, we can force these into equal bins by adding some random noise to the original test_list and binning according to those values.
test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
np.random.seed(42) #set this for reproducible results
test_list_rnd = np.array(test_list) + np.random.random(len(test_list)) #add noise to data
test_series = pd.Series(test_list_rnd, name='value_rank')
pd.qcut(test_series, 5, retbins=True, labels=False)
Output:
(0 0
1 0
2 1
3 2
4 1
5 2
6 3
7 3
8 4
9 4
Name: value_rank, dtype: int64,
array([ 11.37454012, 25.97573801, 30.42160255, 33.11683016,
41.81316392, 53.70807258]))
So, now we have two 0's, two 1's, two 2's and two 4's!
Disclaimer
Obviously, use this at your discretion because results can vary based on your data; like how large your data set is and/or the spacing, for instance. The above "trick" works well for integers because even though we are "salting" the test_list, it will still rank order in the sense that there will won't be a value in group 0 greater than a value in group 1 (maybe equal, but not greater). If, however, you have floats, this may be tricky and you may have to reduce the size of your noise accordingly. For instance if you had floats like 2.1, 5.3, 5.3, 5.4, etc., you should should reduce the noise by dividing by 10: np.random.random(len(test_list)) / 10. If you have arbitrarily long floats, however, you probably would not have this problem in the first place, given the noise already present in "real" data.
This problem arises from duplicate values. A possible solution to force equal sized bins is to use the index as the input for pd.qcut after sorting the dataframe:
import random
df = pd.DataFrame({'A': [random.randint(3, 9) for x in range(20)]}).sort_values('A').reset_index()
del df['index']
df = df.reset_index()
df['A'].plot.hist(bins=30);
picture: https://i.stack.imgur.com/ztjzn.png
df.head()
df['qcut_v1'] = pd.qcut(df['A'], q=4)
df['qcut_v2'] = pd.qcut(df['index'], q=4)
df
picture: https://i.stack.imgur.com/RB4TN.png
df.groupby('qcut_v1').count().reset_index()
picture: https://i.stack.imgur.com/IKtsW.png
df.groupby('qcut_v2').count().reset_index()
picture: https://i.stack.imgur.com/4jrkU.png
sorry I cannot post images since I don't have at least 10 reputation on stackoverflow -.-

Categories