Pandas groupby - how to see distribution of values in a mean - python

The code below downloads daily prices for Apple (AAPL) back to 1980.
I wanted to see the average gain in price for the next 10 days by day of year. eg:- October 20th averages 2% gain historically over the next 10 days.
I converted each date to a unique identifier ("uniqdttm") and then .groupby to group all pricegains by this identifier, and use a .mean() to see the average of all price gains for each date.
import yFinance as yf
tick = yf.Ticker("AAPL")
df = tick.history(period="max", interval="1d", prepost=False, auto_adjust=True, back_adjust=False, actions=False)
nmm = "FutureGainDays-10"
df["gaintmp"] = df["Close"].pct_change(periods=10).fillna(0).mul(100)
df[nmm] = df["gaintmp"].shift(-10) # Shift back x days to see what the gain would have been
df["uniqdt"] = df.index.strftime('%b-%d')
df["tmpdate"] = "1972-"+df.index.strftime('%m-%d') # 1972 is a leap year needed for Feb29!
df["uniqdttm"] = pd.to_datetime(df["tmpdate"], format="%Y-%m-%d")
aggs = df.groupby("uniqdttm")[nmm].agg(['mean']).round(3)
print(aggs)
Output:
mean
uniqdttm
1972-01-02 0.960
1972-01-03 2.855
1972-01-04 0.905
1972-01-05 1.289
1972-01-06 1.601
... ...
1972-12-27 3.995
1972-12-28 2.732
1972-12-29 4.588
1972-12-30 4.359
1972-12-31 2.108
Where I am confused is how to see where the distribution of the mean lies in the data.
An example:
For date x: 5 price gains of:
-4
-6
+10
-2
+7
give an average gain of +5
But trading this would be emotionally very difficult. 3 years of losses and 2 years of big gains.
5 price gains of +1, +2, -1, +1, +1 would give an average of +4 (which is less than +5 above) but is much easier to trade.
What statistical measure would give a better representation of how many positives are present in the samples - and how do I see this in the dataframe groupby output?
EDIT:
Is it also possible to see all values that contributed to the mean in the groupby output sorted smallest to largest, eg:
mean
uniqdttm
1972-01-02 5.0 -6 -4 -2 7 10

Related

Replace outliers with groupby average in multi-index dataframe

I have the following multi-index data frame, where ID and Year are part of the multi-index. Some numbers for the variable ROA are unreasonable, so I want to replace every ROA value that is larger than the 99th percentile of ROA in the overall data frame by the average of its company (the same for everything smaller than the 1th percentile).
ID Year ROA
1 2016 1.5
1 2017 0.8
1 2018 NaN
2 2016 0.7
2 2017 0.8
2 2018 0.4
In a different thread I found the following approach (Replace values based on multiple conditions with groupby mean in Pandas):
mask = ((df['ROA'] > df['ROA'].quantile(0.99)) | (df['ROA'] < df['ROA'].quantile(0.01)))
df['ROA'] = np.where(~mask, df['ROA'], df.groupby('ID')['ROA'].transform('mean'))
However, this does not work for me. The maximum and minimum values of my data frame do not change. Does someone have an idea why this could be?
EDIT:
Alternatively, I thought of this function:
df_outliers = df[(df['ROA'] < df['ROA'].quantile(0.01))|(df['ROA'] >
df['ROA'].quantile(0.99))]
for i in df_outliers.index:
df.loc[(df.index.get_level_values('ID') == float(i[0])) &
(df.index.get_level_values('Year')==float(i[1])), 'ROA'] =
float(df.query('ID == {} and Year != {}'.format(i[0],
i[1])).ROA.mean())
However, here I run into the problem that in the df_outliers.index some companies are mentioned several times because their ROA is an outlier in several years. This makes the function defeat its purpose, as it is currently it only excludes one year from the calculation of the mean, and not several.

Calculating Monthly Anomalies in Pandas

Hello StackOverflow Community,
I have been interested in calculating anomalies for data in pandas 1.2.0 using Python 3.9.1 and Numpy 1.19.5, but have been struggling to figure out the most "Pythonic" and "pandas" way to complete this task (or any way for that matter). Below I have some created some dummy data and put it into a pandas DataFrame. In addition, I have tried to clearly outline my methodology for calculating monthly anomalies for the dummy data.
What I am trying to do is take "n" years of monthly values (in this example, 2 years of monthly data = 25 months) and calculate monthly averages for all years (for example group all the January values togeather and calculate the mean). I have been able to do this using pandas.
Next, I would like to take each monthly average and subtract it from all elements in my DataFrame that fall into that specific month (for example subtract each January value from the overall January mean value). In the code below you will see some lines of code that attempt to do this subtraction, but to no avail.
If anyone has any thought or tips on what may be a good way to approach this, I really appreciate your insight. If you require further clarification, let me know. Thanks for your time and thoughts.
-Marian
#Import packages
import numpy as np
import pandas as pd
#-------------------------------------------------------------
#Create a pandas dataframe with some data that will represent:
#Column of dates for two years, at monthly resolution
#Column of corresponding values for each date.
#Create two years worth of monthly dates
dates = pd.date_range(start='2018-01-01', end='2020-01-01', freq='MS')
#Create some random data that will act as our data that we want to compute the anomalies of
values = np.random.randint(0,100,size=25)
#Put our dates and values into a dataframe to demonsrate how we have tried to calculate our anomalies
df = pd.DataFrame({'Dates': dates, 'Values': values})
#-------------------------------------------------------------
#Anomalies will be computed by finding the mean value of each month over all years
#And then subtracting the mean value of each month by each element that is in that particular month
#Group our df according to the month of each entry and calculate monthly mean for each month
monthly_means = df.groupby(df['Dates'].dt.month).mean()
#-------------------------------------------------------------
#Now, how do we go about subtracting these grouped monthly means from each element that falls
#in the corresponding month.
#For example, if the monthly mean over 2 years for January is 20 and the value is 21 in January 2018, the anomaly would be +1 for January 2018
#Example lines of code I have tried, but have not worked
#ValueError:Unable to coerce to Series, length must be 1: given 12
#anomalies = socal_csv.groupby(socal_csv['Date'].dt.month) - monthly_means
#TypeError: unhashable type: "list"
#anomalies = socal_csv.groupby(socal_csv['Date'].dt.month).transform([np.subtract])
You can use pd.merge like this :
import numpy as np
import pandas as pd
dates = pd.date_range(start='2018-01-01', end='2020-01-01', freq='MS')
values = np.random.randint(0,100,size=25)
df = pd.DataFrame({'Dates': dates, 'Values': values})
monthly_means = df.groupby(df['Dates'].dt.month.mean()
df['month']=df['Dates'].dt.strftime("%m").astype(int)
df=df.merge(monthly_means.rename(columns={'Dates':'month','Values':'Mean'}),on='month',how='left')
df['Diff']=df['Mean']-df['Values']
output:
df['Diff']
Out[19]:
0 33.333333
1 19.500000
2 -29.500000
3 -22.500000
4 -24.000000
5 -3.000000
6 10.000000
7 2.500000
8 14.500000
9 -17.500000
10 44.000000
11 31.000000
12 -11.666667
13 -19.500000
14 29.500000
15 22.500000
16 24.000000
17 3.000000
18 -10.000000
19 -2.500000
20 -14.500000
21 17.500000
22 -44.000000
23 -31.000000
24 -21.666667
You can use abs() if you want absolute difference
A one-line solution is:
df = pd.DataFrame({'Values': values}, index=dates)
df.groupby(df.index.month).transform(lambda x: x-x.mean())

Rolling mean with varying window length in Python

I am working with NLSY79 data and I am trying to construct a 'smoothed' income variable that averages over a period of 4 years. Between 1979 and 1994, the NLSY conducted surveys annually, while after 1996 the survey was conducted biennially. This means that my smoothed income variable will average four observations prior to 1994 and only two after 1996.
I would like my smoothed income variable to satisfy the following criteria:
1) It should be an average of 4 income observations from 1979 to 1994 and only 2 from 1996 onward
2) The window should START from a given observation rather than be centered at it. Therefore, my smoothed income variable should tell me the average income over the four years starting from that date
3) It should ignore NaNs
It should, therefore, look like the following (note that I only computed values for 'smoothed income' that could be computed with the data I have provided.)
id year income 'smoothed income'
1 1979 20,000 21,250
1 1980 22,000
1 1981 21,000
1 1982 22,000
...
1 2014 34,000 34,500
1 2016 35,000
2 1979 28,000 28,333
2 1980 NaN
2 1981 28,000
2 1982 29,000
I am relatively new to dataframe manipulation with pandas, so here is what I have tried:
smooth = DATA.groupby('id')['income'].rolling(window=4, min_periods=1).mean()
DATA['smoothIncome'] = smooth.reset_index(level=0, drop=True)
This code accounts for NaNs, but otherwise does not accomplish objectives 2) and 3).
Any help would be much appreciated
Ok, I've modified the code provided by ansev to make it work. filling in NaNs was causing the problems.
Here's the modified code:
df.set_index('year').groupby('id').income.apply(lambda x: x.reindex(range(x.index.min(),x.index.max()+1))
.rolling(4, min_periods = 1).mean().shift(-3)).reset_index()
The only problem I have now is that the mean is not calculated when there are fewer that 4 years remaining (e.g. from 2014 onward, because my data goes until 2016). Is there a way of shortening the window length after 2014?

Incremental spend in 6 weeks for two groups using pandas

I have an excel data with the following information,
df.head()
User_id Group Week Spend Purchases Group
170309867 Test 2014-10-13 794.66 2 Test-NonRed
57954586 Test 2014-10-13 55.99 1 Test-Red
118068583 Test 2014-10-13 40.87 1 Test-NonRed
516478393 Test 2014-10-13 17.5 2 Test-NonRed
457873235 Test 2014-10-13 20.44 1 Test-Red
From the above information, I need to calculate the incremental spend in the six weeks for the total Test group (test-Red and test-NonRed) vs. control. I need it in absolute ($) and % terms.
I have tried pandas as,
df2= df.groupby(by=['Group','Week']).sum().abs().groupby(level=[0]).cumsum()
And I have the following result,
df2.head()
And then I calculated the sum for each group as,
df2.groupby(by=['group2']).sum()
df2.head()
I would like to have them ( the incremental spend) as an absolute value which I tried by abs(), as well as I need it in absolute percentage.Any help would be much appreciated,
The expected results are to calculate the incremental spend in the six weeks for the total Test group (test-Red and test-NonRed) vs. Control. I need it in absolute spend and then its percentage.The incremental spend for the 6 weeks. Something like this,
Group incremental_spend incremental_%
Control 11450175 #%
test-NonRed 50288158 #%
test-Red 12043938 #%
So my real questions,
1. Whether the above-mentioned approach is the right way to calculate incremental spend for Column Group in 6 Weeks from column Week on Spend?
2. Also, I need all my results in Absolute counts and Absolute %
I think there are several problems here which make your answer difficult to understand.
Vocabulary
What you describe as "Incremental spend" is just the sum.
What you do in two steps is the sum of the cumulative sum .cumsum().sum(), which is not right.
Also I am not sure whether you need abs, which gives the absolute value (abs(-1) gives 1) and will thus only have an effect if there are negative values in your data.
Unfortunately the sample dataset is not large enough to get a conclusion.
Dataset
Your dataset has two columns Group with identical names, which is error prone.
Missing information
You want to get final values (sums) as a ratio (%), but you do not indicate what is the reference value for this ratio.
Is it the sum of Spend for the control group ?
Potential solution
>>> df # Sample dataframe with one entry as 'Control' group
Out[]:
User_id Group Week Spend Purchases Group.1
0 170309867 Test 2014-10-13 794.66 2 Test-NonRed
1 57954586 Test 2014-10-13 55.99 1 Test-Red
2 118068583 Test 2014-10-13 40.87 1 Test-NonRed
3 516478393 Test 2014-10-13 17.50 2 Control
4 457873235 Test 2014-10-13 20.44 1 Test-Red
df2 = pd.DataFrame(df.groupby('Group.1').Spend.sum()) # Get 'Spend' sum for each group
>>> df2
Out[]:
Spend
Group.1
Control 17.50
Test-NonRed 835.53
Test-Red 76.43
control_spend_total = df2.loc['Control'].values # Get total spend for 'Control' group
>>> control_spend_total
Out[]: array([ 17.5])
df2['Spend_%'] = df2.Spend / control_spend_total * 100 # Add 'Spend_ratio' column
>>> df2
Out[]:
Spend Spend_%
Group.1
Control 17.50 100.000000
Test-NonRed 835.53 4774.457143
Test-Red 76.43 436.742857
Does it look like what you want?

How to use Pandas rolling average without guaranteed number of observations

I am looking at annualized baseball statistics and would like to calculate a rolling mean looking back at the previous 3 years' worth of performance in regard to number of Hits. However, I want to account for the fact that while my dataset reaches back more than 3 years, one single player may have only been in the league for 1-2 years and will not have 3 years' worth of observations off of which I can calculate the rolling mean. For example:
In[6]: df = pd.DataFrame({'PLAYER_ID': ['A', 'A', 'A', 'B', 'B'],
'HITS': [45, 55, 50, 20, 24]})
In[9]: df
Out[9]:
PLAYER_ID HITS
0 A 45
1 A 55
2 A 50
3 B 20
4 B 24
How would I use a groupby and aggregation/transform (or some other process) to calculate the rolling mean for each player with a max 3 years historic totals and then just use the max available historic observations for a player with less than 3 years' historic performance data available?
Pretty sure my answer lies within the Pandas package but would be interested in any solution.
Thanks!
pd.DataFrame.rolling handles this problem for you automatically. Using your example data, df.groupby('PLAYER_ID').rolling(1).mean() will give you:
HITS PLAYER_ID
PLAYER_ID
A 0 45.0 A
1 55.0 A
2 50.0 A
B 3 20.0 B
4 24.0 B
For your example case I'm using a window size of just 1, which means that we're treating each individual observation as its own mean. This isn't particularly interesting. With more data you can use a larger window size: for example, if your data is weekly, rolling(5) would give you an approximately monthly window size (or rolling(31) if your data is daily, and so on).
Two issues to be aware of when using this methodology:
If your data isn't sampled on a regular basis (e.g. if it skips a week or a month at a time), your rolling average won't be aligned in time. For this reason if your data isn't already regularly sampled you'll usually want to resample it.
If your data contains NaN values, those will be propagated: every window containing that NaN will also be NaN. You'll have to impute those values somehow to keep that from happening.

Categories