Rolling mean with varying window length in Python - python

I am working with NLSY79 data and I am trying to construct a 'smoothed' income variable that averages over a period of 4 years. Between 1979 and 1994, the NLSY conducted surveys annually, while after 1996 the survey was conducted biennially. This means that my smoothed income variable will average four observations prior to 1994 and only two after 1996.
I would like my smoothed income variable to satisfy the following criteria:
1) It should be an average of 4 income observations from 1979 to 1994 and only 2 from 1996 onward
2) The window should START from a given observation rather than be centered at it. Therefore, my smoothed income variable should tell me the average income over the four years starting from that date
3) It should ignore NaNs
It should, therefore, look like the following (note that I only computed values for 'smoothed income' that could be computed with the data I have provided.)
id year income 'smoothed income'
1 1979 20,000 21,250
1 1980 22,000
1 1981 21,000
1 1982 22,000
...
1 2014 34,000 34,500
1 2016 35,000
2 1979 28,000 28,333
2 1980 NaN
2 1981 28,000
2 1982 29,000
I am relatively new to dataframe manipulation with pandas, so here is what I have tried:
smooth = DATA.groupby('id')['income'].rolling(window=4, min_periods=1).mean()
DATA['smoothIncome'] = smooth.reset_index(level=0, drop=True)
This code accounts for NaNs, but otherwise does not accomplish objectives 2) and 3).
Any help would be much appreciated

Ok, I've modified the code provided by ansev to make it work. filling in NaNs was causing the problems.
Here's the modified code:
df.set_index('year').groupby('id').income.apply(lambda x: x.reindex(range(x.index.min(),x.index.max()+1))
.rolling(4, min_periods = 1).mean().shift(-3)).reset_index()
The only problem I have now is that the mean is not calculated when there are fewer that 4 years remaining (e.g. from 2014 onward, because my data goes until 2016). Is there a way of shortening the window length after 2014?

Related

Pandas groupby - how to see distribution of values in a mean

The code below downloads daily prices for Apple (AAPL) back to 1980.
I wanted to see the average gain in price for the next 10 days by day of year. eg:- October 20th averages 2% gain historically over the next 10 days.
I converted each date to a unique identifier ("uniqdttm") and then .groupby to group all pricegains by this identifier, and use a .mean() to see the average of all price gains for each date.
import yFinance as yf
tick = yf.Ticker("AAPL")
df = tick.history(period="max", interval="1d", prepost=False, auto_adjust=True, back_adjust=False, actions=False)
nmm = "FutureGainDays-10"
df["gaintmp"] = df["Close"].pct_change(periods=10).fillna(0).mul(100)
df[nmm] = df["gaintmp"].shift(-10) # Shift back x days to see what the gain would have been
df["uniqdt"] = df.index.strftime('%b-%d')
df["tmpdate"] = "1972-"+df.index.strftime('%m-%d') # 1972 is a leap year needed for Feb29!
df["uniqdttm"] = pd.to_datetime(df["tmpdate"], format="%Y-%m-%d")
aggs = df.groupby("uniqdttm")[nmm].agg(['mean']).round(3)
print(aggs)
Output:
mean
uniqdttm
1972-01-02 0.960
1972-01-03 2.855
1972-01-04 0.905
1972-01-05 1.289
1972-01-06 1.601
... ...
1972-12-27 3.995
1972-12-28 2.732
1972-12-29 4.588
1972-12-30 4.359
1972-12-31 2.108
Where I am confused is how to see where the distribution of the mean lies in the data.
An example:
For date x: 5 price gains of:
-4
-6
+10
-2
+7
give an average gain of +5
But trading this would be emotionally very difficult. 3 years of losses and 2 years of big gains.
5 price gains of +1, +2, -1, +1, +1 would give an average of +4 (which is less than +5 above) but is much easier to trade.
What statistical measure would give a better representation of how many positives are present in the samples - and how do I see this in the dataframe groupby output?
EDIT:
Is it also possible to see all values that contributed to the mean in the groupby output sorted smallest to largest, eg:
mean
uniqdttm
1972-01-02 5.0 -6 -4 -2 7 10

Replace outliers with groupby average in multi-index dataframe

I have the following multi-index data frame, where ID and Year are part of the multi-index. Some numbers for the variable ROA are unreasonable, so I want to replace every ROA value that is larger than the 99th percentile of ROA in the overall data frame by the average of its company (the same for everything smaller than the 1th percentile).
ID Year ROA
1 2016 1.5
1 2017 0.8
1 2018 NaN
2 2016 0.7
2 2017 0.8
2 2018 0.4
In a different thread I found the following approach (Replace values based on multiple conditions with groupby mean in Pandas):
mask = ((df['ROA'] > df['ROA'].quantile(0.99)) | (df['ROA'] < df['ROA'].quantile(0.01)))
df['ROA'] = np.where(~mask, df['ROA'], df.groupby('ID')['ROA'].transform('mean'))
However, this does not work for me. The maximum and minimum values of my data frame do not change. Does someone have an idea why this could be?
EDIT:
Alternatively, I thought of this function:
df_outliers = df[(df['ROA'] < df['ROA'].quantile(0.01))|(df['ROA'] >
df['ROA'].quantile(0.99))]
for i in df_outliers.index:
df.loc[(df.index.get_level_values('ID') == float(i[0])) &
(df.index.get_level_values('Year')==float(i[1])), 'ROA'] =
float(df.query('ID == {} and Year != {}'.format(i[0],
i[1])).ROA.mean())
However, here I run into the problem that in the df_outliers.index some companies are mentioned several times because their ROA is an outlier in several years. This makes the function defeat its purpose, as it is currently it only excludes one year from the calculation of the mean, and not several.

How to produce a new data frame of mean monthly data, given a data frame consisting of daily data?

I have a data frame containing the daily CO2 data since 2015, and I would like to produce the monthly mean data for each year, then put this into a new data frame. A sample of the data frame I'm using is shown below.
month day cycle trend
year
2011 1 1 391.25 389.76
2011 1 2 391.29 389.77
2011 1 3 391.32 389.77
2011 1 4 391.36 389.78
2011 1 5 391.39 389.79
... ... ... ... ...
2021 3 13 416.15 414.37
2021 3 14 416.17 414.38
2021 3 15 416.18 414.39
2021 3 16 416.19 414.39
2021 3 17 416.21 414.40
I plan on using something like the code below to create the new monthly mean data frame, but the main problem I'm having is indicating the specific subset for each month of each year so that the mean can then be taken for this. If I could highlight all of the year "2015" for the month "1" and then average this etc. that might work?
Any suggestions would be hugely appreciated and if I need to make any edits please let me know, thanks so much!
dfs = list()
for l in L:
dfs.append(refined_data[index = 2015, "month" = 1. day <=31].iloc[l].mean(axis=0))
mean_matrix = pd.concat(dfs, axis=1).T

Using groupby calculations in Pandas data frames

I am working on a geospatial project where I need to do some calculations between groups of data within a data frame. The data I am using spans over several different years and specific to the Local Authority District code, each year has a numerical ID.
I need to be able to calculate the mean average of a group of years within that data set relative to the LAD code.
LAC LAN JAN FEB MAR APR MAY JUN ID
K04000001 ENGLAND AND WALES 56597 43555 49641 88049 52315 42577 5
E92000001 ENGLAND 53045 40806 46508 83504 49413 39885 5
I can use groupby to calculate the mean based on a LAC, but what I can't do is calculate the mean grouped by LAC for ID 1:3 for example.
What is more efficient, seperate in to seperate dataframes stored in an dict for example, or keep in one dataframe and use an ID?
df.groupby('LAC').mean()
I come frome a matlab background so just getting the hang of the best way to do things.
Secondly, once these operatons are complete, I would like to do the following:
(mean of id - 1:5 - mean id:6) using LAC as the key.
Sorry if I haven't explained this very well!
Edit: Expected output.
To be able to average a group of rows by specific ID for a given value of LAC.
For example:
Average monthly values for E92000001 rows with ID 3
LAC JAN FEB MAR APR MAY JUN ID
K04000001, 56706 43653 49723 88153 52374 42624 5
K04000001 56597 43555 49641 88049 52315 42577 5
E92000001 49186 36947 42649 79645 45554 36026 5
E92000001 53045 40806 46508 83504 49413 39885 3
E92000001 68715 56476 62178 99174 65083 55555 4
E92000001 41075 28836 34538 71534 37443 27915 3
E92000001 54595 42356 48058 85054 50963 41435 1
Rows to be averaged:
E92000001 53045 40806 46508 83504 49413 39885 3
E92000001 41075 28836 34538 71534 37443 27915 3
Result
E92000001 47060 34821 40523 77519 43428 33900 3
edit: corrected error.
To match the update in your question. This will give you a dataframe with only one row for each ID-LAC combination, with the average of all the rows that had that index.
df.groupby(['ID', 'LAC']).mean()
I would start by setting the year and LAC as the index
df.set_index(['ID', 'LAC'], inplace=True).sort_index(inplace=True)
Now you can groupby Index and get the mean for every month, or even each row's average since the first year.
expanding_mean = df.groupby('index').cumsum() / (df.groupby('index').cumcount() + 1)

Extract Quarterly Data from Multi Quarter Periods

Public companies in the US make quarterly filings (10-Q) and yearly filings (10-K). In most cases they will file three 10Qs per year and one 10K.
In most cases, the quarterly filings (10Qs) contain quarterly data. For example, "revenue for the three months ending March 31, 2005."
The yearly filings will often only have year end sums. For example: "revenue for the twelve months ending December 31, 2005."
In order to get the value for Q4 of 2005, I need to take the yearly data and subtract the values for each of the quarters (Q1-Q3).
In some cases, each of the quarterly data is expressed as year to date. For example, the first quarterly filing is "revenue for the three months ending March 31, 2005." The second is "revenue for the six months ending June 30, 2005." The third "revenue for the nine months ending September 30, 2005." The yearly is like above, "revenue for the twelve months ending December 31, 2005." This represents a generalization of the above issues in which the desire is to extract quarterly data which can be accomplished by repeated subtraction of the previous period data.
My question is what is the best way in pandas to accomplish this quarterly data extraction?
There a large number of fields (revenue, profit, exposes, etc) per period.
A related question that I asked in regards to how to express this period data in pandas: Creating Period for Multi Quarter Timespan in Pandas
Here is some example data of the first problem (three 10Qs and one 10K which only has year end data):
10Q:
http://www.sec.gov/Archives/edgar/data/1174922/000119312512225309/d326512d10q.htm#tx326512_4
http://www.sec.gov/Archives/edgar/data/1174922/000119312512347659/d360762d10q.htm#tx360762_3
http://www.sec.gov/Archives/edgar/data/1174922/000119312512463380/d411552d10q.htm#tx411552_3
10K:
http://www.sec.gov/Archives/edgar/data/1174922/000119312513087674/d459372d10k.htm#tx459372_29
Calcbench refers to this problem: http://www.calcbench.com/Home/userGuide: "Q4 calculation: Companies often do not report Q4 data, rather opting to report full year data instead. We’ll automatically calculate it for you. Data in blue is calculated.
There will be multiple years of data and for each year I want to calculate the missing fourth quarter:
2012Q2 2012Q3 2012Y 2013Q1 2013Q2 2013Q3 2013Y
Revenue 1 1 1 1 1 1 1
Expense 10 10 10 10 10 10 10
You could define a function to subtract the quarterly totals from the annual number, and then apply the function to each row, storing the result in a new column.
In [2]: df
Out[2]:
Annual Q1 Q2 Q3
Revenue 18 3 4 5
Expense 17 2 3 4
In [3]: def calc_Q4(row):
...: return row['Annual'] - row['Q1'] - row['Q2'] - row['Q3']
In [4]: df['Q4'] = df.apply(calc_Q4, axis = 1)
In [5]: df
Out[5]:
Annual Q1 Q2 Q3 Q4
Revenue 18 3 4 5 6
Expense 17 2 3 4 8
I work for Calcbench.
I wrote an API for Calcbench and have example of getting SEC data into Pandas dataframes, https://www.calcbench.com/home/api.
You would need to sign up for Calcbench to use it.

Categories