Python Pandas Simple Moving Average (deprecated pd.rolling_mean) [duplicate] - python

I would like to add a moving average calculation to my exchange time series.
Original data from Quandl
Exchange = Quandl.get("BUNDESBANK/BBEX3_D_SEK_USD_CA_AC_000",
authtoken="xxxxxxx")
# Value
# Date
# 1989-01-02 6.10500
# 1989-01-03 6.07500
# 1989-01-04 6.10750
# 1989-01-05 6.15250
# 1989-01-09 6.25500
# 1989-01-10 6.24250
# 1989-01-11 6.26250
# 1989-01-12 6.23250
# 1989-01-13 6.27750
# 1989-01-16 6.31250
# Calculating Moving Avarage
MovingAverage = pd.rolling_mean(Exchange,5)
# Value
# Date
# 1989-01-02 NaN
# 1989-01-03 NaN
# 1989-01-04 NaN
# 1989-01-05 NaN
# 1989-01-09 6.13900
# 1989-01-10 6.16650
# 1989-01-11 6.20400
# 1989-01-12 6.22900
# 1989-01-13 6.25400
# 1989-01-16 6.26550
I would like to add the calculated Moving Average as a new column to the right after Value using the same index (Date). Preferably I would also like to rename the calculated moving average to MA.

The rolling mean returns a Series you only have to add it as a new column of your DataFrame (MA) as described below.
For information, the rolling_mean function has been deprecated in pandas newer versions. I have used the new method in my example, see below a quote from the pandas documentation.
Warning Prior to version 0.18.0, pd.rolling_*, pd.expanding_*, and pd.ewm* were module level functions and are now deprecated. These are replaced by using the Rolling, Expanding and EWM. objects and a corresponding method call.
df['MA'] = df.rolling(window=5).mean()
print(df)
# Value MA
# Date
# 1989-01-02 6.11 NaN
# 1989-01-03 6.08 NaN
# 1989-01-04 6.11 NaN
# 1989-01-05 6.15 NaN
# 1989-01-09 6.25 6.14
# 1989-01-10 6.24 6.17
# 1989-01-11 6.26 6.20
# 1989-01-12 6.23 6.23
# 1989-01-13 6.28 6.25
# 1989-01-16 6.31 6.27

A moving average can also be calculated and visualized directly in a line chart by using the following code:
Example using stock price data:
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime
plt.style.use('ggplot')
# Input variables
start = datetime.datetime(2016, 1, 01)
end = datetime.datetime(2018, 3, 29)
stock = 'WFC'
# Extrating data
df = web.DataReader(stock,'morningstar', start, end)
df = df['Close']
print df
plt.plot(df['WFC'],label= 'Close')
plt.plot(df['WFC'].rolling(9).mean(),label= 'MA 9 days')
plt.plot(df['WFC'].rolling(21).mean(),label= 'MA 21 days')
plt.legend(loc='best')
plt.title('Wells Fargo\nClose and Moving Averages')
plt.show()
Tutorial on how to do this: https://youtu.be/XWAPpyF62Vg

In case you are calculating more than one moving average:
for i in range(2,10):
df['MA{}'.format(i)] = df.rolling(window=i).mean()
Then you can do an aggregate average of all the MA
df[[f for f in list(df) if "MA" in f]].mean(axis=1)

To get the moving average in pandas we can use cum_sum and then divide by count.
Here is the working example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': range(5),
'value': range(100,600,100)})
# some other similar statistics
df['cum_sum'] = df['value'].cumsum()
df['count'] = range(1,len(df['value'])+1)
df['mov_avg'] = df['cum_sum'] / df['count']
# other statistics
df['rolling_mean2'] = df['value'].rolling(window=2).mean()
print(df)
output
id value cum_sum count mov_avg rolling_mean2
0 0 100 100 1 100.0 NaN
1 1 200 300 2 150.0 150.0
2 2 300 600 3 200.0 250.0
3 3 400 1000 4 250.0 350.0
4 4 500 1500 5 300.0 450.0

Related

Rolling Correlation of Multi-Column Panda

I am trying to calcualte and then visualize the rolling correlation between multiple columns in a 180 (3 in this example) days window.
My data is formatted like that (in the orginal file there are 12 columns plus the timestamp and thousands of rows):
import numpy as np
import pandas as pd
df = pd.DataFrame({"Timestamp" : ['1993-11-01' ,'1993-11-02', '1993-11-03', '1993-11-04','1993-11-15'], "Austria" : [6.18 ,6.18, 6.17, 6.17, 6.40],"Belgium" : [7.05, 7.05, 7.2, 7.5, 7.6],"France" : [7.69, 7.61, 7.67, 7.91, 8.61]},index = [1, 2, 3,4,5])
Timestamp Austria Belgium France
1 1993-11-01 6.18 7.05 7.69
2 1993-11-02 6.18 7.05 7.61
3 1993-11-03 6.17 7.20 7.67
4 1993-11-04 6.17 7.50 7.91
5 1993-11-15 6.40 7.60 8.61
I cant just use this formula, because I get a formatting error if I do because of the Timestamp column:
df.rolling(2).corr(df)
ValueError: could not convert string to float: '1993-11-01'
When I drop the Timestamp column I get a result of 1.0 for every cell, thats also not right and additionally I lose the Timestamp which I will need for the visualization graph in the end.
df_drop = df.drop(columns=['Timestamp'])
df_drop.rolling(2).corr(df_drop)
Austria Belgium France
1 NaN NaN NaN
2 NaN NaN 1.0
3 1.0 1.0 1.0
4 -inf1.0 1.0
5 1.0 1.0 1.0
Any experiences how to do the rolling correlation with multiple columns and a data index?
Building on the answer of Shreyans Jain I propose the following. It should work with an arbitrary number of columns:
import itertools as it
# omit timestamp-col
cols = list(df.columns)[1:]
# -> ['Austria', 'Belgium', 'France']
col_pairs = list(it.combinations(cols, 2))
# -> [('Austria', 'Belgium'), ('Austria', 'France'), ('Belgium', 'France')]
res = pd.DataFrame()
for pair in col_pairs:
# select the first three letters of each name of the pair
corr_name = f"{pair[0][:3]}_{pair[1][:3]}_corr"
res[corr_name] = df[list(pair)].\
rolling(min_periods=1, window=3).\
corr().iloc[0::2, -1].reset_index(drop=True)
print(str(res))
Aus_Bel_corr Aus_Fra_corr Bel_Fra_corr
0 NaN NaN NaN
1 NaN NaN NaN
2 -1.000000 -0.277350 0.277350
3 -0.755929 -0.654654 0.989743
4 0.693375 0.969346 0.849167
The NaN-Values at the beginning result from the windowing.
Update: I uploaded a notebook with detailed explanations for what happens inside the loop.
https://github.com/cknoll/demo-material/blob/main/pandas/pandas_rolling_correlation_iloc.ipynb
You can probably calculate pair-wise correlation like this, instead of going for all 3 at once.
Once you have the correlation, you can directly add them as your columns as well, preserving the timestamp.
df['Aus_Bel_corr'] = df[['Austria','Belgium']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Bel_Fin_corr'] = df[['Belgium','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Aus_Fin_corr'] = df[['Austria','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)```
I guess that there is an another way.
df['Aus_Bel_corr'] = df['Austria']\
.rolling(min_periods = 1, window = 3)\
.corr(df['Belgium'])
For me, I think it is a little simple than the previous answer.

How to get the datetime index corresponding to the n smallest values of a column in python

I am creating a variable 'spike' as an indicator variable that is 1 for the date corresponding to old column, Cost, n smallest values and a 0 otherwise. The code illustrated below is apart of a larger for loop.
I can only get results using the idxmin() function. I would like help in getting the index for the n smallest values.
import pandas as pd
import numpy as np
df3 = pd.DataFrame({'Dept':['A', 'A', 'B', 'B'],
'Benefit':[2000,25,55,400],
'Cost':[1000, 500, 1500, 2000]})
# Let's create an index using Timestamps
index_ = [pd.Timestamp('01-06-2018'), pd.Timestamp('04-06-2018'),
pd.Timestamp('07-06-2018'), pd.Timestamp('10-06-2018')]
df3.index = index_
print(df3)
df3.index = index_
print(df3)
df3['spike'] = np.where(df3.index.isin(lookup), 1, 0)
If you sort, then you can get the top-3 with standard Python / numpy array slicing.
low_cost = df3.sort_values('Cost')[:3]
low_cost
# Dept Benefit Cost
# 2018-04-06 A 25 500
# 2018-01-06 A 2000 1000
# 2018-07-06 B 55 1500
To get the spike column, for efficiency I would recommend a join.
spikes = low_cost.assign(spike=1)[['spike']]
spikes
# spike
# 2018-04-06 1
# 2018-01-06 1
# 2018-07-06 1
df3.join(spikes, how='left').fillna(0)
# Dept Benefit Cost spike
# 2018-01-06 A 2000 1000 1.0
# 2018-04-06 A 25 500 1.0
# 2018-07-06 B 55 1500 1.0
# 2018-10-06 B 400 2000 0.0

Trouble with for loop in a function and combing multiple series output

I'm new to python and am struggling to figure something out. I'm doing some data analysis on an invoice database in pandas with columns of $ amounts, credits, date, and a unique company ID for each package bought.
I want to run every unique company id through a function that will calculate the average spend rate of these credits based on the difference of package purchase dates. I have the basics figured out in my function, and it returns a series indexed to the original dataframe with the values of the average amount of credits spent each day between packages. However, I only have it working with one company ID at a time, and I don'tknow what kind of process I can do to combine all of these different series for each company id to be able to correctly add a new column onto my dataframe with this average credit spend value for each package. Here's my code so far:
def creditspend(mylist = []):
for i in mylist:
a = df.loc[df['CompanyId'] == i]
a = a.sort_values(by=['Date'], ascending=False)
days = a.Date.diff().map(lambda x: abs(x.days))
spend = a['Credits']/days
print(spend)
If I call
creditspend(mylist=[8, 15])
(with multiple inputs) it obviously does not work. What do I need to do to complete this function?
Thanks in advance.
apply() is a very useful method in pandas that applies a function to every row or column of a DataFrame.
So, if your DataFrame is df:
def creditspend(row):
# some calculation code here
return spend
df['spend_rate'] = df.apply(creditspend)
(You can also use apply() on columns with the axis=1 keyword.)
Consider a groupby for a CompanyID aggregation. Below demonstrates with random data:
import numpy as np
import pandas as pd
np.random.seed(7182018)
df = pd.DataFrame({'CompanyID': np.random.choice(['julia', 'pandas', 'r', 'sas', 'stata', 'spss'],50),
'Date': np.random.choice(pd.Series(pd.date_range('2018-01-01', freq='D', periods=180)), 50),
'Credits': np.random.uniform(0,1000,50)
}, columns=['Date', 'CompanyID', 'Credits'])
# SORT ONCE OUTSIDE OF PROCESSING
df = df.sort_values(by=['CompanyID', 'Date'], ascending=[True, False]).reset_index(drop=True)
def creditspend(g):
g['days'] = g.Date.diff().map(lambda x: abs(x.days))
g['spend'] = g['Credits']/g['days']
return g
grp_df = df.groupby('CompanyID').apply(creditspend)
Output
print(grp_df.head(20))
# Date CompanyID Credits days spend
# 0 2018-06-20 julia 280.522287 NaN NaN
# 1 2018-06-12 julia 985.009523 8.0 123.126190
# 2 2018-05-17 julia 892.308179 26.0 34.319545
# 3 2018-05-03 julia 97.410360 14.0 6.957883
# 4 2018-03-26 julia 480.206077 38.0 12.637002
# 5 2018-03-07 julia 78.892365 19.0 4.152230
# 6 2018-03-03 julia 878.671506 4.0 219.667877
# 7 2018-02-25 julia 905.172807 6.0 150.862135
# 8 2018-02-19 julia 970.016418 6.0 161.669403
# 9 2018-02-03 julia 669.073067 16.0 41.817067
# 10 2018-01-23 julia 636.926865 11.0 57.902442
# 11 2018-01-11 julia 790.107486 12.0 65.842291
# 12 2018-06-16 pandas 639.180696 NaN NaN
# 13 2018-05-21 pandas 565.432415 26.0 21.747401
# 14 2018-04-22 pandas 145.232115 29.0 5.008004
# 15 2018-04-13 pandas 379.964557 9.0 42.218284
# 16 2018-04-12 pandas 538.168690 1.0 538.168690
# 17 2018-03-20 pandas 783.572993 23.0 34.068391
# 18 2018-03-14 pandas 618.354489 6.0 103.059081
# 19 2018-02-10 pandas 859.278127 32.0 26.852441

python pandas:get rolling value of one Dataframe by rolling index of another Dataframe

I have two dataframes: one has multi levels of columns, and another has only single level column (which is the first level of the first dataframe, or say the second dataframe is calculated by grouping the first dataframe).
These two dataframes look like the following:
first dataframe-df1
second dataframe-df2
The relationship between df1 and df2 is:
df2 = df1.groupby(axis=1, level='sector').mean()
Then, I get the index of rolling_max of df1 by:
result1=pd.rolling_apply(df1,window=5,func=lambda x: pd.Series(x).idxmax(),min_periods=4)
Let me explain result1 a little bit. For example, during the five days (window length) 2016/2/23 - 2016/2/29, the max price of the stock sh600870 happened in 2016/2/24, the index of 2016/2/24 in the five-day range is 1. So, in result1, the value of stock sh600870 in 2016/2/29 is 1.
Now, I want to get the sector price for each stock by the index in result1.
Let's take the same stock as example, the stock sh600870 is in sector ’家用电器视听器材白色家电‘. So in 2016/2/29, I wanna get the sector price in 2016/2/24, which is 8.770.
How can I do that?
idxmax (or np.argmax) returns an index which is relative to the rolling
window. To make the index relative to df1, add the index of the left edge of
the rolling window:
index = pd.rolling_apply(df1, window=5, min_periods=4, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=5, min_periods=4)
index = index.add(shift, axis=0)
Once you have ordinal indices relative to df1, you can use them to index
into df1 or df2 using .iloc.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 15
columns = pd.MultiIndex.from_product([['foo','bar'], ['A','B']])
columns.names = ['sector', 'stock']
dates = pd.date_range('2016-02-01', periods=N, freq='D')
df1 = pd.DataFrame(np.random.randint(10, size=(N, 4)), columns=columns, index=dates)
df2 = df1.groupby(axis=1, level='sector').mean()
window_size, min_periods = 5, 4
index = pd.rolling_apply(df1, window=window_size, min_periods=min_periods, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=window_size, min_periods=min_periods)
# alternative, you could use
# shift = np.pad(np.arange(len(df1)-window_size+1), (window_size-1, 0), mode='constant')
# but this is harder to read/understand, and therefore it maybe more prone to bugs.
index = index.add(shift, axis=0)
result = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in index:
sector, stock = col
mask = pd.notnull(index[col])
idx = index.loc[mask, col].astype(int)
result.loc[mask, col] = df2[sector].iloc[idx].values
print(result)
yields
sector foo bar
stock A B A B
2016-02-01 NaN NaN NaN NaN
2016-02-02 NaN NaN NaN NaN
2016-02-03 NaN NaN NaN NaN
2016-02-04 5.5 5 5 7.5
2016-02-05 5.5 5 5 8.5
2016-02-06 5.5 6.5 5 8.5
2016-02-07 5.5 6.5 5 8.5
2016-02-08 6.5 6.5 5 8.5
2016-02-09 6.5 6.5 6.5 8.5
2016-02-10 6.5 6.5 6.5 6
2016-02-11 6 6.5 4.5 6
2016-02-12 6 6.5 4.5 4
2016-02-13 2 6.5 4.5 5
2016-02-14 4 6.5 4.5 5
2016-02-15 4 6.5 4 3.5
Note in pandas 0.18 the rolling_apply syntax was changed. DataFrames and Series now have a rolling method, so that now you would use:
index = df1.rolling(window=window_size, min_periods=min_periods).apply(np.argmax)
shift = (pd.Series(np.arange(len(df1)))
.rolling(window=window_size, min_periods=min_periods).min())
index = index.add(shift.values, axis=0)

Pandas: using multiple functions in a group by

My data has ages, and also payments per month.
I'm trying to aggregate summing the payments, but without summing the ages (averaging would work).
Is it possible to use different functions for different columns?
You can pass a dictionary to agg with column names as keys and the functions you want as values.
import pandas as pd
import numpy as np
# Create some randomised data
N = 20
date_range = pd.date_range('01/01/2015', periods=N, freq='W')
df = pd.DataFrame({'ages':np.arange(N), 'payments':np.arange(N)*10}, index=date_range)
print(df.head())
# ages payments
# 2015-01-04 0 0
# 2015-01-11 1 10
# 2015-01-18 2 20
# 2015-01-25 3 30
# 2015-02-01 4 40
# Apply np.mean to the ages column and np.sum to the payments.
agg_funcs = {'ages':np.mean, 'payments':np.sum}
# Groupby each individual month and then apply the funcs in agg_funcs
grouped = df.groupby(df.index.to_period('M')).agg(agg_funcs)
print(grouped)
# ages payments
# 2015-01 1.5 60
# 2015-02 5.5 220
# 2015-03 10.0 500
# 2015-04 14.5 580
# 2015-05 18.0 540

Categories