Rolling Correlation of Multi-Column Panda - python

I am trying to calcualte and then visualize the rolling correlation between multiple columns in a 180 (3 in this example) days window.
My data is formatted like that (in the orginal file there are 12 columns plus the timestamp and thousands of rows):
import numpy as np
import pandas as pd
df = pd.DataFrame({"Timestamp" : ['1993-11-01' ,'1993-11-02', '1993-11-03', '1993-11-04','1993-11-15'], "Austria" : [6.18 ,6.18, 6.17, 6.17, 6.40],"Belgium" : [7.05, 7.05, 7.2, 7.5, 7.6],"France" : [7.69, 7.61, 7.67, 7.91, 8.61]},index = [1, 2, 3,4,5])
Timestamp Austria Belgium France
1 1993-11-01 6.18 7.05 7.69
2 1993-11-02 6.18 7.05 7.61
3 1993-11-03 6.17 7.20 7.67
4 1993-11-04 6.17 7.50 7.91
5 1993-11-15 6.40 7.60 8.61
I cant just use this formula, because I get a formatting error if I do because of the Timestamp column:
df.rolling(2).corr(df)
ValueError: could not convert string to float: '1993-11-01'
When I drop the Timestamp column I get a result of 1.0 for every cell, thats also not right and additionally I lose the Timestamp which I will need for the visualization graph in the end.
df_drop = df.drop(columns=['Timestamp'])
df_drop.rolling(2).corr(df_drop)
Austria Belgium France
1 NaN NaN NaN
2 NaN NaN 1.0
3 1.0 1.0 1.0
4 -inf1.0 1.0
5 1.0 1.0 1.0
Any experiences how to do the rolling correlation with multiple columns and a data index?

Building on the answer of Shreyans Jain I propose the following. It should work with an arbitrary number of columns:
import itertools as it
# omit timestamp-col
cols = list(df.columns)[1:]
# -> ['Austria', 'Belgium', 'France']
col_pairs = list(it.combinations(cols, 2))
# -> [('Austria', 'Belgium'), ('Austria', 'France'), ('Belgium', 'France')]
res = pd.DataFrame()
for pair in col_pairs:
# select the first three letters of each name of the pair
corr_name = f"{pair[0][:3]}_{pair[1][:3]}_corr"
res[corr_name] = df[list(pair)].\
rolling(min_periods=1, window=3).\
corr().iloc[0::2, -1].reset_index(drop=True)
print(str(res))
Aus_Bel_corr Aus_Fra_corr Bel_Fra_corr
0 NaN NaN NaN
1 NaN NaN NaN
2 -1.000000 -0.277350 0.277350
3 -0.755929 -0.654654 0.989743
4 0.693375 0.969346 0.849167
The NaN-Values at the beginning result from the windowing.
Update: I uploaded a notebook with detailed explanations for what happens inside the loop.
https://github.com/cknoll/demo-material/blob/main/pandas/pandas_rolling_correlation_iloc.ipynb

You can probably calculate pair-wise correlation like this, instead of going for all 3 at once.
Once you have the correlation, you can directly add them as your columns as well, preserving the timestamp.
df['Aus_Bel_corr'] = df[['Austria','Belgium']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Bel_Fin_corr'] = df[['Belgium','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)
df['Aus_Fin_corr'] = df[['Austria','Finland']].rolling(min_periods = 1, window = 3).corr().iloc[0::2,-1].reset_index(drop = True)```

I guess that there is an another way.
df['Aus_Bel_corr'] = df['Austria']\
.rolling(min_periods = 1, window = 3)\
.corr(df['Belgium'])
For me, I think it is a little simple than the previous answer.

Related

Pandas combine two dataseries into one series

I need to combine the dataseries rateScore and rate into one.
This is the current DataFrame I have
rateScore rate
10 NaN 4.5
11 2.5 NaN
12 4.5 NaN
13 NaN 5.0
..
235 NaN 4.7
236 3.8 NaN
This needs to be something like this:
rateScore
10 4.5
11 2.5
12 4.5
13 5.0
..
235 4.7
236 3.8
The rate column needs to be dropped after merging the series and also for each row, the index number needs stay the same.
You can try with the following with fillna(), redifining the rateScore column and dropping rate:
df = df.fillna(0)
df['rateScore'] = df['rateScore'] + df['rate']
df = df.drop(columns='rate')
You could use combine_first to fill NaN values from a second Series:
df['rateScore'] = df['rateScore'].combine_first(df['rateScore'])
Let us do add
df['rateScore'] = df['rateScore'].add(df['rate'],fill_value=0)

Python pandas: how to fill values between existing ones in dataframe column?

I have a pandas DataFrame with 3 columns. The first column contains string values in ascending order, at a certain frequency (e.g. '20173070000', '20173070020', '20173070040', etc.). The second and third columns contain corresponding integer values. I would like to re-sample the first column to every one - '20173070000', '20173070001', '20173070002', simultaneously filling the second and third columns with NaN values, and then I would like to interpolate those NaN values.
I've looked into re-sampling data, but this appears to only work for timedate values. I have also looked into pd.interpolate, but this appears to work for interpolating between missing values. As stated above, my dataset does not contain missing data. I am simply looking to increase the frequency of my entries - to fill between existing values.
To give some reference, my current DataFrame looks like this:
0 1 2
0 20173070000 14.0 13.9
1 20173070020 14.1 14.1
2 20173070040 13.8 13.6
3 20173070060 13.7 13.7
4 20173070080 13.8 13.5
5 20173070100 13.9 14.0
I would like to generate a DataFrame that looks like:
0 1 2
0 20173070000 14.0 13.9
1 20173070001 NaN NaN
2 20173070002 NaN NaN
3 20173070003 NaN NaN
4 20173070004 NaN NaN
5 20173070005 NaN NaN
...
20 20173070020 14.1 14.1
21 20173070021 NaN NaN
...
I have no problem sorting the interpolation afterwards, but I have not worked out how to up sample yet.
You can just use reindex function. By default, it places NaN in locations having no value in the "new" index.
df = pd.DataFrame({'A': [20173070000, 20173070020, 20173070040, 20173070060, 20173070080, 20173070100 ],
'B': [14, 14.1, 13.8, 13.7, 13.8, 13.9],
'C': [13.9, 14.1, 13.6, 13.7, 13.5, 14.0] })
df.set_index('A').reindex(np.arange(np.min(df.A), np.max(df.A)+1) ).reset_index()
I believe the interpolate() is the way to go for you. After having upsampled as you described and given the column containing the values you want to interpolate is called 'val1', you can do:
df.loc[:, 'val1'] = df.loc[:, 'val1'].interpolate()

Python Pandas Simple Moving Average (deprecated pd.rolling_mean) [duplicate]

I would like to add a moving average calculation to my exchange time series.
Original data from Quandl
Exchange = Quandl.get("BUNDESBANK/BBEX3_D_SEK_USD_CA_AC_000",
authtoken="xxxxxxx")
# Value
# Date
# 1989-01-02 6.10500
# 1989-01-03 6.07500
# 1989-01-04 6.10750
# 1989-01-05 6.15250
# 1989-01-09 6.25500
# 1989-01-10 6.24250
# 1989-01-11 6.26250
# 1989-01-12 6.23250
# 1989-01-13 6.27750
# 1989-01-16 6.31250
# Calculating Moving Avarage
MovingAverage = pd.rolling_mean(Exchange,5)
# Value
# Date
# 1989-01-02 NaN
# 1989-01-03 NaN
# 1989-01-04 NaN
# 1989-01-05 NaN
# 1989-01-09 6.13900
# 1989-01-10 6.16650
# 1989-01-11 6.20400
# 1989-01-12 6.22900
# 1989-01-13 6.25400
# 1989-01-16 6.26550
I would like to add the calculated Moving Average as a new column to the right after Value using the same index (Date). Preferably I would also like to rename the calculated moving average to MA.
The rolling mean returns a Series you only have to add it as a new column of your DataFrame (MA) as described below.
For information, the rolling_mean function has been deprecated in pandas newer versions. I have used the new method in my example, see below a quote from the pandas documentation.
Warning Prior to version 0.18.0, pd.rolling_*, pd.expanding_*, and pd.ewm* were module level functions and are now deprecated. These are replaced by using the Rolling, Expanding and EWM. objects and a corresponding method call.
df['MA'] = df.rolling(window=5).mean()
print(df)
# Value MA
# Date
# 1989-01-02 6.11 NaN
# 1989-01-03 6.08 NaN
# 1989-01-04 6.11 NaN
# 1989-01-05 6.15 NaN
# 1989-01-09 6.25 6.14
# 1989-01-10 6.24 6.17
# 1989-01-11 6.26 6.20
# 1989-01-12 6.23 6.23
# 1989-01-13 6.28 6.25
# 1989-01-16 6.31 6.27
A moving average can also be calculated and visualized directly in a line chart by using the following code:
Example using stock price data:
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime
plt.style.use('ggplot')
# Input variables
start = datetime.datetime(2016, 1, 01)
end = datetime.datetime(2018, 3, 29)
stock = 'WFC'
# Extrating data
df = web.DataReader(stock,'morningstar', start, end)
df = df['Close']
print df
plt.plot(df['WFC'],label= 'Close')
plt.plot(df['WFC'].rolling(9).mean(),label= 'MA 9 days')
plt.plot(df['WFC'].rolling(21).mean(),label= 'MA 21 days')
plt.legend(loc='best')
plt.title('Wells Fargo\nClose and Moving Averages')
plt.show()
Tutorial on how to do this: https://youtu.be/XWAPpyF62Vg
In case you are calculating more than one moving average:
for i in range(2,10):
df['MA{}'.format(i)] = df.rolling(window=i).mean()
Then you can do an aggregate average of all the MA
df[[f for f in list(df) if "MA" in f]].mean(axis=1)
To get the moving average in pandas we can use cum_sum and then divide by count.
Here is the working example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': range(5),
'value': range(100,600,100)})
# some other similar statistics
df['cum_sum'] = df['value'].cumsum()
df['count'] = range(1,len(df['value'])+1)
df['mov_avg'] = df['cum_sum'] / df['count']
# other statistics
df['rolling_mean2'] = df['value'].rolling(window=2).mean()
print(df)
output
id value cum_sum count mov_avg rolling_mean2
0 0 100 100 1 100.0 NaN
1 1 200 300 2 150.0 150.0
2 2 300 600 3 200.0 250.0
3 3 400 1000 4 250.0 350.0
4 4 500 1500 5 300.0 450.0

Python Pandas Dataframe: length of index does not match - df['column'] = ndarray

I have a pandas Dataframe containing EOD financial data (OHLC) for analysis.
I'm using https://github.com/cirla/tulipy library to generate technical indicator values, that have a certain timeperiod as option. For Example. ADX with timeperiod=5 shows ADX for last 5 days.
Because of this timeperiod, the generated array with indicator values is always shorter in length than the Dataframe. Because the prices of first 5 days are used to generate ADX for day 6..
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=14)
df['mdi_14'] = mdi14
df['pdi_14'] = pdi14
>> ValueError: Length of values does not match length of index
Unfortunately, unlike TA-LIB for example, this tulip library does not provide NaN-values for these first couple of empty days...
Is there an easy way to prepend these NaN to the ndarray?
Or insert into df at a certain index & have it create NaN for the rows before it automatically?
Thanks in advance, I've been researching for days!
Maybe make the shift yourself in the code ?
period = 14
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=period
)
df['mdi_14'] = np.NAN
df['mdi_14'][period - 1:] = mdi14
I hope they will fill the first values with NAN in the lib in the future. It's dangerous to leave time series data like this without any label.
Full MCVE
df = pd.DataFrame(1, range(10), list('ABC'))
a = np.full((len(df) - 6, df.shape[1]), 2)
b = np.full((6, df.shape[1]), np.nan)
c = np.row_stack([b, a])
d = pd.DataFrame(c, df.index, df.columns)
d
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 2.0 2.0 2.0
7 2.0 2.0 2.0
8 2.0 2.0 2.0
9 2.0 2.0 2.0
The C version of the tulip library includes a start function for each indicator (reference: https://tulipindicators.org/usage) that can be used to determine the output length of an indicator given a set of input options. Unfortunately, it does not appear that the python bindings library, tulipy, includes this functionality. Instead you have to resort to dynamically reassigning your index values to align the output with the original DataFrame.
Here is an example that uses the price series from the tulipy docs:
#Create the dataframe with close prices
prices = pd.DataFrame(data={81.59, 81.06, 82.87, 83, 83.61, 83.15, 82.84, 83.99, 84.55,
84.36, 85.53, 86.54, 86.89, 87.77, 87.29}, columns=['close'])
#Compute the technical indicator using tulipy and save the result in a DataFrame
bbands = pd.DataFrame(data=np.transpose(ti.bbands(real = prices['close'].to_numpy(), period = 5, stddev = 2)))
#Dynamically realign the index; note from the tulip library documentation that the price/volume data is expected be ordered "oldest to newest (index 0 is oldest)"
bbands.index += prices.index.max() - bbands.index.max()
#Put the indicator values with the original DataFrame
prices[['BBANDS_5_2_low', 'BBANDS_5_2_mid', 'BBANDS_5_2_up']] = bbands
prices.head(15)
close BBANDS_5_2_low BBANDS_5_2_mid BBANDS_5_2_up
0 81.06 NaN NaN NaN
1 81.59 NaN NaN NaN
2 82.87 NaN NaN NaN
3 83.00 NaN NaN NaN
4 83.61 80.530042 82.426 84.321958
5 83.15 81.494061 82.844 84.193939
6 82.84 82.533343 83.094 83.654657
7 83.99 82.471983 83.318 84.164017
8 84.55 82.417750 83.628 84.838250
9 84.36 82.435203 83.778 85.120797
10 85.53 82.511331 84.254 85.996669
11 86.54 83.142618 84.994 86.845382
12 86.89 83.536488 85.574 87.611512
13 87.77 83.870324 86.218 88.565676
14 87.29 85.288871 86.804 88.319129

python pandas:get rolling value of one Dataframe by rolling index of another Dataframe

I have two dataframes: one has multi levels of columns, and another has only single level column (which is the first level of the first dataframe, or say the second dataframe is calculated by grouping the first dataframe).
These two dataframes look like the following:
first dataframe-df1
second dataframe-df2
The relationship between df1 and df2 is:
df2 = df1.groupby(axis=1, level='sector').mean()
Then, I get the index of rolling_max of df1 by:
result1=pd.rolling_apply(df1,window=5,func=lambda x: pd.Series(x).idxmax(),min_periods=4)
Let me explain result1 a little bit. For example, during the five days (window length) 2016/2/23 - 2016/2/29, the max price of the stock sh600870 happened in 2016/2/24, the index of 2016/2/24 in the five-day range is 1. So, in result1, the value of stock sh600870 in 2016/2/29 is 1.
Now, I want to get the sector price for each stock by the index in result1.
Let's take the same stock as example, the stock sh600870 is in sector ’家用电器视听器材白色家电‘. So in 2016/2/29, I wanna get the sector price in 2016/2/24, which is 8.770.
How can I do that?
idxmax (or np.argmax) returns an index which is relative to the rolling
window. To make the index relative to df1, add the index of the left edge of
the rolling window:
index = pd.rolling_apply(df1, window=5, min_periods=4, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=5, min_periods=4)
index = index.add(shift, axis=0)
Once you have ordinal indices relative to df1, you can use them to index
into df1 or df2 using .iloc.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 15
columns = pd.MultiIndex.from_product([['foo','bar'], ['A','B']])
columns.names = ['sector', 'stock']
dates = pd.date_range('2016-02-01', periods=N, freq='D')
df1 = pd.DataFrame(np.random.randint(10, size=(N, 4)), columns=columns, index=dates)
df2 = df1.groupby(axis=1, level='sector').mean()
window_size, min_periods = 5, 4
index = pd.rolling_apply(df1, window=window_size, min_periods=min_periods, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=window_size, min_periods=min_periods)
# alternative, you could use
# shift = np.pad(np.arange(len(df1)-window_size+1), (window_size-1, 0), mode='constant')
# but this is harder to read/understand, and therefore it maybe more prone to bugs.
index = index.add(shift, axis=0)
result = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in index:
sector, stock = col
mask = pd.notnull(index[col])
idx = index.loc[mask, col].astype(int)
result.loc[mask, col] = df2[sector].iloc[idx].values
print(result)
yields
sector foo bar
stock A B A B
2016-02-01 NaN NaN NaN NaN
2016-02-02 NaN NaN NaN NaN
2016-02-03 NaN NaN NaN NaN
2016-02-04 5.5 5 5 7.5
2016-02-05 5.5 5 5 8.5
2016-02-06 5.5 6.5 5 8.5
2016-02-07 5.5 6.5 5 8.5
2016-02-08 6.5 6.5 5 8.5
2016-02-09 6.5 6.5 6.5 8.5
2016-02-10 6.5 6.5 6.5 6
2016-02-11 6 6.5 4.5 6
2016-02-12 6 6.5 4.5 4
2016-02-13 2 6.5 4.5 5
2016-02-14 4 6.5 4.5 5
2016-02-15 4 6.5 4 3.5
Note in pandas 0.18 the rolling_apply syntax was changed. DataFrames and Series now have a rolling method, so that now you would use:
index = df1.rolling(window=window_size, min_periods=min_periods).apply(np.argmax)
shift = (pd.Series(np.arange(len(df1)))
.rolling(window=window_size, min_periods=min_periods).min())
index = index.add(shift.values, axis=0)

Categories