Rolling correlation pandas - python

I have a dataframe with time series.
I'd like to compute the rolling correlation (periods=20) between columns.
store_corr=[] #empty list to store the rolling correlation of each pairs
names=[] #empty list to store the column name
df=df.pct_change(periods=1).dropna(axis=0) #Prepate dataframe of time series
for i in range(0,len(df.columns)):
for j in range(i,len(df.columns)):
corr = df[df.columns[i]].rolling(20).corr(df[df.columns[j]])
names.append('col '+str(i)+' -col '+str(j))
store_corr.append(corr)
df_corr=pd.DataFrame(np.transpose(np.array(store_corr)),columns=names)
This solution is working and gives me the rolling correlation.This solution is with the help of Austin Mackillop (comments).
Is there another faster way? (I.e. I want to avoid this double for loop.)

This line:
corr=df.rolling(20).corr(df[df.columns[i]],df[df.columns[j]])
will produce an error because the second argument of corr expects a Bool but you passed a DataFrame which has an ambiguous truth value. You can view the docs here.
Does applying the rolling method to the first DataFrame in the second line of code that you provided achieve what you are trying to do?
corr = df[df.columns[i]].rolling(20).corr(df[df.columns[j]])

Related

How to run different functions in different parts of a dataframe in python?

I have a dataframe(df).
I need to find the standard deviation dataframe from this one.For the first row I want to use the traditional variance formula.
sum of the(x - x(mean))/n
and from second row(=i) I want to use the following formula
lamb*(variance of first row) + (1-lamb)* (first row of returns)^2
※by first row, I meant the previous row.
# Generate Sample Dataframe
import numpy as np
import pandas as pd
df=pd.Dataframe({'a':range(1,7),
'b':[x**2 for x in range(1,7)],
'c':[x**3 for x in range(1,7)]})
# Generate return Dataframe
returns=df.pct_change()
# Generate new Zero dataframe
d=pd.DataFrame(0,index=np.arange(len(returns)),columns=returns.columns)
#populate first row
lamb=0.94
d.iloc[0]=list(returns.var())
Now my question is how to populated the second row till the end using the second formula?
It should be something like
d[1:].agg(lambda x: lamb*x.shift(-1)+(1-lamb)*returns[:2]
but it obviously returned a long error.
Could you please help?
for i in range(1,len(d)):
d.iloc[i].apply(lambda x: lamb*d.iloc[i-1] + (1-lamb)*returns.iloc[i-1])
I'm not completely sure if this gives the right answer but it wont throw an error. But try using apply, for loop and .iloc for iterating over rows and this should do the job for you if you use the correct formula.

Ambiguous behaviour of pandas aggregate function

So I was using pandas for some analysis and ran into a scenario, in which I had to run 2 functions for different groups in my data. And I decided to use pandas' .agg function.
Below are my 2 functions:
def pick_pacf(df,alpha=0.05,nlags=192):
'''
This function returns the lags in the timeseries which are highly correlated with the original timeseries
Input
1. df: pandas series, this is the column for which we are trying to find AR lag
2. metric: str, what metric to be calculated - acf/pacf
3. alpha: float, confidence interval
4. nlags: int, the no. of lags to be tested
Return
1. lags: list, this contain the list of all the lags (# of timestamps) that are highly correlated
'''
values,conf_int = pacf(df.values,alpha=alpha,nlags=nlags)
lags = []
#in the pacf function, confidence interval is centered around pacf values
#we need them to be centered around 0, this will produce the intervals we see in the graph
conf_int_cntrd = [value[0] - value[1] for value in zip(conf_int,values)]
for obs_index, obs in enumerate(zip(conf_int_cntrd,values)):
if obs[1] >= obs[0][1]: #obs[0][1] contains the high value of the conf int
lags.append(obs_index)
elif obs[1] <= obs[0][0]: #obs[0][0] contains the low value of the conf_int
lags.append(obs_index)
lags.remove(0) #removing the 0 lag for auto-corr with itself
return lags
def pick_acf(df,nlags=192):
'''
This funciton takes returns the ACF value for a MA model for a time series
Input
1. df: pandas series, this is the series for which we want to find ACF value
2. nlags: the number of lags to be taken into consideration for ACF
Returns
1. q: numpy array, The lags value at which ACF cuts off
'''
acf_values = acf(df.values)
acf_values = np.round(acf_values,1)
q = np.where(acf_values == 0)[0]
return q
No need to go through the functions line by line (you can if you want to) but the main thing to focus here is what the two functions return. pick_pacf returns a list, whereas pick_acf returns a numpy array.
The calls to these functions are like this:
pacf_values = train_ads[['INVERTER_ID','PER_TS_YIELD']].groupby('INVERTER_ID').agg(lambda x: pick_pacf(x))
acf_values = train_ads[['INVERTER_ID','PER_TS_YIELD']].groupby('INVERTER_ID').agg(lambda x: pick_acf(x))
PER_TS_YIELD is a numeric column and INVERTER_ID is an alphanumeric column.
The ambiguous behaviour here is that when I call the pick_pacf function then only the pandas series of PER_TS_YIELD column is sent as input to the function for each INVERTER_ID.
Whereas, when I call the pick_acf function then first the pandas series of PER_TS_YIELD column is sent and then the whole data frame made up of INVERTER_ID and PER_TS_YIELD columns is sent to the function. This leads to an error as I am doing some calculation which error out when an alphanumeric column is received.
Why is this happening? Does the behaviour of the .agg function depends on what is being returned from the user defined function??
Can someone please explain this to me. Thanks in advance.

Iterating optimization on a dataframe

I'm trying to build an iterating interpolation of series x and dataframe y.
Df y is made by n rows and m columns. I would like to run the interpolation for every row of DataFrame y.
So far, I've been able to successfully build the iteration for one single row by using iloc[0:]
### SX5E
z=np.linspace(0.2,0.99,200)
z_pd_SX5E=pd.Series(z)
from scipy import interpolate
def f(z_pd_SX5E):
x_SX5E=x
y_SX5E=y.iloc[0,:]
tck_SX5E = interpolate.splrep(x_SX5E, y_SX5E)
return interpolate.splev(z_pd_SX5E, tck_SX5E)
Optimal_trigger_P_SX5E= z_pd_SX5E[f(z_pd_SX5E).argmax(axis=0)]
How can I run the function through every row of y?
Thanks
Many thanks
In general you can run any function for each row by using .apply. So something like:
y.apply(lambda val: interpolate.splrep(x,val))
This will then return a new series object.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

Create new columns in pandas DataFrame using List Comprehension

So I have a pandas DataFrame that has several columns that contain values I'd like to use to create new columns using a function I've defined. I'd been planning on doing this using Python's List Comprehension as detailed in this answer. Here's what I'd been trying:
df['NewCol1'], df['NewCol2'] = [myFunction(x=row[0], y=row[1]) for row in zip(df['OldCol1'], df['OldCol2'])]
This runs correctly until it comes time to assign the values to the new columns, at which point it fails, I believe because it hasn't been iteratively assigning the values and instead tries to assign a constant value to each column. I feel like I'm close to doing this correctly, but I can't quite figure out the assignment.
EDIT:
The data are all strings, and the function performs a fetching of some different information from another source based on those strings like so:
def myFunction(x, y):
# read file based on value of x
# search file for values a and b based on value of y
return(a, b)
I know this is a little vague, but the helper function is fairly complicated to explain.
The error received is:
ValueError: too many values to unpack (expected 4)
You can use zip()
df['NewCol1'], df['NewCol2'] = zip(*[myFunction(x=row[0], y=row[1]) for row in zip(df['OldCol1'], df['OldCol2'])])

Shifting all rows in dask dataframe

In Pandas, there is a method DataFrame.shift(n) which shifts the contents of an array by n rows, relative to the index, similarly to np.roll(a, n). I can't seem to find a way to get a similar behaviour working with Dask. I realise things like row-shifts may be difficult to manage with Dask's chunked system, but I don't know of a better way to compare each row with the subsequent one.
What I'd like to be able to do is this:
import numpy as np
import pandas as pd
import dask.DataFrame as dd
with pd.HDFStore(path) as store:
data = dd.from_hdf(store, 'sim')[col1]
shifted = data.shift(1)
idx = data.apply(np.sign) != shifted.apply(np.sign)
in order to create a boolean series indicating the locations of sign changes in the data. (I am aware that method would also catch changes from a signed value to zero)
I would then use the boolean series to index a different Dask dataframe for plotting.
Rolling functions
Currently dask.dataframe does not implement the shift operation. It could though if you raise an issue. In principle this is not so dissimilar from rolling operations that dask.dataframe does support, like rolling_mean, rolling_sum, etc..
Actually, if you were to create a Pandas function that adheres to the same API as these pandas.rolling_foo functions then you can use the dask.dataframe.rolling.wrap_rolling function to turn your pandas style rolling function into a dask.dataframe rolling function.
dask.dataframe.rolling_sum = wrap_rolling(pandas.rolling_sum)
The following code might help to shift down the series.
s = dd_df['column'].rolling(window=2).sum() - dd_df['column']
Edit (03/09/2019):
When you are rolling and finding the sum, for a particular row,
result[i] = row[i-1] + row[i]
Then by subtracting the old value of the column from the result, you are doing the following operation:
final_row[i] = result[i] - row[i]
Which equals:
final_row[i] = row[i-1] + row[i] - row[i]
Which ultimately results in the whole column getting shifted down once.
Tip:
If you want to shift it down multiple rows, you should actually execute the whole operation again that many times with the same window.

Categories