Correlation between two dataframes / curves

Correlation between two dataframes / curves - python

I have yield curve data for two currencies (Euro and U.S. Dollar). For each of these currencies I have 16 variables (16 maturities). I have generated, using some model, synthetic curves and I want to relate the curves of the two currencies. That is, what is the correlation between the two currencies? I am asking this question as my model should capture this correlation. For example, it wouldn't be great if a 4% euro curve is generated and at the same time a -4% level curve is generated for the dollar. How can I do this? I don't like a correlation matrix as this yields a 16x16 matrix per model (I have multiple). Any thoughts?! Could be very helpful.

Just build a new DataFrame out of the two other DataFrames with the Series you want and call corr?
It is hard to provide a exact solution (and i may not understand your problem fully), as you didn't provide any information on how the data looks or what your code looks already
import pandas as pd
# your code so far...
df_corr = pd.DataFrame()
df_corr['eur_curve'] = df_eur # asumming you have a df_eur with the curve and 16 "variables"
df_corr['usd_cuve'] = df_usd # asumming you have a df_usd with the curve and 16 "variables"
corr = df_corr.corr()

Related

Strange results when scaling data using scikit learn

I have an input dataset that has 4 time series with 288 values for 80 days. So the actual shape is (80,4,288). I would like to cluster differnt days. I have 80 days and all of them have 4 time series: outside temperature, solar radiation, electrical demand, electricity prices. What I want is to group similar days with regard to these 4 time series combined into clusters. Days belonging to the same cluster should have similar time series.
Before clustering the days using k-means or Ward's method, I would like to scale them using scikit learn. For this I have to transform the data into a 2 dimensional shape array with the shape (80, 4*288) = (80, 1152), as the Standard Scaler of scikit learn does not accept 3-dimensional input. The Standard Scaler just standardizes features by removing the mean and scaling to unit variance.
Now I scale this data using sckit learn's standard scaler:
import numpy as np
from sklearn.preprocessing import StandardScaler
import pandas as pd
data_Unscaled = pd.read_csv("C:/Users/User1/Desktop/data_Unscaled.csv", sep=";")
scaler = StandardScaler()
data_Scaled = scaler.fit_transform(data_Unscaled)
np.savetxt("C:/Users/User1/Desktop/data_Scaled.csv", data_Scaled, delimiter=";")
When I now compare the unscaled and scaled data e.g. for the first day (1 row) and the 4th time series (columns 864 - 1152 in the csv file), the results look quite strange as you can see in the following figure:
As far as I see it, they are not in line with each other. For example in the timeslots between 111 and 201 the unscaled data does not change at all whereas the scaled data fluctuates. I can't explain that. Do you have any idea why this is happening and why they don't seem to be in line?
Here is the unscaled input data with shape (80,1152): https://filetransfer.io/data-package/CfbGV9Uk#link
and here the scaled output of the scaling with shape (80,1152): https://filetransfer.io/data-package/23dmFFCb#link

You have two issues here: scaling and clustering. As the question title refers to scaling, I'll handle that one in detail. The clustering issue is probably better suited for CrossValidated.
You don't say it, but it seems natural that all temperatures, be it on day 1 or day 80, are measured on a same scale. The same holds for the other three variables. So, for the purpose of scaling you essentially have four time series.
StandardScaler, like basically everything in sklearn, expects your observations to be organised in rows and variables in columns. It treats each column separately, deducting its mean from all the values in the column and dividing the resulting values by their standard deviation.
I reckon from your data that the first 288 entries in each row correspond to one variable, the next 288 to the second one etc. You need to reshape these data to form 288*80=23040 rows and 4 columns, one for each variable.
You apply StandardScaler on that array and reformat the data into the original shape, with 80 rows and 4*288=1152 columns. The code below should do the trick:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
data_Unscaled = pd.read_csv("C:/Users/User1/Desktop/data_Unscaled.csv", sep=";", header=None)
X = data_Unscaled.to_numpy()
X_narrow = np.array([X[:, i*288:(i+1)*288].ravel() for i in range(4)]).T
scaler = StandardScaler()
X_narrow_scaled = scaler.fit_transform(X_narrow)
X_scaled = np.array([X_narrow_scaled[i*288:(i+1)*288, :].T.ravel() for i in range(80)])
# Plot the original data:
i=3
j=0
plt.plot(X[j, i*288:(i+1)*288])
plt.title('TimeSeries_Unscaled')
plt.show()
# plot the scaled data:
plt.plot(X_scaled[j, i*288:(i+1)*288])
plt.title('TimeSeries_Scaled')
plt.show()
resulting in the following graphs:
The line
X_narrow = np.array([X[:, i*288:(i+1)*288].ravel() for i in range(4)]).T
uses list comprehension to generate the four columns of the long, narrow array X_narrow. Basically, it is just a shorthand for a for-loop over your four variables. It takes the first 288 columns of X, flattens them into a vector, which it then puts into the first column of X_narrow. Then it does the same for the next 288 columns, X[:, 288:576], and then for the third and the fourth block of the 288 observed values per day. This way, each column in X_narrow contains a long time series, spanning 80 days (and 288 observations per day), of exactly one of your variables (outside temperature, solar radiation, electrical demand, electricity prices).
Now, you might try to cluster X_scaled using K-means, but I doubt it will work. You have just 80 points in a 1152-dimensional space, so the curse of dimensionality will almost certainly kick in. You'll most probably need to perform some kind of dimensionality reduction, but, as I noted above, that's a different question.

How to determine multiple Periodicities present in Timeseries data?

My objective is to detect all kinds of seasonalities and their time periods that are present in a timeseries waveform.
I'm currently using the following dataset:
https://www.kaggle.com/rakannimer/air-passengers
At the moment, I've tried the following approaches:
1) Use of FFT:
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
#https://www.kaggle.com/rakannimer/air-passengers
df=pd.read_csv('AirPassengers.csv')
df.head()
frequency_eval_max = 100
A_signal_rfft = scipy.fft.rfft(df['#Passengers'], n=frequency_eval_max)
n = np.shape(A_signal_rfft)[0] # np.size(t)
frequencies_rel = len(A_signal_fft)/frequency_eval_max * np.linspace(0,1,int(n))
fig=plt.figure(3, figsize=(15,6))
plt.clf()
plt.plot(frequencies_rel, np.abs(A_signal_rfft), lw=1.0, c='paleturquoise')
plt.stem(frequencies_rel, np.abs(A_signal_rfft))
plt.xlabel("frequency")
plt.ylabel("amplitude")
This results in the following plot:
But it doesn't result in anything conclusive or comprehensible.
Ideally I wish to see the peaks representing daily, weekly, monthly and yearly seasonality.
Could anyone point out what am I doing wrong?
2) Autocorrelation:
from pandas.plotting import autocorrelation_plot
plt.rcParams.update({'figure.figsize':(10,6), 'figure.dpi':120})
autocorrelation_plot(df['#Passengers'].tolist())
After doing which I get a plot like the following:
But how do I read this plot and how can I derive the presence of the various seasonalities and their periods from this?
3) SLT Decomposition Algorithm
df.set_index('Month',inplace=True)
df.index=pd.to_datetime(df.index)
#drop null values
df.dropna(inplace=True)
df.plot()
result=seasonal_decompose(df['#Passengers'], model='multiplicable', period=12)
result.seasonal.plot()
This gives the following plot:
But here I can only see one kind of seasonality.
So how do we detect all the types of seasonalities and their time periods that are present using this method?
Hence, I've tried 3 different approaches but they seem either erroneous or incomplete.
Could anyone please help me out with the most effective approach (even apart from the ones I've tried) to detect all kinds of seasonalities and their time periods for any given timeseries data?

I still think a Fourier analysis is the way to go, its just that the 0-frequency result is shadowing any insight.
This is essentially the square of the average of your data set, and all records are positive, far from the typical sinusoidal function you would analyze with Fourier Transforms. So simply subtract the average of your dataset to your dataset before doing the FFT and see how it looks. This would also help with the autocorrelation technique.
Also, you MUST give units to your frequency values. Do not settle for the raw values from the FFT. Those are related to the sampling frequency and span of your dataset. Reason about it and adequately label the daily, weekly, monthly and anual frequencies in your chart.

using FFT, you can get the fundamental frequency. you can then use a low-pass filter or just manually select the first n frequencies. these frequencies will correspond to the 'seasonalities'. transform your filtered FFT into time domain and you can visualize the most basic underlying repetitions, you can easily calculate the time period of those repetitions and visualize it by individually plotting the F0,F1,... in time domain.

Classification method for unevenly spread data

I have several datasets with very unevenly distributed values: Most values are very low, but a few are very high, for example, in the histogram screenshot or even more extreme.
I am actually interested in the differences in the high values.
So what I am looking for is a classification method that sets many break values where there are few data values and large classes where there are many values. Maybe something like a reversed quantile classification.
Do you have a suggestion on which algorithm could help with this task, preferably in Python?

if you are using pandas, couldn't you just select the values above your chosen threshold and analyze the difference seperately?
import pandas as pd
df = pd.DataFrame(your data)
df_to_analyze_large_values = df[df.your_Column_of_interest > 100000]

Python- np.mean() giving wrong means?

The issue
So I have 50 netCDF4 data files that contain decades of monthly temperature predictions on a global grid. I'm using np.mean() to make an ensemble average of all 50 data files together while preserving time length & spatial scale, but np.mean() gives me two different answers. The first time I run its block of code, it gives me a number that, when averaged over latitude & longitude & plotted against the individual runs, is slightly lower than what the ensemble mean should be. If I re-run the block, it gives me a different mean which looks correct.
The code
I can't copy every line here since it's long, but here's what I do for each run.
#Historical (1950-2020) data
ncin_1 = Dataset("/project/wca/AR5/CanESM2/monthly/histr1/tas_Amon_CanESM2_historical-r1_r1i1p1_195001-202012.nc") #Import data file
tash1 = ncin_1.variables['tas'][:] #extract tas (temperature) variable
ncin_1.close() #close to save memory
#Repeat for future (2021-2100) data
ncin_1 = Dataset("/project/wca/AR5/CanESM2/monthly/histr1/tas_Amon_CanESM2_historical-r1_r1i1p1_202101-210012.nc")
tasr1 = ncin_1.variables['tas'][:]
ncin_1.close()
#Concatenate historical & future files together to make one time series array
tas11 = np.concatenate((tash1,tasr1),axis=0)
#Subtract the 1950-1979 mean to obtain anomalies
tas11 = tas11 - np.mean(tas11[0:359],axis=0,dtype=np.float64)
And I repeat that 49 times more for other datasets. Each tas11, tas12, etc file has the shape (1812, 64, 128) corresponding to time length in months, latitude, and longitude.
To get the ensemble mean, I do the following.
#Move all tas data to one array
alltas = np.zeros((1812,64,128,51)) #years, lat, lon, members (no ensemble mean value yet)
alltas[:,:,:,0] = tas11
(...)
alltas[:,:,:,49] = tas50
#Calculate ensemble mean & fill into 51st slot in axis 3
alltas[:,:,:,50] = np.mean(alltas,axis=3,dtype=np.float64)
When I check a coordinate & month, the ensemble mean is off from what it should be. Here's what a plot of globally averaged temperatures from 1950-2100 looks like with the first mean (with monhly values averaged into annual values. Black line is ensemble mean & colored lines are individual runs.
Obviously that deviated below the real ensemble mean. Here's what the plot looks like when I run alltas[:,:,:,50]=np.mean(alltas,axis=3,dtype=np.float64) a second time & keep everything else the same.
Much better.
The question
Why does np.mean() calculate the wrong value the first time? I tried specifying the data type as a float when using np.mean() like in this question- Wrong numpy mean value?
But it didn't work. Any way I can fix it so it works correctly the first time? I don't want this problem to occur on a calculation where it's not so easy to notice a math error.

In the line
alltas[:,:,:,50] = np.mean(alltas,axis=3,dtype=np.float64)
the argument to mean should be alltas[:,:,:,:50]:
alltas[:,:,:,50] = np.mean(alltas[:,:,:,:50], axis=3, dtype=np.float64)
Otherwise you are including those final zeros in the calculation of the ensemble means.

AgglomerativeClustering on a correlation matrix

I have a correlation matrix of typical structure that is of size 288x288 that is defined by:
from sklearn.cluster import AgglomerativeClustering
df = read_returns()
correl_matrix = df.corr()
where read_returns gives me a dataframe with a date index, and columns of the returns of assets.
Now - I want to cluster these correlations to reduce the population size.
By doing some reading and experimenting I discovered AgglomerativeClustering - and it appears at first pass to be an appropriate solution to my problem.
I define a distance metric as ((.5*(1-correl_matrix))**.5) and have:
cluster = AgglomerativeClustering(n_clusters=40, linkage='average')
cluster.fit(((.5*(1-correl_matrix))**.5).values)
label_groups = cluster.labels_
To observe some of the data and cross check my work I pick out cluster 1 and observe the pairwise correlations and find the min correlation between two items with that group in my dataset to find :
single_cluster = []
for i in range(0,correl_matrix.shape[0]):
if label_groups[i]==1:
single_cluster.append(correl_matrix.index[i])
min_correl = 1.0
for x in single_cluster:
for y in single_cluster:
if x<>y:
if correl_matrix[x][y]<min_correl:
min_correl = correl_matrix[x][y]
print min_correl
and get a min pairwise correlation of .20
To me this seems quite low - but "low based off what?" is a fair question to which I have no answer.
I would like to anticipate/enforce each pairwise correlation of a cluster to be >=.7 or something like this.
Is this possible in AgglomerativeClustering?
Am I accidentally going down the wrong path?

Hierarchical clustering supports different "linkage" strategies.
single-link: this connects points on the minimum distance to the others in the cluster
complete-link: this connects based on the maximum distance to the cluster
...
If you want a high minimum correlation = small maximum distance, this calls for complete linkage.
You may want to treat negative correlations as "good", too.
i.e. use dist = 1 - abs(corr).
Make sure to use ghe dendrogram. If you have outliers in your data, you want to cut into (n_clusters+n_outliers) partitions.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Correlation between two dataframes / curves - python

Related

Strange results when scaling data using scikit learn

How to determine multiple Periodicities present in Timeseries data?

Classification method for unevenly spread data

Python- np.mean() giving wrong means?

AgglomerativeClustering on a correlation matrix

Categories

Resources