Calculating Pearson correlation - python

I'm trying to calculate the Pearson correlation coefficient of two variables. These variables are to determine if there is a relationship between number of postal codes to a range of distances. So I want to see if the number of postal codes increases/decreases as the distance ranges changes.
I'll have one list which will count the number of postal codes within a distance range and the other list will have the actual ranges.
Is it ok to have a list that contain a range of distances? Or would it be better to have a list like this [50, 100, 500, 1000] where each element would then contain ranges up that amount. So for example the list represents up to 50km, then from 50km to 100km and so on.

Use scipy :
scipy.stats.pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.
The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.
Parameters :
x : 1D array
y : 1D array the same length as x
Returns :
(Pearson’s correlation coefficient, :
2-tailed p-value)

You can also use numpy:
numpy.corrcoef(x, y)
which would give you a correlation matrix that looks like:
[[1 correlation(x, y)]
[correlation(y, x) 1]]

try this:
val=Top15[['Energy Supply per Capita','Citable docs per Capita']].rank().corr(method='pearson')

In Python 3.10 correlation() function was added to the statistics module of the Python standard library, it can be directly used by importing the statistics module:
import statistics
statistics.correlation(words, views)

Related

Interpret correlation coefficient using correlation matrix

I want to apply a regression algorithm but before that I decided to study the relationship between my target output(Temperature value) and the remaining variables so I decided to plot the correlation matrix as a first step.
I got the next picture :
As it is shown the Image below : the Temperature is strongly correlated with the first 4 variables
Based on a the difference between VBATT and VIN I decided to split the total samples into two subsets : subset1 and subset 2. Then, I plotted the correlation variables
Subset 1: VIN-VBATT> X
Subset 2: VIN-VBATT=< x
I can not understand why the correlation between the Temperature and VIN has changed completely ?
And why all the correlation coefficients has been decreased ?
Thank you in adavance !

Discretize normal distribution to get prob of a random variable

Suppose I draw randomly from a normal distribution with mean zero and standard deviation represented by a vector of, say, dimension 3 with
scale_rng=np.array([1,2,3])
eps=np.random.normal(0,scale_rng)
I need to compute a weighted average based on some simulations for which I draw the above mentioned eps. The weights of this average are "the probability of eps" (hence I will have a vector with 3 weights). For weighted average I simply mean an arithmetic sum wehere each component is multiplied by a weight, i.e. a number between 0 and 1 and where all the weights should sum up to one.
Such weighted average shall be calculated as follows: I have a time series of observations for one variable, x. I calculate an expanding rolling standard deviation of x (say this is the values in scale). Then, I extract a random variable eps from a normal distribution as explained above for each time-observation in x and I add it to it, say obtaining y=x+eps. Finally, I need to compute the weighted average of y where each value of y is weighted by the "probability of drawing each value of eps from a normal distribution with mean zero and standard deviation equal to scale.
Now, I know that I cannot think of this being the points on the pdf corresponding to the values randomly drawn because a normal random variable is continuous and as such the pdf at a certain point is zero. Hence, the only solution I Found out is to discretize a normal distribution with a certain number of bins and then find the probability that a value extracted with the code of above is actually drawn. How could I do this in Python?
EDIT: the solution I found is to use
norm.cdf(eps_it+0.5, loc=0, scale=scale_rng)-norm.cdf(eps_it-0.5, loc=0, scale=scale_rng)
which is not really based on the discretization but at least it seems feasible to me "probability-wise".
here's an example leaving everything continuous.
import numpy as np
from scipy import stats
# some function we want a monte carlo estimate of
def fn(eps):
return np.sum(np.abs(eps), axis=1)
# define distribution of eps
sd = np.array([1,2,3])
d_eps = stats.norm(0, sd)
# draw uniform samples so we don't double apply the normal density
eps = np.random.uniform(-6*sd, 6*sd, size=(10000, 3))
# calculate weights (working with log-likelihood is better for numerical stability)
w = np.prod(d_eps.pdf(eps), axis=1)
# normalise so weights sum to 1
w /= np.sum(w)
# get estimate
np.sum(fn(eps) * w)
which gives me 4.71, 4.74, 4.70 4.78 if I run it a few times. we can verify this is correct by just using a mean when eps is drawn from a normal directly:
np.mean(fn(d_eps.rvs(size=(10000, 3))))
which gives me essentially the same values, but with expected lower variance. e.g. 4.79, 4.76, 4.77, 4.82, 4.80.

scipy.stats.wasserstein_distance implementation

I am trying to understand the implementation that is used in
scipy.stats.wasserstein_distance
for p=1 and no weights, with u_values, v_values the two 1-D distributions, the code comes down to
u_sorter = np.argsort(u_values) (1)
v_sorter = np.argsort(v_values)
all_values = np.concatenate((u_values, v_values)) (2)
all_values.sort(kind='mergesort')
deltas = np.diff(all_values) (3)
u_cdf_indices = u_values[u_sorter].searchsorted(all_values[:-1], 'right') (4)
v_cdf_indices = v_values[v_sorter].searchsorted(all_values[:-1], 'right')
v_cdf = v_cdf_indices / v_values.size (5)
u_cdf = u_cdf_indices / u_values.size
return np.sum(np.multiply(np.abs(u_cdf - v_cdf), deltas)) (6)
What is the reasoning behind this implementation, is there some literature?
I did look at the paper cited which I believe explains why calculating the Wasserstein distance in its general definition in 1D is equivalent to evaluating the integral,
\int_{-\infty}^{+\infty} |U-V|,
with U and V the cumulative distribution functions for the distributions u_values and v_values,
but I don't understand how this integral is evaluated in scipy implementation.
In particular,
a) why are they multiplying by the deltas in (6) to solve the integral?
b) how are v_cdf and u_cdf in (5) the cumulative distribution functions U and V?
Also, with this implementation the element order of the distribution u_values and v_values is not preserved. Shouldn't this be the case in the general Wasserstein distance definition?
Thank you for your help!
The order of the PDF, histogram or KDE is preserved and is important in Wasserstein distance. If you only pass the u_values and v_values then it has to calculate something like a PDF, KDE or histogram. Normally you would provide the PDF and the range of U and V as the 4 arguments to the function wasserstein_distance. So in the case where samples are provided you are not passing a real datapoint, simply a collection of repeated "experiments". Numbers 1 and 4 in your list of code blocks basically bins your data by the number of discrete values. A CDF is the number of discrete values until that point or P(x<X). The CDF is basically the cumulative sum of a PDF, histogram or KDE. Number 5 does the normalization of the CDF to between 0.0 and 1.0 or said another way it divides the bin by the number of bins.
So the order of the discrete values is preserved, not the original order in the datapoint.
B) It may make more sense if you plot the CDF's of a datapoint such as an image file by using the code above.
The transportation problem however may not need a PDF, but rather a datapoint of ordered features or some way to measure distance between features in which case you would calculate it differently.

How to calculate covariance Matrix with Pandas

I'm trying to figure out how to calculate a covariance matrix with Pandas.
I'm not a data scientist or a finance guy, i'm just a regular dev going a out of his league.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(252, 4)), columns=list('ABCD'))
print(df.cov())
So, if I do this, I get that kind of output:
I find that the number are huge, and i was expecting them to be closer to zero. Do i have to calculate the return before getting the cov ?
Does anyone familiar with this could explain this a little bit or point me to a good link with explanation ? I couldn't find any link to Covariance Matrix For Dummies.
Regards,
Julien
Covariance is a measure of the degree to which returns on two assets (or any two vector or array) move in tandem. A positive covariance means that asset returns move together, while a negative covariance means returns move inversely.
On the other side we have:
The correlation coefficient is a measure that determines the degree to which two variables' movements are associated. Note that the correlation coefficient measures linear relationship between two arrays/vector/asset.
So, portfolio managers try to reduce covariance between two assets and keep the correlation coefficient negative to have enough diversification in the portfolio. Meaning that a decrease in one asset's return will not cause a decrease in return of the second asset(That's why we need negative correlation).
Maybe you meant correlation coefficient close to zero, not covariance.
The fact that you haven't provided a seed for your randomly generated numbers makes th reproducibility of your experiment difficoult. However, I tried the code you are providing here and the closer covariance matrix I get is this one :
To understand why the numbers in your cov_matrix are so huge you should first understand what is a covarance matrix. The covariance matrix is is a matrix that has as elements in the i, j position the the covariance between the i-th and j-th elements of a random vector.
A good link you might check is https://en.wikipedia.org/wiki/Covariance_matrix . Also understanding the correlation matrix might help : https://en.wikipedia.org/wiki/Correlation_and_dependence#Correlation_matrices

AgglomerativeClustering on a correlation matrix

I have a correlation matrix of typical structure that is of size 288x288 that is defined by:
from sklearn.cluster import AgglomerativeClustering
df = read_returns()
correl_matrix = df.corr()
where read_returns gives me a dataframe with a date index, and columns of the returns of assets.
Now - I want to cluster these correlations to reduce the population size.
By doing some reading and experimenting I discovered AgglomerativeClustering - and it appears at first pass to be an appropriate solution to my problem.
I define a distance metric as ((.5*(1-correl_matrix))**.5) and have:
cluster = AgglomerativeClustering(n_clusters=40, linkage='average')
cluster.fit(((.5*(1-correl_matrix))**.5).values)
label_groups = cluster.labels_
To observe some of the data and cross check my work I pick out cluster 1 and observe the pairwise correlations and find the min correlation between two items with that group in my dataset to find :
single_cluster = []
for i in range(0,correl_matrix.shape[0]):
if label_groups[i]==1:
single_cluster.append(correl_matrix.index[i])
min_correl = 1.0
for x in single_cluster:
for y in single_cluster:
if x<>y:
if correl_matrix[x][y]<min_correl:
min_correl = correl_matrix[x][y]
print min_correl
and get a min pairwise correlation of .20
To me this seems quite low - but "low based off what?" is a fair question to which I have no answer.
I would like to anticipate/enforce each pairwise correlation of a cluster to be >=.7 or something like this.
Is this possible in AgglomerativeClustering?
Am I accidentally going down the wrong path?
Hierarchical clustering supports different "linkage" strategies.
single-link: this connects points on the minimum distance to the others in the cluster
complete-link: this connects based on the maximum distance to the cluster
...
If you want a high minimum correlation = small maximum distance, this calls for complete linkage.
You may want to treat negative correlations as "good", too.
i.e. use dist = 1 - abs(corr).
Make sure to use ghe dendrogram. If you have outliers in your data, you want to cut into (n_clusters+n_outliers) partitions.

Categories