Calculating Covariance with Python and Numpy

Calculating Covariance with Python and Numpy - python

I am trying to figure out how to calculate covariance with the Python Numpy function cov. When I pass it two one-dimentional arrays, I get back a 2x2 matrix of results. I don't know what to do with that. I'm not great at statistics, but I believe covariance in such a situation should be a single number. This is what I am looking for. I wrote my own:
def cov(a, b):
if len(a) != len(b):
return
a_mean = np.mean(a)
b_mean = np.mean(b)
sum = 0
for i in range(0, len(a)):
sum += ((a[i] - a_mean) * (b[i] - b_mean))
return sum/(len(a)-1)
That works, but I figure the Numpy version is much more efficient, if I could figure out how to use it.
Does anybody know how to make the Numpy cov function perform like the one I wrote?
Thanks,
Dave

When a and b are 1-dimensional sequences, numpy.cov(a,b)[0][1] is equivalent to your cov(a,b).
The 2x2 array returned by np.cov(a,b) has elements equal to
cov(a,a) cov(a,b)
cov(a,b) cov(b,b)
(where, again, cov is the function you defined above.)

Thanks to unutbu for the explanation. By default numpy.cov calculates the sample covariance. To obtain the population covariance you can specify normalisation by the total N samples like this:
numpy.cov(a, b, bias=True)[0][1]
or like this:
numpy.cov(a, b, ddof=0)[0][1]

Note that starting in Python 3.10, one can obtain the covariance directly from the standard library.
Using statistics.covariance which is a measure (the number you're looking for) of the joint variability of two inputs:
from statistics import covariance
# x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
covariance(x, y)
# 0.75

Related

Method with numpy gives different result when called with array

I created a cosine similarity method, which gives the correct results when called with indivdual vectors, but when I supply a list of vectors I suddenly get different results. Isn't numpy supposed to calculate the formula for every element in the list? Is my understanding wrong?
Cosine similarity:
def cosine_similarity(vec1, vec2):
return np.inner(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
Example:
a = [1, 2, 3]
b = [4, 5, 6]
print(cosine_similarity(a, a), cosine_similarity(a, b), cosine_similarity(a, [a, b]))
With the result:
1.0 0.9746318461970762 [0.39223227 0.8965309 ]
The first two values are correct, the array of values should be the same, but isn't.
Is this just not possible or do I have to change something?

Your understanding is actually correct. Many functions in numpy allow the keyword argument axis to be specified on call. np.linalg.norm for example computes the norm along the specified axis. In your case, if it is not specified, norm calulates the norm of the 2x3 matrix [a, b] instead calculating the norm per row.
To fix the code just do the following:
def cosine_similarity(vec1, vec2):
return np.inner(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2, axis=-1))

why percentile() method is not calculating the appropriate percentile? Like the 25th percentile for this data should be 1.5 and 2 if rounded off

import numpy as np
value = [1, 2, 3, 4, 5, 6]
x = np.percentile(value, 25)
print(x)
I am calculating percentile using this code to cross verify
import sys
import numpy as np
from numpy import math
def my_percentile(data, percentile):
n = len(data)
p = n * percentile / 100
if p.is_integer():
return sorted(data)[int(p)]
else:
return sorted(data)[int(math.ceil(p)) - 1]
t = [1, 2, 3, 4, 5, 6]
per = my_percentile(t, 25)
print(per)

There's more than one way to calculate quartiles. Wikipedia has a good summary under quantiles.
The values returned by numpy's default calculation match those returned by, for example, R's summary() function.
You need to do one of these things.
Switch to numpy.percentile's default way of calculating quartiles,
provide a value to numpy.percentile's parameter interpolation, or
write your own custom function.
Valid values for interpolation in numpy.percentile are here.
I didn't suggest a value for interpolation, because you didn't include your expected output in your question. You need to consider the effect of your decision on all quartiles, not just on one.
(I don't think scipy.stats.percentileofscore() will work for you.

How can I use Numpy to obtain a function that represents the relationship between a pair of values and another?

I am not a mathematician but I think that what I am after is called a "multiple linear regression"; please correct me if I am wrong.
I use numpy.polyfit and numpy.poly1d on a series of angle/pulse_width values from a servo motor, to obtain a function, angles_to_pulsewidths.
angles_to_pulsewidths is a polynomial function that models the servo and represents a line of good fit for the series. Given an angle value, it returns a corresponding pulse_width.
I am now trying to do a similar thing but instead of a single angle value in my series, I have pair of x/y co-ordinates for each pulse_width. I want to obtain a function that given an x/y pair, returns a corresponding pulse_width.
This is my code for creating my angles_to_pulsewidths function:
import numpy
angles_and_pulsewidths = [
[-162, 2490],
[-144, 2270],
[-126, 2070],
[-108, 1880]
]
angles_values_array = numpy.array(angles_and_pulsewidths)[:, 0]
pulsewidths_values_array = numpy.array(angles_and_pulsewidths)[:, 1]
coefficients = numpy.polyfit(
angles_values_array,
pulsewidths_values_array,
3
)
angles_to_pulsewidths = numpy.poly1d(coefficients)
I have been trying to modify this so that instead of providing a one-dimensional array of angles I will provide a two-dimensional array of x/y values:
xy_values = [[1, 2], [3, 4], [5, 6], [6, 7]]
pulse_widths = [2490, 2270, 2070, 1880]
However in this case, I can't use polyfit, because that takes only a one-dimensional array for its x parameter.
I can use numpy.linalg.lstsq instead, but I can't work out what to do with the results it gives me.
I'm also not even sure if I am on the right track; am I? I have read numerous related questions here, and have found numerous clues, but not enough to get me to the next step.

It is possible to use scipy's curve_fit for this.
If you know the general format of the function, perhaps you think it will be something of the form:
a x ^ 2 + b x y + c y ^ 2 + d x + e y +f
then you can use scipy's curve_fit to estimate what I will refer to as "parameters": a, b, c, d, e, f.
First we need to define the general form of our function:
def func(variables, a, b, c, d, e, f):
x, y = variables
return a * x ** 2 + b * x * y + c * y ** 2 + d * x + e * y + f
Note that our function has 6 parameters, to be able to demonstrate how this works we need more data than parameters so I'm extending your example data set to have 7 pairs of xy values and 7 pulse widths:
xy_values = [[1, 2], [3, 4], [5, 6], [6, 7], [8, 9], [10, 11], [12, 13]]
pulse_widths = [2490, 2270, 2070, 1880, 2000, 500, 600]
(If you do not have more data than parameters then you probably can choose a general form of your function to have less parameters.)
We need to reshape our xy_values so that it is not pairs of values but a single pair of two sets of values (the xs and the ys). To do this I'm choosing to creating a numpy array and "transpose" it:
xy_values = np.array(xy_values).T
We can now call our func on our array:
func(variables=xy_values, a=0, b=0, c=0, d=0, e=0, f=4)
Which gives:
array([4, 4, 4, 4, 4, 4, 4])
We can now actually use our data and curve_fit to estimate the best parameters:
from scipy.optimize import curve_fit
popt, pcov = curve_fit(f=func, xdata=xy_values, ydata=pulse_widths)
pcov contains information about how good the fit is and popt is the actual values of the parameters which we can directly see and use:
popt
gives:
array([ -25.61043682, -106.84636863, 119.10145249, -374.6200899 ,
230.65326227, 2141.55126789])
and we can call the function with it on some new value of x and y:
func([0, 5], *popt)
which gives:
6272.353891536915
Choosing the correct general form of the function you want to fit is case dependant. If there is any knowledge of the problem at hand (perhaps you expect there to be some trigonometric relationship) then you can use it otherwise it's a case of trial and error and getting a relationship that's "good enough" for your use case.
EDIT: Your original suggestion of needing to use multiple linear regression (MLR) is not completely incorrect. The solution approach I've described allows you to do MLR but it just assumes a specific type of func: one where all the terms are linear.

Get mean of a distribution?

So, I generated a vector d of data that follows a normal distribution with some mean and variance.
I want then to calculate a vector s such that each component of it is a function of the type si=f(di).
Then I want to do the mean. Is there in Python any quick way to do that without any cycle?

You can use numpy to perform a function on an entire array for example if I had such a function
def f(x):
return x * 2
Then I could use numpy as follows
>>> d = numpy.array([1,2,6,7])
>>> f(d)
array([ 2, 4, 12, 14])
Then to calculate the mean
>>> s = f(d)
>>> numpy.mean(s)
8.0

Overcoming broadcasting error for Legendre polynomails, scipy eval_legendre

I am trying to evaluate the Legendre polynomial P_n(x) with scipy's special function
scipy.special.eval_legendre(n, x)
which allows you to evaluate a Legendre at certain points. I would then like to sum these Legendre polynomials together, \Sigma_n P_n(x).
Begin by evaluating P_n(x) at several n values, let's say 10. Define an array
arr = np.arange(10) = array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
and you can evaluate P_n(x) at these values.
My argument however is a 100 by 100 matrix. So,
eval_legendre(np.arange(10), matrix)
will not work as there's a broadcasting error. That's easy to overcome.
But then, I would like to take the sum of all of these Legendre polynomials
"Sum = P_0(x) + P_1(x) + P_2(x) + ... + P_10(x)"
using
import numpy as np
np.sum()
That is more complex, as I am summing each P_n(x).
I suspect the correct approach is something like
for i in arr:
np.sum(i, matrix)
Is there a more clean/tidy way to do this?

This should do the job:
sum( [eval_legendre(x,matrix) for x in range(1,10)] )
Each call to the eval_legendre function returns a matrix of the shape of the matrix you pass to it. So we can make a list of these matrices using list comprehension, and sum them as you suggested.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculating Covariance with Python and Numpy - python

When a and b are 1-dimensional sequences, numpy.cov(a,b)[0][1] is equivalent to your cov(a,b). The 2x2 array returned by np.cov(a,b) has elements equal to cov(a,a) cov(a,b) cov(a,b) cov(b,b) (where, again, cov is the function you defined above.)

Thanks to unutbu for the explanation. By default numpy.cov calculates the sample covariance. To obtain the population covariance you can specify normalisation by the total N samples like this: numpy.cov(a, b, bias=True)[0][1] or like this: numpy.cov(a, b, ddof=0)[0][1]

Related

Method with numpy gives different result when called with array

why percentile() method is not calculating the appropriate percentile? Like the 25th percentile for this data should be 1.5 and 2 if rounded off

How can I use Numpy to obtain a function that represents the relationship between a pair of values and another?

Get mean of a distribution?

Overcoming broadcasting error for Legendre polynomails, scipy eval_legendre

Categories

Resources