I created a cosine similarity method, which gives the correct results when called with indivdual vectors, but when I supply a list of vectors I suddenly get different results. Isn't numpy supposed to calculate the formula for every element in the list? Is my understanding wrong?
Cosine similarity:
def cosine_similarity(vec1, vec2):
return np.inner(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
Example:
a = [1, 2, 3]
b = [4, 5, 6]
print(cosine_similarity(a, a), cosine_similarity(a, b), cosine_similarity(a, [a, b]))
With the result:
1.0 0.9746318461970762 [0.39223227 0.8965309 ]
The first two values are correct, the array of values should be the same, but isn't.
Is this just not possible or do I have to change something?
Your understanding is actually correct. Many functions in numpy allow the keyword argument axis to be specified on call. np.linalg.norm for example computes the norm along the specified axis. In your case, if it is not specified, norm calulates the norm of the 2x3 matrix [a, b] instead calculating the norm per row.
To fix the code just do the following:
def cosine_similarity(vec1, vec2):
return np.inner(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2, axis=-1))
Related
I am not a mathematician but I think that what I am after is called a "multiple linear regression"; please correct me if I am wrong.
I use numpy.polyfit and numpy.poly1d on a series of angle/pulse_width values from a servo motor, to obtain a function, angles_to_pulsewidths.
angles_to_pulsewidths is a polynomial function that models the servo and represents a line of good fit for the series. Given an angle value, it returns a corresponding pulse_width.
I am now trying to do a similar thing but instead of a single angle value in my series, I have pair of x/y co-ordinates for each pulse_width. I want to obtain a function that given an x/y pair, returns a corresponding pulse_width.
This is my code for creating my angles_to_pulsewidths function:
import numpy
angles_and_pulsewidths = [
[-162, 2490],
[-144, 2270],
[-126, 2070],
[-108, 1880]
]
angles_values_array = numpy.array(angles_and_pulsewidths)[:, 0]
pulsewidths_values_array = numpy.array(angles_and_pulsewidths)[:, 1]
coefficients = numpy.polyfit(
angles_values_array,
pulsewidths_values_array,
3
)
angles_to_pulsewidths = numpy.poly1d(coefficients)
I have been trying to modify this so that instead of providing a one-dimensional array of angles I will provide a two-dimensional array of x/y values:
xy_values = [[1, 2], [3, 4], [5, 6], [6, 7]]
pulse_widths = [2490, 2270, 2070, 1880]
However in this case, I can't use polyfit, because that takes only a one-dimensional array for its x parameter.
I can use numpy.linalg.lstsq instead, but I can't work out what to do with the results it gives me.
I'm also not even sure if I am on the right track; am I? I have read numerous related questions here, and have found numerous clues, but not enough to get me to the next step.
It is possible to use scipy's curve_fit for this.
If you know the general format of the function, perhaps you think it will be something of the form:
a x ^ 2 + b x y + c y ^ 2 + d x + e y +f
then you can use scipy's curve_fit to estimate what I will refer to as "parameters": a, b, c, d, e, f.
First we need to define the general form of our function:
def func(variables, a, b, c, d, e, f):
x, y = variables
return a * x ** 2 + b * x * y + c * y ** 2 + d * x + e * y + f
Note that our function has 6 parameters, to be able to demonstrate how this works we need more data than parameters so I'm extending your example data set to have 7 pairs of xy values and 7 pulse widths:
xy_values = [[1, 2], [3, 4], [5, 6], [6, 7], [8, 9], [10, 11], [12, 13]]
pulse_widths = [2490, 2270, 2070, 1880, 2000, 500, 600]
(If you do not have more data than parameters then you probably can choose a general form of your function to have less parameters.)
We need to reshape our xy_values so that it is not pairs of values but a single pair of two sets of values (the xs and the ys). To do this I'm choosing to creating a numpy array and "transpose" it:
xy_values = np.array(xy_values).T
We can now call our func on our array:
func(variables=xy_values, a=0, b=0, c=0, d=0, e=0, f=4)
Which gives:
array([4, 4, 4, 4, 4, 4, 4])
We can now actually use our data and curve_fit to estimate the best parameters:
from scipy.optimize import curve_fit
popt, pcov = curve_fit(f=func, xdata=xy_values, ydata=pulse_widths)
pcov contains information about how good the fit is and popt is the actual values of the parameters which we can directly see and use:
popt
gives:
array([ -25.61043682, -106.84636863, 119.10145249, -374.6200899 ,
230.65326227, 2141.55126789])
and we can call the function with it on some new value of x and y:
func([0, 5], *popt)
which gives:
6272.353891536915
Choosing the correct general form of the function you want to fit is case dependant. If there is any knowledge of the problem at hand (perhaps you expect there to be some trigonometric relationship) then you can use it otherwise it's a case of trial and error and getting a relationship that's "good enough" for your use case.
EDIT: Your original suggestion of needing to use multiple linear regression (MLR) is not completely incorrect. The solution approach I've described allows you to do MLR but it just assumes a specific type of func: one where all the terms are linear.
I was trying to figure out how to calculate the Frobenius of a matrix in numpy. This way I can get the 2-norm of each row in the matrix x below:
My question is about the ord parameter in numpy's linalg.norm module and how the relevant part of numpy documentation describes which norm of a matrix one can calculate. I was able to get the Frobenius norm by setting ord=2, however, it says that only setting ord=None gives the Frobenius norm.
Here is my example:
x = np.array([[0, 3, 4],
[1, 6, 4]])
I found that I can the Frobenius norm with the following line of code:
x_norm = np.linalg.norm(x, ord = 2, axis=1,keepdims=True )
>>> x_norm
array([[ 5. ],
[ 7.28010989]])
My question is whether the documentation here would be considered not as helpful as possible and if this warrants a request to change the description of setting ord=2 in the aforementioned table.
You're not taking a matrix norm. Since you've passed axis=1, you're taking vector norms, and you should be looking at the vector norm column instead of the matrix norm column.
For vector norms, ord=None and ord=2 both produce a 2-norm.
I have the following problem: I want to compute the softmax function in Python and get an unexpected result. The code is the following:
import numpy as np
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0)
It works perfectly but I don´t know why: It works on matrices as follows: If I insert a 2x2 matrix A, the output is yet another 2x2 matrix. Why is that? Shouldn´t it return a differently sized array since every element of the matrix, i.e. $x=A[0,0]$, yields 2 output values (namely $exp(x)/(exp(A[0,0])+exp(A[1,0]))$ and $exp(x)/(exp(A[0,1])+exp(A[1,1]))$, because or the axis=0 command? That would lead to an 8-element output array, but the actual result only has 4 elements. Also, how exactly does the axis=0 command work? If I type A=np.array([2, 4]), then the logical result of np.sum(A, axis=0) should be array([2, 4]), since the columns are summed up. But the result is array([6]). And the command np.sum(A, axis=1) strangely yields "'axis' entry is out of bounds", although the result should be array([6]) since the rows are summed up. Maybe my two problems are linked.
Any help will be appreciated!
Thanks,
Leon
I will jump into the "final" problem:
matrix_22 / vector_2
Because that does not make mathematical sense, numpy uses a certain assumption. Just as:
matrix_22 * 5
what that does is multiplying each element of the matrix by 5. Then if we consider a matrix_22 as a vector of vectors, then the result of the matrix_22 / vector_2 results on applying the operation division for each vector on the matrix.
You can easily check that behaviour executing the following:
np.array([[14, 28], [70, 56]]) / np.array([2, 7])
Notation: matrix_22 is "some variable which contains a numpy array of shape 2x2, so it is a 2x2 matrix". And vector_2 is a numpy array of two elements.
I am trying to call scipy.stats.multivariate_normal with four different parameters for mu and sigma. And then for each generated probability density function I need to call that pdf on an array of say, 10 values.
For simplicity let's say that above mentioned function is addXY:
def addXY(x, y):
return x+y
params=[[1,2],[1,3],[1,4],[1,5]] # mu and sigma, four versions
inputs=[1,2,3] # values, in this case 3 of them
matrix = []
for pdf_params in params:
row = []
for inp in inputs:
entry = addXY(*pdf_params)
row.append(entry*inp)
matrix.append(row)
print matrix
Is this pythonic?
Is there a way to pass params and inputs and get a matrix with all combinations in it that is more pythonic/vectorized/faster?
!Important notice: Inputs in the example are scalar values (I've set scalar values to simplify problem description, I am actually using array of n-dimensional vectors and thus multivariate_normal pdf).
Hints and tips about similar operations are welcome.
Based on your description of what you are trying to compute, you don't need multivariate_normal. You are calling the PDF method with a set of scalar values for a distribution with a scalar mu and sigma. So you can use the pdf() method of scipy.stats.norm. This method will broadcast its arguments, so by passing in arrays with the proper shape, you can compute the PDF for the different values of mu and sigma in one call. Here's an example.
Here are your x values (you called them inputs), and the parameters:
In [23]: x = np.array([1, 2, 3])
In [24]: params = np.array([[1, 2], [1, 3], [1, 4], [1, 5]])
For convenience, separate the parameters into arrays of mu and sigma values.
In [25]: mu = params[:, 0]
In [26]: sig = params[:, 1]
We'll use scipy.stats.norm to compute the PDF.
In [27]: from scipy.stats import norm
This call computes the PDF for the desired combinations of x and parameters. mu.reshape(-1, 1) and sig.reshape(-1, 1) are 2D arrays with shape (4, 1). x has shape (3,), so when these arguments are broadcast, the result has shape (4, 3). Each row is the PDF evaluated at x for one of the pairs of mu and sigma.
In [28]: p = norm.pdf(x, loc=mu.reshape(-1, 1), scale=sig.reshape(-1, 1))
In [29]: p
Out[29]:
array([[ 0.19947114, 0.17603266, 0.12098536],
[ 0.13298076, 0.12579441, 0.10648267],
[ 0.09973557, 0.09666703, 0.08801633],
[ 0.07978846, 0.07820854, 0.07365403]])
In other words, the rows of p are:
norm.pdf(x, loc=mu[0], scale=sig[0])
norm.pdf(x, loc=mu[1], scale=sig[1])
norm.pdf(x, loc=mu[2], scale=sig[2])
norm.pdf(x, loc=mu[3], scale=sig[3])
This is only my idea to shorten the code and utilize more library.
In your code, in fact, you do not use numpy, scipy. Question will be whether you would like to use numpy.array for further data processing.
Option 1: just use list to present array and list of list to present matrix:
from itertools import product
matrix_list = [sum(param)*input_x for param, input_x in product(params, inputs)]
matrix = zip(*[iter(matrix_list)]*len(inputs))
print matrix
Credit for using zip method should be given to
convert a flat list to list of list in python
Option 2: use numpy.array and numpy.matrix for further processing
from itertools import product
import numpy as np
matrix_array = np.array([sum(param)*input_x for param, input_x in product(params, inputs)])
matrix = matrix_array.reshape(len(params),len(inputs))
print matrix
I am trying to figure out how to calculate covariance with the Python Numpy function cov. When I pass it two one-dimentional arrays, I get back a 2x2 matrix of results. I don't know what to do with that. I'm not great at statistics, but I believe covariance in such a situation should be a single number. This is what I am looking for. I wrote my own:
def cov(a, b):
if len(a) != len(b):
return
a_mean = np.mean(a)
b_mean = np.mean(b)
sum = 0
for i in range(0, len(a)):
sum += ((a[i] - a_mean) * (b[i] - b_mean))
return sum/(len(a)-1)
That works, but I figure the Numpy version is much more efficient, if I could figure out how to use it.
Does anybody know how to make the Numpy cov function perform like the one I wrote?
Thanks,
Dave
When a and b are 1-dimensional sequences, numpy.cov(a,b)[0][1] is equivalent to your cov(a,b).
The 2x2 array returned by np.cov(a,b) has elements equal to
cov(a,a) cov(a,b)
cov(a,b) cov(b,b)
(where, again, cov is the function you defined above.)
Thanks to unutbu for the explanation. By default numpy.cov calculates the sample covariance. To obtain the population covariance you can specify normalisation by the total N samples like this:
numpy.cov(a, b, bias=True)[0][1]
or like this:
numpy.cov(a, b, ddof=0)[0][1]
Note that starting in Python 3.10, one can obtain the covariance directly from the standard library.
Using statistics.covariance which is a measure (the number you're looking for) of the joint variability of two inputs:
from statistics import covariance
# x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
covariance(x, y)
# 0.75