I have two object arrays not necessarily of the same length:
import numpy as np
class Obj_A:
def __init__(self,n):
self.type = 'a'+str(n)
def __eq__(self,other):
return self.type==other.type
class Obj_B:
def __init__(self,n):
self.type = 'b'+str(n)
def __eq__(self,other):
return self.type==other.type
a = np.array([Obj_A(n) for n in range(2)])
b = np.array([Obj_B(n) for n in range(3)])
I would like to generate the matrix
mat = np.array([[[a[0],b[0]],[a[0],b[1]],[a[0],b[2]]],
[[a[1],b[0]],[a[1],b[1]],[a[1],b[2]]]])
this matrix has shape (len(a),len(b),2). Its elements are
mat[i,j] = [a[i],b[j]]
A solution is
mat = np.empty((len(a),len(b),2),dtype='object')
for i,aa in enumerate(a):
for j,bb in enumerate(b):
mat[i,j] = np.array([aa,bb],dtype='object')
but this is too expensive for my problem, which has O(len(a)) = O(len(b)) = 1e5.
I suspect there is a clean numpy solution involving np.repeat, np.tile and np.transpose, similar to the accepted answer here, but the output in this case does not simply reshape to the desired result.
I would suggest using np.meshgrid(), which takes two input arrays and repeats both along different axes so that looking at corresponding positions of the outputs gets you all possible combinations. For example:
>>> x, y = np.meshgrid([1, 2, 3], [4, 5])
>>> x
array([[1, 2, 3],
[1, 2, 3]])
>>> y
array([[4, 4, 4],
[5, 5, 5]])
In your case, you can put the two arrays together and transpose them into the proper configuration. Based on some experimentation I think this should work for you:
>>> np.transpose(np.meshgrid(a, b), (2, 1, 0))
I have a numpy matrix X and I would like to add to this matrix as new variables all the possible products between 2 columns.
So if X=(x1,x2,x3) I want X=(x1,x2,x3,x1x2,x2x3,x1x3)
Is there an elegant way to do that?
I think a combination of numpy and itertools should work
EDIT:
Very good answers but are they considering that X is a matrix? So x1,x1,.. x3 can eventually be arrays?
EDIT:
A Real example
a=array([[1,2,3],[4,5,6]])
Itertools should be the answer here.
a = [1, 2, 3]
p = (x * y for x, y in itertools.combinations(a, 2))
print list(itertools.chain(a, p))
Result:
[1, 2, 3, 2, 3, 6] # 1, 2, 3, 2 x 1, 3 x 1, 3 x 2
I think Samy's solution is pretty good. If you need to use numpy, you could transform it a little like this:
from itertools import combinations
from numpy import prod
x = [1, 2, 3]
print x + map(prod, combinations(x, 2))
Gives the same output as Samy's solution:
[1, 2, 3, 2, 3, 6]
If your arrays are small, then Samy's pure-Python solution using itertools.combinations should be fine:
from itertools import combinations, chain
def all_products1(a):
p = (x * y for x, y in combinations(a, 2))
return list(chain(a, p))
But if your arrays are large, then you'll get a substantial speedup by fully vectorizing the computation, using numpy.triu_indices, like this:
import numpy as np
def all_products2(a):
x, y = np.triu_indices(len(a), 1)
return np.r_[a, a[x] * a[y]]
Let's compare these:
>>> data = np.random.uniform(0, 100, (10000,))
>>> timeit(lambda:all_products1(data), number=1)
53.745754408999346
>>> timeit(lambda:all_products2(data), number=1)
12.26144006299728
The solution using numpy.triu_indices also works for multi-dimensional data:
>>> np.random.uniform(0, 100, (3,2))
array([[ 63.75071196, 15.19461254],
[ 94.33972762, 50.76916376],
[ 88.24056878, 90.36136808]])
>>> all_products2(_)
array([[ 63.75071196, 15.19461254],
[ 94.33972762, 50.76916376],
[ 88.24056878, 90.36136808],
[ 6014.22480172, 771.41777239],
[ 5625.39908354, 1373.00597677],
[ 8324.59122432, 4587.57109368]])
If you want to operate on columns rather than rows, use:
def all_products3(a):
x, y = np.triu_indices(a.shape[1], 1)
return np.c_[a, a[:,x] * a[:,y]]
For example:
>>> np.random.uniform(0, 100, (2,3))
array([[ 33.0062385 , 28.17575024, 20.42504351],
[ 40.84235995, 61.12417428, 58.74835028]])
>>> all_products3(_)
array([[ 33.0062385 , 28.17575024, 20.42504351, 929.97553238,
674.15385734, 575.4909246 ],
[ 40.84235995, 61.12417428, 58.74835028, 2496.45552756,
2399.42126888, 3590.94440122]])
Say I want to calculate a value for every point on a grid. I would define some function func that takes two values x and y as parameters and returns a third value. In the example below, calculating this value requires a look-up in an external dictionary. I would then generate a grid of points and evaluate func on each of them to get my desired result.
The code below does precisely this, but in a somewhat roundabout way. First I reshape both the X and Y coordinate matrices into one-dimensional arrays, calculate all the values, and then reshape the result back into a matrix. My questions is, can this be done in a more elegant manner?
import collections as c
# some arbitrary lookup table
a = c.defaultdict(int)
a[1] = 2
a[2] = 3
a[3] = 2
a[4] = 3
def func(x,y):
# some arbitrary function
return a[x] + a[y]
X,Y = np.mgrid[1:3, 1:4]
X = X.T
Y = Y.T
Z = np.array([func(x,y) for (x,y) in zip(X.ravel(), Y.ravel())]).reshape(X.shape)
print Z
The purpose of this code is to generate a set of values that I can use with pcolor in matplotlib to create a heatmap-type plot.
I'd use numpy.vectorize to "vectorize" your function. Note that despite the name, vectorize is not intended to make your code run faster -- Just simplify it a bit.
Here's some examples:
>>> import numpy as np
>>> #np.vectorize
... def foo(a, b):
... return a + b
...
>>> foo([1,3,5], [2,4,6])
array([ 3, 7, 11])
>>> foo(np.arange(9).reshape(3,3), np.arange(9).reshape(3,3))
array([[ 0, 2, 4],
[ 6, 8, 10],
[12, 14, 16]])
With your code, it should be enough to decorate func with np.vectorize and then you can probably just call it as func(X, Y) -- No raveling or reshapeing necessary:
import numpy as np
import collections as c
# some arbitrary lookup table
a = c.defaultdict(int)
a[1] = 2
a[2] = 3
a[3] = 2
a[4] = 3
#np.vectorize
def func(x,y):
# some arbitrary function
return a[x] + a[y]
X,Y = np.mgrid[1:3, 1:4]
X = X.T
Y = Y.T
Z = func(X, Y)
I want to calculate the cosine similarity between two lists, let's say for example list 1 which is dataSetI and list 2 which is dataSetII.
Let's say dataSetI is [3, 45, 7, 2] and dataSetII is [2, 54, 13, 15]. The length of the lists are always equal. I want to report cosine similarity as a number between 0 and 1.
dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
def cosine_similarity(list1, list2):
# How to?
pass
print(cosine_similarity(dataSetI, dataSetII))
another version based on numpy only
from numpy import dot
from numpy.linalg import norm
cos_sim = dot(a, b)/(norm(a)*norm(b))
You should try SciPy. It has a bunch of useful scientific routines for example, "routines for computing integrals numerically, solving differential equations, optimization, and sparse matrices." It uses the superfast optimized NumPy for its number crunching. See here for installing.
Note that spatial.distance.cosine computes the distance, and not the similarity. So, you must subtract the value from 1 to get the similarity.
from scipy import spatial
dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)
You can use cosine_similarity function form sklearn.metrics.pairwise docs
In [23]: from sklearn.metrics.pairwise import cosine_similarity
In [24]: cosine_similarity([[1, 0, -1]], [[-1,-1, 0]])
Out[24]: array([[-0.5]])
I don't suppose performance matters much here, but I can't resist. The zip() function completely recopies both vectors (more of a matrix transpose, actually) just to get the data in "Pythonic" order. It would be interesting to time the nuts-and-bolts implementation:
import math
def cosine_similarity(v1,v2):
"compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
sumxx, sumxy, sumyy = 0, 0, 0
for i in range(len(v1)):
x = v1[i]; y = v2[i]
sumxx += x*x
sumyy += y*y
sumxy += x*y
return sumxy/math.sqrt(sumxx*sumyy)
v1,v2 = [3, 45, 7, 2], [2, 54, 13, 15]
print(v1, v2, cosine_similarity(v1,v2))
Output: [3, 45, 7, 2] [2, 54, 13, 15] 0.972284251712
That goes through the C-like noise of extracting elements one-at-a-time, but does no bulk array copying and gets everything important done in a single for loop, and uses a single square root.
ETA: Updated print call to be a function. (The original was Python 2.7, not 3.3. The current runs under Python 2.7 with a from __future__ import print_function statement.) The output is the same, either way.
CPYthon 2.7.3 on 3.0GHz Core 2 Duo:
>>> timeit.timeit("cosine_similarity(v1,v2)",setup="from __main__ import cosine_similarity, v1, v2")
2.4261788514654654
>>> timeit.timeit("cosine_measure(v1,v2)",setup="from __main__ import cosine_measure, v1, v2")
8.794677709375264
So, the unpythonic way is about 3.6 times faster in this case.
without using any imports
math.sqrt(x)
can be replaced with
x** .5
without using numpy.dot() you have to create your own dot function using list comprehension:
def dot(A,B):
return (sum(a*b for a,b in zip(A,B)))
and then its just a simple matter of applying the cosine similarity formula:
def cosine_similarity(a,b):
return dot(a,b) / ( (dot(a,a) **.5) * (dot(b,b) ** .5) )
I did a benchmark based on several answers in the question and the following snippet is believed to be the best choice:
def dot_product2(v1, v2):
return sum(map(operator.mul, v1, v2))
def vector_cos5(v1, v2):
prod = dot_product2(v1, v2)
len1 = math.sqrt(dot_product2(v1, v1))
len2 = math.sqrt(dot_product2(v2, v2))
return prod / (len1 * len2)
The result makes me surprised that the implementation based on scipy is not the fastest one. I profiled and find that cosine in scipy takes a lot of time to cast a vector from python list to numpy array.
Python code to calculate:
Cosine Distance
Cosine Similarity
Angular Distance
Angular Similarity
import math
from scipy import spatial
def calculate_cosine_distance(a, b):
cosine_distance = float(spatial.distance.cosine(a, b))
return cosine_distance
def calculate_cosine_similarity(a, b):
cosine_similarity = 1 - calculate_cosine_distance(a, b)
return cosine_similarity
def calculate_angular_distance(a, b):
cosine_similarity = calculate_cosine_similarity(a, b)
angular_distance = math.acos(cosine_similarity) / math.pi
return angular_distance
def calculate_angular_similarity(a, b):
angular_similarity = 1 - calculate_angular_distance(a, b)
return angular_similarity
Similarity Search:
If you want to find closest cosine similarity in array of embeddings, you can use Tensorflow, like the following code.
In my testing, closeset value to an embedding with the shape of 1x512 found in 1M embeddings (1'000'000 x 512) in less than a second (using GPU).
import time
import numpy as np # np.__version__ == '1.23.5'
import tensorflow as tf # tf.__version__ == '2.11.0'
EMBEDDINGS_LENGTH = 512
NUMBER_OF_EMBEDDINGS = 1000 * 1000
def calculate_cosine_similarities(x, embeddings):
cosine_similarities = -1 * tf.keras.losses.cosine_similarity(x, embeddings)
return cosine_similarities.numpy()
def find_closest_embeddings(x, embeddings, top_k=1):
cosine_similarities = calculate_cosine_similarities(x, embeddings)
values, indices = tf.math.top_k(cosine_similarities, k=top_k)
return values.numpy(), indices.numpy()
def main():
# x shape: (512)
# Embeddings shape: (1000000, 512)
x = np.random.rand(EMBEDDINGS_LENGTH).astype(np.float32)
embeddings = np.random.rand(NUMBER_OF_EMBEDDINGS, EMBEDDINGS_LENGTH).astype(np.float32)
print('Embeddings shape: ', embeddings.shape)
n = 100
sum_duration = 0
for i in range(n):
start = time.time()
best_values, best_indices = find_closest_embeddings(x, embeddings, top_k=1)
end = time.time()
duration = end - start
sum_duration += duration
print('Duration (seconds): {}, Best value: {}, Best index: {}'.format(duration, best_values[0], best_indices[0]))
# Average duration (seconds): 1.707 for Intel(R) Core(TM) i7-10700 CPU # 2.90GHz
# Average duration (seconds): 0.961 for NVIDIA 1080 ti
print('Average duration (seconds): ', sum_duration / n)
if __name__ == '__main__':
main()
For more advanced similarity search, you can use Milvus, Weaviate or Faiss.
https://en.wikipedia.org/wiki/Cosine_similarity
https://gist.github.com/amir-saniyan/e102de09b01c4ed1632e3d1a1a1cbf64
import math
from itertools import izip
def dot_product(v1, v2):
return sum(map(lambda x: x[0] * x[1], izip(v1, v2)))
def cosine_measure(v1, v2):
prod = dot_product(v1, v2)
len1 = math.sqrt(dot_product(v1, v1))
len2 = math.sqrt(dot_product(v2, v2))
return prod / (len1 * len2)
You can round it after computing:
cosine = format(round(cosine_measure(v1, v2), 3))
If you want it really short, you can use this one-liner:
from math import sqrt
from itertools import izip
def cosine_measure(v1, v2):
return (lambda (x, y, z): x / sqrt(y * z))(reduce(lambda x, y: (x[0] + y[0] * y[1], x[1] + y[0]**2, x[2] + y[1]**2), izip(v1, v2), (0, 0, 0)))
You can use this simple function to calculate the cosine similarity:
def cosine_similarity(a, b):
return sum([i*j for i,j in zip(a, b)])/(math.sqrt(sum([i*i for i in a]))* math.sqrt(sum([i*i for i in b])))
You can do this in Python using simple function:
def get_cosine(text1, text2):
vec1 = text1
vec2 = text2
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return round(float(numerator) / denominator, 3)
dataSet1 = [3, 45, 7, 2]
dataSet2 = [2, 54, 13, 15]
get_cosine(dataSet1, dataSet2)
Using numpy compare one list of numbers to multiple lists(matrix):
def cosine_similarity(vector,matrix):
return ( np.sum(vector*matrix,axis=1) / ( np.sqrt(np.sum(matrix**2,axis=1)) * np.sqrt(np.sum(vector**2)) ) )[::-1]
If you happen to be using PyTorch already, you should go with their CosineSimilarity implementation.
Suppose you have two n-dimensional numpy.ndarrays, v1 and v2, i.e. their shapes are both (n,). Here's how you get their cosine similarity:
import torch
import torch.nn as nn
cos = nn.CosineSimilarity()
cos(torch.tensor([v1]), torch.tensor([v2])).item()
Or suppose you have two numpy.ndarrays w1 and w2, whose shapes are both (m, n). The following gets you a list of cosine similarities, each being the cosine similarity between a row in w1 and the corresponding row in w2:
cos(torch.tensor(w1), torch.tensor(w2)).tolist()
You can use SciPy (easiest way):
from scipy import spatial
dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
print(1 - spatial.distance.cosine(dataSetI, dataSetII))
Note that spatial.distance.cosine() gives you a dissimilarity (distance) value, and thus to get the similarity, you need to subtract that value from 1.
Another way to get to the solution is to write the function yourself that even contemplates the possibility of lists with different lengths:
def cosineSimilarity(v1, v2):
scalarProduct = moduloV1 = moduloV2 = 0
if len(v1) > len(v2):
v2.extend(0 for _ in range(len(v1) - len(v2)))
else:
v2.extend(0 for _ in range(len(v2) - len(v1)))
for i in range(len(v1)):
scalarProduct += v1[i] * v2[i]
moduloV1 += v1[i] * v1[i]
moduloV2 += v2[i] * v2[i]
return round(scalarProduct/(math.sqrt(moduloV1) * math.sqrt(moduloV2)), 3)
dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
print(cosineSimilarity(dataSetI, dataSetII))
Another version, if you have a scenario where you have list of vectors and a query vector and you want to compute the cosine similarity of query vector with all the vectors in the list, you can do it in one go in the below fashion:
>>> import numpy as np
>>> A # list of vectors, shape -> m x n
array([[ 3, 45, 7, 2],
[ 1, 23, 3, 4]])
>>> B # query vector, shape -> 1 x n
array([ 2, 54, 13, 15])
>>> similarity_scores = A.dot(B)/ (np.linalg.norm(A, axis=1) * np.linalg.norm(B))
>>> similarity_scores
array([0.97228425, 0.99026919])
We can easily calculate cosine similarity with simple mathematics equations.
Cosine_similarity = 1- (dotproduct of vectors/(product of norm of the vectors)). We can define two functions each for calculations of dot product and norm.
def dprod(a,b):
sum=0
for i in range(len(a)):
sum+=a[i]*b[i]
return sum
def norm(a):
norm=0
for i in range(len(a)):
norm+=a[i]**2
return norm**0.5
cosine_a_b = 1-(dprod(a,b)/(norm(a)*norm(b)))
Here is an implementation that would work for matrices as well. Its behaviour is exactly like sklearn cosine similarity:
def cosine_similarity(a, b):
return np.divide(
np.dot(a, b.T),
np.linalg.norm(
a,
axis=1,
keepdims=True
)
# # matrix multiplication
np.linalg.norm(
b,
axis=1,
keepdims=True
).T
)
The # symbol stands for matrix multiplication. See
What does the "at" (#) symbol do in Python?
All the answers are great for situations where you cannot use NumPy. If you can, here is another approach:
def cosine(x, y):
dot_products = np.dot(x, y.T)
norm_products = np.linalg.norm(x) * np.linalg.norm(y)
return dot_products / (norm_products + EPSILON)
Also bear in mind about EPSILON = 1e-07 to secure the division.