Cosine Similarity between 2 Number Lists

Cosine Similarity between 2 Number Lists - python

I want to calculate the cosine similarity between two lists, let's say for example list 1 which is dataSetI and list 2 which is dataSetII.
Let's say dataSetI is [3, 45, 7, 2] and dataSetII is [2, 54, 13, 15]. The length of the lists are always equal. I want to report cosine similarity as a number between 0 and 1.
dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
def cosine_similarity(list1, list2):
# How to?
pass
print(cosine_similarity(dataSetI, dataSetII))

another version based on numpy only
from numpy import dot
from numpy.linalg import norm
cos_sim = dot(a, b)/(norm(a)*norm(b))

You should try SciPy. It has a bunch of useful scientific routines for example, "routines for computing integrals numerically, solving differential equations, optimization, and sparse matrices." It uses the superfast optimized NumPy for its number crunching. See here for installing.
Note that spatial.distance.cosine computes the distance, and not the similarity. So, you must subtract the value from 1 to get the similarity.
from scipy import spatial
dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)

You can use cosine_similarity function form sklearn.metrics.pairwise docs
In [23]: from sklearn.metrics.pairwise import cosine_similarity
In [24]: cosine_similarity([[1, 0, -1]], [[-1,-1, 0]])
Out[24]: array([[-0.5]])

I don't suppose performance matters much here, but I can't resist. The zip() function completely recopies both vectors (more of a matrix transpose, actually) just to get the data in "Pythonic" order. It would be interesting to time the nuts-and-bolts implementation:
import math
def cosine_similarity(v1,v2):
"compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
sumxx, sumxy, sumyy = 0, 0, 0
for i in range(len(v1)):
x = v1[i]; y = v2[i]
sumxx += x*x
sumyy += y*y
sumxy += x*y
return sumxy/math.sqrt(sumxx*sumyy)
v1,v2 = [3, 45, 7, 2], [2, 54, 13, 15]
print(v1, v2, cosine_similarity(v1,v2))
Output: [3, 45, 7, 2] [2, 54, 13, 15] 0.972284251712
That goes through the C-like noise of extracting elements one-at-a-time, but does no bulk array copying and gets everything important done in a single for loop, and uses a single square root.
ETA: Updated print call to be a function. (The original was Python 2.7, not 3.3. The current runs under Python 2.7 with a from __future__ import print_function statement.) The output is the same, either way.
CPYthon 2.7.3 on 3.0GHz Core 2 Duo:
>>> timeit.timeit("cosine_similarity(v1,v2)",setup="from __main__ import cosine_similarity, v1, v2")
2.4261788514654654
>>> timeit.timeit("cosine_measure(v1,v2)",setup="from __main__ import cosine_measure, v1, v2")
8.794677709375264
So, the unpythonic way is about 3.6 times faster in this case.

without using any imports
math.sqrt(x)
can be replaced with
x** .5
without using numpy.dot() you have to create your own dot function using list comprehension:
def dot(A,B):
return (sum(a*b for a,b in zip(A,B)))
and then its just a simple matter of applying the cosine similarity formula:
def cosine_similarity(a,b):
return dot(a,b) / ( (dot(a,a) **.5) * (dot(b,b) ** .5) )

I did a benchmark based on several answers in the question and the following snippet is believed to be the best choice:
def dot_product2(v1, v2):
return sum(map(operator.mul, v1, v2))
def vector_cos5(v1, v2):
prod = dot_product2(v1, v2)
len1 = math.sqrt(dot_product2(v1, v1))
len2 = math.sqrt(dot_product2(v2, v2))
return prod / (len1 * len2)
The result makes me surprised that the implementation based on scipy is not the fastest one. I profiled and find that cosine in scipy takes a lot of time to cast a vector from python list to numpy array.

Python code to calculate:
Cosine Distance
Cosine Similarity
Angular Distance
Angular Similarity
import math
from scipy import spatial
def calculate_cosine_distance(a, b):
cosine_distance = float(spatial.distance.cosine(a, b))
return cosine_distance
def calculate_cosine_similarity(a, b):
cosine_similarity = 1 - calculate_cosine_distance(a, b)
return cosine_similarity
def calculate_angular_distance(a, b):
cosine_similarity = calculate_cosine_similarity(a, b)
angular_distance = math.acos(cosine_similarity) / math.pi
return angular_distance
def calculate_angular_similarity(a, b):
angular_similarity = 1 - calculate_angular_distance(a, b)
return angular_similarity
Similarity Search:
If you want to find closest cosine similarity in array of embeddings, you can use Tensorflow, like the following code.
In my testing, closeset value to an embedding with the shape of 1x512 found in 1M embeddings (1'000'000 x 512) in less than a second (using GPU).
import time
import numpy as np # np.__version__ == '1.23.5'
import tensorflow as tf # tf.__version__ == '2.11.0'
EMBEDDINGS_LENGTH = 512
NUMBER_OF_EMBEDDINGS = 1000 * 1000
def calculate_cosine_similarities(x, embeddings):
cosine_similarities = -1 * tf.keras.losses.cosine_similarity(x, embeddings)
return cosine_similarities.numpy()
def find_closest_embeddings(x, embeddings, top_k=1):
cosine_similarities = calculate_cosine_similarities(x, embeddings)
values, indices = tf.math.top_k(cosine_similarities, k=top_k)
return values.numpy(), indices.numpy()
def main():
# x shape: (512)
# Embeddings shape: (1000000, 512)
x = np.random.rand(EMBEDDINGS_LENGTH).astype(np.float32)
embeddings = np.random.rand(NUMBER_OF_EMBEDDINGS, EMBEDDINGS_LENGTH).astype(np.float32)
print('Embeddings shape: ', embeddings.shape)
n = 100
sum_duration = 0
for i in range(n):
start = time.time()
best_values, best_indices = find_closest_embeddings(x, embeddings, top_k=1)
end = time.time()
duration = end - start
sum_duration += duration
print('Duration (seconds): {}, Best value: {}, Best index: {}'.format(duration, best_values[0], best_indices[0]))
# Average duration (seconds): 1.707 for Intel(R) Core(TM) i7-10700 CPU # 2.90GHz
# Average duration (seconds): 0.961 for NVIDIA 1080 ti
print('Average duration (seconds): ', sum_duration / n)
if __name__ == '__main__':
main()
For more advanced similarity search, you can use Milvus, Weaviate or Faiss.
https://en.wikipedia.org/wiki/Cosine_similarity
https://gist.github.com/amir-saniyan/e102de09b01c4ed1632e3d1a1a1cbf64

import math
from itertools import izip
def dot_product(v1, v2):
return sum(map(lambda x: x[0] * x[1], izip(v1, v2)))
def cosine_measure(v1, v2):
prod = dot_product(v1, v2)
len1 = math.sqrt(dot_product(v1, v1))
len2 = math.sqrt(dot_product(v2, v2))
return prod / (len1 * len2)
You can round it after computing:
cosine = format(round(cosine_measure(v1, v2), 3))
If you want it really short, you can use this one-liner:
from math import sqrt
from itertools import izip
def cosine_measure(v1, v2):
return (lambda (x, y, z): x / sqrt(y * z))(reduce(lambda x, y: (x[0] + y[0] * y[1], x[1] + y[0]**2, x[2] + y[1]**2), izip(v1, v2), (0, 0, 0)))

You can use this simple function to calculate the cosine similarity:
def cosine_similarity(a, b):
return sum([i*j for i,j in zip(a, b)])/(math.sqrt(sum([i*i for i in a]))* math.sqrt(sum([i*i for i in b])))

You can do this in Python using simple function:
def get_cosine(text1, text2):
vec1 = text1
vec2 = text2
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return round(float(numerator) / denominator, 3)
dataSet1 = [3, 45, 7, 2]
dataSet2 = [2, 54, 13, 15]
get_cosine(dataSet1, dataSet2)

Using numpy compare one list of numbers to multiple lists(matrix):
def cosine_similarity(vector,matrix):
return ( np.sum(vector*matrix,axis=1) / ( np.sqrt(np.sum(matrix**2,axis=1)) * np.sqrt(np.sum(vector**2)) ) )[::-1]

If you happen to be using PyTorch already, you should go with their CosineSimilarity implementation.
Suppose you have two n-dimensional numpy.ndarrays, v1 and v2, i.e. their shapes are both (n,). Here's how you get their cosine similarity:
import torch
import torch.nn as nn
cos = nn.CosineSimilarity()
cos(torch.tensor([v1]), torch.tensor([v2])).item()
Or suppose you have two numpy.ndarrays w1 and w2, whose shapes are both (m, n). The following gets you a list of cosine similarities, each being the cosine similarity between a row in w1 and the corresponding row in w2:
cos(torch.tensor(w1), torch.tensor(w2)).tolist()

You can use SciPy (easiest way):
from scipy import spatial
dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
print(1 - spatial.distance.cosine(dataSetI, dataSetII))
Note that spatial.distance.cosine() gives you a dissimilarity (distance) value, and thus to get the similarity, you need to subtract that value from 1.
Another way to get to the solution is to write the function yourself that even contemplates the possibility of lists with different lengths:
def cosineSimilarity(v1, v2):
scalarProduct = moduloV1 = moduloV2 = 0
if len(v1) > len(v2):
v2.extend(0 for _ in range(len(v1) - len(v2)))
else:
v2.extend(0 for _ in range(len(v2) - len(v1)))
for i in range(len(v1)):
scalarProduct += v1[i] * v2[i]
moduloV1 += v1[i] * v1[i]
moduloV2 += v2[i] * v2[i]
return round(scalarProduct/(math.sqrt(moduloV1) * math.sqrt(moduloV2)), 3)
dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
print(cosineSimilarity(dataSetI, dataSetII))

Another version, if you have a scenario where you have list of vectors and a query vector and you want to compute the cosine similarity of query vector with all the vectors in the list, you can do it in one go in the below fashion:
>>> import numpy as np
>>> A # list of vectors, shape -> m x n
array([[ 3, 45, 7, 2],
[ 1, 23, 3, 4]])
>>> B # query vector, shape -> 1 x n
array([ 2, 54, 13, 15])
>>> similarity_scores = A.dot(B)/ (np.linalg.norm(A, axis=1) * np.linalg.norm(B))
>>> similarity_scores
array([0.97228425, 0.99026919])

We can easily calculate cosine similarity with simple mathematics equations.
Cosine_similarity = 1- (dotproduct of vectors/(product of norm of the vectors)). We can define two functions each for calculations of dot product and norm.
def dprod(a,b):
sum=0
for i in range(len(a)):
sum+=a[i]*b[i]
return sum
def norm(a):
norm=0
for i in range(len(a)):
norm+=a[i]**2
return norm**0.5
cosine_a_b = 1-(dprod(a,b)/(norm(a)*norm(b)))

Here is an implementation that would work for matrices as well. Its behaviour is exactly like sklearn cosine similarity:
def cosine_similarity(a, b):
return np.divide(
np.dot(a, b.T),
np.linalg.norm(
a,
axis=1,
keepdims=True
)
# # matrix multiplication
np.linalg.norm(
b,
axis=1,
keepdims=True
).T
)
The # symbol stands for matrix multiplication. See
What does the "at" (#) symbol do in Python?

All the answers are great for situations where you cannot use NumPy. If you can, here is another approach:
def cosine(x, y):
dot_products = np.dot(x, y.T)
norm_products = np.linalg.norm(x) * np.linalg.norm(y)
return dot_products / (norm_products + EPSILON)
Also bear in mind about EPSILON = 1e-07 to secure the division.

Related

numpy create array of the max of consecutive pairs in another array

I have a numpy array:
A = np.array([8, 2, 33, 4, 3, 6])
What I want is to create another array B where each element is the pairwise max of 2 consecutive pairs in A, so I get:
B = np.array([8, 33, 33, 4, 6])
Any ideas on how to implement?
Any ideas on how to implement this for more then 2 elements? (same thing but for consecutive n elements)
Edit:
The answers gave me a way to solve this question, but for the n-size window case, is there a more efficient way that does not require loops?
Edit2:
Turns out that the question is equivalent for asking how to perform 1d max-pooling of a list with a window of size n.
Does anyone know how to implement this efficiently?

One solution to the pairwise problem is using the np.maximum function and array slicing:
B = np.maximum(A[:-1], A[1:])

A loop-free solution is to use max on the windows created by skimage.util.view_as_windows:
list(map(max, view_as_windows(A, (2,))))
[8, 33, 33, 4, 6]
Copy/pastable example:
import numpy as np
from skimage.util import view_as_windows
A = np.array([8, 2, 33, 4, 3, 6])
list(map(max, view_as_windows(A, (2,))))

Here is an approach specifically taylored for larger windows. It is O(1) in window size and O(n) in data size.
I've done a pure numpy and a pythran implementation.
How do we achieve O(1) in window size? We use a "sawtooth" trick: If w is the window width we group the data into lots of w and for each group we do the cumulative maximum from left to right and from right to left. The elements of any in-between window distribute over two groups and the maxima of the intersections are among the cumulative maxima we have computed earlier. So we need a total of 3 comparisons per data point.
benchit (thanks #Divakar) for w=100; my functions are pp (numpy) and winmax (pythran):
For small window size w=5 the picture is more even. Interestingly, pythran still has a huge edge even for very small sizes. They must be doing something right to mimimze call overhead.
python code:
cummax = np.maximum.accumulate
def pp(a,w):
N = a.size//w
if a.size-w+1 > N*w:
out = np.empty(a.size-w+1,a.dtype)
out[:-1] = cummax(a[w*N-1::-1].reshape(N,w),axis=1).ravel()[:w-a.size-1:-1]
out[-1] = a[w*N:].max()
else:
out = cummax(a[w*N-1::-1].reshape(N,w),axis=1).ravel()[:w-a.size-2:-1]
out[1:N*w-w+1] = np.maximum(out[1:N*w-w+1],
cummax(a[w:w*N].reshape(N-1,w),axis=1).ravel())
out[N*w-w+1:] = np.maximum(out[N*w-w+1:],cummax(a[N*w:]))
return out
pythran version; compile with pythran -O3 <filename.py>; this creates a compiled module which you can import:
import numpy as np
# pythran export winmax(float[:],int)
# pythran export winmax(int[:],int)
def winmax(data,winsz):
N = data.size//winsz
if N < 1:
raise ValueError
out = np.empty(data.size-winsz+1,data.dtype)
nxt = winsz
for j in range(winsz,data.size):
if j == nxt:
nxt += winsz
out[j+1-winsz] = data[j]
else:
out[j+1-winsz] = out[j-winsz] if out[j-winsz]>data[j] else data[j]
running = data[-winsz:N*winsz].max()
nxt -= winsz << (nxt > data.size)
for j in range(data.size-winsz,0,-1):
if j == nxt:
nxt -= winsz
running = data[j-1]
else:
running = data[j] if data[j] > running else running
out[j] = out[j] if out[j] > running else running
out[0] = data[0] if data[0] > running else running
return out

In this Q&A, we are basically asking for sliding max values. This has been explored before - Max in a sliding window in NumPy array. Since, we are looking to be efficient, we can look further. One of those would be numba and here are two final variants I ended up with that leverage parallel directive that boosts performance over a without version :
import numpy as np
from numba import njit, prange
#njit(parallel=True)
def numba1(a, W):
L = len(a)-W+1
out = np.empty(L, dtype=a.dtype)
v = np.iinfo(a.dtype).min
for i in prange(L):
max1 = v
for j in range(W):
cur = a[i + j]
if cur>max1:
max1 = cur
out[i] = max1
return out
#njit(parallel=True)
def numba2(a, W):
L = len(a)-W+1
out = np.empty(L, dtype=a.dtype)
for i in prange(L):
for j in range(W):
cur = a[i + j]
if cur>out[i]:
out[i] = cur
return out
From the earlier linked Q&A, the equivalent SciPy version would be -
from scipy.ndimage.filters import maximum_filter1d
def scipy_max_filter1d(a, W):
L = len(a)-W+1
hW = W//2 # Half window size
return maximum_filter1d(a,size=W)[hW:hW+L]
Benchmarking
Other posted working approaches for generic window arg :
from skimage.util import view_as_windows
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
# #mathfux's soln
def npmax_strided(a,n):
return np.max(rolling(a, n), axis=1)
# #Nicolas Gervais's soln
def mapmax_strided(a, W):
return list(map(max, view_as_windows(a,W)))
cummax = np.maximum.accumulate
def pp(a,w):
N = a.size//w
if a.size-w+1 > N*w:
out = np.empty(a.size-w+1,a.dtype)
out[:-1] = cummax(a[w*N-1::-1].reshape(N,w),axis=1).ravel()[:w-a.size-1:-1]
out[-1] = a[w*N:].max()
else:
out = cummax(a[w*N-1::-1].reshape(N,w),axis=1).ravel()[:w-a.size-2:-1]
out[1:N*w-w+1] = np.maximum(out[1:N*w-w+1],
cummax(a[w:w*N].reshape(N-1,w),axis=1).ravel())
out[N*w-w+1:] = np.maximum(out[N*w-w+1:],cummax(a[N*w:]))
return out
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
import benchit
funcs = [mapmax_strided, npmax_strided, numba1, numba2, scipy_max_filter1d, pp]
in_ = {(n,W):(np.random.randint(0,100,n),W) for n in 10**np.arange(2,6) for W in [2, 10, 20, 50, 100]}
t = benchit.timings(funcs, in_, multivar=True, input_name=['Array-length', 'Window-length'])
t.plot(logx=True, sp_ncols=1, save='timings.png')
So, numba ones are great for window sizes lower than 10, at which there's no clear winner and on larger window sizes pp wins with SciPy one at second spot.

In case there are consecutive n items, extended solution requires looping:
np.maximum(*[A[i:len(A)-n+i+1] for i in range(n)])
In order to avoid it you can use stride tricks and convert A to array of n-length blocks:
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
For example:
>>> rolling(A, 3)
array([[ 8, 2, 8],
[ 2, 8, 33],
[ 8, 33, 33],
[33, 33, 4]])
After it's done you can kill it with np.max(rolling(A, n), axis=1).
Though, despite its elegance, neither this solution nor first one were not efficient because we apply repeatedly maximum on adjacent blocks that differs by two items only.

a recursive solution, for all of n
import numpy as np
import sys
def recursive(a: np.ndarray, n: int, b=None, level=2):
if n <= 0 or n > len(a):
raise ValueError(f'len(a):{len(a)} n:{n}')
if n == 1:
return a
if len(a) == n:
return np.max(a)
b = np.maximum(a[:-1], a[1:]) if b is None else np.maximum(a[level - 1:], b)
if n == level:
return b
return recursive(a, n, b[:-1], level + 1)
test_data = np.array([8, 2, 33, 4, 3, 6])
for test_n in range(1, len(test_data) + 2):
try:
print(recursive(test_data, n=test_n))
except ValueError as e:
sys.stderr.write(str(e))
output
[ 8 2 33 4 3 6]
[ 8 33 33 4 6]
[33 33 33 6]
[33 33 33]
[33 33]
33
len(a):6 n:7
about recursive function
You can observe the following data, and then you will know how to write the recursive function.
"""
np.array([8, 2, 33, 4, 3, 6])
n=2: (8, 2), (2, 33), (33, 4), (4, 3), (3, 6) => [8, 33, 33, 4, 6] => B' = [8, 33, 33, 4]
n=3: (8, 2, 33), (2, 33, 4), (33, 4, 3), (4, 3, 6) => B' [33, 4, 3, 6] => np.maximum([8, 33, 33, 4], [33, 4, 3, 6]) => 33, 33, 33, 6
...
"""

Using Pandas:
A = pd.Series([8, 2, 33, 4, 3, 6])
res = pd.concat([A,A.shift(-1)],axis=1).max(axis=1,skipna=False).dropna()
>>res
0 8.0
1 33.0
2 33.0
3 4.0
4 6.0
Or using numpy:
np.vstack([A[1:],A[:-1]]).max(axis=0)

Memory efficient mean pairwise distance

I am aware of the scipy.spatial.distance.pdist function and how to compute the mean from the resulting matrix/ndarray.
>>> x = np.random.rand(10000, 2)
>>> y = pdist(x, metric='euclidean')
>>> y.mean()
0.5214255824176626
In the example above y gets quite large (nearly 2,500 times as large as the input array):
>>> y.shape
(49995000,)
>>> from sys import getsizeof
>>> getsizeof(x)
160112
>>> getsizeof(y)
399960096
>>> getsizeof(y) / getsizeof(x)
2498.0019986009793
But since I am only interested in the mean pairwise distance, the distance matrix doesn't have to be kept in memory. Instead the mean of each row (or column) can be computed seperatly. The final mean value can then be computed from the row mean values.
Is there already a function which exploit this property or is there an easy way to extend/combine existing functions to do so?

If you use the square version of distance, it is equivalent to using the variance with n-1:
from scipy.spatial.distance import pdist, squareform
import numpy as np
x = np.random.rand(10000, 2)
y = np.array([[1,1], [0,0], [2,0]])
print(pdist(x, 'sqeuclidean').mean())
print(np.var(x, 0, ddof=1).sum()*2)
>>0.331474285845873
0.33147428584587346

You will have to weight each row by the number of observations that make up the mean. For example the pdist of a 3 x 2 matrix is the flattened upper triangle (offset of 1) of the squareform 3 x 3 distance matrix.
arr = np.arange(6).reshape(3,2)
arr
array([[0, 1],
[2, 3],
[4, 5]])
pdist(arr)
array([2.82842712, 5.65685425, 2.82842712])
from sklearn.metrics import pairwise_distances
square = pairwise_distances(arr)
square
array([[0. , 2.82842712, 5.65685425],
[2.82842712, 0. , 2.82842712],
[5.65685425, 2.82842712, 0. ]])
square[triu_indices(square.shape[0], 1)]
array([2.82842712, 5.65685425, 2.82842712])
There is the pairwise_distances_chuncked function that can be used to iterate over the distance matrix row by row, but you will need to keep track of the row index to make sure you only take the mean of values in the upper/lower triangle of the matrix (distance matrix is symmetrical). This isn't complicated, but I imagine you will introduce a significant slowdown.
tot = ((arr.shape[0]**2) - arr.shape[0]) / 2
weighted_means = 0
for i in gen:
if r < arr.shape[0]:
sm = i[0, r:].mean()
wgt = (i.shape[1] - r) / tot
weighted_means += sm * wgt
r += 1

Getting eigenvalues from 3x3 matrix in Python using Power method

I'm trying to get all eigenvalues from a 3x3 matrix by using Power Method in Python. However my method returns diffrent eigenvalues from the correct ones for some reason.
My matrix: A = [[1, 2, 3], [2, 4, 5], [3, 5,-1]]
Correct eigenvalues: [ 8.54851285, -4.57408723, 0.02557437 ]
Eigenvalues returned by my method: [ 8.5485128481521926, 4.5740872291939381, 9.148174458392436 ]
So the first one is correct, second one has wrong sign and the third one is all wrong. I don't know what I'm doing wrong and I can't see where have I made mistake.
Here's my code:
import numpy as np
import numpy.linalg as la
eps = 1e-8 # Precision of eigenvalue
def trans(v): # translates vector (v^T)
v_1 = np.copy(v)
return v_1.reshape((-1, 1))
def power(A):
eig = []
Ac = np.copy(A)
lamb = 0
for i in range(3):
x = np.array([1, 1, 1])
while True:
x_1 = Ac.dot(x) # y_n = A*x_(n-1)
x_norm = la.norm(x_1)
x_1 = x_1/x_norm # x_n = y_n/||y_n||
if(abs(lamb - x_norm) <= eps): # If precision is reached, it returns eigenvalue
break
else:
lamb = x_norm
x = x_1
eig.append(lamb)
# Matrix Deflaction: A - Lambda * norm[V]*norm[V]^T
v = x_1/la.norm(x_1)
R = v * trans(v)
R = eig[i]*R
Ac = Ac - R
return eig
def main():
A = np.array([1, 2, 3, 2, 4, 5, 3, 5, -1]).reshape((3, 3))
print(power(A))
if __name__ == '__main__':
main()
PS. Is there a simpler way to get the second and third eigenvalue from power method instead of matrix deflaction?

With
lamb = x_norm
you ever only compute the absolute value of the eigenvalues. Better compute them as
lamb = dot(x,x_1)
where x is assumed to be normalized.
As you do not remove the negative eigenvalue -4.57408723, but effectively add it instead, the largest eigenvalue in the third stage is 2*-4.574.. = -9.148.. where you again computed the absolute value.

I didn't know this method, so I googled it and found here:
http://ergodic.ugr.es/cphys/LECCIONES/FORTRAN/power_method.pdf
that it is valid only for finding the leading (largest) eigenvalue, thus, it seems that it is working for you fine, and it is not guaranteed that the following eigenvalues will be correct.
Btw. numpy.linalg.eig() works faster than your code for this matrix, but I am guessing you implemented it as an exercise.

Dot Product in Python without NumPy

Is there a way that you can preform a dot product of two lists that contain values without using NumPy or the Operation module in Python? So that the code is as simple as it could get?
For example:
V_1=[1,2,3]
V_2=[4,5,6]
Dot(V_1,V_2)
Answer: 32

Without numpy, you can write yourself a function for the dot product which uses zip and sum.
>>> def dot(v1, v2):
... return sum(x*y for x, y in zip(v1, v2))
...
>>> dot([1, 2, 3], [4, 5, 6])
32
As of Python 3.10, you can use zip(v1, v2, strict=True) to ensure that v1 and v2 have the same length.

def dot_product(x, y):
dp = 0
for i in range(len(x)):
dp += (x[i]*y[i])
return dp
sample1 = [1,2,3,4,5]
sample2 = [2,1,1,1,1]
dot_product(sample1, sample2) #16

We can simply use # operator from python.
For example:
import numpy as np
x = np.array([25, 2, 5])
y = np.array([0, 1, 2])
print(x#y)
12

Efficiently Doing Diffusion on a 2d map in Python

I'm pretty new to Python, so I'm doing a project in it. Part of it includes a diffusion across a map. I'm implementing it by going through and making the current tile equal to .2 * the sum of its neighbors n,w,s,e. If I was doing this in C, I'd just do a double for loop that loops through an array doing arr[i*width + j] = arr of j+1, j-1, i+i, i-1 the neighbors) and have several different arrays that I'd do the same thing for (different qualities of the map I'd be changing). However, I'm not sure if this is really the fastest way in Python. Some people I have asked suggest stuff like numPy, but the width probably won't be more than ~200 (so 40-50k elements max) and I wasn't sure if the overhead is worth it. I don't really know any builtin functions to do what I want. Any advice?
edit: This will be very dense i.e. every spot is going to have a non-trivial calculation

This is quite simple to arrange with NumPy. The function np.roll returns a copy of the array, "rolled" in a specified direction.
For example, given the array x,
x=np.arange(9).reshape(3,3)
# array([[0, 1, 2],
# [3, 4, 5],
# [6, 7, 8]])
you can roll the columns to the right with
np.roll(x,shift=1,axis=1)
# array([[2, 0, 1],
# [5, 3, 4],
# [8, 6, 7]])
Using np.roll, boundaries are wrapped like on a torus. If you do not want wrapped boundaries, you could pad the array with an edge of zeros, and reset the edge to zero before every iteration.
import numpy as np
def diffusion(arr):
while True:
arr+=0.2*np.roll(arr,shift=1,axis=1) # right
arr+=0.2*np.roll(arr,shift=-1,axis=1) # left
arr+=0.2*np.roll(arr,shift=1,axis=0) # down
arr+=0.2*np.roll(arr,shift=-1,axis=0) # up
yield arr
N=5
initial=np.random.random((N,N))
for state in diffusion(initial):
print(state)
raw_input()

Use convolution.
from numpy import *
from scipy.signal import convolve2d
mapArr=array(map)
kernel=array([[0 , 0.2, 0],
[0.2, 0, 0.2],
[0 , 0.2, 0]])
diffused=convolve2d(mapArr,kernel,boundary='wrap')
Is this for the ants challenge? If so, in the ants context, convolve2d worked ~20 times faster than the loop, in my implementation.

This modification to unutbu's code maintains constant the global sum of the array while diffuses the values of it:
import numpy as np
def diffuse(arr, d):
contrib = (arr * d)
w = contrib / 8.0
r = arr - contrib
N = np.roll(w, shift=-1, axis=0)
S = np.roll(w, shift=1, axis=0)
E = np.roll(w, shift=1, axis=1)
W = np.roll(w, shift=-1, axis=1)
NW = np.roll(N, shift=-1, axis=1)
NE = np.roll(N, shift=1, axis=1)
SW = np.roll(S, shift=-1, axis=1)
SE = np.roll(S, shift=1, axis=1)
diffused = r + N + S + E + W + NW + NE + SW + SE
return diffused

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cosine Similarity between 2 Number Lists - python

another version based on numpy only from numpy import dot from numpy.linalg import norm cos_sim = dot(a, b)/(norm(a)*norm(b))

You can use cosine_similarity function form sklearn.metrics.pairwise docs In [23]: from sklearn.metrics.pairwise import cosine_similarity In [24]: cosine_similarity([[1, 0, -1]], [[-1,-1, 0]]) Out[24]: array([[-0.5]])

You can use this simple function to calculate the cosine similarity: def cosine_similarity(a, b): return sum([ij for i,j in zip(a, b)])/(math.sqrt(sum([ii for i in a]))* math.sqrt(sum([i*i for i in b])))

Using numpy compare one list of numbers to multiple lists(matrix): def cosine_similarity(vector,matrix): return ( np.sum(vector*matrix,axis=1) / ( np.sqrt(np.sum(matrix**2,axis=1)) * np.sqrt(np.sum(vector**2)) ) )[::-1]

Related

numpy create array of the max of consecutive pairs in another array

Memory efficient mean pairwise distance

Getting eigenvalues from 3x3 matrix in Python using Power method

Dot Product in Python without NumPy

Efficiently Doing Diffusion on a 2d map in Python

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cosine Similarity between 2 Number Lists - python

another version based on numpy only from numpy import dot from numpy.linalg import norm cos_sim = dot(a, b)/(norm(a)*norm(b))

You can use cosine_similarity function form sklearn.metrics.pairwise docs In [23]: from sklearn.metrics.pairwise import cosine_similarity In [24]: cosine_similarity([[1, 0, -1]], [[-1,-1, 0]]) Out[24]: array([[-0.5]])

You can use this simple function to calculate the cosine similarity: def cosine_similarity(a, b): return sum([i*j for i,j in zip(a, b)])/(math.sqrt(sum([i*i for i in a]))* math.sqrt(sum([i*i for i in b])))

Using numpy compare one list of numbers to multiple lists(matrix): def cosine_similarity(vector,matrix): return ( np.sum(vector*matrix,axis=1) / ( np.sqrt(np.sum(matrix**2,axis=1)) * np.sqrt(np.sum(vector**2)) ) )[::-1]

Related

numpy create array of the max of consecutive pairs in another array

Memory efficient mean pairwise distance

Getting eigenvalues from 3x3 matrix in Python using Power method

Dot Product in Python without NumPy

Efficiently Doing Diffusion on a 2d map in Python

Categories

Resources

You can use this simple function to calculate the cosine similarity: def cosine_similarity(a, b): return sum([ij for i,j in zip(a, b)])/(math.sqrt(sum([ii for i in a]))* math.sqrt(sum([i*i for i in b])))