I have a numpy array:
A = np.array([8, 2, 33, 4, 3, 6])
What I want is to create another array B where each element is the pairwise max of 2 consecutive pairs in A, so I get:
B = np.array([8, 33, 33, 4, 6])
Any ideas on how to implement?
Any ideas on how to implement this for more then 2 elements? (same thing but for consecutive n elements)
Edit:
The answers gave me a way to solve this question, but for the n-size window case, is there a more efficient way that does not require loops?
Edit2:
Turns out that the question is equivalent for asking how to perform 1d max-pooling of a list with a window of size n.
Does anyone know how to implement this efficiently?
One solution to the pairwise problem is using the np.maximum function and array slicing:
B = np.maximum(A[:-1], A[1:])
A loop-free solution is to use max on the windows created by skimage.util.view_as_windows:
list(map(max, view_as_windows(A, (2,))))
[8, 33, 33, 4, 6]
Copy/pastable example:
import numpy as np
from skimage.util import view_as_windows
A = np.array([8, 2, 33, 4, 3, 6])
list(map(max, view_as_windows(A, (2,))))
Here is an approach specifically taylored for larger windows. It is O(1) in window size and O(n) in data size.
I've done a pure numpy and a pythran implementation.
How do we achieve O(1) in window size? We use a "sawtooth" trick: If w is the window width we group the data into lots of w and for each group we do the cumulative maximum from left to right and from right to left. The elements of any in-between window distribute over two groups and the maxima of the intersections are among the cumulative maxima we have computed earlier. So we need a total of 3 comparisons per data point.
benchit (thanks #Divakar) for w=100; my functions are pp (numpy) and winmax (pythran):
For small window size w=5 the picture is more even. Interestingly, pythran still has a huge edge even for very small sizes. They must be doing something right to mimimze call overhead.
python code:
cummax = np.maximum.accumulate
def pp(a,w):
N = a.size//w
if a.size-w+1 > N*w:
out = np.empty(a.size-w+1,a.dtype)
out[:-1] = cummax(a[w*N-1::-1].reshape(N,w),axis=1).ravel()[:w-a.size-1:-1]
out[-1] = a[w*N:].max()
else:
out = cummax(a[w*N-1::-1].reshape(N,w),axis=1).ravel()[:w-a.size-2:-1]
out[1:N*w-w+1] = np.maximum(out[1:N*w-w+1],
cummax(a[w:w*N].reshape(N-1,w),axis=1).ravel())
out[N*w-w+1:] = np.maximum(out[N*w-w+1:],cummax(a[N*w:]))
return out
pythran version; compile with pythran -O3 <filename.py>; this creates a compiled module which you can import:
import numpy as np
# pythran export winmax(float[:],int)
# pythran export winmax(int[:],int)
def winmax(data,winsz):
N = data.size//winsz
if N < 1:
raise ValueError
out = np.empty(data.size-winsz+1,data.dtype)
nxt = winsz
for j in range(winsz,data.size):
if j == nxt:
nxt += winsz
out[j+1-winsz] = data[j]
else:
out[j+1-winsz] = out[j-winsz] if out[j-winsz]>data[j] else data[j]
running = data[-winsz:N*winsz].max()
nxt -= winsz << (nxt > data.size)
for j in range(data.size-winsz,0,-1):
if j == nxt:
nxt -= winsz
running = data[j-1]
else:
running = data[j] if data[j] > running else running
out[j] = out[j] if out[j] > running else running
out[0] = data[0] if data[0] > running else running
return out
In this Q&A, we are basically asking for sliding max values. This has been explored before - Max in a sliding window in NumPy array. Since, we are looking to be efficient, we can look further. One of those would be numba and here are two final variants I ended up with that leverage parallel directive that boosts performance over a without version :
import numpy as np
from numba import njit, prange
#njit(parallel=True)
def numba1(a, W):
L = len(a)-W+1
out = np.empty(L, dtype=a.dtype)
v = np.iinfo(a.dtype).min
for i in prange(L):
max1 = v
for j in range(W):
cur = a[i + j]
if cur>max1:
max1 = cur
out[i] = max1
return out
#njit(parallel=True)
def numba2(a, W):
L = len(a)-W+1
out = np.empty(L, dtype=a.dtype)
for i in prange(L):
for j in range(W):
cur = a[i + j]
if cur>out[i]:
out[i] = cur
return out
From the earlier linked Q&A, the equivalent SciPy version would be -
from scipy.ndimage.filters import maximum_filter1d
def scipy_max_filter1d(a, W):
L = len(a)-W+1
hW = W//2 # Half window size
return maximum_filter1d(a,size=W)[hW:hW+L]
Benchmarking
Other posted working approaches for generic window arg :
from skimage.util import view_as_windows
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
# #mathfux's soln
def npmax_strided(a,n):
return np.max(rolling(a, n), axis=1)
# #Nicolas Gervais's soln
def mapmax_strided(a, W):
return list(map(max, view_as_windows(a,W)))
cummax = np.maximum.accumulate
def pp(a,w):
N = a.size//w
if a.size-w+1 > N*w:
out = np.empty(a.size-w+1,a.dtype)
out[:-1] = cummax(a[w*N-1::-1].reshape(N,w),axis=1).ravel()[:w-a.size-1:-1]
out[-1] = a[w*N:].max()
else:
out = cummax(a[w*N-1::-1].reshape(N,w),axis=1).ravel()[:w-a.size-2:-1]
out[1:N*w-w+1] = np.maximum(out[1:N*w-w+1],
cummax(a[w:w*N].reshape(N-1,w),axis=1).ravel())
out[N*w-w+1:] = np.maximum(out[N*w-w+1:],cummax(a[N*w:]))
return out
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
import benchit
funcs = [mapmax_strided, npmax_strided, numba1, numba2, scipy_max_filter1d, pp]
in_ = {(n,W):(np.random.randint(0,100,n),W) for n in 10**np.arange(2,6) for W in [2, 10, 20, 50, 100]}
t = benchit.timings(funcs, in_, multivar=True, input_name=['Array-length', 'Window-length'])
t.plot(logx=True, sp_ncols=1, save='timings.png')
So, numba ones are great for window sizes lower than 10, at which there's no clear winner and on larger window sizes pp wins with SciPy one at second spot.
In case there are consecutive n items, extended solution requires looping:
np.maximum(*[A[i:len(A)-n+i+1] for i in range(n)])
In order to avoid it you can use stride tricks and convert A to array of n-length blocks:
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
For example:
>>> rolling(A, 3)
array([[ 8, 2, 8],
[ 2, 8, 33],
[ 8, 33, 33],
[33, 33, 4]])
After it's done you can kill it with np.max(rolling(A, n), axis=1).
Though, despite its elegance, neither this solution nor first one were not efficient because we apply repeatedly maximum on adjacent blocks that differs by two items only.
a recursive solution, for all of n
import numpy as np
import sys
def recursive(a: np.ndarray, n: int, b=None, level=2):
if n <= 0 or n > len(a):
raise ValueError(f'len(a):{len(a)} n:{n}')
if n == 1:
return a
if len(a) == n:
return np.max(a)
b = np.maximum(a[:-1], a[1:]) if b is None else np.maximum(a[level - 1:], b)
if n == level:
return b
return recursive(a, n, b[:-1], level + 1)
test_data = np.array([8, 2, 33, 4, 3, 6])
for test_n in range(1, len(test_data) + 2):
try:
print(recursive(test_data, n=test_n))
except ValueError as e:
sys.stderr.write(str(e))
output
[ 8 2 33 4 3 6]
[ 8 33 33 4 6]
[33 33 33 6]
[33 33 33]
[33 33]
33
len(a):6 n:7
about recursive function
You can observe the following data, and then you will know how to write the recursive function.
"""
np.array([8, 2, 33, 4, 3, 6])
n=2: (8, 2), (2, 33), (33, 4), (4, 3), (3, 6) => [8, 33, 33, 4, 6] => B' = [8, 33, 33, 4]
n=3: (8, 2, 33), (2, 33, 4), (33, 4, 3), (4, 3, 6) => B' [33, 4, 3, 6] => np.maximum([8, 33, 33, 4], [33, 4, 3, 6]) => 33, 33, 33, 6
...
"""
Using Pandas:
A = pd.Series([8, 2, 33, 4, 3, 6])
res = pd.concat([A,A.shift(-1)],axis=1).max(axis=1,skipna=False).dropna()
>>res
0 8.0
1 33.0
2 33.0
3 4.0
4 6.0
Or using numpy:
np.vstack([A[1:],A[:-1]]).max(axis=0)
Related
I have an n row, m column numpy array, and would like to create a new k x m array by selecting k random elements from each column of the array. I wrote the following python function to do this, but would like to implement something more efficient and faster:
def sample_array_cols(MyMatrix, nelements):
vmat = []
TempMat = MyMatrix.T
for v in TempMat:
v = np.ndarray.tolist(v)
subv = random.sample(v, nelements)
vmat = vmat + [subv]
return(np.array(vmat).T)
One question is whether there's a way to loop over each column without transposing the array (and then transposing back). More importantly, is there some way to map the random sample onto each column that would be faster than having a for loop over all columns? I don't have that much experience with numpy objects, but I would guess that there should be something analogous to apply/mapply in R that would work?
One alternative is to randomly generate the indices first, and then use take_along_axis to map them to the original array:
arr = np.random.randn(1000, 5000) # arbitrary
k = 10 # arbitrary
n, m = arr.shape
idx = np.random.randint(0, n, (k, m))
new = np.take_along_axis(arr, idx, axis=0)
Output (shape):
in [215]: new.shape
out[215]: (10, 500) # (k x m)
To sample each column without replacement just like your original solution
import numpy as np
matrix = np.arange(4*3).reshape(4,3)
matrix
Output
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
k = 2
np.take_along_axis(matrix, np.random.rand(*matrix.shape).argsort(axis=0)[:k], axis=0)
Output
array([[ 9, 1, 2],
[ 3, 4, 11]])
I would
Pre-allocate the result array, and fill in columns, and
Use numpy index based indexing
def sample_array_cols(matrix, n_result):
(n,m) = matrix.shape
vmat = numpy.array([n_result, m], dtype= matrix.dtype)
for c in range(m):
random_indices = numpy.random.randint(0, n, n_result)
vmat[:,c] = matrix[random_indices, c]
return vmat
Not quite fully vectorized, but better than building up a list, and the code scans just like your description.
I have 2 2D-arrays. I am trying to convolve along the axis 1. np.convolve doesn't provide the axis argument. The answer here, convolves 1 2D-array with a 1D array using np.apply_along_axis. But it cannot be directly applied to my use case. The question here doesn't have an answer.
MWE is as follows.
import numpy as np
a = np.random.randint(0, 5, (2, 5))
"""
a=
array([[4, 2, 0, 4, 3],
[2, 2, 2, 3, 1]])
"""
b = np.random.randint(0, 5, (2, 2))
"""
b=
array([[4, 3],
[4, 0]])
"""
# What I want
c = np.convolve(a, b, axis=1) # axis is not supported as an argument
"""
c=
array([[16, 20, 6, 16, 24, 9],
[ 8, 8, 8, 12, 4, 0]])
"""
I know I can do it using np.fft.fft, but it seems like an unnecessary step to get a simple thing done. Is there a simple way to do this? Thanks.
Why not just do a list comprehension with zip?
>>> np.array([np.convolve(x, y) for x, y in zip(a, b)])
array([[16, 20, 6, 16, 24, 9],
[ 8, 8, 8, 12, 4, 0]])
>>>
Or with scipy.signal.convolve2d:
>>> from scipy.signal import convolve2d
>>> convolve2d(a, b)[[0, 2]]
array([[16, 20, 6, 16, 24, 9],
[ 8, 8, 8, 12, 4, 0]])
>>>
One possibility could be to manually go the way to the Fourier spectrum, and back:
n = np.max([a.shape, b.shape]) + 1
np.abs(np.fft.ifft(np.fft.fft(a, n=n) * np.fft.fft(b, n=n))).astype(int)
# array([[16, 20, 6, 16, 24, 9],
# [ 8, 8, 8, 12, 4, 0]])
Would it be considered too ugly to loop over the orthogonal dimension? That would not add much overhead unless the main dimension is very short. Creating the output array ahead of time ensures that no memory needs to be copied about.
def convolvesecond(a, b):
N1, L1 = a.shape
N2, L2 = b.shape
if N1 != N2:
raise ValueError("Not compatible")
c = np.zeros((N1, L1 + L2 - 1), dtype=a.dtype)
for n in range(N1):
c[n,:] = np.convolve(a[n,:], b[n,:], 'full')
return c
For the generic case (convolving along the k-th axis of a pair of multidimensional arrays), I would resort to a pair of helper functions I always keep on hand to convert multidimensional problems to the basic 2d case:
def semiflatten(x, d=0):
'''SEMIFLATTEN - Permute and reshape an array to convenient matrix form
y, s = SEMIFLATTEN(x, d) permutes and reshapes the arbitrary array X so
that input dimension D (default: 0) becomes the second dimension of the
output, and all other dimensions (if any) are combined into the first
dimension of the output. The output is always 2-D, even if the input is
only 1-D.
If D<0, dimensions are counted from the end.
Return value S can be used to invert the operation using SEMIUNFLATTEN.
This is useful to facilitate looping over arrays with unknown shape.'''
x = np.array(x)
shp = x.shape
ndims = x.ndim
if d<0:
d = ndims + d
perm = list(range(ndims))
perm.pop(d)
perm.append(d)
y = np.transpose(x, perm)
# Y has the original D-th axis last, preceded by the other axes, in order
rest = np.array(shp, int)[perm[:-1]]
y = np.reshape(y, [np.prod(rest), y.shape[-1]])
return y, (d, rest)
def semiunflatten(y, s):
'''SEMIUNFLATTEN - Reverse the operation of SEMIFLATTEN
x = SEMIUNFLATTEN(y, s), where Y, S are as returned from SEMIFLATTEN,
reverses the reshaping and permutation.'''
d, rest = s
x = np.reshape(y, np.append(rest, y.shape[-1]))
perm = list(range(x.ndim))
perm.pop()
perm.insert(d, x.ndim-1)
x = np.transpose(x, perm)
return x
(Note that reshape and transpose do not create copies, so these functions are extremely fast.)
With those, the generic form can be written as:
def convolvealong(a, b, axis=-1):
a, S1 = semiflatten(a, axis)
b, S2 = semiflatten(b, axis)
c = convolvesecond(a, b)
return semiunflatten(c, S1)
I have two numpy-arrays and want to create a third one with the information in these twos.
Here is a simple example:
have = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
use = np.array([[2], [3]])
solution = np.array([[1, 1, 3, 4], [5, 5, 5, 8]])
What I want is to use the "use"-array, which gives me the number of how often I want to use the first element in each row from my "have"-array.
So the 2 in "use" means, that I want to have two times a "1" in my new array "solution". Similary for the "3" in use, I want that my new array has 3 times a "5". The rest from have should be the same.
It is important to use the "use"-array for doing this (or a numpy-array in general).
Do you have some ideas?
If there are only small such data structures and performance is not an issue then you can do this so simple:
np.array([ [a[0]]*b[0]+list(a[b[0]:]) for a,b in zip(have,use)])
Simply iterate through the have and replace the values based on the use.
Use:
for i in range(use.shape[0]):
have[i, :use[i, 0]] = np.repeat(have[i, 0], use[i, 0])
Using only numpy operations:
First create a boolean mask of same size as have. mask(i, j) is True if j < use[i, j] otherwise it's False. So mask is True for indices which are to be replaced by first column value. Now use np.where to replace.
n, m = have.shape
mask = np.repeat(np.arange(m)[None, :], n, axis = 0) < use
have = np.where(mask, have[:, 0:1], have)
Output:
>>> have
array([[1, 1, 3, 4],
[5, 5, 5, 8]])
If performance matters, you can use np.apply_along_axis().
import numpy as np
have = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
use = np.array([[2], [3]])
def rep1st(arr):
rep = arr[0]
res = np.repeat(arr[1], rep)
res = np.concatenate([res, arr[rep+1:]])
return res
solution = np.apply_along_axis(rep1st, 1, np.concatenate([use, have], axis=1))
update:
As #hpaulj said, actually the method using apply_along_axis above is not as efficient as I expected. I misunderstood it. Reference: numpy np.apply_along_axis function speed up?.
However, I made some test on current methods:
import numpy as np
from timeit import timeit
def rep1st(arr):
rep = arr[0]
res = np.repeat(arr[1], rep)
res = np.concatenate([res, arr[rep + 1:]])
return res
def test(row, col, run):
have = np.random.randint(0, 100, size=(row, col))
use = np.random.randint(0, col, size=(row, 1))
d = locals()
d.update(globals())
# method by me
t1 = timeit("np.apply_along_axis(rep1st, 1, np.concatenate([use, have], axis=1))", number=run, globals=d)
# method by #quantummind
t2 = timeit("np.array([[a[0]] * b[0] + list(a[b[0]:]) for a, b in zip(have, use)])", number=run, globals=d)
# method by #Amit Vikram Singh
t3 = timeit(
"np.where(np.repeat(np.arange(have.shape[1])[None, :], have.shape[0], axis=0) < use, have[:, 0:1], have)",
number=run, globals=d
)
print(f"{t1:8.6f}, {t2:8.6f}, {t3:8.6f}")
test(1000, 10, 10)
test(100, 100, 10)
test(10, 1000, 10)
test(1000000, 10, 1)
test(100000, 100, 1)
test(10000, 1000, 1)
test(1000, 10000, 1)
test(100, 100000, 1)
test(10, 1000000, 1)
results:
0.062488, 0.028484, 0.000408
0.010787, 0.013811, 0.000270
0.001057, 0.009146, 0.000216
6.146863, 3.210017, 0.044232
0.585289, 1.186013, 0.034110
0.091086, 0.961570, 0.026294
0.039448, 0.917052, 0.022553
0.028719, 0.919377, 0.022751
0.035121, 1.027036, 0.025216
It shows that the second method proposed by #Amit Vikram Singh always works well even when the arrays are huge.
I want to calculate the cosine similarity between two lists, let's say for example list 1 which is dataSetI and list 2 which is dataSetII.
Let's say dataSetI is [3, 45, 7, 2] and dataSetII is [2, 54, 13, 15]. The length of the lists are always equal. I want to report cosine similarity as a number between 0 and 1.
dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
def cosine_similarity(list1, list2):
# How to?
pass
print(cosine_similarity(dataSetI, dataSetII))
another version based on numpy only
from numpy import dot
from numpy.linalg import norm
cos_sim = dot(a, b)/(norm(a)*norm(b))
You should try SciPy. It has a bunch of useful scientific routines for example, "routines for computing integrals numerically, solving differential equations, optimization, and sparse matrices." It uses the superfast optimized NumPy for its number crunching. See here for installing.
Note that spatial.distance.cosine computes the distance, and not the similarity. So, you must subtract the value from 1 to get the similarity.
from scipy import spatial
dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)
You can use cosine_similarity function form sklearn.metrics.pairwise docs
In [23]: from sklearn.metrics.pairwise import cosine_similarity
In [24]: cosine_similarity([[1, 0, -1]], [[-1,-1, 0]])
Out[24]: array([[-0.5]])
I don't suppose performance matters much here, but I can't resist. The zip() function completely recopies both vectors (more of a matrix transpose, actually) just to get the data in "Pythonic" order. It would be interesting to time the nuts-and-bolts implementation:
import math
def cosine_similarity(v1,v2):
"compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
sumxx, sumxy, sumyy = 0, 0, 0
for i in range(len(v1)):
x = v1[i]; y = v2[i]
sumxx += x*x
sumyy += y*y
sumxy += x*y
return sumxy/math.sqrt(sumxx*sumyy)
v1,v2 = [3, 45, 7, 2], [2, 54, 13, 15]
print(v1, v2, cosine_similarity(v1,v2))
Output: [3, 45, 7, 2] [2, 54, 13, 15] 0.972284251712
That goes through the C-like noise of extracting elements one-at-a-time, but does no bulk array copying and gets everything important done in a single for loop, and uses a single square root.
ETA: Updated print call to be a function. (The original was Python 2.7, not 3.3. The current runs under Python 2.7 with a from __future__ import print_function statement.) The output is the same, either way.
CPYthon 2.7.3 on 3.0GHz Core 2 Duo:
>>> timeit.timeit("cosine_similarity(v1,v2)",setup="from __main__ import cosine_similarity, v1, v2")
2.4261788514654654
>>> timeit.timeit("cosine_measure(v1,v2)",setup="from __main__ import cosine_measure, v1, v2")
8.794677709375264
So, the unpythonic way is about 3.6 times faster in this case.
without using any imports
math.sqrt(x)
can be replaced with
x** .5
without using numpy.dot() you have to create your own dot function using list comprehension:
def dot(A,B):
return (sum(a*b for a,b in zip(A,B)))
and then its just a simple matter of applying the cosine similarity formula:
def cosine_similarity(a,b):
return dot(a,b) / ( (dot(a,a) **.5) * (dot(b,b) ** .5) )
I did a benchmark based on several answers in the question and the following snippet is believed to be the best choice:
def dot_product2(v1, v2):
return sum(map(operator.mul, v1, v2))
def vector_cos5(v1, v2):
prod = dot_product2(v1, v2)
len1 = math.sqrt(dot_product2(v1, v1))
len2 = math.sqrt(dot_product2(v2, v2))
return prod / (len1 * len2)
The result makes me surprised that the implementation based on scipy is not the fastest one. I profiled and find that cosine in scipy takes a lot of time to cast a vector from python list to numpy array.
Python code to calculate:
Cosine Distance
Cosine Similarity
Angular Distance
Angular Similarity
import math
from scipy import spatial
def calculate_cosine_distance(a, b):
cosine_distance = float(spatial.distance.cosine(a, b))
return cosine_distance
def calculate_cosine_similarity(a, b):
cosine_similarity = 1 - calculate_cosine_distance(a, b)
return cosine_similarity
def calculate_angular_distance(a, b):
cosine_similarity = calculate_cosine_similarity(a, b)
angular_distance = math.acos(cosine_similarity) / math.pi
return angular_distance
def calculate_angular_similarity(a, b):
angular_similarity = 1 - calculate_angular_distance(a, b)
return angular_similarity
Similarity Search:
If you want to find closest cosine similarity in array of embeddings, you can use Tensorflow, like the following code.
In my testing, closeset value to an embedding with the shape of 1x512 found in 1M embeddings (1'000'000 x 512) in less than a second (using GPU).
import time
import numpy as np # np.__version__ == '1.23.5'
import tensorflow as tf # tf.__version__ == '2.11.0'
EMBEDDINGS_LENGTH = 512
NUMBER_OF_EMBEDDINGS = 1000 * 1000
def calculate_cosine_similarities(x, embeddings):
cosine_similarities = -1 * tf.keras.losses.cosine_similarity(x, embeddings)
return cosine_similarities.numpy()
def find_closest_embeddings(x, embeddings, top_k=1):
cosine_similarities = calculate_cosine_similarities(x, embeddings)
values, indices = tf.math.top_k(cosine_similarities, k=top_k)
return values.numpy(), indices.numpy()
def main():
# x shape: (512)
# Embeddings shape: (1000000, 512)
x = np.random.rand(EMBEDDINGS_LENGTH).astype(np.float32)
embeddings = np.random.rand(NUMBER_OF_EMBEDDINGS, EMBEDDINGS_LENGTH).astype(np.float32)
print('Embeddings shape: ', embeddings.shape)
n = 100
sum_duration = 0
for i in range(n):
start = time.time()
best_values, best_indices = find_closest_embeddings(x, embeddings, top_k=1)
end = time.time()
duration = end - start
sum_duration += duration
print('Duration (seconds): {}, Best value: {}, Best index: {}'.format(duration, best_values[0], best_indices[0]))
# Average duration (seconds): 1.707 for Intel(R) Core(TM) i7-10700 CPU # 2.90GHz
# Average duration (seconds): 0.961 for NVIDIA 1080 ti
print('Average duration (seconds): ', sum_duration / n)
if __name__ == '__main__':
main()
For more advanced similarity search, you can use Milvus, Weaviate or Faiss.
https://en.wikipedia.org/wiki/Cosine_similarity
https://gist.github.com/amir-saniyan/e102de09b01c4ed1632e3d1a1a1cbf64
import math
from itertools import izip
def dot_product(v1, v2):
return sum(map(lambda x: x[0] * x[1], izip(v1, v2)))
def cosine_measure(v1, v2):
prod = dot_product(v1, v2)
len1 = math.sqrt(dot_product(v1, v1))
len2 = math.sqrt(dot_product(v2, v2))
return prod / (len1 * len2)
You can round it after computing:
cosine = format(round(cosine_measure(v1, v2), 3))
If you want it really short, you can use this one-liner:
from math import sqrt
from itertools import izip
def cosine_measure(v1, v2):
return (lambda (x, y, z): x / sqrt(y * z))(reduce(lambda x, y: (x[0] + y[0] * y[1], x[1] + y[0]**2, x[2] + y[1]**2), izip(v1, v2), (0, 0, 0)))
You can use this simple function to calculate the cosine similarity:
def cosine_similarity(a, b):
return sum([i*j for i,j in zip(a, b)])/(math.sqrt(sum([i*i for i in a]))* math.sqrt(sum([i*i for i in b])))
You can do this in Python using simple function:
def get_cosine(text1, text2):
vec1 = text1
vec2 = text2
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return round(float(numerator) / denominator, 3)
dataSet1 = [3, 45, 7, 2]
dataSet2 = [2, 54, 13, 15]
get_cosine(dataSet1, dataSet2)
Using numpy compare one list of numbers to multiple lists(matrix):
def cosine_similarity(vector,matrix):
return ( np.sum(vector*matrix,axis=1) / ( np.sqrt(np.sum(matrix**2,axis=1)) * np.sqrt(np.sum(vector**2)) ) )[::-1]
If you happen to be using PyTorch already, you should go with their CosineSimilarity implementation.
Suppose you have two n-dimensional numpy.ndarrays, v1 and v2, i.e. their shapes are both (n,). Here's how you get their cosine similarity:
import torch
import torch.nn as nn
cos = nn.CosineSimilarity()
cos(torch.tensor([v1]), torch.tensor([v2])).item()
Or suppose you have two numpy.ndarrays w1 and w2, whose shapes are both (m, n). The following gets you a list of cosine similarities, each being the cosine similarity between a row in w1 and the corresponding row in w2:
cos(torch.tensor(w1), torch.tensor(w2)).tolist()
You can use SciPy (easiest way):
from scipy import spatial
dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
print(1 - spatial.distance.cosine(dataSetI, dataSetII))
Note that spatial.distance.cosine() gives you a dissimilarity (distance) value, and thus to get the similarity, you need to subtract that value from 1.
Another way to get to the solution is to write the function yourself that even contemplates the possibility of lists with different lengths:
def cosineSimilarity(v1, v2):
scalarProduct = moduloV1 = moduloV2 = 0
if len(v1) > len(v2):
v2.extend(0 for _ in range(len(v1) - len(v2)))
else:
v2.extend(0 for _ in range(len(v2) - len(v1)))
for i in range(len(v1)):
scalarProduct += v1[i] * v2[i]
moduloV1 += v1[i] * v1[i]
moduloV2 += v2[i] * v2[i]
return round(scalarProduct/(math.sqrt(moduloV1) * math.sqrt(moduloV2)), 3)
dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
print(cosineSimilarity(dataSetI, dataSetII))
Another version, if you have a scenario where you have list of vectors and a query vector and you want to compute the cosine similarity of query vector with all the vectors in the list, you can do it in one go in the below fashion:
>>> import numpy as np
>>> A # list of vectors, shape -> m x n
array([[ 3, 45, 7, 2],
[ 1, 23, 3, 4]])
>>> B # query vector, shape -> 1 x n
array([ 2, 54, 13, 15])
>>> similarity_scores = A.dot(B)/ (np.linalg.norm(A, axis=1) * np.linalg.norm(B))
>>> similarity_scores
array([0.97228425, 0.99026919])
We can easily calculate cosine similarity with simple mathematics equations.
Cosine_similarity = 1- (dotproduct of vectors/(product of norm of the vectors)). We can define two functions each for calculations of dot product and norm.
def dprod(a,b):
sum=0
for i in range(len(a)):
sum+=a[i]*b[i]
return sum
def norm(a):
norm=0
for i in range(len(a)):
norm+=a[i]**2
return norm**0.5
cosine_a_b = 1-(dprod(a,b)/(norm(a)*norm(b)))
Here is an implementation that would work for matrices as well. Its behaviour is exactly like sklearn cosine similarity:
def cosine_similarity(a, b):
return np.divide(
np.dot(a, b.T),
np.linalg.norm(
a,
axis=1,
keepdims=True
)
# # matrix multiplication
np.linalg.norm(
b,
axis=1,
keepdims=True
).T
)
The # symbol stands for matrix multiplication. See
What does the "at" (#) symbol do in Python?
All the answers are great for situations where you cannot use NumPy. If you can, here is another approach:
def cosine(x, y):
dot_products = np.dot(x, y.T)
norm_products = np.linalg.norm(x) * np.linalg.norm(y)
return dot_products / (norm_products + EPSILON)
Also bear in mind about EPSILON = 1e-07 to secure the division.
I'd like to 'shear' a numpy array. I'm not sure I'm using the term 'shear' correctly; by shear, I mean something like:
Shift the first column by 0 places
Shift the second column by 1 place
Shift the third colum by 2 places
etc...
So this array:
array([[11, 12, 13],
[17, 18, 19],
[35, 36, 37]])
would turn into either this array:
array([[11, 36, 19],
[17, 12, 37],
[35, 18, 13]])
or something like this array:
array([[11, 0, 0],
[17, 12, 0],
[35, 18, 13]])
depending on how we handle the edges. I'm not too particular about edge behavior.
Here's my attempt at a function that does this:
import numpy
def shear(a, strength=1, shift_axis=0, increase_axis=1, edges='clip'):
strength = int(strength)
shift_axis = int(shift_axis)
increase_axis = int(increase_axis)
if shift_axis == increase_axis:
raise UserWarning("Shear can't shift in the direction it increases")
temp = numpy.zeros(a.shape, dtype=int)
indices = []
for d, num in enumerate(a.shape):
coords = numpy.arange(num)
shape = [1] * len(a.shape)
shape[d] = num
coords = coords.reshape(shape) + temp
indices.append(coords)
indices[shift_axis] -= strength * indices[increase_axis]
if edges == 'clip':
indices[shift_axis][indices[shift_axis] < 0] = -1
indices[shift_axis][indices[shift_axis] >= a.shape[shift_axis]] = -1
res = a[indices]
res[indices[shift_axis] == -1] = 0
elif edges == 'roll':
indices[shift_axis] %= a.shape[shift_axis]
res = a[indices]
return res
if __name__ == '__main__':
a = numpy.random.random((3,4))
print a
print shear(a)
It seems to work. Please tell me if it doesn't!
It also seems clunky and inelegant. Am I overlooking a builtin numpy/scipy function that does this? Is there a cleaner/better/more efficient way to do this in numpy? Am I reinventing the wheel?
EDIT:
Bonus points if this works on an N-dimensional array, instead of just the 2D case.
This function will be at the very center of a loop I'll repeat many times in our data processing, so I suspect it's actually worth optimizing.
SECOND EDIT:
I finally did some benchmarking. It looks like numpy.roll is the way to go, despite the loop. Thanks, tom10 and Sven Marnach!
Benchmarking code: (run on Windows, don't use time.clock on Linux I think)
import time, numpy
def shear_1(a, strength=1, shift_axis=0, increase_axis=1, edges='roll'):
strength = int(strength)
shift_axis = int(shift_axis)
increase_axis = int(increase_axis)
if shift_axis == increase_axis:
raise UserWarning("Shear can't shift in the direction it increases")
temp = numpy.zeros(a.shape, dtype=int)
indices = []
for d, num in enumerate(a.shape):
coords = numpy.arange(num)
shape = [1] * len(a.shape)
shape[d] = num
coords = coords.reshape(shape) + temp
indices.append(coords)
indices[shift_axis] -= strength * indices[increase_axis]
if edges == 'clip':
indices[shift_axis][indices[shift_axis] < 0] = -1
indices[shift_axis][indices[shift_axis] >= a.shape[shift_axis]] = -1
res = a[indices]
res[indices[shift_axis] == -1] = 0
elif edges == 'roll':
indices[shift_axis] %= a.shape[shift_axis]
res = a[indices]
return res
def shear_2(a, strength=1, shift_axis=0, increase_axis=1, edges='roll'):
indices = numpy.indices(a.shape)
indices[shift_axis] -= strength * indices[increase_axis]
indices[shift_axis] %= a.shape[shift_axis]
res = a[tuple(indices)]
if edges == 'clip':
res[indices[shift_axis] < 0] = 0
res[indices[shift_axis] >= a.shape[shift_axis]] = 0
return res
def shear_3(a, strength=1, shift_axis=0, increase_axis=1):
if shift_axis > increase_axis:
shift_axis -= 1
res = numpy.empty_like(a)
index = numpy.index_exp[:] * increase_axis
roll = numpy.roll
for i in range(0, a.shape[increase_axis]):
index_i = index + (i,)
res[index_i] = roll(a[index_i], i * strength, shift_axis)
return res
numpy.random.seed(0)
for a in (
numpy.random.random((3, 3, 3, 3)),
numpy.random.random((50, 50, 50, 50)),
numpy.random.random((300, 300, 10, 10)),
):
print 'Array dimensions:', a.shape
for sa, ia in ((0, 1), (1, 0), (2, 3), (0, 3)):
print 'Shift axis:', sa
print 'Increase axis:', ia
ref = shear_1(a, shift_axis=sa, increase_axis=ia)
for shear, label in ((shear_1, '1'), (shear_2, '2'), (shear_3, '3')):
start = time.clock()
b = shear(a, shift_axis=sa, increase_axis=ia)
end = time.clock()
print label + ': %0.6f seconds'%(end-start)
if (b - ref).max() > 1e-9:
print "Something's wrong."
print
The approach in tom10's answer can be extended to arbitrary dimensions:
def shear3(a, strength=1, shift_axis=0, increase_axis=1):
if shift_axis > increase_axis:
shift_axis -= 1
res = numpy.empty_like(a)
index = numpy.index_exp[:] * increase_axis
roll = numpy.roll
for i in range(0, a.shape[increase_axis]):
index_i = index + (i,)
res[index_i] = roll(a[index_i], -i * strength, shift_axis)
return res
numpy roll does this. For example, if you original array is x then
for i in range(x.shape[1]):
x[:,i] = np.roll(x[:,i], i)
produces
[[11 36 19]
[17 12 37]
[35 18 13]]
This can be done using a trick described in this answer by Joe Kington:
from numpy.lib.stride_tricks import as_strided
a = numpy.array([[11, 12, 13],
[17, 18, 19],
[35, 36, 37]])
shift_axis = 0
increase_axis = 1
b = numpy.vstack((a, a))
strides = list(b.strides)
strides[increase_axis] -= strides[shift_axis]
strides = (b.strides[0], b.strides[1] - b.strides[0])
as_strided(b, shape=b.shape, strides=strides)[a.shape[0]:]
# array([[11, 36, 19],
# [17, 12, 37],
# [35, 18, 13]])
To get "clip" instead of "roll", use
b = numpy.vstack((numpy.zeros(a.shape, int), a))
This is probably the most efficient way of doing it, since it does not use any Python loop at all.
Here is a cleaned-up version of your own approach:
def shear2(a, strength=1, shift_axis=0, increase_axis=1, edges='clip'):
indices = numpy.indices(a.shape)
indices[shift_axis] -= strength * indices[increase_axis]
indices[shift_axis] %= a.shape[shift_axis]
res = a[tuple(indices)]
if edges == 'clip':
res[indices[shift_axis] < 0] = 0
res[indices[shift_axis] >= a.shape[shift_axis]] = 0
return res
The main difference is that it uses numpy.indices() instead of rolling your own version of this.
r = lambda l, n: l[n:]+l[:n]
transpose(map(r, transpose(a), range(0, len(a)))
I think. You should probably consider this psuedocode more than actual Python. Basically transpose the array, map a general rotate function over it to do the rotation, then transpose it back.