Probability Jaccard Similarity in numpy - python

Anyone has any idea how to efficiently implement a 2D probability Jaccard similarity algorithm in numpy? It looks like this specific algorithm is almost non-existent in computer vision (not in pytorch, not in tensorflow nor in skilearn, I wonder is there a specific reason for this). The formula for probability Jaccard similarity is (taken from wikipedia):

This is one way of doing it. It's pretty straightforward, we use broadcasting to perform the divisions of all pair of points without loops:
def jaccard_probability(x,y):
# Ignore == 0 terms
x0 = x[x!=0]
y0 = y[y!=0]
jac = np.sum(
1.0 / np.sum(np.maximum(x0[:,None] / x0, y0[:,None] / y0), axis=0)
)
return jac
However, I suggest you read the NumPy guide to get a grasp of the basics, at least of broadcasting, as it is a very useful tool to know if you plan on using NumPy in the future and want to make efficient code!

Related

Custom vectorized non-linear filter in Numpy

In digital image processing, many filters are non-linear, such as Harmonic Mean Filter.
I know in Numpy, they provided many vectorized functions which could speed up the computing time tremendously, but currently I have not known any that could work well with non-linear masks.
In specific, I want to speed up the calculation of my implementation of the above filter, which removes two ugly, snail-paced Python for loops:
import math as m
def harmonic(im, ksize):
# Make a copy of the original image
result = im.copy().astype(np.float32)
# Calculate padding size, and pad the original image
psize = m.floor(ksize/2) # paddding size
im = cv.copyMakeBorder(im, psize, psize, psize, psize, cv.BORDER_REFLECT)
# Perform non-linear operations
for i in range(0, result.shape[0]):
for j in range(0, result.shape[1]):
# Get the neighborhood same size as kernel
neighbor = im[(i):(i+2*psize+1),(j):(j+2*psize+1)].astype(np.float32)
# ----------------------------------------
# Calculate the reciprocal sum
recp_sum = np.sum(np.reciprocal(neighbor,where= neighbor != 0).astype(np.float32))
# Harmonic mean for that neighborhood
if (recp_sum != 0):
result[i][j] = (float((ksize*ksize)/(recp_sum)))
# ----------------------------------------
return result.astype(np.uint8)
In general, could we utilize Numpy to create any custom vectorized operations on a array? Or only a limited number operations and what types are they? If yes, what could I do specifically to optimize the above code?
I have tried to explore Numpy vectorization recently, and np.vectorize really caught my attention. However, the examples provided on the documentation was a bit (as far as I feel) irrelevant to the problem I am trying to solve. (English was not my native language so I may miss something, I'd be happy to be elaborated!)
Related to np.vectorize, I do not really understand pyfunc param. Does it really eliminate the traditional Python loops wrapped in that pyfunc? Or it's there just to define a specific mapping at a specific pixel in the array?
The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals. That is,
tmp = 1 / im.astype(np.float32)
tmp = cv2.blur(tmp, (ksize, ksize))
out = 1 / tmp
You might want to add a bit of code there to avoid division by zero. The simplest way is to replace zeros with very small values.

Covariance and correlation coefficient

I have two random variables and I need to calculate precisely some characteristics for them.
https://math.stackexchange.com/questions/3052308/calculated-covariance-corr-coefficient-confirmation?noredirect=1#
I already did this in Java but I want to confirm my answers with at least one more tool.
Could anyone good at python / probability provide me with some guidance how I can calculate these 6 values in python? I guess it is really simple but I am not very confident in python.
I looked at the documentation of the numpy cov function but I have difficulty to understand it.
The best solution is to use the functions from numpy:
import numpy as np
e_X = np.average(X_values, weights=X_weights)
e_Y = np.average(Y_values, weights=Y_weights)
varX = np.average((X_values-e_X)**2, weights=X_weights)
varY = np.average((Y_values-e_Y)**2, weights=Y_weights)
cov_XY = np.cov(X_values, Y_values)
corrcoef_XY = np.corrcoef(X_values, Y_values)

How do you fit a polynomial to a data set?

I'm working on two functions. I have two data sets, eg [[x(1), y(1)], ..., [x(n), y(n)]], dataSet and testData.
createMatrix(D, S) which returns a data matrix, where D is the degree and S is a vector of real numbers [s(1), s(2), ..., s(n)].
I know numpy has a function called polyfit. But polyfit takes in three variables, any advice on how I'd create the matrix?
polyFit(D), which takes in the polynomial of degree D and fits it to the data sets using linear least squares. I'm trying to return the weight vector and errors. I also know that there is lstsq in numpy.linag that I found in this question: Fitting polynomials to data
Is it possible to use that question to recreate what I'm trying?
This is what I have so far, but it isn't working.
def createMatrix(D, S):
x = []
y = []
for i in dataSet:
x.append(i[0])
y.append(i[1])
polyfit(x, y, D)
What I don't get here is what does S, the vector of real numbers, have to do with this?
def polyFit(D)
I'm basing a lot of this on the question posted above. I'm unsure about how to get just w though, the weight vector. I'll be coding the errors, so that's fine I was just wondering if you have any advice on getting the weight vectors themselves.
It looks like all createMatrix is doing is creating the two vectors required by polyfit. What you have will work, but, the more pythonic way to do it is
def createMatrix(dataSet, D):
D = 3 # set this to whatever degree you're trying
x, y = zip(*dataSet)
return polyfit(x, y, D)
(This S/O link provides a detailed explanation of the zip(*dataSet) idiom.)
This will return a vector of coefficients that you can then pass to something like poly1d to generate results. (Further explanation of both polyfit and poly1d can be found here.)
Obviously, you'll need to decide what value you want for D. The simple answer to that is 1, 2, or 3. Polynomials of higher order than cubic tend to be rather unstable and the intrinsic errors make their output rather meaningless.
It sounds like you might be trying to do some sort of correlation analysis (i.e., does y vary with x and, if so, to what extent?) You'll almost certainly want to just use linear (D = 1) regression for this type of analysis. You can try to do a least squares quadratic fit (D = 2) but, again, the error bounds are probably wider than your assumptions (e.g. normality of distribution) will tolerate.

Pandas Matrix to Distance Matrix as fast as possible

I want to calculate a NxN similarity Matrix using the cosine distance formula of sklearn. My problem is that my Matrix is very very large. It has about 1000 entries. My current approach is very very slow and I need a real speed-up. Can anybody help me speeding the code up?
for i in similarity_matrix.columns:
for j in similarity_matrix.columns:
if i == j:
similarity_matrix.ix[i,j] = 0
else:
similarity_matrix.ix[i,j] = cosine(documents[int(i)], documents[int(j)])
Bonus task: In addition I would like to use the weighted cosine formula. But it seems not to be implemented in sklearn? Is that true?
Using for-loops is not the ideal solution. I would recommend to fall back to the pdist functions of scipy. My read is that you don't mean your matrix has 1000 entries but 1000x1000? However Scipy can handle this easily.
import numpy as np
from scipy.spatial.distance import pdist
res = pdist(documents.T, 'cosine')
distances = 1-pd.DataFrame(squareform(res), index=documents.columns, columns=documents.columns)
I have problems understanding how your weight vector looks like? Is is a constant value? Pdist allows for adding custom functions. For example you can calculate your cosine distance using numpy (which is also really fast)
pdist(X, lambda u, v: np.dot(np.dot(u, v), weightvec) / (norm(np.multiply(u, weightvec)) * norm(np.multiply(v, weightvec))))

sparse least square regression

I am trying to fit a linear regression Ax = b where A is a sparse matrix and b a sparse vector. I tried scipy.sparse.linalg.lsqr but apparently b needs to be a numpy (dense) array. Indeed if i run
A = [list(range(0,10)) for i in range(0,15)]
A = scipy.sparse.coo_matrix(A)
b = list(range(0,15))
b = scipy.sparse.coo_matrix(b)
scipy.sparse.linalg.lsqr(A,b)
I end up with:
AttributeError: squeeze not found
While
scipy.sparse.linalg.lsqr(A,b.toarray())
seems to work.
Unfortunately, in my case b is a 1,5 billion x 1 vector and I simply can't use a dense array. Does anybody know a workaround or other libraries for running linear regression with sparse matrix and vector?
It seems that the documentation specifically asks for numpy array. However, given the scale of your problem, maybe its easier to use the closed-form solution of Linear Least Squares?
Given that you want to solve Ax = b, you can cast the normal equations and solve those instead. In other words, you'd solve min ||Ax-b||.
The closed form solution would be x = (A.T*A)^{-1} * A.T *b.
Of course, this closed form solution comes with its own requirements (specifically, on the rank of the matrix A).
You can solve for x using spsolve or if that's too expensive, then using an iterative solver (like Conjugate Gradients) to get an inexact solution.
The code would be:
A = scipy.sparse.rand(1500,1000,0.5) #Create a random instance
b = scipy.sparse.rand(1500,1,0.5)
x = scipy.sparse.linalg.spsolve(A.T*A,A.T*b)
x_lsqr = scipy.sparse.linalg.lsqr(A,b.toarray()) #Just for comparison
print scipy.linalg.norm(x_lsqr[0]-x)
which on a few random instances, consistently gave me values less than 1E-7.
Apparently billions of observations is too much for my machine. I ended up:
Changing algorithm to Stochastic Gradient Descent (SGD): faster with many obs
Removing completely sparse examples (i.e. features and label equal to zero)
Indeed, the update rule of SGD with least square loss function is always zero for obs in 2. This reduced observations from billions to millions which turned out to be feasible under SGD on my machine.

Categories