how does sklearn jaccard_score gets calculated? - python

I was trying to understand what is going on with sklearn's jaccard_score.
This is the result I got
1. jaccard_score([0 1 1], [1 1 1])
0.6666666666666666
2. jaccard_score([1 1 0], [1 0 0])
0.5
3. jaccard_score([1 1 0], [1 0 1])
0.3333333333333333
I understand that the formula is
intersection / size of A + size of B - intersection
I thought the last one should give me 0.2 because the overlap is 1 and total number of entries is 6 resulting 1/5. but I got 0.33333...
Can anyone explain how sklearn calculates jaccard_score?

Per sklearn's doc, the jaccard_score function "is used to compare set of predicted labels for a sample to the corresponding set of labels in y_true". If the attributes are binary, the computation is based on this using the confusion matrix. Otherwise, the same computation is done using the confusion matrix for each attribute value / class label.
The above definition for binary attributes / classes can be reduced to the set definition as explained in the following.
Assume that there are three records r1, r2, and r3. The vector [0, 1, 1] and [1, 1, 1] -- which are true and predicted classes of the records -- can be mapped to two sets {r2, r3} and {r1, r2, r3} respectively. Here, each element in the vector represents whether the correponding record exists in the set. The Jaccard similarity of the two sets are the same as the definition of similarity value for two vectors.

Related

Apply truncnorm and draw to 3d arrays of parameters

I have two 3D arrays mean and std, containing respectively, as their names states, mean values and standard deviation. Both arrays have same shape, so that there is correspondence between mean value and standard deviation at each single position in both these tables. I would like to, for each position of the array, use the value in mean and corresponding value in std to define a truncated normal distribution from which I draw a single value that I store at the corresponding position in another array p that has the same shape as mean and std.
Of course, I thought of using scipy.stats.truncnorm but I encounter broadcasting problems and I am a bit lost on how to use it elegantly. A for loop would take too much time as the aim is to apply this process to very big arrays.
As a simple example, let us consider
mean = [[[4 0]
[1 3]]
[[3 1]
[3 4]]]
std = [[[0.84700368 0.78628226]
[0.54893714 0.68086502]]
[[0.23237688 0.46543749]
[0.01420151 0.25461322]]]
For simplicity, I initialize p as an array containing indices:
p = [[[1 2]
[3 4]]
[[5 6]
[7 8]]]
For instance, I would like to replace value 5 in p by a value randomly drawn from a truncated normal distribution (say truncated between user-chosen values lower and upper) of mean value 3 and standard deviation 0.23237688, as given at corresponding position in mean and std. The aim is to apply this process to all values at once.
Thank you in advance for your answers !
It's easier than you think.
mean = np.array([[[4, 0],
[1, 3]],
[[3, 1],
[3, 4]]])
std = np.array([[[0.84700368, 0.78628226],
[0.54893714, 0.68086502]],
[[0.23237688, 0.46543749],
[0.01420151, 0.25461322]]])
lower = 1
upper = 3
# from the documentation of truncnorm:
a, b = (lower - mean) / std, (upper - mean) / std
from scipy.stats import truncnorm
# remove random_state from parameters if you don't want reproducible
# results.
p = truncnorm.rvs(a, b, loc=mean, scale=std, random_state=1)
print(np.around(p, 2))
# output:
[[[2.6 1.5 ]
[1. 2.3 ]]
[[2.66 1.05]
[2.98 2.94]]]

SVD with numpy - intepretation of results

I'm trying to get into Singular Value Decomposition (SVD). I've found this YouTube Lecture that contains an example. However, when I try this example in numpy I'm getting "kind of" different results. In this example the input matrix is
A = [ [1,1,1,0,0], [3,3,3,0,0], [4,4,4,0,0], [5,5,5,0,0], [0,2,0,4,4], [0,0,0,5,5], [0,1,0,2,2] ]
A = np.asarray(A)
print(A)
[[1 1 1 0 0]
[3 3 3 0 0]
[4 4 4 0 0]
[5 5 5 0 0]
[0 2 0 4 4]
[0 0 0 5 5]
[0 1 0 2 2]]
The rank of this matrix is 3 (np.linalg.matrix_rank(A)). The lecture states that the number of singular values are the rank of the matrix, and in the example the Sigma matrix S is indeed of size 3=3. However, when I perform
U, S, V = np.linalg.svd(A)
matrix S contains 5 values. On the other hand, the first 3 values match the one in the example, and the other 2 are basically 0. Can I assume that get more singular values than the the rank because of the numerical algorithm behind SVD and the finite representation of real numbers on computers - or something along that line?
As mentioned on this page, numpy internally uses LAPACK routine _gesdd to get the SVD decomposition. Now, if you see _gesdd documentation, it mentions,
To find the SVD of a general matrix A, call the LAPACK routine ?gebrd
or ?gbbrd for reducing A to a bidiagonal matrix B by a unitary
(orthogonal) transformation: A = QBPH. Then call ?bdsqr, which forms
the SVD of a bidiagonal matrix: B = U1ΣV1H.
So, there are 2 steps involved here :
Bidiagonalization by orthogonal transformation(Householder transformations)
Get the SVD of the bidiagonal matrix, using implicit zero-shift QR algorithm.
QR algorithm is an iterative algorithm, meaning you don't get an "exact" answer, but get better and better approximations with each iteration and stop if the change in values fall below a threshold, so it is "approximate" in that sense.
Thus along with the issue of numerical accuracies due to finite machine representation of reals, even if we had infinite representational capacity, we would have gotten "approximate" results(if we ran the algorithm for finite time) due to the iterative nature of the algorithm.

Scipy fitting polynomial model to some data

I do try to find an appropriate function for the permeability of cells under varying conditions. If I assume constant permeability, I can fit it to the experimental data and use Sklearns PolynomialFeatures together with a LinearModel (As explained in this post) in order to determine a correlation between the conditions and the permeability. However, the permeability is not constant and now I try to fit my model with the permeability as a function of the process conditions. The PolynomialFeature module of sklearn is quite nice to use.
Is there an equivalent function within scipy or numpy which allows me to create a polynomial model (including interaction terms e.g. a*x[0]*x[1] etc.) of varying order without writing the whole function by hand ?
The standard polynomial class in numpy seems not to support interaction terms.
I'm not aware of such a function that does exactly what you need, but you can achieve it using a combination of itertools and numpy.
If you have n_features predictor variables, you essentially must generate all vectors of length n_features whose entries are non-negative integers and sum to the specified order. Each new feature column is the component-wise power using these vectors who sum to a given order.
For example, if order = 3 and n_features = 2, one of the new features will be the old features raise to the respective powers, [2,1]. I've written some code below for arbitrary order and number of features. I've modified the generation of vectors who sum to order from this post.
import itertools
import numpy as np
from scipy.special import binom
def polynomial_features_with_cross_terms(X, order):
"""
X: numpy ndarray
Matrix of shape, `(n_samples, n_features)`, to be transformed.
order: integer, default 2
Order of polynomial features to be computed.
returns: T, powers.
`T` is a matrix of shape, `(n_samples, n_poly_features)`.
Note that `n_poly_features` is equal to:
`n_features+order-1` Choose `n_features-1`
See: https://en.wikipedia.org\
/wiki/Stars_and_bars_%28combinatorics%29#Theorem_two
`powers` is a matrix of shape, `(n_features, n_poly_features)`.
Each column specifies the power by row of the respective feature,
in the respective column of `T`.
"""
n_samples, n_features = X.shape
n_poly_features = int(binom(n_features+order-1, n_features-1))
powers = np.zeros((n_features, n_poly_features))
T = np.zeros((n_samples, n_poly_features), dtype=X.dtype)
combos = itertools.combinations(range(n_features+order-1), n_features-1)
for i,c in enumerate(combos):
powers[:,i] = np.array([
b-a-1 for a,b in zip((-1,)+c, c+(n_features+order-1,))
])
T[:,i] = np.prod(np.power(X, powers[:,i]), axis=1)
return T, powers
Here's some example usage:
>>> X = np.arange(-5,5).reshape(5,2)
>>> T,p = polynomial_features_with_cross_terms(X, order=3)
>>> print X
[[-5 -4]
[-3 -2]
[-1 0]
[ 1 2]
[ 3 4]]
>>> print p
[[ 0. 1. 2. 3.]
[ 3. 2. 1. 0.]]
>>> print T
[[ -64 -80 -100 -125]
[ -8 -12 -18 -27]
[ 0 0 0 -1]
[ 8 4 2 1]
[ 64 48 36 27]]
Finally, I should mention that the SVM polynomial kernel achieves exactly this effect without explicitly computing the polynomial map. There are of course pro's and con's to this, but I figured I should mentioned it for you to consider if you have not, yet.

How can I interpret Scikit-learn confusion matrix

I am using confusion matrix to check the performance of my classifier.
I am using Scikit-Learn, I am little bit confused. How can I interpret the result from
from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])
How can I take the decision whether this predicted values are good or no.
The simplest way to reach the decision whether the classifier is good or bad is just to calculate an error using some of standard error measures (for example the Mean squared error). I imagine your example is copied from Scikit's documentation, so I assume you've read the definition.
We have three classes here: 0, 1 and 2. On the diagonal, the confusion matrix tells you, how often a particular class have been predicted correctly. So from the diagonal 2 0 2 we can say that class with index 0 was classified correctly 2 times, class with index 1 was never predicted correctly, and class with index 2 was predicted correctly 2 times.
Under and above the diagonal you have numbers which tell you how many times a class with index equal to the element's row number was classified as class with index equal to matrix's column. For example if you look at the first column, under the diagonal you have: 0 1 (in the lower left corner of the matrix). The lower 1 tells you that class with index 2 (the last row) was once erroneously classified as 0 (the first column). This corresponds to the fact that in your y_true there was one sample with label 2 and was classified as 0. This happened for the first sample.
If you sum all the numbers from the confusion matrix you get the number of testing samples (2 + 2 + 1 + 1 = 6 - equal to the length of y_true and y_pred). If you sum the rows you get the number of samples for each label: as you can verify, indeed there are two 0s, one 1 and three 2s in y_pred.
If you for example divide matrix elements by this number, you could tell that, for example, class with label 2 is recognized with correctly with ~66% accuracy, and in 1/3rd of cases it's confused (hence the name) with class with label 0.
TL;DR:
While single-number error measures measure overall performance, with confusion matrix you can determine if (some examples):
your classifier just sucks with everything
or it handles some classes well, and some not (this gives you a hint to look at this particular part of your data and observe classifier's behaviour for these cases)
it does well, but confuses label A with B quite often. For example, for linear classifiers you may want to check then, if these classes are linearly separable.
Etc.

Python - How to find a correlation between two vectors?

Given two vectors X and Y, I have to find their correlation, i.e. their linear dependence/independence. Both vectors have equal dimension. The result should be a floating point number from [-1.0 .. 1.0].
Example:
X=[-1, 2, 0]
Y=[ 4, 2, -0.3]
Find y = cor(X,Y) such that y belongs to [-1.0 .. 1.0].
It should be a simple construction involving a list-comprehension. No external library is allowed.
UPDATE: ok, if the dot product is enough, then here is my solution:
nX = 1/(sum([x*x for x in X]) ** 0.5)
nY = 1/(sum([y*y for y in Y]) ** 0.5)
cor = sum([(x*nX)*(y*nY) for x,y in zip(X,Y) ])
right?
Sounds like a dot product to me.
Solve the equation for the cosine of the angle between the two vectors, which is always in the range [-1, 1], and you'll have what you want.
It's equal to the dot product divided by the magnitudes of two vectors.
Since range is supposed to be [-1, 1] I think that the Pearson Correlation can be ok for your purposes.
Also dot-product would work but you'll have to normalize vectors before calculating it and you can have a -1,1 range just if you have also negative values.. otherwise you would have 0,1
Don't assume because a formula is algebraically correct that its direct implementation in code will work. There can be numerical problems with some definitions of correlation.
See How to calculate correlation accurately

Categories