Fastest way to calculate "cosine" metrics with scipy - python

I am given a matrix of ones and zeros. I need to find 20 rows which have the highest cosine metrics towards 1 specific row in matrix:
If I have 10 rows, and 5th is called specific, I want to choose the highest value between these:
cosine(1row,5row),cosine(2row,5row),...,cosine(8row,5row),cosine(9row,5row)
First, i tried to count metrics.
This didn't work:
A = ratings[:,100]
A = A.reshape(1,A.shape[0])
B = ratings.transpose()
similarity = -cosine(A,B)+1
A.shape = (1L, 71869L)
B.shape = (10000L, 71869L)
Error is: Input vector should be 1-D. I'd like to know, how to implement this aesthetically with no errors, but the most important - which solution will be the fastest?
In my opinion, the fastest way is not realized with help of scipy;
We just have to take all ones in specific row and look at these indices in all other rows. Those rows, which have the highest coincidence will have the highest matrix.
Are there any faster ways?

The fastest way is to use matrix operations: something like np.multipy(A,B)

Related

Creating a matrix with at most one value in every row and column

From a matrix filled with values (see picture), I want to obtain a matrix with at most one value for every row and column. If there is more than one value, the maximum should be kept and the other set to 0. I know I can do that with np.max and np.argmax, but I'm wondering if there is some clever way to do it that I'm not aware of.
Here's the solution I have for now:
tmp = np.zeros_like(matrix)
for x in np.argmax(matrix, axis=0): # get max on x axis
for y in np.argmax(matrix, axis=1): # get max on y axis
tmp[x][y] = matrix[x][y]
matrix = tmp
The sparse structure may be used for efficiency, however right now I see a contradiction between at most one value for every row and column and your current implementation which may leave more than one value per row/column.
Either you need an order to prefer rows over columns or to go along an absolute sorting of all matrix values.
Just an idea which produces for sure at most one entry per row and column would be to, firstly select the maxima of rows, and secondly select from this intermediate matrix the maxima of columns:
import numpy as np
rows=5
cols=5
matrix=np.random.rand(rows, cols)
rowmax=np.argmax(matrix, 1)
rowmax_matrix=np.zeros((rows, cols))
for ri, rm in enumerate(rowmax):
rowmax_matrix[ri,rm]=matrix[ri, rm]
colrowmax=np.argmax(rowmax_matrix, 0)
colrowmax_matrix=np.zeros((rows, cols))
for ci, cm in enumerate(colrowmax):
colrowmax_matrix[cm, ci]=rowmax_matrix[cm, ci]
This is probably not the final answer, but may help to formulate the desired algorithm precisely.

Python cosine_similarity doesn't work for matrix with NaNs

Need to find python function that works like this R func:
proxy::simil(method = "cosine", by_rows = FALSE)
i.e. finds similarity matrix by pair-wise calculating cosine distance between dataframe rows.
If NaNs are present, it should drop exact columns with NaNs in these 2 rows
Simil function description (R)
Python error because of NaNs
upd. I have also tried to delete NaNs in every pair of rows in loop using cosine func from scipy.spatial.distance. It gives the same result as in R, but works ages :(
You can try this approach: https://github.com/Midnighter/nadist,
alternatively you can use _chk_weights with nan_screen=True as described here by metaperture here https://github.com/scipy/scipy/issues/3870, hope that helps.
I have found that Midnighter had posted the same problem previously on stackoverflow: Compute the pairwise distance in scipy with missing values. There are some other solutions there but, as he moved on to cytonize it I bet they were not the best.
I solved the problem by creating a mask (boolean array indicating which values are missing) and calculating pairwise cosine distances between row-vectors of matrix. As a result I received a long vector of similarities, which I then pivoted to get the similarity matrix
You can swap NaN with 0 and try calculating cosine similarity then.

Find the worst element making the correlation worse in pandas DataFrame

I'd like to find the worst record which make the correlation worse in pandas.DataFrame to remove anomaly records.
When I have the following DataFrame:
df = pd.DataFrame({'a':[1,2,3], 'b':[1,2,30]})
The correlation becomes better removing third row.
print df.corr() #-> correlation is 0.88
print df.ix[0:1].corr() # -> correlation is 1.00
In this case, my question is how to find the third row is an candidate of anomalies which make the correlation worse.
My idea is execute linear regression and calculate the error of each element (row). But, I don't know the simple way to try that idea and also believe there is more simple and straightforward way.
Update
Of course, you can remove all of elements and achieve the correlation is 1. But I'd like to find just one (or several) anomaly row(s). Intuitively, I hope to get non-trivial set of records which achieves better correlation.
First, you could brute force it to get exact solution:
import pandas as pd
import numpy as np
from itertools import combinations, chain, imap
df = pd.DataFrame(zip(np.random.randn(10), np.random.randn(10)))
# set the maximal number of lines you are willing to remove
reomve_up_to_n = 3
# all combinations of indices to keep
to_keep = imap(list, chain(*map(lambda i: combinations(df.index, df.shape[0] - i), range(1, reomve_up_to_n + 1))))
# find index with highest remaining correlation
highest_correlation_index = max(to_keep, key = lambda ks: df.ix[ks].corr().ix[0,1])
df_remaining = df.ix[highest_correlation_index]
This can be costly. You could get a greedy approximation by adding a column with something like row's contribution to correlation.
df['CorComp'] = (df.icol(0).mean() - df.icol(0)) * (df.icol(1).mean() - df.icol(1))
df = df.sort(['CorComp'])
Now you can remove rows starting from the top, which may raise your correlation.
Your question is about outliers detection. There is many way to perform this detection, but a simple way could be to exclude values with deviation exceeding x % of the standard deviation of the series.
# Keep only values with a deviation less than 10% of the standard deviation of the series.
df[np.abs(df.b-df.b.mean())<=(1.1*df.b.std())]
# result
a b
0 1 1
1 2 2

Numpy signed maximum magnitude of cumsum along an axis

I have a numpy array a, a.shape=(17,90,144). I want to find the maximum magnitude of each column of cumsum(a, axis=0), but retaining the original sign. In other words, if for a given column a[:,j,i] the largest magnitude of cumsum corresponds to a negative value, I want to retain the minus sign.
The code np.amax(np.abs(a.cumsum(axis=0))) gets me the magnitude, but doesn't retain the sign. Using np.argmax instead will get me the indices I need, which I can then plug into the original cumsum array. But I can't find a good way to do so.
The following code works, but is dirty and really slow:
max_mag_signed = np.zeros((90,144))
indices = np.argmax(np.abs(a.cumsum(axis=0)), axis=0)
for j in range(90):
for i in range(144):
max_mag_signed[j,i] = a.cumsum(axis=0)[indices[j,i],j,i]
There must be a cleaner, faster way to do this. Any ideas?
I can't find any alternatives to argmax but at least you can fasten that with a more vectorized approach:
# store the cumsum, since it's used multiple times
cum_a = a.cumsum(axis=0)
# find the indices as before
indices = np.argmax(abs(cum_a), axis=0)
# construct the indices for the second and third dimensions
y, z = np.indices(indices.shape)
# get the values with np indexing
max_mag_signed = cum_a[indices, y, z]

Element-wise median and percentiles of arrays with Numeric Python

I am using Numeric Python. Unfortunately, NumPy is not an option. If I have multiple arrays, such as:
a=Numeric.array(([1,2,3],[4,5,6],[7,8,9]))
b=Numeric.array(([9,8,7],[6,5,4],[3,2,1]))
c=Numeric.array(([5,9,1],[5,4,7],[5,2,3]))
How do I return an array that represents the element-wise median of arrays a,b and c?...such as,
array(([5,8,3],[5,5,6],[5,2,3]))
And then looking at a more general situation: Given n number of arrays, how do I find the percentiles of each element? For example, return an array that represents the 30th percentile of 10 arrays. Thank you very much for your help!
Combine your stack of 2-D arrays into one 3-D array, d = Numeric.array([a, b, c]) and then sort on the third dimension. Afterwards, the successive 2-D planes will be rank order so you can extract planes for the low, high, quartiles, percentiles, or median.
Well, I'm not versed in Numeric, but I'll just start with a naive solution and see if we can make it any better.
To get the 30th percentile of list foo let x=0.3, sort the list, and pick the the element at foo[int(len(foo)*x)]
For your data, you want to put it in a matrix, transpose it, sort each row, and get the median of each row.
A matrix in Numeric (just like numpy) is an array with two dimensions.
I think that bar = Numeric.array(a,b,c) would make Array you want, and then you could get the nth column with 'bar[:,n]' if Numeric has the same slicing techniques as Numpy.
foo = sorted(bar[:,n])
foo[int(len(foo)*x)]
I hope that helps you.
Putting Raymond Hettinger's description into python:
a=Numeric.array(([1,2,3],[4,5,6],[7,8,9]))
b=Numeric.array(([9,8,7],[6,5,4],[3,2,1]))
c=Numeric.array(([5,9,1],[5,4,7],[5,2,3]))
d = Numeric.array([a, b, c])
d.sort(axis=0)
Since there are n=3 input matrii so the median would be that of the middle one, the one indexed by one,
print d[n//2]
[[5 8 3]
[5 5 6]
[5 2 3]]
And if you had 4 input matrii, you would have to get the mean-elements of d[1] and d[2].

Categories