Need to find python function that works like this R func:
proxy::simil(method = "cosine", by_rows = FALSE)
i.e. finds similarity matrix by pair-wise calculating cosine distance between dataframe rows.
If NaNs are present, it should drop exact columns with NaNs in these 2 rows
Simil function description (R)
Python error because of NaNs
upd. I have also tried to delete NaNs in every pair of rows in loop using cosine func from scipy.spatial.distance. It gives the same result as in R, but works ages :(
You can try this approach: https://github.com/Midnighter/nadist,
alternatively you can use _chk_weights with nan_screen=True as described here by metaperture here https://github.com/scipy/scipy/issues/3870, hope that helps.
I have found that Midnighter had posted the same problem previously on stackoverflow: Compute the pairwise distance in scipy with missing values. There are some other solutions there but, as he moved on to cytonize it I bet they were not the best.
I solved the problem by creating a mask (boolean array indicating which values are missing) and calculating pairwise cosine distances between row-vectors of matrix. As a result I received a long vector of similarities, which I then pivoted to get the similarity matrix
You can swap NaN with 0 and try calculating cosine similarity then.
Related
I am given a matrix of ones and zeros. I need to find 20 rows which have the highest cosine metrics towards 1 specific row in matrix:
If I have 10 rows, and 5th is called specific, I want to choose the highest value between these:
cosine(1row,5row),cosine(2row,5row),...,cosine(8row,5row),cosine(9row,5row)
First, i tried to count metrics.
This didn't work:
A = ratings[:,100]
A = A.reshape(1,A.shape[0])
B = ratings.transpose()
similarity = -cosine(A,B)+1
A.shape = (1L, 71869L)
B.shape = (10000L, 71869L)
Error is: Input vector should be 1-D. I'd like to know, how to implement this aesthetically with no errors, but the most important - which solution will be the fastest?
In my opinion, the fastest way is not realized with help of scipy;
We just have to take all ones in specific row and look at these indices in all other rows. Those rows, which have the highest coincidence will have the highest matrix.
Are there any faster ways?
The fastest way is to use matrix operations: something like np.multipy(A,B)
I have a data frame with points. The first two columns are positions. I am filtering the data based on a points proximity to another point. I calculate the distance of all the points with cdist and then filter this result to find the indices of the points that have a distance less 0.5 from each other. I also have to do two mini filters on these indices first to remove remove indices for comparing the same point distance [n,n] = distance [n,n] will always equal zero and I don't want to remove all of my points. Also I remove indeces for similar distance comparisons distance [n,m] = distance [m,n]. There are basically double the number of points that I need to remove so I use unique to filter out half.
My questions loc_find is a numpy array of indexes to rows that should be removed. How do I remove use this array to remove these numbered rows from my pandas dataframe without iterating over the dataframe?
from scipy.spatial.distance import cdist
import numpy as np
import pandas as pd
# make points and calculate distances
east=data['easting'].values
north=data['northing'].values
points=np.vstack((east,north)).T
distances=cdist(points,points) # big row x row matrix
zzzz=np.where(distances<0.5)
loc_dist=np.vstack((zzzz[0],zzzz[1])).T #array of indices where points are
# to close together and will be filtered contains unwanted distance
# comparisons such as comparing data[1,1] with data[1,1] which is always zero
#since it is the same point. also distance [1,2] is same as [2,1]
#My code for filtering the indices
loc_dist=loc_dist.astype('int')
diff_loc=zzzz[0]-zzzz[1] # remove indices for comparing the same
#point distance [n,n] = distance [n,n]
diff_zero=np.where(diff_loc==0)
loc_dist_s=np.delete(loc_dist, diff_zero[0],axis=0)
loc_find=np.unique(loc_dist_s) # remove indices for similar distance
#comparisons distance [n,m] = distance [m,n]
Thanks to #EdChum I found these two answered questions which work for me.
A faster alternative to Pandas `isin` function
Select rows from a DataFrame based on values in a column in pandas
Just needed to convert dataframe indexes to a column with
data.loc[:,'rindex1']=data.index.get_values()
and then to remove the rows use the following
data_df2=data.loc[~data['rindex1'].isin(loc_find)]
Hope this helps someone else.
I'd like to find the worst record which make the correlation worse in pandas.DataFrame to remove anomaly records.
When I have the following DataFrame:
df = pd.DataFrame({'a':[1,2,3], 'b':[1,2,30]})
The correlation becomes better removing third row.
print df.corr() #-> correlation is 0.88
print df.ix[0:1].corr() # -> correlation is 1.00
In this case, my question is how to find the third row is an candidate of anomalies which make the correlation worse.
My idea is execute linear regression and calculate the error of each element (row). But, I don't know the simple way to try that idea and also believe there is more simple and straightforward way.
Update
Of course, you can remove all of elements and achieve the correlation is 1. But I'd like to find just one (or several) anomaly row(s). Intuitively, I hope to get non-trivial set of records which achieves better correlation.
First, you could brute force it to get exact solution:
import pandas as pd
import numpy as np
from itertools import combinations, chain, imap
df = pd.DataFrame(zip(np.random.randn(10), np.random.randn(10)))
# set the maximal number of lines you are willing to remove
reomve_up_to_n = 3
# all combinations of indices to keep
to_keep = imap(list, chain(*map(lambda i: combinations(df.index, df.shape[0] - i), range(1, reomve_up_to_n + 1))))
# find index with highest remaining correlation
highest_correlation_index = max(to_keep, key = lambda ks: df.ix[ks].corr().ix[0,1])
df_remaining = df.ix[highest_correlation_index]
This can be costly. You could get a greedy approximation by adding a column with something like row's contribution to correlation.
df['CorComp'] = (df.icol(0).mean() - df.icol(0)) * (df.icol(1).mean() - df.icol(1))
df = df.sort(['CorComp'])
Now you can remove rows starting from the top, which may raise your correlation.
Your question is about outliers detection. There is many way to perform this detection, but a simple way could be to exclude values with deviation exceeding x % of the standard deviation of the series.
# Keep only values with a deviation less than 10% of the standard deviation of the series.
df[np.abs(df.b-df.b.mean())<=(1.1*df.b.std())]
# result
a b
0 1 1
1 2 2
I am using Numeric Python. Unfortunately, NumPy is not an option. If I have multiple arrays, such as:
a=Numeric.array(([1,2,3],[4,5,6],[7,8,9]))
b=Numeric.array(([9,8,7],[6,5,4],[3,2,1]))
c=Numeric.array(([5,9,1],[5,4,7],[5,2,3]))
How do I return an array that represents the element-wise median of arrays a,b and c?...such as,
array(([5,8,3],[5,5,6],[5,2,3]))
And then looking at a more general situation: Given n number of arrays, how do I find the percentiles of each element? For example, return an array that represents the 30th percentile of 10 arrays. Thank you very much for your help!
Combine your stack of 2-D arrays into one 3-D array, d = Numeric.array([a, b, c]) and then sort on the third dimension. Afterwards, the successive 2-D planes will be rank order so you can extract planes for the low, high, quartiles, percentiles, or median.
Well, I'm not versed in Numeric, but I'll just start with a naive solution and see if we can make it any better.
To get the 30th percentile of list foo let x=0.3, sort the list, and pick the the element at foo[int(len(foo)*x)]
For your data, you want to put it in a matrix, transpose it, sort each row, and get the median of each row.
A matrix in Numeric (just like numpy) is an array with two dimensions.
I think that bar = Numeric.array(a,b,c) would make Array you want, and then you could get the nth column with 'bar[:,n]' if Numeric has the same slicing techniques as Numpy.
foo = sorted(bar[:,n])
foo[int(len(foo)*x)]
I hope that helps you.
Putting Raymond Hettinger's description into python:
a=Numeric.array(([1,2,3],[4,5,6],[7,8,9]))
b=Numeric.array(([9,8,7],[6,5,4],[3,2,1]))
c=Numeric.array(([5,9,1],[5,4,7],[5,2,3]))
d = Numeric.array([a, b, c])
d.sort(axis=0)
Since there are n=3 input matrii so the median would be that of the middle one, the one indexed by one,
print d[n//2]
[[5 8 3]
[5 5 6]
[5 2 3]]
And if you had 4 input matrii, you would have to get the mean-elements of d[1] and d[2].
I am looking for coding examples to learn Numpy.
Usage would be dtype ='object'.
To construnct array the code used would
a= np.asarray(d, dtype ='object')
not np.asarray(d) or np.asarray(d, dtype='float32')
Is sorting any different than float32/64?
Coming from excel "cell" equations, wrapping my head around Row Column math.
Ex:
A = array([['a',2,3,4],['b',5,6,2],['c',5,1,5]], dtype ='object')
[['a',2,3,4],
['b',5,6,2],
['c',5,1,5]])
Create new array with:
How would I sort high to low by [3].
How calc for entire col. (1,1)- (1,0), Example without sorting A
['b',3],
['c',0]
How calc for enitre array (1,1) - (2,0) Example without sorting A
['b',2],
['c',-1]
Despite the fact that I still cannot understand exactly what you are asking, here is my best guess. Let's say you want to sort A by the values in 3rd column:
A = array([['a',2,3,4],['b',5,6,2],['c',5,1,5]], dtype ='object')
ii = np.argsort(A[:,2])
print A[ii,:]
Here the rows have been sorted according to the 3rd column, but each row is left unsorted.
Subtracting all of the columns is a problem due to the string objects, however if you exclude them, you can for example subtract the 3rd row from the 1st by:
A[0,1:] - A[2,1:]
If I didn't understand the basic point of your question, then please revise it. I highly recommend you take a look at the numpy tutorial and documentation if you have not done so already:
http://docs.scipy.org/doc/numpy/reference/
http://docs.scipy.org/doc/numpy/user/