Calculate the cosine distance between two column in Spark - python

I am using a Python & Spark to solve an issue.
I have dataframe containing two columns in a Spark Dataframe
Each of the columns contain a scalar of numeric(e.g. double or float) type.
I want to interpret these two column as vector and calculate consine similarity between them.
Sofar I only found spark linear algebra that can be used on densevector that are located in cell of the dataframe.
code sample
Code in numpy
import numpy as np
from numpy.linalg import norm
vec = np.array([1, 2])
vec_2 = np.array([2, 1])
angle_vec_vec = (np.dot(vec, vec))/(norm(vec * norm(vec)))
print(angle_vec_vec )
Result should 0.8
How to do this in Spark ?
df_small = spark.createDataFrame([(1, 2), (2, 1)])
df_small.show()
Is there a way to convert a column of double values to a densevector ?
Do you see any other soluation to solve my problem ?

You can see here a sample that calculates the cosine distance in Scala. The strategy is to represent the documents as a RowMatrix and then use its columnSimilarities() method.
If you want to use PySpark, you can try what's suggested here

Related

Matrix/Array multiplication - What's Excel doing (=MMULT) and how to mimic it in Pandas

I want to reproduce a dataset in Excel which I found online in Python with Pandas. The data is depicted in the image below and I also added links to a CSV and Markdown file. There's a vector in cells B2:M2 and a matrix in cells B4:M1. In Excel, in B17:M17 there's the formula {=MMULT(B2:M2,B4:M15)}.
Let's assume B2:M2 is the dataframe E and B4:M15 is the dataframe L. How can I reproduce the results (E*L) in line 17 with Pandas?
Data as CSV : https://pastebin.com/raw/Q00ZWLCC
Edit: A solution in numpy would also work for me.
I am following your link:
df = pd.read_csv('https://pastebin.com/raw/Q00ZWLCC', delimiter='\t')
vector = df.iloc[0, 1:]
matrix = df.iloc[2:14, 1:]
result = matrix.dot(vector)
this code multiplies the matrix shape = (12, 12) for a column vector (12, 1), obtaining a column vector (12, 1).
If you want to obtain a row vector (1,12) from the multiplication (1,12) x (12, 12), you can use numpy. Add the following to the previous code:
import numpy as np
v = vector.to_numpy()
m = matrix.to_numpy()
result_as_a_row_1_by_12 = np.dot(v, m)
This will work for you.
You could also transpose the matrix and stay in pandas, but I think this is a clearer solution.
Regards.

Sklearn Decision Tree - using sparse matrix and other features simultaneously

I am using Sklearn Decision Tree for some classification and I have two types of data: categorical and continuous. I used pd.get_dummies for my categorical values and ended up with over 90 features. Which is, of course, quite a lot.
The thing is that I then iterate over max_features parameter to get the best score for my model, and having more than 20 features is too time-consuming. So I thought that Sklearn could use sparse matricies for my categorical features, instead of 70 columns with 0 and 1.
The question is: can Sklearn use a mix of sparse matricies and regular arrays or no? If yes - how do I do that? Currently I get error: setting an array element with a sequence
Here is some code to get the idea. df_with_dummies is what I currently use, but I hope there is a way to use df_with_sparse
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
a = np.random.randn(10,3)
b = np.random.random((10,1))
df = pd.DataFrame(a, columns = "A B C".split())
df['temp'] = b
df['dum1'] = np.where(df.temp < 0.5, 1, 0)
df['dum2'] = np.where(df.temp >= 0.5, 1, 0)
del df['temp']
df_with_dummies = df.copy()
a = df[['dum1', 'dum2']]
dums = csr_matrix(a)
df['dums'] = dums
df_with_sparse = df.copy()
When you do:
df['dums'] = dums
dums being a sparse matrix is not correctly handled by the pandas DataFrame and it will be broadcasted to each row. pandas does not complain about it, because it thinks of sparse matrix as an non-array object.
That means that each element in the df['dums'] column will point to the whole sparse matrix dums. So essentially, each array element is being set with an array hence the error setting an array element with a sequence when it is being processed in scikit-learn estimators.
For that you can do:
from scipy.sparse import hstack
df_with_sparse = hstack([df[['A', 'B', 'C']].values, dums])
Now you can pass this further.

perform linear algebra operation with pandas data frame

Suppose I have 2 pandas series, which I perceive as column vector in linear algebra x1 and x2
I want to do the operation x1 * x2^T, which is a column vector multiply with a row vector to produce a matrix (pandas dataframe).
What is the best procedure for this?
You want to import numpy and call:
pandas.DataFrame(numpy.outer(x1, x2))
Inside of pandas, you can go back to data frames to do it, e.g.
x1.to_frame().dot(x2.to_frame().T)

Covariance Matrix showing covariance between each value in vectors

Is there a python function that allows me to compute a n*n auto covariance matrix, displaying the covariance between each combination of the entries in a vector [a1,a2,a3...an]? I can't get np.cov to do that...
I want it to look like this:
cov(a1,a1) cov(a1,a2)... cov(a1,an)
cov(a2,a1) cov(a2,a2)...
...
cov(an,a1) ... cov(an,an)
Any help is appreciated!
Cheers,
Lena
You could use the pandas package.
http://pandas.pydata.org/
Dataframes have a covariance method that computes the covariances between all columns in the DataFrame.
This should roughly be the workflow.
import pandas as pd
df = pd.DataFrame(your_vector)
cov = df.cov()
Cov is now a dataframe that contains the covariences between all the columns in your vector. You can than do what ever analysis you want on the resulting dataframe.

Numpy: evaluation of standard deviation of values above/below the average

I want to calculate the standard deviation for values below and above the average of a matrix of n_par parameters and n_sample samples. The fastest way I found so far is:
stdleft = numpy.zeros_like(mean)
for jpar in xrange(mean.shape[1]):
stdleft[jpar] = p[p[:,jpar] < \
mean[jpar],jpar].std()
where p is a matrix like (n_samples,n_par). Is there a smarter way to do it without the for loop? I have roughly n_par = 200 and n_samples = 1e8 and therefore these three lines take ages to be performed.
Any idea would be really helpfull!
Thank you
As I understand it, you want to calculate the standard deviation of each column where the values are below the mean for that column.
In numpy, it's easiest to use masked arrays for this.
As an example:
import numpy as np
# 10 samples, 3 columns
p = np.random.random((10, 3))
# Calculate the mean of each column
colmeans = p.mean(axis=0)
# Make a boolean array where our condition is True
mask = p < colmeans
# Find the standard deviation of values in each column below the column's mean.
# For masked arrays, the True values will be masked, so we'll invert the array.
stdleft = np.ma.masked_where(~mask, p).std(axis=0)
You can also use pandas for this as #SudeepJuvekar mentioned. The performance should be broadly similar, but pandas should be a bit faster for this particular operation (untested).
Pandas is your friend. Convert your matrix in pandas Dataframe and index the Dataframe logically. Something like this
mat = pandas.DataFrame(p)
This creates a DataFrame from original numpy matrix p. Then we compute the column means for the DataFrame.
m = mat.mean()
Creates n_par sized array of all column means of mat. Finally, index the mat matrix using < logical operation and apply std to that.
stdleft = mat[mat < m].std()
Similarly for stdright. Take a couple of minutes to compute on my machine.
Here's the doc page for pandas: http://pandas.pydata.org/
Edit: Edited using the comment below. You can do almost similar indexing using the original p.
m = p.mean(axis=0)
logical = p < m
logical contains a boolean matrix of same size as p. This is where pandas comes handy. You can directly index a pandas matrix using logical of same size. Doing so in numpy is slightly hard. I guess looping is the best way to achieve it?
for i in range(len(p)):
stdleft[i] = p[logical[:, i], i].std()

Categories