perform linear algebra operation with pandas data frame - python

Suppose I have 2 pandas series, which I perceive as column vector in linear algebra x1 and x2
I want to do the operation x1 * x2^T, which is a column vector multiply with a row vector to produce a matrix (pandas dataframe).
What is the best procedure for this?

You want to import numpy and call:
pandas.DataFrame(numpy.outer(x1, x2))

Inside of pandas, you can go back to data frames to do it, e.g.
x1.to_frame().dot(x2.to_frame().T)

Related

Calculate the cosine distance between two column in Spark

I am using a Python & Spark to solve an issue.
I have dataframe containing two columns in a Spark Dataframe
Each of the columns contain a scalar of numeric(e.g. double or float) type.
I want to interpret these two column as vector and calculate consine similarity between them.
Sofar I only found spark linear algebra that can be used on densevector that are located in cell of the dataframe.
code sample
Code in numpy
import numpy as np
from numpy.linalg import norm
vec = np.array([1, 2])
vec_2 = np.array([2, 1])
angle_vec_vec = (np.dot(vec, vec))/(norm(vec * norm(vec)))
print(angle_vec_vec )
Result should 0.8
How to do this in Spark ?
df_small = spark.createDataFrame([(1, 2), (2, 1)])
df_small.show()
Is there a way to convert a column of double values to a densevector ?
Do you see any other soluation to solve my problem ?
You can see here a sample that calculates the cosine distance in Scala. The strategy is to represent the documents as a RowMatrix and then use its columnSimilarities() method.
If you want to use PySpark, you can try what's suggested here

pandas dataframe rows scaling with sklearn

How can I apply a sklearn scaler to all rows of a pandas dataframe. The question is related to pandas dataframe columns scaling with sklearn. How can I apply a sklearn scaler to all values of a row?
NOTE: I know that for feature scaling it's normal to have features in columns and scaling features column wise like in the refenced other question. However I'd like to use sklearn scalers for preprocessing data for visualization where it's reasonable to scale row wise in my case.
Sklearn works both with panda dataframes and numpy arrays, and numpy arrays allow some basic matrix transformations when dataframes don't.
You can transform the dataframe to a numpy array, vectors = df.values. Then transpose the array, scale the transposed array columnwise, transpose it back
scaled_rows = scaler.fit_transform(vectors.T).T
and convert it to dataframe scaled_df = pd.DataFrame(data = scaled_rows, columns = df.columns)

Covariance Matrix showing covariance between each value in vectors

Is there a python function that allows me to compute a n*n auto covariance matrix, displaying the covariance between each combination of the entries in a vector [a1,a2,a3...an]? I can't get np.cov to do that...
I want it to look like this:
cov(a1,a1) cov(a1,a2)... cov(a1,an)
cov(a2,a1) cov(a2,a2)...
...
cov(an,a1) ... cov(an,an)
Any help is appreciated!
Cheers,
Lena
You could use the pandas package.
http://pandas.pydata.org/
Dataframes have a covariance method that computes the covariances between all columns in the DataFrame.
This should roughly be the workflow.
import pandas as pd
df = pd.DataFrame(your_vector)
cov = df.cov()
Cov is now a dataframe that contains the covariences between all the columns in your vector. You can than do what ever analysis you want on the resulting dataframe.

Numpy: evaluation of standard deviation of values above/below the average

I want to calculate the standard deviation for values below and above the average of a matrix of n_par parameters and n_sample samples. The fastest way I found so far is:
stdleft = numpy.zeros_like(mean)
for jpar in xrange(mean.shape[1]):
stdleft[jpar] = p[p[:,jpar] < \
mean[jpar],jpar].std()
where p is a matrix like (n_samples,n_par). Is there a smarter way to do it without the for loop? I have roughly n_par = 200 and n_samples = 1e8 and therefore these three lines take ages to be performed.
Any idea would be really helpfull!
Thank you
As I understand it, you want to calculate the standard deviation of each column where the values are below the mean for that column.
In numpy, it's easiest to use masked arrays for this.
As an example:
import numpy as np
# 10 samples, 3 columns
p = np.random.random((10, 3))
# Calculate the mean of each column
colmeans = p.mean(axis=0)
# Make a boolean array where our condition is True
mask = p < colmeans
# Find the standard deviation of values in each column below the column's mean.
# For masked arrays, the True values will be masked, so we'll invert the array.
stdleft = np.ma.masked_where(~mask, p).std(axis=0)
You can also use pandas for this as #SudeepJuvekar mentioned. The performance should be broadly similar, but pandas should be a bit faster for this particular operation (untested).
Pandas is your friend. Convert your matrix in pandas Dataframe and index the Dataframe logically. Something like this
mat = pandas.DataFrame(p)
This creates a DataFrame from original numpy matrix p. Then we compute the column means for the DataFrame.
m = mat.mean()
Creates n_par sized array of all column means of mat. Finally, index the mat matrix using < logical operation and apply std to that.
stdleft = mat[mat < m].std()
Similarly for stdright. Take a couple of minutes to compute on my machine.
Here's the doc page for pandas: http://pandas.pydata.org/
Edit: Edited using the comment below. You can do almost similar indexing using the original p.
m = p.mean(axis=0)
logical = p < m
logical contains a boolean matrix of same size as p. This is where pandas comes handy. You can directly index a pandas matrix using logical of same size. Doing so in numpy is slightly hard. I guess looping is the best way to achieve it?
for i in range(len(p)):
stdleft[i] = p[logical[:, i], i].std()

How do I compute the variance of a column of a sparse matrix in Scipy?

I have a large scipy.sparse.csc_matrix and would like to normalize it. That is subtract the column mean from each element and divide by the column standard deviation (std)i.
scipy.sparse.csc_matrix has a .mean() but is there an efficient way to compute the variance or std?
You can calculate the variance yourself using the mean, with the following formula:
E[X^2] - (E[X])^2
E[X] stands for the mean. So to calculate E[X^2] you would have to square the csc_matrix and then use the mean function. To get (E[X])^2 you simply need to square the result of the mean function obtained using the normal input.
Sicco has the better answer.
However, another way is to convert the sparse matrix to a dense numpy array one column at a time (to keep the memory requirements lower compared to converting the whole matrix at once):
# mat is the sparse matrix
# Get the number of columns
cols = mat.shape[1]
arr = np.empty(shape=cols)
for i in range(cols):
arr[i] = np.var(mat[:, i].toarray())
The most efficient way I know of is to use StandardScalar from scikit:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler(with_mean=False)
scalar.fit(X)
Then the variances are in the attribute var_:
X_var = scalar.var_
The curious thing though, is that when I densified first using pandas (which is very slow) my answer was off by a few percent. I don't know which is more accurate.
The efficient way is actually to densify the entire matrix, then standardize it in the usual way with
X = X.toarray()
X -= X.mean()
X /= X.std()
As #Sebastian has noted in his comments, standardizing destroys the sparsity structure (introduces lots of non-zero elements) in the subtraction step, so there's no use keeping the matrix in a sparse format.

Categories