PySpark: accessing vector elements in sql - python

I have spark dataframe that has a column named features that holds vectors of data. This column is the output of pyspark's StandardScaler object. I am creating a dataset here similar to the one I have.
# create sample data
arr = [[1,2,3], [4,5,6]]
df_example = spark.createDataFrame(arr, ['A','B','C'])
assembler = VectorAssembler(inputCols=[x for x in df_example.columns],outputCol='features')
df_vector = assembler.transform(df_example).select('features')
>>> df_vector.show()
+-------------+
| features|
+-------------+
|[1.0,2.0,3.0]|
|[4.0,5.0,6.0]|
+-------------+
I want to find the Euclidean distance between each vector and a particular cluster center(an array of same length). Assume the cluster center is:
cluster_center_0 = np.array([0.6, 0.7, 0.8])
How do I achieve this? I tried creating a SQL query hoping that I could get access to the elements inside the vector using OFFSET and from there it would be easy to calculate the distances. But that didn't work out. This is the query I used. Unfortunately it doesn't work and I have very limited knowledge of sql
SELECT aml_cluster_inpt_features
aml_cluster_inpt_features[OFFSET(0)] AS offset_0,
aml_cluster_inpt_features[OFFSET(1)] AS offset_1,
aml_cluster_inpt_features[OFFSET(2)] AS offset_2,
aml_cluster_inpt_features[OFFSET(3)] AS offset_3,
FROM event_rate_holder
Is there a simpler way of doing this? If not, am I headed in the right direction with the sql query above?

Just use UDF:
from pyspark.sql.functions import udf
from scipy.spatial import distance
def euclidean(v1):
#udf("double")
def _(v2):
return distance.euclidean(v1, v2) if v2 is not None else None
return _
center = np.array([0.6, 0.7, 0.8])
df_vector.withColumn("dist", euclidean(center)("features")).show()
# +-------------+-----------------+
# | features| dist|
# +-------------+-----------------+
# |[1.0,2.0,3.0]|2.586503431275513|
# |[4.0,5.0,6.0]|7.555792479945437|
# +-------------+-----------------+
If you want to disassemble vectors you can use How to split Vector into columns - using PySpark

Related

Efficiently compute 2D histogram with Pyspark (Numpy and UDF)

I'm trying to do something really simple which somehow translates into something really difficult when Pyspark is involved.
I have a really large dataframe (~2B rows) on our platform which I'm not allowed to download but only analyse using Pyspark code. The dataframe contains position of some objects over Europe in the last year and I want to compute the density of those objects over time. I've succesfully used the function numpy.histogram2d in the past with good results (it's the faster that I've found in numpy at least). Since there is no equivalent of this function in pyspark I've defined a UDF to compute the density and return a new dataframe. This works when I only process a few rows (I've tried with 100K rows):
import pandas as pd
import numpy as np
def compute_density(df):
lon_bins = np.linspace(-15, 45, 100)
lat_bins = np.linspace(35, 70, 100)
density, xedges, yedges = np.histogram2d(df["corrected_latitude_degree"].values,
df["corrected_longitude_degree"].values,
[lat_bins, lon_bins])
x2d, y2d = np.meshgrid(xedges[:-1], yedges[:-1])
x_out = x2d.ravel()
y_out = y2d.ravel()
density_out = density.ravel()
data = {
'latitude': x_out,
'longitude': y_out,
'density': density_out
}
return pd.DataFrame(data)
which I then call as this
schema = StructType([
StructField("latitude", DoubleType()),
StructField("longitude", DoubleType()),
StructField("density", DoubleType())
])
preproc = (
inp
.limit(100000)
.withColumn("groups", F.lit(0))
)
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def compute_density_udf(df):
return compute_density(df)
result = preproc.groupby(["groups"]).apply(compute_density_udf)
Why am I using the GROUPED_MAP version to apply the UDF? I didn't manage to get it work with the SCALAR type UDF when returning with a schema, although I don't really need to group.
When I try to use this UDF on the full dataset I get an OOM, cause I believe there is only one group and is too much for the UDF to process. I'm sure there is a smarter way to compute this directly with pyspark without an UDF or alternatively split into groups and then assemble the results at the end? Does anyone have any idea/suggestion?

Is there a way to get dot product of two huge matrix using pandas or pyspark

I'm doing collaborative filtering and in predict phase I need to get matrix multiplication of two big matrices(4mln x 7 and 25k x 7) for SVD predictions. Is there an efficient and fast way to so, maybe using pandas or pyspark
Right now I came up with solution to get dot product row by row but that is time consuming:
for i in range(products):
user_ratings = np.dot(X_products[i], X_user)
m = np.min(user_ratings)
items[:,-1] = j
ratings[:,-1] = user_ratings
reorder_cols = np.fliplr(np.argsort(ratings, axis = 1))
rows = np.arange(num_users)[:,np.newaxis]
# reorder
ratings = ratings[rows, reorder_cols]
items = items[rows, reorder_cols]
Any suggestions will be appreciated
I would suggest using pyspark's mllib.linalg.distributed module. suppose your big matrices are M1 & M2 and you have converted them into RDDs.
1. convert them into BlockMatrices.
bm_M1 = IndexedRowMatrix(M1.zipWithIndex().map(lambda x:
(x[1],Vectors.dense(x[0])))).toBlockMatrix(10,10)
bm_M2 = IndexedRowMatrix(M2.ZipWithIndex().map(lambda x:
(x[1],Vectors.dense(x[0])))).toBlockMatrix(10,10)
2. transpose bm_M2 and multiply
bm_M1.multiply(bm_M2.transpose())
An example
import numpy as np
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import *
mat = sc.parallelize(np.random.rand(4,4))
bm_M1 = IndexedRowMatrix(mat.zipWithIndex().map(lambda x:
(x[1],Vectors.dense(x[0])))).toBlockMatrix(1,1)

Rolling PCA on pandas dataframe

I'm wondering if anyone knows of how to implement a rolling/moving window PCA on a pandas dataframe. I've looked around and found implementations in R and MATLAB but not Python. Any help would be appreciated!
This is not a duplicate - moving window PCA is not the same as PCA on the entire dataframe. Please see pandas.DataFrame.rolling() if you do not understand the difference
Unfortunately, pandas.DataFrame.rolling() seems to flatten the df before rolling, so it cannot be used as one might expect to roll over the rows of the df and pass windows of rows to the PCA.
The following is a work-around for this based on rolling over indices instead of rows. It may not be very elegant but it works:
# Generate some data (1000 time points, 10 features)
data = np.random.random(size=(1000,10))
df = pd.DataFrame(data)
# Set the window size
window = 100
# Initialize an empty df of appropriate size for the output
df_pca = pd.DataFrame( np.zeros((data.shape[0] - window + 1, data.shape[1])) )
# Define PCA fit-transform function
# Note: Instead of attempting to return the result,
# it is written into the previously created output array.
def rolling_pca(window_data):
pca = PCA()
transf = pca.fit_transform(df.iloc[window_data])
df_pca.iloc[int(window_data[0])] = transf[0,:]
return True
# Create a df containing row indices for the workaround
df_idx = pd.DataFrame(np.arange(df.shape[0]))
# Use `rolling` to apply the PCA function
_ = df_idx.rolling(window).apply(rolling_pca)
# The results are now contained here:
print df_pca
A quick check reveals that the values produced by this are identical to control values computed by slicing appropriate windows manually and running PCA on them.

Spark: how to get cluster's points (KMeans)

I'm trying to retrieve the data points belonging to a specific cluster in Spark. In the following piece of code, the data is made up but I actually obtain the predicted clustered.
Here is the code I have so far:
import numpy as np
# Example data
flight_routes = np.array([[1,3,2,0],
[4,2,1,4],
[3,6,2,2],
[0,5,2,1]])
flight_routes = sc.parallelize(flight_routes)
model = KMeans.train(rdd=flight_routes, k=500, maxIterations=10)
route_test = np.array([[0,2,3,4]])
test = sc.parallelize(route_test)
prediction = model.predict(test)
cluster_number_predicted = prediction.collect()
print cluster_number_predicted # it returns [100] <-- COOL!!
Now, I'd like to have all the data points belonging to the cluster number 100. How do I get those ?
What I want achieve is something like the answer given to this SO question: Cluster points after Means (Sklearn)
Thank you in advance.
If you both record and prediction (and not willing to switch to Spark ML) you can zip RDDs:
predictions_and_values = model.predict(test).zip(test)
and filter afterwards:
predictions_and_values.filter(lambda x: x[1] == 100)

Can we generate contingency table for chisquare test using python?

I am using scipy.stats.chi2_contingency method to get chi square statistics. We need to pass frequency table i.e. contingency table as parameter. But I have a feature vector and want to automatically generate the frequency table. Do we have any such function available?
I am doing it like this currently:
def contigency_matrix_categorical(data_series,target_series,target_val,indicator_val):
observed_freq={}
for targets in target_val:
observed_freq[targets]={}
for indicators in indicator_val:
observed_freq[targets][indicators['val']]=data_series[((target_series==targets)&(data_series==indicators['val']))].count()
f_obs=[]
var1=0
var2=0
for i in observed_freq:
var1=var1+1
var2=0
for j in observed_freq[i]:
f_obs.append(observed_freq[i][j]+5)
var2=var2+1
arr=np.array(f_obs).reshape(var1,var2)
c,p,dof,expected=chi2_contingency(arr)
return {'score':c,'pval':p,'dof':dof}
Where data series and target series are the columns values and the other two are the name of the indicator.
Can anyone help?
thanks
You can use pandas.crosstab to generate a contingency table from a DataFrame. From the documentation:
Compute a simple cross-tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.
Below is an usage example:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
# Some fake data.
n = 5 # Number of samples.
d = 3 # Dimensionality.
c = 2 # Number of categories.
data = np.random.randint(c, size=(n, d))
data = pd.DataFrame(data, columns=['CAT1', 'CAT2', 'CAT3'])
# Contingency table.
contingency = pd.crosstab(data['CAT1'], data['CAT2'])
# Chi-square test of independence.
c, p, dof, expected = chi2_contingency(contingency)
The following data table
generates the following contingency table
Then, scipy.stats.chi2_contingency(contingency) returns (0.052, 0.819, 1, array([[1.6, 0.4],[2.4, 0.6]])).

Categories