How to get correlation matrix values pyspark - python

I have a correlation matrix calculated as follow on pyspark 2.2:
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
datos = sql("""select * from proceso_riesgos.jdgc_bd_train_mn_ingresos""")
Variables_corr= ['ingreso_final_mix','ingreso_final_promedio',
'ingreso_final_mediana','ingreso_final_trimedia','ingresos_serv_q1',
'ingresos_serv_q2','ingresos_serv_q3','prom_ingresos_serv','y_correc']
assembler = VectorAssembler(
inputCols=Variables_corr,
outputCol="features")
datos1=datos.select(Variables_corr).filter("y_correc is not null")
output = assembler.transform(datos)
r1 = Correlation.corr(output, "features")
the result is a data frame with a variable called "pearson(features): matrix":
Row(pearson(features)=DenseMatrix(20, 20, [1.0, 0.9428, 0.8908, 0.913,
0.567, 0.5832, 0.6148, 0.6488, ..., -0.589, -0.6145, -0.5906, -0.5534,
-0.5346, -0.0797, -0.617, 1.0], False))]
I need to take those values and export it to an excel, or to be able to manipulate the result.
A list could be desiderable.
Thanks for help!!

You are almost there ! There is no need to use old rdd mllib api .
This is my method to generate pandas dataframe, you can export to excel or csv or others format.
def correlation_matrix(df, corr_columns, method='pearson'):
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=corr_columns, outputCol=vector_col)
df_vector = assembler.transform(df).select(vector_col)
matrix = Correlation.corr(df_vector, vector_col, method)
result = matrix.collect()[0]["pearson({})".format(vector_col)].values
return pd.DataFrame(result.reshape(-1, len(corr_columns)), columns=corr_columns, index=corr_columns)

Related

How to do feature selection/feature importance using PySpark?

I am trying to get feature selection/feature importances from my dataset using PySpark but I am having trouble doing it with PySpark.
This is what I have done using Python Pandas to do it but I would like to accomplish it using PySpark:
cols = [col for col in new_result.columns if col not in ['treatment']]
data = new_result[cols]
target = new_result['treatment']
model = ExtraTreesClassifier()
model.fit(data,target)
print(model.feature_importances_)
feat_importances = pd.Series(model.feature_importances_, index=data.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
This is what I have tried but I don't feel the code for PySpark have achieved what I wanted. I know the model is different but I would like to get the same result as what I did for Pandas please:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
assembler = VectorAssembler(
inputCols=['Primary_ID',
'Age',
'Gender',
'Country',
'self_employed',
'family_history',
'work_interfere',
'no_employees',
'remote_work',
'tech_company',
'benefits',
'care_options',
'wellness_program',
'seek_help',
'anonymity',
'leave',
'mental_health_consequence',
'phys_health_consequence',
'coworkers',
'supervisor',
'mental_vs_physical',
'obs_consequence',
'mental_issue_in_tech'],
outputCol="features")
output = assembler.transform(new_result)
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="treatment", outputCol="treatment_index")
output_fixed = indexer.fit(output).transform(output)
final_data = output_fixed.select("features",'treatment_index')
train_data,test_data = final_data.randomSplit([0.7,0.3])
rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="treatment", seed=42)
model = rf.fit(output)
model.featureImportances
Return result of SparseVector(23, {2: 0.0961, 5: 0.1798, 6: 0.3232, 11: 0.0006, 14: 0.1307, 22: 0.2696}) What does this mean? Please advise and thank you in advance for all the help!
Vectors are represented in 2 flavours internally in the spark.
DenseVector
This takes more memory as all the elements are stored as Array[Double]
SparseVector
This is memory efficient way of storing the vector. representation having 3 parts-
size of vector
array of indices - It contains only those indices which has value other than 0.
array of values - it contains actual values associated with the indices.
Example -
val sparseVector = SparseVector(4, [1, 3], [3.0, 4.0])
println(sparseVector.toArray.mkString(", "))
// 0.0, 3.0, 0.0, 4.0
all the missing values are considered as 0
Regarding your problem-
you can map your sparse vector having feature importance with vector assembler input columns. Please note that size of feature vector and the feature importance are same.
val vectorToIndex = vectorAssembler.getInputCols.zipWithIndex.map(_.swap).toMap
val featureToWeight = rf.fit(trainingData).featureImportances.toArray.zipWithIndex.toMap.map{
case(featureWeight, index) => vectorToIndex(index) -> featureWeight
}
println(featureToWeight)
The similar code should work in python too

How to fix an error when trying to cluster 2 rows in a csv file

I am trying to learn how to cluster a simple data set.
'suns.csv' is a csv that has just 2 columns of data, id like to do a clustering model via python and the pyclustering package
The code below give me the error 'KeyError: 0'
import pyclustering
import pandas as pd
# Read data 'SampleSimple3' from Simple Sample collection.
# sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3)
sample = pd.read_csv('suns.csv')
kmedoids_instance = kmedoids(sample, [8, 23, 36, 50])
kmedoids_instance.process()
clusters = kmedoids_instance.get_clusters()
medoids = kmedoids_instance.get_medoids()
for i in range(len(clusters)):
medoid_point = sample[medoids[i]]
clusters[i] = sorted(clusters[i], key=lambda index: metric.euclidean_distance(medoid_point, sample[index]))
print(clusters[i])
print("\n")
I'd like the model to create its own clustering groups and I'd like to plot the model
pd.read_csv('suns.csv') returns DataFrame or TextParser. You have to convert it to list of points represented by built-in list or numpy.array, something like [[1.0, 2.3], [1.3, 2.4], ...].

Scale data from dataframe obtained with pyspark

I'm trying to scale some data from a csv file. I'm doing this with pyspark to obtain the dataframe and sklearn for the scale part. Here is the code:
from sklearn import preprocessing
import numpy as np
import pyspark
from pysparl.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.option('header','true').csv('flights,csv')
X_scaled = preprocessing.scale(df)
If I make the dataframe with pandas the scale part doesn't have any problems, but with spark I get this error:
ValueError: setting an array element with a sequence.
So I'm guessing that the element types are different between pandas and pyspark, but how can I work with pyspark to do the scale?
sklearn works with pandas dataframe. So you have to convert spark dataframe to pandas dataframe.
X_scaled = preprocessing.scale(df.toPandas())
You can use the "StandardScaler" method from "pyspark.ml.feature". Attaching a sample script to perform the exact pre-processing as sklearn,
Step 1:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features",
outputCol="scaled_features",
withStd=True,withMean=True)
scaler_model = scaler.fit(transformed_data)
scaled_data = scaler_model.transform(transformed_data)
Remember before you perform step 1, you need to assemble all the features with VectorAssembler. Hence this will be your step 0.
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=required_features, outputCol='features')
transformed_data = assembler.transform(df)

PySpark: accessing vector elements in sql

I have spark dataframe that has a column named features that holds vectors of data. This column is the output of pyspark's StandardScaler object. I am creating a dataset here similar to the one I have.
# create sample data
arr = [[1,2,3], [4,5,6]]
df_example = spark.createDataFrame(arr, ['A','B','C'])
assembler = VectorAssembler(inputCols=[x for x in df_example.columns],outputCol='features')
df_vector = assembler.transform(df_example).select('features')
>>> df_vector.show()
+-------------+
| features|
+-------------+
|[1.0,2.0,3.0]|
|[4.0,5.0,6.0]|
+-------------+
I want to find the Euclidean distance between each vector and a particular cluster center(an array of same length). Assume the cluster center is:
cluster_center_0 = np.array([0.6, 0.7, 0.8])
How do I achieve this? I tried creating a SQL query hoping that I could get access to the elements inside the vector using OFFSET and from there it would be easy to calculate the distances. But that didn't work out. This is the query I used. Unfortunately it doesn't work and I have very limited knowledge of sql
SELECT aml_cluster_inpt_features
aml_cluster_inpt_features[OFFSET(0)] AS offset_0,
aml_cluster_inpt_features[OFFSET(1)] AS offset_1,
aml_cluster_inpt_features[OFFSET(2)] AS offset_2,
aml_cluster_inpt_features[OFFSET(3)] AS offset_3,
FROM event_rate_holder
Is there a simpler way of doing this? If not, am I headed in the right direction with the sql query above?
Just use UDF:
from pyspark.sql.functions import udf
from scipy.spatial import distance
def euclidean(v1):
#udf("double")
def _(v2):
return distance.euclidean(v1, v2) if v2 is not None else None
return _
center = np.array([0.6, 0.7, 0.8])
df_vector.withColumn("dist", euclidean(center)("features")).show()
# +-------------+-----------------+
# | features| dist|
# +-------------+-----------------+
# |[1.0,2.0,3.0]|2.586503431275513|
# |[4.0,5.0,6.0]|7.555792479945437|
# +-------------+-----------------+
If you want to disassemble vectors you can use How to split Vector into columns - using PySpark

Pyspark Dataframe to Array RDD for KMEANS

I am trying to run Kmeans clustering algo in Spark 2.2. I am not able to find the correct input format. It gives TypeError: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector error. I checked further that my inputrdd is an Row Rdd. CAn we convert it to an array RDD? This MLlib Doc says shows that we can pass a paralleized array rdd data into the KMeans model.
Error occurs at KMeans.train step.
import pandas as pd
from pyspark.mllib.clustering import KMeans, KMeansModel
df = pd.DataFrame({"c1" : [1,2,3,4,5,6], "c2": [2,6,1,2,4,6], "c3" : [21,32,12,65,43,52]})
sdf = sqlContext.createDataFrame(df)
inputrdd = sdf.rdd
model = KMeans.train( inputrdd, 2, maxIterations=10, initializationMode="random",
seed=50, initializationSteps=5, epsilon=1e-4)
inputrdd when .collect is called.
[Row(c1=1, c2=2, c3=21),
Row(c1=2, c2=6, c3=32),
Row(c1=3, c2=1, c3=12),
Row(c1=4, c2=2, c3=65),
Row(c1=5, c2=4, c3=43),
Row(c1=6, c2=6, c3=52)]
Following changes helped. I changed my Row rdd to Vector directly using Vectors.dense.
from pyspark.mllib.linalg import Vectors
inputrdd = sdf.rdd.map(lambda s : Vectors.dense(s))

Categories