Perform PCA on each group of a groupBy in PySpark - python

I am looking for a way to run the spark.ml.feature.PCA function over grouped data returned from a groupBy() call on a dataframe. But I'm not sure if this is possible, or how to achieve it. This is a basic example that hopefully illustrates what I want to do:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import PCA
df = spark.createDataFrame([[3, 1, 1], [4, 2, 1], [5, 2, 1], [3, 3, 2], [6, 2, 2], [4, 4, 2]], ["Value1", "Value2", "ID"])
df.show()
+------+------+---+
|Value1|Value2| ID|
+------+------+---+
| 3| 1| 1|
| 4| 2| 1|
| 5| 2| 1|
| 3| 3| 2|
| 6| 2| 2|
| 4| 4| 2|
+------+------+---+
assembler = VectorAssembler(inputCols=["Value1", "Value2"], outputCol="features")
df2 = assembler.transform(df)
df2.show()
+------+------+---+---------+
|Value1|Value2| ID| features|
+------+------+---+---------+
| 3| 1| 1|[3.0,1.0]|
| 4| 2| 1|[4.0,2.0]|
| 5| 2| 1|[5.0,2.0]|
| 3| 3| 2|[3.0,3.0]|
| 6| 2| 2|[6.0,2.0]|
| 4| 4| 2|[4.0,4.0]|
+------+------+---+---------+
pca = PCA(k=1, inputCol="features", outputCol="component")
At this point I have the dataframe and the pca object that I want to use. I would like to now perform PCA on the dataframe but grouped by "ID", so I would get the PCA for all of the features with ID 1, and the PCA for all of the features where ID is 2, just returning the components. I can get these manually by:
>>>> pca.fit(df2.where("ID==1")).pc
DenseMatrix(2, 1, [-0.8817, -0.4719], 0)
>>>> pca.fit(dff.where("ID==2")).pc
DenseMatrix(2, 1, [-0.8817, 0.4719], 0)
But I would like to run this over all of the different IDs in the dataframe in parallel, something like:
df2.groupBy("ID").map(lambda group: pca.fit(group).pc)
But you can't use map() on grouped data like this. Is there a way to achieve this?

Spark>=3.0.0
As of Spark 3.0.0, you can use applyInPandas to apply a simple Python function to each group of the current DataFrame and return the result as another DataFrame. You basically need to define the output schema of the returned DataFrame.
Here I will use scikit-learn's PCA function instead of the Spark implementation as it has to be applied to single pandas DataFrames, not Spark ones. The principal components to be found should be the same anyway.
import pandas as pd
from sklearn.decomposition import PCA
from pyspark.sql.types import StructField, StructType, DoubleType
# define PCA parameters
cols = ['Value1', 'Value2']
pca_components = 1
# define Python function
def pca_udf(pdf):
X = pdf[cols]
pca = PCA(n_components=pca_components)
PC = pca.fit_transform(X)
PC_df = pd.DataFrame(PC, columns=['PC_' + str(i+1) for i in range(pca_components)])
result = pd.concat([pdf, PC_df], axis=1, ignore_index=True)
return result
# define output schema; principal components are generated dynamically based on `pca_components`
to_append = [StructField('PC_' + str(i+1), DoubleType(), True) for i in range(pca_components)]
output_schema = StructType(df.schema.fields + to_append)
df\
.groupby('ID')\
.applyInPandas(pca_udf, output_schema)\
.show()
+------+------+---+-------------------+
|Value1|Value2| ID| PC_1|
+------+------+---+-------------------+
| 3| 1| 1| 1.1962465491226262|
| 4| 2| 1|-0.1572859751773413|
| 5| 2| 1|-1.0389605739452852|
| 3| 3| 2|-1.1755661316905914|
| 6| 2| 2| 1.941315590145264|
| 4| 4| 2|-0.7657494584546719|
+------+------+---+-------------------+
Spark<3.0.0
Before Spark 3.0.0 - but still with Spark>=2.3.0 - the solution is similar but we need to actually define a pandas_udf, a vectorized user-defined function executed by Spark using Arrow to transfer data and Pandas to work with the data. The concepts to define it are similar to the previous ones anyway.
import pandas as pd
from sklearn.decomposition import PCA
from pyspark.sql.types import StructField, StructType, DoubleType
from pyspark.sql.functions import pandas_udf, PandasUDFType
# macro-function that includes the pandas_udf and allows to pass it some parameters
def pca_by_group(df, cols, pca_components=1):
# build output schema for the Pandas UDF
# principal components are generated dynamically based on `pca_components`
to_append = [StructField('PC_' + str(i+1), DoubleType(), True) for i in range(pca_components)]
output_schema = StructType(df.schema.fields + to_append)
# Pandas UDF for applying PCA within each group
#pandas_udf(output_schema, functionType=PandasUDFType.GROUPED_MAP)
def pca_udf(pdf):
X = pdf[cols]
pca = PCA(n_components=pca_components)
PC = pca.fit_transform(X)
PC_df = pd.DataFrame(PC, columns=['PC_' + str(i+1) for i in range(pca_components)])
result = pd.concat([pdf, PC_df], axis=1, ignore_index=True)
return result
# apply the Pandas UDF
df = df\
.groupby('ID')\
.apply(pca_udf)
return df
new_df = pca_by_group(df, cols=['Value1', 'Value2'], pca_components=1)
new_df.show()
+------+------+---+-------------------+
|Value1|Value2| ID| PC_1|
+------+------+---+-------------------+
| 3| 1| 1| 1.1962465491226262|
| 4| 2| 1|-0.1572859751773413|
| 5| 2| 1|-1.0389605739452852|
| 3| 3| 2|-1.1755661316905914|
| 6| 2| 2| 1.941315590145264|
| 4| 4| 2|-0.7657494584546719|
+------+------+---+-------------------+

Related

Aggregate GroupBy columns with "all"-like function pyspark

I have a dataframe with a primary key, date, variable, and value. I want to group by the primary key and determine if all values are equal to a provided value. Example data:
import pandas as pd
from datetime import date
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame({
"pk": [1, 1, 1, 1, 2, 2, 2, 2, 3, 4],
"date": [
date("2022-05-06"),
date("2022-05-13"),
date("2022-05-06"),
date("2022-05-06"),
date("2022-05-14"),
date("2022-05-15"),
date("2022-05-05"),
date("2022-05-05"),
date("2022-05-11"),
date("2022-05-12")
],
"variable": [A, B, C, D, A, A, E, F, A, G],
"value": [2, 3, 2, 2, 1, 1, 1, 1, 5, 4]
})
df = spark.createDataFrame(df)
df.show()
df1.show()
#+-----+-----------+--------+-----+
#|pk | date|variable|value|
#+-----+-----------+--------+-----+
#| 1| 2022-05-06| A| 2|
#| 1| 2022-05-13| B| 3|
#| 1| 2022-05-06| C| 2|
#| 1| 2022-05-06| D| 2|
#| 2| 2022-05-14| A| 1|
#| 2| 2022-05-15| A| 1|
#| 2| 2022-05-05| E| 1|
#| 2| 2022-05-05| F| 1|
#| 3| 2022-05-11| A| 5|
#| 4| 2022-05-12| G| 4|
#+-----+-----------+--------+-----+
So if I want to know whether, given a primary key, pk, all the values are equal to 1 (or any arbitrary Boolean test), how should I do this? I've tried performing an applyInPandas but that is not super efficient and it seems like there is probably a pretty simply method to do this.
For Spark 3.+, you could use forall function to check if all values collected by collect_list satisfy the boolean test.
import pyspark.sql.functions as F
df1 = (df
.groupby("pk")
.agg(F.expr("forall(collect_list(value), v -> v == 1)").alias("value"))
)
df1.show()
# +---+-----+
# | pk|value|
# +---+-----+
# | 1|false|
# | 3|false|
# | 2| true|
# | 4|false|
# +---+-----+
# or create a column using window function
df2 = df.withColumn("test", F.expr("forall(collect_list(value) over (partition by pk), v -> v == 1)"))
df2.show()
# +---+----------+--------+-----+-----+
# | pk| date|variable|value| test|
# +---+----------+--------+-----+-----+
# | 1|2022-05-06| A| 2|false|
# | 1|2022-05-13| B| 3|false|
# | 1|2022-05-06| C| 2|false|
# | 1|2022-05-06| D| 2|false|
# | 3|2022-05-11| A| 5|false|
# | 2|2022-05-14| A| 1| true|
# | 2|2022-05-15| A| 1| true|
# | 2|2022-05-05| E| 1| true|
# | 2|2022-05-05| F| 1| true|
# | 4|2022-05-12| G| 4|false|
# +---+----------+--------+-----+-----+
You might want to put it inside a case clause to handle NULL values.

How to apply F.when condition separately for unique subsets of the data

I want to apply a condition over subsets of my data. In the example, I want to use F.when over "A" and "B" from col1 separately, and return the a DataFrame that contains both "A" and "B" with the condition applied.
I have tried to use a group by to do this, but I'm not interested in aggregating the data, I want to return the same number of rows before and after the condition is applied.
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
spark.createDataFrame(pd.DataFrame({"col1": ["A", "A", "A", "B", "B"], "score": [1,2,3,1,2] }))
condition = F.when(F.col("score") > 2, 1).otherwise(0)
Does anyone have any advice as to how to solve this problem? Below is my expected output, but it is crucial that the condition is applied over "A" and "B" separately, as my actual use case is a bit different than the toy example supplied.
Try with:
df.select(df.col1, df.score, condition.alias("send")).show()
# Out:
# +----+-----+----+
# |col1|score|send|
# +----+-----+----+
# | A| 1| 0|
# | A| 2| 0|
# | A| 3| 1|
# | B| 1| 0|
# | B| 2| 0|
# +----+-----+----+
(see: pyspark.sql.Column.when)
To apply multiple conditions depending on the row values use:
from pyspark.sql.functions import when
df.withColumn("send", when((df.col1 == "A") & (F.col("score") > 2), 1)
.when((df.col1 == "B") & (F.col("score") > 1), 1)
.otherwise(0)
).show()
# Out:
# +----+-----+----+
# |col1|score|send|
# +----+-----+----+
# | A| 1| 0|
# | A| 2| 0|
# | A| 3| 1|
# | B| 1| 0|
# | B| 2| 1|
# +----+-----+----+
(pyspark.sql.functions.when)

Spread List of Lists to Sparks DF with PySpark?

I'm currently struggling with following issue:
Let's take following List of Lists:
[[1, 2, 3], [4, 5], [6, 7]]
How can I create following Sparks DF out of it with one row per element of each sublist:
| min_value | value |
---------------------
| 1| 1|
| 1| 2|
| 1| 3|
| 4| 4|
| 4| 5|
| 6| 6|
| 6| 7|
The only way I'm getting this done is by processing this list to another list with for-loops, which basically then already represents all rows of my DF, which is probably not the best way to solve this.
THX & BR
IntoNumbers
You can create a dataframe and use explode and array_min to get the desired output:
import pyspark.sql.functions as F
l = [[1, 2, 3], [4, 5], [6, 7]]
df = spark.createDataFrame(
[[l]],
['col']
).select(
F.explode('col').alias('value')
).withColumn(
'min_value',
F.array_min('value')
).withColumn(
'value',
F.explode('value')
)
df.show()
+-----+---------+
|value|min_value|
+-----+---------+
| 1| 1|
| 2| 1|
| 3| 1|
| 4| 4|
| 5| 4|
| 6| 6|
| 7| 6|
+-----+---------+
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import col, expr
import pyspark.sql.functions as F
data=[[1, 2, 3], [4, 5], [6, 7]]
Extract first element in each list in data
j=[item[0] for item in data]
Zip first element to data, create df and explode column b
df=spark.createDataFrame(zip(j,data), ['min_value','value']).withColumn('value', F.explode(col('value'))).show()
+---------+-----+
|min_value|value|
+---------+-----+
| 1| 1|
| 1| 2|
| 1| 3|
| 4| 4|
| 4| 5|
| 6| 6|
| 6| 7|
+---------+-----+
Here is my Spark like solution to this:
Basic Imports and SparkSession creation for example purposes
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("Stackoverflow problem 1") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
Creating an arbitrary list of lists:
list_of_lists = [[1,3,5], [1053,23], [1,3], [5,2,15,23,5,3]]
Solution below:
# Create an rdd from the list of lists
rdd_ll = spark.sparkContext.parallelize(list_of_lists)
# Compute the min from the previous rdd and store in another rdd
rdd_min = rdd_ll.map(lambda x: min(x))
# create a dataframe by zipping the two rdds and using the explode function
df = spark.createDataFrame(rdd_min.zip(rdd_ll)) \
.withColumn('value', F.explode('_2')) \
.drop('_2') \
.withColumnRenamed('_1', 'min_value')
Output
df.show()
+---------+-----+
|min_value|value|
+---------+-----+
| 1| 1|
| 1| 3|
| 1| 5|
| 23| 1053|
| 23| 23|
| 1| 1|
| 1| 3|
| 2| 5|
| 2| 2|
| 2| 15|
| 2| 23|
| 2| 5|
| 2| 3|
+---------+-----+

Comparing columns in Pyspark

I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.
For example:
Input: PySpark DataFrame containing :
col_1 = [1,2,3], col_2 = [2,1,4], col_3 = [3,2,5]
Ouput :
col_4 = max(col1, col_2, col_3) = [3,2,5]
There is something similar in pandas as explained in this question.
Is there any way of doing this in PySpark or should I change convert my PySpark df to Pandas df and then perform the operations?
You can reduce using SQL expressions over a list of columns:
from pyspark.sql.functions import max as max_, col, when
from functools import reduce
def row_max(*cols):
return reduce(
lambda x, y: when(x > y, x).otherwise(y),
[col(c) if isinstance(c, str) else c for c in cols]
)
df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
.toDF(["a", "b", "c"]))
df.select(row_max("a", "b", "c").alias("max")))
Spark 1.5+ also provides least, greatest
from pyspark.sql.functions import greatest
df.select(greatest("a", "b", "c"))
If you want to keep name of the max you can use `structs:
from pyspark.sql.functions import struct, lit
def row_max_with_name(*cols):
cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols]
return greatest(*cols_).alias("greatest({0})".format(",".join(cols)))
maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs"))
And finally you can use above to find select "top" column:
from pyspark.sql.functions import max
((_, c), ) = (maxs
.groupBy(col("maxs")["col"].alias("col"))
.count()
.agg(max(struct(col("count"), col("col"))))
.first())
df.select(c)
We can use greatest
Creating DataFrame
df = spark.createDataFrame(
[[1,2,3], [2,1,2], [3,4,5]],
['col_1','col_2','col_3']
)
df.show()
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| 1| 2| 3|
| 2| 1| 2|
| 3| 4| 5|
+-----+-----+-----+
Solution
from pyspark.sql.functions import greatest
df2 = df.withColumn('max_by_rows', greatest('col_1', 'col_2', 'col_3'))
#Only if you need col
#from pyspark.sql.functions import col
#df2 = df.withColumn('max', greatest(col('col_1'), col('col_2'), col('col_3')))
df2.show()
+-----+-----+-----+-----------+
|col_1|col_2|col_3|max_by_rows|
+-----+-----+-----+-----------+
| 1| 2| 3| 3|
| 2| 1| 2| 2|
| 3| 4| 5| 5|
+-----+-----+-----+-----------+
You can also use the pyspark built-in least:
from pyspark.sql.functions import least, col
df = df.withColumn('min', least(col('c1'), col('c2'), col('c3')))
Another simple way of doing it. Let us say that the below df is your dataframe
df = sc.parallelize([(10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4)]).toDF(["c1", "c2", "c3"])
df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 10| 10| 1|
|200| 2| 20|
| 3| 30|300|
|400| 40| 4|
+---+---+---+
You can process the above df as below to get the desited results
from pyspark.sql.functions import lit, min
df.select( lit('c1').alias('cn1'), min(df.c1).alias('c1'),
lit('c2').alias('cn2'), min(df.c2).alias('c2'),
lit('c3').alias('cn3'), min(df.c3).alias('c3')
)\
.rdd.flatMap(lambda r: [ (r.cn1, r.c1), (r.cn2, r.c2), (r.cn3, r.c3)])\
.toDF(['Columnn', 'Min']).show()
+-------+---+
|Columnn|Min|
+-------+---+
| c1| 3|
| c2| 2|
| c3| 1|
+-------+---+
Scala solution:
df = sc.parallelize(Seq((10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4))).toDF("c1", "c2", "c3"))
df.rdd.map(row=>List[String](row(0).toString,row(1).toString,row(2).toString)).map(x=>(x(0),x(1),x(2),x.min)).toDF("c1","c2","c3","min").show
+---+---+---+---+
| c1| c2| c3|min|
+---+---+---+---+
| 10| 10| 1| 1|
|200| 2| 20| 2|
| 3| 30|300| 3|
|400| 40| 4| 4|
+---+---+---+---+

Add column sum as new column in PySpark dataframe

I'm using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns.
Suppose my dataframe had columns "a", "b", and "c". I know I can do this:
df.withColumn('total_col', df.a + df.b + df.c)
The problem is that I don't want to type out each column individually and add them, especially if I have a lot of columns. I want to be able to do this automatically or by specifying a list of column names that I want to add. Is there another way to do this?
This was not obvious. I see no row-based sum of the columns defined in the spark Dataframes API.
Version 2
This can be done in a fairly simple way:
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
df.columns is supplied by pyspark as a list of strings giving all of the column names in the Spark Dataframe. For a different sum, you can supply any other list of column names instead.
I did not try this as my first solution because I wasn't certain how it would behave. But it works.
Version 1
This is overly complicated, but works as well.
You can do this:
use df.columns to get a list of the names of the columns
use that names list to make a list of the columns
pass that list to something that will invoke the column's overloaded add function in a fold-type functional manner
With python's reduce, some knowledge of how operator overloading works, and the pyspark code for columns here that becomes:
def column_add(a,b):
return a.__add__(b)
newdf = df.withColumn('total_col',
reduce(column_add, ( df[col] for col in df.columns ) ))
Note this is a python reduce, not a spark RDD reduce, and the parenthesis term in the second parameter to reduce requires the parenthesis because it is a list generator expression.
Tested, Works!
$ pyspark
>>> df = sc.parallelize([{'a': 1, 'b':2, 'c':3}, {'a':8, 'b':5, 'c':6}, {'a':3, 'b':1, 'c':0}]).toDF().cache()
>>> df
DataFrame[a: bigint, b: bigint, c: bigint]
>>> df.columns
['a', 'b', 'c']
>>> def column_add(a,b):
... return a.__add__(b)
...
>>> df.withColumn('total', reduce(column_add, ( df[col] for col in df.columns ) )).collect()
[Row(a=1, b=2, c=3, total=6), Row(a=8, b=5, c=6, total=19), Row(a=3, b=1, c=0, total=4)]
The most straight forward way of doing it is to use the expr function
from pyspark.sql.functions import *
data = data.withColumn('total', expr("col1 + col2 + col3 + col4"))
The solution
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
posted by #Paul works. Nevertheless I was getting the error, as many other as I have seen,
TypeError: 'Column' object is not callable
After some time I found the problem (at least in my case). The problem is that I previously imported some pyspark functions with the line
from pyspark.sql.functions import udf, col, count, sum, when, avg, mean, min
so the line imported the sum pyspark command while df.withColumn('total', sum(df[col] for col in df.columns)) is supposed to use the normal python sum function.
You can delete the reference of the pyspark function with del sum.
Otherwise in my case I changed the import to
import pyspark.sql.functions as F
and then referenced the functions as F.sum.
Summing multiple columns from a list into one column
PySpark's sum function doesn't support column addition.
This can be achieved using expr function.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns.
My problem was similar to the above (bit more complex) as i had to add consecutive column sums as new columns in PySpark dataframe. This approach uses code from Paul's Version 1 above:
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName('addColAsCumulativeSUM').getOrCreate()
df=spark.createDataFrame(data=[(1,2,3),(4,5,6),(3,2,1)\
,(6,1,-4),(0,2,-2),(6,4,1)\
,(4,5,2),(5,-3,-5),(6,4,-1)]\
,schema=['x1','x2','x3'])
df.show()
+---+---+---+
| x1| x2| x3|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 3| 2| 1|
| 6| 1| -4|
| 0| 2| -2|
| 6| 4| 1|
| 4| 5| 2|
| 5| -3| -5|
| 6| 4| -1|
+---+---+---+
colnames=df.columns
add new columns that are cumulative sums (consecutive):
for i in range(0,len(colnames)):
colnameLst= colnames[0:i+1]
colname = 'cm'+ str(i+1)
df = df.withColumn(colname, sum(df[col] for col in colnameLst))
df.show()
+---+---+---+---+---+---+
| x1| x2| x3|cm1|cm2|cm3|
+---+---+---+---+---+---+
| 1| 2| 3| 1| 3| 6|
| 4| 5| 6| 4| 9| 15|
| 3| 2| 1| 3| 5| 6|
| 6| 1| -4| 6| 7| 3|
| 0| 2| -2| 0| 2| 0|
| 6| 4| 1| 6| 10| 11|
| 4| 5| 2| 4| 9| 11|
| 5| -3| -5| 5| 2| -3|
| 6| 4| -1| 6| 10| 9|
+---+---+---+---+---+---+
'cumulative sum' columns added are as follows:
cm1 = x1
cm2 = x1 + x2
cm3 = x1 + x2 + x3
df = spark.createDataFrame([("linha1", "valor1", 2), ("linha2", "valor2", 5)], ("Columna1", "Columna2", "Columna3"))
df.show()
+--------+--------+--------+
|Columna1|Columna2|Columna3|
+--------+--------+--------+
| linha1| valor1| 2|
| linha2| valor2| 5|
+--------+--------+--------+
df = df.withColumn('DivisaoPorDois', df[2]/2)
df.show()
+--------+--------+--------+--------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|
+--------+--------+--------+--------------+
| linha1| valor1| 2| 1.0|
| linha2| valor2| 5| 2.5|
+--------+--------+--------+--------------+
df = df.withColumn('Soma_Colunas', df[2]+df[3])
df.show()
+--------+--------+--------+--------------+------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|Soma_Colunas|
+--------+--------+--------+--------------+------------+
| linha1| valor1| 2| 1.0| 3.0|
| linha2| valor2| 5| 2.5| 7.5|
+--------+--------+--------+--------------+------------+
A very simple approach would be to just use select instead of withcolumn as below:
df = df.select('*', (col("a")+col("b")+col('c).alias("total"))
This should give you required sum with minor changes based on requirements
The following approach works for me:
Import pyspark sql functions
from pyspark.sql import functions as F
Use F.expr(list_of_columns) data_frame.withColumn('Total_Sum',F.expr('col_name1+col_name2+..col_namen)

Categories