Comparing columns in Pyspark - python

I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.
For example:
Input: PySpark DataFrame containing :
col_1 = [1,2,3], col_2 = [2,1,4], col_3 = [3,2,5]
Ouput :
col_4 = max(col1, col_2, col_3) = [3,2,5]
There is something similar in pandas as explained in this question.
Is there any way of doing this in PySpark or should I change convert my PySpark df to Pandas df and then perform the operations?

You can reduce using SQL expressions over a list of columns:
from pyspark.sql.functions import max as max_, col, when
from functools import reduce
def row_max(*cols):
return reduce(
lambda x, y: when(x > y, x).otherwise(y),
[col(c) if isinstance(c, str) else c for c in cols]
)
df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
.toDF(["a", "b", "c"]))
df.select(row_max("a", "b", "c").alias("max")))
Spark 1.5+ also provides least, greatest
from pyspark.sql.functions import greatest
df.select(greatest("a", "b", "c"))
If you want to keep name of the max you can use `structs:
from pyspark.sql.functions import struct, lit
def row_max_with_name(*cols):
cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols]
return greatest(*cols_).alias("greatest({0})".format(",".join(cols)))
maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs"))
And finally you can use above to find select "top" column:
from pyspark.sql.functions import max
((_, c), ) = (maxs
.groupBy(col("maxs")["col"].alias("col"))
.count()
.agg(max(struct(col("count"), col("col"))))
.first())
df.select(c)

We can use greatest
Creating DataFrame
df = spark.createDataFrame(
[[1,2,3], [2,1,2], [3,4,5]],
['col_1','col_2','col_3']
)
df.show()
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| 1| 2| 3|
| 2| 1| 2|
| 3| 4| 5|
+-----+-----+-----+
Solution
from pyspark.sql.functions import greatest
df2 = df.withColumn('max_by_rows', greatest('col_1', 'col_2', 'col_3'))
#Only if you need col
#from pyspark.sql.functions import col
#df2 = df.withColumn('max', greatest(col('col_1'), col('col_2'), col('col_3')))
df2.show()
+-----+-----+-----+-----------+
|col_1|col_2|col_3|max_by_rows|
+-----+-----+-----+-----------+
| 1| 2| 3| 3|
| 2| 1| 2| 2|
| 3| 4| 5| 5|
+-----+-----+-----+-----------+

You can also use the pyspark built-in least:
from pyspark.sql.functions import least, col
df = df.withColumn('min', least(col('c1'), col('c2'), col('c3')))

Another simple way of doing it. Let us say that the below df is your dataframe
df = sc.parallelize([(10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4)]).toDF(["c1", "c2", "c3"])
df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 10| 10| 1|
|200| 2| 20|
| 3| 30|300|
|400| 40| 4|
+---+---+---+
You can process the above df as below to get the desited results
from pyspark.sql.functions import lit, min
df.select( lit('c1').alias('cn1'), min(df.c1).alias('c1'),
lit('c2').alias('cn2'), min(df.c2).alias('c2'),
lit('c3').alias('cn3'), min(df.c3).alias('c3')
)\
.rdd.flatMap(lambda r: [ (r.cn1, r.c1), (r.cn2, r.c2), (r.cn3, r.c3)])\
.toDF(['Columnn', 'Min']).show()
+-------+---+
|Columnn|Min|
+-------+---+
| c1| 3|
| c2| 2|
| c3| 1|
+-------+---+

Scala solution:
df = sc.parallelize(Seq((10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4))).toDF("c1", "c2", "c3"))
df.rdd.map(row=>List[String](row(0).toString,row(1).toString,row(2).toString)).map(x=>(x(0),x(1),x(2),x.min)).toDF("c1","c2","c3","min").show
+---+---+---+---+
| c1| c2| c3|min|
+---+---+---+---+
| 10| 10| 1| 1|
|200| 2| 20| 2|
| 3| 30|300| 3|
|400| 40| 4| 4|
+---+---+---+---+

Related

How to generate the max values for new columns in PySpark dataframe?

Suppose I have a pyspark dataframe df.
+---+---+
| a| b|
+---+---+
| 1| 2|
| 2| 3|
| 4| 5|
+---+---+
I'd like to add new column c.
column c = max(0, column b - 100)
+---+---+---+
| a| b| c|
+---+---+---+
| 1|200|100|
| 2|300|200|
| 4| 50| 0|
+---+---+---+
How should I generate the new column c in pyspark dataframe? Thanks in advance!
Hope you are looking something like this:
from pyspark.sql.functions import col, lit, greatest
df = spark.createDataFrame(
[
(1, 200),
(2, 300),
(4, 50),
],
["a", "b"]
)
df_new = df.withColumn("c", greatest(lit(0), col("b")-lit(100)))
.show()

Aggregate GroupBy columns with "all"-like function pyspark

I have a dataframe with a primary key, date, variable, and value. I want to group by the primary key and determine if all values are equal to a provided value. Example data:
import pandas as pd
from datetime import date
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame({
"pk": [1, 1, 1, 1, 2, 2, 2, 2, 3, 4],
"date": [
date("2022-05-06"),
date("2022-05-13"),
date("2022-05-06"),
date("2022-05-06"),
date("2022-05-14"),
date("2022-05-15"),
date("2022-05-05"),
date("2022-05-05"),
date("2022-05-11"),
date("2022-05-12")
],
"variable": [A, B, C, D, A, A, E, F, A, G],
"value": [2, 3, 2, 2, 1, 1, 1, 1, 5, 4]
})
df = spark.createDataFrame(df)
df.show()
df1.show()
#+-----+-----------+--------+-----+
#|pk | date|variable|value|
#+-----+-----------+--------+-----+
#| 1| 2022-05-06| A| 2|
#| 1| 2022-05-13| B| 3|
#| 1| 2022-05-06| C| 2|
#| 1| 2022-05-06| D| 2|
#| 2| 2022-05-14| A| 1|
#| 2| 2022-05-15| A| 1|
#| 2| 2022-05-05| E| 1|
#| 2| 2022-05-05| F| 1|
#| 3| 2022-05-11| A| 5|
#| 4| 2022-05-12| G| 4|
#+-----+-----------+--------+-----+
So if I want to know whether, given a primary key, pk, all the values are equal to 1 (or any arbitrary Boolean test), how should I do this? I've tried performing an applyInPandas but that is not super efficient and it seems like there is probably a pretty simply method to do this.
For Spark 3.+, you could use forall function to check if all values collected by collect_list satisfy the boolean test.
import pyspark.sql.functions as F
df1 = (df
.groupby("pk")
.agg(F.expr("forall(collect_list(value), v -> v == 1)").alias("value"))
)
df1.show()
# +---+-----+
# | pk|value|
# +---+-----+
# | 1|false|
# | 3|false|
# | 2| true|
# | 4|false|
# +---+-----+
# or create a column using window function
df2 = df.withColumn("test", F.expr("forall(collect_list(value) over (partition by pk), v -> v == 1)"))
df2.show()
# +---+----------+--------+-----+-----+
# | pk| date|variable|value| test|
# +---+----------+--------+-----+-----+
# | 1|2022-05-06| A| 2|false|
# | 1|2022-05-13| B| 3|false|
# | 1|2022-05-06| C| 2|false|
# | 1|2022-05-06| D| 2|false|
# | 3|2022-05-11| A| 5|false|
# | 2|2022-05-14| A| 1| true|
# | 2|2022-05-15| A| 1| true|
# | 2|2022-05-05| E| 1| true|
# | 2|2022-05-05| F| 1| true|
# | 4|2022-05-12| G| 4|false|
# +---+----------+--------+-----+-----+
You might want to put it inside a case clause to handle NULL values.

Pyspark: Split multiple array columns into rows

I have a dataframe which has one row, and several columns. Some of the columns are single values, and others are lists. All list columns are the same length. I want to split each list column into a separate row, while keeping any non-list column as is.
Sample DF:
from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode
sqlc = SQLContext(sc)
df = sqlc.createDataFrame([Row(a=1, b=[1,2,3],c=[7,8,9], d='foo')])
# +---+---------+---------+---+
# | a| b| c| d|
# +---+---------+---------+---+
# | 1|[1, 2, 3]|[7, 8, 9]|foo|
# +---+---------+---------+---+
What I want:
+---+---+----+------+
| a| b| c | d |
+---+---+----+------+
| 1| 1| 7 | foo |
| 1| 2| 8 | foo |
| 1| 3| 9 | foo |
+---+---+----+------+
If I only had one list column, this would be easy by just doing an explode:
df_exploded = df.withColumn('b', explode('b'))
# >>> df_exploded.show()
# +---+---+---------+---+
# | a| b| c| d|
# +---+---+---------+---+
# | 1| 1|[7, 8, 9]|foo|
# | 1| 2|[7, 8, 9]|foo|
# | 1| 3|[7, 8, 9]|foo|
# +---+---+---------+---+
However, if I try to also explode the c column, I end up with a dataframe with a length the square of what I want:
df_exploded_again = df_exploded.withColumn('c', explode('c'))
# >>> df_exploded_again.show()
# +---+---+---+---+
# | a| b| c| d|
# +---+---+---+---+
# | 1| 1| 7|foo|
# | 1| 1| 8|foo|
# | 1| 1| 9|foo|
# | 1| 2| 7|foo|
# | 1| 2| 8|foo|
# | 1| 2| 9|foo|
# | 1| 3| 7|foo|
# | 1| 3| 8|foo|
# | 1| 3| 9|foo|
# +---+---+---+---+
What I want is - for each column, take the nth element of the array in that column and add that to a new row. I've tried mapping an explode accross all columns in the dataframe, but that doesn't seem to work either:
df_split = df.rdd.map(lambda col: df.withColumn(col, explode(col))).toDF()
Spark >= 2.4
You can replace zip_ udf with arrays_zip function
from pyspark.sql.functions import arrays_zip, col, explode
(df
.withColumn("tmp", arrays_zip("b", "c"))
.withColumn("tmp", explode("tmp"))
.select("a", col("tmp.b"), col("tmp.c"), "d"))
Spark < 2.4
With DataFrames and UDF:
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType
from pyspark.sql.functions import col, udf, explode
zip_ = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
# Adjust types to reflect data types
StructField("first", IntegerType()),
StructField("second", IntegerType())
]))
)
(df
.withColumn("tmp", zip_("b", "c"))
# UDF output cannot be directly passed to explode
.withColumn("tmp", explode("tmp"))
.select("a", col("tmp.first").alias("b"), col("tmp.second").alias("c"), "d"))
With RDDs:
(df
.rdd
.flatMap(lambda row: [(row.a, b, c, row.d) for b, c in zip(row.b, row.c)])
.toDF(["a", "b", "c", "d"]))
Both solutions are inefficient due to Python communication overhead. If data size is fixed you can do something like this:
from functools import reduce
from pyspark.sql import DataFrame
# Length of array
n = 3
# For legacy Python you'll need a separate function
# in place of method accessor
reduce(
DataFrame.unionAll,
(df.select("a", col("b").getItem(i), col("c").getItem(i), "d")
for i in range(n))
).toDF("a", "b", "c", "d")
or even:
from pyspark.sql.functions import array, struct
# SQL level zip of arrays of known size
# followed by explode
tmp = explode(array(*[
struct(col("b").getItem(i).alias("b"), col("c").getItem(i).alias("c"))
for i in range(n)
]))
(df
.withColumn("tmp", tmp)
.select("a", col("tmp").getItem("b"), col("tmp").getItem("c"), "d"))
This should be significantly faster compared to UDF or RDD. Generalized to support an arbitrary number of columns:
# This uses keyword only arguments
# If you use legacy Python you'll have to change signature
# Body of the function can stay the same
def zip_and_explode(*colnames, n):
return explode(array(*[
struct(*[col(c).getItem(i).alias(c) for c in colnames])
for i in range(n)
]))
df.withColumn("tmp", zip_and_explode("b", "c", n=3))
You'd need to use flatMap, not map as you want to make multiple output rows out of each input row.
from pyspark.sql import Row
def dualExplode(r):
rowDict = r.asDict()
bList = rowDict.pop('b')
cList = rowDict.pop('c')
for b,c in zip(bList, cList):
newDict = dict(rowDict)
newDict['b'] = b
newDict['c'] = c
yield Row(**newDict)
df_split = sqlContext.createDataFrame(df.rdd.flatMap(dualExplode))
One liner (for Spark>=2.4.0):
df.withColumn("bc", arrays_zip("b","c"))
.select("a", explode("bc").alias("tbc"))
.select("a", col"tbc.b", "tbc.c").show()
Import required:
from pyspark.sql.functions import arrays_zip
Steps -
Create a column bc which is an array_zip of columns b and c
Explode bc to get a struct tbc
Select the required columns a, b and c (all exploded as required).
Output:
> df.withColumn("bc", arrays_zip("b","c")).select("a", explode("bc").alias("tbc")).select("a", "tbc.b", col("tbc.c")).show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 1| 7|
| 1| 2| 8|
| 1| 3| 9|
+---+---+---+

Spark pivot one column but keep others intact

Given the following dataframe, how do I pivot the max scores but aggregate the sum of plays?
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
from pyspark.sql import Window
df = sqlContext.createDataFrame([
("u1", "g1", 10, 0, 1),
("u1", "g3", 2, 2, 1),
("u1", "g3", 5, 3, 1),
("u1", "g4", 5, 4, 1),
("u2", "g2", 1, 1, 1),
], ["UserID", "GameID", "Score", "Time", "Plays"])
Desired Output
+------+-------------+-------------+-----+
|UserID|MaxScoreGame1|MaxScoreGame2|Plays|
+------+-------------+-------------+-----+
| u1| 10| 5| 4|
| u2| 1| null| 1|
+------+-------------+-------------+-----+
I posted a solution below but I'm hoping to avoid using join.
I don't think it is a real improvement but you can add total number of plays
...
.select(
F.col("*"),
F.row_number().over(rowNumberWindow).alias("GameNumber"),
F.sum("Plays").over(rowNumberWindow.orderBy()).alias("total_plays")
)
...
and use it later as a secondary grouping column for pivot:
...
.groupBy("UserID", "total_plays")
.pivot("GameCol", ["MaxScoreGame1", "MaxScoreGame2"])
.agg(F.max("Score"))
...
Here's a solution using join, which I'm hoping to avoid:
Sum dataframe
df_sum = df.groupBy("UserID").agg(F.sum("Plays").alias("Plays")).alias("df_sum")
df_sum.show()
+------+-----+
|UserID|Plays|
+------+-----+
| u1| 4|
| u2| 1|
+------+-----+
rowNumberWindow = Window.partitionBy("UserID").orderBy(F.col("Time"))
Pivot dataframe
rowNumberWindow = Window.partitionBy("UserID").orderBy(F.col("Time"))
df_piv = (df
.groupBy("UserID", "GameID")
.agg(F.sum("Plays").alias("Plays"),
F.max("Score").alias("Score"),
F.min("Time").alias("Time"))
.select(F.col("*"),
F.row_number().over(rowNumberWindow).alias("GameNumber"))
.filter(F.col("GameNumber") <= F.lit(2))
.withColumn("GameCol", F.concat(F.lit("MaxScoreGame"), F.col("GameNumber")))
.groupBy("UserID")
.pivot("GameCol", ["MaxScoreGame1", "MaxScoreGame2"])
.agg(F.max("Score"))
).alias("df_piv")
df_piv.show()
+------+-------------+-------------+
|UserID|MaxScoreGame1|MaxScoreGame2|
+------+-------------+-------------+
| u1| 10| 5|
| u2| 1| null|
+------+-------------+-------------+
Joined dataframe
df_joined = df_sum.join(df_piv, F.col("df_sum.UserID") == F.col("df_piv.UserID"))
df_joined.show()
+------+-----+------+-------------+-------------+
|UserID|Plays|UserID|MaxScoreGame1|MaxScoreGame2|
+------+-----+------+-------------+-------------+
| u1| 4| u1| 10| 5|
| u2| 1| u2| 1| null|
+------+-----+------+-------------+-------------+

Add column sum as new column in PySpark dataframe

I'm using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns.
Suppose my dataframe had columns "a", "b", and "c". I know I can do this:
df.withColumn('total_col', df.a + df.b + df.c)
The problem is that I don't want to type out each column individually and add them, especially if I have a lot of columns. I want to be able to do this automatically or by specifying a list of column names that I want to add. Is there another way to do this?
This was not obvious. I see no row-based sum of the columns defined in the spark Dataframes API.
Version 2
This can be done in a fairly simple way:
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
df.columns is supplied by pyspark as a list of strings giving all of the column names in the Spark Dataframe. For a different sum, you can supply any other list of column names instead.
I did not try this as my first solution because I wasn't certain how it would behave. But it works.
Version 1
This is overly complicated, but works as well.
You can do this:
use df.columns to get a list of the names of the columns
use that names list to make a list of the columns
pass that list to something that will invoke the column's overloaded add function in a fold-type functional manner
With python's reduce, some knowledge of how operator overloading works, and the pyspark code for columns here that becomes:
def column_add(a,b):
return a.__add__(b)
newdf = df.withColumn('total_col',
reduce(column_add, ( df[col] for col in df.columns ) ))
Note this is a python reduce, not a spark RDD reduce, and the parenthesis term in the second parameter to reduce requires the parenthesis because it is a list generator expression.
Tested, Works!
$ pyspark
>>> df = sc.parallelize([{'a': 1, 'b':2, 'c':3}, {'a':8, 'b':5, 'c':6}, {'a':3, 'b':1, 'c':0}]).toDF().cache()
>>> df
DataFrame[a: bigint, b: bigint, c: bigint]
>>> df.columns
['a', 'b', 'c']
>>> def column_add(a,b):
... return a.__add__(b)
...
>>> df.withColumn('total', reduce(column_add, ( df[col] for col in df.columns ) )).collect()
[Row(a=1, b=2, c=3, total=6), Row(a=8, b=5, c=6, total=19), Row(a=3, b=1, c=0, total=4)]
The most straight forward way of doing it is to use the expr function
from pyspark.sql.functions import *
data = data.withColumn('total', expr("col1 + col2 + col3 + col4"))
The solution
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
posted by #Paul works. Nevertheless I was getting the error, as many other as I have seen,
TypeError: 'Column' object is not callable
After some time I found the problem (at least in my case). The problem is that I previously imported some pyspark functions with the line
from pyspark.sql.functions import udf, col, count, sum, when, avg, mean, min
so the line imported the sum pyspark command while df.withColumn('total', sum(df[col] for col in df.columns)) is supposed to use the normal python sum function.
You can delete the reference of the pyspark function with del sum.
Otherwise in my case I changed the import to
import pyspark.sql.functions as F
and then referenced the functions as F.sum.
Summing multiple columns from a list into one column
PySpark's sum function doesn't support column addition.
This can be achieved using expr function.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns.
My problem was similar to the above (bit more complex) as i had to add consecutive column sums as new columns in PySpark dataframe. This approach uses code from Paul's Version 1 above:
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName('addColAsCumulativeSUM').getOrCreate()
df=spark.createDataFrame(data=[(1,2,3),(4,5,6),(3,2,1)\
,(6,1,-4),(0,2,-2),(6,4,1)\
,(4,5,2),(5,-3,-5),(6,4,-1)]\
,schema=['x1','x2','x3'])
df.show()
+---+---+---+
| x1| x2| x3|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 3| 2| 1|
| 6| 1| -4|
| 0| 2| -2|
| 6| 4| 1|
| 4| 5| 2|
| 5| -3| -5|
| 6| 4| -1|
+---+---+---+
colnames=df.columns
add new columns that are cumulative sums (consecutive):
for i in range(0,len(colnames)):
colnameLst= colnames[0:i+1]
colname = 'cm'+ str(i+1)
df = df.withColumn(colname, sum(df[col] for col in colnameLst))
df.show()
+---+---+---+---+---+---+
| x1| x2| x3|cm1|cm2|cm3|
+---+---+---+---+---+---+
| 1| 2| 3| 1| 3| 6|
| 4| 5| 6| 4| 9| 15|
| 3| 2| 1| 3| 5| 6|
| 6| 1| -4| 6| 7| 3|
| 0| 2| -2| 0| 2| 0|
| 6| 4| 1| 6| 10| 11|
| 4| 5| 2| 4| 9| 11|
| 5| -3| -5| 5| 2| -3|
| 6| 4| -1| 6| 10| 9|
+---+---+---+---+---+---+
'cumulative sum' columns added are as follows:
cm1 = x1
cm2 = x1 + x2
cm3 = x1 + x2 + x3
df = spark.createDataFrame([("linha1", "valor1", 2), ("linha2", "valor2", 5)], ("Columna1", "Columna2", "Columna3"))
df.show()
+--------+--------+--------+
|Columna1|Columna2|Columna3|
+--------+--------+--------+
| linha1| valor1| 2|
| linha2| valor2| 5|
+--------+--------+--------+
df = df.withColumn('DivisaoPorDois', df[2]/2)
df.show()
+--------+--------+--------+--------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|
+--------+--------+--------+--------------+
| linha1| valor1| 2| 1.0|
| linha2| valor2| 5| 2.5|
+--------+--------+--------+--------------+
df = df.withColumn('Soma_Colunas', df[2]+df[3])
df.show()
+--------+--------+--------+--------------+------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|Soma_Colunas|
+--------+--------+--------+--------------+------------+
| linha1| valor1| 2| 1.0| 3.0|
| linha2| valor2| 5| 2.5| 7.5|
+--------+--------+--------+--------------+------------+
A very simple approach would be to just use select instead of withcolumn as below:
df = df.select('*', (col("a")+col("b")+col('c).alias("total"))
This should give you required sum with minor changes based on requirements
The following approach works for me:
Import pyspark sql functions
from pyspark.sql import functions as F
Use F.expr(list_of_columns) data_frame.withColumn('Total_Sum',F.expr('col_name1+col_name2+..col_namen)

Categories