Comparing 3 columns in PySpark

Comparing 3 columns in PySpark - python

I want to compare 3 columns in PySpark (percentages summing to 100%) to create a new one which would contain the column name of the max of the 3 columns or, in case the max is not unique, contains the name of the columns that have the same value. I have seen some similar examples here but they don't handle the case when the max is not unique. Below is my brute-force solution, but it takes so much time to run to become useless:
df\
.withColumn("MaxName",
F.when( (col(A)>col(B)) & (col(A)>col(C)), "A")\
.when( (col(B)>col(A)) & (col(B)>col(C)), "B")\
.when( (col(C)>col(A)) & (col(C)>col(B)), "C")\
.when( (col(A)==col(B)) &\
(col(A)>col(C)) | (col(B)>col(C)), "AB")\
.when( (col(C)==col(B)) | (col(C)==col(A)) &\
(col(C)>col(B)) | (col(C)>col(A)), "CAB")\
.otherwise("ABC")
Any insights to build a more efficient solution?

If I correctly understand, you can compare with greatest and return the column names , then concat:
Example:
Input:
np.random.seed(111)
df = spark.createDataFrame(pd.DataFrame(np.random.randint(0,100,(5,5)),
columns=list('ABCDE')))
df.show()
+---+---+---+---+---+
| A| B| C| D| E|
+---+---+---+---+---+
| 84| 84| 84| 86| 19|
| 41| 66| 82| 40| 71|
| 57| 7| 12| 10| 65|
| 88| 28| 14| 34| 21|
| 54| 72| 37| 76| 58|
+---+---+---+---+---+
Proposed solution:
import pyspark.sql.functions as F
cols = ['A','B','C']
df.withColumn("max_of_ABC",F.concat_ws("",
*[F.when(F.col(i) == F.greatest(*cols),i) for i in cols])).show()
+---+---+---+---+---+----------+
| A| B| C| D| E|max_of_ABC|
+---+---+---+---+---+----------+
| 84| 84| 84| 86| 19| ABC|
| 41| 66| 82| 40| 71| C|
| 57| 7| 12| 10| 65| A|
| 88| 28| 14| 34| 21| A|
| 54| 72| 37| 76| 58| B|
+---+---+---+---+---+----------+

Related

How do I find the corresponding value of Column B for the max of column A over a window (not groupBy) in Pyspark?

Is it possible to return the "fav_place" column in pyspark without using a join (see example below)?
+-----+-------+------+ +-----+-------+------+-----------+----------+
| day| place|number| | day| place|number| max_number| fav_place|
+-----+-------+------+ +-----+-------+------+-----------+----------+
| Mon|Place A| 10| | Mon|Place A| 10| 42| Place B|
| Mon|Place B| 42| ===>> | Mon|Place B| 42| 42| Place B|
| Tues|Place C| 41| | Tues|Place C| 47| 47| Place C|
| Tues|Place C| 41| | Tues|Place D| 41| 47| Place C|
|Thurs|Place D| 45| |Thurs|Place D| 45| 45| Place D|
| Fri|Place E| 64| | Fri|Place E| 64| 64| Place E|
| Fri|Place A| 12| | Fri|Place A| 12| 64| Place E|
| Wed|Place F| 54| | Wed|Place F| 54| 54| Place F|
| Wed|Place A| 1| | Wed|Place A| 7| 54| Place F|
| Wed|Place A| 1| | Wed|Place C| 1| 54| Place F|
+-----+-------+------+ +-----+-------+------+-----------+----------+
The max(number) column should be relatively simple to calculate with...
eg_window = Window.partitionBy(['day'])
df.withColumn('max_number', max('number).over(eg_window))
...but I can't figure out how to get the corresponding place without having to do a join
Is it possible? Or is it better to just use a group by + join?

Is it possible?
Yes.
eg_window = Window.partitionBy(['day']).orderBy(desc('number'))
df.select('*',
first('number').over(eg_window).alias('max_number'),
first('place').over(eg_window).alias('fav_place'))
Or is it better to just use a group by + join?
With sample data, window function was faster, however, maybe that could be due to the way I did join. I'll keep the actual benchmark for you.

Join dataframes matching a column with a range determined by two columns in the other one with PySpark

I have a df on the left like this one:
+----+-----+
| id|value|
+----+-----+
| 2| xx|
| 4| xx|
| 11| xx|
| 14| xx|
| 27| xx|
| 28| xx|
| 56| xx|
| 55| xx|
+----+-----+
And another one on the right like this one:
+-----+---+----+
|start|end| ov |
+-----+---+----+
| 0| 9| A|
| 10| 19| B|
| 20| 29| C|
| 30| 39| D|
| 40| 49| F|
+-----+---+----+
And I need to join the rows when the id of the first table is between the range of start end of the second table. The output should look like this:
+----+-----+----+
| id|value| ov |
+----+-----+----+
| 2| xx| A|
| 4| xx| A|
| 11| xx| B|
| 14| xx| B|
| 27| xx| C|
| 28| xx| C|
| 56| xx| |
| 55| xx| |
+----+-----+----+
How can I achive this result with PySpark?

Use between operator with left join.
Example:
#using dataframes api
df.join(df1,(df['id'] >= df1['start']) & (df['id'] <= df1['end']),'left').select(df["*"],df1['ov']).show(10,False)
#using spark sql api
df.createOrReplaceTempView("t1")
df1.createOrReplaceTempView("t2")
spark.sql("select t1.*,t2.ov from t1 left join t2 on t1.id between t2.start and t2.end").show()
#this is just sample data
#+---+-----+----+
#| id|value| ov|
#+---+-----+----+
#| 2| xx| A|
#| 4| xx| A|
#| 55| zz|null|
#+---+-----+----+

Merge Rows in Apache spark by eliminating null values

I have a spark data frame like below
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 2|null|null|null| 102| 202| 302|
| 4|null|null|null| 104| 204| 304|
| 1|null|null|null| 101| 201| 301|
| 3|null|null|null| 103| 203| 303|
| 1| 11| 21| 31|null|null|null|
| 2| 12| 22| 32|null|null|null|
| 4| 14| 24| 34|null|null|null|
| 3| 13| 23| 33|null|null|null|
+---+----+----+----+----+----+----+
I wanted to transform data frame like below by merging null rows
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 1| 11| 21| 31| 101| 201| 301|
| 2| 12| 22| 32| 102| 202| 302|
| 4| 14| 24| 34| 104| 204| 304|
| 3| 13| 23| 33| 103| 203| 303|
+---+----+----+----+----+----+----+
preferably in scala.

You can group on id and aggregate using first with ignorenulls for other columns:
import pyspark.sql.functions as F
(df.groupBy('id').agg(*[F.first(x,ignorenulls=True) for x in df.columns if x!='id'])
.show())
+---+----+----+----+-----+-----+-----+
| id| 1| 2| 3| sf_1| sf_2| sf_3|
+---+----+----+----+-----+-----+-----+
| 1|11.0|21.0|31.0|101.0|201.0|301.0|
| 3|13.0|23.0|33.0|103.0|203.0|303.0|
| 2|12.0|22.0|32.0|102.0|202.0|302.0|
| 4|14.0|24.0|34.0|104.0|204.0|304.0|
+---+----+----+----+-----+-----+-----+

scala way of doing.
val inputColumns = inputLoadDF.columns.toList.drop(0)
val exprs = inputColumns.map(x => first(x,true))
inputLoadDF.groupBy("id").agg(exprs.head,exprs.tail:_*).show()

Converting Table columns and values to nested JSON

I have a CSV file with columns as below
Required result in Nested JSON objects as below:
{
"Type" : "A"
"Value" :"1"
}
{
"Type" : "B"
"Value" :"1"
}
I tried the below code : Any help would be deeply appreciated.
from pyspark.sql.functions import *
list1=['A','B','C']
df2=spark.sql("select * from test limit 10" )
df3=df2.select(list1)
for i in range(0,len(list1)):
df4=df3.withColumn("Type",lit(list1[i]))

This can be done by trick of UnPivoting...
Suppose you have dataset like below.. Let's call it a Patient's test results.. Column A,B,C means.. Test Type A , Test Type B, ... and values in those Columns mean test results in numeric
+-------------+---+---+---+---+---+---+---+
|PatientNumber| A| B| C| D| E| F| G|
+-------------+---+---+---+---+---+---+---+
| 101| 1| 2| 3| 4| 5| 6| 7|
| 102| 11| 12| 13| 14| 15| 16| 17|
+-------------+---+---+---+---+---+---+---+
I've added a PatientNumber column just for making data look more sensible. You can remove that from your code.
I added this dataset to a csv..
val testDF = spark.read.format("csv").option("header", "true").load("""C:\TestData\CSVtoJSon.csv""")
Let's create 2 Arrays, one for id columns, and another for all test Types..
val idCols = Array("PatientNumber")
val valCols = testDF.columns.diff(idCols)
Then here's the code to Unpivot
val valcolNames = valCols.map(x => List(''' + x + ''', x))
val unPivotedDF = testDF.select($"PatientNumber", expr(s"""stack(${valCols.size},${valcolNames.flatMap(x => x).mkString(",")} ) as (Type,Value)"""))
Here's how this Unpivoted data looks like -
+-------------+----+-----+
|PatientNumber|Type|Value|
+-------------+----+-----+
| 101| A| 1|
| 101| B| 2|
| 101| C| 3|
| 101| D| 4|
| 101| E| 5|
| 101| F| 6|
| 101| G| 7|
| 102| A| 11|
| 102| B| 12|
| 102| C| 13|
| 102| D| 14|
| 102| E| 15|
| 102| F| 16|
| 102| G| 17|
+-------------+----+-----+
Finally write this Unpivoted DF as Json -
unPivotedDF.coalesce(1).write.format("json").mode("Overwrite").save("""C:\TestData\output""")
Content of Json File looks same as your desired result -
{"PatientNumber":"101","Type":"A","Value":"1"}
{"PatientNumber":"101","Type":"B","Value":"2"}
{"PatientNumber":"101","Type":"C","Value":"3"}
{"PatientNumber":"101","Type":"D","Value":"4"}
{"PatientNumber":"101","Type":"E","Value":"5"}
{"PatientNumber":"101","Type":"F","Value":"6"}
{"PatientNumber":"101","Type":"G","Value":"7"}
{"PatientNumber":"102","Type":"A","Value":"11"}
{"PatientNumber":"102","Type":"B","Value":"12"}
{"PatientNumber":"102","Type":"C","Value":"13"}
{"PatientNumber":"102","Type":"D","Value":"14"}
{"PatientNumber":"102","Type":"E","Value":"15"}
{"PatientNumber":"102","Type":"F","Value":"16"}
{"PatientNumber":"102","Type":"G","Value":"17"}

Pyspark advanced window function

Here is my dataframe :
FlightDate=[20,40,51,50,60,15,17,37,36,50]
IssuingDate=[10,15,44,45,55,10,2,30,32,24]
Revenue = [100,50,40,70,60,40,30,100,200,100]
Customer = ['a','a','a','a','a','b','b','b','b','b']
df = spark.createDataFrame(pd.DataFrame([Customer,FlightDate,IssuingDate, Revenue]).T, schema=["Customer",'FlightDate', 'IssuingDate','Revenue'])
df.show()
+--------+----------+-----------+-------+
|Customer|FlightDate|IssuingDate|Revenue|
+--------+----------+-----------+-------+
| a| 20| 10| 100|
| a| 40| 15| 50|
| a| 51| 44| 40|
| a| 50| 45| 70|
| a| 60| 55| 60|
| b| 15| 10| 40|
| b| 27| 2| 30|
| b| 37| 30| 100|
| b| 36| 32| 200|
| b| 50| 24| 100|
+--------+----------+-----------+-------+
For convenience, I used number for days.
For each customer, I would like to sum revenues for all issuing dates between studied FlightDate and studied FlightDate + 10 days.
That is to say :
For the first line : I sum all revenue for IssuingDate between day 20 and day 30... which gives 0 here.
For the second line : I sum all revenus for IssuingDate between day 40 and 50, that is to say 40+70 = 110
Here is the desired result :
+--------+----------+-----------+-------+------+
|Customer|FlightDate|IssuingDate|Revenue|Result|
+--------+----------+-----------+-------+------+
| a| 20| 10| 100| 0|
| a| 40| 15| 50| 110|
| a| 51| 44| 40| 60|
| a| 50| 45| 70| 60|
| a| 60| 55| 60| 0|
| b| 15| 10| 40| 100|
| b| 27| 2| 30| 300|
| b| 37| 30| 100| 0|
| b| 36| 32| 200| 0|
| b| 50| 24| 100| 0|
+--------+----------+-----------+-------+------+
I know it will involve some window functions but this one seems a bit tricky. Thanks

no need of a window function. It is just a join and an agg :
df.alias("df").join(
df.alias("df_2"),
on=F.expr(
"df.Customer = df_2.Customer "
"and df_2.issuingdate between df.flightdate and df.flightdate+10"
),
how='left'
).groupBy(
*('df.{}'.format(c)
for c
in df.columns)
).agg(
F.sum(F.coalesce(
"df_2.revenue",
F.lit(0))
).alias("result")
).show()
+--------+----------+-----------+-------+------+
|Customer|FlightDate|IssuingDate|Revenue|result|
+--------+----------+-----------+-------+------+
| a| 20| 10| 100| 0|
| a| 40| 15| 50| 110|
| a| 50| 45| 70| 60|
| a| 51| 44| 40| 60|
| a| 60| 55| 60| 0|
| b| 15| 10| 40| 100|
| b| 27| 2| 30| 300|
| b| 36| 32| 200| 0|
| b| 37| 30| 100| 0|
| b| 50| 24| 100| 0|
+--------+----------+-----------+-------+------+

If you would like to keep the Revenue for the current row and next 10 days then you can use below code.
For e.g.
First line: flightDate = 20 and you need revenue between 20 and 30 (both dates inclusive) which means Total Revenue = 100.
Second Line: flightDate = 40 and you need revenue between 40 and 50 (both dates inclusive) which means Total revenue = 50 (for date 40) + 50 (for date 50) = 120.
Third Line: flightDate = 50 and you need revenue between 50 and 60 (both dates inclusive) which mean Total revenue = 70(for date 50) + 40(for date 51) + 60(for date 60) = 170
from pyspark.sql import *
from pyspark.sql.functions import *
import pandas as pd
FlightDate=[20,40,51,50,60,15,17,37,36,50]
IssuingDate=[10,15,44,45,55,10,2,30,32,24]
Revenue = [100,50,40,70,60,40,30,100,200,100]
Customer = ['a','a','a','a','a','b','b','b','b','b']
df = spark.createDataFrame(pd.DataFrame([Customer,FlightDate,IssuingDate, Revenue]).T, schema=["Customer",'FlightDate', 'IssuingDate','Revenue'])
windowSpec = Window.partitionBy("Customer").orderBy("FlightDate").rangeBetween(0,10)
df.withColumn("Sum", sum("Revenue").over(windowSpec)).sort("Customer").show()
Result as mentioned below
+--------+----------+-----------+-------+---+
|Customer|FlightDate|IssuingDate|Revenue|Sum|
+--------+----------+-----------+-------+---+
| a| 20| 10| 100|100|
| a| 40| 15| 50|120|
| a| 50| 45| 70|170|
| a| 51| 44| 40|100|
| a| 60| 55| 60| 60|
| b| 15| 10| 40| 70|
| b| 17| 2| 30| 30|
| b| 36| 32| 200|300|
| b| 37| 30| 100|100|
| b| 50| 24| 100|100|
+--------+----------+-----------+-------+---+

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing 3 columns in PySpark - python

Related

How do I find the corresponding value of Column B for the max of column A over a window (not groupBy) in Pyspark?

Join dataframes matching a column with a range determined by two columns in the other one with PySpark

Merge Rows in Apache spark by eliminating null values

Converting Table columns and values to nested JSON

Pyspark advanced window function

Categories

Resources