Is it possible to return the "fav_place" column in pyspark without using a join (see example below)?
+-----+-------+------+ +-----+-------+------+-----------+----------+
| day| place|number| | day| place|number| max_number| fav_place|
+-----+-------+------+ +-----+-------+------+-----------+----------+
| Mon|Place A| 10| | Mon|Place A| 10| 42| Place B|
| Mon|Place B| 42| ===>> | Mon|Place B| 42| 42| Place B|
| Tues|Place C| 41| | Tues|Place C| 47| 47| Place C|
| Tues|Place C| 41| | Tues|Place D| 41| 47| Place C|
|Thurs|Place D| 45| |Thurs|Place D| 45| 45| Place D|
| Fri|Place E| 64| | Fri|Place E| 64| 64| Place E|
| Fri|Place A| 12| | Fri|Place A| 12| 64| Place E|
| Wed|Place F| 54| | Wed|Place F| 54| 54| Place F|
| Wed|Place A| 1| | Wed|Place A| 7| 54| Place F|
| Wed|Place A| 1| | Wed|Place C| 1| 54| Place F|
+-----+-------+------+ +-----+-------+------+-----------+----------+
The max(number) column should be relatively simple to calculate with...
eg_window = Window.partitionBy(['day'])
df.withColumn('max_number', max('number).over(eg_window))
...but I can't figure out how to get the corresponding place without having to do a join
Is it possible? Or is it better to just use a group by + join?
Is it possible?
Yes.
eg_window = Window.partitionBy(['day']).orderBy(desc('number'))
df.select('*',
first('number').over(eg_window).alias('max_number'),
first('place').over(eg_window).alias('fav_place'))
Or is it better to just use a group by + join?
With sample data, window function was faster, however, maybe that could be due to the way I did join. I'll keep the actual benchmark for you.
I have a df on the left like this one:
+----+-----+
| id|value|
+----+-----+
| 2| xx|
| 4| xx|
| 11| xx|
| 14| xx|
| 27| xx|
| 28| xx|
| 56| xx|
| 55| xx|
+----+-----+
And another one on the right like this one:
+-----+---+----+
|start|end| ov |
+-----+---+----+
| 0| 9| A|
| 10| 19| B|
| 20| 29| C|
| 30| 39| D|
| 40| 49| F|
+-----+---+----+
And I need to join the rows when the id of the first table is between the range of start end of the second table. The output should look like this:
+----+-----+----+
| id|value| ov |
+----+-----+----+
| 2| xx| A|
| 4| xx| A|
| 11| xx| B|
| 14| xx| B|
| 27| xx| C|
| 28| xx| C|
| 56| xx| |
| 55| xx| |
+----+-----+----+
How can I achive this result with PySpark?
Use between operator with left join.
Example:
#using dataframes api
df.join(df1,(df['id'] >= df1['start']) & (df['id'] <= df1['end']),'left').select(df["*"],df1['ov']).show(10,False)
#using spark sql api
df.createOrReplaceTempView("t1")
df1.createOrReplaceTempView("t2")
spark.sql("select t1.*,t2.ov from t1 left join t2 on t1.id between t2.start and t2.end").show()
#this is just sample data
#+---+-----+----+
#| id|value| ov|
#+---+-----+----+
#| 2| xx| A|
#| 4| xx| A|
#| 55| zz|null|
#+---+-----+----+
I have a spark data frame like below
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 2|null|null|null| 102| 202| 302|
| 4|null|null|null| 104| 204| 304|
| 1|null|null|null| 101| 201| 301|
| 3|null|null|null| 103| 203| 303|
| 1| 11| 21| 31|null|null|null|
| 2| 12| 22| 32|null|null|null|
| 4| 14| 24| 34|null|null|null|
| 3| 13| 23| 33|null|null|null|
+---+----+----+----+----+----+----+
I wanted to transform data frame like below by merging null rows
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 1| 11| 21| 31| 101| 201| 301|
| 2| 12| 22| 32| 102| 202| 302|
| 4| 14| 24| 34| 104| 204| 304|
| 3| 13| 23| 33| 103| 203| 303|
+---+----+----+----+----+----+----+
preferably in scala.
You can group on id and aggregate using first with ignorenulls for other columns:
import pyspark.sql.functions as F
(df.groupBy('id').agg(*[F.first(x,ignorenulls=True) for x in df.columns if x!='id'])
.show())
+---+----+----+----+-----+-----+-----+
| id| 1| 2| 3| sf_1| sf_2| sf_3|
+---+----+----+----+-----+-----+-----+
| 1|11.0|21.0|31.0|101.0|201.0|301.0|
| 3|13.0|23.0|33.0|103.0|203.0|303.0|
| 2|12.0|22.0|32.0|102.0|202.0|302.0|
| 4|14.0|24.0|34.0|104.0|204.0|304.0|
+---+----+----+----+-----+-----+-----+
scala way of doing.
val inputColumns = inputLoadDF.columns.toList.drop(0)
val exprs = inputColumns.map(x => first(x,true))
inputLoadDF.groupBy("id").agg(exprs.head,exprs.tail:_*).show()
I have a CSV file with columns as below
Required result in Nested JSON objects as below:
{
"Type" : "A"
"Value" :"1"
}
{
"Type" : "B"
"Value" :"1"
}
I tried the below code : Any help would be deeply appreciated.
from pyspark.sql.functions import *
list1=['A','B','C']
df2=spark.sql("select * from test limit 10" )
df3=df2.select(list1)
for i in range(0,len(list1)):
df4=df3.withColumn("Type",lit(list1[i]))
This can be done by trick of UnPivoting...
Suppose you have dataset like below.. Let's call it a Patient's test results.. Column A,B,C means.. Test Type A , Test Type B, ... and values in those Columns mean test results in numeric
+-------------+---+---+---+---+---+---+---+
|PatientNumber| A| B| C| D| E| F| G|
+-------------+---+---+---+---+---+---+---+
| 101| 1| 2| 3| 4| 5| 6| 7|
| 102| 11| 12| 13| 14| 15| 16| 17|
+-------------+---+---+---+---+---+---+---+
I've added a PatientNumber column just for making data look more sensible. You can remove that from your code.
I added this dataset to a csv..
val testDF = spark.read.format("csv").option("header", "true").load("""C:\TestData\CSVtoJSon.csv""")
Let's create 2 Arrays, one for id columns, and another for all test Types..
val idCols = Array("PatientNumber")
val valCols = testDF.columns.diff(idCols)
Then here's the code to Unpivot
val valcolNames = valCols.map(x => List(''' + x + ''', x))
val unPivotedDF = testDF.select($"PatientNumber", expr(s"""stack(${valCols.size},${valcolNames.flatMap(x => x).mkString(",")} ) as (Type,Value)"""))
Here's how this Unpivoted data looks like -
+-------------+----+-----+
|PatientNumber|Type|Value|
+-------------+----+-----+
| 101| A| 1|
| 101| B| 2|
| 101| C| 3|
| 101| D| 4|
| 101| E| 5|
| 101| F| 6|
| 101| G| 7|
| 102| A| 11|
| 102| B| 12|
| 102| C| 13|
| 102| D| 14|
| 102| E| 15|
| 102| F| 16|
| 102| G| 17|
+-------------+----+-----+
Finally write this Unpivoted DF as Json -
unPivotedDF.coalesce(1).write.format("json").mode("Overwrite").save("""C:\TestData\output""")
Content of Json File looks same as your desired result -
{"PatientNumber":"101","Type":"A","Value":"1"}
{"PatientNumber":"101","Type":"B","Value":"2"}
{"PatientNumber":"101","Type":"C","Value":"3"}
{"PatientNumber":"101","Type":"D","Value":"4"}
{"PatientNumber":"101","Type":"E","Value":"5"}
{"PatientNumber":"101","Type":"F","Value":"6"}
{"PatientNumber":"101","Type":"G","Value":"7"}
{"PatientNumber":"102","Type":"A","Value":"11"}
{"PatientNumber":"102","Type":"B","Value":"12"}
{"PatientNumber":"102","Type":"C","Value":"13"}
{"PatientNumber":"102","Type":"D","Value":"14"}
{"PatientNumber":"102","Type":"E","Value":"15"}
{"PatientNumber":"102","Type":"F","Value":"16"}
{"PatientNumber":"102","Type":"G","Value":"17"}
Here is my dataframe :
FlightDate=[20,40,51,50,60,15,17,37,36,50]
IssuingDate=[10,15,44,45,55,10,2,30,32,24]
Revenue = [100,50,40,70,60,40,30,100,200,100]
Customer = ['a','a','a','a','a','b','b','b','b','b']
df = spark.createDataFrame(pd.DataFrame([Customer,FlightDate,IssuingDate, Revenue]).T, schema=["Customer",'FlightDate', 'IssuingDate','Revenue'])
df.show()
+--------+----------+-----------+-------+
|Customer|FlightDate|IssuingDate|Revenue|
+--------+----------+-----------+-------+
| a| 20| 10| 100|
| a| 40| 15| 50|
| a| 51| 44| 40|
| a| 50| 45| 70|
| a| 60| 55| 60|
| b| 15| 10| 40|
| b| 27| 2| 30|
| b| 37| 30| 100|
| b| 36| 32| 200|
| b| 50| 24| 100|
+--------+----------+-----------+-------+
For convenience, I used number for days.
For each customer, I would like to sum revenues for all issuing dates between studied FlightDate and studied FlightDate + 10 days.
That is to say :
For the first line : I sum all revenue for IssuingDate between day 20 and day 30... which gives 0 here.
For the second line : I sum all revenus for IssuingDate between day 40 and 50, that is to say 40+70 = 110
Here is the desired result :
+--------+----------+-----------+-------+------+
|Customer|FlightDate|IssuingDate|Revenue|Result|
+--------+----------+-----------+-------+------+
| a| 20| 10| 100| 0|
| a| 40| 15| 50| 110|
| a| 51| 44| 40| 60|
| a| 50| 45| 70| 60|
| a| 60| 55| 60| 0|
| b| 15| 10| 40| 100|
| b| 27| 2| 30| 300|
| b| 37| 30| 100| 0|
| b| 36| 32| 200| 0|
| b| 50| 24| 100| 0|
+--------+----------+-----------+-------+------+
I know it will involve some window functions but this one seems a bit tricky. Thanks
no need of a window function. It is just a join and an agg :
df.alias("df").join(
df.alias("df_2"),
on=F.expr(
"df.Customer = df_2.Customer "
"and df_2.issuingdate between df.flightdate and df.flightdate+10"
),
how='left'
).groupBy(
*('df.{}'.format(c)
for c
in df.columns)
).agg(
F.sum(F.coalesce(
"df_2.revenue",
F.lit(0))
).alias("result")
).show()
+--------+----------+-----------+-------+------+
|Customer|FlightDate|IssuingDate|Revenue|result|
+--------+----------+-----------+-------+------+
| a| 20| 10| 100| 0|
| a| 40| 15| 50| 110|
| a| 50| 45| 70| 60|
| a| 51| 44| 40| 60|
| a| 60| 55| 60| 0|
| b| 15| 10| 40| 100|
| b| 27| 2| 30| 300|
| b| 36| 32| 200| 0|
| b| 37| 30| 100| 0|
| b| 50| 24| 100| 0|
+--------+----------+-----------+-------+------+
If you would like to keep the Revenue for the current row and next 10 days then you can use below code.
For e.g.
First line: flightDate = 20 and you need revenue between 20 and 30 (both dates inclusive) which means Total Revenue = 100.
Second Line: flightDate = 40 and you need revenue between 40 and 50 (both dates inclusive) which means Total revenue = 50 (for date 40) + 50 (for date 50) = 120.
Third Line: flightDate = 50 and you need revenue between 50 and 60 (both dates inclusive) which mean Total revenue = 70(for date 50) + 40(for date 51) + 60(for date 60) = 170
from pyspark.sql import *
from pyspark.sql.functions import *
import pandas as pd
FlightDate=[20,40,51,50,60,15,17,37,36,50]
IssuingDate=[10,15,44,45,55,10,2,30,32,24]
Revenue = [100,50,40,70,60,40,30,100,200,100]
Customer = ['a','a','a','a','a','b','b','b','b','b']
df = spark.createDataFrame(pd.DataFrame([Customer,FlightDate,IssuingDate, Revenue]).T, schema=["Customer",'FlightDate', 'IssuingDate','Revenue'])
windowSpec = Window.partitionBy("Customer").orderBy("FlightDate").rangeBetween(0,10)
df.withColumn("Sum", sum("Revenue").over(windowSpec)).sort("Customer").show()
Result as mentioned below
+--------+----------+-----------+-------+---+
|Customer|FlightDate|IssuingDate|Revenue|Sum|
+--------+----------+-----------+-------+---+
| a| 20| 10| 100|100|
| a| 40| 15| 50|120|
| a| 50| 45| 70|170|
| a| 51| 44| 40|100|
| a| 60| 55| 60| 60|
| b| 15| 10| 40| 70|
| b| 17| 2| 30| 30|
| b| 36| 32| 200|300|
| b| 37| 30| 100|100|
| b| 50| 24| 100|100|
+--------+----------+-----------+-------+---+