Pyspark advanced window function - python

Here is my dataframe :
FlightDate=[20,40,51,50,60,15,17,37,36,50]
IssuingDate=[10,15,44,45,55,10,2,30,32,24]
Revenue = [100,50,40,70,60,40,30,100,200,100]
Customer = ['a','a','a','a','a','b','b','b','b','b']
df = spark.createDataFrame(pd.DataFrame([Customer,FlightDate,IssuingDate, Revenue]).T, schema=["Customer",'FlightDate', 'IssuingDate','Revenue'])
df.show()
+--------+----------+-----------+-------+
|Customer|FlightDate|IssuingDate|Revenue|
+--------+----------+-----------+-------+
| a| 20| 10| 100|
| a| 40| 15| 50|
| a| 51| 44| 40|
| a| 50| 45| 70|
| a| 60| 55| 60|
| b| 15| 10| 40|
| b| 27| 2| 30|
| b| 37| 30| 100|
| b| 36| 32| 200|
| b| 50| 24| 100|
+--------+----------+-----------+-------+
For convenience, I used number for days.
For each customer, I would like to sum revenues for all issuing dates between studied FlightDate and studied FlightDate + 10 days.
That is to say :
For the first line : I sum all revenue for IssuingDate between day 20 and day 30... which gives 0 here.
For the second line : I sum all revenus for IssuingDate between day 40 and 50, that is to say 40+70 = 110
Here is the desired result :
+--------+----------+-----------+-------+------+
|Customer|FlightDate|IssuingDate|Revenue|Result|
+--------+----------+-----------+-------+------+
| a| 20| 10| 100| 0|
| a| 40| 15| 50| 110|
| a| 51| 44| 40| 60|
| a| 50| 45| 70| 60|
| a| 60| 55| 60| 0|
| b| 15| 10| 40| 100|
| b| 27| 2| 30| 300|
| b| 37| 30| 100| 0|
| b| 36| 32| 200| 0|
| b| 50| 24| 100| 0|
+--------+----------+-----------+-------+------+
I know it will involve some window functions but this one seems a bit tricky. Thanks

no need of a window function. It is just a join and an agg :
df.alias("df").join(
df.alias("df_2"),
on=F.expr(
"df.Customer = df_2.Customer "
"and df_2.issuingdate between df.flightdate and df.flightdate+10"
),
how='left'
).groupBy(
*('df.{}'.format(c)
for c
in df.columns)
).agg(
F.sum(F.coalesce(
"df_2.revenue",
F.lit(0))
).alias("result")
).show()
+--------+----------+-----------+-------+------+
|Customer|FlightDate|IssuingDate|Revenue|result|
+--------+----------+-----------+-------+------+
| a| 20| 10| 100| 0|
| a| 40| 15| 50| 110|
| a| 50| 45| 70| 60|
| a| 51| 44| 40| 60|
| a| 60| 55| 60| 0|
| b| 15| 10| 40| 100|
| b| 27| 2| 30| 300|
| b| 36| 32| 200| 0|
| b| 37| 30| 100| 0|
| b| 50| 24| 100| 0|
+--------+----------+-----------+-------+------+

If you would like to keep the Revenue for the current row and next 10 days then you can use below code.
For e.g.
First line: flightDate = 20 and you need revenue between 20 and 30 (both dates inclusive) which means Total Revenue = 100.
Second Line: flightDate = 40 and you need revenue between 40 and 50 (both dates inclusive) which means Total revenue = 50 (for date 40) + 50 (for date 50) = 120.
Third Line: flightDate = 50 and you need revenue between 50 and 60 (both dates inclusive) which mean Total revenue = 70(for date 50) + 40(for date 51) + 60(for date 60) = 170
from pyspark.sql import *
from pyspark.sql.functions import *
import pandas as pd
FlightDate=[20,40,51,50,60,15,17,37,36,50]
IssuingDate=[10,15,44,45,55,10,2,30,32,24]
Revenue = [100,50,40,70,60,40,30,100,200,100]
Customer = ['a','a','a','a','a','b','b','b','b','b']
df = spark.createDataFrame(pd.DataFrame([Customer,FlightDate,IssuingDate, Revenue]).T, schema=["Customer",'FlightDate', 'IssuingDate','Revenue'])
windowSpec = Window.partitionBy("Customer").orderBy("FlightDate").rangeBetween(0,10)
df.withColumn("Sum", sum("Revenue").over(windowSpec)).sort("Customer").show()
Result as mentioned below
+--------+----------+-----------+-------+---+
|Customer|FlightDate|IssuingDate|Revenue|Sum|
+--------+----------+-----------+-------+---+
| a| 20| 10| 100|100|
| a| 40| 15| 50|120|
| a| 50| 45| 70|170|
| a| 51| 44| 40|100|
| a| 60| 55| 60| 60|
| b| 15| 10| 40| 70|
| b| 17| 2| 30| 30|
| b| 36| 32| 200|300|
| b| 37| 30| 100|100|
| b| 50| 24| 100|100|
+--------+----------+-----------+-------+---+

Related

Grouping and sum of columns and eliminate duplicates in PySpark

I have a data frame like below in pyspark
df = spark.createDataFrame(
[
('14_100_00','A',25,0),
('14_100_00','A',0,24),
('15_100_00','A',20,1),
('150_100','C',21,0),
('16','A',0,20),
('16','A',20,0)],("rust", "name", "value_1","value_2" ))
df.show()
+---------+----+-------+-------+
| rust|name|value_1|value_2|
+---------+----+-------+-------+
|14_100_00| A| 25| 0|
|14_100_00| A| 0| 24|
|15_100_00| A| 20| 1|
| 150_100| C| 21| 0|
| 16| A| 0| 20|
| 16| A| 20| 0|
+---------+----+-------+-------+
I am trying to update the value_1 and value_2 columns based on below conditions
when rust and name columns are same then sum of value_1 as value_1 for that group
when rust and name columns are same then sum of value_2 as value_2 for that group
Expected result:
+---------+----+-------+-------+
| rust|name|value_1|value_2|
+---------+----+-------+-------+
|14_100_00| A| 25| 24|
|15_100_00| A| 20| 1|
| 150_100| C| 21| 0|
| 16| A| 20| 20|
+---------+----+-------+-------+
I have tried this:
df1 = df.withColumn("VALUE_1", f.sum("VALUE_1").over(Window.partitionBy("rust", "name"))).withColumn("VALUE_2", f.sum("VALUE_2").over(Window.partitionBy("rust", "name")))
df1.show()
+---------+----+-------+-------+
| rust|name|VALUE_1|VALUE_2|
+---------+----+-------+-------+
| 150_100| C| 21| 0|
| 16| A| 20| 20|
| 16| A| 20| 20|
|14_100_00| A| 25| 24|
|14_100_00| A| 25| 24|
|15_100_00| A| 20| 1|
+---------+----+-------+-------+
Is there a better way to achieve this without having duplicates?
Use groupBy instead of window functions:
df1 = df.groupBy("rust", "name").agg(
F.sum("value_1").alias("value_1"),
F.sum("value_2").alias("value_2"),
)
df1.show()
#+---------+----+-------+-------+
#| rust|name|value_1|value_2|
#+---------+----+-------+-------+
#|14_100_00| A| 25| 24|
#|15_100_00| A| 20| 1|
#| 150_100| C| 21| 0|
#| 16| A| 20| 20|
#+---------+----+-------+-------+

Merge Rows in Apache spark by eliminating null values

I have a spark data frame like below
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 2|null|null|null| 102| 202| 302|
| 4|null|null|null| 104| 204| 304|
| 1|null|null|null| 101| 201| 301|
| 3|null|null|null| 103| 203| 303|
| 1| 11| 21| 31|null|null|null|
| 2| 12| 22| 32|null|null|null|
| 4| 14| 24| 34|null|null|null|
| 3| 13| 23| 33|null|null|null|
+---+----+----+----+----+----+----+
I wanted to transform data frame like below by merging null rows
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 1| 11| 21| 31| 101| 201| 301|
| 2| 12| 22| 32| 102| 202| 302|
| 4| 14| 24| 34| 104| 204| 304|
| 3| 13| 23| 33| 103| 203| 303|
+---+----+----+----+----+----+----+
preferably in scala.
You can group on id and aggregate using first with ignorenulls for other columns:
import pyspark.sql.functions as F
(df.groupBy('id').agg(*[F.first(x,ignorenulls=True) for x in df.columns if x!='id'])
.show())
+---+----+----+----+-----+-----+-----+
| id| 1| 2| 3| sf_1| sf_2| sf_3|
+---+----+----+----+-----+-----+-----+
| 1|11.0|21.0|31.0|101.0|201.0|301.0|
| 3|13.0|23.0|33.0|103.0|203.0|303.0|
| 2|12.0|22.0|32.0|102.0|202.0|302.0|
| 4|14.0|24.0|34.0|104.0|204.0|304.0|
+---+----+----+----+-----+-----+-----+
scala way of doing.
val inputColumns = inputLoadDF.columns.toList.drop(0)
val exprs = inputColumns.map(x => first(x,true))
inputLoadDF.groupBy("id").agg(exprs.head,exprs.tail:_*).show()

Get "circular lag" of a column

I would like to create a new column in a pyspark.sql.DataFrame based on lagged values of an existing column. But... I would also like the last values to become the first ones, and the first values to become the last ones. Here is an example :
df = spark.createDataFrame([(1,100),
(2,200),
(3,300),
(4,400),
(5,500)],
['id','value'])
df.show()
+---+-----+
| id|value|
+---+-----+
| 1| 100|
| 2| 200|
| 3| 300|
| 4| 400|
| 5| 500|
+---+-----+
And the desired output would be :
+---+-----+----------------+-----------------+
| id|value|lag_value_plus_2|lag_value_minus_2|
+---+-----+----------------+-----------------+
| 1| 100| 300| 400|
| 2| 200| 400| 500|
| 3| 300| 500| 100|
| 4| 400| 100| 200|
| 5| 500| 200| 300|
+---+-----+----------------+-----------------+
I can feel it has something to do with window functions or pyspark.sql.lag function, but can't figure out how to do.
Here is one solution I can offer. But I'm not sure it is the most optimized one :
from functools import reduce
# Duplicate the dataframe twice, one "before" and one "after"
df = reduce(
lambda a, b : a.union(b),
[df.withColumn("x", F.lit(i)) for i in [-1,0,1]]
)
df.withColumn(
"lag_value_plus_2",
F.lead("value", 2).over(Window.partitionBy().orderBy("x", "id"))
).withColumn(
"lag_value_minus_2",
F.lag("value", 2).over(Window.partitionBy().orderBy("x", "id"))
).where("x=0").drop("x").show()
+---+-----+----------------+-----------------+
| id|value|lag_value_plus_2|lag_value_minus_2|
+---+-----+----------------+-----------------+
| 1| 100| 300| 400|
| 2| 200| 400| 500|
| 3| 300| 500| 100|
| 4| 400| 100| 200|
| 5| 500| 200| 300|
+---+-----+----------------+-----------------+

Calculating Cumulative sum in PySpark using Window Functions

I have the following sample DataFrame:
rdd = sc.parallelize([(1,20), (2,30), (3,30)])
df2 = spark.createDataFrame(rdd, ["id", "duration"])
df2.show()
+---+--------+
| id|duration|
+---+--------+
| 1| 20|
| 2| 30|
| 3| 30|
+---+--------+
I want to sort this DataFrame in desc order of duration and add a new column which has the cumulative sum of the duration. So I did the following:
windowSpec = Window.orderBy(df2['duration'].desc())
df_cum_sum = df2.withColumn("duration_cum_sum", sum('duration').over(windowSpec))
df_cum_sum.show()
+---+--------+----------------+
| id|duration|duration_cum_sum|
+---+--------+----------------+
| 2| 30| 60|
| 3| 30| 60|
| 1| 20| 80|
+---+--------+----------------+
My desired output is:
+---+--------+----------------+
| id|duration|duration_cum_sum|
+---+--------+----------------+
| 2| 30| 30|
| 3| 30| 60|
| 1| 20| 80|
+---+--------+----------------+
How do I get this?
Here is the breakdown:
+--------+----------------+
|duration|duration_cum_sum|
+--------+----------------+
| 30| 30| #First value
| 30| 60| #Current duration + previous cum sum value
| 20| 80| #Current duration + previous cum sum value
+--------+----------------+
You can introduce the row_number to break the ties; If written in sql:
df2.selectExpr(
"id", "duration",
"sum(duration) over (order by row_number() over (order by duration desc)) as duration_cum_sum"
).show()
+---+--------+----------------+
| id|duration|duration_cum_sum|
+---+--------+----------------+
| 2| 30| 30|
| 3| 30| 60|
| 1| 20| 80|
+---+--------+----------------+
Here you can check this
df2.withColumn('cumu', F.sum('duration').over(Window.orderBy(F.col('duration').desc()).rowsBetween(Window.unboundedPreceding, 0)
)).show()

cumulative sum function in pyspark grouping on multiple columns based on condition

I need to create a event_id basically a counter grouping on multiple columns(v_id,d_id,ip,l_id) and incrementing it when delta > 40 to get
the output like this
v_id d_id ip l_id delta event_id last_event_flag
1 20 30 40 1 1 N
1 20 30 40 2 1 N
1 20 30 40 3 1 N
1 20 30 40 4 1 Y
1 20 20 40 1 1 Y
1 30 30 40 2 1 N
1 30 30 40 3 1 N
1 30 30 40 4 1 N
1 30 30 40 5 1 Y
i was able to achieve this using pandas data frame
df['event_id'] = (df.delta >=40.0).groupby([df.l_id,df.v_id,d_id,ip]).cumsum() + 1
df.append(df['event_id'], ignore_index=True
but seeing memory error when executing it on a larger data .
How to do similar thing in pyspark.
In pyspark you can do it using a window function:
First let's create the dataframe. Note that you can also directly load it as a dataframe from a csv:
df = spark.createDataFrame(
sc.parallelize(
[[1,20,30,40,1,1],
[1,20,30,40,2,1],
[1,20,30,40,3,1],
[1,20,30,40,4,1],
[1,20,30,40,45,2],
[1,20,30,40,1,2],
[1,30,30,40,2,1],
[1,30,30,40,3,1],
[1,30,30,40,4,1],
[1,30,30,40,5,1]]
),
["v_id","d_id","ip","l_id","delta","event_id"]
)
You have an implicit ordering in your table, we need to create a monotonically increasing id so that we don't end up shuffling it around:
import pyspark.sql.functions as psf
df = df.withColumn(
"rn",
psf.monotonically_increasing_id()
)
+----+----+---+----+-----+--------+----------+
|v_id|d_id| ip|l_id|delta|event_id| rn|
+----+----+---+----+-----+--------+----------+
| 1| 20| 30| 40| 1| 1| 0|
| 1| 20| 30| 40| 2| 1| 1|
| 1| 20| 30| 40| 3| 1| 2|
| 1| 20| 30| 40| 4| 1| 3|
| 1| 20| 30| 40| 45| 2| 4|
| 1| 20| 30| 40| 1| 2|8589934592|
| 1| 30| 30| 40| 2| 1|8589934593|
| 1| 30| 30| 40| 3| 1|8589934594|
| 1| 30| 30| 40| 4| 1|8589934595|
| 1| 30| 30| 40| 5| 1|8589934596|
+----+----+---+----+-----+--------+----------+
Now to compute event_id and last_event_flag:
from pyspark.sql import Window
w1 = Window.partitionBy("v_id", "d_id", "l_id", "ip").orderBy("rn")
w2 = Window.partitionBy("v_id", "d_id", "l_id", "ip").orderBy(psf.desc("rn"))
df.withColumn(
"event_id",
psf.sum((df.delta >= 40).cast("int")).over(w1) + 1
).withColumn(
"last_event_flag",
psf.row_number().over(w2) == 1
).drop("rn")
+----+----+---+----+-----+--------+---------------+
|v_id|d_id| ip|l_id|delta|event_id|last_event_flag|
+----+----+---+----+-----+--------+---------------+
| 1| 20| 30| 40| 1| 1| false|
| 1| 20| 30| 40| 2| 1| false|
| 1| 20| 30| 40| 3| 1| false|
| 1| 20| 30| 40| 4| 1| false|
| 1| 20| 30| 40| 45| 2| false|
| 1| 20| 30| 40| 1| 2| true|
| 1| 30| 30| 40| 2| 1| false|
| 1| 30| 30| 40| 3| 1| false|
| 1| 30| 30| 40| 4| 1| false|
| 1| 30| 30| 40| 5| 1| true|
+----+----+---+----+-----+--------+---------------+
Perhaps you should calculate df = df[df.delta>=40] before running the groupby- I'm not sure if that matters.
Also you can look into chunksize to perform calculations based on chunks of the csv for memory efficiency. So you might break up the data into chunks of 10000 lines and then run the calculations to avoid memory error.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
How to read a 6 GB csv file with pandas

Categories