cumulative sum function in pyspark grouping on multiple columns based on condition - python

I need to create a event_id basically a counter grouping on multiple columns(v_id,d_id,ip,l_id) and incrementing it when delta > 40 to get
the output like this
v_id d_id ip l_id delta event_id last_event_flag
1 20 30 40 1 1 N
1 20 30 40 2 1 N
1 20 30 40 3 1 N
1 20 30 40 4 1 Y
1 20 20 40 1 1 Y
1 30 30 40 2 1 N
1 30 30 40 3 1 N
1 30 30 40 4 1 N
1 30 30 40 5 1 Y
i was able to achieve this using pandas data frame
df['event_id'] = (df.delta >=40.0).groupby([df.l_id,df.v_id,d_id,ip]).cumsum() + 1
df.append(df['event_id'], ignore_index=True
but seeing memory error when executing it on a larger data .
How to do similar thing in pyspark.

In pyspark you can do it using a window function:
First let's create the dataframe. Note that you can also directly load it as a dataframe from a csv:
df = spark.createDataFrame(
sc.parallelize(
[[1,20,30,40,1,1],
[1,20,30,40,2,1],
[1,20,30,40,3,1],
[1,20,30,40,4,1],
[1,20,30,40,45,2],
[1,20,30,40,1,2],
[1,30,30,40,2,1],
[1,30,30,40,3,1],
[1,30,30,40,4,1],
[1,30,30,40,5,1]]
),
["v_id","d_id","ip","l_id","delta","event_id"]
)
You have an implicit ordering in your table, we need to create a monotonically increasing id so that we don't end up shuffling it around:
import pyspark.sql.functions as psf
df = df.withColumn(
"rn",
psf.monotonically_increasing_id()
)
+----+----+---+----+-----+--------+----------+
|v_id|d_id| ip|l_id|delta|event_id| rn|
+----+----+---+----+-----+--------+----------+
| 1| 20| 30| 40| 1| 1| 0|
| 1| 20| 30| 40| 2| 1| 1|
| 1| 20| 30| 40| 3| 1| 2|
| 1| 20| 30| 40| 4| 1| 3|
| 1| 20| 30| 40| 45| 2| 4|
| 1| 20| 30| 40| 1| 2|8589934592|
| 1| 30| 30| 40| 2| 1|8589934593|
| 1| 30| 30| 40| 3| 1|8589934594|
| 1| 30| 30| 40| 4| 1|8589934595|
| 1| 30| 30| 40| 5| 1|8589934596|
+----+----+---+----+-----+--------+----------+
Now to compute event_id and last_event_flag:
from pyspark.sql import Window
w1 = Window.partitionBy("v_id", "d_id", "l_id", "ip").orderBy("rn")
w2 = Window.partitionBy("v_id", "d_id", "l_id", "ip").orderBy(psf.desc("rn"))
df.withColumn(
"event_id",
psf.sum((df.delta >= 40).cast("int")).over(w1) + 1
).withColumn(
"last_event_flag",
psf.row_number().over(w2) == 1
).drop("rn")
+----+----+---+----+-----+--------+---------------+
|v_id|d_id| ip|l_id|delta|event_id|last_event_flag|
+----+----+---+----+-----+--------+---------------+
| 1| 20| 30| 40| 1| 1| false|
| 1| 20| 30| 40| 2| 1| false|
| 1| 20| 30| 40| 3| 1| false|
| 1| 20| 30| 40| 4| 1| false|
| 1| 20| 30| 40| 45| 2| false|
| 1| 20| 30| 40| 1| 2| true|
| 1| 30| 30| 40| 2| 1| false|
| 1| 30| 30| 40| 3| 1| false|
| 1| 30| 30| 40| 4| 1| false|
| 1| 30| 30| 40| 5| 1| true|
+----+----+---+----+-----+--------+---------------+

Perhaps you should calculate df = df[df.delta>=40] before running the groupby- I'm not sure if that matters.
Also you can look into chunksize to perform calculations based on chunks of the csv for memory efficiency. So you might break up the data into chunks of 10000 lines and then run the calculations to avoid memory error.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
How to read a 6 GB csv file with pandas

Related

Manipulate a complex dataframe in PySpark

i'm preparing a dataset to train machine learning model using PySpark. The dataframe on which i'm working contains a thousands of records about presences registered inside rooms of different buildings and cities in different days. These presences are in this format:
+----+----+----+---+---------+------+--------+-------+---------+
|room|building|city|day|month|inHour|inMinute|outHour|outMinute|
+----+--------+----+---+-----+------+--------+-------+---------+
| 1| 1| 1| 9| 11| 8| 27| 13| 15|
| 1| 1| 1| 9| 11| 8| 28| 13| 5|
| 1| 1| 1| 9| 11| 8| 32| 13| 7|
| 1| 1| 1| 9| 11| 8| 32| 8| 50|
| 1| 1| 1| 9| 11| 8| 32| 8| 48|
+----+--------+----+---+-----+------+--------+-------+---------+
inHour and inMinute stands for the hour and minute of access and, of course, outHour and outMinute refers to time of exit. The hours are considered in a 0-23 format.
All the column contains just integer values.
What i'm missing is the target value of my machine learning model which is the number of persons for the combination of room, building, city, day, month and a time interval. I will try to explain better, the first row refers to a presence with access time 8 and exit time 13 so it should be counted in the record with the interval 8-9, 9-10, 10-11, 11-12 and also 13-14.
What i want to accomplish is something like the following:
+----+----+----+---+---------+------+-------+-----+
|room|building|city|day|month|timeIn|timeOut|count|
+----+--------+----+---+-----+------+-------+-----+
| 1| 1| 1| 9| 11| 8| 9| X|
| 1| 1| 1| 9| 11| 9| 10| X|
| 1| 1| 1| 9| 11| 10| 11| X|
| 1| 1| 1| 9| 11| 11| 12| X|
| 1| 1| 1| 9| 11| 12| 13| X|
+----+--------+----+---+-----+------+-------+-----+
So the 4th row of the first table should be counted in the 1st row of this table and so on...
You can explode a sequence of hours (e.g. the first row would have [8,9,10,11,12,13]), group by the hour (and other columns) and get the aggregate count for each group. Here hour refers to timeIn. I think it's not necessary to specify timeOut in the result dataframe because it's always timeIn + 1.
import pyspark.sql.functions as F
df2 = df.withColumn(
'hour',
F.explode(F.sequence('inHour', 'outHour'))
).groupBy(
'room', 'building', 'city', 'day', 'month', 'hour'
).count().orderBy('hour')
df2.show()
+----+--------+----+---+-----+----+-----+
|room|building|city|day|month|hour|count|
+----+--------+----+---+-----+----+-----+
| 1| 1| 1| 9| 11| 8| 5|
| 1| 1| 1| 9| 11| 9| 3|
| 1| 1| 1| 9| 11| 10| 3|
| 1| 1| 1| 9| 11| 11| 3|
| 1| 1| 1| 9| 11| 12| 3|
| 1| 1| 1| 9| 11| 13| 3|
+----+--------+----+---+-----+----+-----+

Joining with a lookup table in PySpark

I have 2 tables: Table 'A' and Table 'Lookup'
Table A:
ID Day
A 1
B 1
C 2
D 4
The lookup table has percentage values for each ID-Day combination.
Table Lookup:
ID 1 2 3 4
A 20 10 50 30
B 0 50 0 50
C 50 10 10 30
D 10 25 25 40
My expected output is to have an additional field in Table 'A' named 'Percent' with values filled in from the lookup table:
ID Day Percent
A 1 20
B 1 0
C 2 10
D 4 40
Since both the tables are large, I do not want to pivot any of the tables.
I have written code in scala. You can refer same for python.
scala> TableA.show()
+---+---+
| ID|Day|
+---+---+
| A| 1|
| B| 1|
| C| 2|
| D| 4|
+---+---+
scala> lookup.show()
+---+---+---+---+---+
| ID| 1| 2| 3| 4|
+---+---+---+---+---+
| A| 20| 10| 50| 30|
| B| 0| 50| 0| 50|
| C| 50| 10| 10| 30|
| D| 10| 25| 25| 40|
+---+---+---+---+---+
//UDF Functon to retrieve data from lookup table
val lookupUDF = (r:Row, s:String) => {
r.getAs(s).toString}
//Join over Key column "ID"
val joindf = TableA.join(lookup,"ID")
//final output DataFrame creation
val final_df = joindf.map(x => (x.getAs("ID").toString, x.getAs("Day").toString, lookupUDF(x,x.getAs("Day")))).toDF("ID","Day","Percentage")
final_df.show()
+---+---+----------+
| ID|Day|Percentage|
+---+---+----------+
| A| 1| 20|
| B| 1| 0|
| C| 2| 10|
| D| 4| 40|
+---+---+----------+
(Posting my answer a day after I posted the question)
I was able to solve this by converting the tables to a pandas dataframe.
from pyspark.sql.types import *
schema = StructType([StructField("id", StringType())\
,StructField("day", StringType())\
,StructField("1", IntegerType())\
,StructField("2", IntegerType())\
,StructField("3", IntegerType())\
,StructField("4", IntegerType())])
# Day field is String type
data = [['A', 1, 20, 10, 50, 30], ['B', 1, 0, 50, 0, 50], ['C', 2, 50, 10, 10, 30], ['D', 4, 10, 25, 25, 40]]
df = spark.createDataFrame(data,schema=schema)
df.show()
# After joining the 2 tables on "id", the tables would look like this:
+---+---+---+---+---+---+
| id|day| 1| 2| 3| 4|
+---+---+---+---+---+---+
| A| 1| 20| 10| 50| 30|
| B| 1| 0| 50| 0| 50|
| C| 2| 50| 10| 10| 30|
| D| 4| 10| 25| 25| 40|
+---+---+---+---+---+---+
# Converting to a pandas dataframe
pandas_df = df.toPandas()
id day 1 2 3 4
A 1 20 10 50 30
B 1 0 50 0 50
C 2 50 10 10 30
D 4 10 25 25 40
# UDF:
def udf(x):
return x[x['day']]
pandas_df['percent'] = pandas_df.apply(udf, axis=1)
# Converting back to a Spark DF:
spark_df = sqlContext.createDataFrame(pandas_df)
+---+---+---+---+---+---+---+
| id|day| 1| 2| 3| 4|new|
+---+---+---+---+---+---+---+
| A| 1| 20| 10| 50| 30| 20|
| B| 1| 0| 50| 0| 50| 0|
| C| 2| 50| 10| 10| 30| 10|
| D| 4| 10| 25| 25| 40| 40|
+---+---+---+---+---+---+---+
spark_df.select("id", "day", "percent").show()
+---+---+-------+
| id|day|percent|
+---+---+-------+
| A| 1| 20|
| B| 1| 0|
| C| 2| 10|
| D| 4| 40|
+---+---+-------+
I would appreciate if someone posts an answer in PySpark without the pandas-df conversion.
df = spark.createDataFrame([{'ID':'A','Day':1}
,{'ID':'B','Day':1}
,{'ID':'C','Day':2}
,{'ID':'D','Day':4}])
df1 = spark.createDataFrame([{'ID':'A','1':20,'2':10,'3':50,'4':30},
{'ID':'B','1':0,'2':50,'3':0,'4':50},
{'ID':'C','1':50,'2':10,'3':10,'4':30},
{'ID':'D','1':10,'2':25,'3':25,'4':40}
])
df1=df1.withColumn('1',col('1').cast('int')).withColumn('2',col('2').cast('int')).withColumn('3',col('3').cast('int')).withColumn('4',col('4').cast('int'))
df=df.withColumn('Day',col('Day').cast('int'))
df_final = df.join(df1,'ID')
df_final_rdd = df_final.rdd
print(df_final_rdd.collect())
def create_list(r,s):
s=str(s)
k = (r['ID'],r['Day'],r[s])
return k
l=[]
for element in df_final_rdd.collect():
l.append(create_list(element,element['Day']))
rdd = sc.parallelize(l)
df= spark.createDataFrame(rdd).toDF('ID','Day','Percent')

Pyspark - ranking column replacing ranking of ties by Mean of rank

Consider a data set with ranking
+--------+----+-----------+--------------+
| colA|colB|colA_rank |colA_rank_mean|
+--------+----+-----------+--------------+
| 21| 50| 1| 1|
| 9| 23| 2| 2.5|
| 9| 21| 3| 2.5|
| 8| 21| 4| 3|
| 2| 21| 5| 5.5|
| 2| 5| 6| 5.5|
| 1| 5| 7| 7.5|
| 1| 4| 8| 7.5|
| 0| 4| 9| 11|
| 0| 3| 10| 11|
| 0| 3| 11| 11|
| 0| 2| 12| 11|
| 0| 2| 13| 11|
+--------+----+-----------+--------------+
colA_rank is a normal ranking, while with colA_rank_mean I would like to resolve ties by replacing the ranking with the mean rank of the ties. Is it achievable with a single pass and some particular ranking method ?
Currently I am thinking of 2 passes but that would seem to require ordering the dataset twice on colA, one without partition and one with partition.
#Step 1: normal rank
df = df.withColumn("colA_rank",F.row_number().over(Window.orderBy("colA")))
#Step 2 : solve ties :
df = df.withColumn("colA_rank_mean",F.mean(colA_rank).over(Window.partitionBy("colA"))

Pyspark advanced window function

Here is my dataframe :
FlightDate=[20,40,51,50,60,15,17,37,36,50]
IssuingDate=[10,15,44,45,55,10,2,30,32,24]
Revenue = [100,50,40,70,60,40,30,100,200,100]
Customer = ['a','a','a','a','a','b','b','b','b','b']
df = spark.createDataFrame(pd.DataFrame([Customer,FlightDate,IssuingDate, Revenue]).T, schema=["Customer",'FlightDate', 'IssuingDate','Revenue'])
df.show()
+--------+----------+-----------+-------+
|Customer|FlightDate|IssuingDate|Revenue|
+--------+----------+-----------+-------+
| a| 20| 10| 100|
| a| 40| 15| 50|
| a| 51| 44| 40|
| a| 50| 45| 70|
| a| 60| 55| 60|
| b| 15| 10| 40|
| b| 27| 2| 30|
| b| 37| 30| 100|
| b| 36| 32| 200|
| b| 50| 24| 100|
+--------+----------+-----------+-------+
For convenience, I used number for days.
For each customer, I would like to sum revenues for all issuing dates between studied FlightDate and studied FlightDate + 10 days.
That is to say :
For the first line : I sum all revenue for IssuingDate between day 20 and day 30... which gives 0 here.
For the second line : I sum all revenus for IssuingDate between day 40 and 50, that is to say 40+70 = 110
Here is the desired result :
+--------+----------+-----------+-------+------+
|Customer|FlightDate|IssuingDate|Revenue|Result|
+--------+----------+-----------+-------+------+
| a| 20| 10| 100| 0|
| a| 40| 15| 50| 110|
| a| 51| 44| 40| 60|
| a| 50| 45| 70| 60|
| a| 60| 55| 60| 0|
| b| 15| 10| 40| 100|
| b| 27| 2| 30| 300|
| b| 37| 30| 100| 0|
| b| 36| 32| 200| 0|
| b| 50| 24| 100| 0|
+--------+----------+-----------+-------+------+
I know it will involve some window functions but this one seems a bit tricky. Thanks
no need of a window function. It is just a join and an agg :
df.alias("df").join(
df.alias("df_2"),
on=F.expr(
"df.Customer = df_2.Customer "
"and df_2.issuingdate between df.flightdate and df.flightdate+10"
),
how='left'
).groupBy(
*('df.{}'.format(c)
for c
in df.columns)
).agg(
F.sum(F.coalesce(
"df_2.revenue",
F.lit(0))
).alias("result")
).show()
+--------+----------+-----------+-------+------+
|Customer|FlightDate|IssuingDate|Revenue|result|
+--------+----------+-----------+-------+------+
| a| 20| 10| 100| 0|
| a| 40| 15| 50| 110|
| a| 50| 45| 70| 60|
| a| 51| 44| 40| 60|
| a| 60| 55| 60| 0|
| b| 15| 10| 40| 100|
| b| 27| 2| 30| 300|
| b| 36| 32| 200| 0|
| b| 37| 30| 100| 0|
| b| 50| 24| 100| 0|
+--------+----------+-----------+-------+------+
If you would like to keep the Revenue for the current row and next 10 days then you can use below code.
For e.g.
First line: flightDate = 20 and you need revenue between 20 and 30 (both dates inclusive) which means Total Revenue = 100.
Second Line: flightDate = 40 and you need revenue between 40 and 50 (both dates inclusive) which means Total revenue = 50 (for date 40) + 50 (for date 50) = 120.
Third Line: flightDate = 50 and you need revenue between 50 and 60 (both dates inclusive) which mean Total revenue = 70(for date 50) + 40(for date 51) + 60(for date 60) = 170
from pyspark.sql import *
from pyspark.sql.functions import *
import pandas as pd
FlightDate=[20,40,51,50,60,15,17,37,36,50]
IssuingDate=[10,15,44,45,55,10,2,30,32,24]
Revenue = [100,50,40,70,60,40,30,100,200,100]
Customer = ['a','a','a','a','a','b','b','b','b','b']
df = spark.createDataFrame(pd.DataFrame([Customer,FlightDate,IssuingDate, Revenue]).T, schema=["Customer",'FlightDate', 'IssuingDate','Revenue'])
windowSpec = Window.partitionBy("Customer").orderBy("FlightDate").rangeBetween(0,10)
df.withColumn("Sum", sum("Revenue").over(windowSpec)).sort("Customer").show()
Result as mentioned below
+--------+----------+-----------+-------+---+
|Customer|FlightDate|IssuingDate|Revenue|Sum|
+--------+----------+-----------+-------+---+
| a| 20| 10| 100|100|
| a| 40| 15| 50|120|
| a| 50| 45| 70|170|
| a| 51| 44| 40|100|
| a| 60| 55| 60| 60|
| b| 15| 10| 40| 70|
| b| 17| 2| 30| 30|
| b| 36| 32| 200|300|
| b| 37| 30| 100|100|
| b| 50| 24| 100|100|
+--------+----------+-----------+-------+---+

Window timeseries with step in Spark/Scala

I have this input :
timestamp,user
1,A
2,B
5,C
9,E
12,F
The result wanted is :
timestampRange,userList
1 to 2,[A,B]
3 to 4,[] Or null
5 to 6,[C]
7 to 8,[] Or null
9 to 10,[E]
11 to 12,[F]
I tried using Window, but the problem, it doesn't include the empty timestamp range.
Any hints would be helpful.
Don't know if widowing function will cover the gaps between ranges, but you can take the following approach :
Define a dataframe, df_ranges:
val ranges = List((1,2), (3,4), (5,6), (7,8), (9,10))
val df_ranges = sc.parallelize(ranges).toDF("start", "end")
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 3| 4|
| 5| 6|
| 7| 8|
| 9| 10|
+-----+---+
Data with the timestamp column, df_data :
val data = List((1,"A"), (2,"B"), (5,"C"), (9,"E"))
val df_data = sc.parallelize(data).toDF("timestamp", "user")
+---------+----+
|timestamp|user|
+---------+----+
| 1| A|
| 2| B|
| 5| C|
| 9| E|
+---------+----+
Join the two dataframe on the start, end, timestamp columns:
df_ranges.join(df_data, df_ranges.col("start").equalTo(df_data.col("timestamp")).or(df_ranges.col("end").equalTo(df_data.col("timestamp"))), "left")
+-----+---+---------+----+
|start|end|timestamp|user|
+-----+---+---------+----+
| 1| 2| 1| A|
| 1| 2| 2| B|
| 5| 6| 5| C|
| 9| 10| 9| E|
| 3| 4| null|null|
| 7| 8| null|null|
+-----+---+---------+----+
Now do a simple aggregation with collect_list function :
res4.groupBy("start", "end").agg(collect_list("user")).orderBy("start")
+-----+---+------------------+
|start|end|collect_list(user)|
+-----+---+------------------+
| 1| 2| [A, B]|
| 3| 4| []|
| 5| 6| [C]|
| 7| 8| []|
| 9| 10| [E]|
+-----+---+------------------+

Categories