PySpark get max and min non-zero values of column - python

I have a dataframe as follows:
+-------+--------------------+--------------------+--------------+---------+----------+
| label| app_id| title|download_count|entity_id|risk_score|
+-------+--------------------+--------------------+--------------+---------+----------+
|ANDROID|com.aaron.test.ze...| Aaron Test| 0| 124| 100|
|ANDROID|com.boulderdailyc...|Boulder Daily Cam...| 100| 122| 100|
|ANDROID|com.communitybank...| Budgeting Tools| 0| 123| 100|
|ANDROID|com.communitybank...| PB Mobile Banking| 600| 123| 100|
|ANDROID|com.mendocinobeac...|Mendocino Beacon ...| 10| 122| 100|
|ANDROID|com.profitstars.t...|Johnson City Mobi...| 500| 123| 100|
|ANDROID|com.spreedinc.pro...|Oneida Dispatch f...| 1000| 122| 100|
+-------+--------------------+--------------------+--------------+---------+----------+
I wish to get the non-zero max and min download_count values grouped by entity ID. I'm not too sure how to do this with aggregation, of course simple max and min won't work.
apps_by_entity = (
group_by_entity_id(df)
.agg(F.min(df.download_count), F.max(df.download_count), F.count("entity_id").alias("app_count"))
.withColumnRenamed("max(download_count)", "download_max")
.withColumnRenamed("min(download_count)", "download_min")
)
as this will get 0 for the min of entity 123 and 124.
+---------+------------+------------+---------+
|entity_id|download_min|download_max|app_count|
+---------+------------+------------+---------+
| 124| 0| 0| 1|
| 123| 0| 600| 3|
| 122| 10| 1000| 3|
+---------+------------+------------+---------+
The desired output would look something like
+---------+------------+------------+---------+
|entity_id|download_min|download_max|app_count|
+---------+------------+------------+---------+
| 124| 0| 0| 1|
| 123| 500| 600| 3|
| 122| 10| 1000| 3|
+---------+------------+------------+---------+
Is there a way to do this with aggregation? If not what would be the best way to get this non-zero value? In the case of max = min = 0 just returning 0 or null would be fine.

I'm not sure if you can exclude zeros while doing min, max aggregations, without losing counts.
One way to achieve your output is to do (min, max) and count aggregations separately, and then join them back.
from pyspark.sql.functions import *
from pyspark.sql import functions as F
min_max_df = df.filter(col("download_count")!=0).groupBy('entity_id')\
.agg(F.min('download_count').alias("download_min"),\
F.max('download_count').alias("download_max"))\
.withColumnRenamed("entity_id", "entity_id_1")
count_df =df.groupBy('entity_id').agg(F.count('download_count')\
.alias("app_count"))
count_df.join(min_max_df, (count_df.entity_id == min_max_df.entity_id_1), \
"left").drop("entity_id_1").fillna(0, subset=['download_min',\
'download_max']).show()
+---------+---------+------------+------------+
|entity_id|app_count|download_min|download_max|
+---------+---------+------------+------------+
| 124| 1| 0| 0|
| 123| 3| 500| 600|
| 122| 3| 10| 1000|
+---------+---------+------------+------------+

Related

PySpark: Using Window Functions to roll-up dataframe

I have a dataframe my_df that contains 4 columns:
+----------------+---------------+--------+---------+
| user_id| domain|isp_flag|frequency|
+----------------+---------------+--------+---------+
| josh| wanadoo.fr| 1| 15|
| josh| random.it| 0| 12|
| samantha| wanadoo.fr| 1| 16|
| bob| eidsiva.net| 1| 5|
| bob| media.net| 0| 1|
| dylan| vodafone.it| 1| 448|
| dylan| somesite.net| 0| 20|
| dylan| yolosite.net| 0| 49|
| dylan| random.it| 0| 3|
| don| vodafone.it| 1| 39|
| don| popsugar.com| 0| 10|
| don| fabio.com| 1| 49|
+----------------+---------------+--------+---------+
This is what I'm planning to do-
Find all the user_id where the maximum frequency domain with isp_flag=0 has a frequency that is less than 25% of the maximum frequency domain with isp_flag=1.
So, in the example that I have above, my output_df would look like-
+----------------+---------------+--------+---------+
| user_id| domain|isp_flag|frequency|
+----------------+---------------+--------+---------+
| bob| eidsiva.net| 1| 5|
| bob| media.net| 0| 1|
| dylan| vodafone.it| 1| 448|
| dylan| yolosite.net| 0| 49|
| don| fabio.com| 1| 49|
| don| popsugar.com| 0| 10|
+----------------+---------------+--------+---------+
I believe I need window functions to do this, and so I tried the following to first find the maximum frequency domains for isp_flag=0 and isp_flag=1 respectively, for each of the user_id-
>>> win_1 = Window().partitionBy("user_id", "domain", "isp_flag").orderBy((col("frequency").desc()))
>>> final_df = my_df.select("*", rank().over(win_1).alias("rank")).filter(col("rank")==1)
>>> final_df.show(5) # this just gives me the original dataframe back
What am I doing wrong here? How do I get to the final output_df I printed above?
IIUC, you can try the following: calculate the max_frequencies (max_0, max_1) for each user having isp_flag == 0 or 1 respectively. and then filter by condition max_0 < 0.25*max_1 and plus frequency in (max_1, max_0) to select only the records with maximum frequency.
from pyspark.sql import Window, functions as F
# set up the Window to calculate max_0 and max_1 for each user
# having isp_flag = 0 and 1 respectively
w1 = Window.partitionBy('user_id').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('max_1', F.max(F.expr("IF(isp_flag==1, frequency, NULL)")).over(w1))\
.withColumn('max_0', F.max(F.expr("IF(isp_flag==0, frequency, NULL)")).over(w1))\
.where('max_0 < 0.25*max_1 AND frequency in (max_1, max_0)') \
.show()
+-------+------------+--------+---------+-----+-----+
|user_id| domain|isp_flag|frequency|max_1|max_0|
+-------+------------+--------+---------+-----+-----+
| don|popsugar.com| 0| 10| 49| 10|
| don| fabio.com| 1| 49| 49| 10|
| dylan| vodafone.it| 1| 448| 448| 49|
| dylan|yolosite.net| 0| 49| 448| 49|
| bob| eidsiva.net| 1| 5| 5| 1|
| bob| media.net| 0| 1| 5| 1|
+-------+------------+--------+---------+-----+-----+
Some Explanations per request:
the WindowSpec w1 is set to examine all records for the same user(partitionBy), so that the F.max() function will compare all rows based on the same user.
we use IF(isp_flag==1, frequency, NULL) to find frequency for rows having isp_flag==1, it returns NULL when isp_flag is not 1 and thus is skipped in F.max() function. this is an SQL expression and thus we need F.expr() function to run it.
F.max(...).over(w1) will take the max value of the result from executing the above SQL expression. this calculation is based on the Window w1.

Pyspark - ranking column replacing ranking of ties by Mean of rank

Consider a data set with ranking
+--------+----+-----------+--------------+
| colA|colB|colA_rank |colA_rank_mean|
+--------+----+-----------+--------------+
| 21| 50| 1| 1|
| 9| 23| 2| 2.5|
| 9| 21| 3| 2.5|
| 8| 21| 4| 3|
| 2| 21| 5| 5.5|
| 2| 5| 6| 5.5|
| 1| 5| 7| 7.5|
| 1| 4| 8| 7.5|
| 0| 4| 9| 11|
| 0| 3| 10| 11|
| 0| 3| 11| 11|
| 0| 2| 12| 11|
| 0| 2| 13| 11|
+--------+----+-----------+--------------+
colA_rank is a normal ranking, while with colA_rank_mean I would like to resolve ties by replacing the ranking with the mean rank of the ties. Is it achievable with a single pass and some particular ranking method ?
Currently I am thinking of 2 passes but that would seem to require ordering the dataset twice on colA, one without partition and one with partition.
#Step 1: normal rank
df = df.withColumn("colA_rank",F.row_number().over(Window.orderBy("colA")))
#Step 2 : solve ties :
df = df.withColumn("colA_rank_mean",F.mean(colA_rank).over(Window.partitionBy("colA"))

Pyspark advanced window function

Here is my dataframe :
FlightDate=[20,40,51,50,60,15,17,37,36,50]
IssuingDate=[10,15,44,45,55,10,2,30,32,24]
Revenue = [100,50,40,70,60,40,30,100,200,100]
Customer = ['a','a','a','a','a','b','b','b','b','b']
df = spark.createDataFrame(pd.DataFrame([Customer,FlightDate,IssuingDate, Revenue]).T, schema=["Customer",'FlightDate', 'IssuingDate','Revenue'])
df.show()
+--------+----------+-----------+-------+
|Customer|FlightDate|IssuingDate|Revenue|
+--------+----------+-----------+-------+
| a| 20| 10| 100|
| a| 40| 15| 50|
| a| 51| 44| 40|
| a| 50| 45| 70|
| a| 60| 55| 60|
| b| 15| 10| 40|
| b| 27| 2| 30|
| b| 37| 30| 100|
| b| 36| 32| 200|
| b| 50| 24| 100|
+--------+----------+-----------+-------+
For convenience, I used number for days.
For each customer, I would like to sum revenues for all issuing dates between studied FlightDate and studied FlightDate + 10 days.
That is to say :
For the first line : I sum all revenue for IssuingDate between day 20 and day 30... which gives 0 here.
For the second line : I sum all revenus for IssuingDate between day 40 and 50, that is to say 40+70 = 110
Here is the desired result :
+--------+----------+-----------+-------+------+
|Customer|FlightDate|IssuingDate|Revenue|Result|
+--------+----------+-----------+-------+------+
| a| 20| 10| 100| 0|
| a| 40| 15| 50| 110|
| a| 51| 44| 40| 60|
| a| 50| 45| 70| 60|
| a| 60| 55| 60| 0|
| b| 15| 10| 40| 100|
| b| 27| 2| 30| 300|
| b| 37| 30| 100| 0|
| b| 36| 32| 200| 0|
| b| 50| 24| 100| 0|
+--------+----------+-----------+-------+------+
I know it will involve some window functions but this one seems a bit tricky. Thanks
no need of a window function. It is just a join and an agg :
df.alias("df").join(
df.alias("df_2"),
on=F.expr(
"df.Customer = df_2.Customer "
"and df_2.issuingdate between df.flightdate and df.flightdate+10"
),
how='left'
).groupBy(
*('df.{}'.format(c)
for c
in df.columns)
).agg(
F.sum(F.coalesce(
"df_2.revenue",
F.lit(0))
).alias("result")
).show()
+--------+----------+-----------+-------+------+
|Customer|FlightDate|IssuingDate|Revenue|result|
+--------+----------+-----------+-------+------+
| a| 20| 10| 100| 0|
| a| 40| 15| 50| 110|
| a| 50| 45| 70| 60|
| a| 51| 44| 40| 60|
| a| 60| 55| 60| 0|
| b| 15| 10| 40| 100|
| b| 27| 2| 30| 300|
| b| 36| 32| 200| 0|
| b| 37| 30| 100| 0|
| b| 50| 24| 100| 0|
+--------+----------+-----------+-------+------+
If you would like to keep the Revenue for the current row and next 10 days then you can use below code.
For e.g.
First line: flightDate = 20 and you need revenue between 20 and 30 (both dates inclusive) which means Total Revenue = 100.
Second Line: flightDate = 40 and you need revenue between 40 and 50 (both dates inclusive) which means Total revenue = 50 (for date 40) + 50 (for date 50) = 120.
Third Line: flightDate = 50 and you need revenue between 50 and 60 (both dates inclusive) which mean Total revenue = 70(for date 50) + 40(for date 51) + 60(for date 60) = 170
from pyspark.sql import *
from pyspark.sql.functions import *
import pandas as pd
FlightDate=[20,40,51,50,60,15,17,37,36,50]
IssuingDate=[10,15,44,45,55,10,2,30,32,24]
Revenue = [100,50,40,70,60,40,30,100,200,100]
Customer = ['a','a','a','a','a','b','b','b','b','b']
df = spark.createDataFrame(pd.DataFrame([Customer,FlightDate,IssuingDate, Revenue]).T, schema=["Customer",'FlightDate', 'IssuingDate','Revenue'])
windowSpec = Window.partitionBy("Customer").orderBy("FlightDate").rangeBetween(0,10)
df.withColumn("Sum", sum("Revenue").over(windowSpec)).sort("Customer").show()
Result as mentioned below
+--------+----------+-----------+-------+---+
|Customer|FlightDate|IssuingDate|Revenue|Sum|
+--------+----------+-----------+-------+---+
| a| 20| 10| 100|100|
| a| 40| 15| 50|120|
| a| 50| 45| 70|170|
| a| 51| 44| 40|100|
| a| 60| 55| 60| 60|
| b| 15| 10| 40| 70|
| b| 17| 2| 30| 30|
| b| 36| 32| 200|300|
| b| 37| 30| 100|100|
| b| 50| 24| 100|100|
+--------+----------+-----------+-------+---+

Get "circular lag" of a column

I would like to create a new column in a pyspark.sql.DataFrame based on lagged values of an existing column. But... I would also like the last values to become the first ones, and the first values to become the last ones. Here is an example :
df = spark.createDataFrame([(1,100),
(2,200),
(3,300),
(4,400),
(5,500)],
['id','value'])
df.show()
+---+-----+
| id|value|
+---+-----+
| 1| 100|
| 2| 200|
| 3| 300|
| 4| 400|
| 5| 500|
+---+-----+
And the desired output would be :
+---+-----+----------------+-----------------+
| id|value|lag_value_plus_2|lag_value_minus_2|
+---+-----+----------------+-----------------+
| 1| 100| 300| 400|
| 2| 200| 400| 500|
| 3| 300| 500| 100|
| 4| 400| 100| 200|
| 5| 500| 200| 300|
+---+-----+----------------+-----------------+
I can feel it has something to do with window functions or pyspark.sql.lag function, but can't figure out how to do.
Here is one solution I can offer. But I'm not sure it is the most optimized one :
from functools import reduce
# Duplicate the dataframe twice, one "before" and one "after"
df = reduce(
lambda a, b : a.union(b),
[df.withColumn("x", F.lit(i)) for i in [-1,0,1]]
)
df.withColumn(
"lag_value_plus_2",
F.lead("value", 2).over(Window.partitionBy().orderBy("x", "id"))
).withColumn(
"lag_value_minus_2",
F.lag("value", 2).over(Window.partitionBy().orderBy("x", "id"))
).where("x=0").drop("x").show()
+---+-----+----------------+-----------------+
| id|value|lag_value_plus_2|lag_value_minus_2|
+---+-----+----------------+-----------------+
| 1| 100| 300| 400|
| 2| 200| 400| 500|
| 3| 300| 500| 100|
| 4| 400| 100| 200|
| 5| 500| 200| 300|
+---+-----+----------------+-----------------+

Calculating Cumulative sum in PySpark using Window Functions

I have the following sample DataFrame:
rdd = sc.parallelize([(1,20), (2,30), (3,30)])
df2 = spark.createDataFrame(rdd, ["id", "duration"])
df2.show()
+---+--------+
| id|duration|
+---+--------+
| 1| 20|
| 2| 30|
| 3| 30|
+---+--------+
I want to sort this DataFrame in desc order of duration and add a new column which has the cumulative sum of the duration. So I did the following:
windowSpec = Window.orderBy(df2['duration'].desc())
df_cum_sum = df2.withColumn("duration_cum_sum", sum('duration').over(windowSpec))
df_cum_sum.show()
+---+--------+----------------+
| id|duration|duration_cum_sum|
+---+--------+----------------+
| 2| 30| 60|
| 3| 30| 60|
| 1| 20| 80|
+---+--------+----------------+
My desired output is:
+---+--------+----------------+
| id|duration|duration_cum_sum|
+---+--------+----------------+
| 2| 30| 30|
| 3| 30| 60|
| 1| 20| 80|
+---+--------+----------------+
How do I get this?
Here is the breakdown:
+--------+----------------+
|duration|duration_cum_sum|
+--------+----------------+
| 30| 30| #First value
| 30| 60| #Current duration + previous cum sum value
| 20| 80| #Current duration + previous cum sum value
+--------+----------------+
You can introduce the row_number to break the ties; If written in sql:
df2.selectExpr(
"id", "duration",
"sum(duration) over (order by row_number() over (order by duration desc)) as duration_cum_sum"
).show()
+---+--------+----------------+
| id|duration|duration_cum_sum|
+---+--------+----------------+
| 2| 30| 30|
| 3| 30| 60|
| 1| 20| 80|
+---+--------+----------------+
Here you can check this
df2.withColumn('cumu', F.sum('duration').over(Window.orderBy(F.col('duration').desc()).rowsBetween(Window.unboundedPreceding, 0)
)).show()

Categories