PySpark: Using Window Functions to roll-up dataframe - python

I have a dataframe my_df that contains 4 columns:
+----------------+---------------+--------+---------+
| user_id| domain|isp_flag|frequency|
+----------------+---------------+--------+---------+
| josh| wanadoo.fr| 1| 15|
| josh| random.it| 0| 12|
| samantha| wanadoo.fr| 1| 16|
| bob| eidsiva.net| 1| 5|
| bob| media.net| 0| 1|
| dylan| vodafone.it| 1| 448|
| dylan| somesite.net| 0| 20|
| dylan| yolosite.net| 0| 49|
| dylan| random.it| 0| 3|
| don| vodafone.it| 1| 39|
| don| popsugar.com| 0| 10|
| don| fabio.com| 1| 49|
+----------------+---------------+--------+---------+
This is what I'm planning to do-
Find all the user_id where the maximum frequency domain with isp_flag=0 has a frequency that is less than 25% of the maximum frequency domain with isp_flag=1.
So, in the example that I have above, my output_df would look like-
+----------------+---------------+--------+---------+
| user_id| domain|isp_flag|frequency|
+----------------+---------------+--------+---------+
| bob| eidsiva.net| 1| 5|
| bob| media.net| 0| 1|
| dylan| vodafone.it| 1| 448|
| dylan| yolosite.net| 0| 49|
| don| fabio.com| 1| 49|
| don| popsugar.com| 0| 10|
+----------------+---------------+--------+---------+
I believe I need window functions to do this, and so I tried the following to first find the maximum frequency domains for isp_flag=0 and isp_flag=1 respectively, for each of the user_id-
>>> win_1 = Window().partitionBy("user_id", "domain", "isp_flag").orderBy((col("frequency").desc()))
>>> final_df = my_df.select("*", rank().over(win_1).alias("rank")).filter(col("rank")==1)
>>> final_df.show(5) # this just gives me the original dataframe back
What am I doing wrong here? How do I get to the final output_df I printed above?

IIUC, you can try the following: calculate the max_frequencies (max_0, max_1) for each user having isp_flag == 0 or 1 respectively. and then filter by condition max_0 < 0.25*max_1 and plus frequency in (max_1, max_0) to select only the records with maximum frequency.
from pyspark.sql import Window, functions as F
# set up the Window to calculate max_0 and max_1 for each user
# having isp_flag = 0 and 1 respectively
w1 = Window.partitionBy('user_id').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('max_1', F.max(F.expr("IF(isp_flag==1, frequency, NULL)")).over(w1))\
.withColumn('max_0', F.max(F.expr("IF(isp_flag==0, frequency, NULL)")).over(w1))\
.where('max_0 < 0.25*max_1 AND frequency in (max_1, max_0)') \
.show()
+-------+------------+--------+---------+-----+-----+
|user_id| domain|isp_flag|frequency|max_1|max_0|
+-------+------------+--------+---------+-----+-----+
| don|popsugar.com| 0| 10| 49| 10|
| don| fabio.com| 1| 49| 49| 10|
| dylan| vodafone.it| 1| 448| 448| 49|
| dylan|yolosite.net| 0| 49| 448| 49|
| bob| eidsiva.net| 1| 5| 5| 1|
| bob| media.net| 0| 1| 5| 1|
+-------+------------+--------+---------+-----+-----+
Some Explanations per request:
the WindowSpec w1 is set to examine all records for the same user(partitionBy), so that the F.max() function will compare all rows based on the same user.
we use IF(isp_flag==1, frequency, NULL) to find frequency for rows having isp_flag==1, it returns NULL when isp_flag is not 1 and thus is skipped in F.max() function. this is an SQL expression and thus we need F.expr() function to run it.
F.max(...).over(w1) will take the max value of the result from executing the above SQL expression. this calculation is based on the Window w1.

Related

PySpark SQL Dataframe identify and store consecutive matching value rows incrementally

I was wondering if I could store information of a boolean column in a pyspark dataframe that has consecutive true and false values into another column showing the "Occurrence #" of the consecutive TRUE values.
Bool Column I have
Column I want to create
True
1
True
1
True
1
True
1
False
0
False
0
True
2
True
2
False
0
True
3
True
3
I would like the TRUE rows to increment and FALSE to stay at 0.
I am fairly new to pyspark and have only been able to succeed at doing this by converting to a pandas dataframe. If you could give me a pyspark solution or guidance this would be much appreciated.
The data is ordered by a data column not pictured in the table above.
You can add a helper column begin as shown below, and do a rolling sum using window function:
Sample dataframe:
df.show()
+---+-----+
| ts| col|
+---+-----+
| 0| true|
| 1| true|
| 2| true|
| 3| true|
| 4|false|
| 5|false|
| 6| true|
| 7| true|
| 8|false|
| 9| true|
| 10| true|
+---+-----+
Code:
w = Window.orderBy('ts')
df2 = df.withColumn(
'begin',
F.coalesce(
(F.lag('col').over(w) != F.col('col')) & F.col('col'),
F.lit(True) # take care of first row where lag is null
).cast('int')
).withColumn(
'newcol',
F.when(
F.col('col'),
F.sum('begin').over(w)
).otherwise(0)
)
df2.show()
+---+-----+-----+------+
| ts| col|begin|newcol|
+---+-----+-----+------+
| 0| true| 1| 1|
| 1| true| 0| 1|
| 2| true| 0| 1|
| 3| true| 0| 1|
| 4|false| 0| 0|
| 5|false| 0| 0|
| 6| true| 1| 2|
| 7| true| 0| 2|
| 8|false| 0| 0|
| 9| true| 1| 3|
| 10| true| 0| 3|
+---+-----+-----+------+

PySpark group by to return constant if all values are negative but average if only some are

I have a dataframe that looks like this:
df =
+----+-----+
|Year|Value|
+----+-----+
| 1| 50|
| 1| 30|
| 1| 20|
| 2| -14|
| 2| -34|
| 3| 10|
| 3| 20|
| 3| -34|
+----+-----+
I want to group by Year and show the average of value. If the Value column is negative I want to ignore that unless all the values of a particular year are negative (year = 2). Then I just want to show avg(Value) as -1.
I am doing:
df.filter(df.Value > 0).groupBy('Year').agg(avg('Value').alias('Average')).show()
which gives me this
+----+------------------+
|Year| Average|
+----+------------------+
| 1|33.333333333333336|
| 3| 15.0|
+----+------------------+
The result I want is
+----+------------------+
|Year| Average|
+----+------------------+
| 1|33.333333333333336|
| 2| -1|
| 3| 15.0|
+----+------------------+
Anyone has any idea how to achieve the above result?
Replace the negative values in the Value column with null, and then use group by and compute average.
avg ignores nulls, which is what you need. In the end, I'm replacing null averages with -1.
from pyspark.sql.functions import avg, when, col
df.withColumn("Value", when(col("Value") < 0, None).otherwise(col("Value")))\
.groupBy('Year').agg(avg('Value').alias('Average'))\
.fillna(-1, subset=['Average']).show()
+----+------------------+
|Year| Average|
+----+------------------+
| 1|33.333333333333336|
| 2| -1.0|
| 3| 15.0|
+----+------------------+

PySpark get max and min non-zero values of column

I have a dataframe as follows:
+-------+--------------------+--------------------+--------------+---------+----------+
| label| app_id| title|download_count|entity_id|risk_score|
+-------+--------------------+--------------------+--------------+---------+----------+
|ANDROID|com.aaron.test.ze...| Aaron Test| 0| 124| 100|
|ANDROID|com.boulderdailyc...|Boulder Daily Cam...| 100| 122| 100|
|ANDROID|com.communitybank...| Budgeting Tools| 0| 123| 100|
|ANDROID|com.communitybank...| PB Mobile Banking| 600| 123| 100|
|ANDROID|com.mendocinobeac...|Mendocino Beacon ...| 10| 122| 100|
|ANDROID|com.profitstars.t...|Johnson City Mobi...| 500| 123| 100|
|ANDROID|com.spreedinc.pro...|Oneida Dispatch f...| 1000| 122| 100|
+-------+--------------------+--------------------+--------------+---------+----------+
I wish to get the non-zero max and min download_count values grouped by entity ID. I'm not too sure how to do this with aggregation, of course simple max and min won't work.
apps_by_entity = (
group_by_entity_id(df)
.agg(F.min(df.download_count), F.max(df.download_count), F.count("entity_id").alias("app_count"))
.withColumnRenamed("max(download_count)", "download_max")
.withColumnRenamed("min(download_count)", "download_min")
)
as this will get 0 for the min of entity 123 and 124.
+---------+------------+------------+---------+
|entity_id|download_min|download_max|app_count|
+---------+------------+------------+---------+
| 124| 0| 0| 1|
| 123| 0| 600| 3|
| 122| 10| 1000| 3|
+---------+------------+------------+---------+
The desired output would look something like
+---------+------------+------------+---------+
|entity_id|download_min|download_max|app_count|
+---------+------------+------------+---------+
| 124| 0| 0| 1|
| 123| 500| 600| 3|
| 122| 10| 1000| 3|
+---------+------------+------------+---------+
Is there a way to do this with aggregation? If not what would be the best way to get this non-zero value? In the case of max = min = 0 just returning 0 or null would be fine.
I'm not sure if you can exclude zeros while doing min, max aggregations, without losing counts.
One way to achieve your output is to do (min, max) and count aggregations separately, and then join them back.
from pyspark.sql.functions import *
from pyspark.sql import functions as F
min_max_df = df.filter(col("download_count")!=0).groupBy('entity_id')\
.agg(F.min('download_count').alias("download_min"),\
F.max('download_count').alias("download_max"))\
.withColumnRenamed("entity_id", "entity_id_1")
count_df =df.groupBy('entity_id').agg(F.count('download_count')\
.alias("app_count"))
count_df.join(min_max_df, (count_df.entity_id == min_max_df.entity_id_1), \
"left").drop("entity_id_1").fillna(0, subset=['download_min',\
'download_max']).show()
+---------+---------+------------+------------+
|entity_id|app_count|download_min|download_max|
+---------+---------+------------+------------+
| 124| 1| 0| 0|
| 123| 3| 500| 600|
| 122| 3| 10| 1000|
+---------+---------+------------+------------+

How to filter rows which include in PySpark window

I have a dataframe like
|TRADEID|time_period|value|
+-------+-----------+-----+
| 1| 31-01-2019| 5|
| 1| 31-05-2019| 6|
| 2| 31-01-2019| 15|
| 2| 31-03-2019| 20|
+-------+-----------+-----+
Entries for some month are missing so I forward fill them with window operations
window = Window.partitionBy(['TRADEID'])\
.orderBy('time_period')\
.rowsBetween(Window.unboundedPreceding, 0)
filled_column = last(df['value'], ignorenulls = True).over(window)
df_filled = df.withColumn('value_filled', filled_column)
After this I get
|TRADEID|time_period|value|
+-------+-----------+-----+
| 1| 31-01-2019| 5|
| 1| 28-02-2019| 5|
| 1| 31-03-2019| 5|
| 1| 30-04-2019| 5|
| 1| 31-05-2019| 6|
| 2| 31-01-2019| 15|
| 2| 28-02-2019| 15|
| 2| 31-03-2019| 20|
+-------+-----------+-----+
However, I do not want to fill a month if the gap between this month and the last available month is more than 2 months. For example, in my case I want to get
|TRADEID|time_period|value |
+-------+-----------+--------+
| 1| 31-01-2019| 5 |
| 1| 28-02-2019| 5 |
| 1| 31-03-2019| 5 |
| 1| 30-04-2019| null|
| 1| 31-05-2019| 6 |
| 2| 31-01-2019| 15 |
| 2| 28-02-2019| 15 |
| 2| 31-03-2019| 20 |
+-------+-----------+--------+
How can I do this?
Have one more window operation that deletes entries with month that should not be filled beforehand? If it is a good way, please help me with code, I am bad at window operations.

PySpark: group by between two dates

I am trying to count the number of times the field "measure" has the value "M" between date 1 and date 2 for each pair id and id_2
** date 1 is when the measure is taken and date 2 is date 1 - 5 days, since 5 days is the period on which we want to count the number of measures taken.
| date 1| date 2| id | id_2 | measure|
+---------- +---------- +---------------+----------------+-----------+
|2016-10-25 |2016-10-20 | 1| 1| M|
|2016-10-26 |2016-10-21 | 1| 5| |
|2016-10-22 |2016-10-17 | 1| 1| M|
|2016-10-30 |2016-10-25 | 1| 2| M|
|2016-10-22 |2016-10-17 | 2| 1| M|
|2016-10-26 |2016-10-21 | 2| 2| |
|2016-10-18 |2016-10-13 | 2| 5| M|
|2016-10-25 |2016-10-20 | 2| 1| |
+------------+------------+---------------+----------------+-----------+
And the desired output would be:
| date 1| date 2| id | id_2 | measure| count
+---------- +---------- +---------------+----------------+-----------+-----+
|2016-10-25 |2016-10-20 | 1| 1| M| 1|
|2016-10-26 |2016-10-21 | 1| 5| | 0|
|2016-10-22 |2016-10-17 | 1| 1| M| 0|
|2016-10-30 |2016-10-25 | 1| 2| M| 0|
|2016-10-22 |2016-10-17 | 2| 1| M| 0|
|2016-10-26 |2016-10-21 | 2| 2| | 0|
|2016-10-18 |2016-10-13 | 2| 5| M| 0|
|2016-10-25 |2016-10-20 | 2| 1| | 1|
+------------+------------+---------------+----------------+----------+-----+
I believe that windows might be helpful but I am not sure how to apply them to this problem. Previously I have solved this kind of problem with a subquery, but in Pyspark they cannot be applied.
The query I have previously used is:
SELECT a.date1,
a.date2,
a.id,
a.id_2,
a.measure,
(select count(data.[id]) as CountMeasure
from dataOrigin data
where data.id = a.id
and data.id_2 = a.id_2 and measure = 'M'
and datos.date1 > date2 and datos.date1 < a.date1) as count
FROM dataOrigin a
If more info is needed please let me know
Thank you every body!

Categories