Split count results of different events into different columns in pyspark - python

I have a rdd from which I need to extract counts of multiple events. The initial rdd looks like this
+----------+--------------------+-------------------+
| event| user| day|
+----------+--------------------+-------------------+
|event_x |user_A | 0|
|event_y |user_A | 2|
|event_x |user_B | 2|
|event_y |user_B | 1|
|event_x |user_A | 0|
|event_x |user_B | 1|
|event_y |user_B | 2|
|event_y |user_A | 1|
+----------+--------------------+-------------------+
I need a count column for each type of event (in this case 2 types of events: event_x and event_y), grouped by player and day. So far, I managed to do it with only one event, resulting in the following:
+--------------------+-------------------+------------+
| user| day|count(event)|
+--------------------+-------------------+------------+
|user_A | 0| 11|
|user_A | 1| 8|
|user_A | 2| 4|
|user_B | 0| 2|
|user_B | 1| 1|
|user_B | 2| 25|
+--------------------+-------------------+------------+
But I need arbitrarily many columns, being the number of columns the same as the number of events that appear in the leftmost column of the first rdd displayed above. So, if I only had 2 events (x and y) it should be something like this:
+--------------------+-------------------+--------------+--------------+
| user| day|count(event_x)|count(event_y)|
+--------------------+-------------------+--------------+--------------+
|user_A | 0| 11| 3|
|user_A | 1| 8| 23|
|user_A | 2| 4| 2|
|user_B | 0| 2| 0|
|user_B | 1| 1| 1|
|user_B | 2| 25| 11|
+--------------------+-------------------+--------------+--------------+
The code I have currently is:
rdd = rdd.groupby('user', 'day').agg({'event': 'count'}).orderBy('user', 'day')
What should I do to achieve the desired result?
Thanks in advance ;)

you can try group by with pivot option
df =spark.createDataFrame([["event_x","user_A",0],["event_y","user_A",2],["event_x","user_B",2],["event_y","user_B",1],["event_x","user_A",0],["event_x","user_B",1],["event_y","user_B",2],["event_y","user_A",1]],["event","user","day"])
>>> df.show()
+-------+------+---+
| event| user|day|
+-------+------+---+
|event_x|user_A| 0|
|event_y|user_A| 2|
|event_x|user_B| 2|
|event_y|user_B| 1|
|event_x|user_A| 0|
|event_x|user_B| 1|
|event_y|user_B| 2|
|event_y|user_A| 1|
+-------+------+---+
>>> df.groupBy(["user","day"]).pivot("event").agg({"event":"count"}).show()
+------+---+-------+-------+
| user|day|event_x|event_y|
+------+---+-------+-------+
|user_A| 0| 2| null|
|user_B| 1| 1| 1|
|user_A| 2| null| 1|
|user_A| 1| null| 1|
|user_B| 2| 1| 1|
+------+---+-------+-------+
please have a look and do let me know if you have any doubts about same.

temp = temp.groupBy("columnname").pivot("bucket_columnaname").agg({"bucket_columnaname":"count"})

Related

Get the first row that matches some condition over a window in PySpark

To give an example, suppose we have a stream of user actions as follows:
from pyspark.sql import *
spark = SparkSession.builder.appName('test').master('local[8]').getOrCreate()
df = spark.sparkContext.parallelize([
Row(user=1, action=1, time=1),
Row(user=1, action=1, time=2),
Row(user=2, action=1, time=3),
Row(user=1, action=2, time=4),
Row(user=2, action=2, time=5),
Row(user=2, action=2, time=6),
Row(user=1, action=1, time=7),
Row(user=2, action=1, time=8),
]).toDF()
df.show()
The dataframe looks like:
+----+------+----+
|user|action|time|
+----+------+----+
| 1| 1| 1|
| 1| 1| 2|
| 2| 1| 3|
| 1| 2| 4|
| 2| 2| 5|
| 2| 2| 6|
| 1| 1| 7|
| 2| 1| 8|
+----+------+----+
Then, I want to add a column next_alt_time to each row, giving the time when user changes action type in the following rows. For the input above, the output should be:
+----+------+----+-------------+
|user|action|time|next_alt_time|
+----+------+----+-------------+
| 1| 1| 1| 4|
| 1| 1| 2| 4|
| 2| 1| 3| 5|
| 1| 2| 4| 7|
| 2| 2| 5| 8|
| 2| 2| 6| 8|
| 1| 1| 7| null|
| 2| 1| 8| null|
+----+------+----+-------------+
I know I can create a window like this:
wnd = Window().partitionBy('user').orderBy('time').rowsBetween(1, Window.unboundedFollowing)
But then I don't know how to impose a condition over the window and select the first row that has a different action than current row, over the window defined above.
Here's how to do it. Spark cannot keep the dataframe order, but if you check the rows one by one, you can confirm that it's giving your expected answer:
from pyspark.sql import Row
from pyspark.sql.window import Window
import pyspark.sql.functions as F
df = spark.sparkContext.parallelize([
Row(user=1, action=1, time=1),
Row(user=1, action=1, time=2),
Row(user=2, action=1, time=3),
Row(user=1, action=2, time=4),
Row(user=2, action=2, time=5),
Row(user=2, action=2, time=6),
Row(user=1, action=1, time=7),
Row(user=2, action=1, time=8),
]).toDF()
win = Window().partitionBy('user').orderBy('time')
df = df.withColumn('new_action', F.lag('action').over(win) != F.col('action'))
df = df.withColumn('new_action_time', F.when(F.col('new_action'), F.col('time')))
df = df.withColumn('next_alt_time', F.first('new_action', ignorenulls=True).over(win.rowsBetween(1, Window.unboundedFollowing)))
df.show()
+----+------+----+----------+---------------+-------------+
|user|action|time|new_action|new_action_time|next_alt_time|
+----+------+----+----------+---------------+-------------+
| 1| 1| 1| null| null| 4|
| 1| 1| 2| false| null| 4|
| 1| 2| 4| true| 4| 7|
| 1| 1| 7| true| 7| null|
| 2| 1| 3| null| null| 5|
| 2| 2| 5| true| 5| 8|
| 2| 2| 6| false| null| 8|
| 2| 1| 8| true| 8| null|
+----+------+----+----------+---------------+-------------+

How do I count consecutive values with pyspark?

I am trying to count consecutive values that appear in a column with Pyspark. I have the column "a" in my dataframe and expect to create the column "b".
+---+---+
| a| b|
+---+---+
| 0| 1|
| 0| 2|
| 0| 3|
| 0| 4|
| 0| 5|
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 2| 1|
| 2| 2|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 6|
| 3| 1|
| 3| 2|
| 3| 3|
+---+---+
I have tried to create the column "b" with lag function over some window, but without success.
w = Window\
.partitionBy(df.some_id)\
.orderBy(df.timestamp_column)
df.withColumn(
"b",
f.when(df.a == f.lag(df.a).over(w),
f.sum(f.lit(1)).over(w)).otherwise(f.lit(0))
)
I could resolve this issue with the following code:
df.withColumn("b",
f.row_number().over(Window.partitionBy("a").orderBy("timestamp_column"))

replace NA with median in pyspark using window function

I want to replace NA with medain based on partition columns using window function in pyspark?
Sample Input:
Required Output:
Creating your dataframe:
list=([1,5,4],
[1,5,None],
[1,5,1],
[1,5,4],
[2,5,1],
[2,5,2],
[2,5,None],
[2,5,None],
[2,5,4])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
df.show()
+----+----+----+
|I_id|p_id| xyz|
+----+----+----+
| 1| 5| 4|
| 1| 5|null|
| 1| 5| 1|
| 1| 5| 4|
| 2| 5| 1|
| 2| 5| 2|
| 2| 5|null|
| 2| 5|null|
| 2| 5| 4|
+----+----+----+
To keep the solution as generic and dynamic as possible, I had to create many new columns to compute the median, and to be able to send it to the nulls. With that said, solution will not be slow, and will be scalable for big data.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import when
w= Window().partitionBy("I_id","p_id").orderBy(F.col("xyz").asc_nulls_first())
w2= Window().partitionBy("I_id","p_id")
df.withColumn("xyz1",F.count(F.col("xyz").isNotNull()).over(w))\
.withColumn("xyz2", F.max(F.row_number().over(w)).over(w2))\
.withColumn("xyz3", F.first("xyz1").over(w))\
.withColumn("xyz10", F.col("xyz2")-F.col("xyz3"))\
.withColumn("xyz9", F.when((F.col("xyz2")-F.col("xyz3"))%2!=0, F.col("xyz2")-F.col("xyz3")+1).otherwise(F.col("xyz2")-F.col("xyz3")))\
.withColumn("xyz4", (F.col("xyz9")/2))\
.withColumn("xyz6", F.col("xyz4")+F.col("xyz3"))\
.withColumn("xyz7", F.when(F.col("xyz10")%2==0,(F.col("xyz4")+F.col("xyz3")+1)).otherwise(F.lit(None)))\
.withColumn("xyz5", F.row_number().over(w))\
.withColumn("medianr", F.when(F.col("xyz6")==F.col("xyz5"), F.col("xyz")).when(F.col("xyz7")==F.col("xyz5"),F.col("xyz")).otherwise(F.lit(None)))\
.withColumn("medianr2", (F.mean("medianr").over(w2)))\
.withColumn("xyz", F.when(F.col("xyz").isNull(), F.col("medianr2")).otherwise(F.col("xyz")))\
.select("I_id","p_id","xyz")\
.orderBy("I_id").show()
+----+----+---+
|I_id|p_id|xyz|
+----+----+---+
| 1| 5| 4|
| 1| 5| 1|
| 1| 5| 4|
| 1| 5| 4|
| 2| 5| 2|
| 2| 5| 2|
| 2| 5| 1|
| 2| 5| 2|
| 2| 5| 4|
+----+----+---+

Pyspark - ranking column replacing ranking of ties by Mean of rank

Consider a data set with ranking
+--------+----+-----------+--------------+
| colA|colB|colA_rank |colA_rank_mean|
+--------+----+-----------+--------------+
| 21| 50| 1| 1|
| 9| 23| 2| 2.5|
| 9| 21| 3| 2.5|
| 8| 21| 4| 3|
| 2| 21| 5| 5.5|
| 2| 5| 6| 5.5|
| 1| 5| 7| 7.5|
| 1| 4| 8| 7.5|
| 0| 4| 9| 11|
| 0| 3| 10| 11|
| 0| 3| 11| 11|
| 0| 2| 12| 11|
| 0| 2| 13| 11|
+--------+----+-----------+--------------+
colA_rank is a normal ranking, while with colA_rank_mean I would like to resolve ties by replacing the ranking with the mean rank of the ties. Is it achievable with a single pass and some particular ranking method ?
Currently I am thinking of 2 passes but that would seem to require ordering the dataset twice on colA, one without partition and one with partition.
#Step 1: normal rank
df = df.withColumn("colA_rank",F.row_number().over(Window.orderBy("colA")))
#Step 2 : solve ties :
df = df.withColumn("colA_rank_mean",F.mean(colA_rank).over(Window.partitionBy("colA"))

Generate a group ID with only one Window function

I am trying to assign a group ID to my table.
For each group in "part" column, acording to order in "ord" column, the 1st elements I encounter in "id" column will receive a new_id 0, then each time I encounter a different ID, i increase the "new_id".
Currently, I need to use 2 window functions and the process is therefore quite slow.
df = sqlContext.createDataFrame(
[
(1,1,'X'),
(1,2,'X'),
(1,3,'X'),
(1,4,'Y'),
(1,5,'Y'),
(1,6,'Y'),
(1,7,'X'),
(1,8,'X'),
(2,1,'X'),
(2,2,'X'),
(2,3,'X'),
(2,4,'Y'),
(2,5,'Y'),
(2,6,'Y'),
(2,7,'X'),
(2,8,'X'),
],
["part", "ord", "id"]
)
df.withColumn(
"new_id",
F.lag(F.col("id")).over(Window.partitionBy("part").orderBy("ord"))!=F.col('id')
).withColumn(
"new_id",
F.sum(
F.col("new_id").cast('int')
).over(Window.partitionBy("part").orderBy("ord"))
).na.fill(0).show()
+----+---+---+------+
|part|ord| id|new_id|
+----+---+---+------+
| 2| 1| X| 0|
| 2| 2| X| 0|
| 2| 3| X| 0|
| 2| 4| Y| 1|
| 2| 5| Y| 1|
| 2| 6| Y| 1|
| 2| 7| X| 2|
| 2| 8| X| 2|
| 1| 1| X| 0|
| 1| 2| X| 0|
| 1| 3| X| 0|
| 1| 4| Y| 1|
| 1| 5| Y| 1|
| 1| 6| Y| 1|
| 1| 7| X| 2|
| 1| 8| X| 2|
+----+---+---+------+
Can I achieve the same using only one window function ?

Categories