Generate a group ID with only one Window function - python

I am trying to assign a group ID to my table.
For each group in "part" column, acording to order in "ord" column, the 1st elements I encounter in "id" column will receive a new_id 0, then each time I encounter a different ID, i increase the "new_id".
Currently, I need to use 2 window functions and the process is therefore quite slow.
df = sqlContext.createDataFrame(
[
(1,1,'X'),
(1,2,'X'),
(1,3,'X'),
(1,4,'Y'),
(1,5,'Y'),
(1,6,'Y'),
(1,7,'X'),
(1,8,'X'),
(2,1,'X'),
(2,2,'X'),
(2,3,'X'),
(2,4,'Y'),
(2,5,'Y'),
(2,6,'Y'),
(2,7,'X'),
(2,8,'X'),
],
["part", "ord", "id"]
)
df.withColumn(
"new_id",
F.lag(F.col("id")).over(Window.partitionBy("part").orderBy("ord"))!=F.col('id')
).withColumn(
"new_id",
F.sum(
F.col("new_id").cast('int')
).over(Window.partitionBy("part").orderBy("ord"))
).na.fill(0).show()
+----+---+---+------+
|part|ord| id|new_id|
+----+---+---+------+
| 2| 1| X| 0|
| 2| 2| X| 0|
| 2| 3| X| 0|
| 2| 4| Y| 1|
| 2| 5| Y| 1|
| 2| 6| Y| 1|
| 2| 7| X| 2|
| 2| 8| X| 2|
| 1| 1| X| 0|
| 1| 2| X| 0|
| 1| 3| X| 0|
| 1| 4| Y| 1|
| 1| 5| Y| 1|
| 1| 6| Y| 1|
| 1| 7| X| 2|
| 1| 8| X| 2|
+----+---+---+------+
Can I achieve the same using only one window function ?

Related

I have a question about how to add a specific column value in a column with the same key value with pyspark [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 months ago.
Improve this question
I am studying spark.
Is there a way to create and return a dataframe by combining playtime among rows where userId and movieId are duplicated among the columns in the table?
thank you.
enter image description here
Although I am not able to understand the result data, by looking at the attached screenshot, I came up with this solution.
Here I am assuming I have to first select the records from duplicate records in final result aggregating playtime.
I have created two Dataframe df and df1 of the Dataset and joined them.
>>> windowSpec = Window.partitionBy("movieid").orderBy("userid")
>>> df1=df.withColumn("row_number",row_number().over(windowSpec))
>>> df1.show()
+------+-------+-----+---------+--------+--------+----------+
|userid|movieid|score|scoretype|playtime|duration|row_number|
+------+-------+-----+---------+--------+--------+----------+
| 3| 306| 5| 1| 0| 0| 1|
| 2| 202| 2| 2| 1800| 3600| 1|
| 7| 100| 0| 1| 3600| 3600| 1|
| 1| 102| 5| 1| 0| 0| 1|
| 1| 102| 0| 2| 1800| 3600| 2|
| 2| 204| 3| 1| 0| 0| 1|
| 1| 106| 3| 2| 3600| 3600| 1|
| 7| 700| 0| 2| 100| 3600| 1|
| 7| 700| 0| 2| 500| 3600| 2|
| 7| 700| 0| 2| 600| 3600| 3|
+------+-------+-----+---------+--------+--------+----------+
>>> df1=df1.where(df1.row_number==1)
>>> df1.show()
+------+-------+-----+---------+--------+--------+----------+
|userid|movieid|score|scoretype|playtime|duration|row_number|
+------+-------+-----+---------+--------+--------+----------+
| 3| 306| 5| 1| 0| 0| 1|
| 2| 202| 2| 2| 1800| 3600| 1|
| 7| 100| 0| 1| 3600| 3600| 1|
| 1| 102| 5| 1| 0| 0| 1|
| 2| 204| 3| 1| 0| 0| 1|
| 1| 106| 3| 2| 3600| 3600| 1|
| 7| 700| 0| 2| 100| 3600| 1|
+------+-------+-----+---------+--------+--------+----------+
>>> df=df.groupBy("userid","movieid").agg(sum("playtime"))
>>> df.show()
+------+-------+-------------+
|userid|movieid|sum(playtime)|
+------+-------+-------------+
| 1| 102| 1800|
| 2| 202| 1800|
| 7| 100| 3600|
| 2| 204| 0|
| 3| 306| 0|
| 7| 700| 1200|
| 1| 106| 3600|
+------+-------+-------------+
>>> df=df.withColumnRenamed("userid","useriddf").withColumnRenamed("movieid","movieiddf").withColumnRenamed("sum(playtime)","playtime1")
>>> df.show()
+--------+---------+-------------+
|useriddf|movieiddf|playtime1 |
+--------+---------+-------------+
| 1| 102| 1800|
| 2| 202| 1800|
| 7| 100| 3600|
| 2| 204| 0|
| 3| 306| 0|
| 7| 700| 1200|
| 1| 106| 3600|
+--------+---------+-------------+
>>> df_join=df.join(df1,df.movieiddf == df1.movieid,"inner")
>>> df_join.show()
+--------+---------+---------+------+-------+-----+---------+--------+--------+
|useriddf|movieiddf|playtime1|userid|movieid|score|scoretype|playtime|duration|
+--------+---------+---------+------+-------+-----+---------+--------+--------+
| 3| 306| 0| 3| 306| 5| 1| 0| 0|
| 2| 202| 1800| 2| 202| 2| 2| 1800| 3600|
| 7| 100| 3600| 7| 100| 0| 1| 3600| 3600|
| 1| 102| 1800| 1| 102| 5| 1| 0| 0|
| 2| 204| 0| 2| 204| 3| 1| 0| 0|
| 1| 106| 3600| 1| 106| 3| 2| 3600| 3600|
| 7| 700| 1200| 7| 700| 0| 2| 100| 3600|
+--------+---------+---------+------+-------+-----+---------+--------+--------+
>>> df_join.select("userid","movieid","score","scoretype","playtime1","duration").sort("userid").show()
+------+-------+-----+---------+---------+--------+
|userid|movieid|score|scoretype|playtime1|duration|
+------+-------+-----+---------+---------+--------+
| 1| 102| 5| 1| 1800| 0|
| 1| 106| 3| 2| 3600| 3600|
| 2| 202| 2| 2| 1800| 3600|
| 2| 204| 3| 1| 0| 0|
| 3| 306| 5| 1| 0| 0|
| 7| 100| 0| 1| 3600| 3600|
| 7| 700| 0| 2| 1200| 3600|
+------+-------+-----+---------+---------+--------+
I hope this is helpful for you.
# you can use groupBy + sum function
# filter if userid==1 or userid==7 (movieid is not necessary because they have distinct userid)
duplicated_user_movie = df.filter(F.col('userid')==1 or F.col('userid')==7)
duplicated_sum = duplicated_user_movie.groupBy('userid', 'movieid', 'score', 'scoretype', 'duration').agg(F.sum('playtime'))
cols = ['userid', 'movieid', 'score', 'scoretype', 'playtime', 'duration']
duplicated_sum = duplicated_sum.toDF(*cols) # change column orders
df = df.filter((F.col('userid')!=1) & (F.col('userid')!=7))
df = df.union(duplicated_sum) # concat two dataframes
df = df.orderBy('userid') # sort by userid

Merging rows that have same credentials -pyspark dataframe

How do can I merge two rows in a pyspark dataframe that satisfy a condition?
Example:
dataframe
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 1| 4| 1|
| 4| 4| 2|
| 4| 1| 3|
| 1| 7| 1|
+---+---+------+
condition: (df.src,df.dst) == (df.dst,df.src)
expected output
summed the weight and deleted (4,1)
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 1| 4| 4| #
| 4| 4| 2|
| 1| 7| 1|
+---+---+------+
or
summed the weights and deleted (1,4)
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 4| 4| 2|
| 4| 1| 4| #
| 1| 7| 1|
+---+---+------+
You can add a src_dst column with the sorted array of src and dst, then get the sum of weights for each src_dst, and remove duplicate rows of src_dst:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'src_dst',
F.sort_array(F.array('src', 'dst'))
).withColumn(
'weight',
F.sum('weight').over(Window.partitionBy('src_dst'))
).dropDuplicates(['src_dst']).drop('src_dst')
df2.show()
+---+---+------+
|src|dst|weight|
+---+---+------+
| 1| 7| 1|
| 1| 1| 93|
| 1| 4| 4|
| 8| 7| 1|
| 4| 4| 2|
+---+---+------+

How do I count consecutive values with pyspark?

I am trying to count consecutive values that appear in a column with Pyspark. I have the column "a" in my dataframe and expect to create the column "b".
+---+---+
| a| b|
+---+---+
| 0| 1|
| 0| 2|
| 0| 3|
| 0| 4|
| 0| 5|
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 2| 1|
| 2| 2|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 6|
| 3| 1|
| 3| 2|
| 3| 3|
+---+---+
I have tried to create the column "b" with lag function over some window, but without success.
w = Window\
.partitionBy(df.some_id)\
.orderBy(df.timestamp_column)
df.withColumn(
"b",
f.when(df.a == f.lag(df.a).over(w),
f.sum(f.lit(1)).over(w)).otherwise(f.lit(0))
)
I could resolve this issue with the following code:
df.withColumn("b",
f.row_number().over(Window.partitionBy("a").orderBy("timestamp_column"))

replace NA with median in pyspark using window function

I want to replace NA with medain based on partition columns using window function in pyspark?
Sample Input:
Required Output:
Creating your dataframe:
list=([1,5,4],
[1,5,None],
[1,5,1],
[1,5,4],
[2,5,1],
[2,5,2],
[2,5,None],
[2,5,None],
[2,5,4])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
df.show()
+----+----+----+
|I_id|p_id| xyz|
+----+----+----+
| 1| 5| 4|
| 1| 5|null|
| 1| 5| 1|
| 1| 5| 4|
| 2| 5| 1|
| 2| 5| 2|
| 2| 5|null|
| 2| 5|null|
| 2| 5| 4|
+----+----+----+
To keep the solution as generic and dynamic as possible, I had to create many new columns to compute the median, and to be able to send it to the nulls. With that said, solution will not be slow, and will be scalable for big data.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import when
w= Window().partitionBy("I_id","p_id").orderBy(F.col("xyz").asc_nulls_first())
w2= Window().partitionBy("I_id","p_id")
df.withColumn("xyz1",F.count(F.col("xyz").isNotNull()).over(w))\
.withColumn("xyz2", F.max(F.row_number().over(w)).over(w2))\
.withColumn("xyz3", F.first("xyz1").over(w))\
.withColumn("xyz10", F.col("xyz2")-F.col("xyz3"))\
.withColumn("xyz9", F.when((F.col("xyz2")-F.col("xyz3"))%2!=0, F.col("xyz2")-F.col("xyz3")+1).otherwise(F.col("xyz2")-F.col("xyz3")))\
.withColumn("xyz4", (F.col("xyz9")/2))\
.withColumn("xyz6", F.col("xyz4")+F.col("xyz3"))\
.withColumn("xyz7", F.when(F.col("xyz10")%2==0,(F.col("xyz4")+F.col("xyz3")+1)).otherwise(F.lit(None)))\
.withColumn("xyz5", F.row_number().over(w))\
.withColumn("medianr", F.when(F.col("xyz6")==F.col("xyz5"), F.col("xyz")).when(F.col("xyz7")==F.col("xyz5"),F.col("xyz")).otherwise(F.lit(None)))\
.withColumn("medianr2", (F.mean("medianr").over(w2)))\
.withColumn("xyz", F.when(F.col("xyz").isNull(), F.col("medianr2")).otherwise(F.col("xyz")))\
.select("I_id","p_id","xyz")\
.orderBy("I_id").show()
+----+----+---+
|I_id|p_id|xyz|
+----+----+---+
| 1| 5| 4|
| 1| 5| 1|
| 1| 5| 4|
| 1| 5| 4|
| 2| 5| 2|
| 2| 5| 2|
| 2| 5| 1|
| 2| 5| 2|
| 2| 5| 4|
+----+----+---+

How to add an ID to pyspark dataframe rows that increases only if a certain condition is met?

I have a pyspark dataframe, and I want to add an Id column to it that only increases if a condition is met.
Example:
over a Window on col1, if col2 value changes, the Id needs to be incremented by 1.
Input:
+----+----+
|col1|col2|
+----+----+
| 1| A|
| 1| A|
| 1| B|
| 1| C|
| 2| A|
| 2| B|
| 2| B|
| 2| B|
| 2| C|
| 2| C|
+----+----+
output:
+----+----+----+
|col1|col2| ID|
+----+----+----+
| 1| A| 1|
| 1| A| 1|
| 1| B| 2|
| 1| C| 3|
| 2| A| 1|
| 2| B| 2|
| 2| B| 2|
| 2| B| 2|
| 2| C| 3|
| 2| C| 3|
+----+----+----+
Thanks :)
What you are looking for is the dense_rank function (pyspark doc here).
Assuming your dataframe variable is df you can do something like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
...
df.withColumn('ID', F.dense_rank().over(Window.partitionBy('col1').orderBy('col2')))

Categories