How do I count consecutive values with pyspark? - python

I am trying to count consecutive values that appear in a column with Pyspark. I have the column "a" in my dataframe and expect to create the column "b".
+---+---+
| a| b|
+---+---+
| 0| 1|
| 0| 2|
| 0| 3|
| 0| 4|
| 0| 5|
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 2| 1|
| 2| 2|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 6|
| 3| 1|
| 3| 2|
| 3| 3|
+---+---+
I have tried to create the column "b" with lag function over some window, but without success.
w = Window\
.partitionBy(df.some_id)\
.orderBy(df.timestamp_column)
df.withColumn(
"b",
f.when(df.a == f.lag(df.a).over(w),
f.sum(f.lit(1)).over(w)).otherwise(f.lit(0))
)

I could resolve this issue with the following code:
df.withColumn("b",
f.row_number().over(Window.partitionBy("a").orderBy("timestamp_column"))

Related

Merging rows that have same credentials -pyspark dataframe

How do can I merge two rows in a pyspark dataframe that satisfy a condition?
Example:
dataframe
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 1| 4| 1|
| 4| 4| 2|
| 4| 1| 3|
| 1| 7| 1|
+---+---+------+
condition: (df.src,df.dst) == (df.dst,df.src)
expected output
summed the weight and deleted (4,1)
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 1| 4| 4| #
| 4| 4| 2|
| 1| 7| 1|
+---+---+------+
or
summed the weights and deleted (1,4)
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 4| 4| 2|
| 4| 1| 4| #
| 1| 7| 1|
+---+---+------+
You can add a src_dst column with the sorted array of src and dst, then get the sum of weights for each src_dst, and remove duplicate rows of src_dst:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'src_dst',
F.sort_array(F.array('src', 'dst'))
).withColumn(
'weight',
F.sum('weight').over(Window.partitionBy('src_dst'))
).dropDuplicates(['src_dst']).drop('src_dst')
df2.show()
+---+---+------+
|src|dst|weight|
+---+---+------+
| 1| 7| 1|
| 1| 1| 93|
| 1| 4| 4|
| 8| 7| 1|
| 4| 4| 2|
+---+---+------+

replace NA with median in pyspark using window function

I want to replace NA with medain based on partition columns using window function in pyspark?
Sample Input:
Required Output:
Creating your dataframe:
list=([1,5,4],
[1,5,None],
[1,5,1],
[1,5,4],
[2,5,1],
[2,5,2],
[2,5,None],
[2,5,None],
[2,5,4])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
df.show()
+----+----+----+
|I_id|p_id| xyz|
+----+----+----+
| 1| 5| 4|
| 1| 5|null|
| 1| 5| 1|
| 1| 5| 4|
| 2| 5| 1|
| 2| 5| 2|
| 2| 5|null|
| 2| 5|null|
| 2| 5| 4|
+----+----+----+
To keep the solution as generic and dynamic as possible, I had to create many new columns to compute the median, and to be able to send it to the nulls. With that said, solution will not be slow, and will be scalable for big data.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import when
w= Window().partitionBy("I_id","p_id").orderBy(F.col("xyz").asc_nulls_first())
w2= Window().partitionBy("I_id","p_id")
df.withColumn("xyz1",F.count(F.col("xyz").isNotNull()).over(w))\
.withColumn("xyz2", F.max(F.row_number().over(w)).over(w2))\
.withColumn("xyz3", F.first("xyz1").over(w))\
.withColumn("xyz10", F.col("xyz2")-F.col("xyz3"))\
.withColumn("xyz9", F.when((F.col("xyz2")-F.col("xyz3"))%2!=0, F.col("xyz2")-F.col("xyz3")+1).otherwise(F.col("xyz2")-F.col("xyz3")))\
.withColumn("xyz4", (F.col("xyz9")/2))\
.withColumn("xyz6", F.col("xyz4")+F.col("xyz3"))\
.withColumn("xyz7", F.when(F.col("xyz10")%2==0,(F.col("xyz4")+F.col("xyz3")+1)).otherwise(F.lit(None)))\
.withColumn("xyz5", F.row_number().over(w))\
.withColumn("medianr", F.when(F.col("xyz6")==F.col("xyz5"), F.col("xyz")).when(F.col("xyz7")==F.col("xyz5"),F.col("xyz")).otherwise(F.lit(None)))\
.withColumn("medianr2", (F.mean("medianr").over(w2)))\
.withColumn("xyz", F.when(F.col("xyz").isNull(), F.col("medianr2")).otherwise(F.col("xyz")))\
.select("I_id","p_id","xyz")\
.orderBy("I_id").show()
+----+----+---+
|I_id|p_id|xyz|
+----+----+---+
| 1| 5| 4|
| 1| 5| 1|
| 1| 5| 4|
| 1| 5| 4|
| 2| 5| 2|
| 2| 5| 2|
| 2| 5| 1|
| 2| 5| 2|
| 2| 5| 4|
+----+----+---+

How to add an ID to pyspark dataframe rows that increases only if a certain condition is met?

I have a pyspark dataframe, and I want to add an Id column to it that only increases if a condition is met.
Example:
over a Window on col1, if col2 value changes, the Id needs to be incremented by 1.
Input:
+----+----+
|col1|col2|
+----+----+
| 1| A|
| 1| A|
| 1| B|
| 1| C|
| 2| A|
| 2| B|
| 2| B|
| 2| B|
| 2| C|
| 2| C|
+----+----+
output:
+----+----+----+
|col1|col2| ID|
+----+----+----+
| 1| A| 1|
| 1| A| 1|
| 1| B| 2|
| 1| C| 3|
| 2| A| 1|
| 2| B| 2|
| 2| B| 2|
| 2| B| 2|
| 2| C| 3|
| 2| C| 3|
+----+----+----+
Thanks :)
What you are looking for is the dense_rank function (pyspark doc here).
Assuming your dataframe variable is df you can do something like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
...
df.withColumn('ID', F.dense_rank().over(Window.partitionBy('col1').orderBy('col2')))

Duplicate Rows in Pyspark DataFrame Dynamic Number of Times Based on Difference between Two Columns [duplicate]

This question already has answers here:
Pyspark Replicate Row based on column value
(2 answers)
Closed 4 years ago.
I have a PySpark dataframe that looks like:
+---+----+----+
| id|day1|day2|
+---+----+----+
| 1| 2| 4|
| 2| 1| 2|
| 3| 3| 3|
+---+----+----+
I want to duplicate each row n number of times where n = day2 - day1. The resulting dataframe would look like:
+---+----+----+
| id|day1|day2|
+---+----+----+
| 1| 2| 4|
| 1| 2| 4|
| 1| 2| 4|
| 2| 1| 2|
| 2| 1| 2|
| 3| 3| 3|
+---+----+----+
How can I do this?
Here is one way to do that.
from pyspark.sql import functions as F
from pyspark.sql.types import *
#F.udf(ArrayType(StringType()))
def gen_array(day1, day2):
return ['' for i in range(day2-day1+1)]
df.withColumn(
"dup",
F.explode(
gen_array(F.col("day1"), F.col("day2"))
)
).drop("dup").show()
+---+----+----+
| id|day1|day2|
+---+----+----+
| 1| 2| 4|
| 1| 2| 4|
| 1| 2| 4|
| 2| 1| 2|
| 2| 1| 2|
| 3| 3| 3|
+---+----+----+
Another option using rdd.flatMap:
df.rdd.flatMap(lambda r: [r] * (r.day2 - r.day1 + 1)).toDF().show()
+---+----+----+
| id|day1|day2|
+---+----+----+
| 1| 2| 4|
| 1| 2| 4|
| 1| 2| 4|
| 2| 1| 2|
| 2| 1| 2|
| 3| 3| 3|
+---+----+----+

Generate a group ID with only one Window function

I am trying to assign a group ID to my table.
For each group in "part" column, acording to order in "ord" column, the 1st elements I encounter in "id" column will receive a new_id 0, then each time I encounter a different ID, i increase the "new_id".
Currently, I need to use 2 window functions and the process is therefore quite slow.
df = sqlContext.createDataFrame(
[
(1,1,'X'),
(1,2,'X'),
(1,3,'X'),
(1,4,'Y'),
(1,5,'Y'),
(1,6,'Y'),
(1,7,'X'),
(1,8,'X'),
(2,1,'X'),
(2,2,'X'),
(2,3,'X'),
(2,4,'Y'),
(2,5,'Y'),
(2,6,'Y'),
(2,7,'X'),
(2,8,'X'),
],
["part", "ord", "id"]
)
df.withColumn(
"new_id",
F.lag(F.col("id")).over(Window.partitionBy("part").orderBy("ord"))!=F.col('id')
).withColumn(
"new_id",
F.sum(
F.col("new_id").cast('int')
).over(Window.partitionBy("part").orderBy("ord"))
).na.fill(0).show()
+----+---+---+------+
|part|ord| id|new_id|
+----+---+---+------+
| 2| 1| X| 0|
| 2| 2| X| 0|
| 2| 3| X| 0|
| 2| 4| Y| 1|
| 2| 5| Y| 1|
| 2| 6| Y| 1|
| 2| 7| X| 2|
| 2| 8| X| 2|
| 1| 1| X| 0|
| 1| 2| X| 0|
| 1| 3| X| 0|
| 1| 4| Y| 1|
| 1| 5| Y| 1|
| 1| 6| Y| 1|
| 1| 7| X| 2|
| 1| 8| X| 2|
+----+---+---+------+
Can I achieve the same using only one window function ?

Categories