Window timeseries with step in Spark/Scala - python

I have this input :
timestamp,user
1,A
2,B
5,C
9,E
12,F
The result wanted is :
timestampRange,userList
1 to 2,[A,B]
3 to 4,[] Or null
5 to 6,[C]
7 to 8,[] Or null
9 to 10,[E]
11 to 12,[F]
I tried using Window, but the problem, it doesn't include the empty timestamp range.
Any hints would be helpful.

Don't know if widowing function will cover the gaps between ranges, but you can take the following approach :
Define a dataframe, df_ranges:
val ranges = List((1,2), (3,4), (5,6), (7,8), (9,10))
val df_ranges = sc.parallelize(ranges).toDF("start", "end")
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 3| 4|
| 5| 6|
| 7| 8|
| 9| 10|
+-----+---+
Data with the timestamp column, df_data :
val data = List((1,"A"), (2,"B"), (5,"C"), (9,"E"))
val df_data = sc.parallelize(data).toDF("timestamp", "user")
+---------+----+
|timestamp|user|
+---------+----+
| 1| A|
| 2| B|
| 5| C|
| 9| E|
+---------+----+
Join the two dataframe on the start, end, timestamp columns:
df_ranges.join(df_data, df_ranges.col("start").equalTo(df_data.col("timestamp")).or(df_ranges.col("end").equalTo(df_data.col("timestamp"))), "left")
+-----+---+---------+----+
|start|end|timestamp|user|
+-----+---+---------+----+
| 1| 2| 1| A|
| 1| 2| 2| B|
| 5| 6| 5| C|
| 9| 10| 9| E|
| 3| 4| null|null|
| 7| 8| null|null|
+-----+---+---------+----+
Now do a simple aggregation with collect_list function :
res4.groupBy("start", "end").agg(collect_list("user")).orderBy("start")
+-----+---+------------------+
|start|end|collect_list(user)|
+-----+---+------------------+
| 1| 2| [A, B]|
| 3| 4| []|
| 5| 6| [C]|
| 7| 8| []|
| 9| 10| [E]|
+-----+---+------------------+

Related

Merging rows that have same credentials -pyspark dataframe

How do can I merge two rows in a pyspark dataframe that satisfy a condition?
Example:
dataframe
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 1| 4| 1|
| 4| 4| 2|
| 4| 1| 3|
| 1| 7| 1|
+---+---+------+
condition: (df.src,df.dst) == (df.dst,df.src)
expected output
summed the weight and deleted (4,1)
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 1| 4| 4| #
| 4| 4| 2|
| 1| 7| 1|
+---+---+------+
or
summed the weights and deleted (1,4)
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 4| 4| 2|
| 4| 1| 4| #
| 1| 7| 1|
+---+---+------+
You can add a src_dst column with the sorted array of src and dst, then get the sum of weights for each src_dst, and remove duplicate rows of src_dst:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'src_dst',
F.sort_array(F.array('src', 'dst'))
).withColumn(
'weight',
F.sum('weight').over(Window.partitionBy('src_dst'))
).dropDuplicates(['src_dst']).drop('src_dst')
df2.show()
+---+---+------+
|src|dst|weight|
+---+---+------+
| 1| 7| 1|
| 1| 1| 93|
| 1| 4| 4|
| 8| 7| 1|
| 4| 4| 2|
+---+---+------+

How do I count consecutive values with pyspark?

I am trying to count consecutive values that appear in a column with Pyspark. I have the column "a" in my dataframe and expect to create the column "b".
+---+---+
| a| b|
+---+---+
| 0| 1|
| 0| 2|
| 0| 3|
| 0| 4|
| 0| 5|
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 2| 1|
| 2| 2|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 6|
| 3| 1|
| 3| 2|
| 3| 3|
+---+---+
I have tried to create the column "b" with lag function over some window, but without success.
w = Window\
.partitionBy(df.some_id)\
.orderBy(df.timestamp_column)
df.withColumn(
"b",
f.when(df.a == f.lag(df.a).over(w),
f.sum(f.lit(1)).over(w)).otherwise(f.lit(0))
)
I could resolve this issue with the following code:
df.withColumn("b",
f.row_number().over(Window.partitionBy("a").orderBy("timestamp_column"))

Pyspark - ranking column replacing ranking of ties by Mean of rank

Consider a data set with ranking
+--------+----+-----------+--------------+
| colA|colB|colA_rank |colA_rank_mean|
+--------+----+-----------+--------------+
| 21| 50| 1| 1|
| 9| 23| 2| 2.5|
| 9| 21| 3| 2.5|
| 8| 21| 4| 3|
| 2| 21| 5| 5.5|
| 2| 5| 6| 5.5|
| 1| 5| 7| 7.5|
| 1| 4| 8| 7.5|
| 0| 4| 9| 11|
| 0| 3| 10| 11|
| 0| 3| 11| 11|
| 0| 2| 12| 11|
| 0| 2| 13| 11|
+--------+----+-----------+--------------+
colA_rank is a normal ranking, while with colA_rank_mean I would like to resolve ties by replacing the ranking with the mean rank of the ties. Is it achievable with a single pass and some particular ranking method ?
Currently I am thinking of 2 passes but that would seem to require ordering the dataset twice on colA, one without partition and one with partition.
#Step 1: normal rank
df = df.withColumn("colA_rank",F.row_number().over(Window.orderBy("colA")))
#Step 2 : solve ties :
df = df.withColumn("colA_rank_mean",F.mean(colA_rank).over(Window.partitionBy("colA"))

Generate a group ID with only one Window function

I am trying to assign a group ID to my table.
For each group in "part" column, acording to order in "ord" column, the 1st elements I encounter in "id" column will receive a new_id 0, then each time I encounter a different ID, i increase the "new_id".
Currently, I need to use 2 window functions and the process is therefore quite slow.
df = sqlContext.createDataFrame(
[
(1,1,'X'),
(1,2,'X'),
(1,3,'X'),
(1,4,'Y'),
(1,5,'Y'),
(1,6,'Y'),
(1,7,'X'),
(1,8,'X'),
(2,1,'X'),
(2,2,'X'),
(2,3,'X'),
(2,4,'Y'),
(2,5,'Y'),
(2,6,'Y'),
(2,7,'X'),
(2,8,'X'),
],
["part", "ord", "id"]
)
df.withColumn(
"new_id",
F.lag(F.col("id")).over(Window.partitionBy("part").orderBy("ord"))!=F.col('id')
).withColumn(
"new_id",
F.sum(
F.col("new_id").cast('int')
).over(Window.partitionBy("part").orderBy("ord"))
).na.fill(0).show()
+----+---+---+------+
|part|ord| id|new_id|
+----+---+---+------+
| 2| 1| X| 0|
| 2| 2| X| 0|
| 2| 3| X| 0|
| 2| 4| Y| 1|
| 2| 5| Y| 1|
| 2| 6| Y| 1|
| 2| 7| X| 2|
| 2| 8| X| 2|
| 1| 1| X| 0|
| 1| 2| X| 0|
| 1| 3| X| 0|
| 1| 4| Y| 1|
| 1| 5| Y| 1|
| 1| 6| Y| 1|
| 1| 7| X| 2|
| 1| 8| X| 2|
+----+---+---+------+
Can I achieve the same using only one window function ?

Pyspark: how to duplicate a row n time in dataframe?

I've got a dataframe like this and I want to duplicate the row n times if the column n is bigger than one:
A B n
1 2 1
2 9 1
3 8 2
4 1 1
5 3 3
And transform like this:
A B n
1 2 1
2 9 1
3 8 2
3 8 2
4 1 1
5 3 3
5 3 3
5 3 3
I think I should use explode, but I don't understand how it works...
Thanks
With Spark 2.4.0+, this is easier with builtin functions: array_repeat + explode:
from pyspark.sql.functions import expr
df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)], ["A", "B", "n"])
new_df = df.withColumn('n', expr('explode(array_repeat(n,int(n)))'))
>>> new_df.show()
+---+---+---+
| A| B| n|
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
| 5| 3| 3|
| 5| 3| 3|
+---+---+---+
The explode function returns a new row for each element in the given array or map.
One way to exploit this function is to use a udf to create a list of size n for each row. Then explode the resulting array.
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import ArrayType, IntegerType
df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)] ,["A", "B", "n"])
+---+---+---+
| A| B| n|
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
+---+---+---+
# use udf function to transform the n value to n times
n_to_array = udf(lambda n : [n] * n, ArrayType(IntegerType()))
df2 = df.withColumn('n', n_to_array(df.n))
+---+---+---------+
| A| B| n|
+---+---+---------+
| 1| 2| [1]|
| 2| 9| [1]|
| 3| 8| [2, 2]|
| 4| 1| [1]|
| 5| 3|[3, 3, 3]|
+---+---+---------+
# now use explode
df2.withColumn('n', explode(df2.n)).show()
+---+---+---+
| A | B | n |
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
| 5| 3| 3|
| 5| 3| 3|
+---+---+---+
I think the udf answer by #Ahmed is the best way to go, but here is an alternative method, that may be as good or better for small n:
First, collect the maximum value of n over the whole DataFrame:
max_n = df.select(f.max('n').alias('max_n')).first()['max_n']
print(max_n)
#3
Now create an array for each row of length max_n, containing numbers in range(max_n). The output of this intermediate step will result in a DataFrame like:
df.withColumn('n_array', f.array([f.lit(i) for i in range(max_n)])).show()
#+---+---+---+---------+
#| A| B| n| n_array|
#+---+---+---+---------+
#| 1| 2| 1|[0, 1, 2]|
#| 2| 9| 1|[0, 1, 2]|
#| 3| 8| 2|[0, 1, 2]|
#| 4| 1| 1|[0, 1, 2]|
#| 5| 3| 3|[0, 1, 2]|
#+---+---+---+---------+
Now we explode the n_array column, and filter to keep only the values in the array that are less than n. This will ensure that we have n copies of each row. Finally we drop the exploded column to get the end result:
df.withColumn('n_array', f.array([f.lit(i) for i in range(max_n)]))\
.select('A', 'B', 'n', f.explode('n_array').alias('col'))\
.where(f.col('col') < f.col('n'))\
.drop('col')\
.show()
#+---+---+---+
#| A| B| n|
#+---+---+---+
#| 1| 2| 1|
#| 2| 9| 1|
#| 3| 8| 2|
#| 3| 8| 2|
#| 4| 1| 1|
#| 5| 3| 3|
#| 5| 3| 3|
#| 5| 3| 3|
#+---+---+---+
However, we are creating a max_n length array for each row- as opposed to just an n length array in the udf solution. It's not immediately clear to me how this will scale vs. udf for large max_n, but I suspect the udf will win out.

Categories