Manipulate a complex dataframe in PySpark - python

i'm preparing a dataset to train machine learning model using PySpark. The dataframe on which i'm working contains a thousands of records about presences registered inside rooms of different buildings and cities in different days. These presences are in this format:
+----+----+----+---+---------+------+--------+-------+---------+
|room|building|city|day|month|inHour|inMinute|outHour|outMinute|
+----+--------+----+---+-----+------+--------+-------+---------+
| 1| 1| 1| 9| 11| 8| 27| 13| 15|
| 1| 1| 1| 9| 11| 8| 28| 13| 5|
| 1| 1| 1| 9| 11| 8| 32| 13| 7|
| 1| 1| 1| 9| 11| 8| 32| 8| 50|
| 1| 1| 1| 9| 11| 8| 32| 8| 48|
+----+--------+----+---+-----+------+--------+-------+---------+
inHour and inMinute stands for the hour and minute of access and, of course, outHour and outMinute refers to time of exit. The hours are considered in a 0-23 format.
All the column contains just integer values.
What i'm missing is the target value of my machine learning model which is the number of persons for the combination of room, building, city, day, month and a time interval. I will try to explain better, the first row refers to a presence with access time 8 and exit time 13 so it should be counted in the record with the interval 8-9, 9-10, 10-11, 11-12 and also 13-14.
What i want to accomplish is something like the following:
+----+----+----+---+---------+------+-------+-----+
|room|building|city|day|month|timeIn|timeOut|count|
+----+--------+----+---+-----+------+-------+-----+
| 1| 1| 1| 9| 11| 8| 9| X|
| 1| 1| 1| 9| 11| 9| 10| X|
| 1| 1| 1| 9| 11| 10| 11| X|
| 1| 1| 1| 9| 11| 11| 12| X|
| 1| 1| 1| 9| 11| 12| 13| X|
+----+--------+----+---+-----+------+-------+-----+
So the 4th row of the first table should be counted in the 1st row of this table and so on...

You can explode a sequence of hours (e.g. the first row would have [8,9,10,11,12,13]), group by the hour (and other columns) and get the aggregate count for each group. Here hour refers to timeIn. I think it's not necessary to specify timeOut in the result dataframe because it's always timeIn + 1.
import pyspark.sql.functions as F
df2 = df.withColumn(
'hour',
F.explode(F.sequence('inHour', 'outHour'))
).groupBy(
'room', 'building', 'city', 'day', 'month', 'hour'
).count().orderBy('hour')
df2.show()
+----+--------+----+---+-----+----+-----+
|room|building|city|day|month|hour|count|
+----+--------+----+---+-----+----+-----+
| 1| 1| 1| 9| 11| 8| 5|
| 1| 1| 1| 9| 11| 9| 3|
| 1| 1| 1| 9| 11| 10| 3|
| 1| 1| 1| 9| 11| 11| 3|
| 1| 1| 1| 9| 11| 12| 3|
| 1| 1| 1| 9| 11| 13| 3|
+----+--------+----+---+-----+----+-----+

Related

Merging rows that have same credentials -pyspark dataframe

How do can I merge two rows in a pyspark dataframe that satisfy a condition?
Example:
dataframe
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 1| 4| 1|
| 4| 4| 2|
| 4| 1| 3|
| 1| 7| 1|
+---+---+------+
condition: (df.src,df.dst) == (df.dst,df.src)
expected output
summed the weight and deleted (4,1)
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 1| 4| 4| #
| 4| 4| 2|
| 1| 7| 1|
+---+---+------+
or
summed the weights and deleted (1,4)
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 4| 4| 2|
| 4| 1| 4| #
| 1| 7| 1|
+---+---+------+
You can add a src_dst column with the sorted array of src and dst, then get the sum of weights for each src_dst, and remove duplicate rows of src_dst:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'src_dst',
F.sort_array(F.array('src', 'dst'))
).withColumn(
'weight',
F.sum('weight').over(Window.partitionBy('src_dst'))
).dropDuplicates(['src_dst']).drop('src_dst')
df2.show()
+---+---+------+
|src|dst|weight|
+---+---+------+
| 1| 7| 1|
| 1| 1| 93|
| 1| 4| 4|
| 8| 7| 1|
| 4| 4| 2|
+---+---+------+

replace NA with median in pyspark using window function

I want to replace NA with medain based on partition columns using window function in pyspark?
Sample Input:
Required Output:
Creating your dataframe:
list=([1,5,4],
[1,5,None],
[1,5,1],
[1,5,4],
[2,5,1],
[2,5,2],
[2,5,None],
[2,5,None],
[2,5,4])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
df.show()
+----+----+----+
|I_id|p_id| xyz|
+----+----+----+
| 1| 5| 4|
| 1| 5|null|
| 1| 5| 1|
| 1| 5| 4|
| 2| 5| 1|
| 2| 5| 2|
| 2| 5|null|
| 2| 5|null|
| 2| 5| 4|
+----+----+----+
To keep the solution as generic and dynamic as possible, I had to create many new columns to compute the median, and to be able to send it to the nulls. With that said, solution will not be slow, and will be scalable for big data.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import when
w= Window().partitionBy("I_id","p_id").orderBy(F.col("xyz").asc_nulls_first())
w2= Window().partitionBy("I_id","p_id")
df.withColumn("xyz1",F.count(F.col("xyz").isNotNull()).over(w))\
.withColumn("xyz2", F.max(F.row_number().over(w)).over(w2))\
.withColumn("xyz3", F.first("xyz1").over(w))\
.withColumn("xyz10", F.col("xyz2")-F.col("xyz3"))\
.withColumn("xyz9", F.when((F.col("xyz2")-F.col("xyz3"))%2!=0, F.col("xyz2")-F.col("xyz3")+1).otherwise(F.col("xyz2")-F.col("xyz3")))\
.withColumn("xyz4", (F.col("xyz9")/2))\
.withColumn("xyz6", F.col("xyz4")+F.col("xyz3"))\
.withColumn("xyz7", F.when(F.col("xyz10")%2==0,(F.col("xyz4")+F.col("xyz3")+1)).otherwise(F.lit(None)))\
.withColumn("xyz5", F.row_number().over(w))\
.withColumn("medianr", F.when(F.col("xyz6")==F.col("xyz5"), F.col("xyz")).when(F.col("xyz7")==F.col("xyz5"),F.col("xyz")).otherwise(F.lit(None)))\
.withColumn("medianr2", (F.mean("medianr").over(w2)))\
.withColumn("xyz", F.when(F.col("xyz").isNull(), F.col("medianr2")).otherwise(F.col("xyz")))\
.select("I_id","p_id","xyz")\
.orderBy("I_id").show()
+----+----+---+
|I_id|p_id|xyz|
+----+----+---+
| 1| 5| 4|
| 1| 5| 1|
| 1| 5| 4|
| 1| 5| 4|
| 2| 5| 2|
| 2| 5| 2|
| 2| 5| 1|
| 2| 5| 2|
| 2| 5| 4|
+----+----+---+

Pyspark - ranking column replacing ranking of ties by Mean of rank

Consider a data set with ranking
+--------+----+-----------+--------------+
| colA|colB|colA_rank |colA_rank_mean|
+--------+----+-----------+--------------+
| 21| 50| 1| 1|
| 9| 23| 2| 2.5|
| 9| 21| 3| 2.5|
| 8| 21| 4| 3|
| 2| 21| 5| 5.5|
| 2| 5| 6| 5.5|
| 1| 5| 7| 7.5|
| 1| 4| 8| 7.5|
| 0| 4| 9| 11|
| 0| 3| 10| 11|
| 0| 3| 11| 11|
| 0| 2| 12| 11|
| 0| 2| 13| 11|
+--------+----+-----------+--------------+
colA_rank is a normal ranking, while with colA_rank_mean I would like to resolve ties by replacing the ranking with the mean rank of the ties. Is it achievable with a single pass and some particular ranking method ?
Currently I am thinking of 2 passes but that would seem to require ordering the dataset twice on colA, one without partition and one with partition.
#Step 1: normal rank
df = df.withColumn("colA_rank",F.row_number().over(Window.orderBy("colA")))
#Step 2 : solve ties :
df = df.withColumn("colA_rank_mean",F.mean(colA_rank).over(Window.partitionBy("colA"))

Generate a group ID with only one Window function

I am trying to assign a group ID to my table.
For each group in "part" column, acording to order in "ord" column, the 1st elements I encounter in "id" column will receive a new_id 0, then each time I encounter a different ID, i increase the "new_id".
Currently, I need to use 2 window functions and the process is therefore quite slow.
df = sqlContext.createDataFrame(
[
(1,1,'X'),
(1,2,'X'),
(1,3,'X'),
(1,4,'Y'),
(1,5,'Y'),
(1,6,'Y'),
(1,7,'X'),
(1,8,'X'),
(2,1,'X'),
(2,2,'X'),
(2,3,'X'),
(2,4,'Y'),
(2,5,'Y'),
(2,6,'Y'),
(2,7,'X'),
(2,8,'X'),
],
["part", "ord", "id"]
)
df.withColumn(
"new_id",
F.lag(F.col("id")).over(Window.partitionBy("part").orderBy("ord"))!=F.col('id')
).withColumn(
"new_id",
F.sum(
F.col("new_id").cast('int')
).over(Window.partitionBy("part").orderBy("ord"))
).na.fill(0).show()
+----+---+---+------+
|part|ord| id|new_id|
+----+---+---+------+
| 2| 1| X| 0|
| 2| 2| X| 0|
| 2| 3| X| 0|
| 2| 4| Y| 1|
| 2| 5| Y| 1|
| 2| 6| Y| 1|
| 2| 7| X| 2|
| 2| 8| X| 2|
| 1| 1| X| 0|
| 1| 2| X| 0|
| 1| 3| X| 0|
| 1| 4| Y| 1|
| 1| 5| Y| 1|
| 1| 6| Y| 1|
| 1| 7| X| 2|
| 1| 8| X| 2|
+----+---+---+------+
Can I achieve the same using only one window function ?

Window timeseries with step in Spark/Scala

I have this input :
timestamp,user
1,A
2,B
5,C
9,E
12,F
The result wanted is :
timestampRange,userList
1 to 2,[A,B]
3 to 4,[] Or null
5 to 6,[C]
7 to 8,[] Or null
9 to 10,[E]
11 to 12,[F]
I tried using Window, but the problem, it doesn't include the empty timestamp range.
Any hints would be helpful.
Don't know if widowing function will cover the gaps between ranges, but you can take the following approach :
Define a dataframe, df_ranges:
val ranges = List((1,2), (3,4), (5,6), (7,8), (9,10))
val df_ranges = sc.parallelize(ranges).toDF("start", "end")
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 3| 4|
| 5| 6|
| 7| 8|
| 9| 10|
+-----+---+
Data with the timestamp column, df_data :
val data = List((1,"A"), (2,"B"), (5,"C"), (9,"E"))
val df_data = sc.parallelize(data).toDF("timestamp", "user")
+---------+----+
|timestamp|user|
+---------+----+
| 1| A|
| 2| B|
| 5| C|
| 9| E|
+---------+----+
Join the two dataframe on the start, end, timestamp columns:
df_ranges.join(df_data, df_ranges.col("start").equalTo(df_data.col("timestamp")).or(df_ranges.col("end").equalTo(df_data.col("timestamp"))), "left")
+-----+---+---------+----+
|start|end|timestamp|user|
+-----+---+---------+----+
| 1| 2| 1| A|
| 1| 2| 2| B|
| 5| 6| 5| C|
| 9| 10| 9| E|
| 3| 4| null|null|
| 7| 8| null|null|
+-----+---+---------+----+
Now do a simple aggregation with collect_list function :
res4.groupBy("start", "end").agg(collect_list("user")).orderBy("start")
+-----+---+------------------+
|start|end|collect_list(user)|
+-----+---+------------------+
| 1| 2| [A, B]|
| 3| 4| []|
| 5| 6| [C]|
| 7| 8| []|
| 9| 10| [E]|
+-----+---+------------------+

Categories