I am very much new to pyspark and my problem is as follows:
I have this sample data frame and I want to apply withColumn(avg coeff) to each row of country column having same value (in country) and give me an output.
This is my selfmade dataframe similar to what I have been actually working on:
+--------+--------+---+-----+----+----+-----+------------+---------+----+
|Relegion| Country|Day|Month|Year|Week|cases|Weekly Coeff|Avg Coeff|rank|
+--------+--------+---+-----+----+----+-----+------------+---------+----+
| Hindu| India| 3| 1| 20| 1| 30| 0.5| 0.616| 3|
| Hindu| India| 2| 1| 20| 1| 20| 0.7| 0.616| 3|
| Hindu| India| 5| 2| 20| 2| 100| 0.9| 0.616| 3|
| Hindu| India| 6| 2| 20| 2| 160| 0.4| 0.616| 3|
| Hindu| India| 6| 2| 20| 1| 160| 0.4| 0.616| 3|
| Hindu| India| 1| 1| 20| 2| 5| 0.6| 0.616| 3|
| Hindu| India| 1| 1| 20| 1| 5| 0.6| 0.616| 3|
| Hindu| India| 2| 1| 20| 2| 20| 0.7| 0.616| 3|
| Hindu| India| 4| 2| 20| 1| 10| 0.6| 0.616| 3|
| Hindu| India| 3| 1| 20| 2| 30| 0.5| 0.616| 3|
| Hindu| India| 4| 2| 20| 2| 10| 0.6| 0.616| 3|
| Hindu| India| 5| 2| 20| 1| 100| 0.9| 0.616| 3|
| Muslim|Pakistan| 1| 1| 20| 1| 100| 0.6| 0.683| 2|
| Muslim|Pakistan| 4| 2| 20| 1| 200| 0.6| 0.683| 2|
| Muslim|Pakistan| 2| 1| 20| 1| 20| 0.9| 0.683| 2|
| Muslim|Pakistan| 5| 2| 20| 1| 300| 0.8| 0.683| 2|
| Muslim|Pakistan| 2| 1| 20| 2| 20| 0.9| 0.683| 2|
| Muslim|Pakistan| 3| 1| 20| 1| 50| 0.4| 0.683| 2|
| Muslim|Pakistan| 6| 2| 20| 1| 310| 0.8| 0.683| 2|
| Muslim|Pakistan| 3| 1| 20| 2| 50| 0.4| 0.683| 2|
+--------+--------+---+-----+----+----+-----+------------+---------+----+
I have to find the average coefficient (one value per country), I have added a column manually to test the result I haven't been able to find it.
You can use average over a window:
from pyspark.sql import functions as F, Window
df2 = df.withColumn('Avg_coeff', F.avg('Weekly_coeff').over(Window.partitionBy('Country')))
Note that it's not a good practice to have spaces in column names.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 months ago.
Improve this question
I am studying spark.
Is there a way to create and return a dataframe by combining playtime among rows where userId and movieId are duplicated among the columns in the table?
thank you.
enter image description here
Although I am not able to understand the result data, by looking at the attached screenshot, I came up with this solution.
Here I am assuming I have to first select the records from duplicate records in final result aggregating playtime.
I have created two Dataframe df and df1 of the Dataset and joined them.
>>> windowSpec = Window.partitionBy("movieid").orderBy("userid")
>>> df1=df.withColumn("row_number",row_number().over(windowSpec))
>>> df1.show()
+------+-------+-----+---------+--------+--------+----------+
|userid|movieid|score|scoretype|playtime|duration|row_number|
+------+-------+-----+---------+--------+--------+----------+
| 3| 306| 5| 1| 0| 0| 1|
| 2| 202| 2| 2| 1800| 3600| 1|
| 7| 100| 0| 1| 3600| 3600| 1|
| 1| 102| 5| 1| 0| 0| 1|
| 1| 102| 0| 2| 1800| 3600| 2|
| 2| 204| 3| 1| 0| 0| 1|
| 1| 106| 3| 2| 3600| 3600| 1|
| 7| 700| 0| 2| 100| 3600| 1|
| 7| 700| 0| 2| 500| 3600| 2|
| 7| 700| 0| 2| 600| 3600| 3|
+------+-------+-----+---------+--------+--------+----------+
>>> df1=df1.where(df1.row_number==1)
>>> df1.show()
+------+-------+-----+---------+--------+--------+----------+
|userid|movieid|score|scoretype|playtime|duration|row_number|
+------+-------+-----+---------+--------+--------+----------+
| 3| 306| 5| 1| 0| 0| 1|
| 2| 202| 2| 2| 1800| 3600| 1|
| 7| 100| 0| 1| 3600| 3600| 1|
| 1| 102| 5| 1| 0| 0| 1|
| 2| 204| 3| 1| 0| 0| 1|
| 1| 106| 3| 2| 3600| 3600| 1|
| 7| 700| 0| 2| 100| 3600| 1|
+------+-------+-----+---------+--------+--------+----------+
>>> df=df.groupBy("userid","movieid").agg(sum("playtime"))
>>> df.show()
+------+-------+-------------+
|userid|movieid|sum(playtime)|
+------+-------+-------------+
| 1| 102| 1800|
| 2| 202| 1800|
| 7| 100| 3600|
| 2| 204| 0|
| 3| 306| 0|
| 7| 700| 1200|
| 1| 106| 3600|
+------+-------+-------------+
>>> df=df.withColumnRenamed("userid","useriddf").withColumnRenamed("movieid","movieiddf").withColumnRenamed("sum(playtime)","playtime1")
>>> df.show()
+--------+---------+-------------+
|useriddf|movieiddf|playtime1 |
+--------+---------+-------------+
| 1| 102| 1800|
| 2| 202| 1800|
| 7| 100| 3600|
| 2| 204| 0|
| 3| 306| 0|
| 7| 700| 1200|
| 1| 106| 3600|
+--------+---------+-------------+
>>> df_join=df.join(df1,df.movieiddf == df1.movieid,"inner")
>>> df_join.show()
+--------+---------+---------+------+-------+-----+---------+--------+--------+
|useriddf|movieiddf|playtime1|userid|movieid|score|scoretype|playtime|duration|
+--------+---------+---------+------+-------+-----+---------+--------+--------+
| 3| 306| 0| 3| 306| 5| 1| 0| 0|
| 2| 202| 1800| 2| 202| 2| 2| 1800| 3600|
| 7| 100| 3600| 7| 100| 0| 1| 3600| 3600|
| 1| 102| 1800| 1| 102| 5| 1| 0| 0|
| 2| 204| 0| 2| 204| 3| 1| 0| 0|
| 1| 106| 3600| 1| 106| 3| 2| 3600| 3600|
| 7| 700| 1200| 7| 700| 0| 2| 100| 3600|
+--------+---------+---------+------+-------+-----+---------+--------+--------+
>>> df_join.select("userid","movieid","score","scoretype","playtime1","duration").sort("userid").show()
+------+-------+-----+---------+---------+--------+
|userid|movieid|score|scoretype|playtime1|duration|
+------+-------+-----+---------+---------+--------+
| 1| 102| 5| 1| 1800| 0|
| 1| 106| 3| 2| 3600| 3600|
| 2| 202| 2| 2| 1800| 3600|
| 2| 204| 3| 1| 0| 0|
| 3| 306| 5| 1| 0| 0|
| 7| 100| 0| 1| 3600| 3600|
| 7| 700| 0| 2| 1200| 3600|
+------+-------+-----+---------+---------+--------+
I hope this is helpful for you.
# you can use groupBy + sum function
# filter if userid==1 or userid==7 (movieid is not necessary because they have distinct userid)
duplicated_user_movie = df.filter(F.col('userid')==1 or F.col('userid')==7)
duplicated_sum = duplicated_user_movie.groupBy('userid', 'movieid', 'score', 'scoretype', 'duration').agg(F.sum('playtime'))
cols = ['userid', 'movieid', 'score', 'scoretype', 'playtime', 'duration']
duplicated_sum = duplicated_sum.toDF(*cols) # change column orders
df = df.filter((F.col('userid')!=1) & (F.col('userid')!=7))
df = df.union(duplicated_sum) # concat two dataframes
df = df.orderBy('userid') # sort by userid
i'm preparing a dataset to train machine learning model using PySpark. The dataframe on which i'm working contains a thousands of records about presences registered inside rooms of different buildings and cities in different days. These presences are in this format:
+----+----+----+---+---------+------+--------+-------+---------+
|room|building|city|day|month|inHour|inMinute|outHour|outMinute|
+----+--------+----+---+-----+------+--------+-------+---------+
| 1| 1| 1| 9| 11| 8| 27| 13| 15|
| 1| 1| 1| 9| 11| 8| 28| 13| 5|
| 1| 1| 1| 9| 11| 8| 32| 13| 7|
| 1| 1| 1| 9| 11| 8| 32| 8| 50|
| 1| 1| 1| 9| 11| 8| 32| 8| 48|
+----+--------+----+---+-----+------+--------+-------+---------+
inHour and inMinute stands for the hour and minute of access and, of course, outHour and outMinute refers to time of exit. The hours are considered in a 0-23 format.
All the column contains just integer values.
What i'm missing is the target value of my machine learning model which is the number of persons for the combination of room, building, city, day, month and a time interval. I will try to explain better, the first row refers to a presence with access time 8 and exit time 13 so it should be counted in the record with the interval 8-9, 9-10, 10-11, 11-12 and also 13-14.
What i want to accomplish is something like the following:
+----+----+----+---+---------+------+-------+-----+
|room|building|city|day|month|timeIn|timeOut|count|
+----+--------+----+---+-----+------+-------+-----+
| 1| 1| 1| 9| 11| 8| 9| X|
| 1| 1| 1| 9| 11| 9| 10| X|
| 1| 1| 1| 9| 11| 10| 11| X|
| 1| 1| 1| 9| 11| 11| 12| X|
| 1| 1| 1| 9| 11| 12| 13| X|
+----+--------+----+---+-----+------+-------+-----+
So the 4th row of the first table should be counted in the 1st row of this table and so on...
You can explode a sequence of hours (e.g. the first row would have [8,9,10,11,12,13]), group by the hour (and other columns) and get the aggregate count for each group. Here hour refers to timeIn. I think it's not necessary to specify timeOut in the result dataframe because it's always timeIn + 1.
import pyspark.sql.functions as F
df2 = df.withColumn(
'hour',
F.explode(F.sequence('inHour', 'outHour'))
).groupBy(
'room', 'building', 'city', 'day', 'month', 'hour'
).count().orderBy('hour')
df2.show()
+----+--------+----+---+-----+----+-----+
|room|building|city|day|month|hour|count|
+----+--------+----+---+-----+----+-----+
| 1| 1| 1| 9| 11| 8| 5|
| 1| 1| 1| 9| 11| 9| 3|
| 1| 1| 1| 9| 11| 10| 3|
| 1| 1| 1| 9| 11| 11| 3|
| 1| 1| 1| 9| 11| 12| 3|
| 1| 1| 1| 9| 11| 13| 3|
+----+--------+----+---+-----+----+-----+
To give an example, suppose we have a stream of user actions as follows:
from pyspark.sql import *
spark = SparkSession.builder.appName('test').master('local[8]').getOrCreate()
df = spark.sparkContext.parallelize([
Row(user=1, action=1, time=1),
Row(user=1, action=1, time=2),
Row(user=2, action=1, time=3),
Row(user=1, action=2, time=4),
Row(user=2, action=2, time=5),
Row(user=2, action=2, time=6),
Row(user=1, action=1, time=7),
Row(user=2, action=1, time=8),
]).toDF()
df.show()
The dataframe looks like:
+----+------+----+
|user|action|time|
+----+------+----+
| 1| 1| 1|
| 1| 1| 2|
| 2| 1| 3|
| 1| 2| 4|
| 2| 2| 5|
| 2| 2| 6|
| 1| 1| 7|
| 2| 1| 8|
+----+------+----+
Then, I want to add a column next_alt_time to each row, giving the time when user changes action type in the following rows. For the input above, the output should be:
+----+------+----+-------------+
|user|action|time|next_alt_time|
+----+------+----+-------------+
| 1| 1| 1| 4|
| 1| 1| 2| 4|
| 2| 1| 3| 5|
| 1| 2| 4| 7|
| 2| 2| 5| 8|
| 2| 2| 6| 8|
| 1| 1| 7| null|
| 2| 1| 8| null|
+----+------+----+-------------+
I know I can create a window like this:
wnd = Window().partitionBy('user').orderBy('time').rowsBetween(1, Window.unboundedFollowing)
But then I don't know how to impose a condition over the window and select the first row that has a different action than current row, over the window defined above.
Here's how to do it. Spark cannot keep the dataframe order, but if you check the rows one by one, you can confirm that it's giving your expected answer:
from pyspark.sql import Row
from pyspark.sql.window import Window
import pyspark.sql.functions as F
df = spark.sparkContext.parallelize([
Row(user=1, action=1, time=1),
Row(user=1, action=1, time=2),
Row(user=2, action=1, time=3),
Row(user=1, action=2, time=4),
Row(user=2, action=2, time=5),
Row(user=2, action=2, time=6),
Row(user=1, action=1, time=7),
Row(user=2, action=1, time=8),
]).toDF()
win = Window().partitionBy('user').orderBy('time')
df = df.withColumn('new_action', F.lag('action').over(win) != F.col('action'))
df = df.withColumn('new_action_time', F.when(F.col('new_action'), F.col('time')))
df = df.withColumn('next_alt_time', F.first('new_action', ignorenulls=True).over(win.rowsBetween(1, Window.unboundedFollowing)))
df.show()
+----+------+----+----------+---------------+-------------+
|user|action|time|new_action|new_action_time|next_alt_time|
+----+------+----+----------+---------------+-------------+
| 1| 1| 1| null| null| 4|
| 1| 1| 2| false| null| 4|
| 1| 2| 4| true| 4| 7|
| 1| 1| 7| true| 7| null|
| 2| 1| 3| null| null| 5|
| 2| 2| 5| true| 5| 8|
| 2| 2| 6| false| null| 8|
| 2| 1| 8| true| 8| null|
+----+------+----+----------+---------------+-------------+
Consider a data set with ranking
+--------+----+-----------+--------------+
| colA|colB|colA_rank |colA_rank_mean|
+--------+----+-----------+--------------+
| 21| 50| 1| 1|
| 9| 23| 2| 2.5|
| 9| 21| 3| 2.5|
| 8| 21| 4| 3|
| 2| 21| 5| 5.5|
| 2| 5| 6| 5.5|
| 1| 5| 7| 7.5|
| 1| 4| 8| 7.5|
| 0| 4| 9| 11|
| 0| 3| 10| 11|
| 0| 3| 11| 11|
| 0| 2| 12| 11|
| 0| 2| 13| 11|
+--------+----+-----------+--------------+
colA_rank is a normal ranking, while with colA_rank_mean I would like to resolve ties by replacing the ranking with the mean rank of the ties. Is it achievable with a single pass and some particular ranking method ?
Currently I am thinking of 2 passes but that would seem to require ordering the dataset twice on colA, one without partition and one with partition.
#Step 1: normal rank
df = df.withColumn("colA_rank",F.row_number().over(Window.orderBy("colA")))
#Step 2 : solve ties :
df = df.withColumn("colA_rank_mean",F.mean(colA_rank).over(Window.partitionBy("colA"))
I am trying to assign a group ID to my table.
For each group in "part" column, acording to order in "ord" column, the 1st elements I encounter in "id" column will receive a new_id 0, then each time I encounter a different ID, i increase the "new_id".
Currently, I need to use 2 window functions and the process is therefore quite slow.
df = sqlContext.createDataFrame(
[
(1,1,'X'),
(1,2,'X'),
(1,3,'X'),
(1,4,'Y'),
(1,5,'Y'),
(1,6,'Y'),
(1,7,'X'),
(1,8,'X'),
(2,1,'X'),
(2,2,'X'),
(2,3,'X'),
(2,4,'Y'),
(2,5,'Y'),
(2,6,'Y'),
(2,7,'X'),
(2,8,'X'),
],
["part", "ord", "id"]
)
df.withColumn(
"new_id",
F.lag(F.col("id")).over(Window.partitionBy("part").orderBy("ord"))!=F.col('id')
).withColumn(
"new_id",
F.sum(
F.col("new_id").cast('int')
).over(Window.partitionBy("part").orderBy("ord"))
).na.fill(0).show()
+----+---+---+------+
|part|ord| id|new_id|
+----+---+---+------+
| 2| 1| X| 0|
| 2| 2| X| 0|
| 2| 3| X| 0|
| 2| 4| Y| 1|
| 2| 5| Y| 1|
| 2| 6| Y| 1|
| 2| 7| X| 2|
| 2| 8| X| 2|
| 1| 1| X| 0|
| 1| 2| X| 0|
| 1| 3| X| 0|
| 1| 4| Y| 1|
| 1| 5| Y| 1|
| 1| 6| Y| 1|
| 1| 7| X| 2|
| 1| 8| X| 2|
+----+---+---+------+
Can I achieve the same using only one window function ?