I am trying to count the number of times the field "measure" has the value "M" between date 1 and date 2 for each pair id and id_2
** date 1 is when the measure is taken and date 2 is date 1 - 5 days, since 5 days is the period on which we want to count the number of measures taken.
| date 1| date 2| id | id_2 | measure|
+---------- +---------- +---------------+----------------+-----------+
|2016-10-25 |2016-10-20 | 1| 1| M|
|2016-10-26 |2016-10-21 | 1| 5| |
|2016-10-22 |2016-10-17 | 1| 1| M|
|2016-10-30 |2016-10-25 | 1| 2| M|
|2016-10-22 |2016-10-17 | 2| 1| M|
|2016-10-26 |2016-10-21 | 2| 2| |
|2016-10-18 |2016-10-13 | 2| 5| M|
|2016-10-25 |2016-10-20 | 2| 1| |
+------------+------------+---------------+----------------+-----------+
And the desired output would be:
| date 1| date 2| id | id_2 | measure| count
+---------- +---------- +---------------+----------------+-----------+-----+
|2016-10-25 |2016-10-20 | 1| 1| M| 1|
|2016-10-26 |2016-10-21 | 1| 5| | 0|
|2016-10-22 |2016-10-17 | 1| 1| M| 0|
|2016-10-30 |2016-10-25 | 1| 2| M| 0|
|2016-10-22 |2016-10-17 | 2| 1| M| 0|
|2016-10-26 |2016-10-21 | 2| 2| | 0|
|2016-10-18 |2016-10-13 | 2| 5| M| 0|
|2016-10-25 |2016-10-20 | 2| 1| | 1|
+------------+------------+---------------+----------------+----------+-----+
I believe that windows might be helpful but I am not sure how to apply them to this problem. Previously I have solved this kind of problem with a subquery, but in Pyspark they cannot be applied.
The query I have previously used is:
SELECT a.date1,
a.date2,
a.id,
a.id_2,
a.measure,
(select count(data.[id]) as CountMeasure
from dataOrigin data
where data.id = a.id
and data.id_2 = a.id_2 and measure = 'M'
and datos.date1 > date2 and datos.date1 < a.date1) as count
FROM dataOrigin a
If more info is needed please let me know
Thank you every body!
Related
I am trying to count consecutive values that appear in a column with Pyspark. I have the column "a" in my dataframe and expect to create the column "b".
+---+---+
| a| b|
+---+---+
| 0| 1|
| 0| 2|
| 0| 3|
| 0| 4|
| 0| 5|
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 2| 1|
| 2| 2|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 6|
| 3| 1|
| 3| 2|
| 3| 3|
+---+---+
I have tried to create the column "b" with lag function over some window, but without success.
w = Window\
.partitionBy(df.some_id)\
.orderBy(df.timestamp_column)
df.withColumn(
"b",
f.when(df.a == f.lag(df.a).over(w),
f.sum(f.lit(1)).over(w)).otherwise(f.lit(0))
)
I could resolve this issue with the following code:
df.withColumn("b",
f.row_number().over(Window.partitionBy("a").orderBy("timestamp_column"))
I have a rdd from which I need to extract counts of multiple events. The initial rdd looks like this
+----------+--------------------+-------------------+
| event| user| day|
+----------+--------------------+-------------------+
|event_x |user_A | 0|
|event_y |user_A | 2|
|event_x |user_B | 2|
|event_y |user_B | 1|
|event_x |user_A | 0|
|event_x |user_B | 1|
|event_y |user_B | 2|
|event_y |user_A | 1|
+----------+--------------------+-------------------+
I need a count column for each type of event (in this case 2 types of events: event_x and event_y), grouped by player and day. So far, I managed to do it with only one event, resulting in the following:
+--------------------+-------------------+------------+
| user| day|count(event)|
+--------------------+-------------------+------------+
|user_A | 0| 11|
|user_A | 1| 8|
|user_A | 2| 4|
|user_B | 0| 2|
|user_B | 1| 1|
|user_B | 2| 25|
+--------------------+-------------------+------------+
But I need arbitrarily many columns, being the number of columns the same as the number of events that appear in the leftmost column of the first rdd displayed above. So, if I only had 2 events (x and y) it should be something like this:
+--------------------+-------------------+--------------+--------------+
| user| day|count(event_x)|count(event_y)|
+--------------------+-------------------+--------------+--------------+
|user_A | 0| 11| 3|
|user_A | 1| 8| 23|
|user_A | 2| 4| 2|
|user_B | 0| 2| 0|
|user_B | 1| 1| 1|
|user_B | 2| 25| 11|
+--------------------+-------------------+--------------+--------------+
The code I have currently is:
rdd = rdd.groupby('user', 'day').agg({'event': 'count'}).orderBy('user', 'day')
What should I do to achieve the desired result?
Thanks in advance ;)
you can try group by with pivot option
df =spark.createDataFrame([["event_x","user_A",0],["event_y","user_A",2],["event_x","user_B",2],["event_y","user_B",1],["event_x","user_A",0],["event_x","user_B",1],["event_y","user_B",2],["event_y","user_A",1]],["event","user","day"])
>>> df.show()
+-------+------+---+
| event| user|day|
+-------+------+---+
|event_x|user_A| 0|
|event_y|user_A| 2|
|event_x|user_B| 2|
|event_y|user_B| 1|
|event_x|user_A| 0|
|event_x|user_B| 1|
|event_y|user_B| 2|
|event_y|user_A| 1|
+-------+------+---+
>>> df.groupBy(["user","day"]).pivot("event").agg({"event":"count"}).show()
+------+---+-------+-------+
| user|day|event_x|event_y|
+------+---+-------+-------+
|user_A| 0| 2| null|
|user_B| 1| 1| 1|
|user_A| 2| null| 1|
|user_A| 1| null| 1|
|user_B| 2| 1| 1|
+------+---+-------+-------+
please have a look and do let me know if you have any doubts about same.
temp = temp.groupBy("columnname").pivot("bucket_columnaname").agg({"bucket_columnaname":"count"})
I have a dataframe my_df that contains 4 columns:
+----------------+---------------+--------+---------+
| user_id| domain|isp_flag|frequency|
+----------------+---------------+--------+---------+
| josh| wanadoo.fr| 1| 15|
| josh| random.it| 0| 12|
| samantha| wanadoo.fr| 1| 16|
| bob| eidsiva.net| 1| 5|
| bob| media.net| 0| 1|
| dylan| vodafone.it| 1| 448|
| dylan| somesite.net| 0| 20|
| dylan| yolosite.net| 0| 49|
| dylan| random.it| 0| 3|
| don| vodafone.it| 1| 39|
| don| popsugar.com| 0| 10|
| don| fabio.com| 1| 49|
+----------------+---------------+--------+---------+
This is what I'm planning to do-
Find all the user_id where the maximum frequency domain with isp_flag=0 has a frequency that is less than 25% of the maximum frequency domain with isp_flag=1.
So, in the example that I have above, my output_df would look like-
+----------------+---------------+--------+---------+
| user_id| domain|isp_flag|frequency|
+----------------+---------------+--------+---------+
| bob| eidsiva.net| 1| 5|
| bob| media.net| 0| 1|
| dylan| vodafone.it| 1| 448|
| dylan| yolosite.net| 0| 49|
| don| fabio.com| 1| 49|
| don| popsugar.com| 0| 10|
+----------------+---------------+--------+---------+
I believe I need window functions to do this, and so I tried the following to first find the maximum frequency domains for isp_flag=0 and isp_flag=1 respectively, for each of the user_id-
>>> win_1 = Window().partitionBy("user_id", "domain", "isp_flag").orderBy((col("frequency").desc()))
>>> final_df = my_df.select("*", rank().over(win_1).alias("rank")).filter(col("rank")==1)
>>> final_df.show(5) # this just gives me the original dataframe back
What am I doing wrong here? How do I get to the final output_df I printed above?
IIUC, you can try the following: calculate the max_frequencies (max_0, max_1) for each user having isp_flag == 0 or 1 respectively. and then filter by condition max_0 < 0.25*max_1 and plus frequency in (max_1, max_0) to select only the records with maximum frequency.
from pyspark.sql import Window, functions as F
# set up the Window to calculate max_0 and max_1 for each user
# having isp_flag = 0 and 1 respectively
w1 = Window.partitionBy('user_id').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('max_1', F.max(F.expr("IF(isp_flag==1, frequency, NULL)")).over(w1))\
.withColumn('max_0', F.max(F.expr("IF(isp_flag==0, frequency, NULL)")).over(w1))\
.where('max_0 < 0.25*max_1 AND frequency in (max_1, max_0)') \
.show()
+-------+------------+--------+---------+-----+-----+
|user_id| domain|isp_flag|frequency|max_1|max_0|
+-------+------------+--------+---------+-----+-----+
| don|popsugar.com| 0| 10| 49| 10|
| don| fabio.com| 1| 49| 49| 10|
| dylan| vodafone.it| 1| 448| 448| 49|
| dylan|yolosite.net| 0| 49| 448| 49|
| bob| eidsiva.net| 1| 5| 5| 1|
| bob| media.net| 0| 1| 5| 1|
+-------+------------+--------+---------+-----+-----+
Some Explanations per request:
the WindowSpec w1 is set to examine all records for the same user(partitionBy), so that the F.max() function will compare all rows based on the same user.
we use IF(isp_flag==1, frequency, NULL) to find frequency for rows having isp_flag==1, it returns NULL when isp_flag is not 1 and thus is skipped in F.max() function. this is an SQL expression and thus we need F.expr() function to run it.
F.max(...).over(w1) will take the max value of the result from executing the above SQL expression. this calculation is based on the Window w1.
I have a dataframe like
|TRADEID|time_period|value|
+-------+-----------+-----+
| 1| 31-01-2019| 5|
| 1| 31-05-2019| 6|
| 2| 31-01-2019| 15|
| 2| 31-03-2019| 20|
+-------+-----------+-----+
Entries for some month are missing so I forward fill them with window operations
window = Window.partitionBy(['TRADEID'])\
.orderBy('time_period')\
.rowsBetween(Window.unboundedPreceding, 0)
filled_column = last(df['value'], ignorenulls = True).over(window)
df_filled = df.withColumn('value_filled', filled_column)
After this I get
|TRADEID|time_period|value|
+-------+-----------+-----+
| 1| 31-01-2019| 5|
| 1| 28-02-2019| 5|
| 1| 31-03-2019| 5|
| 1| 30-04-2019| 5|
| 1| 31-05-2019| 6|
| 2| 31-01-2019| 15|
| 2| 28-02-2019| 15|
| 2| 31-03-2019| 20|
+-------+-----------+-----+
However, I do not want to fill a month if the gap between this month and the last available month is more than 2 months. For example, in my case I want to get
|TRADEID|time_period|value |
+-------+-----------+--------+
| 1| 31-01-2019| 5 |
| 1| 28-02-2019| 5 |
| 1| 31-03-2019| 5 |
| 1| 30-04-2019| null|
| 1| 31-05-2019| 6 |
| 2| 31-01-2019| 15 |
| 2| 28-02-2019| 15 |
| 2| 31-03-2019| 20 |
+-------+-----------+--------+
How can I do this?
Have one more window operation that deletes entries with month that should not be filled beforehand? If it is a good way, please help me with code, I am bad at window operations.
This question already has answers here:
Pyspark Replicate Row based on column value
(2 answers)
Closed 4 years ago.
I have a PySpark dataframe that looks like:
+---+----+----+
| id|day1|day2|
+---+----+----+
| 1| 2| 4|
| 2| 1| 2|
| 3| 3| 3|
+---+----+----+
I want to duplicate each row n number of times where n = day2 - day1. The resulting dataframe would look like:
+---+----+----+
| id|day1|day2|
+---+----+----+
| 1| 2| 4|
| 1| 2| 4|
| 1| 2| 4|
| 2| 1| 2|
| 2| 1| 2|
| 3| 3| 3|
+---+----+----+
How can I do this?
Here is one way to do that.
from pyspark.sql import functions as F
from pyspark.sql.types import *
#F.udf(ArrayType(StringType()))
def gen_array(day1, day2):
return ['' for i in range(day2-day1+1)]
df.withColumn(
"dup",
F.explode(
gen_array(F.col("day1"), F.col("day2"))
)
).drop("dup").show()
+---+----+----+
| id|day1|day2|
+---+----+----+
| 1| 2| 4|
| 1| 2| 4|
| 1| 2| 4|
| 2| 1| 2|
| 2| 1| 2|
| 3| 3| 3|
+---+----+----+
Another option using rdd.flatMap:
df.rdd.flatMap(lambda r: [r] * (r.day2 - r.day1 + 1)).toDF().show()
+---+----+----+
| id|day1|day2|
+---+----+----+
| 1| 2| 4|
| 1| 2| 4|
| 1| 2| 4|
| 2| 1| 2|
| 2| 1| 2|
| 3| 3| 3|
+---+----+----+