PySpark SQL Dataframe identify and store consecutive matching value rows incrementally - python

I was wondering if I could store information of a boolean column in a pyspark dataframe that has consecutive true and false values into another column showing the "Occurrence #" of the consecutive TRUE values.
Bool Column I have
Column I want to create
True
1
True
1
True
1
True
1
False
0
False
0
True
2
True
2
False
0
True
3
True
3
I would like the TRUE rows to increment and FALSE to stay at 0.
I am fairly new to pyspark and have only been able to succeed at doing this by converting to a pandas dataframe. If you could give me a pyspark solution or guidance this would be much appreciated.
The data is ordered by a data column not pictured in the table above.

You can add a helper column begin as shown below, and do a rolling sum using window function:
Sample dataframe:
df.show()
+---+-----+
| ts| col|
+---+-----+
| 0| true|
| 1| true|
| 2| true|
| 3| true|
| 4|false|
| 5|false|
| 6| true|
| 7| true|
| 8|false|
| 9| true|
| 10| true|
+---+-----+
Code:
w = Window.orderBy('ts')
df2 = df.withColumn(
'begin',
F.coalesce(
(F.lag('col').over(w) != F.col('col')) & F.col('col'),
F.lit(True) # take care of first row where lag is null
).cast('int')
).withColumn(
'newcol',
F.when(
F.col('col'),
F.sum('begin').over(w)
).otherwise(0)
)
df2.show()
+---+-----+-----+------+
| ts| col|begin|newcol|
+---+-----+-----+------+
| 0| true| 1| 1|
| 1| true| 0| 1|
| 2| true| 0| 1|
| 3| true| 0| 1|
| 4|false| 0| 0|
| 5|false| 0| 0|
| 6| true| 1| 2|
| 7| true| 0| 2|
| 8|false| 0| 0|
| 9| true| 1| 3|
| 10| true| 0| 3|
+---+-----+-----+------+

Related

Apply value to a dataframe after filtering another dataframe based on conditions

My question:
There are two dataframes and the info one is in progress to build.
What I want to do is filtering in reference dataframe based on the condition. When key is b, then apply value is 2 to the into table as whole column.
The output dataframe is the final one I want to do.
Dataframe (info)
+-----+-----+
| key|value|
+-----+-----+
| a| 10|
| b| 20|
| c| 50|
| d| 40|
+-----+-----+
Dataframe (Reference)
+-----+-----+
| key|value|
+-----+-----+
| a| 42|
| b| 2|
| c| 9|
| d| 100|
+-----+-----+
Below is the output I want:
Dataframe (Output)
+-----+-----+-----+
| key|value|const|
+-----+-----+-----+
| a| 10| 2|
| b| 20| 2|
| c| 50| 2|
| d| 40| 2|
+-----+-----+-----+
I have tried several methods and below one is the latest one I tried, but system warm me that pyspark do not have loc function.
df_cal = (
info
.join(reference)
.withColumn('const', reference.loc[reference['key']=='b', 'value'].iloc[0])
.select('key', 'result', 'const')
)
df_cal.show()
And below is the warming that reminded by system:
AttributeError: 'Dataframe' object has no attribute 'loc'
This solve:
from pyspark.sql.functions import lit
target = 'b'
const = [i['value'] for i in df2.collect() if i['key'] == f'{target}']
df_cal = df1.withColumn('const', lit(const[0]))
df_cal.show()
+---+-----+-----+
|key|value|const|
+---+-----+-----+
| a| 10| 2|
| b| 20| 2|
| c| 30| 2|
| d| 40| 2|
+---+-----+-----+

Split count results of different events into different columns in pyspark

I have a rdd from which I need to extract counts of multiple events. The initial rdd looks like this
+----------+--------------------+-------------------+
| event| user| day|
+----------+--------------------+-------------------+
|event_x |user_A | 0|
|event_y |user_A | 2|
|event_x |user_B | 2|
|event_y |user_B | 1|
|event_x |user_A | 0|
|event_x |user_B | 1|
|event_y |user_B | 2|
|event_y |user_A | 1|
+----------+--------------------+-------------------+
I need a count column for each type of event (in this case 2 types of events: event_x and event_y), grouped by player and day. So far, I managed to do it with only one event, resulting in the following:
+--------------------+-------------------+------------+
| user| day|count(event)|
+--------------------+-------------------+------------+
|user_A | 0| 11|
|user_A | 1| 8|
|user_A | 2| 4|
|user_B | 0| 2|
|user_B | 1| 1|
|user_B | 2| 25|
+--------------------+-------------------+------------+
But I need arbitrarily many columns, being the number of columns the same as the number of events that appear in the leftmost column of the first rdd displayed above. So, if I only had 2 events (x and y) it should be something like this:
+--------------------+-------------------+--------------+--------------+
| user| day|count(event_x)|count(event_y)|
+--------------------+-------------------+--------------+--------------+
|user_A | 0| 11| 3|
|user_A | 1| 8| 23|
|user_A | 2| 4| 2|
|user_B | 0| 2| 0|
|user_B | 1| 1| 1|
|user_B | 2| 25| 11|
+--------------------+-------------------+--------------+--------------+
The code I have currently is:
rdd = rdd.groupby('user', 'day').agg({'event': 'count'}).orderBy('user', 'day')
What should I do to achieve the desired result?
Thanks in advance ;)
you can try group by with pivot option
df =spark.createDataFrame([["event_x","user_A",0],["event_y","user_A",2],["event_x","user_B",2],["event_y","user_B",1],["event_x","user_A",0],["event_x","user_B",1],["event_y","user_B",2],["event_y","user_A",1]],["event","user","day"])
>>> df.show()
+-------+------+---+
| event| user|day|
+-------+------+---+
|event_x|user_A| 0|
|event_y|user_A| 2|
|event_x|user_B| 2|
|event_y|user_B| 1|
|event_x|user_A| 0|
|event_x|user_B| 1|
|event_y|user_B| 2|
|event_y|user_A| 1|
+-------+------+---+
>>> df.groupBy(["user","day"]).pivot("event").agg({"event":"count"}).show()
+------+---+-------+-------+
| user|day|event_x|event_y|
+------+---+-------+-------+
|user_A| 0| 2| null|
|user_B| 1| 1| 1|
|user_A| 2| null| 1|
|user_A| 1| null| 1|
|user_B| 2| 1| 1|
+------+---+-------+-------+
please have a look and do let me know if you have any doubts about same.
temp = temp.groupBy("columnname").pivot("bucket_columnaname").agg({"bucket_columnaname":"count"})

PySpark: Using Window Functions to roll-up dataframe

I have a dataframe my_df that contains 4 columns:
+----------------+---------------+--------+---------+
| user_id| domain|isp_flag|frequency|
+----------------+---------------+--------+---------+
| josh| wanadoo.fr| 1| 15|
| josh| random.it| 0| 12|
| samantha| wanadoo.fr| 1| 16|
| bob| eidsiva.net| 1| 5|
| bob| media.net| 0| 1|
| dylan| vodafone.it| 1| 448|
| dylan| somesite.net| 0| 20|
| dylan| yolosite.net| 0| 49|
| dylan| random.it| 0| 3|
| don| vodafone.it| 1| 39|
| don| popsugar.com| 0| 10|
| don| fabio.com| 1| 49|
+----------------+---------------+--------+---------+
This is what I'm planning to do-
Find all the user_id where the maximum frequency domain with isp_flag=0 has a frequency that is less than 25% of the maximum frequency domain with isp_flag=1.
So, in the example that I have above, my output_df would look like-
+----------------+---------------+--------+---------+
| user_id| domain|isp_flag|frequency|
+----------------+---------------+--------+---------+
| bob| eidsiva.net| 1| 5|
| bob| media.net| 0| 1|
| dylan| vodafone.it| 1| 448|
| dylan| yolosite.net| 0| 49|
| don| fabio.com| 1| 49|
| don| popsugar.com| 0| 10|
+----------------+---------------+--------+---------+
I believe I need window functions to do this, and so I tried the following to first find the maximum frequency domains for isp_flag=0 and isp_flag=1 respectively, for each of the user_id-
>>> win_1 = Window().partitionBy("user_id", "domain", "isp_flag").orderBy((col("frequency").desc()))
>>> final_df = my_df.select("*", rank().over(win_1).alias("rank")).filter(col("rank")==1)
>>> final_df.show(5) # this just gives me the original dataframe back
What am I doing wrong here? How do I get to the final output_df I printed above?
IIUC, you can try the following: calculate the max_frequencies (max_0, max_1) for each user having isp_flag == 0 or 1 respectively. and then filter by condition max_0 < 0.25*max_1 and plus frequency in (max_1, max_0) to select only the records with maximum frequency.
from pyspark.sql import Window, functions as F
# set up the Window to calculate max_0 and max_1 for each user
# having isp_flag = 0 and 1 respectively
w1 = Window.partitionBy('user_id').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('max_1', F.max(F.expr("IF(isp_flag==1, frequency, NULL)")).over(w1))\
.withColumn('max_0', F.max(F.expr("IF(isp_flag==0, frequency, NULL)")).over(w1))\
.where('max_0 < 0.25*max_1 AND frequency in (max_1, max_0)') \
.show()
+-------+------------+--------+---------+-----+-----+
|user_id| domain|isp_flag|frequency|max_1|max_0|
+-------+------------+--------+---------+-----+-----+
| don|popsugar.com| 0| 10| 49| 10|
| don| fabio.com| 1| 49| 49| 10|
| dylan| vodafone.it| 1| 448| 448| 49|
| dylan|yolosite.net| 0| 49| 448| 49|
| bob| eidsiva.net| 1| 5| 5| 1|
| bob| media.net| 0| 1| 5| 1|
+-------+------------+--------+---------+-----+-----+
Some Explanations per request:
the WindowSpec w1 is set to examine all records for the same user(partitionBy), so that the F.max() function will compare all rows based on the same user.
we use IF(isp_flag==1, frequency, NULL) to find frequency for rows having isp_flag==1, it returns NULL when isp_flag is not 1 and thus is skipped in F.max() function. this is an SQL expression and thus we need F.expr() function to run it.
F.max(...).over(w1) will take the max value of the result from executing the above SQL expression. this calculation is based on the Window w1.

pyspark count not null values for pairs in two column within group

I have some data like this
A B C
1 Null 3
1 2 4
2 Null 6
2 2 Null
2 1 2
3 Null 4
and I want to groupby A and then calculat the number of rows that don't contain Null value. So, the result should be
A count
1 1
2 1
3 0
I don't think this will work..., does it?
df.groupby('A').agg(count('B','C'))
Personally, I would use an auxiliary column saying whether B or C is Null. Negative result in this solution and return 1 or 0. And use sum for this column.
from pyspark.sql.functions import sum, when
# ...
df.withColumn("isNotNull", when(df.B.isNull() | df.C.isNull(), 0).otherwise(1))\
.groupBy("A").agg(sum("isNotNull"))
Demo:
df.show()
# +---+----+----+
# | _1| _2| _3|
# +---+----+----+
# | 1|null| 3|
# | 1| 2| 4|
# | 2|null| 6|
# | 2| 2|null|
# | 2| 1| 2|
# | 3|null| 4|
# +---+----+----+
df.withColumn("isNotNull", when(df._2.isNull() | df._3.isNull(), 0).otherwise(1)).show()
# +---+----+----+---------+
# | _1| _2| _3|isNotNull|
# +---+----+----+---------+
# | 1|null| 3| 0|
# | 1| 2| 4| 1|
# | 2|null| 6| 0|
# | 2| 2|null| 0|
# | 2| 1| 2| 1|
# | 3|null| 4| 0|
# +---+----+----+---------+
df.withColumn("isNotNull", when(df._2.isNull() | df._3.isNull(), 0).otherwise(1))\
.groupBy("_1").agg(sum("isNotNull")).show()
# +---+--------------+
# | _1|sum(isNotNull)|
# +---+--------------+
# | 1| 1|
# | 3| 0|
# | 2| 1|
# +---+--------------+
You can drop rows that contain null values and then groupby + count:
df.select('A').dropDuplicates().join(
df.dropna(how='any').groupby('A').count(), on=['A'], how='left'
).show()
+---+-----+
| A|count|
+---+-----+
| 1| 1|
| 3| null|
| 2| 1|
+---+-----+
If you don't want to do the join, create another column to indicate whether there is null in columns B or C:
import pyspark.sql.functions as f
df.selectExpr('*',
'case when B is not null and C is not null then 1 else 0 end as D'
).groupby('A').agg(f.sum('D').alias('count')).show()
+---+-----+
| A|count|
+---+-----+
| 1| 1|
| 3| 0|
| 2| 1|
+---+-----+

Subtract 2 pyspark dataframes based on column

I have 2 pyspark dataframes,
i
+---+-----+
| ID|COL_A|
+---+-----+
| 1| 123|
| 2| 456|
| 3| 111|
| 4| 678|
+---+-----+
j
+----+-----+
|ID_B|COL_B|
+----+-----+
| 2| 456|
| 3| 111|
| 4| 876|
+----+-----+
I'm trying to subtract i from j based on values of a particular column i.e., values present in COL_A of i should not be present in COL_B of j.
Expected output should be,
diff
+---+-----+
| ID|COL_A|
+---+-----+
| 1| 123|
| 4| 678|
+---+-----+
This is my code,
common = i.join(j.withColumnRenamed('COL_B', 'COL_A'), ['COL_A'], 'leftsemi')
diff = i.subtract(common)
diff.show()
But the output is coming wrong,
diff
+---+-----+
| ID|COL_A|
+---+-----+
| 2| 456|
| 1| 123|
| 4| 678|
| 3| 111|
+---+-----+
Am I doing something wrong here? Thanks in advance.
Try:
left_join = i.join(j, j.COL_B == i.COL_A,how='left')
left_join.filter(left_join.COL_A.isNull()).show()
If you are having column names as args, you can do like:
left_join = i.join(j, j[colb] == i[cola],how='left')
left_join.filter(left_join[cola].isNull()).show()

Categories