Can the pySpark lag function reference itself? - python

I am looking for a way to grow a cumulative value in a column using the lag function in pySpark to first fetch the previous value in the column then add to it but tis is failing as presumably it can't find itself before it exists. Is there a way around this?

Maybe something like this you are looking for?
df = spark.createDataFrame(
[
('1',20),
('2',34),
('3',12)
], ['id','value'])
from pyspark.sql import Window as W
w = W.orderBy('id').rowsBetween(W.unboundedPreceding, 0)
df\
.withColumn('cumul_sum', F.sum(F.col('value')).over(w))\
.show()
+---+-----+---------+
| id|value|cumul_sum|
+---+-----+---------+
| 1| 20| 20|
| 2| 34| 54|
| 3| 12| 66|
+---+-----+---------+

Related

How to do a cummsum in a lambda call using PySpark

I am trying to replicate a code in Python using PySpark, and I found myself in a problem. So this is the code I am trying to replicate:
df_act = (df_act.assign(n_cycles = (lambda x: (x.cycles_bol != x.cycles_bol.shift(1)).cumsum())))
Keep in mind that I am working with a dataframe, and that cycles_bol is a column of dataframe "df_act".
and I simply can't. The closest I think I have gotten to the solution is the following:
df_act=df_act.withColumn(
"grp",
when(df_act['cycles_bol'] == lead("cycles_bol").over(Window.partitionBy("user_id").orderBy("timestamp")),0).otherwise(1).over(Window.orderBy("timestamp"))
).drop("grp").show()
Can anyone please help me?
Thanks in advance!
You dindt give much information
You have to orderBy,use lag to check if cycles_bol consecutives are the same and conditionally add. Use an existing column to orderBy if it wont change the order of cycles_bol. If you dont have such a column, generate one using monotonically_increasing function like I did.
df_act.withColumn('id', monotonically_increasing_id()).withColumn('n_cycles',sum(when(lag('cycles_bol').over(Window.orderBy('id'))!=col('cycles_bol'),1).otherwise(0)).over(Window.orderBy('id'))).drop('id').show()
+----------+--------+
|cycles_bol|n_cycles|
+----------+--------+
| A| 0|
| B| 1|
| B| 1|
| A| 2|
| B| 3|
| A| 4|
| A| 4|
| B| 5|
| C| 6|
+----------+--------+

Is there a way to add a column with range of values to a Spark Dataframe?

I have a spark dataframe: df1 as below:
age = spark.createDataFrame(["10","11","13"], "string").toDF("age")
age.show()
+---+
|age|
+---+
| 10|
| 11|
| 13|
+---+
I have a requirement of adding a row number column to the dataframe to make it:
+---+------+
|age|col_id|
+---+------+
| 10| 1 |
| 11| 2 |
| 13| 3 |
+---+------+
None of the columns in my dataframe contains unique values.
I tried to use F.monotonically_increasing_id()) but it is just producing random numbers in increasing order.
>>> age = spark.createDataFrame(["10","11","13"], "string").toDF("age").withColumn("rowId1", F.monotonically_increasing_id())
>>> age
DataFrame[age: string, rowId1: bigint]
>>> age.show
<bound method DataFrame.show of DataFrame[age: string, rowId1: bigint]>
>>> age.show()
+---+-----------+
|age| rowId1|
+---+-----------+
| 10|17179869184|
| 11|42949672960|
| 13|60129542144|
+---+-----------+
Since I don't have any column with unique data, I am worried about using windowing functions and generate row_numbers.
So, is there a way I can add a column with row_count to the dataframe that gives:
+---+------+
|age|col_id|
+---+------+
| 10| 1 |
| 11| 2 |
| 13| 3 |
+---+------+
If windowing function is the only way to implement, how can I make sure all the data comes under a single partition ?
or if there is a way to implement the same without using windowing functions, how to implement it ?
Any help is appreciated.
Use zipWithIndex.
I could not find code I did myself in the past yesterday as I was busy working on issues, but here is a good post that explains it. https://sqlandhadoop.com/pyspark-zipwithindex-example/
pyspark different to Scala.
Other answer not good for performance - going to single Executor. zipWithIndex is narrow transformation so it works per partition.
Here goes, you can tailor accordingly:
from pyspark.sql.types import StructField
from pyspark.sql.types import StructType
from pyspark.sql.types import StringType, LongType
import pyspark.sql.functions as F
df1 = spark.createDataFrame([ ('abc'),('2'),('3'),('4'), ('abc'),('2'),('3'),('4'), ('abc'),('2'),('3'),('4') ], StringType())
schema = StructType(df1.schema.fields[:] + [StructField("index", LongType(), True)])
rdd = df1.rdd.zipWithIndex()
rdd1 = rdd.map(lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],))
df1 = spark.createDataFrame(rdd1, schema)
df1.show()
returns:
+-----+-----+
|value|index|
+-----+-----+
| abc| 0|
| 2| 1|
| 3| 2|
| 4| 3|
| abc| 4|
| 2| 5|
| 3| 6|
| 4| 7|
| abc| 8|
| 2| 9|
| 3| 10|
| 4| 11|
+-----+-----+
Assumption: This answer is based on the assumption that the order of col_id should depend on the age column. If the assumption does not hold true the other suggested solution is the in the questions comments mentioned zipWithIndex. An example usage of zipWithIndex can be found in this answer.
Proposed solution:
You can use a window with an empty partitionBy and the the row number to get the expected numbers.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
windowSpec = Window.partitionBy().orderBy(F.col('age').asc())
age = age.withColumn(
'col_id',
F.row_number().over(windowSpec)
)
[EDIT] Add assumption of requirements and reference to alternative solution.

Pyspark - Using two time indices for window function

I have a dataframe where each row has two date columns. I would like to create a window function with a range between that counts the number of rows in a particular range, where BOTH date columns are within the range. In the case below, both timestamps of a row must be before the timestamp of the current row, to be included in the count.
Example df including the count column:
+---+-----------+-----------+-----+
| ID|Timestamp_1|Timestamp_2|Count|
+---+-----------+-----------+-----+
| a| 0| 3| 0|
| b| 2| 5| 0|
| d| 5| 5| 3|
| c| 5| 9| 3|
| e| 8| 10| 4|
+---+-----------+-----------+-----+
I tried creating two windows and creating the new column over both of these:
w_1 = Window.partitionBy().orderBy('Timestamp_1').rangeBetween(Window.unboundedPreceding, 0)
w_2 = Window.partitionBy().orderBy('Timestamp_2').rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn('count', F.count('ID').over(w_1).over(w_2))
However, this is not allowed in Pyspark and therefore gives an error.
Any ideas? Solutions in SQL are also fine!
Would a self-join work?
from pyspark.sql import functions as F
df_count = (
df.alias('a')
.join(
df.alias('b'),
(F.col('b.Timestamp_1') <= F.col('a.Timestamp_1')) &
(F.col('b.Timestamp_2') <= F.col('a.Timestamp_2')),
'left'
)
.groupBy(
'a.ID'
)
.agg(
F.count('b.ID').alias('count')
)
)
df = df.join(df_count, 'ID')

PySpark DataFrame: Find closest value and slice the DataFrame

So, I've done enough research and haven't found a post that addresses what I want to do.
I have a PySpark DataFrame my_df which is sorted by value column-
+----+-----+
|name|value|
+----+-----+
| A| 30|
| B| 25|
| C| 20|
| D| 18|
| E| 18|
| F| 15|
| G| 10|
+----+-----+
The summation of all the counts in value column is equal to 136. I want to get all the rows whose combined values >= x% of 136. In this example, let's say x=80. Then target sum = 0.8*136 = 108.8. Hence, the new DataFrame will consist of all the rows that have a combined value >= 108.8.
In our example, this would come down to row D (since combined values upto D = 30+25+20+18 = 93).
However, the hard part is that I also want to include the immediately following rows with duplicate values. In this case, I also want to include row E since it has the same value as row D i.e. 18.
I want to slice my_df by giving a percentage x variable, for example 80 as discussed above. The new DataFrame should consist of the following rows-
+----+-----+
|name|value|
+----+-----+
| A| 30|
| B| 25|
| C| 20|
| D| 18|
| E| 18|
+----+-----+
One thing I could do here is iterate through the DataFrame (which is ~360k rows), but I guess that defeats the purpose of Spark.
Is there a concise function for what I want here?
Use pyspark SQL functions to do this concisely.
result = my_df.filter(my_df.value > target).select(my_df.name,my_df.value)
result.show()
Edit: Based on OP's question edit - Compute running sum and get rows until the target value is reached. Note that this will result in rows upto D, not E..which seems like a strange requirement.
from pyspark.sql import Window
from pyspark.sql import functions as f
# Total sum of all `values`
target = (my_df.agg(sum("value")).collect())[0][0]
w = Window.orderBy(my_df.name) #Ideally this should be a column that specifies ordering among rows
running_sum_df = my_df.withColumn('rsum',f.sum(my_df.value).over(w))
running_sum_df.filter(running_sum_df.rsum <= 0.8*target)
Your requirements are quite strict, so it's difficult to formulate an efficient solution to your problem. Nevertheless, here is one approach:
First calculate the cumulative sum and the total sum for the value column and filter the DataFrame using the percentage of target condition you specified. Let's call this result df_filtered:
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.orderBy(f.col("value").desc(), "name").rangeBetween(Window.unboundedPreceding, 0)
target = 0.8
df_filtered = df.withColumn("cum_sum", f.sum("value").over(w))\
.withColumn("total_sum", f.sum("value").over(Window.partitionBy()))\
.where(f.col("cum_sum") <= f.col("total_sum")*target)
df_filtered.show()
#+----+-----+-------+---------+
#|name|value|cum_sum|total_sum|
#+----+-----+-------+---------+
#| A| 30| 30| 136|
#| B| 25| 55| 136|
#| C| 20| 75| 136|
#| D| 18| 93| 136|
#+----+-----+-------+---------+
Then join this filtered DataFrame back on the original on the value column. Since your DataFrame is already sorted by value, the final output will contain the rows you want.
df.alias("r")\
.join(
df_filtered.alias('l'),
on="value"
).select("r.name", "r.value").sort(f.col("value").desc(), "name").show()
#+----+-----+
#|name|value|
#+----+-----+
#| A| 30|
#| B| 25|
#| C| 20|
#| D| 18|
#| E| 18|
#+----+-----+
The total_sum and cum_sum columns are calculated using a Window function.
The Window w orders on the value column descending, followed by the name column. The name column is used to break ties- without it, both rows C and D would have the same cumulative sum of 111 = 75+18+18 and you'd incorrectly lose both of them in the filter.
w = Window\ # Define Window
.orderBy( # This will define ordering
f.col("value").desc(), # First sort by value descending
"name" # Sort on name second
)\
.rangeBetween(Window.unboundedPreceding, 0) # Extend back to beginning of window
The rangeBetween(Window.unboundedPreceding, 0) specifies that the Window should include all rows before the current row (defined by the orderBy). This is what makes it a cumulative sum.

PySpark: Add a new column with a tuple created from columns

Here I have a dateframe created as follow,
df = spark.createDataFrame([('a',5,'R','X'),('b',7,'G','S'),('c',8,'G','S')],
["Id","V1","V2","V3"])
It looks like
+---+---+---+---+
| Id| V1| V2| V3|
+---+---+---+---+
| a| 5| R| X|
| b| 7| G| S|
| c| 8| G| S|
+---+---+---+---+
I'm looking to add a column that is a tuple consisting of V1,V2,V3.
The result should look like
+---+---+---+---+-------+
| Id| V1| V2| V3|V_tuple|
+---+---+---+---+-------+
| a| 5| R| X|(5,R,X)|
| b| 7| G| S|(7,G,S)|
| c| 8| G| S|(8,G,S)|
+---+---+---+---+-------+
I've tried to use similar syntex as in Python but it didn't work:
df.withColumn("V_tuple",list(zip(df.V1,df.V2,df.V3)))
TypeError: zip argument #1 must support iteration.
Any help would be appreciated!
I'm coming from scala but I do believe that there's a similar way in python :
Using sql.functions package mehtod :
If you want to get a StructType with this three column use the struct(cols: Column*): Column method like this :
from pyspark.sql.functions import struct
df.withColumn("V_tuple",struct(df.V1,df.V2,df.V3))
but if you want to get it as a String you can use the concat(exprs: Column*): Column method like this :
from pyspark.sql.functions import concat
df.withColumn("V_tuple",concat(df.V1,df.V2,df.V3))
With this second method you may have to cast the columns into Strings
I'm not sure about the python syntax, Just edit the answer if there's a syntax error.
Hope this help you. Best Regards
Use struct:
from pyspark.sql.functions import struct
df.withColumn("V_tuple", struct(df.V1,df.V2,df.V3))

Categories