pyspark get the row immediately after the one selected - python

i have a dataframe (df) like this:
col1
col2
col3
One
Two
x
One
Two
full
One
Two
y
One
Two
z
One
Two
full
One
Two
u
One
Two
e
Using PySPark i want to mark the element/rows immediately after col3=="full" with 1 otherwise 0, like this:
col1
col2
col3
flag
One
Two
x
0
One
Two
full
0
One
Two
y
1
One
Two
z
0
One
Two
full
0
One
Two
u
1
One
Two
e
0
At the moment this is my idea, but i'm not taking the row immediately after...:
df.withColumn('flag',f.when(f.col('CD_OPERAZIONE')=='full',1).otherwise(0))
can you help me?

Use lag and when statement
w= Window.partitionBy('col1','col2').orderBy('col1')
df.withColumn('x', when(lag('col3').over(w)=='full',1).otherwise(0)).show()
+----+----+----+---+
|col1|col2|col3| x|
+----+----+----+---+
| One| Two| x| 0|
| One| Two|full| 0|
| One| Two| y| 1|
| One| Two| z| 0|
| One| Two|full| 0|
| One| Two| u| 1|
| One| Two| e| 0|
+----+----+----+---+

Step 1: assign row number to each row using row_number function
Step 2: filter the dataframe with col3==full, now you have the row numbers where col3 is full, call it dataframe2 lets say
step 3: create a new column adding one to the row number column in dataframe2, now you will have row numbers of the immediate row next to the ones having col3 as full
step 4: join dataframe one with dataframe2 via inner join after selecting the new column from dataframe2 on row_number from dataframe1 and new row number column on dataframe2.
Pardon for no code, on my mobile. Let me know if you want help still.

Related

How to make a new pyspark df column that's the average of the last n values by day of week?

What I'm trying to do is make a pyspark dataframe with item and date and another column "3_avg" that's the average of the last three same day-of-week from the given date back. Said another way, if 2022-5-5 is a thursday, I want the 3_avg value for that row to be the average sales for that item for the last three thursdays, so 4/28, 4/21, and 4/14.
I've got this thus far, but it just averages the whole column for that day of week... I can't figure out how to get it to be distinct by item and date and only use the last three? I was trying to get it to work with day_of_week, but my brain can't connect that to what I need to happen.
df_fcst_dow = (
df_avg
.withColumn("day_of_week", F.dayofweek(F.col("trn_dt")))
.groupBy("item", "date", "day_of_week")
.agg(
F.sum(F.col("sales") / 3).alias("3_avg")
)
)
You can do this with a window or you can do it with a groupby. Here I'd encourage group by as it will distribute the work better amongst the worker nodes. We create an array of the current date and the next two dates. We then explode that array, give us data duplicated accross all the dates we want so we can then group it up to make an average.
import pyspark.sql.functions as F
>>> spark.table("trn_dt").show()
+----+----------+-----+
|item| date|sales|
+----+----------+-----+
| 1|2016-01-03| 16.0|
| 1|2016-01-02| 15.0|
| 1|2016-01-05| 9.0|
| 1|2016-01-04| 10.0|
| 1|2016-01-01| 11.0|
| 1|2016-01-07| 10.0|
| 1|2016-01-06| 7.0|
+----+----------+-----+
df_avg.withColumn( "dates",
F.array( #building array of dates
df_avg["date"],
F.date_add( df_avg["date"], 1),
F.date_add( df_avg["date"], 2)
)).select(
F.col("item"),
F.explode("dates") ).alias("ThreeDayAve"), # tripling our data
F.col("sales")
).groupBy( "item","ThreeDayAve")
.agg( F.avg("sales").alias("3_avg")).show()
+----+-----------+------------------+
|item|ThreeDayAve| 3_avg|
+----+-----------+------------------+
| 1| 2016-01-05|11.666666666666666|
| 1| 2016-01-04|13.666666666666666|
| 1| 2016-01-07| 8.666666666666666|
| 1| 2016-01-01| 11.0|
| 1| 2016-01-03| 14.0|
| 1| 2016-01-02| 13.0|
| 1| 2016-01-09| 10.0|
| 1| 2016-01-06| 8.666666666666666|
| 1| 2016-01-08| 8.5|
+----+-----------+------------------+
You likely could use window on this but it wouldn't perform as well on large data sets.

PySpark: Moving rows from one dataframe into another if column values are not found in second dataframe

I have two spark dataframes with similar schemas:
DF1:
id category flag
123abc type 1 1
456def type 1 1
789ghi type 2 0
101jkl type 3 0
Df2:
id category flag
123abc type 1 1
456def type 1 1
789ghi type 2 1
101xyz type 3 0
DF1 has more data than DF2 so I cannot replace it. However, DF2 will have ids not found in DF1, as well as several IDs with more accurate flag data. This means there there are two situations that I need resolved:
789ghi has a different flag and needs to overwrite the 789ghi in
DF1.
101xyz is not found in DF1 and needs to be moved over
Each dataframe is millions of rows, so I am looking for an efficient way to perform this operation. I am not sure if this is a situation that requires an outer join or anti-join.
You can union the two dataframes and keep the first record for each id.
from functools import reduce
from pyspark.sql import DataFrame, Window
from pyspark.sql.functions import monotonically_increasing_id, col
df = reduce(DataFrame.unionByName,[df2,df1])
df = df.withColumn('row_num',monotonically_increasing_id())
window = Window.partitionBy("id").orderBy('row_num')
df = (df.withColumn('rank', rank().over(window))
.filter(col('rank') == 1)).drop('rank','row_num')
Output
+------+--------+----+
| id|category|flag|
+------+--------+----+
|101jkl| type 3| 0|
|101xyz| type 3| 0|
|123abc| type 1| 1|
|456def| type 1| 1|
|789ghi| type 2| 1|
+------+--------+----+
Option 1:
I would find ids in df1 not in df2 and put them into a subset df
I would then union the subset with df2.
Or
Option 2:
Find elements in df1 that are in df2 and drop those rows and then union df2.
The approach I take would obviously be based on which is less expensive computationally.
Option 1 code
s=df1.select('id').subtract(df2.select('id')).collect()[0][0]
df2.union(df1.filter(col('id')==s)).show()
Outcome
+------+--------+----+
| id|category|flag|
+------+--------+----+
|123abc| type 1| 1|
|456def| type 1| 1|
|789ghi| type 2| 1|
|101xyz| type 3| 0|
|101jkl| type 3| 0|
+------+--------+----+

Pysaprk multi groupby with different column

I have data like below
year name percent sex
1880 John 0.081541 boy
1881 William 0.080511 boy
1881 John 0.050057 boy
I need to groupby and count using different columns
df_year = df.groupby('year').count()
df_name = df.groupby('name').count()
df_sex = df.groupby('sex').count()
then I have to create a Window to get the top-3 data by each column
window = Window.partitionBy('year').orderBy(col("count").desc())
top4_res = df_year.withColumn('topn', func.row_number().over(window)).\
filter(col('topn') <= 4).repartition(1)
suppose I have hundreds of columns to groupby and count and topk_3 operation.
can I do it all in once?
or is there any better ways to do it?
I am not sure if this will meet your requirement but if you are okay with a single dataframe, i think it can give you a start, let me know if otherwise. You can stack these 3 columns (or more) and then groupby and take count :
cols = ['year','name','sex']
e = f"""stack({len(cols)},{','.join(map(','.join,
(zip([f'"{i}"' for i in cols],cols))))}) as (col,val)"""
(df.select(*[F.col(i).cast('string') for i in cols]).selectExpr(e)
.groupBy(*['col','val']).agg(F.count("col").alias("Counts")).orderBy('col')).show()
+----+-------+------+
| col| val|Counts|
+----+-------+------+
|name| John| 2|
|name|William| 1|
| sex| boy| 3|
|year| 1881| 2|
|year| 1880| 1|
+----+-------+------+
If you want a wide form you can also pivot but i think long form would be helpful:
(df.select(*[F.col(i).cast('string') for i in cols]).selectExpr(e)
.groupBy('col').pivot('val').agg(F.count('val')).show())
+----+----+----+----+-------+----+
| col|1880|1881|John|William| boy|
+----+----+----+----+-------+----+
|name|null|null| 2| 1|null|
|year| 1| 2|null| null|null|
| sex|null|null|null| null| 3|
+----+----+----+----+-------+----+
If you want top n values of columns that have the biggest count, this should work:
from pyspark.sql.functions import *
columns_to_check = [ 'year', 'name' ]
n = 4
for c in columns_to_check:
# returns a dataframe
x = df.groupBy(c).count().sort(col("count").desc()).limit(n)
x.show()
# returns a list of rows
x = df.groupBy(c).count().sort(col("count").desc()).take(n)
print(x)

PySpark DataFrame: Find closest value and slice the DataFrame

So, I've done enough research and haven't found a post that addresses what I want to do.
I have a PySpark DataFrame my_df which is sorted by value column-
+----+-----+
|name|value|
+----+-----+
| A| 30|
| B| 25|
| C| 20|
| D| 18|
| E| 18|
| F| 15|
| G| 10|
+----+-----+
The summation of all the counts in value column is equal to 136. I want to get all the rows whose combined values >= x% of 136. In this example, let's say x=80. Then target sum = 0.8*136 = 108.8. Hence, the new DataFrame will consist of all the rows that have a combined value >= 108.8.
In our example, this would come down to row D (since combined values upto D = 30+25+20+18 = 93).
However, the hard part is that I also want to include the immediately following rows with duplicate values. In this case, I also want to include row E since it has the same value as row D i.e. 18.
I want to slice my_df by giving a percentage x variable, for example 80 as discussed above. The new DataFrame should consist of the following rows-
+----+-----+
|name|value|
+----+-----+
| A| 30|
| B| 25|
| C| 20|
| D| 18|
| E| 18|
+----+-----+
One thing I could do here is iterate through the DataFrame (which is ~360k rows), but I guess that defeats the purpose of Spark.
Is there a concise function for what I want here?
Use pyspark SQL functions to do this concisely.
result = my_df.filter(my_df.value > target).select(my_df.name,my_df.value)
result.show()
Edit: Based on OP's question edit - Compute running sum and get rows until the target value is reached. Note that this will result in rows upto D, not E..which seems like a strange requirement.
from pyspark.sql import Window
from pyspark.sql import functions as f
# Total sum of all `values`
target = (my_df.agg(sum("value")).collect())[0][0]
w = Window.orderBy(my_df.name) #Ideally this should be a column that specifies ordering among rows
running_sum_df = my_df.withColumn('rsum',f.sum(my_df.value).over(w))
running_sum_df.filter(running_sum_df.rsum <= 0.8*target)
Your requirements are quite strict, so it's difficult to formulate an efficient solution to your problem. Nevertheless, here is one approach:
First calculate the cumulative sum and the total sum for the value column and filter the DataFrame using the percentage of target condition you specified. Let's call this result df_filtered:
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.orderBy(f.col("value").desc(), "name").rangeBetween(Window.unboundedPreceding, 0)
target = 0.8
df_filtered = df.withColumn("cum_sum", f.sum("value").over(w))\
.withColumn("total_sum", f.sum("value").over(Window.partitionBy()))\
.where(f.col("cum_sum") <= f.col("total_sum")*target)
df_filtered.show()
#+----+-----+-------+---------+
#|name|value|cum_sum|total_sum|
#+----+-----+-------+---------+
#| A| 30| 30| 136|
#| B| 25| 55| 136|
#| C| 20| 75| 136|
#| D| 18| 93| 136|
#+----+-----+-------+---------+
Then join this filtered DataFrame back on the original on the value column. Since your DataFrame is already sorted by value, the final output will contain the rows you want.
df.alias("r")\
.join(
df_filtered.alias('l'),
on="value"
).select("r.name", "r.value").sort(f.col("value").desc(), "name").show()
#+----+-----+
#|name|value|
#+----+-----+
#| A| 30|
#| B| 25|
#| C| 20|
#| D| 18|
#| E| 18|
#+----+-----+
The total_sum and cum_sum columns are calculated using a Window function.
The Window w orders on the value column descending, followed by the name column. The name column is used to break ties- without it, both rows C and D would have the same cumulative sum of 111 = 75+18+18 and you'd incorrectly lose both of them in the filter.
w = Window\ # Define Window
.orderBy( # This will define ordering
f.col("value").desc(), # First sort by value descending
"name" # Sort on name second
)\
.rangeBetween(Window.unboundedPreceding, 0) # Extend back to beginning of window
The rangeBetween(Window.unboundedPreceding, 0) specifies that the Window should include all rows before the current row (defined by the orderBy). This is what makes it a cumulative sum.

Merge with Pyspark

I'm working with Pyspark, I have Spark 1.6. And I would like to group some values together.
+--------+-----+
| Item |value|
+--------+-----+
| A | 187|
| B | 200|
| C | 3|
| D | 10|
I would to group all item with less of 10% total value together (in this case C and D would be group into new value "Other")
So, the new table look like
+--------+-----+
| Item |value|
+--------+-----+
| A | 187|
| B | 200|
| Other | 13|
Does some one know some function or simple way to do that?
Many thanks for your help
You can filter the dataframe twice to get a dataframe with just the values you want to keep, and one with just the others. Perform an aggregation on the others dataframe to sum them, then union the two dataframe back together. Depending on the data you may want to persist the original dataframe before all this so that it doesn't need to be evaluated twice.

Categories