pyspark dataframe doesn't change after some processing - python

I create a dateframe, and use window function to get the accumulative value, but after use the function,df and df.select() show with different row order
spark = SparkSession.builder.master('local[8]').appName("SparkByExample.com").getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", '50')
# create dataframe
from pyspark.sql.types import StructType,StructField, StringType, IntegerType,FloatType,LongType,ArrayType
data = [(1,1,2),
(1,2,2),
(1,3,2),
(1,2,1),
(2,1,2),
(2,3,2),
(2,2,1),
(3,1,2),
(3,3,2),
(3,2,1)]
schema = StructType([ \
StructField("col1", IntegerType(),True), \
StructField("col2", IntegerType(),True), \
StructField("col3", IntegerType(),True)])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| 2|
| 1| 2| 2|
| 1| 3| 2|
| 1| 2| 1|
| 2| 1| 2|
| 2| 3| 2|
| 2| 2| 1|
| 3| 1| 2|
| 3| 3| 2|
| 3| 2| 1|
+----+----+----+
# this is window fuction to get the accumulative value
from pyspark.sql.window import Window
w = Window().partitionBy(F.col('col1')).orderBy(F.col('col2'))
def f(lag_val, current_val):
value = 0
if lag_val != current_val:
value = 1
return value
# register udf so we can use with our dataframe
func_udf = F.udf(f, IntegerType())
print(id(df))
df = df.withColumn("new_column", func_udf(F.lag("col3").over(w), df['col3'])).withColumn('new_column2', F.sum('new_column').over(w.partitionBy(F.col('col1')).rowsBetween(Window.unboundedPreceding, 0)))
df.show()
+----+----+----+----------+-----------+
|col1|col2|col3|new_column|new_column2|
+----+----+----+----------+-----------+
| 3| 1| 2| 1| 1|
| 3| 2| 1| 1| 2|
| 3| 3| 2| 1| 3|
| 2| 1| 2| 1| 1|
| 2| 2| 1| 1| 2|
| 2| 3| 2| 1| 3|
| 1| 1| 2| 1| 1|
| 1| 2| 2| 0| 1|
| 1| 2| 1| 1| 2|
| 1| 3| 2| 1| 3|
+----+----+----+----------+-----------+
test = df.select('col1','col2','col3')
test.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| 2|
| 1| 2| 2|
| 1| 3| 2|
| 1| 2| 1|
| 2| 1| 2|
| 2| 3| 2|
| 2| 2| 1|
| 3| 1| 2|
| 3| 3| 2|
| 3| 2| 1|
+----+----+----+
we can see that the row order of 'col1','col2','col3' of df and test are different, but test shows the original order of df, this could be because of action and transformation, but I am not sure.

Since on a python dataframe packages, sometimes it will automatically alter the order after some functions are executed. If you are referring to debugging, I suggest you break down your code as the following:
df = df.withColumn("new_column", func_udf(F.lag("col3").over(w), df['col3'])).withColumn('new_column2', F.sum('new_column').over(w.partitionBy(F.col('col1')).rowsBetween(Window.unboundedPreceding, 0)))
to something like this
df = df.withColumn("new_column", func_udf(F.lag("col3").over(w), df['col3']))
df_partition = w.partitionBy(F.col('col1')).rowsBetween(Window.unboundedPreceding, 0))
df = df.withColumn('new_column2', F.sum('new_column').over(df_partition)
You should further break those lines into smaller function until you understand what each function does.

If you do explain() to check the query plans,
test.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[col1#255,col2#256,col3#257]
df.explain()
== Physical Plan ==
Window [sum(cast(new_column#288 as bigint)) windowspecdefinition(col1#255, col2#256 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS new_column2#295L], [col1#255], [col2#256 ASC NULLS FIRST]
+- *(4) Sort [col1#255 ASC NULLS FIRST, col2#256 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(col1#255, 50), true, [id=#553]
+- *(3) Project [col1#255, col2#256, col3#257, pythonUDF0#389 AS new_column#288]
+- BatchEvalPython [f(_we0#289, col3#257)], [pythonUDF0#389]
+- Window [lag(col3#257, 1, null) windowspecdefinition(col1#255, col2#256 ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _we0#289], [col1#255], [col2#256 ASC NULLS FIRST]
+- *(2) Sort [col1#255 ASC NULLS FIRST, col2#256 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(col1#255, 50), true, [id=#544]
+- *(1) Scan ExistingRDD[col1#255,col2#256,col3#257]
You can see that the query optimizer knew that the test dataframe only requires columns from the original dataframe, so it skipped all the transformations for the new columns in df because they are irrelevant. That's why the dataframe had the same order as the original dataframe.
But for the new df with additional columns, sorting was performed in the window functions, so the order of the dataframe changed.

Related

Merging rows that have same credentials -pyspark dataframe

How do can I merge two rows in a pyspark dataframe that satisfy a condition?
Example:
dataframe
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 1| 4| 1|
| 4| 4| 2|
| 4| 1| 3|
| 1| 7| 1|
+---+---+------+
condition: (df.src,df.dst) == (df.dst,df.src)
expected output
summed the weight and deleted (4,1)
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 1| 4| 4| #
| 4| 4| 2|
| 1| 7| 1|
+---+---+------+
or
summed the weights and deleted (1,4)
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 4| 4| 2|
| 4| 1| 4| #
| 1| 7| 1|
+---+---+------+
You can add a src_dst column with the sorted array of src and dst, then get the sum of weights for each src_dst, and remove duplicate rows of src_dst:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'src_dst',
F.sort_array(F.array('src', 'dst'))
).withColumn(
'weight',
F.sum('weight').over(Window.partitionBy('src_dst'))
).dropDuplicates(['src_dst']).drop('src_dst')
df2.show()
+---+---+------+
|src|dst|weight|
+---+---+------+
| 1| 7| 1|
| 1| 1| 93|
| 1| 4| 4|
| 8| 7| 1|
| 4| 4| 2|
+---+---+------+

Add distinct count of a column to each row in PySpark

I need to add distinct count of a column to each row in PySpark dataframe.
Example:
If the original dataframe is this:
+----+----+
|col1|col2|
+----+----+
|abc | 1|
|xyz | 1|
|dgc | 2|
|ydh | 3|
|ujd | 1|
|ujx | 3|
+----+----+
Then I want something like this:
+----+----+----+
|col1|col2|col3|
+----+----+----+
|abc | 1| 3|
|xyz | 1| 3|
|dgc | 2| 3|
|ydh | 3| 3|
|ujd | 1| 3|
|ujx | 3| 3|
+----+----+----+
I tried df.withColumn('total_count', f.countDistinct('col2')) but it's giving error.
You can count distinct elements in the column and create new column with the value:
distincts = df.dropDuplicates(["col2"]).count()
df = df.withColumn("col3", f.lit(distincts))
Cross join to the count distinct as below:
df2 = df.crossJoin(df.select(F.countDistinct('col2').alias('col3')))
df2.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| abc| 1| 3|
| xyz| 1| 3|
| dgc| 2| 3|
| ydh| 3| 3|
| ujd| 1| 3|
| ujx| 3| 3|
+----+----+----+
You can use Window, collect_set and size:
from pyspark.sql import functions as F, Window
df = spark.createDataFrame([("abc", 1), ("xyz", 1), ("dgc", 2), ("ydh", 3), ("ujd", 1), ("ujx", 3)], ['col1', 'col2'])
window = Window.orderBy("col2").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn("col3", F.size(F.collect_set(F.col("col2")).over(window))).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| abc| 1| 3|
| xyz| 1| 3|
| dgc| 2| 3|
| ydh| 3| 3|
| ujd| 1| 3|
| ujx| 3| 3|
+----+----+----+

Get the first row that matches some condition over a window in PySpark

To give an example, suppose we have a stream of user actions as follows:
from pyspark.sql import *
spark = SparkSession.builder.appName('test').master('local[8]').getOrCreate()
df = spark.sparkContext.parallelize([
Row(user=1, action=1, time=1),
Row(user=1, action=1, time=2),
Row(user=2, action=1, time=3),
Row(user=1, action=2, time=4),
Row(user=2, action=2, time=5),
Row(user=2, action=2, time=6),
Row(user=1, action=1, time=7),
Row(user=2, action=1, time=8),
]).toDF()
df.show()
The dataframe looks like:
+----+------+----+
|user|action|time|
+----+------+----+
| 1| 1| 1|
| 1| 1| 2|
| 2| 1| 3|
| 1| 2| 4|
| 2| 2| 5|
| 2| 2| 6|
| 1| 1| 7|
| 2| 1| 8|
+----+------+----+
Then, I want to add a column next_alt_time to each row, giving the time when user changes action type in the following rows. For the input above, the output should be:
+----+------+----+-------------+
|user|action|time|next_alt_time|
+----+------+----+-------------+
| 1| 1| 1| 4|
| 1| 1| 2| 4|
| 2| 1| 3| 5|
| 1| 2| 4| 7|
| 2| 2| 5| 8|
| 2| 2| 6| 8|
| 1| 1| 7| null|
| 2| 1| 8| null|
+----+------+----+-------------+
I know I can create a window like this:
wnd = Window().partitionBy('user').orderBy('time').rowsBetween(1, Window.unboundedFollowing)
But then I don't know how to impose a condition over the window and select the first row that has a different action than current row, over the window defined above.
Here's how to do it. Spark cannot keep the dataframe order, but if you check the rows one by one, you can confirm that it's giving your expected answer:
from pyspark.sql import Row
from pyspark.sql.window import Window
import pyspark.sql.functions as F
df = spark.sparkContext.parallelize([
Row(user=1, action=1, time=1),
Row(user=1, action=1, time=2),
Row(user=2, action=1, time=3),
Row(user=1, action=2, time=4),
Row(user=2, action=2, time=5),
Row(user=2, action=2, time=6),
Row(user=1, action=1, time=7),
Row(user=2, action=1, time=8),
]).toDF()
win = Window().partitionBy('user').orderBy('time')
df = df.withColumn('new_action', F.lag('action').over(win) != F.col('action'))
df = df.withColumn('new_action_time', F.when(F.col('new_action'), F.col('time')))
df = df.withColumn('next_alt_time', F.first('new_action', ignorenulls=True).over(win.rowsBetween(1, Window.unboundedFollowing)))
df.show()
+----+------+----+----------+---------------+-------------+
|user|action|time|new_action|new_action_time|next_alt_time|
+----+------+----+----------+---------------+-------------+
| 1| 1| 1| null| null| 4|
| 1| 1| 2| false| null| 4|
| 1| 2| 4| true| 4| 7|
| 1| 1| 7| true| 7| null|
| 2| 1| 3| null| null| 5|
| 2| 2| 5| true| 5| 8|
| 2| 2| 6| false| null| 8|
| 2| 1| 8| true| 8| null|
+----+------+----+----------+---------------+-------------+

How do I count consecutive values with pyspark?

I am trying to count consecutive values that appear in a column with Pyspark. I have the column "a" in my dataframe and expect to create the column "b".
+---+---+
| a| b|
+---+---+
| 0| 1|
| 0| 2|
| 0| 3|
| 0| 4|
| 0| 5|
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 2| 1|
| 2| 2|
| 2| 3|
| 2| 4|
| 2| 5|
| 2| 6|
| 3| 1|
| 3| 2|
| 3| 3|
+---+---+
I have tried to create the column "b" with lag function over some window, but without success.
w = Window\
.partitionBy(df.some_id)\
.orderBy(df.timestamp_column)
df.withColumn(
"b",
f.when(df.a == f.lag(df.a).over(w),
f.sum(f.lit(1)).over(w)).otherwise(f.lit(0))
)
I could resolve this issue with the following code:
df.withColumn("b",
f.row_number().over(Window.partitionBy("a").orderBy("timestamp_column"))

How to add an ID to pyspark dataframe rows that increases only if a certain condition is met?

I have a pyspark dataframe, and I want to add an Id column to it that only increases if a condition is met.
Example:
over a Window on col1, if col2 value changes, the Id needs to be incremented by 1.
Input:
+----+----+
|col1|col2|
+----+----+
| 1| A|
| 1| A|
| 1| B|
| 1| C|
| 2| A|
| 2| B|
| 2| B|
| 2| B|
| 2| C|
| 2| C|
+----+----+
output:
+----+----+----+
|col1|col2| ID|
+----+----+----+
| 1| A| 1|
| 1| A| 1|
| 1| B| 2|
| 1| C| 3|
| 2| A| 1|
| 2| B| 2|
| 2| B| 2|
| 2| B| 2|
| 2| C| 3|
| 2| C| 3|
+----+----+----+
Thanks :)
What you are looking for is the dense_rank function (pyspark doc here).
Assuming your dataframe variable is df you can do something like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
...
df.withColumn('ID', F.dense_rank().over(Window.partitionBy('col1').orderBy('col2')))

Categories