Sum of variable number of columns in PySpark - python

I have a Spark DataFrame like this one:
+-----+--------+-------+-------+-------+-------+-------+
| Type|Criteria|Value#1|Value#2|Value#3|Value#4|Value#5|
+-----+--------+-------+-------+-------+-------+-------+
| Cat| 1| 1| 2| 3| 4| 5|
| Dog| 2| 1| 2| 3| 4| 5|
|Mouse| 4| 1| 2| 3| 4| 5|
| Fox| 5| 1| 2| 3| 4| 5|
+-----+--------+-------+-------+-------+-------+-------+
You can reproduce it with the next code:
data = [('Cat', 1, 1, 2, 3, 4, 5),
('Dog', 2, 1, 2, 3, 4, 5),
('Mouse', 4, 1, 2, 3, 4, 5),
('Fox', 5, 1, 2, 3, 4, 5)]
columns = ['Type', 'Criteria', 'Value#1', 'Value#2', 'Value#3', 'Value#4', 'Value#5']
df = spark.createDataFrame(data, schema=columns)
df.show()
My task is to add Total column that is a sum of all Value columns with # no more then Criteria for this Row.
In this example:
For row 'Cat': Criteria is 1, so Total is just Value#1.
For row 'Dog': Criteria is 2, so Total is the sum of Value#1 and Value#2.
For row 'Fox': Criteria is 5, so Total is the sum of all columns (Value#1 through Value#5).
Result should look like this:
+-----+--------+-------+-------+-------+-------+-------+-----+
| Type|Criteria|Value#1|Value#2|Value#3|Value#4|Value#5|Total|
+-----+--------+-------+-------+-------+-------+-------+-----+
| Cat| 1| 1| 2| 3| 4| 5| 1|
| Dog| 2| 1| 2| 3| 4| 5| 3|
|Mouse| 4| 1| 2| 3| 4| 5| 10|
| Fox| 5| 1| 2| 3| 4| 5| 15|
+-----+--------+-------+-------+-------+-------+-------+-----+
I can do it using Python UDF, but my datasets are large, and Python UDF are slow because of serialization. I'm looking for pure Spark solution.
I'm using PySpark and Spark 2.1

You can easily adjust the solution to PySpark: compute row maximum of the subset of columns and add to an exisiting dataframe by user6910411
from pyspark.sql.functions import col, when
total = sum([
when(col("Criteria") >= i, col("Value#{}".format(i))).otherwise(0)
for i in range(1, 6)
])
df.withColumn("total", total).show()
# +-----+--------+-------+-------+-------+-------+-------+-----+
# | Type|Criteria|Value#1|Value#2|Value#3|Value#4|Value#5|total|
# +-----+--------+-------+-------+-------+-------+-------+-----+
# | Cat| 1| 1| 2| 3| 4| 5| 1|
# | Dog| 2| 1| 2| 3| 4| 5| 3|
# |Mouse| 4| 1| 2| 3| 4| 5| 10|
# | Fox| 5| 1| 2| 3| 4| 5| 15|
# +-----+--------+-------+-------+-------+-------+-------+-----+
For arbitrary set of order columns define a list:
cols = df.columns[2:]
and redefine total as:
total_ = sum([
when(col("Criteria") > i, col(cols[i])).otherwise(0)
for i in range(len(cols))
])
df.withColumn("total", total_).show()
# +-----+--------+-------+-------+-------+-------+-------+-----+
# | Type|Criteria|Value#1|Value#2|Value#3|Value#4|Value#5|total|
# +-----+--------+-------+-------+-------+-------+-------+-----+
# | Cat| 1| 1| 2| 3| 4| 5| 1|
# | Dog| 2| 1| 2| 3| 4| 5| 3|
# |Mouse| 4| 1| 2| 3| 4| 5| 10|
# | Fox| 5| 1| 2| 3| 4| 5| 15|
# +-----+--------+-------+-------+-------+-------+-------+-----+
Important:
Here sum is __builtin__.sum not pyspark.sql.functions.sum.

Related

Modify column values with list values when a condition is satisfied - PySpark

I want to assign values to the dataframe column from a list on a condition, but my code only works on hard-coded replacements and not a dynamic version like lists.
And I can't convert the list directly to dataframe column bcuz its length is way shorter than the column's length
no_connections = network_data.map(lambda row: (row[1], 1)).reduceByKey(lambda a,b: a+b).collect()
network_data1 = network_data1\
.withColumn("NoUserConnections", when(network_data1.NoUserConnections == 0, no_connections[0])
.otherwise(network_data1.NoUserConnections))
I can also get the values of no_connections from a dataframe like so
network_data1.groupby('User').count().show()
My Dataframe looks like this:
+---+----+-----------+-----------------+
|_c0|User|Connections|NoUserConnections|
+---+----+-----------+-----------------+
| 0| 0| 1| 0|
| 1| 0| 2| 0|
| 2| 0| 3| 0|
| 3| 0| 4| 0|
| 4| 0| 5| 0|
| 5| 0| 6| 0|
| 6| 1| 7| 1|
| 7| 1| 8| 1|
| 8| 1| 9| 1|
| 9| 1| 10| 1|
+---+----+-----------+-----------------+
and I want to put the number of instances of each User value to their corresponding User like this
+---+----+-----------+-----------------+
|_c0|User|Connections|NoUserConnections|
+---+----+-----------+-----------------+
| 0| 0| 1| 6|
| 1| 0| 2| 6|
| 2| 0| 3| 6|
| 3| 0| 4| 6|
| 4| 0| 5| 6|
| 5| 0| 6| 6|
| 6| 1| 7| 4|
| 7| 1| 8| 4|
| 8| 1| 9| 4|
| 9| 1| 10| 4|
+---+----+-----------+-----------------+
Assuming you are trying to compute the number of occurences of each user in the dataframe and assign it to the user. You can use window functions in PySpark and apply a count aggregate.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [(0, 0, 1, 0, ),
(1, 0, 2, 0, ),
(2, 0, 3, 0, ),
(3, 0, 4, 0, ),
(4, 0, 5, 0, ),
(5, 0, 6, 0, ),
(6, 1, 7, 1, ),
(7, 1, 8, 1, ),
(8, 1, 9, 1, ),
(9, 1, 10, 1, ),]
df = spark.createDataFrame(data, ("Id", "User", "Connections", "NoUserConnections", ))
window_spec = W.partitionBy("User").rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
df.withColumn("NoUserConnections", F.count("Connections").over(window_spec)).show()
Output
+---+----+-----------+-----------------+
| Id|User|Connections|NoUserConnections|
+---+----+-----------+-----------------+
| 0| 0| 1| 6|
| 1| 0| 2| 6|
| 2| 0| 3| 6|
| 3| 0| 4| 6|
| 4| 0| 5| 6|
| 5| 0| 6| 6|
| 6| 1| 7| 4|
| 7| 1| 8| 4|
| 8| 1| 9| 4|
| 9| 1| 10| 4|
+---+----+-----------+-----------------+

Merging rows that have same credentials -pyspark dataframe

How do can I merge two rows in a pyspark dataframe that satisfy a condition?
Example:
dataframe
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 1| 4| 1|
| 4| 4| 2|
| 4| 1| 3|
| 1| 7| 1|
+---+---+------+
condition: (df.src,df.dst) == (df.dst,df.src)
expected output
summed the weight and deleted (4,1)
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 1| 4| 4| #
| 4| 4| 2|
| 1| 7| 1|
+---+---+------+
or
summed the weights and deleted (1,4)
+---+---+------+
|src|dst|weight|
+---+---+------+
| 8| 7| 1|
| 1| 1| 93|
| 4| 4| 2|
| 4| 1| 4| #
| 1| 7| 1|
+---+---+------+
You can add a src_dst column with the sorted array of src and dst, then get the sum of weights for each src_dst, and remove duplicate rows of src_dst:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'src_dst',
F.sort_array(F.array('src', 'dst'))
).withColumn(
'weight',
F.sum('weight').over(Window.partitionBy('src_dst'))
).dropDuplicates(['src_dst']).drop('src_dst')
df2.show()
+---+---+------+
|src|dst|weight|
+---+---+------+
| 1| 7| 1|
| 1| 1| 93|
| 1| 4| 4|
| 8| 7| 1|
| 4| 4| 2|
+---+---+------+

Get the first row that matches some condition over a window in PySpark

To give an example, suppose we have a stream of user actions as follows:
from pyspark.sql import *
spark = SparkSession.builder.appName('test').master('local[8]').getOrCreate()
df = spark.sparkContext.parallelize([
Row(user=1, action=1, time=1),
Row(user=1, action=1, time=2),
Row(user=2, action=1, time=3),
Row(user=1, action=2, time=4),
Row(user=2, action=2, time=5),
Row(user=2, action=2, time=6),
Row(user=1, action=1, time=7),
Row(user=2, action=1, time=8),
]).toDF()
df.show()
The dataframe looks like:
+----+------+----+
|user|action|time|
+----+------+----+
| 1| 1| 1|
| 1| 1| 2|
| 2| 1| 3|
| 1| 2| 4|
| 2| 2| 5|
| 2| 2| 6|
| 1| 1| 7|
| 2| 1| 8|
+----+------+----+
Then, I want to add a column next_alt_time to each row, giving the time when user changes action type in the following rows. For the input above, the output should be:
+----+------+----+-------------+
|user|action|time|next_alt_time|
+----+------+----+-------------+
| 1| 1| 1| 4|
| 1| 1| 2| 4|
| 2| 1| 3| 5|
| 1| 2| 4| 7|
| 2| 2| 5| 8|
| 2| 2| 6| 8|
| 1| 1| 7| null|
| 2| 1| 8| null|
+----+------+----+-------------+
I know I can create a window like this:
wnd = Window().partitionBy('user').orderBy('time').rowsBetween(1, Window.unboundedFollowing)
But then I don't know how to impose a condition over the window and select the first row that has a different action than current row, over the window defined above.
Here's how to do it. Spark cannot keep the dataframe order, but if you check the rows one by one, you can confirm that it's giving your expected answer:
from pyspark.sql import Row
from pyspark.sql.window import Window
import pyspark.sql.functions as F
df = spark.sparkContext.parallelize([
Row(user=1, action=1, time=1),
Row(user=1, action=1, time=2),
Row(user=2, action=1, time=3),
Row(user=1, action=2, time=4),
Row(user=2, action=2, time=5),
Row(user=2, action=2, time=6),
Row(user=1, action=1, time=7),
Row(user=2, action=1, time=8),
]).toDF()
win = Window().partitionBy('user').orderBy('time')
df = df.withColumn('new_action', F.lag('action').over(win) != F.col('action'))
df = df.withColumn('new_action_time', F.when(F.col('new_action'), F.col('time')))
df = df.withColumn('next_alt_time', F.first('new_action', ignorenulls=True).over(win.rowsBetween(1, Window.unboundedFollowing)))
df.show()
+----+------+----+----------+---------------+-------------+
|user|action|time|new_action|new_action_time|next_alt_time|
+----+------+----+----------+---------------+-------------+
| 1| 1| 1| null| null| 4|
| 1| 1| 2| false| null| 4|
| 1| 2| 4| true| 4| 7|
| 1| 1| 7| true| 7| null|
| 2| 1| 3| null| null| 5|
| 2| 2| 5| true| 5| 8|
| 2| 2| 6| false| null| 8|
| 2| 1| 8| true| 8| null|
+----+------+----+----------+---------------+-------------+

replace NA with median in pyspark using window function

I want to replace NA with medain based on partition columns using window function in pyspark?
Sample Input:
Required Output:
Creating your dataframe:
list=([1,5,4],
[1,5,None],
[1,5,1],
[1,5,4],
[2,5,1],
[2,5,2],
[2,5,None],
[2,5,None],
[2,5,4])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
df.show()
+----+----+----+
|I_id|p_id| xyz|
+----+----+----+
| 1| 5| 4|
| 1| 5|null|
| 1| 5| 1|
| 1| 5| 4|
| 2| 5| 1|
| 2| 5| 2|
| 2| 5|null|
| 2| 5|null|
| 2| 5| 4|
+----+----+----+
To keep the solution as generic and dynamic as possible, I had to create many new columns to compute the median, and to be able to send it to the nulls. With that said, solution will not be slow, and will be scalable for big data.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import when
w= Window().partitionBy("I_id","p_id").orderBy(F.col("xyz").asc_nulls_first())
w2= Window().partitionBy("I_id","p_id")
df.withColumn("xyz1",F.count(F.col("xyz").isNotNull()).over(w))\
.withColumn("xyz2", F.max(F.row_number().over(w)).over(w2))\
.withColumn("xyz3", F.first("xyz1").over(w))\
.withColumn("xyz10", F.col("xyz2")-F.col("xyz3"))\
.withColumn("xyz9", F.when((F.col("xyz2")-F.col("xyz3"))%2!=0, F.col("xyz2")-F.col("xyz3")+1).otherwise(F.col("xyz2")-F.col("xyz3")))\
.withColumn("xyz4", (F.col("xyz9")/2))\
.withColumn("xyz6", F.col("xyz4")+F.col("xyz3"))\
.withColumn("xyz7", F.when(F.col("xyz10")%2==0,(F.col("xyz4")+F.col("xyz3")+1)).otherwise(F.lit(None)))\
.withColumn("xyz5", F.row_number().over(w))\
.withColumn("medianr", F.when(F.col("xyz6")==F.col("xyz5"), F.col("xyz")).when(F.col("xyz7")==F.col("xyz5"),F.col("xyz")).otherwise(F.lit(None)))\
.withColumn("medianr2", (F.mean("medianr").over(w2)))\
.withColumn("xyz", F.when(F.col("xyz").isNull(), F.col("medianr2")).otherwise(F.col("xyz")))\
.select("I_id","p_id","xyz")\
.orderBy("I_id").show()
+----+----+---+
|I_id|p_id|xyz|
+----+----+---+
| 1| 5| 4|
| 1| 5| 1|
| 1| 5| 4|
| 1| 5| 4|
| 2| 5| 2|
| 2| 5| 2|
| 2| 5| 1|
| 2| 5| 2|
| 2| 5| 4|
+----+----+---+

Pyspark: how to duplicate a row n time in dataframe?

I've got a dataframe like this and I want to duplicate the row n times if the column n is bigger than one:
A B n
1 2 1
2 9 1
3 8 2
4 1 1
5 3 3
And transform like this:
A B n
1 2 1
2 9 1
3 8 2
3 8 2
4 1 1
5 3 3
5 3 3
5 3 3
I think I should use explode, but I don't understand how it works...
Thanks
With Spark 2.4.0+, this is easier with builtin functions: array_repeat + explode:
from pyspark.sql.functions import expr
df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)], ["A", "B", "n"])
new_df = df.withColumn('n', expr('explode(array_repeat(n,int(n)))'))
>>> new_df.show()
+---+---+---+
| A| B| n|
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
| 5| 3| 3|
| 5| 3| 3|
+---+---+---+
The explode function returns a new row for each element in the given array or map.
One way to exploit this function is to use a udf to create a list of size n for each row. Then explode the resulting array.
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import ArrayType, IntegerType
df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)] ,["A", "B", "n"])
+---+---+---+
| A| B| n|
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
+---+---+---+
# use udf function to transform the n value to n times
n_to_array = udf(lambda n : [n] * n, ArrayType(IntegerType()))
df2 = df.withColumn('n', n_to_array(df.n))
+---+---+---------+
| A| B| n|
+---+---+---------+
| 1| 2| [1]|
| 2| 9| [1]|
| 3| 8| [2, 2]|
| 4| 1| [1]|
| 5| 3|[3, 3, 3]|
+---+---+---------+
# now use explode
df2.withColumn('n', explode(df2.n)).show()
+---+---+---+
| A | B | n |
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
| 5| 3| 3|
| 5| 3| 3|
+---+---+---+
I think the udf answer by #Ahmed is the best way to go, but here is an alternative method, that may be as good or better for small n:
First, collect the maximum value of n over the whole DataFrame:
max_n = df.select(f.max('n').alias('max_n')).first()['max_n']
print(max_n)
#3
Now create an array for each row of length max_n, containing numbers in range(max_n). The output of this intermediate step will result in a DataFrame like:
df.withColumn('n_array', f.array([f.lit(i) for i in range(max_n)])).show()
#+---+---+---+---------+
#| A| B| n| n_array|
#+---+---+---+---------+
#| 1| 2| 1|[0, 1, 2]|
#| 2| 9| 1|[0, 1, 2]|
#| 3| 8| 2|[0, 1, 2]|
#| 4| 1| 1|[0, 1, 2]|
#| 5| 3| 3|[0, 1, 2]|
#+---+---+---+---------+
Now we explode the n_array column, and filter to keep only the values in the array that are less than n. This will ensure that we have n copies of each row. Finally we drop the exploded column to get the end result:
df.withColumn('n_array', f.array([f.lit(i) for i in range(max_n)]))\
.select('A', 'B', 'n', f.explode('n_array').alias('col'))\
.where(f.col('col') < f.col('n'))\
.drop('col')\
.show()
#+---+---+---+
#| A| B| n|
#+---+---+---+
#| 1| 2| 1|
#| 2| 9| 1|
#| 3| 8| 2|
#| 3| 8| 2|
#| 4| 1| 1|
#| 5| 3| 3|
#| 5| 3| 3|
#| 5| 3| 3|
#+---+---+---+
However, we are creating a max_n length array for each row- as opposed to just an n length array in the udf solution. It's not immediately clear to me how this will scale vs. udf for large max_n, but I suspect the udf will win out.

Categories