Aggregate GroupBy columns with "all"-like function pyspark - python

I have a dataframe with a primary key, date, variable, and value. I want to group by the primary key and determine if all values are equal to a provided value. Example data:
import pandas as pd
from datetime import date
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame({
"pk": [1, 1, 1, 1, 2, 2, 2, 2, 3, 4],
"date": [
date("2022-05-06"),
date("2022-05-13"),
date("2022-05-06"),
date("2022-05-06"),
date("2022-05-14"),
date("2022-05-15"),
date("2022-05-05"),
date("2022-05-05"),
date("2022-05-11"),
date("2022-05-12")
],
"variable": [A, B, C, D, A, A, E, F, A, G],
"value": [2, 3, 2, 2, 1, 1, 1, 1, 5, 4]
})
df = spark.createDataFrame(df)
df.show()
df1.show()
#+-----+-----------+--------+-----+
#|pk | date|variable|value|
#+-----+-----------+--------+-----+
#| 1| 2022-05-06| A| 2|
#| 1| 2022-05-13| B| 3|
#| 1| 2022-05-06| C| 2|
#| 1| 2022-05-06| D| 2|
#| 2| 2022-05-14| A| 1|
#| 2| 2022-05-15| A| 1|
#| 2| 2022-05-05| E| 1|
#| 2| 2022-05-05| F| 1|
#| 3| 2022-05-11| A| 5|
#| 4| 2022-05-12| G| 4|
#+-----+-----------+--------+-----+
So if I want to know whether, given a primary key, pk, all the values are equal to 1 (or any arbitrary Boolean test), how should I do this? I've tried performing an applyInPandas but that is not super efficient and it seems like there is probably a pretty simply method to do this.

For Spark 3.+, you could use forall function to check if all values collected by collect_list satisfy the boolean test.
import pyspark.sql.functions as F
df1 = (df
.groupby("pk")
.agg(F.expr("forall(collect_list(value), v -> v == 1)").alias("value"))
)
df1.show()
# +---+-----+
# | pk|value|
# +---+-----+
# | 1|false|
# | 3|false|
# | 2| true|
# | 4|false|
# +---+-----+
# or create a column using window function
df2 = df.withColumn("test", F.expr("forall(collect_list(value) over (partition by pk), v -> v == 1)"))
df2.show()
# +---+----------+--------+-----+-----+
# | pk| date|variable|value| test|
# +---+----------+--------+-----+-----+
# | 1|2022-05-06| A| 2|false|
# | 1|2022-05-13| B| 3|false|
# | 1|2022-05-06| C| 2|false|
# | 1|2022-05-06| D| 2|false|
# | 3|2022-05-11| A| 5|false|
# | 2|2022-05-14| A| 1| true|
# | 2|2022-05-15| A| 1| true|
# | 2|2022-05-05| E| 1| true|
# | 2|2022-05-05| F| 1| true|
# | 4|2022-05-12| G| 4|false|
# +---+----------+--------+-----+-----+
You might want to put it inside a case clause to handle NULL values.

Related

Spread List of Lists to Sparks DF with PySpark?

I'm currently struggling with following issue:
Let's take following List of Lists:
[[1, 2, 3], [4, 5], [6, 7]]
How can I create following Sparks DF out of it with one row per element of each sublist:
| min_value | value |
---------------------
| 1| 1|
| 1| 2|
| 1| 3|
| 4| 4|
| 4| 5|
| 6| 6|
| 6| 7|
The only way I'm getting this done is by processing this list to another list with for-loops, which basically then already represents all rows of my DF, which is probably not the best way to solve this.
THX & BR
IntoNumbers
You can create a dataframe and use explode and array_min to get the desired output:
import pyspark.sql.functions as F
l = [[1, 2, 3], [4, 5], [6, 7]]
df = spark.createDataFrame(
[[l]],
['col']
).select(
F.explode('col').alias('value')
).withColumn(
'min_value',
F.array_min('value')
).withColumn(
'value',
F.explode('value')
)
df.show()
+-----+---------+
|value|min_value|
+-----+---------+
| 1| 1|
| 2| 1|
| 3| 1|
| 4| 4|
| 5| 4|
| 6| 6|
| 7| 6|
+-----+---------+
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import col, expr
import pyspark.sql.functions as F
data=[[1, 2, 3], [4, 5], [6, 7]]
Extract first element in each list in data
j=[item[0] for item in data]
Zip first element to data, create df and explode column b
df=spark.createDataFrame(zip(j,data), ['min_value','value']).withColumn('value', F.explode(col('value'))).show()
+---------+-----+
|min_value|value|
+---------+-----+
| 1| 1|
| 1| 2|
| 1| 3|
| 4| 4|
| 4| 5|
| 6| 6|
| 6| 7|
+---------+-----+
Here is my Spark like solution to this:
Basic Imports and SparkSession creation for example purposes
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("Stackoverflow problem 1") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
Creating an arbitrary list of lists:
list_of_lists = [[1,3,5], [1053,23], [1,3], [5,2,15,23,5,3]]
Solution below:
# Create an rdd from the list of lists
rdd_ll = spark.sparkContext.parallelize(list_of_lists)
# Compute the min from the previous rdd and store in another rdd
rdd_min = rdd_ll.map(lambda x: min(x))
# create a dataframe by zipping the two rdds and using the explode function
df = spark.createDataFrame(rdd_min.zip(rdd_ll)) \
.withColumn('value', F.explode('_2')) \
.drop('_2') \
.withColumnRenamed('_1', 'min_value')
Output
df.show()
+---------+-----+
|min_value|value|
+---------+-----+
| 1| 1|
| 1| 3|
| 1| 5|
| 23| 1053|
| 23| 23|
| 1| 1|
| 1| 3|
| 2| 5|
| 2| 2|
| 2| 15|
| 2| 23|
| 2| 5|
| 2| 3|
+---------+-----+

Sum of pyspark columns to ignore NaN values

I have a pypark dataframe in the following way:
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| 1| 3|
| 2| NaN| 4|
| 3| 3| 5|
+---+----+----+
I would like to sum col1 and col2 so that the result looks like this:
+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
| 1| 1| 3| 4|
| 2| NaN| 4| 4|
| 3| 3| 5| 8|
+---+----+----+---+
Here's what I have tried:
import pandas as pd
test = pd.DataFrame({
'id': [1, 2, 3],
'col1': [1, None, 3],
'col2': [3, 4, 5]
})
test = spark.createDataFrame(test)
test.withColumn('sum', F.col('col1') + F.col('col2')).show()
This code returns:
+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
| 1| 1| 3| 4|
| 2| NaN| 4|NaN| # <-- I want a 4 here, not this NaN
| 3| 3| 5| 8|
+---+----+----+---+
Can anyone help me with this?
Use F.nanvl to replace NaN with a given value (0 here):
import pyspark.sql.functions as F
result = test.withColumn('sum', F.nanvl(F.col('col1'), F.lit(0)) + F.col('col2'))
For your comment:
result = test.withColumn('sum',
F.when(
F.isnan(F.col('col1')) & F.isnan(F.col('col2')),
F.lit(float('nan'))
).otherwise(
F.nanvl(F.col('col1'), F.lit(0)) + F.nanvl(F.col('col2'), F.lit(0))
)
)

PySpark making dataframe with three columns from RDD with tuple and int

I have a RDD in a form of:
[(('1', '10'), 1), (('10', '1'), 1), (('1', '12'), 1), (('12', '1'), 1)]
What I have done is
df = spark.createDataFrame(rdd, ["src", "rp"])
where I make a column of the tuple and int which looks like this:
+-------+-----+
| src|rp |
+-------+-----+
|[1, 10]| 1|
|[10, 1]| 1|
|[1, 12]| 1|
|[12, 1]| 1|
+-------+-----+
But I can't figure out how to make src column of the first element in [x,y] and dst column of the second element so i would have a dataframe with three columns src, dst and rp:
+-------+-----+-----+
| src|dst |rp |
+-------+-----+-----+
| 1| 10| 1|
| 10| 1| 1|
| 1| 12| 1|
| 12| 1| 1|
+-------+-----+-----+
You need an intermediate transformation on your RDD to make it a flat list of three elements:
spark.createDataFrame(rdd.map(lambda l: [l[0][0], l[0][1], l[1]]), ["src", "dst", "rp"])
+---+---+---+
|src|dst| rp|
+---+---+---+
| 1| 10| 1|
| 10| 1| 1|
| 1| 12| 1|
| 12| 1| 1|
+---+---+---+
You can just do a simple select on the dataframe to separate out the columns. No need to do a intermediate transformation as the other answer suggests.
from pyspark.sql.functions import col
df = sqlContext.createDataFrame(rdd, ["src", "rp"])
df = df.select(col("src._1").alias("src"), col("src._2").alias("dst"),col("rp"))
df.show()
Here's the result
+---+---+---+
|src|dst| rp|
+---+---+---+
| 1| 10| 1|
| 10| 1| 1|
| 1| 12| 1|
| 12| 1| 1|
+---+---+---+

Explode nested JSON with Index using posexplode

Im using the below function to explode a deeply nested JSON (has nested struct and array).
# Flatten nested df
def flatten_df(nested_df):
for col in nested_df.columns:
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for col in array_cols:
nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col]))
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
if len(nested_cols) == 0:
return nested_df
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
flat_df = nested_df.select(flat_cols +
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols
for c in nested_df.select(nc+'.*').columns])
return flatten_df(flat_df)
Im successfully able to explode. But I also want to add the order or the index of the elements in the exploded dataframe. So in the above code I replace the explode_outer function to posexplode_outer. But I get the below error
An error was encountered:
'The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases'
I tried changing the nested_df.withColumn to nested_df.select but I wasn't successful. Can anyone help me explode the nested json but same time maintain the order of array elements as a column in the exploded dataframe.
Read json data as dataframe and create view or table.In spark SQL you can use number of laterviewexplode method using alias reference. If json data structure in struct type you can use dot to represent the structure. Level1.level2
Replace nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col])) with nested_df = df.selectExpr("*", f"posexplode({col}) as (position,col)").drop(col)
You might need to write some logic to replace the column names to original, but it should be simple
The error is because posexplode_outer returns two columns pos and col, so you cannot use it along withColumn(). This can be used in select as shown in the code below
from pyspark.sql import functions as F
from pyspark.sql.window import Window
tst= sqlContext.createDataFrame([(1,7,80),(1,8,40),(1,5,100),(5,8,90),(7,6,50),(0,3,60)],schema=['col1','col2','col3'])
tst_new = tst.withColumn("arr",F.array(tst.columns))
expr = tst.columns
expr.append(F.posexplode_outer('arr'))
#%%
tst_explode = tst_new.select(*expr)
results:
tst_explode.show()
+----+----+----+---+---+
|col1|col2|col3|pos|col|
+----+----+----+---+---+
| 1| 7| 80| 0| 1|
| 1| 7| 80| 1| 7|
| 1| 7| 80| 2| 80|
| 1| 8| 40| 0| 1|
| 1| 8| 40| 1| 8|
| 1| 8| 40| 2| 40|
| 1| 5| 100| 0| 1|
| 1| 5| 100| 1| 5|
| 1| 5| 100| 2|100|
| 5| 8| 90| 0| 5|
| 5| 8| 90| 1| 8|
| 5| 8| 90| 2| 90|
| 7| 6| 50| 0| 7|
| 7| 6| 50| 1| 6|
| 7| 6| 50| 2| 50|
| 0| 3| 60| 0| 0|
| 0| 3| 60| 1| 3|
| 0| 3| 60| 2| 60|
+----+----+----+---+---+
If you need to rename the columns, you can use the .withColumnRenamed() function
df_final=(tst_explode.withColumnRenamed('pos','position')).withColumnRenamed('col','column')
You can try select with list-comprehension to posexplode the ArrayType columns in your existing code:
for col in array_cols:
nested_df = nested_df.select([ F.posexplode_outer(col).alias(col+'_pos', col) if c == col else c for c in nested_df.columns ])
Example:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1,"n1", ["a", "b", "c"]),(2,"n2", ["foo", "bar"])],["id", "name", "vals"])
#+---+----+----------+
#| id|name| vals|
#+---+----+----------+
#| 1| n1| [a, b, c]|
#| 2| n2|[foo, bar]|
#+---+----+----------+
col = "vals"
df.select([F.posexplode_outer(col).alias(col+'_pos', col) if c == col else c for c in df.columns]).show()
#+---+----+--------+----+
#| id|name|vals_pos|vals|
#+---+----+--------+----+
#| 1| n1| 0| a|
#| 1| n1| 1| b|
#| 1| n1| 2| c|
#| 2| n2| 0| foo|
#| 2| n2| 1| bar|
#+---+----+--------+----+

Pyspark: how to duplicate a row n time in dataframe?

I've got a dataframe like this and I want to duplicate the row n times if the column n is bigger than one:
A B n
1 2 1
2 9 1
3 8 2
4 1 1
5 3 3
And transform like this:
A B n
1 2 1
2 9 1
3 8 2
3 8 2
4 1 1
5 3 3
5 3 3
5 3 3
I think I should use explode, but I don't understand how it works...
Thanks
With Spark 2.4.0+, this is easier with builtin functions: array_repeat + explode:
from pyspark.sql.functions import expr
df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)], ["A", "B", "n"])
new_df = df.withColumn('n', expr('explode(array_repeat(n,int(n)))'))
>>> new_df.show()
+---+---+---+
| A| B| n|
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
| 5| 3| 3|
| 5| 3| 3|
+---+---+---+
The explode function returns a new row for each element in the given array or map.
One way to exploit this function is to use a udf to create a list of size n for each row. Then explode the resulting array.
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import ArrayType, IntegerType
df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)] ,["A", "B", "n"])
+---+---+---+
| A| B| n|
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
+---+---+---+
# use udf function to transform the n value to n times
n_to_array = udf(lambda n : [n] * n, ArrayType(IntegerType()))
df2 = df.withColumn('n', n_to_array(df.n))
+---+---+---------+
| A| B| n|
+---+---+---------+
| 1| 2| [1]|
| 2| 9| [1]|
| 3| 8| [2, 2]|
| 4| 1| [1]|
| 5| 3|[3, 3, 3]|
+---+---+---------+
# now use explode
df2.withColumn('n', explode(df2.n)).show()
+---+---+---+
| A | B | n |
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
| 5| 3| 3|
| 5| 3| 3|
+---+---+---+
I think the udf answer by #Ahmed is the best way to go, but here is an alternative method, that may be as good or better for small n:
First, collect the maximum value of n over the whole DataFrame:
max_n = df.select(f.max('n').alias('max_n')).first()['max_n']
print(max_n)
#3
Now create an array for each row of length max_n, containing numbers in range(max_n). The output of this intermediate step will result in a DataFrame like:
df.withColumn('n_array', f.array([f.lit(i) for i in range(max_n)])).show()
#+---+---+---+---------+
#| A| B| n| n_array|
#+---+---+---+---------+
#| 1| 2| 1|[0, 1, 2]|
#| 2| 9| 1|[0, 1, 2]|
#| 3| 8| 2|[0, 1, 2]|
#| 4| 1| 1|[0, 1, 2]|
#| 5| 3| 3|[0, 1, 2]|
#+---+---+---+---------+
Now we explode the n_array column, and filter to keep only the values in the array that are less than n. This will ensure that we have n copies of each row. Finally we drop the exploded column to get the end result:
df.withColumn('n_array', f.array([f.lit(i) for i in range(max_n)]))\
.select('A', 'B', 'n', f.explode('n_array').alias('col'))\
.where(f.col('col') < f.col('n'))\
.drop('col')\
.show()
#+---+---+---+
#| A| B| n|
#+---+---+---+
#| 1| 2| 1|
#| 2| 9| 1|
#| 3| 8| 2|
#| 3| 8| 2|
#| 4| 1| 1|
#| 5| 3| 3|
#| 5| 3| 3|
#| 5| 3| 3|
#+---+---+---+
However, we are creating a max_n length array for each row- as opposed to just an n length array in the udf solution. It's not immediately clear to me how this will scale vs. udf for large max_n, but I suspect the udf will win out.

Categories