Spread List of Lists to Sparks DF with PySpark?

Spread List of Lists to Sparks DF with PySpark? - python

I'm currently struggling with following issue:
Let's take following List of Lists:
[[1, 2, 3], [4, 5], [6, 7]]
How can I create following Sparks DF out of it with one row per element of each sublist:
| min_value | value |
---------------------
| 1| 1|
| 1| 2|
| 1| 3|
| 4| 4|
| 4| 5|
| 6| 6|
| 6| 7|
The only way I'm getting this done is by processing this list to another list with for-loops, which basically then already represents all rows of my DF, which is probably not the best way to solve this.
THX & BR
IntoNumbers

You can create a dataframe and use explode and array_min to get the desired output:
import pyspark.sql.functions as F
l = [[1, 2, 3], [4, 5], [6, 7]]
df = spark.createDataFrame(
[[l]],
['col']
).select(
F.explode('col').alias('value')
).withColumn(
'min_value',
F.array_min('value')
).withColumn(
'value',
F.explode('value')
)
df.show()
+-----+---------+
|value|min_value|
+-----+---------+
| 1| 1|
| 2| 1|
| 3| 1|
| 4| 4|
| 5| 4|
| 6| 6|
| 7| 6|
+-----+---------+

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import col, expr
import pyspark.sql.functions as F
data=[[1, 2, 3], [4, 5], [6, 7]]
Extract first element in each list in data
j=[item[0] for item in data]
Zip first element to data, create df and explode column b
df=spark.createDataFrame(zip(j,data), ['min_value','value']).withColumn('value', F.explode(col('value'))).show()
+---------+-----+
|min_value|value|
+---------+-----+
| 1| 1|
| 1| 2|
| 1| 3|
| 4| 4|
| 4| 5|
| 6| 6|
| 6| 7|
+---------+-----+

Here is my Spark like solution to this:
Basic Imports and SparkSession creation for example purposes
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("Stackoverflow problem 1") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
Creating an arbitrary list of lists:
list_of_lists = [[1,3,5], [1053,23], [1,3], [5,2,15,23,5,3]]
Solution below:
# Create an rdd from the list of lists
rdd_ll = spark.sparkContext.parallelize(list_of_lists)
# Compute the min from the previous rdd and store in another rdd
rdd_min = rdd_ll.map(lambda x: min(x))
# create a dataframe by zipping the two rdds and using the explode function
df = spark.createDataFrame(rdd_min.zip(rdd_ll)) \
.withColumn('value', F.explode('_2')) \
.drop('_2') \
.withColumnRenamed('_1', 'min_value')
Output
df.show()
+---------+-----+
|min_value|value|
+---------+-----+
| 1| 1|
| 1| 3|
| 1| 5|
| 23| 1053|
| 23| 23|
| 1| 1|
| 1| 3|
| 2| 5|
| 2| 2|
| 2| 15|
| 2| 23|
| 2| 5|
| 2| 3|
+---------+-----+

Related

Aggregate GroupBy columns with "all"-like function pyspark

I have a dataframe with a primary key, date, variable, and value. I want to group by the primary key and determine if all values are equal to a provided value. Example data:
import pandas as pd
from datetime import date
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame({
"pk": [1, 1, 1, 1, 2, 2, 2, 2, 3, 4],
"date": [
date("2022-05-06"),
date("2022-05-13"),
date("2022-05-06"),
date("2022-05-06"),
date("2022-05-14"),
date("2022-05-15"),
date("2022-05-05"),
date("2022-05-05"),
date("2022-05-11"),
date("2022-05-12")
],
"variable": [A, B, C, D, A, A, E, F, A, G],
"value": [2, 3, 2, 2, 1, 1, 1, 1, 5, 4]
})
df = spark.createDataFrame(df)
df.show()
df1.show()
#+-----+-----------+--------+-----+
#|pk | date|variable|value|
#+-----+-----------+--------+-----+
#| 1| 2022-05-06| A| 2|
#| 1| 2022-05-13| B| 3|
#| 1| 2022-05-06| C| 2|
#| 1| 2022-05-06| D| 2|
#| 2| 2022-05-14| A| 1|
#| 2| 2022-05-15| A| 1|
#| 2| 2022-05-05| E| 1|
#| 2| 2022-05-05| F| 1|
#| 3| 2022-05-11| A| 5|
#| 4| 2022-05-12| G| 4|
#+-----+-----------+--------+-----+
So if I want to know whether, given a primary key, pk, all the values are equal to 1 (or any arbitrary Boolean test), how should I do this? I've tried performing an applyInPandas but that is not super efficient and it seems like there is probably a pretty simply method to do this.

For Spark 3.+, you could use forall function to check if all values collected by collect_list satisfy the boolean test.
import pyspark.sql.functions as F
df1 = (df
.groupby("pk")
.agg(F.expr("forall(collect_list(value), v -> v == 1)").alias("value"))
)
df1.show()
# +---+-----+
# | pk|value|
# +---+-----+
# | 1|false|
# | 3|false|
# | 2| true|
# | 4|false|
# +---+-----+
# or create a column using window function
df2 = df.withColumn("test", F.expr("forall(collect_list(value) over (partition by pk), v -> v == 1)"))
df2.show()
# +---+----------+--------+-----+-----+
# | pk| date|variable|value| test|
# +---+----------+--------+-----+-----+
# | 1|2022-05-06| A| 2|false|
# | 1|2022-05-13| B| 3|false|
# | 1|2022-05-06| C| 2|false|
# | 1|2022-05-06| D| 2|false|
# | 3|2022-05-11| A| 5|false|
# | 2|2022-05-14| A| 1| true|
# | 2|2022-05-15| A| 1| true|
# | 2|2022-05-05| E| 1| true|
# | 2|2022-05-05| F| 1| true|
# | 4|2022-05-12| G| 4|false|
# +---+----------+--------+-----+-----+
You might want to put it inside a case clause to handle NULL values.

how to rename all columns of pyspark dataframe using a list

I have a existing pyspark dataframe that has around 200 columns. I have a list of the column names (in the correct order and length).
How can I apply the list to the dataframe without using structtype?

Assuming the list of column names is in the right order and has a matching length you can use toDF
Preparing an example dataframe
import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(np.random.randint(1,10,(5,4)).tolist(), list('ABCD'))
df.show()
Output
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 6| 9| 4| 7|
| 6| 4| 7| 9|
| 2| 5| 2| 2|
| 3| 7| 4| 5|
| 8| 9| 6| 8|
+---+---+---+---+
Changing the column names
newcolumns = ['new_A','new_B','new_C','new_D']
df.toDF(*newcolumns).show()
Output
+-----+-----+-----+-----+
|new_A|new_B|new_C|new_D|
+-----+-----+-----+-----+
| 6| 9| 4| 7|
| 6| 4| 7| 9|
| 2| 5| 2| 2|
| 3| 7| 4| 5|
| 8| 9| 6| 8|
+-----+-----+-----+-----+

If you have list of columns pre-exiting, it would work fine:
df_list = ["newName_1", "newName_2", "newName_3", "newName_4"]
renamed_df = df.toDF(*df_list)
renamed_df.show()
But if you want to make it dynamic and without relying on list of
columns, here is alternate way of doing it:
df.select([col(col_name).alias(col_name) for col_name in df])

Sum of pyspark columns to ignore NaN values

I have a pypark dataframe in the following way:
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| 1| 3|
| 2| NaN| 4|
| 3| 3| 5|
+---+----+----+
I would like to sum col1 and col2 so that the result looks like this:
+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
| 1| 1| 3| 4|
| 2| NaN| 4| 4|
| 3| 3| 5| 8|
+---+----+----+---+
Here's what I have tried:
import pandas as pd
test = pd.DataFrame({
'id': [1, 2, 3],
'col1': [1, None, 3],
'col2': [3, 4, 5]
})
test = spark.createDataFrame(test)
test.withColumn('sum', F.col('col1') + F.col('col2')).show()
This code returns:
+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
| 1| 1| 3| 4|
| 2| NaN| 4|NaN| # <-- I want a 4 here, not this NaN
| 3| 3| 5| 8|
+---+----+----+---+
Can anyone help me with this?

Use F.nanvl to replace NaN with a given value (0 here):
import pyspark.sql.functions as F
result = test.withColumn('sum', F.nanvl(F.col('col1'), F.lit(0)) + F.col('col2'))
For your comment:
result = test.withColumn('sum',
F.when(
F.isnan(F.col('col1')) & F.isnan(F.col('col2')),
F.lit(float('nan'))
).otherwise(
F.nanvl(F.col('col1'), F.lit(0)) + F.nanvl(F.col('col2'), F.lit(0))
)
)

PySpark making dataframe with three columns from RDD with tuple and int

I have a RDD in a form of:
[(('1', '10'), 1), (('10', '1'), 1), (('1', '12'), 1), (('12', '1'), 1)]
What I have done is
df = spark.createDataFrame(rdd, ["src", "rp"])
where I make a column of the tuple and int which looks like this:
+-------+-----+
| src|rp |
+-------+-----+
|[1, 10]| 1|
|[10, 1]| 1|
|[1, 12]| 1|
|[12, 1]| 1|
+-------+-----+
But I can't figure out how to make src column of the first element in [x,y] and dst column of the second element so i would have a dataframe with three columns src, dst and rp:
+-------+-----+-----+
| src|dst |rp |
+-------+-----+-----+
| 1| 10| 1|
| 10| 1| 1|
| 1| 12| 1|
| 12| 1| 1|
+-------+-----+-----+

You need an intermediate transformation on your RDD to make it a flat list of three elements:
spark.createDataFrame(rdd.map(lambda l: [l[0][0], l[0][1], l[1]]), ["src", "dst", "rp"])
+---+---+---+
|src|dst| rp|
+---+---+---+
| 1| 10| 1|
| 10| 1| 1|
| 1| 12| 1|
| 12| 1| 1|
+---+---+---+

You can just do a simple select on the dataframe to separate out the columns. No need to do a intermediate transformation as the other answer suggests.
from pyspark.sql.functions import col
df = sqlContext.createDataFrame(rdd, ["src", "rp"])
df = df.select(col("src._1").alias("src"), col("src._2").alias("dst"),col("rp"))
df.show()
Here's the result
+---+---+---+
|src|dst| rp|
+---+---+---+
| 1| 10| 1|
| 10| 1| 1|
| 1| 12| 1|
| 12| 1| 1|
+---+---+---+

Pyspark: how to duplicate a row n time in dataframe?

I've got a dataframe like this and I want to duplicate the row n times if the column n is bigger than one:
A B n
1 2 1
2 9 1
3 8 2
4 1 1
5 3 3
And transform like this:
A B n
1 2 1
2 9 1
3 8 2
3 8 2
4 1 1
5 3 3
5 3 3
5 3 3
I think I should use explode, but I don't understand how it works...
Thanks

With Spark 2.4.0+, this is easier with builtin functions: array_repeat + explode:
from pyspark.sql.functions import expr
df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)], ["A", "B", "n"])
new_df = df.withColumn('n', expr('explode(array_repeat(n,int(n)))'))
>>> new_df.show()
+---+---+---+
| A| B| n|
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
| 5| 3| 3|
| 5| 3| 3|
+---+---+---+

The explode function returns a new row for each element in the given array or map.
One way to exploit this function is to use a udf to create a list of size n for each row. Then explode the resulting array.
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import ArrayType, IntegerType
df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)] ,["A", "B", "n"])
+---+---+---+
| A| B| n|
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
+---+---+---+
# use udf function to transform the n value to n times
n_to_array = udf(lambda n : [n] * n, ArrayType(IntegerType()))
df2 = df.withColumn('n', n_to_array(df.n))
+---+---+---------+
| A| B| n|
+---+---+---------+
| 1| 2| [1]|
| 2| 9| [1]|
| 3| 8| [2, 2]|
| 4| 1| [1]|
| 5| 3|[3, 3, 3]|
+---+---+---------+
# now use explode
df2.withColumn('n', explode(df2.n)).show()
+---+---+---+
| A | B | n |
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
| 5| 3| 3|
| 5| 3| 3|
+---+---+---+

I think the udf answer by #Ahmed is the best way to go, but here is an alternative method, that may be as good or better for small n:
First, collect the maximum value of n over the whole DataFrame:
max_n = df.select(f.max('n').alias('max_n')).first()['max_n']
print(max_n)
#3
Now create an array for each row of length max_n, containing numbers in range(max_n). The output of this intermediate step will result in a DataFrame like:
df.withColumn('n_array', f.array([f.lit(i) for i in range(max_n)])).show()
#+---+---+---+---------+
#| A| B| n| n_array|
#+---+---+---+---------+
#| 1| 2| 1|[0, 1, 2]|
#| 2| 9| 1|[0, 1, 2]|
#| 3| 8| 2|[0, 1, 2]|
#| 4| 1| 1|[0, 1, 2]|
#| 5| 3| 3|[0, 1, 2]|
#+---+---+---+---------+
Now we explode the n_array column, and filter to keep only the values in the array that are less than n. This will ensure that we have n copies of each row. Finally we drop the exploded column to get the end result:
df.withColumn('n_array', f.array([f.lit(i) for i in range(max_n)]))\
.select('A', 'B', 'n', f.explode('n_array').alias('col'))\
.where(f.col('col') < f.col('n'))\
.drop('col')\
.show()
#+---+---+---+
#| A| B| n|
#+---+---+---+
#| 1| 2| 1|
#| 2| 9| 1|
#| 3| 8| 2|
#| 3| 8| 2|
#| 4| 1| 1|
#| 5| 3| 3|
#| 5| 3| 3|
#| 5| 3| 3|
#+---+---+---+
However, we are creating a max_n length array for each row- as opposed to just an n length array in the udf solution. It's not immediately clear to me how this will scale vs. udf for large max_n, but I suspect the udf will win out.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spread List of Lists to Sparks DF with PySpark? - python

Related

Aggregate GroupBy columns with "all"-like function pyspark

how to rename all columns of pyspark dataframe using a list

Sum of pyspark columns to ignore NaN values

PySpark making dataframe with three columns from RDD with tuple and int

Pyspark: how to duplicate a row n time in dataframe?

Categories

Resources