How to get unique pairs of values in DataFrame - python

Given a pySpark DataFrame, how can I get all possible unique combinations of columns col1 and col2.
I can get unique values for a single column, but cannot get unique pairs of col1 and col2:
df.select('col1').distinct().rdd.map(lambda r: r[0]).collect()
I tried this, but it doesn't seem to work:
df.select(['col1','col2']).distinct().rdd.map(lambda r: r[0]).collect()

The one I tried,
>>> df = spark.createDataFrame([(1,2),(1,3),(1,2),(2,3)],['col1','col2'])
>>> df.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 1| 3|
| 1| 2|
| 2| 3|
+----+----+
>>> df.select('col1','col2').distinct().rdd.map(lambda r:r[0]).collect() ## your mapping
[1, 2, 1]
>>> df.select('col1','col2').distinct().show()
+----+----+
|col1|col2|
+----+----+
| 1| 3|
| 2| 3|
| 1| 2|
+----+----+
>>> df.select('col1','col2').distinct().rdd.map(lambda r:(r[0],r[1])).collect()
[(1, 3), (2, 3), (1, 2)]

Try with this function below:
`df[['col1', 'col2']].drop_duplicates()`

Related

How to generate the max values for new columns in PySpark dataframe?

Suppose I have a pyspark dataframe df.
+---+---+
| a| b|
+---+---+
| 1| 2|
| 2| 3|
| 4| 5|
+---+---+
I'd like to add new column c.
column c = max(0, column b - 100)
+---+---+---+
| a| b| c|
+---+---+---+
| 1|200|100|
| 2|300|200|
| 4| 50| 0|
+---+---+---+
How should I generate the new column c in pyspark dataframe? Thanks in advance!
Hope you are looking something like this:
from pyspark.sql.functions import col, lit, greatest
df = spark.createDataFrame(
[
(1, 200),
(2, 300),
(4, 50),
],
["a", "b"]
)
df_new = df.withColumn("c", greatest(lit(0), col("b")-lit(100)))
.show()

How to iterate over a pyspark dataframe and create a dictionary out of it

I have the following pyspark dataframe:
import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','a', 'b','b','b','b'],
'time': [1,2,3,4,1,2,3,4],
'col': ['1','2','1','2','3','2','3','2']})
foo_df = spark.createDataFrame(foo)
foo_df.show()
+---+----+---+
| id|time|col|
+---+----+---+
| a| 1| 1|
| a| 2| 2|
| a| 3| 1|
| a| 4| 2|
| b| 1| 3|
| b| 2| 2|
| b| 3| 3|
| b| 4| 2|
+---+----+---+
I would like to iterate over all ids and obtain a python dictionary that would have as keys the id and as values the col and would look like this:
foo_dict = {'a': ['1','2','1','2'], 'b': ['3','2','3','2']})
I have in total 10k ids and around 10m rows in foo, so I am looking for an efficient implementation.
Any ideas ?
It's a pandas dataframe. You should checkout the documentaton. The dataframe object has inbuilt methods to help iterate, slice and dice your data. There is also this fun tool to help you visualize what is going on.
pandas has a ready-made method to convert a dataframe to a dict.

Aggregate GroupBy columns with "all"-like function pyspark

I have a dataframe with a primary key, date, variable, and value. I want to group by the primary key and determine if all values are equal to a provided value. Example data:
import pandas as pd
from datetime import date
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame({
"pk": [1, 1, 1, 1, 2, 2, 2, 2, 3, 4],
"date": [
date("2022-05-06"),
date("2022-05-13"),
date("2022-05-06"),
date("2022-05-06"),
date("2022-05-14"),
date("2022-05-15"),
date("2022-05-05"),
date("2022-05-05"),
date("2022-05-11"),
date("2022-05-12")
],
"variable": [A, B, C, D, A, A, E, F, A, G],
"value": [2, 3, 2, 2, 1, 1, 1, 1, 5, 4]
})
df = spark.createDataFrame(df)
df.show()
df1.show()
#+-----+-----------+--------+-----+
#|pk | date|variable|value|
#+-----+-----------+--------+-----+
#| 1| 2022-05-06| A| 2|
#| 1| 2022-05-13| B| 3|
#| 1| 2022-05-06| C| 2|
#| 1| 2022-05-06| D| 2|
#| 2| 2022-05-14| A| 1|
#| 2| 2022-05-15| A| 1|
#| 2| 2022-05-05| E| 1|
#| 2| 2022-05-05| F| 1|
#| 3| 2022-05-11| A| 5|
#| 4| 2022-05-12| G| 4|
#+-----+-----------+--------+-----+
So if I want to know whether, given a primary key, pk, all the values are equal to 1 (or any arbitrary Boolean test), how should I do this? I've tried performing an applyInPandas but that is not super efficient and it seems like there is probably a pretty simply method to do this.
For Spark 3.+, you could use forall function to check if all values collected by collect_list satisfy the boolean test.
import pyspark.sql.functions as F
df1 = (df
.groupby("pk")
.agg(F.expr("forall(collect_list(value), v -> v == 1)").alias("value"))
)
df1.show()
# +---+-----+
# | pk|value|
# +---+-----+
# | 1|false|
# | 3|false|
# | 2| true|
# | 4|false|
# +---+-----+
# or create a column using window function
df2 = df.withColumn("test", F.expr("forall(collect_list(value) over (partition by pk), v -> v == 1)"))
df2.show()
# +---+----------+--------+-----+-----+
# | pk| date|variable|value| test|
# +---+----------+--------+-----+-----+
# | 1|2022-05-06| A| 2|false|
# | 1|2022-05-13| B| 3|false|
# | 1|2022-05-06| C| 2|false|
# | 1|2022-05-06| D| 2|false|
# | 3|2022-05-11| A| 5|false|
# | 2|2022-05-14| A| 1| true|
# | 2|2022-05-15| A| 1| true|
# | 2|2022-05-05| E| 1| true|
# | 2|2022-05-05| F| 1| true|
# | 4|2022-05-12| G| 4|false|
# +---+----------+--------+-----+-----+
You might want to put it inside a case clause to handle NULL values.

How to stack two columns into a single one in PySpark?

I have the following PySpark DataFrame:
id col1 col2
A 2 3
A 2 4
A 4 6
B 1 2
I want to stack col1 and col2 in order to get a single column as follows:
id col3
A 2
A 3
A 4
A 6
B 1
B 2
How can I do so?
df = (
sc.parallelize([
(A, 2, 3), (A, 2, 4), (A, 4, 6),
(B, 1, 2),
]).toDF(["id", "col1", "col2"])
)
The simplest is merge col1 and col2 into an array column and then explode it:
df.show()
+---+----+----+
| id|col1|col2|
+---+----+----+
| A| 2| 3|
| A| 2| 4|
| A| 4| 6|
| B| 1| 2|
+---+----+----+
df.selectExpr('id', 'explode(array(col1, col2))').show()
+---+---+
| id|col|
+---+---+
| A| 2|
| A| 3|
| A| 2|
| A| 4|
| A| 4|
| A| 6|
| B| 1|
| B| 2|
+---+---+
You can drop duplicates if you don't need them.
To do this, group by the "id", then collect the lists from both "col1" and "col2" in an aggregation, to then explode it again into one column.
To get the unique numbers, just drop the duplicates after.
I see that you also have the numbers sorted in your end result, this is done by sorting the concatted lists in the aggregation.
The following code:
from pyspark.sql.functions import concat, collect_list, explode, col, sort_array
df = (
sc.parallelize([
('A', 2, 3), ('A', 2, 4), ('A', 4, 6),
('B', 1, 2),
]).toDF(["id", "col1", "col2"])
)
result = df.groupBy("id") \
.agg(sort_array(concat(collect_list("col1"),collect_list("col2"))).alias("all_numbers")) \
.orderBy("id") \
.withColumn('number', explode(col('all_numbers'))) \
.dropDuplicates() \
.select("id","number") \
.show()
will yield:
+---+------+
| id|number|
+---+------+
| A| 2|
| A| 3|
| A| 4|
| A| 6|
| B| 1|
| B| 2|
+---+------+
Rather a simple solution if the number of columns involved is less.
df = (
sc.parallelize([
('A', 2, 3), ('A', 2, 4), ('A', 4, 6),
('B', 1, 2),
]).toDF(["id", "col1", "col2"])
)
df.show()
+---+----+----+
| id|col1|col2|
+---+----+----+
| A| 2| 3|
| A| 2| 4|
| A| 4| 6|
| B| 1| 2|
+---+----+----+
df1 = df.select(['id', 'col1'])
df2 = df.select(['id', 'col2']).withColumnRenamed('col2', 'col1')
df_new = df1.union(df2)
df_new = df_new.drop_duplicates()
df_new.show()
+---+----+
| id|col1|
+---+----+
| A| 3|
| A| 4|
| B| 1|
| A| 6|
| A| 2|
| B| 2|
+---+----+

PySpark: Add a column to DataFrame when column is a list

I have read similar questions but couldn't find a solution to my specific problem.
I have a list
l = [1, 2, 3]
and a DataFrame
df = sc.parallelize([
['p1', 'a'],
['p2', 'b'],
['p3', 'c'],
]).toDF(('product', 'name'))
I would like to obtain a new DataFrame where the list l is added as a further column, namely
+-------+----+---------+
|product|name| new_col |
+-------+----+---------+
| p1| a| 1 |
| p2| b| 2 |
| p3| c| 3 |
+-------+----+---------+
Approaches with JOIN, where I was joining df with an
sc.parallelize([[1], [2], [3]])
have failed. Approaches using withColumn, as in
new_df = df.withColumn('new_col', l)
have failed because the list is not a Column object.
So, from reading some interesting stuff here, I've ascertained that you can't really just append a random / arbitrary column to a given DataFrame object. It appears what you want is more of a zip than a join. I looked around and found this ticket, which makes me think you won't be able to zip given that you have DataFrame rather than RDD objects.
The only way I've been able to solve your issue invovles leaving the world of DataFrame objects and returning to RDD objects. I've also needed to create an index for the purpose of the join, which may or may not work with your use case.
l = sc.parallelize([1, 2, 3])
index = sc.parallelize(range(0, l.count()))
z = index.zip(l)
rdd = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']])
rdd_index = index.zip(rdd)
# just in case!
assert(rdd.count() == l.count())
# perform an inner join on the index we generated above, then map it to look pretty.
new_rdd = rdd_index.join(z).map(lambda (x, y): [y[0][0], y[0][1], y[1]])
new_df = new_rdd.toDF(["product", 'name', 'new_col'])
When I run new_df.show(), I get:
+-------+----+-------+
|product|name|new_col|
+-------+----+-------+
| p1| a| 1|
| p2| b| 2|
| p3| c| 3|
+-------+----+-------+
Sidenote: I'm really surprised this didn't work. Looks like an outer join?
from pyspark.sql import Row
l = sc.parallelize([1, 2, 3])
new_row = Row("new_col_name")
l_as_df = l.map(new_row).toDF()
new_df = df.join(l_as_df)
When I run new_df.show(), I get:
+-------+----+------------+
|product|name|new_col_name|
+-------+----+------------+
| p1| a| 1|
| p1| a| 2|
| p1| a| 3|
| p2| b| 1|
| p3| c| 1|
| p2| b| 2|
| p2| b| 3|
| p3| c| 2|
| p3| c| 3|
+-------+----+------------+
If the product column is unique then consider the following approach:
original dataframe:
df = spark.sparkContext.parallelize([
['p1', 'a'],
['p2', 'b'],
['p3', 'c'],
]).toDF(('product', 'name'))
df.show()
+-------+----+
|product|name|
+-------+----+
| p1| a|
| p2| b|
| p3| c|
+-------+----+
new column (and new index column):
lst = [1, 2, 3]
indx = ['p1','p2','p3']
create a new dataframe from the list above (with an index):
from pyspark.sql.types import *
myschema= StructType([ StructField("indx", StringType(), True),
StructField("newCol", IntegerType(), True)
])
df1=spark.createDataFrame(zip(indx,lst),schema = myschema)
df1.show()
+----+------+
|indx|newCol|
+----+------+
| p1| 1|
| p2| 2|
| p3| 3|
+----+------+
join this to the original dataframe, using the index created:
dfnew = df.join(df1, df.product == df1.indx,how='left')\
.drop(df1.indx)\
.sort("product")
to get:
dfnew.show()
+-------+----+------+
|product|name|newCol|
+-------+----+------+
| p1| a| 1|
| p2| b| 2|
| p3| c| 3|
+-------+----+------+
This is achievable via RDDs.
1 Convert dataframes to indexed rdds:
df_rdd = df.rdd.zipWithIndex().map(lambda row: (row[1], (row[0][0], row[0][1])))
l_rdd = sc.parallelize(l).zipWithIndex().map(lambda row: (row[1], row[0]))
2 Join two RDDs on index, drop index and rearrange elements:
res_rdd = df_rdd.join(l_rdd).map(lambda row: [row[1][0][0], row[1][0][1], row[1][1]])
3 Convert result to Dataframe:
res_df = res_rdd.toDF(['product', 'name', 'new_col'])
res_df.show()
+-------+----+-------+
|product|name|new_col|
+-------+----+-------+
| p1| a| 1|
| p2| b| 2|
| p3| c| 3|
+-------+----+-------+

Categories