How do you create merge_asof functionality in PySpark? - python

Table A has many columns with a date column, Table B has a datetime and a value. The data in both tables are generated sporadically with no regular interval. Table A is small, table B is massive.
I need to join B to A under the condition that a given element a of A.datetime corresponds to
B[B['datetime'] <= a]]['datetime'].max()
There are a couple ways to do this, but I would like the most efficient way.
Option 1
Broadcast the small dataset as a Pandas DataFrame. Set up a Spark UDF that creates a pandas DataFrame for each row merges with the large dataset using merge_asof.
Option 2
Use the broadcast join functionality of Spark SQL: set up a theta join on the following condition
B['datetime'] <= A['datetime']
Then eliminate all the superfluous rows.
Option B seems pretty terrible... but please let me know if the first way is efficient or if there is another way.
EDIT: Here is the sample input and expected output:
A =
+---------+----------+
| Column1 | Datetime |
+---------+----------+
| A |2019-02-03|
| B |2019-03-14|
+---------+----------+
B =
+---------+----------+
| Key | Datetime |
+---------+----------+
| 0 |2019-01-01|
| 1 |2019-01-15|
| 2 |2019-02-01|
| 3 |2019-02-15|
| 4 |2019-03-01|
| 5 |2019-03-15|
+---------+----------+
custom_join(A,B) =
+---------+----------+
| Column1 | Key |
+---------+----------+
| A | 2 |
| B | 4 |
+---------+----------+

You could solve it with Spark by using union and last together with a window function. Ideally you have something to partition your window by.
from pyspark.sql import functions as f
from pyspark.sql.window import Window
df1 = df1.withColumn('Key', f.lit(None))
df2 = df2.withColumn('Column1', f.lit(None))
df3 = df1.unionByName(df2)
w = Window.orderBy('Datetime', 'Column1').rowsBetween(Window.unboundedPreceding, -1)
df3.withColumn('Key', f.last('Key', True).over(w)).filter(~f.isnull('Column1')).show()
Which gives
+-------+----------+---+
|Column1| Datetime|Key|
+-------+----------+---+
| A|2019-02-03| 2|
| B|2019-03-14| 4|
+-------+----------+---+

Anyone trying to do this in pyspark 3.x can use
applyInPandas
#### For Example:
from pyspark.sql import SparkSession, Row, DataFrame
import pandas as pd
spark = SparkSession.builder.master("local").getOrCreate()
df1 = spark.createDataFrame(
[(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)],
("time", "id", "v1"))
df2 = spark.createDataFrame(
[(20000101, 1, "x"), (20000101, 2, "y")],
("time", "id", "v2"))
def asof_join(l, r):
return pd.merge_asof(l, r, on="time", by="id")
df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
asof_join, schema="time int, id int, v1 double, v2 string"
).show()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
+--------+---+---+---+
| time| id| v1| v2|
+--------+---+---+---+
|20000101| 1|1.0| x|
|20000102| 1|3.0| x|
|20000101| 2|2.0| y|
|20000102| 2|4.0| y|
+--------+---+---+---+

Figured out a fast (but perhaps not the most efficient) method to complete this. I built a helper function:
def get_close_record(df, key_column, datetime_column, record_time):
"""
Takes in ordered dataframe and returns the closest
record that is higher than the datetime given.
"""
filtered_df = df[df[datetime_column] >= record_time][0:1]
[key] = filtered_df[key_column].values.tolist()
return key
Instead of joining B to A, I set up a pandas_udf of the above code and ran it on the columns of table B then ran groupBy on B with primary key A_key and aggregated B_key by max.
The issue with this method is that it requires monotonically increasing keys in B.
Better solution:
I developed the following helper function that should work
other_df['_0'] = other_df['Datetime']
bdf = sc.broadcast(other_df)
#merge asof udf
#F.pandas_udf('long')
def join_asof(v, other=bdf.value):
f = pd.DataFrame(v)
j = pd.merge_asof(f, other, on='_0', direction = 'forward')
return j['Key']
joined = df.withColumn('Key', join_asof(F.col('Datetime')))

Related

Replace values of each array in pyspark dataframe array column by their corresponding ids

I have a pyspark.sql dataframe that looks like this:
id
name
refs
1
A
B, C ,D
2
B
A
3
C
A, B
I&apos;m trying to build a function that replaces the values of each array in ref by the corresponding ID of the name that it references and if there&apos;s no matching name in the Name column then it would ideally filter that value out or set it to null. The results would ideally look something like this:
id
name
refs
1
A
2, 3
2
B
1
3
C
1, 2
I tried doing this by defining a UDF that collects all names from the table and then obtains the indices of the intersection between each ref array and the set of all names. It works but is extremely slow, I&apos;m sure there&apos;s probably better ways to do this using Spark and/or SQL.
You can explode the arrays, do a self-join using the exploded ref and name, and collect the joined ids back to an array using collect_list.
import pyspark.sql.functions as F
df1 = df.select('id', 'name', F.explode('refs').alias('refs'))
df2 = df.toDF('id2', 'name2', 'refs2')
result = df1.join(df2, df1.refs == df2.name2) \
.select('id', 'name', 'id2') \
.groupBy('id', 'name') \
.agg(F.collect_list('id2').alias('refs'))
result.show()
+---+----+------+
| id|name| refs|
+---+----+------+
| 1| A|[2, 3]|
| 2| B| [1]|
| 3| C|[1, 2]|
+---+----+------+

pyspark dataframe to extract each distinct word from a column of string and put them into a new dataframe

I am trying to find all strings in a column in pyspark dataframe.
the input df:
id val
1 "book bike car"
15 "car TV bike"
I need an output df like: (the word_index value is auto-increment index and the order of values in "val_new" is random.)
val_new word_index
TV 1
car 2
bike 3
book 4
My code :
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, StringType
import re
def my_f(col):
if not col:
return
s = ''
if isinstance(col, str):
s = re.sub('[^a-zA-Z0-9]+', ' ', col).split()
return s
my_udf = F.udf(my_f, ArrayType(StringType()))
df = spark.createDataFrame([(1, 'book bike car'), (18, 'car TV bike')], ['id', 'val'])
df = df.withColumn('val_new', my_udf(F.col('val')))
I have converted the string to array but how to extract the words from each row, remove duplicates, and create a new dataframe with the two new columns ?
I do not want to use groupBy and aggregate because the dataframe may be large and I do not need the "id" column and any duplicates of "val".
thanks
This can be a working solution for you - use spark in-build functions instead using a udf , which eventually can make your application slow. The functions
explode() groupBy() with collect_set() will help you achieve the desired result.
Create the DF Here
df = spark.createDataFrame([(1, 'book bike car'), (18, 'car TV bike')], ['id', 'val'])
df = df.withColumn("dummy_col", F.lit(1))
df.show()
+---+-------------+---------+
| id| val|dummy_col|
+---+-------------+---------+
| 1|book bike car| 1|
| 18| car TV bike| 1|
+---+-------------+---------+
Logic Here
#Add a dummy column to groupBy & in a single line
df = df.withColumn("array_col", F.split("val", " "))
#Collect_set will return you an array without duplicates
df_grp = df.groupBy("dummy_col").agg(F.collect_set("array_col").alias("array_col"))
#explode to transpoe the column
df_grp = df_grp.withColumn("explode_col", F.explode("array_col"))
df_grp = df_grp.withColumn("explode_col", F.explode("explode_col"))
#Distince to remove the duplicates
df_grp = df_grp.select("explode_col").distinct()
#another dummy column to create the row number
df_grp = df_grp.withColumn("dummy_col", F.lit("A"))
_w = W.partitionBy("dummy_col").orderBy("dummy_col")
df_grp = df_grp.withColumn("rnk", F.row_number().over(_w))
df_grp.show(truncate=False)
Final Output
+-----------+---------+---+
|explode_col|dummy_col|rnk|
+-----------+---------+---+
|TV |A |1 |
|car |A |2 |
|bike |A |3 |
|book |A |4 |
+-----------+---------+---+

row_number ranking function to filter the latest records in DF

I want to apply a Window function to a DataFrame to get only the latest metrics for every Id. For the following data I expect the df to contain only the first two records after applying a Window function.
| id | metric | transaction_date |
| 1 | 0.5 | 05-10-2019 |
| 2 | 15.9 | 07-22-2020 |
| 2 | 4.7 | 11-03-2017 |
Is it a correct approach to use row_number ranking function? My current implementation looks like this:
df.withColumn(
"_row_number",
F.row_number().over(
Window.partitionBy("id").orderBy(F.desc("transaction_date")))
)
.filter(F.col("_row_number") == 1)
.drop("_row_number")
You need to first sort the dataframe by id and date (descending). Then you do group by id. The first method on group by object will return the first row (which has the latest date).
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'id':[1,2,2],
'metric':[0.5, 15.9, 4.7],
'date':[datetime(2019,5,10), datetime(2020,7,22), datetime(2017,11,3)]})
## sort df by id and date
df = df.sort_values(['id','date'], ascending= [True, False])
## return the first row of each group
df.groupby('id').first()
val fDF = Seq( (1, 0.5, "05-10-2019"),
(2, 15.9, "07-22-2020"),
(2, 4.7, "11-03-2017"))
.toDF("id", "metric", "transaction_date")
val f1DF = fDF
.withColumn("transaction_date", to_date('transaction_date, "MM-dd-yyyy"))
.orderBy('id.asc,'transaction_date.desc)
val f2DF = f1DF.groupBy("id")
.agg(first('transaction_date).alias("transaction_date"),
first('metric).alias("metric"))
f2DF.show(false)
// +---+----------------+------+
// |id |transaction_date|metric|
// +---+----------------+------+
// |1 |2019-05-10 |0.5 |
// |2 |2020-07-22 |15.9 |
// +---+----------------+------+

Selecting row based on column value in duplicated entries on different column in PySpark

I have a PySpark DataFrame which I group on a field (column) with the purpose of eliminating,per each group, the records, which have a certain value of another field.
So for instance, the table looks like
colA colB
'a' 1
'b' 1
'a' 0
'c' 0
here what I'd like is removing the records where there is a duplicate colA and colB is 0, so to obtain
colA colB
'a' 1
'b' 1
'c' 0
row for 'c' remains because I want to remove the 0s only for the duplicated (on colA) rows.
I can't think of a way to achieve this because I'm not proficient with the way to use agg after a groupBy, if the expr is not one of "avg", "max", etc.
How about simple max?
from pyspark.sql.functions import max as max_
df = sc.parallelize([
('a', 1), ('b', 1), ('a', 0), ('c', 0)
]).toDF(('colA', 'colB'))
df.groupBy('colA').agg(max_('colB')).show()
## +----+---------+
## |colA|max(colB)|
## +----+---------+
## | a| 1|
## | b| 1|
## | c| 0|
## +----+---------+
This approach should work for any column which support ordering and uses binary labels with an optional adjustment of the aggregate function you use (min / max).
It is possible implement more advanced rules using window functions but it will be more expensive.
Nevertheless here is an example:
from pyspark.sql.functions import col, sum as sum_, when
from pyspark.sql import Window
import sys
w = Window.partitionBy("colA").rowsBetween(-sys.maxsize, sys.maxsize)
this_non_zero = col("colB") != 0
any_non_zero = sum_(this_non_zero.cast("long")).over(w) != 0
(df
.withColumn("this_non_zero", this_non_zero)
.withColumn("any_non_zero", any_non_zero)
.where(
(col("this_non_zero") & col("any_non_zero")) |
~col("any_non_zero")
))

What's the most efficient way to accumulate dataframes in pyspark?

I have a dataframe (or could be any RDD) containing several millions row in a well-known schema like this:
Key | FeatureA | FeatureB
--------------------------
U1 | 0 | 1
U2 | 1 | 1
I need to load a dozen other datasets from disk that contains different features for the same number of keys. Some datasets are up to a dozen or so columns wide. Imagine:
Key | FeatureC | FeatureD | FeatureE
-------------------------------------
U1 | 0 | 0 | 1
Key | FeatureF
--------------
U2 | 1
It feels like a fold or an accumulation where I just want to iterate all the datasets and get back something like this:
Key | FeatureA | FeatureB | FeatureC | FeatureD | FeatureE | FeatureF
---------------------------------------------------------------------
U1 | 0 | 1 | 0 | 0 | 1 | 0
U2 | 1 | 1 | 0 | 0 | 0 | 1
I've tried loading each dataframe then joining but that takes forever once I get past a handful of datasets. Am I missing a common pattern or efficient way of accomplishing this task?
Assuming there is at most one row per key in each DataFrame and all keys are of primitive types you can try an union with an aggregation. Lets start with some imports and example data:
from itertools import chain
from functools import reduce
from pyspark.sql.types import StructType
from pyspark.sql.functions import col, lit, max
from pyspark.sql import DataFrame
df1 = sc.parallelize([
("U1", 0, 1), ("U2", 1, 1)
]).toDF(["Key", "FeatureA", "FeatureB"])
df2 = sc.parallelize([
("U1", 0, 0, 1)
]).toDF(["Key", "FeatureC", "FeatureD", "FeatureE"])
df3 = sc.parallelize([("U2", 1)]).toDF(["Key", "FeatureF"])
dfs = [df1, df2, df3]
Next we can extract common schema:
output_schema = StructType(
[df1.schema.fields[0]] + list(chain(*[df.schema.fields[1:] for df in dfs]))
)
and transform all DataFrames:
transformed_dfs = [df.select(*[
lit(None).cast(c.dataType).alias(c.name) if c.name not in df.columns
else col(c.name)
for c in output_schema.fields
]) for df in dfs]
Finally an union and dummy aggregation:
combined = reduce(DataFrame.unionAll, transformed_dfs)
exprs = [max(c).alias(c) for c in combined.columns[1:]]
result = combined.repartition(col("Key")).groupBy(col("Key")).agg(*exprs)
If there is more than one row per key but individual columns are still atomic you can try to replace max with collect_list / collect_set followed by explode.

Categories