What's the most efficient way to accumulate dataframes in pyspark? - python

I have a dataframe (or could be any RDD) containing several millions row in a well-known schema like this:
Key | FeatureA | FeatureB
--------------------------
U1 | 0 | 1
U2 | 1 | 1
I need to load a dozen other datasets from disk that contains different features for the same number of keys. Some datasets are up to a dozen or so columns wide. Imagine:
Key | FeatureC | FeatureD | FeatureE
-------------------------------------
U1 | 0 | 0 | 1
Key | FeatureF
--------------
U2 | 1
It feels like a fold or an accumulation where I just want to iterate all the datasets and get back something like this:
Key | FeatureA | FeatureB | FeatureC | FeatureD | FeatureE | FeatureF
---------------------------------------------------------------------
U1 | 0 | 1 | 0 | 0 | 1 | 0
U2 | 1 | 1 | 0 | 0 | 0 | 1
I've tried loading each dataframe then joining but that takes forever once I get past a handful of datasets. Am I missing a common pattern or efficient way of accomplishing this task?

Assuming there is at most one row per key in each DataFrame and all keys are of primitive types you can try an union with an aggregation. Lets start with some imports and example data:
from itertools import chain
from functools import reduce
from pyspark.sql.types import StructType
from pyspark.sql.functions import col, lit, max
from pyspark.sql import DataFrame
df1 = sc.parallelize([
("U1", 0, 1), ("U2", 1, 1)
]).toDF(["Key", "FeatureA", "FeatureB"])
df2 = sc.parallelize([
("U1", 0, 0, 1)
]).toDF(["Key", "FeatureC", "FeatureD", "FeatureE"])
df3 = sc.parallelize([("U2", 1)]).toDF(["Key", "FeatureF"])
dfs = [df1, df2, df3]
Next we can extract common schema:
output_schema = StructType(
[df1.schema.fields[0]] + list(chain(*[df.schema.fields[1:] for df in dfs]))
)
and transform all DataFrames:
transformed_dfs = [df.select(*[
lit(None).cast(c.dataType).alias(c.name) if c.name not in df.columns
else col(c.name)
for c in output_schema.fields
]) for df in dfs]
Finally an union and dummy aggregation:
combined = reduce(DataFrame.unionAll, transformed_dfs)
exprs = [max(c).alias(c) for c in combined.columns[1:]]
result = combined.repartition(col("Key")).groupBy(col("Key")).agg(*exprs)
If there is more than one row per key but individual columns are still atomic you can try to replace max with collect_list / collect_set followed by explode.

Related

Use one data-frame (used as a dictionary) to fill in the main data-frame (Python, Pandas)

I have a central DataFrame called "cases" (5000000 rows × 5 columns) and a secondary DataFrame, called "relevant information", which is a kind of dictionary in relation to the central DataFrame (300 rows × 6 columns).
I am trying to fill in the central DataFrame based on a common column called "Verdict_type".
And, if the value does not appear in the secondary DataFrame it fill in "not_relevant" in all the rows that will be added.
I used all sorts of directions without success.
I would love to get a good direction.
The DataFrames
import pandas as pd
# this is a mockup of the raw data
cases = [
[1, "1", "v1"],
[2, "2", "v2"],
[3, "3", "v3"]
]
relevant_info = [
["v1", "info1"],
["v3", "info3"]
]
# these are the data from screenshot
df_cases = pd.DataFrame(cases, columns=["id", "verdict_name", "verdict_type"]).set_index("id")
df_relevant_info = pd.DataFrame(relevant_info, columns=["verdict_type", "features"])
Input:
df_cases <-- note here the index marked as 'id'
df_relevant_info
# first, flatten the index of the cases ( this is probably what you were missing )
df_cases = df_cases.reset_index()
# then, merge the two sets on the verdict_type
df_merge = pd.merge(df_cases, df_relevant_info, on="verdict_type", how="outer")
# finally, mark missing values as non relevant
df_merge["features"] = df_merge["features"].fillna(value="not_relevant")
Output:
merged set:
+----+------+----------------+----------------+--------------+
| | id | verdict_name | verdict_type | features |
|----+------+----------------+----------------+--------------|
| 0 | 1 | 1 | v1 | info1 |
| 1 | 2 | 2 | v2 | not_relevant |
| 2 | 3 | 3 | v3 | info3 |
+----+------+----------------+----------------+--------------+

row_number ranking function to filter the latest records in DF

I want to apply a Window function to a DataFrame to get only the latest metrics for every Id. For the following data I expect the df to contain only the first two records after applying a Window function.
| id | metric | transaction_date |
| 1 | 0.5 | 05-10-2019 |
| 2 | 15.9 | 07-22-2020 |
| 2 | 4.7 | 11-03-2017 |
Is it a correct approach to use row_number ranking function? My current implementation looks like this:
df.withColumn(
"_row_number",
F.row_number().over(
Window.partitionBy("id").orderBy(F.desc("transaction_date")))
)
.filter(F.col("_row_number") == 1)
.drop("_row_number")
You need to first sort the dataframe by id and date (descending). Then you do group by id. The first method on group by object will return the first row (which has the latest date).
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'id':[1,2,2],
'metric':[0.5, 15.9, 4.7],
'date':[datetime(2019,5,10), datetime(2020,7,22), datetime(2017,11,3)]})
## sort df by id and date
df = df.sort_values(['id','date'], ascending= [True, False])
## return the first row of each group
df.groupby('id').first()
val fDF = Seq( (1, 0.5, "05-10-2019"),
(2, 15.9, "07-22-2020"),
(2, 4.7, "11-03-2017"))
.toDF("id", "metric", "transaction_date")
val f1DF = fDF
.withColumn("transaction_date", to_date('transaction_date, "MM-dd-yyyy"))
.orderBy('id.asc,'transaction_date.desc)
val f2DF = f1DF.groupBy("id")
.agg(first('transaction_date).alias("transaction_date"),
first('metric).alias("metric"))
f2DF.show(false)
// +---+----------------+------+
// |id |transaction_date|metric|
// +---+----------------+------+
// |1 |2019-05-10 |0.5 |
// |2 |2020-07-22 |15.9 |
// +---+----------------+------+

How do you create merge_asof functionality in PySpark?

Table A has many columns with a date column, Table B has a datetime and a value. The data in both tables are generated sporadically with no regular interval. Table A is small, table B is massive.
I need to join B to A under the condition that a given element a of A.datetime corresponds to
B[B['datetime'] <= a]]['datetime'].max()
There are a couple ways to do this, but I would like the most efficient way.
Option 1
Broadcast the small dataset as a Pandas DataFrame. Set up a Spark UDF that creates a pandas DataFrame for each row merges with the large dataset using merge_asof.
Option 2
Use the broadcast join functionality of Spark SQL: set up a theta join on the following condition
B['datetime'] <= A['datetime']
Then eliminate all the superfluous rows.
Option B seems pretty terrible... but please let me know if the first way is efficient or if there is another way.
EDIT: Here is the sample input and expected output:
A =
+---------+----------+
| Column1 | Datetime |
+---------+----------+
| A |2019-02-03|
| B |2019-03-14|
+---------+----------+
B =
+---------+----------+
| Key | Datetime |
+---------+----------+
| 0 |2019-01-01|
| 1 |2019-01-15|
| 2 |2019-02-01|
| 3 |2019-02-15|
| 4 |2019-03-01|
| 5 |2019-03-15|
+---------+----------+
custom_join(A,B) =
+---------+----------+
| Column1 | Key |
+---------+----------+
| A | 2 |
| B | 4 |
+---------+----------+
You could solve it with Spark by using union and last together with a window function. Ideally you have something to partition your window by.
from pyspark.sql import functions as f
from pyspark.sql.window import Window
df1 = df1.withColumn('Key', f.lit(None))
df2 = df2.withColumn('Column1', f.lit(None))
df3 = df1.unionByName(df2)
w = Window.orderBy('Datetime', 'Column1').rowsBetween(Window.unboundedPreceding, -1)
df3.withColumn('Key', f.last('Key', True).over(w)).filter(~f.isnull('Column1')).show()
Which gives
+-------+----------+---+
|Column1| Datetime|Key|
+-------+----------+---+
| A|2019-02-03| 2|
| B|2019-03-14| 4|
+-------+----------+---+
Anyone trying to do this in pyspark 3.x can use
applyInPandas
#### For Example:
from pyspark.sql import SparkSession, Row, DataFrame
import pandas as pd
spark = SparkSession.builder.master("local").getOrCreate()
df1 = spark.createDataFrame(
[(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)],
("time", "id", "v1"))
df2 = spark.createDataFrame(
[(20000101, 1, "x"), (20000101, 2, "y")],
("time", "id", "v2"))
def asof_join(l, r):
return pd.merge_asof(l, r, on="time", by="id")
df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
asof_join, schema="time int, id int, v1 double, v2 string"
).show()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
+--------+---+---+---+
| time| id| v1| v2|
+--------+---+---+---+
|20000101| 1|1.0| x|
|20000102| 1|3.0| x|
|20000101| 2|2.0| y|
|20000102| 2|4.0| y|
+--------+---+---+---+
Figured out a fast (but perhaps not the most efficient) method to complete this. I built a helper function:
def get_close_record(df, key_column, datetime_column, record_time):
"""
Takes in ordered dataframe and returns the closest
record that is higher than the datetime given.
"""
filtered_df = df[df[datetime_column] >= record_time][0:1]
[key] = filtered_df[key_column].values.tolist()
return key
Instead of joining B to A, I set up a pandas_udf of the above code and ran it on the columns of table B then ran groupBy on B with primary key A_key and aggregated B_key by max.
The issue with this method is that it requires monotonically increasing keys in B.
Better solution:
I developed the following helper function that should work
other_df['_0'] = other_df['Datetime']
bdf = sc.broadcast(other_df)
#merge asof udf
#F.pandas_udf('long')
def join_asof(v, other=bdf.value):
f = pd.DataFrame(v)
j = pd.merge_asof(f, other, on='_0', direction = 'forward')
return j['Key']
joined = df.withColumn('Key', join_asof(F.col('Datetime')))

Apply function with string and integer from multiple columns not working

I want to create a combined string based on two columns, one is an integer and the other is a string. I need to combine them to create a string.
I've already tried using the solution from this answer here (Apply function to create string with multiple columns as argument) but it doesn't give the required output. H
I have two columns: prod_no which is an integer and PROD which is a string. So something like
| prod_no | PROD | out | | |
|---------|-------|---------------|---|---|
| 1 | PRODA | #Item=1=PRODA | | |
| 2 | PRODB | #Item=2=PRODB | | |
| 3 | PRODC | #Item=3=PRODC | | |
to get the last column, I used the following code:
prod_list['out'] = prod_list.apply(lambda x: "#ITEM={}=={}"
.format(prod_list.prod_no.astype(str), prod_list.PROD), axis=1)
I'm trying to produce the column "out" but the result of that code is weird. The output is #Item=0 1 22 3...very odd. I'm specifically trying to implement using apply and lambda. However, I am biased to efficient implementations since I am trying to learn how to write optimized code. Please help :)
This works.
import pandas as pd
df= pd.DataFrame({"prod_no": [1,2,3], "PROD": [ "PRODA", "PRODB", "PRODC" ]})
df["out"] = df.apply(lambda x: "#ITEM={}=={}".format(x["prod_no"], x["PROD"]), axis=1)
print(df)
Output:
PROD prod_no out
0 PRODA 1 #ITEM=1==PRODA
1 PRODB 2 #ITEM=2==PRODB
2 PRODC 3 #ITEM=3==PRODC
you can also try with zip:
df=df.assign(out=['#ITEM={}=={}'.format(a,b) for a,b in zip(df.prod_no,df.PROD)])
#or directly : df.assign(out='#Item='+df.prod_no.astype(str)+'=='+df.PROD)
prod_no PROD out
0 1 PRODA #ITEM=1==PRODA
1 2 PRODB #ITEM=2==PRODB
2 3 PRODC #ITEM=3==PRODC

Maintaining column order when adding two dataframes with similar formats

I have two dataframes with similar formats. Both have 3 indexes/headers. Most of the headers are the same but df2 has a few additional ones. When I add them up the order of the headers gets mixed up. I would like to maintain the order of df1. Any ideas?
Global = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Global')
Oslav = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Country XYZ')
Oslav = Oslav.replace(to_replace=1,value=10)
Oslav = Oslav.replace(to_replace=-1,value=-2)
df = Global.add(Oslav,fill_value=0)
Example of df Format
HeaderA | Header2 | Header3 |
xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4 |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColDK|
1 | ds | 1 | |+1 |-1 | .......................................
2 | dh | ..........................................................
3 | ge | ..........................................................
4 | ew | ..........................................................
5 | er | ..........................................................
df = df[Global.columns+list(set(Oslav.columns)-set(Global.columns))].copy()
or
df = df[Global.columns+[col for col in Oslav.columns if not col in Global.columns]].copy()
(The second option should preserve the order of Oslav columns as well, if you care about that.)
or
df = df.reindex(columns=Global.columns+list(set(Oslav.columns)-set(Global.columns)))
If you don't want to keep the columns that are in Oslav, but not in Global, you can do
df = df[Global.columns].copy()
Note that without .copy(), you're getting a view of the previous dataframe, rather than a dataframe in its own right.

Categories