Good Afternoon.
I am trying to perform a join in Pyspark that uses a complex set of conditions to produce a single value.
A minimum example of what I am trying to achieve could look like the following. Imagine a set of events that can occur at discrete times (between t=0 and t=40). Each event has a set of three independent boolean properties that describe the nature of the event. There is some time-dependent value associated with the triggering of each property, contained in a lookup table. For each event, I would like to determine the sum of all the relevant values for that event.
My first dataframe, df_1, is a list of the events, the time at which the event occured, and had a selection of boolean properties associated with it:
+-------------+------------+------------+------------+------------+
| EVENT_INDEX | EVENT_TIME | PROPERTY_1 | PROPERTY_2 | PROPERTY_3 |
+-------------+------------+------------+------------+------------+
| Event_1 | 13 | 1 | 0 | 1 |
| Event_2 | 24 | 0 | 1 | 1 |
| Event_3 | 35 | 1 | 0 | 0 |
+-------------+------------+------------+------------+------------+
The second dataframe, df_2, is the lookup table that describes the associated value of having a TRUE for a particular property at a particular time. Since there are many repeated values across all the time buckets, the format of this dataframe is an inclusive range of times for which the property has the specific value. The time ranges are not consistently sized and can vary wildly between different properties:
+------------+----------+---------------+-------+
| START_TIME | END_TIME | PROPERTY_NAME | VALUE |
+------------+----------+---------------+-------+
| 0 | 18 | PROPERTY_1 | 0.1 |
| 19 | 40 | PROPERTY_1 | 0.8 |
| 0 | 20 | PROPERTY_2 | 0.7 |
| 20 | 24 | PROPERTY_2 | 0.3 |
| 25 | 40 | PROPERTY_2 | 0.7 |
| 0 | 40 | PROPERTY_3 | 0.5 |
+------------+----------+---------------+-------+
Desired Output:
Since Event_1 occured at time t=13, with PROPERTY_1 and PROPERTY_3 triggered, the expected sum of the values according to df_2 should be 0.1 (from the PROPERTY_1 0-18 bucket) + 0.5 (from the PROPERTY_3 0-40 bucket) = 0.6. In the same way, Event_2 should have a value of 0.3 (remember that bucket start/end times are inclusive, so this comes from the 20-24 bucket) + 0.5 = 0.8. Finally, Event_3 = 0.8.
+-------------+------------+------------+------------+------------+-------------+
| EVENT_INDEX | EVENT_TIME | PROPERTY_1 | PROPERTY_2 | PROPERTY_3 | TOTAL_VALUE |
+-------------+------------+------------+------------+------------+-------------+
| Event_1 | 13 | 1 | 0 | 1 | 0.6 |
| Event_2 | 24 | 0 | 1 | 1 | 0.8 |
| Event_3 | 35 | 1 | 0 | 0 | 0.8 |
+-------------+------------+------------+------------+------------+-------------+
For my initial test dataset, in the event dataframe df_1 there are ~20,000 events spread over 2000 time buckets. Each event has ~44 properties and the length of the lookup table df_2 is ~53,000. As I would like to expand this process out to significantly more data (a couple of orders of magnitude potentially), I am very interested in a parallelisable solution to this problem. For instance, I feel like summarising df_2 as a python dictionary and broadcasting that to my executors will not be possible given the volume of data.
Since I'm trying to add a single column to each row in df_1, I have tried to accomplish the task using a nested map that looks similar to the following:
def calculate_value(df_2):
def _calculate_value(row):
row_dict = row.asDict()
rolling_value = 0.0
for property_name in [key for key in row_dict.keys() if "PROPERTY" in key]:
additional_value = (
df_2
.filter(
(pyspark.sql.functions.col("PROPERTY_NAME") == property_name)
& (pyspark.sql.functions.col("START_BUCKET") <= row_dict["EVENT_TIME"])
& (pyspark.sql.functions.col("END_BUCKET") >= row_dict["EVENT_TIME"])
)
.select("VALUE")
.collect()
)[0][0]
rolling_value += additional_value
return pyspark.sql.Row(**row_dict)
return _calculate_value
This code is able to perform the join on the driver (by running calculate_value(df_2)(df_1.rdd.take(1)[0])), however when I try to perform the parallelised map:
(
df_1
.rdd
.map(calculate_value(df_2))
)
I receive a Py4JError indicating that it could not seralize the dataframe object df_2. This is verified elsewhere in StackOverflow, e.g. Pyspark: PicklingError: Could not serialize object:.
I opted to use a map rather than a join because I am adding a single column to each row in df_1, and given the difficulty in encoding the complex logic required to identify the correct rows in df_2 to add up for each given event (First, check which properties fired and were TRUE in df_1, then select those properties in df_2, downselect to only the properties and values that are relevant given the event time, and then add up all the events).
I am trying to think of a way to reconfigure df_2 in a sustainable, scalable manner to allow for a more simple join/map, but I am not sure how best to go about doing it.
Any advice would be greatly appreciated.
Sample DataFrames:
df1.show()
+-----------+----------+----------+----------+----------+
|EVENT_INDEX|EVENT_TIME|PROPERTY_1|PROPERTY_2|PROPERTY_3|
+-----------+----------+----------+----------+----------+
| Event_1| 13| 1| 0| 1|
| Event_2| 24| 0| 1| 1|
| Event_3| 35| 1| 0| 0|
+-----------+----------+----------+----------+----------+
df2.show()
+----------+--------+-------------+-----+
|START_TIME|END_TIME|PROPERTY_NAME|VALUE|
+----------+--------+-------------+-----+
| 0| 18| PROPERTY_1| 0.1|
| 19| 40| PROPERTY_1| 0.8|
| 0| 20| PROPERTY_2| 0.7|
| 20| 24| PROPERTY_2| 0.3|
| 25| 40| PROPERTY_2| 0.7|
| 0| 40| PROPERTY_3| 0.5|
+----------+--------+-------------+-----+
This works for Spark2.4+ using DataframeAPI.(very scalable as it only uses in-built functions and it is dynamic for as many property columns)
It will work for as many properties as it is dynamic for them as long as the properties columns start with 'PROPERTY_'. First I will use arrays_zip and array and explode to collapse all Property columns into rows with 2 columns using element_at to give us PROPERY_NAME,PROPERTY_VALUE. Before join, I will filter to only keep all rows where the PROPERY_VALUE=1. The join will take place on the range of time and where PROPERTY(with all collapsed rows of properties)=PROPERTY_NAMES(of df2). This will ensure that we only get all the rows needed for our sum. Then I perform a groupBy with agg to select all our required columns and to get our total sum as TOTAL_VALUE.
from pyspark.sql import functions as F
df1.withColumn("PROPERTIES",\
F.explode(F.arrays_zip(F.array([F.array(F.lit(x),F.col(x)) for x in df1.columns if x.startswith("PROPERTY_")]))))\
.select("EVENT_INDEX", "EVENT_TIME","PROPERTIES.*",\
*[x for x in df1.columns if x.startswith("PROPERTY_")]).withColumn("PROPERTY", F.element_at("0",1))\
.withColumn("PROPERTY_VALUE", F.element_at("0",2)).drop("0")\
.filter('PROPERTY_VALUE=1').join(df2, (df1.EVENT_TIME>=df2.START_TIME) & (df1.EVENT_TIME<=df2.END_TIME)& \
(F.col("PROPERTY")==df2.PROPERTY_NAME)).groupBy("EVENT_INDEX").agg(F.first("EVENT_TIME").alias("EVENT_TIME"),\
*[F.first(x).alias(x) for x in df1.columns if x.startswith("PROPERTY_")],\
(F.sum("VALUE").alias("TOTAL_VALUE"))).orderBy("EVENT_TIME").show()
+-----------+----------+----------+----------+----------+-----------+
|EVENT_INDEX|EVENT_TIME|PROPERTY_1|PROPERTY_2|PROPERTY_3|TOTAL_VALUE|
+-----------+----------+----------+----------+----------+-----------+
| Event_1| 13| 1| 0| 1| 0.6|
| Event_2| 24| 0| 1| 1| 0.8|
| Event_3| 35| 1| 0| 0| 0.8|
+-----------+----------+----------+----------+----------+-----------+
Related
For a project with table features, I try to create a new table with pivot_table.
Small problem however, one of my columns contains an array. Here an ex
| House | Job |
| ----- | --- |
| Gryffindor | ["Head of Auror Office", " Minister for Magic"]|
| Gryffindor | ["Auror"]|
| Slytherin | ["Auror","Student"] |
Ideally, I would like with a pivot table to create a table that looks like this
| House | Head of Auror Office | Minister for Magic | Auror | Student |
|:----- |:--------------------:|:------------------:|:-----:|:-------:|
| Gryffindor | 1 | 1| 1| 0|
| Slytherin | 0 | 0| 1| 1|
Of course I can have a value like 2,3 or 4 in the array so something that is not fixed. Anyone have a solution? Maybe the pivot_table is not the best solution :/
Sorry for the arrays, It's not working :(
suppose your table is df with two columns:
(df.explode('Job')
.groupby(['House', 'Job']).size().reset_index()
.pivot(index = 'House', columns = 'Job').fillna(0))
the code first expand the list into rows, then do the count, and finally do the pivot table
Problem Statement
I have two corresponding DataFrames, one is employee table, one is job catalog table, one of their columns is filled with array, I want to find and intersection of two array in the skill_set column from two DataFrames (I've using np.intersect1d) and return the value to employee DataFrame for each id in employee DataFrame.
So 1 id in employee DataFrame will be looped to find intersection of all job_name in job catalog DataFrame in same job rank with the current employee job rank. Final output is meant to find 5 job with highest amount of intersect (using len since np.intersect1d returns a list) from job DataFrames.
employee_data
+----+--------+----------+----------+
| id|emp_name| job_rank| skill_set|
+----+--------+----------+----------+
| 2| c | 1|[a1,a2,a3]|
| 2| a | 2|[a1,a2,a3]|
| 1| c | 3|[a1,a2,a3]|
| 1| j | 4|[a1,a2,a3]|
| 3| k | 5|[a1,a2,a3]|
| 1| l | 6|[a1,a2,a3]|
+----+--------+----------+----------+
job_data
+----+--------+----------+----------+
| id|job_name| job_rank| skill_set|
+----+--------+----------+----------+
| 2| c | 1|[a1,a2,a3]|
| 2| a | 2|[a1,a2,a3]|
| 1| c | 1|[a1,a2,a3]|
| 1| b | 4|[a1,a2,a3]|
| 3| r | 3|[a1,a2,a3]|
| 1| a | 6|[a1,a2,a3]|
| 1| m | 2|[a1,a2,a3]|
| 1| g | 4|[a1,a2,a3]|
+----+--------+----------+----------+
I can give you an idea how you can solve this, considering the emp data and job data are not too big.
Do a full join (or inner join as you need) on employee_data and job_data. So your new joined data will have len(employee_data) * len(job_data) rows and will have skills from both tables including employee details
| emp_details | emp_skills | job_details | job_skills |
Operate on this table to find which of emp_skills matches with job_skills with (lambda) functions. With functions you are easily operate on array objects.
Select the emp details from the row
I got a dataframe (df1), where I have listed some time frames:
| start | end | event name |
|-------|-----|------------|
| 1 | 3 | name_1 |
| 3 | 5 | name_2 |
| 2 | 6 | name_3 |
In these time frames, I would like to extract some data from another dataframe (df2). For example, I want to extend df1 with the average measurementn from df2 inside the specified time range.
| timestamp | measurement |
|-----------|-------------|
| 1 | 5 |
| 2 | 7 |
| 3 | 5 |
| 4 | 9 |
| 5 | 2 |
| 6 | 7 |
| 7 | 8 |
I was thinking about an UDF function which filters df2 by timestamp and evaluates the average. But in a UDF I can not reference two dataframes:
def get_avg(start, end):
return df2.filter(df2.timestamp > start & df2.timestamp < end).agg({"average": "avg"})
udf_1 = f.udf(get_avg)
df1.select(udf_1('start', 'end').show()
This will throw an error TypeError: cannot pickle '_thread.RLock' object.
How would I solve this issue efficiently?
In this case there is no need to use UDFs, you can simply use join over a range interval determined by the timestamps
import pyspark.sql.functions as F
df1.join(df2, on=[(df2.timestamp > df1.start) & (df2.timestamp < df1.end)]) \
.groupby('start', 'end', 'event_name') \
.agg(F.mean('measurement').alias('avg')) \
.show()
+-----+---+----------+-----------------+
|start|end|event_name| avg|
+-----+---+----------+-----------------+
| 1| 3| name_1| 7.0|
| 3| 5| name_2| 9.0|
| 2| 6| name_3|5.333333333333333|
+-----+---+----------+-----------------+
I have a table like so:
--------------------------------------------
| Id | Value | Some Other Columns Here
| 0 | 5 |
| 0 | 4 |
| 0 | 0 |
| 1 | 3 |
| 2 | 1 |
| 2 | 8 |
| 3 | -4 |
--------------------------------------------
I would like to remove all IDs which have any Value <= 0, so the result would be:
--------------------------------------------
| Id | Value | Some Other Columns Here
| 1 | 3 |
| 2 | 1 |
| 2 | 8 |
--------------------------------------------
I tried doing this by filtering to only rows with Value<=0, selecting the distinct IDs from this, converting that to a list, and then removing any rows in the original table that have an ID in that list using df.filter(~df.Id.isin(mylist))
However, I have a huge amount of data, and this ran out of memory making the list, so I need to come up with a pure pyspark solution.
As Gordon mentions, you may need a window for this, here is a pyspark version:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.partitionBy("Id")
(df.withColumn("flag",F.when(F.col("Value")<=0,0).otherwise(1))
.withColumn("Min",F.min("flag").over(w)).filter(F.col("Min")!=0)
.drop("flag","Min")).show()
+---+-----+
| Id|Value|
+---+-----+
| 1| 3|
| 2| 1|
| 2| 8|
+---+-----+
Brief summary of approach taken:
Set a flag when Value<=0 then 0 else `1
get min over a partition of id (will return 0 if any of prev cond is
met)
filter only when this Min value is not 0
`
You can use window functions:
select t.*
from (select t.*, min(value) over (partition by id) as min_value
from t
) t
where min_value > 0
I'm working with Pyspark, I have Spark 1.6. And I would like to group some values together.
+--------+-----+
| Item |value|
+--------+-----+
| A | 187|
| B | 200|
| C | 3|
| D | 10|
I would to group all item with less of 10% total value together (in this case C and D would be group into new value "Other")
So, the new table look like
+--------+-----+
| Item |value|
+--------+-----+
| A | 187|
| B | 200|
| Other | 13|
Does some one know some function or simple way to do that?
Many thanks for your help
You can filter the dataframe twice to get a dataframe with just the values you want to keep, and one with just the others. Perform an aggregation on the others dataframe to sum them, then union the two dataframe back together. Depending on the data you may want to persist the original dataframe before all this so that it doesn't need to be evaluated twice.