Problem Statement
I have two corresponding DataFrames, one is employee table, one is job catalog table, one of their columns is filled with array, I want to find and intersection of two array in the skill_set column from two DataFrames (I've using np.intersect1d) and return the value to employee DataFrame for each id in employee DataFrame.
So 1 id in employee DataFrame will be looped to find intersection of all job_name in job catalog DataFrame in same job rank with the current employee job rank. Final output is meant to find 5 job with highest amount of intersect (using len since np.intersect1d returns a list) from job DataFrames.
employee_data
+----+--------+----------+----------+
| id|emp_name| job_rank| skill_set|
+----+--------+----------+----------+
| 2| c | 1|[a1,a2,a3]|
| 2| a | 2|[a1,a2,a3]|
| 1| c | 3|[a1,a2,a3]|
| 1| j | 4|[a1,a2,a3]|
| 3| k | 5|[a1,a2,a3]|
| 1| l | 6|[a1,a2,a3]|
+----+--------+----------+----------+
job_data
+----+--------+----------+----------+
| id|job_name| job_rank| skill_set|
+----+--------+----------+----------+
| 2| c | 1|[a1,a2,a3]|
| 2| a | 2|[a1,a2,a3]|
| 1| c | 1|[a1,a2,a3]|
| 1| b | 4|[a1,a2,a3]|
| 3| r | 3|[a1,a2,a3]|
| 1| a | 6|[a1,a2,a3]|
| 1| m | 2|[a1,a2,a3]|
| 1| g | 4|[a1,a2,a3]|
+----+--------+----------+----------+
I can give you an idea how you can solve this, considering the emp data and job data are not too big.
Do a full join (or inner join as you need) on employee_data and job_data. So your new joined data will have len(employee_data) * len(job_data) rows and will have skills from both tables including employee details
| emp_details | emp_skills | job_details | job_skills |
Operate on this table to find which of emp_skills matches with job_skills with (lambda) functions. With functions you are easily operate on array objects.
Select the emp details from the row
Related
For a project with table features, I try to create a new table with pivot_table.
Small problem however, one of my columns contains an array. Here an ex
| House | Job |
| ----- | --- |
| Gryffindor | ["Head of Auror Office", " Minister for Magic"]|
| Gryffindor | ["Auror"]|
| Slytherin | ["Auror","Student"] |
Ideally, I would like with a pivot table to create a table that looks like this
| House | Head of Auror Office | Minister for Magic | Auror | Student |
|:----- |:--------------------:|:------------------:|:-----:|:-------:|
| Gryffindor | 1 | 1| 1| 0|
| Slytherin | 0 | 0| 1| 1|
Of course I can have a value like 2,3 or 4 in the array so something that is not fixed. Anyone have a solution? Maybe the pivot_table is not the best solution :/
Sorry for the arrays, It's not working :(
suppose your table is df with two columns:
(df.explode('Job')
.groupby(['House', 'Job']).size().reset_index()
.pivot(index = 'House', columns = 'Job').fillna(0))
the code first expand the list into rows, then do the count, and finally do the pivot table
these 2 df i have tried different code from join (which it requires a common column), to union and some other code to merge, tho i can't get the result i want, i tried also straight forward
data.join(tdf, how='inner').select('*')
data.join(tdf, how='outer').select('*')
none of the 2 above codes gave me a wanted df.
data.show()
|_c0| description| medical_specialty| sample_name| transcription| keywords|
+---+--------------------+--------------------+-------------------------------------------------------------+
| 1| Consult for lapa...| Bariatrics|Laparoscopic Gas...|PAST MEDICAL HIST...|bariatrics, lapar...|
| 2| Consult for lapa...| Bariatrics|Laparoscopic Gas...|"HISTORY OF PRESE...| at his highest h...|
| 3| 2-D M-Mode. Dopp...| Cardiovascular /...|2-D Echocardiogr...|2-D M-MODE: , ,1....|cardiovascular / ...|
| 4| 2-D Echocardiogram| Cardiovascular /...|2-D Echocardiogr...|1. The left vent...|cardiovascular / ...|
How to add the age column as a column of the above df/ Or how to merge these dfs
tdf.show()
|age|
+---+
| |
| 42|
| |
| |
| 30|
| |
| |
goal:
+---+--------------------+--------------------+-------------------+------------------+----------------------+---+
|_c0| description| medical_specialty| sample_name| transcription| keywords|age|
+---+--------------------+--------------------+-------------------------------------------------------------+---+
| 1| Consult for lapa...| Bariatrics|Laparoscopic Gas...|PAST MEDICAL HIST...|bariatrics, lapar...| |
| 2| Consult for lapa...| Bariatrics|Laparoscopic Gas...|"HISTORY OF PRESE...| at his highest h...| 42|
| 3| 2-D M-Mode. Dopp...| Cardiovascular /...|2-D Echocardiogr...|2-D M-MODE: , ,1....|cardiovascular / ...| |
| 4| 2-D Echocardiogram| Cardiovascular /...|2-D Echocardiogr...|1. The left vent...|cardiovascular / ...| |
As per my understanding, you want to join the two dataframes based on their row_number as your data contains the row_number but your age df doesn't.
So you can add a row number then join your dataframe like:
tdf = tdf.withColumn('Row_id',f.row_number().over(Window.orderBy(f.lit('a'))))
joined_data = data.join(tdf, data._c0==tdf.Row_id).select('*')
This will join the age column to your existing dataframe.
Good Afternoon.
I am trying to perform a join in Pyspark that uses a complex set of conditions to produce a single value.
A minimum example of what I am trying to achieve could look like the following. Imagine a set of events that can occur at discrete times (between t=0 and t=40). Each event has a set of three independent boolean properties that describe the nature of the event. There is some time-dependent value associated with the triggering of each property, contained in a lookup table. For each event, I would like to determine the sum of all the relevant values for that event.
My first dataframe, df_1, is a list of the events, the time at which the event occured, and had a selection of boolean properties associated with it:
+-------------+------------+------------+------------+------------+
| EVENT_INDEX | EVENT_TIME | PROPERTY_1 | PROPERTY_2 | PROPERTY_3 |
+-------------+------------+------------+------------+------------+
| Event_1 | 13 | 1 | 0 | 1 |
| Event_2 | 24 | 0 | 1 | 1 |
| Event_3 | 35 | 1 | 0 | 0 |
+-------------+------------+------------+------------+------------+
The second dataframe, df_2, is the lookup table that describes the associated value of having a TRUE for a particular property at a particular time. Since there are many repeated values across all the time buckets, the format of this dataframe is an inclusive range of times for which the property has the specific value. The time ranges are not consistently sized and can vary wildly between different properties:
+------------+----------+---------------+-------+
| START_TIME | END_TIME | PROPERTY_NAME | VALUE |
+------------+----------+---------------+-------+
| 0 | 18 | PROPERTY_1 | 0.1 |
| 19 | 40 | PROPERTY_1 | 0.8 |
| 0 | 20 | PROPERTY_2 | 0.7 |
| 20 | 24 | PROPERTY_2 | 0.3 |
| 25 | 40 | PROPERTY_2 | 0.7 |
| 0 | 40 | PROPERTY_3 | 0.5 |
+------------+----------+---------------+-------+
Desired Output:
Since Event_1 occured at time t=13, with PROPERTY_1 and PROPERTY_3 triggered, the expected sum of the values according to df_2 should be 0.1 (from the PROPERTY_1 0-18 bucket) + 0.5 (from the PROPERTY_3 0-40 bucket) = 0.6. In the same way, Event_2 should have a value of 0.3 (remember that bucket start/end times are inclusive, so this comes from the 20-24 bucket) + 0.5 = 0.8. Finally, Event_3 = 0.8.
+-------------+------------+------------+------------+------------+-------------+
| EVENT_INDEX | EVENT_TIME | PROPERTY_1 | PROPERTY_2 | PROPERTY_3 | TOTAL_VALUE |
+-------------+------------+------------+------------+------------+-------------+
| Event_1 | 13 | 1 | 0 | 1 | 0.6 |
| Event_2 | 24 | 0 | 1 | 1 | 0.8 |
| Event_3 | 35 | 1 | 0 | 0 | 0.8 |
+-------------+------------+------------+------------+------------+-------------+
For my initial test dataset, in the event dataframe df_1 there are ~20,000 events spread over 2000 time buckets. Each event has ~44 properties and the length of the lookup table df_2 is ~53,000. As I would like to expand this process out to significantly more data (a couple of orders of magnitude potentially), I am very interested in a parallelisable solution to this problem. For instance, I feel like summarising df_2 as a python dictionary and broadcasting that to my executors will not be possible given the volume of data.
Since I'm trying to add a single column to each row in df_1, I have tried to accomplish the task using a nested map that looks similar to the following:
def calculate_value(df_2):
def _calculate_value(row):
row_dict = row.asDict()
rolling_value = 0.0
for property_name in [key for key in row_dict.keys() if "PROPERTY" in key]:
additional_value = (
df_2
.filter(
(pyspark.sql.functions.col("PROPERTY_NAME") == property_name)
& (pyspark.sql.functions.col("START_BUCKET") <= row_dict["EVENT_TIME"])
& (pyspark.sql.functions.col("END_BUCKET") >= row_dict["EVENT_TIME"])
)
.select("VALUE")
.collect()
)[0][0]
rolling_value += additional_value
return pyspark.sql.Row(**row_dict)
return _calculate_value
This code is able to perform the join on the driver (by running calculate_value(df_2)(df_1.rdd.take(1)[0])), however when I try to perform the parallelised map:
(
df_1
.rdd
.map(calculate_value(df_2))
)
I receive a Py4JError indicating that it could not seralize the dataframe object df_2. This is verified elsewhere in StackOverflow, e.g. Pyspark: PicklingError: Could not serialize object:.
I opted to use a map rather than a join because I am adding a single column to each row in df_1, and given the difficulty in encoding the complex logic required to identify the correct rows in df_2 to add up for each given event (First, check which properties fired and were TRUE in df_1, then select those properties in df_2, downselect to only the properties and values that are relevant given the event time, and then add up all the events).
I am trying to think of a way to reconfigure df_2 in a sustainable, scalable manner to allow for a more simple join/map, but I am not sure how best to go about doing it.
Any advice would be greatly appreciated.
Sample DataFrames:
df1.show()
+-----------+----------+----------+----------+----------+
|EVENT_INDEX|EVENT_TIME|PROPERTY_1|PROPERTY_2|PROPERTY_3|
+-----------+----------+----------+----------+----------+
| Event_1| 13| 1| 0| 1|
| Event_2| 24| 0| 1| 1|
| Event_3| 35| 1| 0| 0|
+-----------+----------+----------+----------+----------+
df2.show()
+----------+--------+-------------+-----+
|START_TIME|END_TIME|PROPERTY_NAME|VALUE|
+----------+--------+-------------+-----+
| 0| 18| PROPERTY_1| 0.1|
| 19| 40| PROPERTY_1| 0.8|
| 0| 20| PROPERTY_2| 0.7|
| 20| 24| PROPERTY_2| 0.3|
| 25| 40| PROPERTY_2| 0.7|
| 0| 40| PROPERTY_3| 0.5|
+----------+--------+-------------+-----+
This works for Spark2.4+ using DataframeAPI.(very scalable as it only uses in-built functions and it is dynamic for as many property columns)
It will work for as many properties as it is dynamic for them as long as the properties columns start with 'PROPERTY_'. First I will use arrays_zip and array and explode to collapse all Property columns into rows with 2 columns using element_at to give us PROPERY_NAME,PROPERTY_VALUE. Before join, I will filter to only keep all rows where the PROPERY_VALUE=1. The join will take place on the range of time and where PROPERTY(with all collapsed rows of properties)=PROPERTY_NAMES(of df2). This will ensure that we only get all the rows needed for our sum. Then I perform a groupBy with agg to select all our required columns and to get our total sum as TOTAL_VALUE.
from pyspark.sql import functions as F
df1.withColumn("PROPERTIES",\
F.explode(F.arrays_zip(F.array([F.array(F.lit(x),F.col(x)) for x in df1.columns if x.startswith("PROPERTY_")]))))\
.select("EVENT_INDEX", "EVENT_TIME","PROPERTIES.*",\
*[x for x in df1.columns if x.startswith("PROPERTY_")]).withColumn("PROPERTY", F.element_at("0",1))\
.withColumn("PROPERTY_VALUE", F.element_at("0",2)).drop("0")\
.filter('PROPERTY_VALUE=1').join(df2, (df1.EVENT_TIME>=df2.START_TIME) & (df1.EVENT_TIME<=df2.END_TIME)& \
(F.col("PROPERTY")==df2.PROPERTY_NAME)).groupBy("EVENT_INDEX").agg(F.first("EVENT_TIME").alias("EVENT_TIME"),\
*[F.first(x).alias(x) for x in df1.columns if x.startswith("PROPERTY_")],\
(F.sum("VALUE").alias("TOTAL_VALUE"))).orderBy("EVENT_TIME").show()
+-----------+----------+----------+----------+----------+-----------+
|EVENT_INDEX|EVENT_TIME|PROPERTY_1|PROPERTY_2|PROPERTY_3|TOTAL_VALUE|
+-----------+----------+----------+----------+----------+-----------+
| Event_1| 13| 1| 0| 1| 0.6|
| Event_2| 24| 0| 1| 1| 0.8|
| Event_3| 35| 1| 0| 0| 0.8|
+-----------+----------+----------+----------+----------+-----------+
I have two tables like the following:
First Table:
+---+------+----------+----------+
| id|sub_id| startDate| endDate|
+---+------+----------+----------+
| 2| a|2018-11-15|2018-12-01|
| 2| b|2018-10-15|2018-11-01|
| 3| a|2018-09-15|2018-10-01|
+---+------+----------+----------+
Second Table:
+---+----------+----+
| id| date|time|
+---+----------+----+
| 2|2018-10-15|1200|
| 2|2018-10-16|1200|
| 2|2018-10-18|1200|
| 3|2018-09-28|1200|
| 3|2018-09-29|1200|
+---+----------+----+
For a particular id and a given startDate and endDate, I require to filter the second table between the given timeframe.
From the filtered table I require the sum of the time column and output should be like following:
+---+------+----------+----------+---------+
| id|sub_id| startDate| endDate|totalTime|
+---+------+----------+----------+---------+
| 2| a|2018-11-15|2018-12-01| 0|
| 2| b|2018-10-15|2018-11-01| 3600|
| 3| a|2018-09-15|2018-10-01| 2400|
+---+------+----------+----------+---------+
My objective is to avoid using for loop along with filter. I tried using pandas_udf but it works with only one dataframe.
I am trying to take a column in Spark (using pyspark) that has string values like 'A1', 'C2', and 'B9' and create new columns with each element in the string. How can I extract values from strings to create a new column?
How do I turn this:
| id | col_s |
|----|-------|
| 1 | 'A1' |
| 2 | 'C2' |
into this:
| id | col_s | col_1 | col_2 |
|----|-------|-------|-------|
| 1 | 'A1' | 'A' | '1' |
| 2 | 'C2' | 'C' | '2' |
I have been looking through the docs unsuccessfully.
You can use expr (read here) and substr (read here) to extract the substrings you want. In substr() function, the first argument is the column, second argument is the index from where you want to start extracting and the third argument is the length of the string you want to extract. Note: Its 1 based indexing, as opposed to being 0 based.
from pyspark.sql.functions import substring, length, expr
df = df.withColumn('col_1',expr('substring(col_s, 1, 1)'))
df = df.withColumn('col_2',expr('substring(col_s, 2, 1)'))
df.show()
+---+-----+-----+-----+
| id|col_s|col_1|col_2|
+---+-----+-----+-----+
| 1| A1| A| 1|
| 2| C1| C| 1|
| 3| G8| G| 8|
| 4| Z6| Z| 6|
+---+-----+-----+-----+
I was able to answer my own question 5 minutes after posting it here...
split_col = pyspark.sql.functions.split(df['COL_NAME'], "")
df = df.withColumn('COL_NAME_CHAR', split_col.getItem(0))
df = df.withColumn('COL_NAME_NUM', split_col.getItem(1))