I'm working with Pyspark, I have Spark 1.6. And I would like to group some values together.
+--------+-----+
| Item |value|
+--------+-----+
| A | 187|
| B | 200|
| C | 3|
| D | 10|
I would to group all item with less of 10% total value together (in this case C and D would be group into new value "Other")
So, the new table look like
+--------+-----+
| Item |value|
+--------+-----+
| A | 187|
| B | 200|
| Other | 13|
Does some one know some function or simple way to do that?
Many thanks for your help
You can filter the dataframe twice to get a dataframe with just the values you want to keep, and one with just the others. Perform an aggregation on the others dataframe to sum them, then union the two dataframe back together. Depending on the data you may want to persist the original dataframe before all this so that it doesn't need to be evaluated twice.
Related
For a project with table features, I try to create a new table with pivot_table.
Small problem however, one of my columns contains an array. Here an ex
| House | Job |
| ----- | --- |
| Gryffindor | ["Head of Auror Office", " Minister for Magic"]|
| Gryffindor | ["Auror"]|
| Slytherin | ["Auror","Student"] |
Ideally, I would like with a pivot table to create a table that looks like this
| House | Head of Auror Office | Minister for Magic | Auror | Student |
|:----- |:--------------------:|:------------------:|:-----:|:-------:|
| Gryffindor | 1 | 1| 1| 0|
| Slytherin | 0 | 0| 1| 1|
Of course I can have a value like 2,3 or 4 in the array so something that is not fixed. Anyone have a solution? Maybe the pivot_table is not the best solution :/
Sorry for the arrays, It's not working :(
suppose your table is df with two columns:
(df.explode('Job')
.groupby(['House', 'Job']).size().reset_index()
.pivot(index = 'House', columns = 'Job').fillna(0))
the code first expand the list into rows, then do the count, and finally do the pivot table
What I'm trying to do is make a pyspark dataframe with item and date and another column "3_avg" that's the average of the last three same day-of-week from the given date back. Said another way, if 2022-5-5 is a thursday, I want the 3_avg value for that row to be the average sales for that item for the last three thursdays, so 4/28, 4/21, and 4/14.
I've got this thus far, but it just averages the whole column for that day of week... I can't figure out how to get it to be distinct by item and date and only use the last three? I was trying to get it to work with day_of_week, but my brain can't connect that to what I need to happen.
df_fcst_dow = (
df_avg
.withColumn("day_of_week", F.dayofweek(F.col("trn_dt")))
.groupBy("item", "date", "day_of_week")
.agg(
F.sum(F.col("sales") / 3).alias("3_avg")
)
)
You can do this with a window or you can do it with a groupby. Here I'd encourage group by as it will distribute the work better amongst the worker nodes. We create an array of the current date and the next two dates. We then explode that array, give us data duplicated accross all the dates we want so we can then group it up to make an average.
import pyspark.sql.functions as F
>>> spark.table("trn_dt").show()
+----+----------+-----+
|item| date|sales|
+----+----------+-----+
| 1|2016-01-03| 16.0|
| 1|2016-01-02| 15.0|
| 1|2016-01-05| 9.0|
| 1|2016-01-04| 10.0|
| 1|2016-01-01| 11.0|
| 1|2016-01-07| 10.0|
| 1|2016-01-06| 7.0|
+----+----------+-----+
df_avg.withColumn( "dates",
F.array( #building array of dates
df_avg["date"],
F.date_add( df_avg["date"], 1),
F.date_add( df_avg["date"], 2)
)).select(
F.col("item"),
F.explode("dates") ).alias("ThreeDayAve"), # tripling our data
F.col("sales")
).groupBy( "item","ThreeDayAve")
.agg( F.avg("sales").alias("3_avg")).show()
+----+-----------+------------------+
|item|ThreeDayAve| 3_avg|
+----+-----------+------------------+
| 1| 2016-01-05|11.666666666666666|
| 1| 2016-01-04|13.666666666666666|
| 1| 2016-01-07| 8.666666666666666|
| 1| 2016-01-01| 11.0|
| 1| 2016-01-03| 14.0|
| 1| 2016-01-02| 13.0|
| 1| 2016-01-09| 10.0|
| 1| 2016-01-06| 8.666666666666666|
| 1| 2016-01-08| 8.5|
+----+-----------+------------------+
You likely could use window on this but it wouldn't perform as well on large data sets.
Problem Statement
I have two corresponding DataFrames, one is employee table, one is job catalog table, one of their columns is filled with array, I want to find and intersection of two array in the skill_set column from two DataFrames (I've using np.intersect1d) and return the value to employee DataFrame for each id in employee DataFrame.
So 1 id in employee DataFrame will be looped to find intersection of all job_name in job catalog DataFrame in same job rank with the current employee job rank. Final output is meant to find 5 job with highest amount of intersect (using len since np.intersect1d returns a list) from job DataFrames.
employee_data
+----+--------+----------+----------+
| id|emp_name| job_rank| skill_set|
+----+--------+----------+----------+
| 2| c | 1|[a1,a2,a3]|
| 2| a | 2|[a1,a2,a3]|
| 1| c | 3|[a1,a2,a3]|
| 1| j | 4|[a1,a2,a3]|
| 3| k | 5|[a1,a2,a3]|
| 1| l | 6|[a1,a2,a3]|
+----+--------+----------+----------+
job_data
+----+--------+----------+----------+
| id|job_name| job_rank| skill_set|
+----+--------+----------+----------+
| 2| c | 1|[a1,a2,a3]|
| 2| a | 2|[a1,a2,a3]|
| 1| c | 1|[a1,a2,a3]|
| 1| b | 4|[a1,a2,a3]|
| 3| r | 3|[a1,a2,a3]|
| 1| a | 6|[a1,a2,a3]|
| 1| m | 2|[a1,a2,a3]|
| 1| g | 4|[a1,a2,a3]|
+----+--------+----------+----------+
I can give you an idea how you can solve this, considering the emp data and job data are not too big.
Do a full join (or inner join as you need) on employee_data and job_data. So your new joined data will have len(employee_data) * len(job_data) rows and will have skills from both tables including employee details
| emp_details | emp_skills | job_details | job_skills |
Operate on this table to find which of emp_skills matches with job_skills with (lambda) functions. With functions you are easily operate on array objects.
Select the emp details from the row
I have two dataframes players and striker_details. The striker_details df is sorted by striker_grade and looks like this:
+-------------+-------------+
|player_api_id|striker_grade|
+-------------+-------------+
| 20276| 89.25|
| 37412| 89.0|
| 38817| 88.75|
| 32118| 88.25|
| 31921| 87.0|
+-------------+-------------+
The other dataframe players is an unordered dataframe and looks like this:
+---+-------------+------------------+------------------+-------------------+------+------+
| id|player_api_id| player_name|player_fifa_api_id| birthday|height|weight|
+---+-------------+------------------+------------------+-------------------+------+------+
| 1| 505942|Aaron Appindangoye| 218353|1992-02-29 00:00:00|182.88| 187|
| 2| 155782| Aaron Cresswell| 189615|1989-12-15 00:00:00|170.18| 146|
| 3| 162549| Aaron Doran| 186170|1991-05-13 00:00:00|170.18| 163|
| 4| 30572| Aaron Galindo| 140161|1982-05-08 00:00:00|182.88| 198|
| 5| 23780| Aaron Hughes| 17725|1979-11-08 00:00:00|182.88| 154|
+---+-------------+------------------+------------------+-------------------+------+------+
When I am trying to join the two dataframes using pyspark join, it is returning an unordered data:
+-------------+-----------------+------------------+-------------------+------+------+
|player_api_id| striker_grade| player_name| birthday|height|weight|
+-------------+-----------------+------------------+-------------------+------+------+
| 309726|75.38888888888889| Andrea Belotti|1993-12-20 00:00:00|180.34| 159|
| 38433| 72.5625| Borja Valero|1985-01-12 00:00:00|175.26| 161|
| 41157| 82.0|Giovani dos Santos|1989-05-11 00:00:00|175.26| 163|
| 40740| 70.375| Jeremy Morel|1984-04-02 00:00:00|172.72| 157|
| 109653| 73.5| John Goossens|1988-07-25 00:00:00|175.26| 150|
+-------------+-----------------+------------------+-------------------+------+------+
I used the command: ss = striker_details.join(players, ["player_api_id"], "inner")
How can I achieve sorted data?
Try chaining orderBy('striker_grade,ascending=False) at the end of your join:
ss = striker_details.join(players, ["player_api_id"], "inner").orderBy('striker_grade',ascending=False)
ss.display()
Also, the default join is inner, so you don't need to specify it. It is good to show it for explicitness though.
Good Afternoon.
I am trying to perform a join in Pyspark that uses a complex set of conditions to produce a single value.
A minimum example of what I am trying to achieve could look like the following. Imagine a set of events that can occur at discrete times (between t=0 and t=40). Each event has a set of three independent boolean properties that describe the nature of the event. There is some time-dependent value associated with the triggering of each property, contained in a lookup table. For each event, I would like to determine the sum of all the relevant values for that event.
My first dataframe, df_1, is a list of the events, the time at which the event occured, and had a selection of boolean properties associated with it:
+-------------+------------+------------+------------+------------+
| EVENT_INDEX | EVENT_TIME | PROPERTY_1 | PROPERTY_2 | PROPERTY_3 |
+-------------+------------+------------+------------+------------+
| Event_1 | 13 | 1 | 0 | 1 |
| Event_2 | 24 | 0 | 1 | 1 |
| Event_3 | 35 | 1 | 0 | 0 |
+-------------+------------+------------+------------+------------+
The second dataframe, df_2, is the lookup table that describes the associated value of having a TRUE for a particular property at a particular time. Since there are many repeated values across all the time buckets, the format of this dataframe is an inclusive range of times for which the property has the specific value. The time ranges are not consistently sized and can vary wildly between different properties:
+------------+----------+---------------+-------+
| START_TIME | END_TIME | PROPERTY_NAME | VALUE |
+------------+----------+---------------+-------+
| 0 | 18 | PROPERTY_1 | 0.1 |
| 19 | 40 | PROPERTY_1 | 0.8 |
| 0 | 20 | PROPERTY_2 | 0.7 |
| 20 | 24 | PROPERTY_2 | 0.3 |
| 25 | 40 | PROPERTY_2 | 0.7 |
| 0 | 40 | PROPERTY_3 | 0.5 |
+------------+----------+---------------+-------+
Desired Output:
Since Event_1 occured at time t=13, with PROPERTY_1 and PROPERTY_3 triggered, the expected sum of the values according to df_2 should be 0.1 (from the PROPERTY_1 0-18 bucket) + 0.5 (from the PROPERTY_3 0-40 bucket) = 0.6. In the same way, Event_2 should have a value of 0.3 (remember that bucket start/end times are inclusive, so this comes from the 20-24 bucket) + 0.5 = 0.8. Finally, Event_3 = 0.8.
+-------------+------------+------------+------------+------------+-------------+
| EVENT_INDEX | EVENT_TIME | PROPERTY_1 | PROPERTY_2 | PROPERTY_3 | TOTAL_VALUE |
+-------------+------------+------------+------------+------------+-------------+
| Event_1 | 13 | 1 | 0 | 1 | 0.6 |
| Event_2 | 24 | 0 | 1 | 1 | 0.8 |
| Event_3 | 35 | 1 | 0 | 0 | 0.8 |
+-------------+------------+------------+------------+------------+-------------+
For my initial test dataset, in the event dataframe df_1 there are ~20,000 events spread over 2000 time buckets. Each event has ~44 properties and the length of the lookup table df_2 is ~53,000. As I would like to expand this process out to significantly more data (a couple of orders of magnitude potentially), I am very interested in a parallelisable solution to this problem. For instance, I feel like summarising df_2 as a python dictionary and broadcasting that to my executors will not be possible given the volume of data.
Since I'm trying to add a single column to each row in df_1, I have tried to accomplish the task using a nested map that looks similar to the following:
def calculate_value(df_2):
def _calculate_value(row):
row_dict = row.asDict()
rolling_value = 0.0
for property_name in [key for key in row_dict.keys() if "PROPERTY" in key]:
additional_value = (
df_2
.filter(
(pyspark.sql.functions.col("PROPERTY_NAME") == property_name)
& (pyspark.sql.functions.col("START_BUCKET") <= row_dict["EVENT_TIME"])
& (pyspark.sql.functions.col("END_BUCKET") >= row_dict["EVENT_TIME"])
)
.select("VALUE")
.collect()
)[0][0]
rolling_value += additional_value
return pyspark.sql.Row(**row_dict)
return _calculate_value
This code is able to perform the join on the driver (by running calculate_value(df_2)(df_1.rdd.take(1)[0])), however when I try to perform the parallelised map:
(
df_1
.rdd
.map(calculate_value(df_2))
)
I receive a Py4JError indicating that it could not seralize the dataframe object df_2. This is verified elsewhere in StackOverflow, e.g. Pyspark: PicklingError: Could not serialize object:.
I opted to use a map rather than a join because I am adding a single column to each row in df_1, and given the difficulty in encoding the complex logic required to identify the correct rows in df_2 to add up for each given event (First, check which properties fired and were TRUE in df_1, then select those properties in df_2, downselect to only the properties and values that are relevant given the event time, and then add up all the events).
I am trying to think of a way to reconfigure df_2 in a sustainable, scalable manner to allow for a more simple join/map, but I am not sure how best to go about doing it.
Any advice would be greatly appreciated.
Sample DataFrames:
df1.show()
+-----------+----------+----------+----------+----------+
|EVENT_INDEX|EVENT_TIME|PROPERTY_1|PROPERTY_2|PROPERTY_3|
+-----------+----------+----------+----------+----------+
| Event_1| 13| 1| 0| 1|
| Event_2| 24| 0| 1| 1|
| Event_3| 35| 1| 0| 0|
+-----------+----------+----------+----------+----------+
df2.show()
+----------+--------+-------------+-----+
|START_TIME|END_TIME|PROPERTY_NAME|VALUE|
+----------+--------+-------------+-----+
| 0| 18| PROPERTY_1| 0.1|
| 19| 40| PROPERTY_1| 0.8|
| 0| 20| PROPERTY_2| 0.7|
| 20| 24| PROPERTY_2| 0.3|
| 25| 40| PROPERTY_2| 0.7|
| 0| 40| PROPERTY_3| 0.5|
+----------+--------+-------------+-----+
This works for Spark2.4+ using DataframeAPI.(very scalable as it only uses in-built functions and it is dynamic for as many property columns)
It will work for as many properties as it is dynamic for them as long as the properties columns start with 'PROPERTY_'. First I will use arrays_zip and array and explode to collapse all Property columns into rows with 2 columns using element_at to give us PROPERY_NAME,PROPERTY_VALUE. Before join, I will filter to only keep all rows where the PROPERY_VALUE=1. The join will take place on the range of time and where PROPERTY(with all collapsed rows of properties)=PROPERTY_NAMES(of df2). This will ensure that we only get all the rows needed for our sum. Then I perform a groupBy with agg to select all our required columns and to get our total sum as TOTAL_VALUE.
from pyspark.sql import functions as F
df1.withColumn("PROPERTIES",\
F.explode(F.arrays_zip(F.array([F.array(F.lit(x),F.col(x)) for x in df1.columns if x.startswith("PROPERTY_")]))))\
.select("EVENT_INDEX", "EVENT_TIME","PROPERTIES.*",\
*[x for x in df1.columns if x.startswith("PROPERTY_")]).withColumn("PROPERTY", F.element_at("0",1))\
.withColumn("PROPERTY_VALUE", F.element_at("0",2)).drop("0")\
.filter('PROPERTY_VALUE=1').join(df2, (df1.EVENT_TIME>=df2.START_TIME) & (df1.EVENT_TIME<=df2.END_TIME)& \
(F.col("PROPERTY")==df2.PROPERTY_NAME)).groupBy("EVENT_INDEX").agg(F.first("EVENT_TIME").alias("EVENT_TIME"),\
*[F.first(x).alias(x) for x in df1.columns if x.startswith("PROPERTY_")],\
(F.sum("VALUE").alias("TOTAL_VALUE"))).orderBy("EVENT_TIME").show()
+-----------+----------+----------+----------+----------+-----------+
|EVENT_INDEX|EVENT_TIME|PROPERTY_1|PROPERTY_2|PROPERTY_3|TOTAL_VALUE|
+-----------+----------+----------+----------+----------+-----------+
| Event_1| 13| 1| 0| 1| 0.6|
| Event_2| 24| 0| 1| 1| 0.8|
| Event_3| 35| 1| 0| 0| 0.8|
+-----------+----------+----------+----------+----------+-----------+