Trips
id,timestamp
1008,2003-11-03 15:00:31
1008,2003-11-03 15:02:38
1008,2003-11-03 15:03:04
1008,2003-11-03 15:18:00
1009,2003-11-03 22:00:00
1009,2003-11-03 22:02:53
1009,2003-11-03 22:03:44
1009,2003-11-14 10:00:00
1009,2003-11-14 10:02:02
1009,2003-11-14 10:03:10
prompts
id,timestamp ,mode
1008,2003-11-03 15:18:49,car
1009,2003-11-03 22:04:20,metro
1009,2003-11-14 10:04:20,bike
Read csv file:
coordinates = pd.read_csv('coordinates.csv')
mode = pd.read_csv('prompts.csv')
I have to assign each mode at the end of the trip
Results:
id, timestamp, mode
1008, 2003-11-03 15:00:31, null
1008, 2003-11-03 15:02:38, null
1008, 2003-11-03 15:03:04, null
1008, 2003-11-03 15:18:00, car
1009, 2003-11-03 22:00:00, null
1009, 2003-11-03 22:02:53, null
1009, 2003-11-03 22:03:44, metro
1009, 2003-11-14 10:00:00, null
1009, 2003-11-14 10:02:02, null
1009, 2003-11-14 10:03:10, bike
Note
I use a large dataset for trips (4GB) and a small dataset for modes (500MB)
Based on your updated example, you can denote a trip by finding the first prompt timestamp that is greater than the trip timestamp. All rows with the same prompt timestamp will then correspond to the same trip. Then you want to set the mode for the greatest of the trip timestamps for each group.
One way to do this is by using 2 pyspark.sql.Windows.
Suppose you start with the following two PySpark DataFrames, trips and prompts:
trips.show(truncate=False)
#+----+-------------------+
#|id |timestamp |
#+----+-------------------+
#|1008|2003-11-03 15:00:31|
#|1008|2003-11-03 15:02:38|
#|1008|2003-11-03 15:03:04|
#|1008|2003-11-03 15:18:00|
#|1009|2003-11-03 22:00:00|
#|1009|2003-11-03 22:02:53|
#|1009|2003-11-03 22:03:44|
#|1009|2003-11-14 10:00:00|
#|1009|2003-11-14 10:02:02|
#|1009|2003-11-14 10:03:10|
#|1009|2003-11-15 10:00:00|
#+----+-------------------+
prompts.show(truncate=False)
#+----+-------------------+-----+
#|id |timestamp |mode |
#+----+-------------------+-----+
#|1008|2003-11-03 15:18:49|car |
#|1009|2003-11-03 22:04:20|metro|
#|1009|2003-11-14 10:04:20|bike |
#+----+-------------------+-----+
Join these two tables together using the id column with the condition that the prompt timestamp is greater than or equal to the trip timestamp. For some trip timestamps, this will result in multiple prompt timestamps. We can eliminate this by selecting the minimum prompt timestamp for each ('id', 'trip.timestamp') group- I call this temporary column indicator, and I used the Window w1 to compute it.
Next do a window over ('id', 'indicator') and find the maximum trip timestamp for each group. Set this value equal to the mode. All other rows will be set to pyspark.sql.functions.lit(None).
Finally you can compute all of the entries in trips where the trip timestamp was greater than the max prompt timestamp. These would be trips that did not match to a prompt. Union the matched and the unmatched together.
import pyspark.sql.functions as f
from pyspark.sql import Window
w1 = Window.partitionBy('id', 'trips.timestamp')
w2 = Window.partitionBy('id', 'indicator')
matched = trips.alias('trips').join(prompts.alias('prompts'), on='id', how='left')\
.where('prompts.timestamp >= trips.timestamp' )\
.select(
'id',
'trips.timestamp',
'mode',
f.when(
f.col('prompts.timestamp') == f.min('prompts.timestamp').over(w1),
f.col('prompts.timestamp'),
).otherwise(f.lit(None)).alias('indicator')
)\
.where(~f.isnull('indicator'))\
.select(
'id',
f.col('trips.timestamp').alias('timestamp'),
f.when(
f.col('trips.timestamp') == f.max(f.col('trips.timestamp')).over(w2),
f.col('mode')
).otherwise(f.lit(None)).alias('mode')
)
unmatched = trips.alias('t').join(prompts.alias('p'), on='id', how='left')\
.withColumn('max_prompt_time', f.max('p.timestamp').over(Window.partitionBy('id')))\
.where('t.timestamp > max_prompt_time')\
.select('id', 't.timestamp', f.lit(None).alias('mode'))\
.distinct()
Output:
matched.union(unmatched).sort('id', 'timestamp').show()
+----+-------------------+-----+
| id| timestamp| mode|
+----+-------------------+-----+
|1008|2003-11-03 15:00:31| null|
|1008|2003-11-03 15:02:38| null|
|1008|2003-11-03 15:03:04| null|
|1008|2003-11-03 15:18:00| car|
|1009|2003-11-03 22:00:00| null|
|1009|2003-11-03 22:02:53| null|
|1009|2003-11-03 22:03:44|metro|
|1009|2003-11-14 10:00:00| null|
|1009|2003-11-14 10:02:02| null|
|1009|2003-11-14 10:03:10| bike|
|1009|2003-11-15 10:00:00| null|
+----+-------------------+-----+
This would be a naive solution which assumes that your coordinates DataFrame already is sorted by timestamp, that ids are unique and that your data set fits into memory. If the latter is not the case, I recommend using dask and partition your DataFrames by id.
Imports:
import pandas as pd
import numpy as np
First we join the two DataFrames. This will fill the whole mode column for each id. We join on the index because that will speed up the operation, see also "Improve Pandas Merge performance".
mode = mode.set_index('id')
coordinates = coordinates.set_index('id')
merged = coordinates.join(mode, how='left')
We need the index to be unique values in order for our groupby operation to work.
merged = merged.reset_index()
Then we apply a function that will replace all but the last row in the mode column for each id.
def clean_mode_col(df):
cleaned_mode_col = df['mode'].copy()
cleaned_mode_col.iloc[:-1] = np.nan
df['mode'] = cleaned_mode_col
return df
merged = merged.groupby('id').apply(clean_mode_col)
As mentioned above, you can use dask to parallelize the execution of the merge code like this:
import dask.dataframe as dd
dd_coordinates = dd.from_pandas(coordinates).set_index('id')
dd_mode = dd.from_pandas(mode).set_index('id')
merged = dd.merge(dd_coordinates, dd_mode, left_index=True, right_index=True)
merged = merged.compute() #returns pandas DataFrame
The set_index operations are slow but make the merge way faster.
I did not test this code. Please provide copy-pasteable code that includes your DataFrames so that I don't have to copy and paste all those files you have in your description (hint: use pd.DataFrame.to_dict to export your DataFrame as a dictionary and copy and paste that into your code).
Related
I'm working on some data preparation for a project I'm involved in. We do most of the work in Databricks, using the underlying Apache Spark for computations on large datasets. Everything is done in PySpark.
My goal is to convert a date variable to a variable yearperiod, which divides the year into 13 periods of 4 weeks (with some exceptions). The value is a concatenation of the year and the period, e.g. yearperiod = 201513 would be the year 2015, period 13.
I have two tables: yp_table which contains start and end dates (Edit: type DateType()) for yearperiods (between 2012 and now, Edit: ~120 rows):
+----------+----------+----------+
| start| end|yearperiod|
+----------+----------+----------+
|2012-01-16|2012-01-29| 201201|
|2012-01-30|2012-02-26| 201202|
|2012-02-27|2012-03-25| 201203|
|2012-03-26|2012-04-22| 201204|
|2012-04-23|2012-05-20| 201205|
|2012-05-21|2012-06-17| 201206|
....
And I have the actual data table, which contains a Date column (Edit: type StringType()):
+--------+--------+--------+-----+
| Var1| Var2| Date| Var3|
+--------+--------+--------+-----+
| xxxxxx| xxxx|20191231| x,xx|
| xxxxxx| xxxx|20191231| x,xx|
| xxxxxx| xxxx|20191231| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
...
My question: how do I compute a column yearperiod for the data table, by comparing data.Date with both yp_table.start and yp_table.end?
So far I've been able to make it work with regular Python (a solution with list comprehensions), but it proves to be too slow for large datasets. Any help is greatly appreciated!
Edit: for privacy reasons I can't give the actual schemas of the dataframes. I've edited above to include the types of the relevant columns.
Add a column to your data df that contains the dates in the matching format to the yp_table and then join them filtering by date intervals. Since the yp_table is small, you can use a broadcast join to speed things up.
import pyspark.sql.functions as fun
# Date lookup
start_dates = ["2012-01-16", "2012-01-30", "2012-02-27", "2012-03-26", "2012-04-23", "2012-05-21"]
end_dates = ["2012-01-29", "2012-02-26", "2012-03-25", "2012-04-22", "2012-05-20", "2012-06-17"]
yearperiod = ["201201", "201202", "201203", "201204", "201205", "201206"]
yp_table = spark.createDataFrame(pd.DataFrame({'start': start_dates, 'end': end_dates, 'yearperiod': yearperiod}))
# Data df
dates = ["20120116", "20120130", "20120228", "20120301", "20200101", "20200101", "20200101"]
vals = range(0, len(dates))
data = spark.createDataFrame(pd.DataFrame({'Dates':dates, 'vals': vals}))
# Add formatted data_str column for joining
data = data.withColumn("date_str", fun.concat_ws("-", data.Dates.substr(0,4), data.Dates.substr(5,2), data.Dates.substr(7,2))) # + "-" + data.Dates.substr(6,8))
# Broadcase join small yp_table into the data table using conditional
joined = data.join(fun.broadcast(yp_table), (data.date_str >= yp_table.start) & (data.date_str < yp_table.end))
yp_table.show()
data.show()
joined.show()
+----------+----------+----------+
| start| end|yearperiod|
+----------+----------+----------+
|2012-01-16|2012-01-29| 201201|
|2012-01-30|2012-02-26| 201202|
|2012-02-27|2012-03-25| 201203|
|2012-03-26|2012-04-22| 201204|
|2012-04-23|2012-05-20| 201205|
|2012-05-21|2012-06-17| 201206|
+----------+----------+----------+
+--------+----+----------+
| Dates|vals| date_str|
+--------+----+----------+
|20120116| 0|2012-01-16|
|20120130| 1|2012-01-30|
|20120228| 2|2012-02-28|
|20120301| 3|2012-03-01|
|20200101| 4|2020-01-01|
|20200101| 5|2020-01-01|
|20200101| 6|2020-01-01|
+--------+----+----------+
+--------+----+----------+----------+----------+----------+
| Dates|vals| date_str| start| end|yearperiod|
+--------+----+----------+----------+----------+----------+
|20120116| 0|2012-01-16|2012-01-16|2012-01-29| 201201|
|20120130| 1|2012-01-30|2012-01-30|2012-02-26| 201202|
|20120228| 2|2012-02-28|2012-02-27|2012-03-25| 201203|
|20120301| 3|2012-03-01|2012-02-27|2012-03-25| 201203|
+--------+----+----------+----------+----------+----------+
I am trying to select data for multiple ID's between a time range using pyspark.
I have four columns in a spark dataframe 'event_df'
ID
Time
Event_Start_Date
Event_End_Date
241856
2020-10-18T09:16:49.000+0000
2020-11-12T20:15:00.000+0000
2020-11-12T20:45:00.000+0000
In 'Time' there is data worth 2 months for individual ID's. Different ID's have different event start and end datetimes However, I want to select data only between 'event start date' and 'event end date'.
I have tried the following but it doesn't seem to return what I want
refined_df = event_df.where(( col ('Time') >= col ('Event_Start_Date')) & ( col ('Time') <= col ('Event_End_Date ')) )
Not sure why your line isn't working for you, but you can also try using between:
import pyspark.sql.functions as F
data = [(241856, '2020-10-18T09:16:49.000+0000', '2019-11-12T20:15:00.000+0000', '2020-11-12T20:45:00.000+0000'),
(241857, '2020-10-18T09:16:49.000+0000', '2020-11-12T20:15:00.000+0000', '2020-11-12T20:45:00.000+0000')]
df = spark.sparkContext.parallelize(data).toDF(['ID','Time','Event_Start_Date','Event_End_Date'])
df.show()
df.filter(F.col('Time').between(F.col('Event_Start_Date'), F.col('Event_End_Date'))).show()
returns
+------+--------------------+--------------------+--------------------+
| ID| Time| Event_Start_Date| Event_End_Date|
+------+--------------------+--------------------+--------------------+
|241856|2020-10-18T09:16:...|2019-11-12T20:15:...|2020-11-12T20:45:...|
|241857|2020-10-18T09:16:...|2020-11-12T20:15:...|2020-11-12T20:45:...|
+------+--------------------+--------------------+--------------------+
+------+--------------------+--------------------+--------------------+
| ID| Time| Event_Start_Date| Event_End_Date|
+------+--------------------+--------------------+--------------------+
|241856|2020-10-18T09:16:...|2019-11-12T20:15:...|2020-11-12T20:45:...|
+------+--------------------+--------------------+--------------------+
I would like to compare 2 dataframes in pyspark.
Below is my test case dataset (from google).
So I have 2 df's
Base DF
Secondary DF
baseDF
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3500,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Vom,5000,mex,IT,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
secDF
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,ITA,2/11/2019
22,Tom,2500,usa,HRA,2/11/2019
33,Kom,3000,uk,ITA,2/11/2019
44,Nom,4600,can,HRA,2/11/2019
55,Vom,8000,mex,ITA,2/11/2019
77,XYZ,5000,mex,ITA,2/11/2019
I have to compare secDF and baseDF with 2 keys (No and Name), if those fields match (I only need the matched records from secDF)then I have to update the salary and Dept field of baseDF with the value from secDF
Expected output
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,ITA,2/11/2019
22,Tom,2500,usa,HRA,2/11/2019
33,Kom,3000,uk,ITA,2/11/2019
44,Nom,4600,can,HRA,2/11/2019
55,Vom,8000,mex,ITA,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
Using pyspark I can use subtract() to find the values of table1 not present in table2, and consequently use unionAll of the two tables or should I use withcolumn to overwrite values satisfying the condition.
Could someone suggest a good way of doing this?
Update ---
I have to compare secDF and baseDF with 2 keys (No and Name), if those fields match (I only need the matched records from secDF)then I have to update the salary and Dept field of baseDF with the value from secDF.
You can do a left join and coalesce the resulting Sal column, with secdf taking precedence over basedf:
import pyspark.sql.functions as F
result = basedf.alias('basedf').join(
secdf.alias('secdf'),
['No', 'Name'],
'left'
).select(
[F.coalesce('secdf.Sal', 'basedf.Sal').alias('Sal')
if c == 'Sal'
else F.coalesce('secdf.Dept', 'basedf.Dept').alias('Dept')
if c == 'Dept'
else f'basedf.{c}'
for c in basedf.columns]
)
result.show()
+---+----+----+-------+----+---------+
| No|Name| Sal|Address|Dept|Join_Date|
+---+----+----+-------+----+---------+
| 11| Sam|1000| ind| ITA|2/11/2019|
| 22| Tom|2500| usa| HRA|2/11/2019|
| 33| Kom|3000| uk| ITA|2/11/2019|
| 44| Nom|4600| can| HRA|2/11/2019|
| 55| Vom|8000| mex| ITA|2/11/2019|
| 66| XYZ|5000| mex| IT|2/11/2019|
+---+----+----+-------+----+---------+
Given the following example dataframe:
advertiser_id| name | amount | total |max_total_advertiser|
4061 |source1|-434.955284|-354882.75336200005| -355938.53950700007
4061 |source2|-594.012216|-355476.76557800005| -355938.53950700007
4061 |source3|-461.773929|-355938.53950700007| -355938.53950700007
I need to sum the amount and the max_total_advertiser field in order to get the correct total value in each row. Taking into account that I need this total value for every group partitioned by advertiser_id. (The total column in the initial dataframe is incorrect, that's why I want to calculate correctly)
Something like that should be:
w = Window.partitionBy("advertiser_id").orderBy("advertiser_id")
df.withColumn("total_aux", when( lag("advertiser_id").over(w) == col("advertiser_id"), lag("total_aux").over(w) + col("amount") ).otherwise( col("max_total_advertiser") + col("amount") ))
This lag("total_aux") is not working because the column is not generated yet, that's what I want to achieve, if it is the first row in the group, sum the columns in the same row if not sum the previous obtained value with the current amount field.
Example output:
advertiser_id| name | amount | total_aux |
4061 |source1|-434.955284|-356373.494791 |
4061 |source2|-594.012216|-356967.507007 |
4061 |source3|-461.773929|-357429.280936 |
Thanks.
I assume that name is a distinct value for each advertiser_id and your dataset is therefore sortable by name. I also assume that max_total_advertiser contains the same value for each advertiser_id. If one of those is not the case, please add a comment.
What you need is a rangeBetween window which gives you all preceding and following rows within the specified range. We will use Window.unboundedPreceding as we want to sum up all the previous values.
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [
(4061, 'source1',-434.955284,-354882.75336200005, -355938.53950700007),
(4061, 'source2',-594.012216,-355476.76557800005, -345938.53950700007),
(4062, 'source1',-594.012216,-355476.76557800005, -5938.53950700007),
(4062, 'source2',-594.012216,-355476.76557800005, -5938.53950700007),
(4061, 'source3',-461.773929,-355938.53950700007, -355938.53950700007)
]
columns = ['advertiser_id','name' ,'amount', 'total', 'max_total_advertiser']
df=spark.createDataFrame(l, columns)
w = Window.partitionBy('advertiser_id').orderBy('name').rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn('total', F.sum('amount').over(w) + df.max_total_advertiser)
df.show()
Output:
+-------------+-------+-----------+-------------------+--------------------+
|advertiser_id| name| amount| total|max_total_advertiser|
+-------------+-------+-----------+-------------------+--------------------+
| 4062|source1|-594.012216|-6532.5517230000705| -5938.53950700007|
| 4062|source2|-594.012216| -7126.563939000071| -5938.53950700007|
| 4061|source1|-434.955284| -356373.4947910001| -355938.53950700007|
| 4061|source2|-594.012216| -346967.5070070001| -345938.53950700007|
| 4061|source3|-461.773929|-357429.28093600005| -355938.53950700007|
+-------------+-------+-----------+-------------------+--------------------+
You might be looking for the orderBy() function. Does this work?
from pyspark.sql.window import *
df.withColumn("cumulativeSum", sum(df("amount"))
.over( Window.partitionBy("advertiser_id").orderBy("amount")))
I have the following pyspark df:
+------------------+--------+-------+
| ID| Assets|Revenue|
+------------------+--------+-------+
|201542399349300619| 1633944| 32850|
|201542399349300629| 3979760| 850914|
|201542399349300634| 3402687|1983568|
|201542399349300724| 1138291|1097553|
|201522369349300122| 1401406|1010828|
|201522369349300137| 16948| 171534|
|201522369349300142|13474056|2285323|
|201522369349300202| 481045| 241788|
|201522369349300207| 700861|1185640|
|201522369349300227| 178479| 267976|
+------------------+--------+-------+
For each row, I want to be able to get the rows that are within 20% of the Assets amount. For example, for the first row (ID=201542399349300619), I want to be able to get all the rows where Assets are within 20% +/- of 1,633,944 (so between 1,307,155 to 1,960,732):
+------------------+--------+-------+
| ID| Assets|Revenue|
+------------------+--------+-------+
|201542399349300619| 1633944| 32850|
|201522369349300122| 1401406|1010828|
Using this subsetted table, I want to get the average assets and add it as a new column. So for the above example, it would be the average assets of (1633944+1401406) = 1517675
+------------------+--------+-------+---------+
| ID| Assets|Revenue|AvgAssets|
+------------------+--------+-------+---------+
|201542399349300619| 1633944| 32850| 1517675|
Assuming your DataFrame has a schema similar to the following (i.e. Assets and Revenue are numeric):
df.printSchema()
#root
# |-- ID: long (nullable = true)
# |-- Assets: integer (nullable = true)
# |-- Revenue: integer (nullable = true)
You can join the DataFrame to itself on the condition that you've set forth. After the join, you can group and aggregate by taking the average of the Assets column.
For example:
from pyspark.sql.functions import avg, expr
df.alias("l")\
.join(
df.alias("r"),
on=expr("r.assets between l.assets*0.8 and l.assets*1.2")
)\
.groupBy("l.ID", "l.Assets", "l.Revenue")\
.agg(avg("r.Assets").alias("AvgAssets"))\
.show()
#+------------------+--------+-------+------------------+
#| ID| Assets|Revenue| AvgAssets|
#+------------------+--------+-------+------------------+
#|201542399349300629| 3979760| 850914| 3691223.5|
#|201522369349300202| 481045| 241788| 481045.0|
#|201522369349300207| 700861|1185640| 700861.0|
#|201522369349300137| 16948| 171534| 16948.0|
#|201522369349300142|13474056|2285323| 1.3474056E7|
#|201522369349300227| 178479| 267976| 178479.0|
#|201542399349300619| 1633944| 32850| 1517675.0|
#|201522369349300122| 1401406|1010828|1391213.6666666667|
#|201542399349300724| 1138291|1097553| 1138291.0|
#|201542399349300634| 3402687|1983568| 3691223.5|
#+------------------+--------+-------+------------------+
Since we are joining the DataFrame to itself, we can use aliases to refer to the left table ("l") and the right table ("r"). The logic above says join l to r on the condition that the assets in r is +/20% of the assets in l.
There are multiple ways to express the +/20% condition, but I am using the spark-sql between expression to find rows that are between Assets * 0.8 and Assets * 1.2.
Then we aggregate on all of the columns (groupBy) of the left table and average over the assets in the right table.
The resulting AvgAssets column is a FloatType column, but you can easily convert it to IntegerType by adding a .cast("int") before the .alias("AvgAssets") if that's what you prefer.
See also:
What are the various join types in Spark?