I am trying to select data for multiple ID's between a time range using pyspark.
I have four columns in a spark dataframe 'event_df'
ID
Time
Event_Start_Date
Event_End_Date
241856
2020-10-18T09:16:49.000+0000
2020-11-12T20:15:00.000+0000
2020-11-12T20:45:00.000+0000
In 'Time' there is data worth 2 months for individual ID's. Different ID's have different event start and end datetimes However, I want to select data only between 'event start date' and 'event end date'.
I have tried the following but it doesn't seem to return what I want
refined_df = event_df.where(( col ('Time') >= col ('Event_Start_Date')) & ( col ('Time') <= col ('Event_End_Date ')) )
Not sure why your line isn't working for you, but you can also try using between:
import pyspark.sql.functions as F
data = [(241856, '2020-10-18T09:16:49.000+0000', '2019-11-12T20:15:00.000+0000', '2020-11-12T20:45:00.000+0000'),
(241857, '2020-10-18T09:16:49.000+0000', '2020-11-12T20:15:00.000+0000', '2020-11-12T20:45:00.000+0000')]
df = spark.sparkContext.parallelize(data).toDF(['ID','Time','Event_Start_Date','Event_End_Date'])
df.show()
df.filter(F.col('Time').between(F.col('Event_Start_Date'), F.col('Event_End_Date'))).show()
returns
+------+--------------------+--------------------+--------------------+
| ID| Time| Event_Start_Date| Event_End_Date|
+------+--------------------+--------------------+--------------------+
|241856|2020-10-18T09:16:...|2019-11-12T20:15:...|2020-11-12T20:45:...|
|241857|2020-10-18T09:16:...|2020-11-12T20:15:...|2020-11-12T20:45:...|
+------+--------------------+--------------------+--------------------+
+------+--------------------+--------------------+--------------------+
| ID| Time| Event_Start_Date| Event_End_Date|
+------+--------------------+--------------------+--------------------+
|241856|2020-10-18T09:16:...|2019-11-12T20:15:...|2020-11-12T20:45:...|
+------+--------------------+--------------------+--------------------+
Related
Problem
Hello is there a way in pyspark/spark to select first element over some window on some condition?
Examples
Let's have an example input dataframe
+---------+----------+----+----+----------------+
| id| timestamp| f1| f2| computed|
+---------+----------+----+----+----------------+
| 1|2020-01-02|null|c1f2| [f2]|
| 1|2020-01-01|c1f1|null| [f1]|
| 2|2020-01-01|c2f1|null| [f1]|
+---------+----------+----+----+----------------+
I want to select for each id latest column (f1, f2...) that was computed.
So the "code" would look like this
cols = ["f1", "f2"]
w = Window().partitionBy("id").orderBy(f.desc("timestamp")).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
output_df = (
input_df.select(
"id",
*[f.first(col, condition=f.array_contains(f.col("computed"), col)).over(w).alias(col) for col in cols]
)
.groupBy("id")
.agg(*[f.first(col).alias(col) for col in cols])
)
And output should be
+---------+----+----+
| id| f1| f2|
+---------+----+----+
| 1|c1f1|c1f2|
| 2|c2f1|null|
+---------+----+----+
If the input looks like this
+---------+----------+----+----+----------------+
| id| timestamp| f1| f2| computed|
+---------+----------+----+----+----------------+
| 1|2020-01-02|null|c1f2| [f1, f2]|
| 1|2020-01-01|c1f1|null| [f1]|
| 2|2020-01-01|c2f1|null| [f1]|
+---------+----------+----+----+----------------+
Then the output should be
+---------+----+----+
| id| f1| f2|
+---------+----+----+
| 1|null|c1f2|
| 2|c2f1|null|
+---------+----+----+
As you can see it's not easy just to use f.first(ignore_nulls=True) because in this case we don't want to skip the null as it is taken as computed value.
Current solution
Step 1
Save original data types
cols = ["f1", "f2"]
orig_dtypes = [field.dataType for field in input_df.schema if field.name in cols]
Step 2
For Each column create new column with it's value if the column is computed and also replace original null with our "synthetic" <NULL> string
output_df = input_df.select(
"id", "timestamp", "computed",
*[
f.when(f.array_contains(f.col("computed"), col) & f.col(col).isNotNull(), f.col(col))
.when(f.array_contains(f.col("computed"), col) & f.col(col).isNull(), "<NULL>")
.alias(col)
for col in cols
]
)
Step 3
Select first non null value over window because now we know that <NULL> won't be skipped
output_df = (
output_df.select(
"id",
*[f.first(col, ignorenulls=True).over(w).alias(col) for col in cols],
)
.groupBy("id")
.agg(*[f.first(col).alias(col) for col in cols])
)
Step 4
Replace our "synthetic" <NULL> for original nulls.
output_df = output_df.replace("<NULL>", None)
Step 5
Cast columns back to it's original types because they might get retyped to string in step 2
output_df = output_df.select("id", *[f.col(col).cast(type_) for col, type_ in zip(cols, orig_dtypes)])
This solution works but it does not seem to be the right way to do it. Besides it's pretty heavy and it's taking too long to get computed.
Is there any other more "sparkish" way to do it?
Here's one way by using this trick of struct ordering.
Groupby id and collect list of structs like struct<col_exists_in_computed, timestamp, col_value> for each column in cols list, then using array_max function on the resulting array you get the lasted value you want:
from pyspark.sql import functions as F
output_df = input_df.groupBy("id").agg(
*[F.array_max(
F.collect_list(
F.struct(F.array_contains("computed", c), F.col("timestamp"), F.col(c))
)
)[c].alias(c) for c in cols]
)
# applied to you second dataframe example, it gives
output_df.show()
#+---+----+----+
#| id| f1| f2|
#+---+----+----+
#| 1|null|c1f2|
#| 2|c2f1|null|
#+---+----+----+
I'm working on some data preparation for a project I'm involved in. We do most of the work in Databricks, using the underlying Apache Spark for computations on large datasets. Everything is done in PySpark.
My goal is to convert a date variable to a variable yearperiod, which divides the year into 13 periods of 4 weeks (with some exceptions). The value is a concatenation of the year and the period, e.g. yearperiod = 201513 would be the year 2015, period 13.
I have two tables: yp_table which contains start and end dates (Edit: type DateType()) for yearperiods (between 2012 and now, Edit: ~120 rows):
+----------+----------+----------+
| start| end|yearperiod|
+----------+----------+----------+
|2012-01-16|2012-01-29| 201201|
|2012-01-30|2012-02-26| 201202|
|2012-02-27|2012-03-25| 201203|
|2012-03-26|2012-04-22| 201204|
|2012-04-23|2012-05-20| 201205|
|2012-05-21|2012-06-17| 201206|
....
And I have the actual data table, which contains a Date column (Edit: type StringType()):
+--------+--------+--------+-----+
| Var1| Var2| Date| Var3|
+--------+--------+--------+-----+
| xxxxxx| xxxx|20191231| x,xx|
| xxxxxx| xxxx|20191231| x,xx|
| xxxxxx| xxxx|20191231| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
...
My question: how do I compute a column yearperiod for the data table, by comparing data.Date with both yp_table.start and yp_table.end?
So far I've been able to make it work with regular Python (a solution with list comprehensions), but it proves to be too slow for large datasets. Any help is greatly appreciated!
Edit: for privacy reasons I can't give the actual schemas of the dataframes. I've edited above to include the types of the relevant columns.
Add a column to your data df that contains the dates in the matching format to the yp_table and then join them filtering by date intervals. Since the yp_table is small, you can use a broadcast join to speed things up.
import pyspark.sql.functions as fun
# Date lookup
start_dates = ["2012-01-16", "2012-01-30", "2012-02-27", "2012-03-26", "2012-04-23", "2012-05-21"]
end_dates = ["2012-01-29", "2012-02-26", "2012-03-25", "2012-04-22", "2012-05-20", "2012-06-17"]
yearperiod = ["201201", "201202", "201203", "201204", "201205", "201206"]
yp_table = spark.createDataFrame(pd.DataFrame({'start': start_dates, 'end': end_dates, 'yearperiod': yearperiod}))
# Data df
dates = ["20120116", "20120130", "20120228", "20120301", "20200101", "20200101", "20200101"]
vals = range(0, len(dates))
data = spark.createDataFrame(pd.DataFrame({'Dates':dates, 'vals': vals}))
# Add formatted data_str column for joining
data = data.withColumn("date_str", fun.concat_ws("-", data.Dates.substr(0,4), data.Dates.substr(5,2), data.Dates.substr(7,2))) # + "-" + data.Dates.substr(6,8))
# Broadcase join small yp_table into the data table using conditional
joined = data.join(fun.broadcast(yp_table), (data.date_str >= yp_table.start) & (data.date_str < yp_table.end))
yp_table.show()
data.show()
joined.show()
+----------+----------+----------+
| start| end|yearperiod|
+----------+----------+----------+
|2012-01-16|2012-01-29| 201201|
|2012-01-30|2012-02-26| 201202|
|2012-02-27|2012-03-25| 201203|
|2012-03-26|2012-04-22| 201204|
|2012-04-23|2012-05-20| 201205|
|2012-05-21|2012-06-17| 201206|
+----------+----------+----------+
+--------+----+----------+
| Dates|vals| date_str|
+--------+----+----------+
|20120116| 0|2012-01-16|
|20120130| 1|2012-01-30|
|20120228| 2|2012-02-28|
|20120301| 3|2012-03-01|
|20200101| 4|2020-01-01|
|20200101| 5|2020-01-01|
|20200101| 6|2020-01-01|
+--------+----+----------+
+--------+----+----------+----------+----------+----------+
| Dates|vals| date_str| start| end|yearperiod|
+--------+----+----------+----------+----------+----------+
|20120116| 0|2012-01-16|2012-01-16|2012-01-29| 201201|
|20120130| 1|2012-01-30|2012-01-30|2012-02-26| 201202|
|20120228| 2|2012-02-28|2012-02-27|2012-03-25| 201203|
|20120301| 3|2012-03-01|2012-02-27|2012-03-25| 201203|
+--------+----+----------+----------+----------+----------+
I would like to compare 2 dataframes in pyspark.
Below is my test case dataset (from google).
So I have 2 df's
Base DF
Secondary DF
baseDF
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3500,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Vom,5000,mex,IT,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
secDF
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,ITA,2/11/2019
22,Tom,2500,usa,HRA,2/11/2019
33,Kom,3000,uk,ITA,2/11/2019
44,Nom,4600,can,HRA,2/11/2019
55,Vom,8000,mex,ITA,2/11/2019
77,XYZ,5000,mex,ITA,2/11/2019
I have to compare secDF and baseDF with 2 keys (No and Name), if those fields match (I only need the matched records from secDF)then I have to update the salary and Dept field of baseDF with the value from secDF
Expected output
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,ITA,2/11/2019
22,Tom,2500,usa,HRA,2/11/2019
33,Kom,3000,uk,ITA,2/11/2019
44,Nom,4600,can,HRA,2/11/2019
55,Vom,8000,mex,ITA,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
Using pyspark I can use subtract() to find the values of table1 not present in table2, and consequently use unionAll of the two tables or should I use withcolumn to overwrite values satisfying the condition.
Could someone suggest a good way of doing this?
Update ---
I have to compare secDF and baseDF with 2 keys (No and Name), if those fields match (I only need the matched records from secDF)then I have to update the salary and Dept field of baseDF with the value from secDF.
You can do a left join and coalesce the resulting Sal column, with secdf taking precedence over basedf:
import pyspark.sql.functions as F
result = basedf.alias('basedf').join(
secdf.alias('secdf'),
['No', 'Name'],
'left'
).select(
[F.coalesce('secdf.Sal', 'basedf.Sal').alias('Sal')
if c == 'Sal'
else F.coalesce('secdf.Dept', 'basedf.Dept').alias('Dept')
if c == 'Dept'
else f'basedf.{c}'
for c in basedf.columns]
)
result.show()
+---+----+----+-------+----+---------+
| No|Name| Sal|Address|Dept|Join_Date|
+---+----+----+-------+----+---------+
| 11| Sam|1000| ind| ITA|2/11/2019|
| 22| Tom|2500| usa| HRA|2/11/2019|
| 33| Kom|3000| uk| ITA|2/11/2019|
| 44| Nom|4600| can| HRA|2/11/2019|
| 55| Vom|8000| mex| ITA|2/11/2019|
| 66| XYZ|5000| mex| IT|2/11/2019|
+---+----+----+-------+----+---------+
Cog in the Machine:
Data contains Current 12 months of data and is stacked Horizontally. With each month having updates revised and new month appended to.
ID |Date |Month1_a |Month1_b |Month1_c |Month2_a |Month2_b |Month2_c |Month3_a |Month3_b |Month3_c
## |MM/DD/YYYY |abc |zxy |123 |NULL |zxy |122 |abc |zxy |123
Actual data file has no headers and is ingested downstream as distinct File per Month
File Month 1, etc.
ID | Date |Month1_a |Month1_b |Month1_c |New Column
## |MM/DD/YYYY |abc |zxy |123 | #
ID | Date |Month2_a |Month2_b |Month2_c |New Column
## |MM/DD/YYYY |NULL |zxy |122 | #
Other than copying the file 12 times. Is there any suggestion for reading once and looping through to create my outputs. I've worked out the logic for Month 1, I'm stuck as to how to move to month 2+.
Was originally thinking Read File > Drop Month 3+ > Drop Month 1 > Run Logic, but I'm not sure if there is a better/best practice.
Thanks.
This will output n number of csv files where n is the number of months in your input data. Hopefully this is what you are after.
import pandas as pd
df = pd.read_csv('my_data.csv', sep='|')
# Strip whitespace from column names
df.columns = [x.strip() for x in df.columns]
# Get a set of months in the data by splitting on _ and removing 'Month' from
# the first part
months = set([x.split('_')[0].replace('Month','') for x in df.columns if 'Month' in x])
# For each numeric month in months, add those columns with that number in it to
# the ID and Date columns and write to a csv with that month number in the csv title
for month in months:
base_columns = ['ID','Date']
base_columns.extend([x for x in df.columns if 'Month'+month in x])
df[base_columns].to_csv(f'Month_{month}.csv', index=False)
Trips
id,timestamp
1008,2003-11-03 15:00:31
1008,2003-11-03 15:02:38
1008,2003-11-03 15:03:04
1008,2003-11-03 15:18:00
1009,2003-11-03 22:00:00
1009,2003-11-03 22:02:53
1009,2003-11-03 22:03:44
1009,2003-11-14 10:00:00
1009,2003-11-14 10:02:02
1009,2003-11-14 10:03:10
prompts
id,timestamp ,mode
1008,2003-11-03 15:18:49,car
1009,2003-11-03 22:04:20,metro
1009,2003-11-14 10:04:20,bike
Read csv file:
coordinates = pd.read_csv('coordinates.csv')
mode = pd.read_csv('prompts.csv')
I have to assign each mode at the end of the trip
Results:
id, timestamp, mode
1008, 2003-11-03 15:00:31, null
1008, 2003-11-03 15:02:38, null
1008, 2003-11-03 15:03:04, null
1008, 2003-11-03 15:18:00, car
1009, 2003-11-03 22:00:00, null
1009, 2003-11-03 22:02:53, null
1009, 2003-11-03 22:03:44, metro
1009, 2003-11-14 10:00:00, null
1009, 2003-11-14 10:02:02, null
1009, 2003-11-14 10:03:10, bike
Note
I use a large dataset for trips (4GB) and a small dataset for modes (500MB)
Based on your updated example, you can denote a trip by finding the first prompt timestamp that is greater than the trip timestamp. All rows with the same prompt timestamp will then correspond to the same trip. Then you want to set the mode for the greatest of the trip timestamps for each group.
One way to do this is by using 2 pyspark.sql.Windows.
Suppose you start with the following two PySpark DataFrames, trips and prompts:
trips.show(truncate=False)
#+----+-------------------+
#|id |timestamp |
#+----+-------------------+
#|1008|2003-11-03 15:00:31|
#|1008|2003-11-03 15:02:38|
#|1008|2003-11-03 15:03:04|
#|1008|2003-11-03 15:18:00|
#|1009|2003-11-03 22:00:00|
#|1009|2003-11-03 22:02:53|
#|1009|2003-11-03 22:03:44|
#|1009|2003-11-14 10:00:00|
#|1009|2003-11-14 10:02:02|
#|1009|2003-11-14 10:03:10|
#|1009|2003-11-15 10:00:00|
#+----+-------------------+
prompts.show(truncate=False)
#+----+-------------------+-----+
#|id |timestamp |mode |
#+----+-------------------+-----+
#|1008|2003-11-03 15:18:49|car |
#|1009|2003-11-03 22:04:20|metro|
#|1009|2003-11-14 10:04:20|bike |
#+----+-------------------+-----+
Join these two tables together using the id column with the condition that the prompt timestamp is greater than or equal to the trip timestamp. For some trip timestamps, this will result in multiple prompt timestamps. We can eliminate this by selecting the minimum prompt timestamp for each ('id', 'trip.timestamp') group- I call this temporary column indicator, and I used the Window w1 to compute it.
Next do a window over ('id', 'indicator') and find the maximum trip timestamp for each group. Set this value equal to the mode. All other rows will be set to pyspark.sql.functions.lit(None).
Finally you can compute all of the entries in trips where the trip timestamp was greater than the max prompt timestamp. These would be trips that did not match to a prompt. Union the matched and the unmatched together.
import pyspark.sql.functions as f
from pyspark.sql import Window
w1 = Window.partitionBy('id', 'trips.timestamp')
w2 = Window.partitionBy('id', 'indicator')
matched = trips.alias('trips').join(prompts.alias('prompts'), on='id', how='left')\
.where('prompts.timestamp >= trips.timestamp' )\
.select(
'id',
'trips.timestamp',
'mode',
f.when(
f.col('prompts.timestamp') == f.min('prompts.timestamp').over(w1),
f.col('prompts.timestamp'),
).otherwise(f.lit(None)).alias('indicator')
)\
.where(~f.isnull('indicator'))\
.select(
'id',
f.col('trips.timestamp').alias('timestamp'),
f.when(
f.col('trips.timestamp') == f.max(f.col('trips.timestamp')).over(w2),
f.col('mode')
).otherwise(f.lit(None)).alias('mode')
)
unmatched = trips.alias('t').join(prompts.alias('p'), on='id', how='left')\
.withColumn('max_prompt_time', f.max('p.timestamp').over(Window.partitionBy('id')))\
.where('t.timestamp > max_prompt_time')\
.select('id', 't.timestamp', f.lit(None).alias('mode'))\
.distinct()
Output:
matched.union(unmatched).sort('id', 'timestamp').show()
+----+-------------------+-----+
| id| timestamp| mode|
+----+-------------------+-----+
|1008|2003-11-03 15:00:31| null|
|1008|2003-11-03 15:02:38| null|
|1008|2003-11-03 15:03:04| null|
|1008|2003-11-03 15:18:00| car|
|1009|2003-11-03 22:00:00| null|
|1009|2003-11-03 22:02:53| null|
|1009|2003-11-03 22:03:44|metro|
|1009|2003-11-14 10:00:00| null|
|1009|2003-11-14 10:02:02| null|
|1009|2003-11-14 10:03:10| bike|
|1009|2003-11-15 10:00:00| null|
+----+-------------------+-----+
This would be a naive solution which assumes that your coordinates DataFrame already is sorted by timestamp, that ids are unique and that your data set fits into memory. If the latter is not the case, I recommend using dask and partition your DataFrames by id.
Imports:
import pandas as pd
import numpy as np
First we join the two DataFrames. This will fill the whole mode column for each id. We join on the index because that will speed up the operation, see also "Improve Pandas Merge performance".
mode = mode.set_index('id')
coordinates = coordinates.set_index('id')
merged = coordinates.join(mode, how='left')
We need the index to be unique values in order for our groupby operation to work.
merged = merged.reset_index()
Then we apply a function that will replace all but the last row in the mode column for each id.
def clean_mode_col(df):
cleaned_mode_col = df['mode'].copy()
cleaned_mode_col.iloc[:-1] = np.nan
df['mode'] = cleaned_mode_col
return df
merged = merged.groupby('id').apply(clean_mode_col)
As mentioned above, you can use dask to parallelize the execution of the merge code like this:
import dask.dataframe as dd
dd_coordinates = dd.from_pandas(coordinates).set_index('id')
dd_mode = dd.from_pandas(mode).set_index('id')
merged = dd.merge(dd_coordinates, dd_mode, left_index=True, right_index=True)
merged = merged.compute() #returns pandas DataFrame
The set_index operations are slow but make the merge way faster.
I did not test this code. Please provide copy-pasteable code that includes your DataFrames so that I don't have to copy and paste all those files you have in your description (hint: use pd.DataFrame.to_dict to export your DataFrame as a dictionary and copy and paste that into your code).