Performing batch join in pyspark from timestamp and id using watermarking

Performing batch join in pyspark from timestamp and id using watermarking - python

I have two dataframes with on equipment and equipment surrounding information. A row from the on equipment dataframe has the following schema:
on_equipment_df.head()
# Output of notebook:
Row(
on_eq_equipid='1',
on_eq_timestamp=datetime.datetime(2020, 10, 7, 15, 27, 42, 866098),
on_eq_charge=3.917107463423725, on_eq_electric_consumption=102.02754516792204, on_eq_force=10.551710736897613, on_eq_humidity=22.663245200558457, on_eq_pressure=10.813417893943944, on_eq_temperature=29.80448721128125, on_eq_vibration=4.376662536641158,
measurement_status='Bad')
And a row from the equipment surrounding looks like:
equipment_surrounding_df.head()
# Output of notebook
Row(
eq_surrounding_equipid='1',
eq_surrounding_timestamp=datetime.datetime(2020, 10, 7, 15, 27, 42, 903198),
eq_surrounding_dust=24.0885168316774, eq_surrounding_humidity=16.949569353381793, eq_surrounding_noise=12.256649392702574, eq_surrounding_temperature=8.141877435145844,
measurement_status='Good')
Notice that both tables have id's identifying the equipment and a timestamp relating to when the measurement was taken.
Problem: I want to perform a join between these two dataframes based on equipment id and the timestamps. The issue is that the timestamp is recorded with a very precise reading rendering the join through timestamp impossible (unless I round the timestamp, which I would like to leave as a last resort). The on equipment and equipment surrounding readings are recorded at different frequencies. I want therefore to perform a join only based on the equipment id but for windows between certain timestamp values. This is similar to what is done in structured streaming using watermarking.
To do this I tried to use the equivelent operation used in structured streaming as mentioned above called watermarking. Here is the code:
# Add watermark defined by the timestamp column of each df
watermarked_equipment_surr = equipment_surrounding_df.withWatermark("eq_surr_timestamp", "0.2 seconds")
watermarked_on_equipment = on_equipment_df.withWatermark("on_eq_timestamp", "0.2 seconds")
# Define new equipment dataframe based on the watermarking conditions described
equipment_df = watermarked_on_equipment.join(
watermarked_equipment_surr,
expr("""
on_eq_equipid = eq_surr_equipid AND
on_eq_timestamp >= eq_surrounding_timestamp AND
on_eq_timestamp <= eq_surrounding_timestamp + interval 0.2 seconds
"""))
# Perfrom a count (error appears here)
print("Equipment size:", equipment_df.count())
I get an error when performing this action. Based on this, I have two questions:
Is this the right way to solve such a use case / problem?
If so, why do I get an error in my code?
Thank you in advance
UPDATE:
So I beleive I found half the solution inspiered by:
Joining two spark dataframes on time (TimestampType) in python
Essentially the solution goes through creating two columns in one of the dataframes which represents an upper and lower limit of the timestamp using udfs. The code for that:
def lower_range_func(x, offset_milli=250):
"""
From a timestamp and offset, get the timestamp obtained from subtractng the offset.
"""
return x - timedelta(seconds=offset_milli/1000)
def upper_range_func(x, offset_milli=250):
"""
From a timestamp and offset, get the timestamp obtained from adding the offset.
"""
return x + timedelta(seconds=offset_milli/1000)
# Create two dataframes
lower_range = udf(lower_range_func, TimestampType())
upper_range = udf(upper_range_func, TimestampType())
# Add these columns to the on_equipment dataframe
on_equipment_df = on_equipment_df\
.withColumn('lower_on_eq_timestamp', lower_range(on_equipment_df["on_eq_timestamp"]))\
.withColumn('upper_on_eq_timestamp', upper_range(on_equipment_df["on_eq_timestamp"]))
Once we have those columns, we can perform a filtered join using these new columns.
# Join dataframes based on a filtered join
equipment_df = on_equipment_df.join(equipment_surrounding_df)\
.filter(on_equipment_df.on_eq_timestamp > equipment_surrounding_df.lower_eq_surr_timestamp)\
.filter(on_equipment_df.on_eq_timestamp < equipment_surrounding_df.upper_eq_surr_timestamp)
The problem is, as soon as I try to join using also the equipment id, like so:
# Join dataframes based on a filtered join
equipment_df = on_equipment_df.join(
equipment_surrounding_df, on_equipment_df.on_eq_equipid == equipment_surrounding_df.eq_surr_equipid)\
.filter(on_equipment_df.on_eq_timestamp > equipment_surrounding_df.lower_eq_surr_timestamp)\
.filter(on_equipment_df.on_eq_timestamp < equipment_surrounding_df.upper_eq_surr_timestamp)
I get the an error. Any thoughts on this approach?

Related

Pandas - Get value from second dataframe based on combination of multiple columns

I'm struggling with a problem at the moment.
Basically I have two DataFrames.
One that is an export from my ERP System and gives me the current physical stock level, which should be enhanced with stock reservations per sales channel, e.g.
Stock = pd.DataFrame(data={'SKU': [1,2,3], 'PhysicalStock': [100,1,2], 'FirstSeenInStock': [2,5,200], 'SafetyStock_Platform1': [np.nan,np.nan,np.nan], 'SafetyStock_Platform2': [np.nan,np.nan,np.nan]})
The columns SKU, Physical Stock and First Seen in Stock (which is days since this product was first seen with stock) come from the ERP system. The columns for Safety stock should be derived from another DataFrame, which is maintained by someone for all marketplaces and looks like this:
SafetyStock = pd.DataFrame(data={'FromAgeDays': [0,2,9], 'ToAgeDays': [3,10,999], 'SafetyStock_Platform1': [10,1,0], 'SafetyStock_Platform2': [5,3,0]})
What I tried with iloc is to identify the values from the dataframe SafetyStock and copy them into the Stock dataframe, considering the following logic:
Stock['FirstSeenInStock'] >= SafetyStock['FromAgeDays']
Stock['FirstSeenInStock'] <= SafetyStock['ToAgeDays']
Right column for platform, thus I named the columns the same in both dataframes
The desired outcome would be the following:
DesiredOutcome = pd.DataFrame(data={'SKU': [1,2,3], 'PhysicalStock': [100, 1, 2], 'FirstSeenInStock': [2,5,200], 'SafetyStock_Platform1': [10,1,0], 'SafetyStock_Platform2': [5,3,0]})

You should use merge function in pandas which is essentially known as "join" in the world of database.
Merging based on conditions is not well developed in pandas, it is known as "non equi join".

Attempting to compare different tables and cut out rows based on comparison in Pandas

So I have two different tables right now. These tables contain a series of information including one column being a specific date.
Example:
[Table 1]
Unique Identifier (Primary Key) / Date / Piece of Information
0001 / December 1, 2020 / Apples
[Table 2]
Unique Identifier (Primary Key) / Date / Piece of Information
0001 / December 5, 2020 / Oranges
I am trying to compare the two tables if the second table has a date that is AFTER the first table (for the same unique identifier), I would like to write this to a new table. There are a lot of rows in this table, and I need to keep going through the rows. However I can't seem to get this to work. This is what I am doing:
import pandas as pd
from pyspark.sql.functions import desc
from pyspark.sql import functions as F
def fluvoxamine_covid_hospital_progression(fluvoxamine_covids_that_were_also_in_hospital_at_any_time, fluvoxamine_for_hospitalization_analysis_only_outpatients_need_to_dbl_chk):
df_fluvoxamine_covid_outpatients = pd.DataFrame(fluvoxamine_for_hospitalization_analysis_only_outpatients_need_to_dbl_chk)
df_fluvoxamine_covid_outpatients.dropDuplicates(['visit_occurrence_id'])
df_fluvoxamine_covid_outpatients.sort(desc('visit_start_date'))
df_fluvoxamine_converted_hospital = pd.DataFrame(fluvoxamine_covids_that_were_also_in_hospital_at_any_time)
df_fluvoxamine_converted_hospital.dropDuplicates(['visit_occurrence_id'])
df_fluvoxamine_converted_hospital.sort(desc('visit_start_date'))
i = 0
if df_fluvoxamine_covid_outpatients.sort('visit_start_date') < df_fluvoxamine_converted_hospital.sort('visit_start_date'):
i = i + 1

Try to break it down into steps. I renamed your variables for readability.
# renamed variables
converted = fluvoxamine_covids_that_were_also_in_hospital_at_any_time
outpatients = fluvoxamine_for_hospitalization_analysis_only_outpatients_need_to_dbl_chk
For the first step, keep the first few lines of code as you wrote.
# Load and clean the data
covid_outpatients = pd.DataFrame(outpatients)
converted_hospital = pd.DataFrame(converted)
covid_outpatients.dropDuplicates(['visit_occurrence_id'])
converted_hospital.dropDuplicates(['visit_occurrence_id'])
Next, join the data using the unique identifier column.
all_data = covid_outpatients.set_index('Unique Identifier (Primary Key)').join(converted_hospital.set_index('Unique Identifier (Primary Key)'), lsuffix='_outpatients', rsuffix='_converted')
Reset the index with the unique identifier column.
all_data['Unique Identifier (Primary Key)'] = all_data.index
all_data.reset_index(drop=True, inplace=True)
Generate a mask based on the date comparison. A mask is a series of boolean values with the same size/shape as the DataFrame. In this case, the mask is True if the outpatients date is less than the converted date, otherwise, the value in the series is False.
filtered_data = all_data[all_data['visit_start_date_outpatients'] < all_data['visit_start_date_converted']]
Note, if your data is not in a date format, it might need to be converted or cast for the mask to work properly.
Lastly, save the output as a comma-separated-value CSV file.
// Generate an output. For example, it's easy to save it as a CSV file.
filtered_data.to_csv('outpatient_dates_less_than_converted_dates.csv')
In addition to the official documentation for pandas dataframes, the website https://towardsdatascience.com/ has many good tips. I hope this helps!

Dividing two columns of a different DataFrames

I am using Spark to do exploratory data analysis on a user log file. One of the analysis that I am doing is average requests on daily basis per host. So in order to figure out the average, I need to divide the total request column of the DataFrame by number unique Request column of the DataFrame.
total_req_per_day_df = logs_df.select('host',dayofmonth('time').alias('day')).groupby('day').count()
avg_daily_req_per_host_df = total_req_per_day_df.select("day",(total_req_per_day_df["count"] / daily_hosts_df["count"]).alias("count"))
This is what I have written using the PySpark to determine the average. And here is the log of error that I get
AnalysisException: u'resolved attribute(s) count#1993L missing from day#3628,count#3629L in operator !Project [day#3628,(cast(count#3629L as double) / cast(count#1993L as double)) AS count#3630];
Note: daily_hosts_df and logs_df is cached in the memory. How do you divide the count column of both data frames?

It is not possible to reference column from another table. If you want to combine data you'll have to join first using something similar to this:
from pyspark.sql.functions import col
(total_req_per_day_df.alias("total")
.join(daily_hosts_df.alias("host"), ["day"])
.select(col("day"), (col("total.count") / col("host.count")).alias("count")))

It's a question from an edX Spark course assignment. Since the solution is public now I take the opportunity to share another, slower one and ask whether the performance of it could be improved or is totally anti-Spark?
daily_hosts_list = (daily_hosts_df.map(lambda r: (r[0], r[1])).take(30))
days_with_hosts, hosts = zip(*daily_hosts_list)
requests = (total_req_per_day_df.map(lambda r: (r[1])).take(30))
average_requests = [(days_with_hosts[n], float(l)) for n, l in enumerate(list(np.array(requests, dtype=float) / np.array(hosts)))]
avg_daily_req_per_host_df = sqlContext.createDataFrame(average_requests, ('day', 'avg_reqs_per_host_per_day'))

Join the two data frames on column day, and then select the day and ratio of the count columns.
total_req_per_day_df = logs_df.select(dayofmonth('time')
.alias('day')
).groupBy('day').count()
avg_daily_req_per_host_df = (
total_req_per_day_df.join(daily_hosts_df,
total_req_per_day_df.day == daily_hosts_df.day
)
.select(daily_hosts_df['day'],
(total_req_per_day_df['count']/daily_hosts_df['count'])
.alias('avg_reqs_per_host_per_day')
)
.cache()
)

Solution, based on zero323 answer, but correctly works as OUTER join.
avg_daily_req_per_host_df = (
total_req_per_day_df.join(
daily_hosts_df, daily_hosts_df['day'] == total_req_per_day_df['day'], 'outer'
).select(
total_req_per_day_df['day'],
(total_req_per_day_df['count']/daily_hosts_df['count']).alias('avg_reqs_per_host_per_day')
)
).cache()
Without 'outer' param you lost data for days missings in one of dataframes. This is not critical for PySpark Lab2 task, becouse both dataframes contains same dates. But can create some pain in another tasks :)

Python - Lookup value from one table that falls within a range in a second table

I have two tables, one contains SCHEDULE_DATE (over 300,000 records) and WORK_WEEK_CODE, and the second table contains WORK_WEEK_CODE, START_DATE, and END_DATE. The first table has duplicate schedule dates, and the second table is 3,200 unique values. I need to populate the WORK_WEEK_CODE in table one with the WORK_WEEK_CODE from table two, based off of the range where the schedule date falls. Samples of the two tables are below.
I was able to accomplish the task using arcpy.da.UpdateCursor with a nested arcpy.da.SearchCursor, but with the volume of records, it takes a long time. Any suggestions on a better (and less time consuming) method would be greatly appreciated.
Note: The date fields are formatted as string
Table 1
SCHEDULE_DATE,WORK_WEEK_CODE
20160219
20160126
20160219
20160118
20160221
20160108
20160129
20160201
20160214
20160127
Table 2
WORK_WEEK_CODE,START_DATE,END_DATE
1601,20160104,20160110
1602,20160111,20160117
1603,20160118,20160124
1604,20160125,20160131
1605,20160201,20160207
1606,20160208,20160214
1607,20160215,20160221

You can use Pandas dataframes as a more efficient method. Here is the approach using Pandas. Hope this helps:
import pandas as pd
# First you need to convert your data to Pandas Dataframe I read them from csv
Table1 = pd.read_csv('Table1.csv')
Table2 = pd.read_csv('Table2.csv')
# Then you need to add a shared key for join
Table1['key'] = 1
Table2['key'] = 1
#The following line joins the two tables
mergeddf = pd.merge(Table1,Table2,how='left',on='key')
#The following line converts the string dates to actual dates
mergeddf['SCHEDULE_DATE'] = pd.to_datetime(mergeddf['SCHEDULE_DATE'],format='%Y%m%d')
mergeddf['START_DATE'] = pd.to_datetime(mergeddf['START_DATE'],format='%Y%m%d')
mergeddf['END_DATE'] = pd.to_datetime(mergeddf['END_DATE'],format='%Y%m%d')
#The following line will filter and keep only lines that you need
result = mergeddf[(mergeddf['SCHEDULE_DATE'] >= mergeddf['START_DATE']) & (mergeddf['SCHEDULE_DATE'] <= mergeddf['END_DATE'])]

pandas update multiple fields

I am trying to add and update multiple columns in a pandas dataframe using a second dataframe. The problem I get is when the number of columns I want to add doesn't match the number of columns in the base dataframe I get the following error: "Shape of passed values is (2, 3), indices imply (2, 2)"
A simplified version of the problem is below
tst = DataFrame({"One":[1,2],"Two":[2,4]})
def square(row):
"""
for each row in the table return multiple calculated values
"""
a = row["One"]
b = row["Two"]
return a ** 2, b ** 2, b ** 3
#create three new fields from the data
tst[["One^2", "Two^2", "Two^3"]] = tst.apply(square, axis=1)
if the number of fields being added matches the number already in the table the opertaion works as expected.
tst = DataFrame({"One":[1,2],"Two":[2,4]})
def square(row):
"""
for each row in the table return multiple calculated values
"""
a = row["One"]
b = row["Two"]
return a ** 2, b ** 2
#create three new fields from the data
tst[["One^2", "Two^2"]] = tst.apply(square, axis=1)
I realise I could do each field seperately but in the actual problem I am trying to solve I perform a join between the table being updated and an external table within the "updater" (i.e. square) and want to be able to grab all the required information at once.
Below is how I would do it in SQL. Unfortunately the two dataframes contain data from different database technologies, hence why I have to do perform the operation in pandas.
update tu
set tu.a_field = upd.the_field_i_want
tu.another_field = upd.the_second_required_field
from to_update tu
inner join the_updater upd
on tu.item_id = upd.item_id
and tu.date between upd.date_from and upd.date_to
Here you can see the exact details of what I am trying to do. I have a table "to_update" that contains point-in-time information against an item_id. The other table "the_updater" contains date range information against the item_id. For example a particular item_id may sit with customer_1 from DateA to DateB and with customer_2 between DateB and DateC etc. I want to be able to align information from the table containing the date ranges against the point-in-time table.
Please note a merge won't work due to problems with the data (this is actually being written as part of a dataquality test). I really need to be able to replicate the functionality of the update statement above.
I could obviously do it as a loop but I was hoping to use the pandas framework where possible.

Declare a empty column in dataframe and assign it to zero
tst["Two^3"] = 0
Then do the respective operations for that column, along with other columns
tst[["One^2", "Two^2", "Two^3"]] = tst.apply(square, axis=1)
Try printing it
print tst.head(5)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Performing batch join in pyspark from timestamp and id using watermarking - python

Related

Pandas - Get value from second dataframe based on combination of multiple columns

Attempting to compare different tables and cut out rows based on comparison in Pandas

Dividing two columns of a different DataFrames

Python - Lookup value from one table that falls within a range in a second table

pandas update multiple fields

Categories

Resources