Computing yearperiod from date by comparing date column with two reference columns - python

I'm working on some data preparation for a project I'm involved in. We do most of the work in Databricks, using the underlying Apache Spark for computations on large datasets. Everything is done in PySpark.
My goal is to convert a date variable to a variable yearperiod, which divides the year into 13 periods of 4 weeks (with some exceptions). The value is a concatenation of the year and the period, e.g. yearperiod = 201513 would be the year 2015, period 13.
I have two tables: yp_table which contains start and end dates (Edit: type DateType()) for yearperiods (between 2012 and now, Edit: ~120 rows):
+----------+----------+----------+
| start| end|yearperiod|
+----------+----------+----------+
|2012-01-16|2012-01-29| 201201|
|2012-01-30|2012-02-26| 201202|
|2012-02-27|2012-03-25| 201203|
|2012-03-26|2012-04-22| 201204|
|2012-04-23|2012-05-20| 201205|
|2012-05-21|2012-06-17| 201206|
....
And I have the actual data table, which contains a Date column (Edit: type StringType()):
+--------+--------+--------+-----+
| Var1| Var2| Date| Var3|
+--------+--------+--------+-----+
| xxxxxx| xxxx|20191231| x,xx|
| xxxxxx| xxxx|20191231| x,xx|
| xxxxxx| xxxx|20191231| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
...
My question: how do I compute a column yearperiod for the data table, by comparing data.Date with both yp_table.start and yp_table.end?
So far I've been able to make it work with regular Python (a solution with list comprehensions), but it proves to be too slow for large datasets. Any help is greatly appreciated!
Edit: for privacy reasons I can't give the actual schemas of the dataframes. I've edited above to include the types of the relevant columns.

Add a column to your data df that contains the dates in the matching format to the yp_table and then join them filtering by date intervals. Since the yp_table is small, you can use a broadcast join to speed things up.
import pyspark.sql.functions as fun
# Date lookup
start_dates = ["2012-01-16", "2012-01-30", "2012-02-27", "2012-03-26", "2012-04-23", "2012-05-21"]
end_dates = ["2012-01-29", "2012-02-26", "2012-03-25", "2012-04-22", "2012-05-20", "2012-06-17"]
yearperiod = ["201201", "201202", "201203", "201204", "201205", "201206"]
yp_table = spark.createDataFrame(pd.DataFrame({'start': start_dates, 'end': end_dates, 'yearperiod': yearperiod}))
# Data df
dates = ["20120116", "20120130", "20120228", "20120301", "20200101", "20200101", "20200101"]
vals = range(0, len(dates))
data = spark.createDataFrame(pd.DataFrame({'Dates':dates, 'vals': vals}))
# Add formatted data_str column for joining
data = data.withColumn("date_str", fun.concat_ws("-", data.Dates.substr(0,4), data.Dates.substr(5,2), data.Dates.substr(7,2))) # + "-" + data.Dates.substr(6,8))
# Broadcase join small yp_table into the data table using conditional
joined = data.join(fun.broadcast(yp_table), (data.date_str >= yp_table.start) & (data.date_str < yp_table.end))
yp_table.show()
data.show()
joined.show()
+----------+----------+----------+
| start| end|yearperiod|
+----------+----------+----------+
|2012-01-16|2012-01-29| 201201|
|2012-01-30|2012-02-26| 201202|
|2012-02-27|2012-03-25| 201203|
|2012-03-26|2012-04-22| 201204|
|2012-04-23|2012-05-20| 201205|
|2012-05-21|2012-06-17| 201206|
+----------+----------+----------+
+--------+----+----------+
| Dates|vals| date_str|
+--------+----+----------+
|20120116| 0|2012-01-16|
|20120130| 1|2012-01-30|
|20120228| 2|2012-02-28|
|20120301| 3|2012-03-01|
|20200101| 4|2020-01-01|
|20200101| 5|2020-01-01|
|20200101| 6|2020-01-01|
+--------+----+----------+
+--------+----+----------+----------+----------+----------+
| Dates|vals| date_str| start| end|yearperiod|
+--------+----+----------+----------+----------+----------+
|20120116| 0|2012-01-16|2012-01-16|2012-01-29| 201201|
|20120130| 1|2012-01-30|2012-01-30|2012-02-26| 201202|
|20120228| 2|2012-02-28|2012-02-27|2012-03-25| 201203|
|20120301| 3|2012-03-01|2012-02-27|2012-03-25| 201203|
+--------+----+----------+----------+----------+----------+

Related

Selecting data between a time range in pyspark dataframe

I am trying to select data for multiple ID's between a time range using pyspark.
I have four columns in a spark dataframe 'event_df'
ID
Time
Event_Start_Date
Event_End_Date
241856
2020-10-18T09:16:49.000+0000
2020-11-12T20:15:00.000+0000
2020-11-12T20:45:00.000+0000
In 'Time' there is data worth 2 months for individual ID's. Different ID's have different event start and end datetimes However, I want to select data only between 'event start date' and 'event end date'.
I have tried the following but it doesn't seem to return what I want
refined_df = event_df.where(( col ('Time') >= col ('Event_Start_Date')) & ( col ('Time') <= col ('Event_End_Date ')) )
Not sure why your line isn't working for you, but you can also try using between:
import pyspark.sql.functions as F
data = [(241856, '2020-10-18T09:16:49.000+0000', '2019-11-12T20:15:00.000+0000', '2020-11-12T20:45:00.000+0000'),
(241857, '2020-10-18T09:16:49.000+0000', '2020-11-12T20:15:00.000+0000', '2020-11-12T20:45:00.000+0000')]
df = spark.sparkContext.parallelize(data).toDF(['ID','Time','Event_Start_Date','Event_End_Date'])
df.show()
df.filter(F.col('Time').between(F.col('Event_Start_Date'), F.col('Event_End_Date'))).show()
returns
+------+--------------------+--------------------+--------------------+
| ID| Time| Event_Start_Date| Event_End_Date|
+------+--------------------+--------------------+--------------------+
|241856|2020-10-18T09:16:...|2019-11-12T20:15:...|2020-11-12T20:45:...|
|241857|2020-10-18T09:16:...|2020-11-12T20:15:...|2020-11-12T20:45:...|
+------+--------------------+--------------------+--------------------+
+------+--------------------+--------------------+--------------------+
| ID| Time| Event_Start_Date| Event_End_Date|
+------+--------------------+--------------------+--------------------+
|241856|2020-10-18T09:16:...|2019-11-12T20:15:...|2020-11-12T20:45:...|
+------+--------------------+--------------------+--------------------+

How to update Spark DataFrame Column Values of a table from another table based on a condition using Pyspark

I would like to compare 2 dataframes in pyspark.
Below is my test case dataset (from google).
So I have 2 df's
Base DF
Secondary DF
baseDF
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3500,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Vom,5000,mex,IT,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
secDF
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,ITA,2/11/2019
22,Tom,2500,usa,HRA,2/11/2019
33,Kom,3000,uk,ITA,2/11/2019
44,Nom,4600,can,HRA,2/11/2019
55,Vom,8000,mex,ITA,2/11/2019
77,XYZ,5000,mex,ITA,2/11/2019
I have to compare secDF and baseDF with 2 keys (No and Name), if those fields match (I only need the matched records from secDF)then I have to update the salary and Dept field of baseDF with the value from secDF
Expected output
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,ITA,2/11/2019
22,Tom,2500,usa,HRA,2/11/2019
33,Kom,3000,uk,ITA,2/11/2019
44,Nom,4600,can,HRA,2/11/2019
55,Vom,8000,mex,ITA,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
Using pyspark I can use subtract() to find the values of table1 not present in table2, and consequently use unionAll of the two tables or should I use withcolumn to overwrite values satisfying the condition.
Could someone suggest a good way of doing this?
Update ---
I have to compare secDF and baseDF with 2 keys (No and Name), if those fields match (I only need the matched records from secDF)then I have to update the salary and Dept field of baseDF with the value from secDF.
You can do a left join and coalesce the resulting Sal column, with secdf taking precedence over basedf:
import pyspark.sql.functions as F
result = basedf.alias('basedf').join(
secdf.alias('secdf'),
['No', 'Name'],
'left'
).select(
[F.coalesce('secdf.Sal', 'basedf.Sal').alias('Sal')
if c == 'Sal'
else F.coalesce('secdf.Dept', 'basedf.Dept').alias('Dept')
if c == 'Dept'
else f'basedf.{c}'
for c in basedf.columns]
)
result.show()
+---+----+----+-------+----+---------+
| No|Name| Sal|Address|Dept|Join_Date|
+---+----+----+-------+----+---------+
| 11| Sam|1000| ind| ITA|2/11/2019|
| 22| Tom|2500| usa| HRA|2/11/2019|
| 33| Kom|3000| uk| ITA|2/11/2019|
| 44| Nom|4600| can| HRA|2/11/2019|
| 55| Vom|8000| mex| ITA|2/11/2019|
| 66| XYZ|5000| mex| IT|2/11/2019|
+---+----+----+-------+----+---------+

Best Practice for repetitive computations

Cog in the Machine:
Data contains Current 12 months of data and is stacked Horizontally. With each month having updates revised and new month appended to.
ID |Date |Month1_a |Month1_b |Month1_c |Month2_a |Month2_b |Month2_c |Month3_a |Month3_b |Month3_c
## |MM/DD/YYYY |abc |zxy |123 |NULL |zxy |122 |abc |zxy |123
Actual data file has no headers and is ingested downstream as distinct File per Month
File Month 1, etc.
ID | Date |Month1_a |Month1_b |Month1_c |New Column
## |MM/DD/YYYY |abc |zxy |123 | #
ID | Date |Month2_a |Month2_b |Month2_c |New Column
## |MM/DD/YYYY |NULL |zxy |122 | #
Other than copying the file 12 times. Is there any suggestion for reading once and looping through to create my outputs. I've worked out the logic for Month 1, I'm stuck as to how to move to month 2+.
Was originally thinking Read File > Drop Month 3+ > Drop Month 1 > Run Logic, but I'm not sure if there is a better/best practice.
Thanks.
This will output n number of csv files where n is the number of months in your input data. Hopefully this is what you are after.
import pandas as pd
df = pd.read_csv('my_data.csv', sep='|')
# Strip whitespace from column names
df.columns = [x.strip() for x in df.columns]
# Get a set of months in the data by splitting on _ and removing 'Month' from
# the first part
months = set([x.split('_')[0].replace('Month','') for x in df.columns if 'Month' in x])
# For each numeric month in months, add those columns with that number in it to
# the ID and Date columns and write to a csv with that month number in the csv title
for month in months:
base_columns = ['ID','Date']
base_columns.extend([x for x in df.columns if 'Month'+month in x])
df[base_columns].to_csv(f'Month_{month}.csv', index=False)

Pyspark -Convert String to TimeStamp - Getting Nulls

I've the following column as string on a dataframe df:
date|
+----------------+
|4/23/2019 23:59|
|05/06/2019 23:59|
|4/16/2019 19:00
I am trying to convert this to Timestamp but I only getting NULL values.
My statement is:
from pyspark.sql.functions import col, unix_timestamp
df.withColumn('date',unix_timestamp(df['date'], "MM/dd/yyyy hh:mm").cast("timestamp"))
Why I am getting only Null values? Is It because the Month format (since I hive an additional 0 on 05)?
Thanks!
The pattern for 24 hour format is HH, hh is for am./pm.
https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html
df \
.withColumn('converted_date', psf.to_timestamp('date', format='MM/dd/yyyy HH:mm')) \
.show()
+----------------+-------------------+
| date| converted_date|
+----------------+-------------------+
| 4/23/2019 23:59|2019-04-23 23:59:00|
|05/06/2019 23:59|2019-05-06 23:59:00|
| 4/16/2019 19:00|2019-04-16 19:00:00|
+----------------+-------------------+
Whether there is or not a leading 0 does not matter

Filtering Spark Dataframe

I've created a dataframe as:
ratings = imdb_data.sort('imdbRating').select('imdbRating').filter('imdbRating is NOT NULL')
Upon doing ratings.show() as shown below, i can see that
the imdbRating field has a mixed type of data such as random strings, movie title, movie url and actual ratings. So the dirty data looks this:
+--------------------+
| imdbRating|
+--------------------+
|Mary (TV Episode...|
| Paranormal Activ...|
| Sons (TV Episode...|
| Spion (2011)|
| Winter... und Fr...|
| and Gays (TV Epi...|
| grAs - Die Serie...|
| hat die Wahl (2000)|
| 1.0|
| 1.3|
| 1.4|
| 1.5|
| 1.5|
| 1.5|
| 1.6|
| 1.6|
| 1.7|
| 1.9|
| 1.9|
| 1.9|
+--------------------+
only showing top 20 rows
Is there anyway i can filter out the unwanted strings and all just get the ratings ? I tried using UDF as:
ratings_udf = udf(lambda imdbRating: imdbRating if isinstance(imdbRating, float) else None)
and tried calling it as:
ratings = imdb_data.sort('imdbRating').select('imdbRating')
filtered = rating.withColumn('imdbRating',ratings_udf(ratings.imdbRating))
The problem with above is, since it tried calling the udf on each row, each row of the dataframe mapped to a Row type and hence returning None on all the values.
Is there any straightforward way to filter out those data ?
Any help will be much appreciated. Thank you
Finally, i was able to resolve it.The problem was there was some corrupt data with not all fields present. Firstly, i tried is using pandas by reading the csv files in pandas as:
pd_frame = pd.read_csv('imdb.csv', error_bad_lines=False)
This skipped/dropped the corrupt rows which had less columns than the actual. I tried to read the above panda dataframe, pd_frame, to spark using:
imdb_data= spark.createDataFrame(pd_frame)
but got some error because of mismatch while inferring schema. Turns out spark csv reader has something similar which drops the corrupt rows as:
imdb_data = spark.read.csv('imdb.csv', header='true', mode='DROPMALFORMED')

Categories