I am trying to select data for multiple ID's between a time range using pyspark.
I have four columns in a spark dataframe 'event_df'
ID
Time
Event_Start_Date
Event_End_Date
241856
2020-10-18T09:16:49.000+0000
2020-11-12T20:15:00.000+0000
2020-11-12T20:45:00.000+0000
In 'Time' there is data worth 2 months for individual ID's. Different ID's have different event start and end datetimes However, I want to select data only between 'event start date' and 'event end date'.
I have tried the following but it doesn't seem to return what I want
refined_df = event_df.where(( col ('Time') >= col ('Event_Start_Date')) & ( col ('Time') <= col ('Event_End_Date ')) )
Not sure why your line isn't working for you, but you can also try using between:
import pyspark.sql.functions as F
data = [(241856, '2020-10-18T09:16:49.000+0000', '2019-11-12T20:15:00.000+0000', '2020-11-12T20:45:00.000+0000'),
(241857, '2020-10-18T09:16:49.000+0000', '2020-11-12T20:15:00.000+0000', '2020-11-12T20:45:00.000+0000')]
df = spark.sparkContext.parallelize(data).toDF(['ID','Time','Event_Start_Date','Event_End_Date'])
df.show()
df.filter(F.col('Time').between(F.col('Event_Start_Date'), F.col('Event_End_Date'))).show()
returns
+------+--------------------+--------------------+--------------------+
| ID| Time| Event_Start_Date| Event_End_Date|
+------+--------------------+--------------------+--------------------+
|241856|2020-10-18T09:16:...|2019-11-12T20:15:...|2020-11-12T20:45:...|
|241857|2020-10-18T09:16:...|2020-11-12T20:15:...|2020-11-12T20:45:...|
+------+--------------------+--------------------+--------------------+
+------+--------------------+--------------------+--------------------+
| ID| Time| Event_Start_Date| Event_End_Date|
+------+--------------------+--------------------+--------------------+
|241856|2020-10-18T09:16:...|2019-11-12T20:15:...|2020-11-12T20:45:...|
+------+--------------------+--------------------+--------------------+
I would like to compare 2 dataframes in pyspark.
Below is my test case dataset (from google).
So I have 2 df's
Base DF
Secondary DF
baseDF
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3500,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Vom,5000,mex,IT,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
secDF
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,ITA,2/11/2019
22,Tom,2500,usa,HRA,2/11/2019
33,Kom,3000,uk,ITA,2/11/2019
44,Nom,4600,can,HRA,2/11/2019
55,Vom,8000,mex,ITA,2/11/2019
77,XYZ,5000,mex,ITA,2/11/2019
I have to compare secDF and baseDF with 2 keys (No and Name), if those fields match (I only need the matched records from secDF)then I have to update the salary and Dept field of baseDF with the value from secDF
Expected output
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,ITA,2/11/2019
22,Tom,2500,usa,HRA,2/11/2019
33,Kom,3000,uk,ITA,2/11/2019
44,Nom,4600,can,HRA,2/11/2019
55,Vom,8000,mex,ITA,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
Using pyspark I can use subtract() to find the values of table1 not present in table2, and consequently use unionAll of the two tables or should I use withcolumn to overwrite values satisfying the condition.
Could someone suggest a good way of doing this?
Update ---
I have to compare secDF and baseDF with 2 keys (No and Name), if those fields match (I only need the matched records from secDF)then I have to update the salary and Dept field of baseDF with the value from secDF.
You can do a left join and coalesce the resulting Sal column, with secdf taking precedence over basedf:
import pyspark.sql.functions as F
result = basedf.alias('basedf').join(
secdf.alias('secdf'),
['No', 'Name'],
'left'
).select(
[F.coalesce('secdf.Sal', 'basedf.Sal').alias('Sal')
if c == 'Sal'
else F.coalesce('secdf.Dept', 'basedf.Dept').alias('Dept')
if c == 'Dept'
else f'basedf.{c}'
for c in basedf.columns]
)
result.show()
+---+----+----+-------+----+---------+
| No|Name| Sal|Address|Dept|Join_Date|
+---+----+----+-------+----+---------+
| 11| Sam|1000| ind| ITA|2/11/2019|
| 22| Tom|2500| usa| HRA|2/11/2019|
| 33| Kom|3000| uk| ITA|2/11/2019|
| 44| Nom|4600| can| HRA|2/11/2019|
| 55| Vom|8000| mex| ITA|2/11/2019|
| 66| XYZ|5000| mex| IT|2/11/2019|
+---+----+----+-------+----+---------+
Cog in the Machine:
Data contains Current 12 months of data and is stacked Horizontally. With each month having updates revised and new month appended to.
ID |Date |Month1_a |Month1_b |Month1_c |Month2_a |Month2_b |Month2_c |Month3_a |Month3_b |Month3_c
## |MM/DD/YYYY |abc |zxy |123 |NULL |zxy |122 |abc |zxy |123
Actual data file has no headers and is ingested downstream as distinct File per Month
File Month 1, etc.
ID | Date |Month1_a |Month1_b |Month1_c |New Column
## |MM/DD/YYYY |abc |zxy |123 | #
ID | Date |Month2_a |Month2_b |Month2_c |New Column
## |MM/DD/YYYY |NULL |zxy |122 | #
Other than copying the file 12 times. Is there any suggestion for reading once and looping through to create my outputs. I've worked out the logic for Month 1, I'm stuck as to how to move to month 2+.
Was originally thinking Read File > Drop Month 3+ > Drop Month 1 > Run Logic, but I'm not sure if there is a better/best practice.
Thanks.
This will output n number of csv files where n is the number of months in your input data. Hopefully this is what you are after.
import pandas as pd
df = pd.read_csv('my_data.csv', sep='|')
# Strip whitespace from column names
df.columns = [x.strip() for x in df.columns]
# Get a set of months in the data by splitting on _ and removing 'Month' from
# the first part
months = set([x.split('_')[0].replace('Month','') for x in df.columns if 'Month' in x])
# For each numeric month in months, add those columns with that number in it to
# the ID and Date columns and write to a csv with that month number in the csv title
for month in months:
base_columns = ['ID','Date']
base_columns.extend([x for x in df.columns if 'Month'+month in x])
df[base_columns].to_csv(f'Month_{month}.csv', index=False)
I've the following column as string on a dataframe df:
date|
+----------------+
|4/23/2019 23:59|
|05/06/2019 23:59|
|4/16/2019 19:00
I am trying to convert this to Timestamp but I only getting NULL values.
My statement is:
from pyspark.sql.functions import col, unix_timestamp
df.withColumn('date',unix_timestamp(df['date'], "MM/dd/yyyy hh:mm").cast("timestamp"))
Why I am getting only Null values? Is It because the Month format (since I hive an additional 0 on 05)?
Thanks!
The pattern for 24 hour format is HH, hh is for am./pm.
https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html
df \
.withColumn('converted_date', psf.to_timestamp('date', format='MM/dd/yyyy HH:mm')) \
.show()
+----------------+-------------------+
| date| converted_date|
+----------------+-------------------+
| 4/23/2019 23:59|2019-04-23 23:59:00|
|05/06/2019 23:59|2019-05-06 23:59:00|
| 4/16/2019 19:00|2019-04-16 19:00:00|
+----------------+-------------------+
Whether there is or not a leading 0 does not matter
I've created a dataframe as:
ratings = imdb_data.sort('imdbRating').select('imdbRating').filter('imdbRating is NOT NULL')
Upon doing ratings.show() as shown below, i can see that
the imdbRating field has a mixed type of data such as random strings, movie title, movie url and actual ratings. So the dirty data looks this:
+--------------------+
| imdbRating|
+--------------------+
|Mary (TV Episode...|
| Paranormal Activ...|
| Sons (TV Episode...|
| Spion (2011)|
| Winter... und Fr...|
| and Gays (TV Epi...|
| grAs - Die Serie...|
| hat die Wahl (2000)|
| 1.0|
| 1.3|
| 1.4|
| 1.5|
| 1.5|
| 1.5|
| 1.6|
| 1.6|
| 1.7|
| 1.9|
| 1.9|
| 1.9|
+--------------------+
only showing top 20 rows
Is there anyway i can filter out the unwanted strings and all just get the ratings ? I tried using UDF as:
ratings_udf = udf(lambda imdbRating: imdbRating if isinstance(imdbRating, float) else None)
and tried calling it as:
ratings = imdb_data.sort('imdbRating').select('imdbRating')
filtered = rating.withColumn('imdbRating',ratings_udf(ratings.imdbRating))
The problem with above is, since it tried calling the udf on each row, each row of the dataframe mapped to a Row type and hence returning None on all the values.
Is there any straightforward way to filter out those data ?
Any help will be much appreciated. Thank you
Finally, i was able to resolve it.The problem was there was some corrupt data with not all fields present. Firstly, i tried is using pandas by reading the csv files in pandas as:
pd_frame = pd.read_csv('imdb.csv', error_bad_lines=False)
This skipped/dropped the corrupt rows which had less columns than the actual. I tried to read the above panda dataframe, pd_frame, to spark using:
imdb_data= spark.createDataFrame(pd_frame)
but got some error because of mismatch while inferring schema. Turns out spark csv reader has something similar which drops the corrupt rows as:
imdb_data = spark.read.csv('imdb.csv', header='true', mode='DROPMALFORMED')