I've the following column as string on a dataframe df:
date|
+----------------+
|4/23/2019 23:59|
|05/06/2019 23:59|
|4/16/2019 19:00
I am trying to convert this to Timestamp but I only getting NULL values.
My statement is:
from pyspark.sql.functions import col, unix_timestamp
df.withColumn('date',unix_timestamp(df['date'], "MM/dd/yyyy hh:mm").cast("timestamp"))
Why I am getting only Null values? Is It because the Month format (since I hive an additional 0 on 05)?
Thanks!
The pattern for 24 hour format is HH, hh is for am./pm.
https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html
df \
.withColumn('converted_date', psf.to_timestamp('date', format='MM/dd/yyyy HH:mm')) \
.show()
+----------------+-------------------+
| date| converted_date|
+----------------+-------------------+
| 4/23/2019 23:59|2019-04-23 23:59:00|
|05/06/2019 23:59|2019-05-06 23:59:00|
| 4/16/2019 19:00|2019-04-16 19:00:00|
+----------------+-------------------+
Whether there is or not a leading 0 does not matter
Related
I'm working on some data preparation for a project I'm involved in. We do most of the work in Databricks, using the underlying Apache Spark for computations on large datasets. Everything is done in PySpark.
My goal is to convert a date variable to a variable yearperiod, which divides the year into 13 periods of 4 weeks (with some exceptions). The value is a concatenation of the year and the period, e.g. yearperiod = 201513 would be the year 2015, period 13.
I have two tables: yp_table which contains start and end dates (Edit: type DateType()) for yearperiods (between 2012 and now, Edit: ~120 rows):
+----------+----------+----------+
| start| end|yearperiod|
+----------+----------+----------+
|2012-01-16|2012-01-29| 201201|
|2012-01-30|2012-02-26| 201202|
|2012-02-27|2012-03-25| 201203|
|2012-03-26|2012-04-22| 201204|
|2012-04-23|2012-05-20| 201205|
|2012-05-21|2012-06-17| 201206|
....
And I have the actual data table, which contains a Date column (Edit: type StringType()):
+--------+--------+--------+-----+
| Var1| Var2| Date| Var3|
+--------+--------+--------+-----+
| xxxxxx| xxxx|20191231| x,xx|
| xxxxxx| xxxx|20191231| x,xx|
| xxxxxx| xxxx|20191231| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
| xxxxxx| xxxx|20200101| x,xx|
...
My question: how do I compute a column yearperiod for the data table, by comparing data.Date with both yp_table.start and yp_table.end?
So far I've been able to make it work with regular Python (a solution with list comprehensions), but it proves to be too slow for large datasets. Any help is greatly appreciated!
Edit: for privacy reasons I can't give the actual schemas of the dataframes. I've edited above to include the types of the relevant columns.
Add a column to your data df that contains the dates in the matching format to the yp_table and then join them filtering by date intervals. Since the yp_table is small, you can use a broadcast join to speed things up.
import pyspark.sql.functions as fun
# Date lookup
start_dates = ["2012-01-16", "2012-01-30", "2012-02-27", "2012-03-26", "2012-04-23", "2012-05-21"]
end_dates = ["2012-01-29", "2012-02-26", "2012-03-25", "2012-04-22", "2012-05-20", "2012-06-17"]
yearperiod = ["201201", "201202", "201203", "201204", "201205", "201206"]
yp_table = spark.createDataFrame(pd.DataFrame({'start': start_dates, 'end': end_dates, 'yearperiod': yearperiod}))
# Data df
dates = ["20120116", "20120130", "20120228", "20120301", "20200101", "20200101", "20200101"]
vals = range(0, len(dates))
data = spark.createDataFrame(pd.DataFrame({'Dates':dates, 'vals': vals}))
# Add formatted data_str column for joining
data = data.withColumn("date_str", fun.concat_ws("-", data.Dates.substr(0,4), data.Dates.substr(5,2), data.Dates.substr(7,2))) # + "-" + data.Dates.substr(6,8))
# Broadcase join small yp_table into the data table using conditional
joined = data.join(fun.broadcast(yp_table), (data.date_str >= yp_table.start) & (data.date_str < yp_table.end))
yp_table.show()
data.show()
joined.show()
+----------+----------+----------+
| start| end|yearperiod|
+----------+----------+----------+
|2012-01-16|2012-01-29| 201201|
|2012-01-30|2012-02-26| 201202|
|2012-02-27|2012-03-25| 201203|
|2012-03-26|2012-04-22| 201204|
|2012-04-23|2012-05-20| 201205|
|2012-05-21|2012-06-17| 201206|
+----------+----------+----------+
+--------+----+----------+
| Dates|vals| date_str|
+--------+----+----------+
|20120116| 0|2012-01-16|
|20120130| 1|2012-01-30|
|20120228| 2|2012-02-28|
|20120301| 3|2012-03-01|
|20200101| 4|2020-01-01|
|20200101| 5|2020-01-01|
|20200101| 6|2020-01-01|
+--------+----+----------+
+--------+----+----------+----------+----------+----------+
| Dates|vals| date_str| start| end|yearperiod|
+--------+----+----------+----------+----------+----------+
|20120116| 0|2012-01-16|2012-01-16|2012-01-29| 201201|
|20120130| 1|2012-01-30|2012-01-30|2012-02-26| 201202|
|20120228| 2|2012-02-28|2012-02-27|2012-03-25| 201203|
|20120301| 3|2012-03-01|2012-02-27|2012-03-25| 201203|
+--------+----+----------+----------+----------+----------+
I would like to compare 2 dataframes in pyspark.
Below is my test case dataset (from google).
So I have 2 df's
Base DF
Secondary DF
baseDF
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3500,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Vom,5000,mex,IT,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
secDF
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,ITA,2/11/2019
22,Tom,2500,usa,HRA,2/11/2019
33,Kom,3000,uk,ITA,2/11/2019
44,Nom,4600,can,HRA,2/11/2019
55,Vom,8000,mex,ITA,2/11/2019
77,XYZ,5000,mex,ITA,2/11/2019
I have to compare secDF and baseDF with 2 keys (No and Name), if those fields match (I only need the matched records from secDF)then I have to update the salary and Dept field of baseDF with the value from secDF
Expected output
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,ITA,2/11/2019
22,Tom,2500,usa,HRA,2/11/2019
33,Kom,3000,uk,ITA,2/11/2019
44,Nom,4600,can,HRA,2/11/2019
55,Vom,8000,mex,ITA,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
Using pyspark I can use subtract() to find the values of table1 not present in table2, and consequently use unionAll of the two tables or should I use withcolumn to overwrite values satisfying the condition.
Could someone suggest a good way of doing this?
Update ---
I have to compare secDF and baseDF with 2 keys (No and Name), if those fields match (I only need the matched records from secDF)then I have to update the salary and Dept field of baseDF with the value from secDF.
You can do a left join and coalesce the resulting Sal column, with secdf taking precedence over basedf:
import pyspark.sql.functions as F
result = basedf.alias('basedf').join(
secdf.alias('secdf'),
['No', 'Name'],
'left'
).select(
[F.coalesce('secdf.Sal', 'basedf.Sal').alias('Sal')
if c == 'Sal'
else F.coalesce('secdf.Dept', 'basedf.Dept').alias('Dept')
if c == 'Dept'
else f'basedf.{c}'
for c in basedf.columns]
)
result.show()
+---+----+----+-------+----+---------+
| No|Name| Sal|Address|Dept|Join_Date|
+---+----+----+-------+----+---------+
| 11| Sam|1000| ind| ITA|2/11/2019|
| 22| Tom|2500| usa| HRA|2/11/2019|
| 33| Kom|3000| uk| ITA|2/11/2019|
| 44| Nom|4600| can| HRA|2/11/2019|
| 55| Vom|8000| mex| ITA|2/11/2019|
| 66| XYZ|5000| mex| IT|2/11/2019|
+---+----+----+-------+----+---------+
The conversion of the string to datetime is failing.
The data in the dataframe has the following format: "2020-08-05T12:34:10.800046".
I used pattern yyyy-MM-ddTHH:mm:ss.SSSSSS
config_df.withColumn(
"modifiedDate",
F.to_timestamp(config_df["modifiedDate"], "yyyy-MM-dd'T'HH:mm:ss.SSSSSS"),
).show()
+------------+
|modifiedDate|
+------------+
| null|
+------------+
The execution works without problem but all values in the updated column are NULL. Which format should I use?
According to this post, SSSis for milliseconds. Therefore, it matches the first 3 digits 800 in your 800046, no matter how many S you add.
I couldn't find any pattern that match your date, so you first need to update your string to keep only 3 digits at the end. With a regex for example
a = [
("2020-08-05T12:34:10.800123",),
]
b = ["modifiedDate"]
df = spark.createDataFrame(a, b)
df.withColumn(
"modifiedDate",
F.to_timestamp(
F.regexp_extract(
"modifiedDate", r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}", 0
),
"yyyy-MM-dd'T'HH:mm:ss.SSS",
),
).show()
+-------------------+
| modifiedDate|
+-------------------+
|2020-08-05 12:34:10|
+-------------------+
I have a dataframe like so:
+-------+-------------------+
|id |scandatetime |
+-------+-------------------+
|1234567|2020-03-13 10:56:18|
|1234567|2020-03-12 17:09:48|
|1234567|2020-03-12 15:42:25|
|1234567|2020-03-09 16:30:22|
|1234567|2020-03-12 17:09:48|
|1234567|2020-03-09 16:30:22|
|1234567|2020-03-12 15:42:25|
+-------+-------------------+
And I would like to calculate the minimum and maximum timestamps for this id. To do so, I have used the following code:
dfScans = datasource1.toDF()
dfScans = dfScans.withColumn('scandatetime',f.unix_timestamp(f.col('scandatetime'), "yyyy-MM-dd hh:mm:ss").cast("timestamp"))
dfDateAgg = dfScans.groupBy("id").agg(f.min('scandatetime').alias('FirstScanDate'),
f.max('scandatetime').alias('LastScanDate'))
But I am getting the following return:
+-------+-------------------+-------------------+
|id |FirstScanDate |LastScanDate |
+-------+-------------------+-------------------+
|1234567|2020-03-13 10:56:18|2020-03-13 10:56:18|
+-------+-------------------+-------------------+
Why is the min function not returning the right value?
Your timestamps have hours in the 0-23 range, and thus you are using the wrong date format. You should be using "yyyy-MM-dd HH:mm:ss" (capital H) (See docs for SimpleDateFormat).
The lowercase h refers to hours in the 1-12 range, and thus all values except "2020-03-13 10:56:18" become null upon conversion to timestamp.
from pyspark.sq import functions as f
dfScans = dfScans.withColumn(
'scandatetime',
f.unix_timestamp(
f.col('scandatetime'),
"yyyy-MM-dd HH:mm:ss"
).cast("timestamp")
)
dfScans.groupBy("id").agg(f.min('scandatetime').alias('FirstScanDate'),
f.max('scandatetime').alias('LastScanDate')).show()
#+-------+-------------------+-------------------+
#| id| FirstScanDate| LastScanDate|
#+-------+-------------------+-------------------+
#|1234567|2020-03-09 16:30:22|2020-03-13 10:56:18|
#+-------+-------------------+-------------------+
I use pyspark and work with the following dataframe:
+---------+----+--------------------+-------------------+
| id| sid| values| ratio|
+---------+----+--------------------+-------------------+
| 6052791|4178|[2#2#2#2#3#3#3#3#...|0.32673267326732675|
| 57908575|4178|[2#2#2#2#3#3#3#3#...| 0.3173076923076923|
| 78836630|4178|[2#2#2#2#3#3#3#3#...| 0.782608695652174|
|109252111|4178|[2#2#2#2#3#3#3#3#...| 0.2803738317757009|
|139428308|4385|[2#2#2#3#4#4#4#4#...| 1.140625|
|173158079|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505|
|183739386|4390|[3#2#2#3#3#2#4#4#...|0.32080419580419584|
|206815630|4178|[2#2#2#2#3#3#3#3#...|0.14782608695652175|
|242251660|4320|[2#2#2#2#3#3#3#3#...| 0.1452991452991453|
|272670796|5038|[3#2#2#2#2#2#2#3#...| 0.2648648648648649|
|297848516|4320|[2#2#2#2#3#3#3#3#...|0.12195121951219512|
|346566485|4113|[2#3#3#2#2#2#2#3#...| 0.646823138928402|
|369667874|5038|[2#2#2#2#2#2#2#3#...| 0.4546293788454067|
|374645154|4320|[2#2#2#2#3#3#3#3#...|0.34782608695652173|
|400996010|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505|
|401594848|4178|[3#3#6#6#3#3#4#4#...| 0.7647058823529411|
|401954629|4569|[3#3#3#3#3#3#3#3#...| 0.5520833333333333|
|417115190|4320|[2#2#2#2#3#3#3#3#...| 0.6235294117647059|
|423877535|4178|[2#2#2#2#3#3#3#3#...| 0.5538461538461539|
|445523599|4320|[2#2#2#2#3#3#3#3#...| 0.1271186440677966|
+---------+----+--------------------+-------------------+
What I want is to make sid 4178 as a column and put rounded ratio as its row value. The result should look as follows:
+---------+-------+------+-------+
| id| 4178 |4385 | 4390 |(if sid for id fill row with ratio)
+---------+-------+------+-------+
| 6052791|0.32 | 0 | 0 |(if not fill with 0)
id 4178
6052791 0.32
The number of columns is the number of sids that have the same rounded ratio.
If that sid does not exist for any id then sid column has to contain 0.
You need a column to groupby, for which I am adding a new column called sNo.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(List((6052791, 4178, 0.42673267326732675),
(6052791, 4178, 0.22673267326732675),
(6052791, 4179, 0.62673267326732675),
(6052791, 4180, 0.72673267326732675),
(6052791, 4179, 0.82673267326732675),
(6052791, 4179, 0.92673267326732675))).toDF("id", "sid", "ratio")
df.withColumn("sNo", lit(1))
.groupBy("sNo")
.pivot("sid")
.agg(min("ratio"))
.show
This would return output
+---+-------------------+------------------+------------------+
|sNo| 4178| 4179| 4180|
+---+-------------------+------------------+------------------+
| 1|0.22673267326732674|0.6267326732673267|0.7267326732673267|
+---+-------------------+------------------+------------------+
That sounds like a pivot that could be in Spark SQL (Scala version) as follows:
scala> ratios.
groupBy("id").
pivot("sid").
agg(first("ratio")).
show
+-------+-------------------+
| id| 4178|
+-------+-------------------+
|6052791|0.32673267326732675|
+-------+-------------------+
I'm still unsure how to select the other columns (4385 and 4390 in your example). It seems that you round ratio and search for other sids that would match.