PySpark using Min function on timestamp selecting the wrong value - python

I have a dataframe like so:
+-------+-------------------+
|id |scandatetime |
+-------+-------------------+
|1234567|2020-03-13 10:56:18|
|1234567|2020-03-12 17:09:48|
|1234567|2020-03-12 15:42:25|
|1234567|2020-03-09 16:30:22|
|1234567|2020-03-12 17:09:48|
|1234567|2020-03-09 16:30:22|
|1234567|2020-03-12 15:42:25|
+-------+-------------------+
And I would like to calculate the minimum and maximum timestamps for this id. To do so, I have used the following code:
dfScans = datasource1.toDF()
dfScans = dfScans.withColumn('scandatetime',f.unix_timestamp(f.col('scandatetime'), "yyyy-MM-dd hh:mm:ss").cast("timestamp"))
dfDateAgg = dfScans.groupBy("id").agg(f.min('scandatetime').alias('FirstScanDate'),
f.max('scandatetime').alias('LastScanDate'))
But I am getting the following return:
+-------+-------------------+-------------------+
|id |FirstScanDate |LastScanDate |
+-------+-------------------+-------------------+
|1234567|2020-03-13 10:56:18|2020-03-13 10:56:18|
+-------+-------------------+-------------------+
Why is the min function not returning the right value?

Your timestamps have hours in the 0-23 range, and thus you are using the wrong date format. You should be using "yyyy-MM-dd HH:mm:ss" (capital H) (See docs for SimpleDateFormat).
The lowercase h refers to hours in the 1-12 range, and thus all values except "2020-03-13 10:56:18" become null upon conversion to timestamp.
from pyspark.sq import functions as f
dfScans = dfScans.withColumn(
'scandatetime',
f.unix_timestamp(
f.col('scandatetime'),
"yyyy-MM-dd HH:mm:ss"
).cast("timestamp")
)
dfScans.groupBy("id").agg(f.min('scandatetime').alias('FirstScanDate'),
f.max('scandatetime').alias('LastScanDate')).show()
#+-------+-------------------+-------------------+
#| id| FirstScanDate| LastScanDate|
#+-------+-------------------+-------------------+
#|1234567|2020-03-09 16:30:22|2020-03-13 10:56:18|
#+-------+-------------------+-------------------+

Related

String to datetime conversion is failing

The conversion of the string to datetime is failing.
The data in the dataframe has the following format: "2020-08-05T12:34:10.800046".
I used pattern yyyy-MM-ddTHH:mm:ss.SSSSSS
config_df.withColumn(
"modifiedDate",
F.to_timestamp(config_df["modifiedDate"], "yyyy-MM-dd'T'HH:mm:ss.SSSSSS"),
).show()
+------------+
|modifiedDate|
+------------+
| null|
+------------+
The execution works without problem but all values in the updated column are NULL. Which format should I use?
According to this post, SSSis for milliseconds. Therefore, it matches the first 3 digits 800 in your 800046, no matter how many S you add.
I couldn't find any pattern that match your date, so you first need to update your string to keep only 3 digits at the end. With a regex for example
a = [
("2020-08-05T12:34:10.800123",),
]
b = ["modifiedDate"]
df = spark.createDataFrame(a, b)
df.withColumn(
"modifiedDate",
F.to_timestamp(
F.regexp_extract(
"modifiedDate", r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}", 0
),
"yyyy-MM-dd'T'HH:mm:ss.SSS",
),
).show()
+-------------------+
| modifiedDate|
+-------------------+
|2020-08-05 12:34:10|
+-------------------+

PySpark Cum Sum of two values

Given the following example dataframe:
advertiser_id| name | amount | total |max_total_advertiser|
4061 |source1|-434.955284|-354882.75336200005| -355938.53950700007
4061 |source2|-594.012216|-355476.76557800005| -355938.53950700007
4061 |source3|-461.773929|-355938.53950700007| -355938.53950700007
I need to sum the amount and the max_total_advertiser field in order to get the correct total value in each row. Taking into account that I need this total value for every group partitioned by advertiser_id. (The total column in the initial dataframe is incorrect, that's why I want to calculate correctly)
Something like that should be:
w = Window.partitionBy("advertiser_id").orderBy("advertiser_id")
df.withColumn("total_aux", when( lag("advertiser_id").over(w) == col("advertiser_id"), lag("total_aux").over(w) + col("amount") ).otherwise( col("max_total_advertiser") + col("amount") ))
This lag("total_aux") is not working because the column is not generated yet, that's what I want to achieve, if it is the first row in the group, sum the columns in the same row if not sum the previous obtained value with the current amount field.
Example output:
advertiser_id| name | amount | total_aux |
4061 |source1|-434.955284|-356373.494791 |
4061 |source2|-594.012216|-356967.507007 |
4061 |source3|-461.773929|-357429.280936 |
Thanks.
I assume that name is a distinct value for each advertiser_id and your dataset is therefore sortable by name. I also assume that max_total_advertiser contains the same value for each advertiser_id. If one of those is not the case, please add a comment.
What you need is a rangeBetween window which gives you all preceding and following rows within the specified range. We will use Window.unboundedPreceding as we want to sum up all the previous values.
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [
(4061, 'source1',-434.955284,-354882.75336200005, -355938.53950700007),
(4061, 'source2',-594.012216,-355476.76557800005, -345938.53950700007),
(4062, 'source1',-594.012216,-355476.76557800005, -5938.53950700007),
(4062, 'source2',-594.012216,-355476.76557800005, -5938.53950700007),
(4061, 'source3',-461.773929,-355938.53950700007, -355938.53950700007)
]
columns = ['advertiser_id','name' ,'amount', 'total', 'max_total_advertiser']
df=spark.createDataFrame(l, columns)
w = Window.partitionBy('advertiser_id').orderBy('name').rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn('total', F.sum('amount').over(w) + df.max_total_advertiser)
df.show()
Output:
+-------------+-------+-----------+-------------------+--------------------+
|advertiser_id| name| amount| total|max_total_advertiser|
+-------------+-------+-----------+-------------------+--------------------+
| 4062|source1|-594.012216|-6532.5517230000705| -5938.53950700007|
| 4062|source2|-594.012216| -7126.563939000071| -5938.53950700007|
| 4061|source1|-434.955284| -356373.4947910001| -355938.53950700007|
| 4061|source2|-594.012216| -346967.5070070001| -345938.53950700007|
| 4061|source3|-461.773929|-357429.28093600005| -355938.53950700007|
+-------------+-------+-----------+-------------------+--------------------+
You might be looking for the orderBy() function. Does this work?
from pyspark.sql.window import *
df.withColumn("cumulativeSum", sum(df("amount"))
.over( Window.partitionBy("advertiser_id").orderBy("amount")))

Pyspark -Convert String to TimeStamp - Getting Nulls

I've the following column as string on a dataframe df:
date|
+----------------+
|4/23/2019 23:59|
|05/06/2019 23:59|
|4/16/2019 19:00
I am trying to convert this to Timestamp but I only getting NULL values.
My statement is:
from pyspark.sql.functions import col, unix_timestamp
df.withColumn('date',unix_timestamp(df['date'], "MM/dd/yyyy hh:mm").cast("timestamp"))
Why I am getting only Null values? Is It because the Month format (since I hive an additional 0 on 05)?
Thanks!
The pattern for 24 hour format is HH, hh is for am./pm.
https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html
df \
.withColumn('converted_date', psf.to_timestamp('date', format='MM/dd/yyyy HH:mm')) \
.show()
+----------------+-------------------+
| date| converted_date|
+----------------+-------------------+
| 4/23/2019 23:59|2019-04-23 23:59:00|
|05/06/2019 23:59|2019-05-06 23:59:00|
| 4/16/2019 19:00|2019-04-16 19:00:00|
+----------------+-------------------+
Whether there is or not a leading 0 does not matter

How could I attach a prompt at the end of trips

Trips
id,timestamp
1008,2003-11-03 15:00:31
1008,2003-11-03 15:02:38
1008,2003-11-03 15:03:04
1008,2003-11-03 15:18:00
1009,2003-11-03 22:00:00
1009,2003-11-03 22:02:53
1009,2003-11-03 22:03:44
1009,2003-11-14 10:00:00
1009,2003-11-14 10:02:02
1009,2003-11-14 10:03:10
prompts
id,timestamp ,mode
1008,2003-11-03 15:18:49,car
1009,2003-11-03 22:04:20,metro
1009,2003-11-14 10:04:20,bike
Read csv file:
coordinates = pd.read_csv('coordinates.csv')
mode = pd.read_csv('prompts.csv')
I have to assign each mode at the end of the trip
Results:
id, timestamp, mode
1008, 2003-11-03 15:00:31, null
1008, 2003-11-03 15:02:38, null
1008, 2003-11-03 15:03:04, null
1008, 2003-11-03 15:18:00, car
1009, 2003-11-03 22:00:00, null
1009, 2003-11-03 22:02:53, null
1009, 2003-11-03 22:03:44, metro
1009, 2003-11-14 10:00:00, null
1009, 2003-11-14 10:02:02, null
1009, 2003-11-14 10:03:10, bike
Note
I use a large dataset for trips (4GB) and a small dataset for modes (500MB)
Based on your updated example, you can denote a trip by finding the first prompt timestamp that is greater than the trip timestamp. All rows with the same prompt timestamp will then correspond to the same trip. Then you want to set the mode for the greatest of the trip timestamps for each group.
One way to do this is by using 2 pyspark.sql.Windows.
Suppose you start with the following two PySpark DataFrames, trips and prompts:
trips.show(truncate=False)
#+----+-------------------+
#|id |timestamp |
#+----+-------------------+
#|1008|2003-11-03 15:00:31|
#|1008|2003-11-03 15:02:38|
#|1008|2003-11-03 15:03:04|
#|1008|2003-11-03 15:18:00|
#|1009|2003-11-03 22:00:00|
#|1009|2003-11-03 22:02:53|
#|1009|2003-11-03 22:03:44|
#|1009|2003-11-14 10:00:00|
#|1009|2003-11-14 10:02:02|
#|1009|2003-11-14 10:03:10|
#|1009|2003-11-15 10:00:00|
#+----+-------------------+
prompts.show(truncate=False)
#+----+-------------------+-----+
#|id |timestamp |mode |
#+----+-------------------+-----+
#|1008|2003-11-03 15:18:49|car |
#|1009|2003-11-03 22:04:20|metro|
#|1009|2003-11-14 10:04:20|bike |
#+----+-------------------+-----+
Join these two tables together using the id column with the condition that the prompt timestamp is greater than or equal to the trip timestamp. For some trip timestamps, this will result in multiple prompt timestamps. We can eliminate this by selecting the minimum prompt timestamp for each ('id', 'trip.timestamp') group- I call this temporary column indicator, and I used the Window w1 to compute it.
Next do a window over ('id', 'indicator') and find the maximum trip timestamp for each group. Set this value equal to the mode. All other rows will be set to pyspark.sql.functions.lit(None).
Finally you can compute all of the entries in trips where the trip timestamp was greater than the max prompt timestamp. These would be trips that did not match to a prompt. Union the matched and the unmatched together.
import pyspark.sql.functions as f
from pyspark.sql import Window
w1 = Window.partitionBy('id', 'trips.timestamp')
w2 = Window.partitionBy('id', 'indicator')
matched = trips.alias('trips').join(prompts.alias('prompts'), on='id', how='left')\
.where('prompts.timestamp >= trips.timestamp' )\
.select(
'id',
'trips.timestamp',
'mode',
f.when(
f.col('prompts.timestamp') == f.min('prompts.timestamp').over(w1),
f.col('prompts.timestamp'),
).otherwise(f.lit(None)).alias('indicator')
)\
.where(~f.isnull('indicator'))\
.select(
'id',
f.col('trips.timestamp').alias('timestamp'),
f.when(
f.col('trips.timestamp') == f.max(f.col('trips.timestamp')).over(w2),
f.col('mode')
).otherwise(f.lit(None)).alias('mode')
)
unmatched = trips.alias('t').join(prompts.alias('p'), on='id', how='left')\
.withColumn('max_prompt_time', f.max('p.timestamp').over(Window.partitionBy('id')))\
.where('t.timestamp > max_prompt_time')\
.select('id', 't.timestamp', f.lit(None).alias('mode'))\
.distinct()
Output:
matched.union(unmatched).sort('id', 'timestamp').show()
+----+-------------------+-----+
| id| timestamp| mode|
+----+-------------------+-----+
|1008|2003-11-03 15:00:31| null|
|1008|2003-11-03 15:02:38| null|
|1008|2003-11-03 15:03:04| null|
|1008|2003-11-03 15:18:00| car|
|1009|2003-11-03 22:00:00| null|
|1009|2003-11-03 22:02:53| null|
|1009|2003-11-03 22:03:44|metro|
|1009|2003-11-14 10:00:00| null|
|1009|2003-11-14 10:02:02| null|
|1009|2003-11-14 10:03:10| bike|
|1009|2003-11-15 10:00:00| null|
+----+-------------------+-----+
This would be a naive solution which assumes that your coordinates DataFrame already is sorted by timestamp, that ids are unique and that your data set fits into memory. If the latter is not the case, I recommend using dask and partition your DataFrames by id.
Imports:
import pandas as pd
import numpy as np
First we join the two DataFrames. This will fill the whole mode column for each id. We join on the index because that will speed up the operation, see also "Improve Pandas Merge performance".
mode = mode.set_index('id')
coordinates = coordinates.set_index('id')
merged = coordinates.join(mode, how='left')
We need the index to be unique values in order for our groupby operation to work.
merged = merged.reset_index()
Then we apply a function that will replace all but the last row in the mode column for each id.
def clean_mode_col(df):
cleaned_mode_col = df['mode'].copy()
cleaned_mode_col.iloc[:-1] = np.nan
df['mode'] = cleaned_mode_col
return df
merged = merged.groupby('id').apply(clean_mode_col)
As mentioned above, you can use dask to parallelize the execution of the merge code like this:
import dask.dataframe as dd
dd_coordinates = dd.from_pandas(coordinates).set_index('id')
dd_mode = dd.from_pandas(mode).set_index('id')
merged = dd.merge(dd_coordinates, dd_mode, left_index=True, right_index=True)
merged = merged.compute() #returns pandas DataFrame
The set_index operations are slow but make the merge way faster.
I did not test this code. Please provide copy-pasteable code that includes your DataFrames so that I don't have to copy and paste all those files you have in your description (hint: use pd.DataFrame.to_dict to export your DataFrame as a dictionary and copy and paste that into your code).

pySpark - Add list to a dataframe as a column [duplicate]

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows:
dt.withColumn('new_column', 10).head(5)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-50-a6d0257ca2be> in <module>()
1 dt = (messages
2 .select(messages.fromuserid, messages.messagetype, floor(messages.datetime/(1000*60*5)).alias("dt")))
----> 3 dt.withColumn('new_column', 10).head(5)
/Users/evanzamir/spark-1.4.1/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col)
1166 [Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
1167 """
-> 1168 return self.select('*', col.alias(colName))
1169
1170 #ignore_unicode_prefix
AttributeError: 'int' object has no attribute 'alias'
It seems that I can trick the function into working as I want by adding and subtracting one of the other columns (so they add to zero) and then adding the number I want (10 in this case):
dt.withColumn('new_column', dt.messagetype - dt.messagetype + 10).head(5)
[Row(fromuserid=425, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=47019141, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=49746356, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=93506471, messagetype=1, dt=4809600.0, new_column=10),
Row(fromuserid=80488242, messagetype=1, dt=4809600.0, new_column=10)]
This is supremely hacky, right? I assume there is a more legit way to do this?
Spark 2.2+
Spark 2.2 introduces typedLit to support Seq, Map, and Tuples (SPARK-19254) and following calls should be supported (Scala):
import org.apache.spark.sql.functions.typedLit
df.withColumn("some_array", typedLit(Seq(1, 2, 3)))
df.withColumn("some_struct", typedLit(("foo", 1, 0.3)))
df.withColumn("some_map", typedLit(Map("key1" -> 1, "key2" -> 2)))
Spark 1.3+ (lit), 1.4+ (array, struct), 2.0+ (map):
The second argument for DataFrame.withColumn should be a Column so you have to use a literal:
from pyspark.sql.functions import lit
df.withColumn('new_column', lit(10))
If you need complex columns you can build these using blocks like array:
from pyspark.sql.functions import array, create_map, struct
df.withColumn("some_array", array(lit(1), lit(2), lit(3)))
df.withColumn("some_struct", struct(lit("foo"), lit(1), lit(.3)))
df.withColumn("some_map", create_map(lit("key1"), lit(1), lit("key2"), lit(2)))
Exactly the same methods can be used in Scala.
import org.apache.spark.sql.functions.{array, lit, map, struct}
df.withColumn("new_column", lit(10))
df.withColumn("map", map(lit("key1"), lit(1), lit("key2"), lit(2)))
To provide names for structs use either alias on each field:
df.withColumn(
"some_struct",
struct(lit("foo").alias("x"), lit(1).alias("y"), lit(0.3).alias("z"))
)
or cast on the whole object
df.withColumn(
"some_struct",
struct(lit("foo"), lit(1), lit(0.3)).cast("struct<x: string, y: integer, z: double>")
)
It is also possible, although slower, to use an UDF.
Note:
The same constructs can be used to pass constant arguments to UDFs or SQL functions.
In spark 2.2 there are two ways to add constant value in a column in DataFrame:
1) Using lit
2) Using typedLit.
The difference between the two is that typedLit can also handle parameterized scala types e.g. List, Seq, and Map
Sample DataFrame:
val df = spark.createDataFrame(Seq((0,"a"),(1,"b"),(2,"c"))).toDF("id", "col1")
+---+----+
| id|col1|
+---+----+
| 0| a|
| 1| b|
+---+----+
1) Using lit: Adding constant string value in new column named newcol:
import org.apache.spark.sql.functions.lit
val newdf = df.withColumn("newcol",lit("myval"))
Result:
+---+----+------+
| id|col1|newcol|
+---+----+------+
| 0| a| myval|
| 1| b| myval|
+---+----+------+
2) Using typedLit:
import org.apache.spark.sql.functions.typedLit
df.withColumn("newcol", typedLit(("sample", 10, .044)))
Result:
+---+----+-----------------+
| id|col1| newcol|
+---+----+-----------------+
| 0| a|[sample,10,0.044]|
| 1| b|[sample,10,0.044]|
| 2| c|[sample,10,0.044]|
+---+----+-----------------+
As the other answers have described, lit and typedLit are how to add constant columns to DataFrames. lit is an important Spark function that you will use frequently, but not for adding constant columns to DataFrames.
You'll commonly be using lit to create org.apache.spark.sql.Column objects because that's the column type required by most of the org.apache.spark.sql.functions.
Suppose you have a DataFrame with a some_date DateType column and would like to add a column with the days between December 31, 2020 and some_date.
Here's your DataFrame:
+----------+
| some_date|
+----------+
|2020-09-23|
|2020-01-05|
|2020-04-12|
+----------+
Here's how to calculate the days till the year end:
val diff = datediff(lit(Date.valueOf("2020-12-31")), col("some_date"))
df
.withColumn("days_till_yearend", diff)
.show()
+----------+-----------------+
| some_date|days_till_yearend|
+----------+-----------------+
|2020-09-23| 99|
|2020-01-05| 361|
|2020-04-12| 263|
+----------+-----------------+
You could also use lit to create a year_end column and compute the days_till_yearend like so:
import java.sql.Date
df
.withColumn("yearend", lit(Date.valueOf("2020-12-31")))
.withColumn("days_till_yearend", datediff(col("yearend"), col("some_date")))
.show()
+----------+----------+-----------------+
| some_date| yearend|days_till_yearend|
+----------+----------+-----------------+
|2020-09-23|2020-12-31| 99|
|2020-01-05|2020-12-31| 361|
|2020-04-12|2020-12-31| 263|
+----------+----------+-----------------+
Most of the time, you don't need to use lit to append a constant column to a DataFrame. You just need to use lit to convert a Scala type to a org.apache.spark.sql.Column object because that's what's required by the function.
See the datediff function signature:
As you can see, datediff requires two Column arguments.

Categories