PySpark weird behaviour of to_timestamp()

PySpark weird behaviour of to_timestamp() - python

I am noticing a bit weird behaviour in PySpark's (and possibly Spark's) to_timestamp function. Looks like it is converting some strings to timestamp correctly while some other strings of the exact same format to null. Consider the following example I worked out:
times = [['2030-03-10 02:56:07'], ['2030-03-11 02:56:07']]
df_test = spark.createDataFrame(times, schema=StructType([
StructField("time_string", StringType())
]))
df_test = df_test.withColumn('timestamp',
F.to_timestamp('time_string',
format='yyyy-MM-dd HH:mm:ss'))
df_test.show(2, False)
This is what I get:
+-------------------+-------------------+
|time_string |timestamp |
+-------------------+-------------------+
|2030-03-10 02:56:07|null |
|2030-03-11 02:56:07|2030-03-11 02:56:07|
+-------------------+-------------------+
What's the reason behind the second string being converted correctly but not the first one? I have tried with unix_timestamp() function as well and the result is the same.
Further strangely, if I don't use the format parameter, I don't get the null anymore but the timestamp's hour is incremented by one.
df_test2 = df_test.withColumn('timestamp', F.to_timestamp('time_string'))
df_test2.show(2, False)
Result:
+-------------------+-------------------+
|time_string |timestamp |
+-------------------+-------------------+
|2030-03-10 02:56:07|2030-03-10 03:56:07|
|2030-03-11 02:56:07|2030-03-11 02:56:07|
+-------------------+-------------------+
Any idea what's going on?
UPDATE:
I have tried with in Scala as well via spark-shell and the result is the same:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions
val times = Seq(Row("2030-03-10 02:56:07"), Row("2030-03-11 02:56:07"))
val schema=List((StructField("time_string", StringType)))
val df = spark.createDataFrame(spark.sparkContext.parallelize(times),
StructType(schema))
val df_test = df.withColumn("timestamp",
functions.to_timestamp(functions.col("time_string"),
fmt="yyyy-MM-dd HH:mm:ss"))
df_test.show()
And the result:
+-------------------+-------------------+
| time_string| timestamp|
+-------------------+-------------------+
|2030-03-10 02:56:07| null|
|2030-03-11 02:56:07|2030-03-11 02:56:07|
+-------------------+-------------------+

Related

pyspark extracting a string using python

Spark dataframe which has column emailID : ram.shyam.78uy#testing.com. i would like to extract the string between "." and "#" i.e 78uy and store it in column.
tried
split_for_alias = split(rs_csv['emailID'],'[.]')
rs_csv_alias= rs_csv.withColumn('alias',split_for_alias.getItem(size(split_for_alias) -2))
Its adding 78uy#testing as alias. Another column can be added and chop off the extra values. But is it possible to do in single statement.

Extract the alphanumeric immediately to the left of special character . and immediately followed by special character #
DataFrame
data= [
(1,"am.shyam.78uy#testing.com"),
(2, "j.k.kilo#jom.com")
]
df=spark.createDataFrame(data, ("id",'emailID'))
df.show()
+---+--------------------+
| id| emailID|
+---+--------------------+
| 1|am.shyam.78uy#tes...|
| 2| j.k.kilo#jom.com|
+---+--------------------+
Code
df.withColumn('name', regexp_extract('emailID', '(?<=\.)(\w+)(?=\#)',1)).show()
outcome
+---+--------------------+----+
| id| emailID|name|
+---+--------------------+----+
| 1|am.shyam.78uy#tes...|78uy|
| 2| j.k.kilo#jom.com|kilo|
+---+--------------------+----+

We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
First we setup a Pandas DataFrame to test:
import pandas as pd
df = pd.DataFrame({"id":[1,2],"email": ["am.shyam.78uy#testing.com", "j.k.kilo#jom.com"]})
Next, we make a native Python function. The logic is clear this way.
from typing import List, Dict, Any
def extract(df:List[Dict[str,Any]]) -> List[Dict[str,Any]]:
for row in df:
email = row["email"].split("#")[0].split(".")[-1]
row["new_col"] = email
return df
Then we can test on the Pandas engine:
from fugue import transform
transform(df, extract, schema="*, new_col:str")
Because it works, we can bring it to Spark by supplying an engine:
import fugue_spark
transform(df, extract, schema="*, new_col:str", engine="spark").show()
+---+--------------------+-------+
| id| email|new_col|
+---+--------------------+-------+
| 1|am.shyam.78uy#tes...| 78uy|
| 2| j.k.kilo#jom.com| kilo|
+---+--------------------+-------+
Note .show() is needed because Spark evaluates lazily. This transform can take in both Pandas and Spark DataFrames and will output a Spark DataFrame if using the Spark engine.

How to convert string in DataFrame to date in pyspark

I would like to convert string in a column of DataFrame to date in pyspark.
l = [("10/14/2000","12/4/1991","5/8/1991"), ("11/3/1391","1/26/1992","9/5/1992")]
spark.createDataFrame(l).collect()
df = spark.createDataFrame(l, ["first", 'second',"third"])
df2 = df.select(col("first"),to_date(col("first"),"MM/dd/yyyy").alias("date"))
df3 = df.select(col("first"),to_date(col("first"),"%M/%d/%y").alias("date"))
I tried those codes above, but neither of them worked.
Could somebody help me to solve this issue?

The code snippet you are using is correct , however the date_format you are using for parsing is not in line with Spark 3.x
Furthermore to handle inconsistent cases , like - 10/14/2000 and 11/3/1391 , with MM/dd/yyyy , you can set the timeParserPolicy=LEGACY as the below link is applicable for Spark 3.x , more info about this can be found here
The available DateTime Patterns for Parsing can be found - https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
Data Prepartion
l = [("10/14/2000","12/4/1991","5/8/1991"), ("11/3/1391","1/26/1992","9/5/1992")]
df = sql.createDataFrame(l, ["first", 'second',"third"])
df.show()
+----------+---------+--------+
| first| second| third|
+----------+---------+--------+
|10/14/2000|12/4/1991|5/8/1991|
| 11/3/1391|1/26/1992|9/5/1992|
+----------+---------+--------+
To Date
df.select(F.col("first"),F.to_date(F.col("first"),"MM/dd/yyyy").alias("date")).show()
+----------+----------+
| first| date|
+----------+----------+
|10/14/2000|2000-10-14|
| 11/3/1391|1391-11-03|
+----------+----------+

Get last / delimited value from Dataframe column in PySpark

I am trying to get the last string after '/'.
The column can look like this: "lala/mae.da/rg1/zzzzz" (not necessary only 3 /), and I'd like to return: zzzzz
In SQL and Python it's very easy, but I would like to know if there is a way to do it in PySpark.
Solving it in Python:
original_string = "lala/mae.da/rg1/zzzzz"
last_char_index = original_string.rfind("/")
new_string = original_string[last_char_index+1:]
or directly:
new_string = original_string.rsplit('/', 1)[1]
And in SQL:
RIGHT(MyColumn, CHARINDEX('/', REVERSE(MyColumn))-1)
For PySpark I was thinking something like this:
df = df.select(col("MyColumn").rsplit('/', 1)[1])
but I get the following error: TypeError: 'Column' object is not callable and I am not even sure Spark allows me to do rsplit at all.
Do you have any suggestion on how can I solve this?

Adding another solution even though #Pav3k's answer is great. element_at which gets an item at specific position out of a list:
from pyspark.sql import functions as F
df = df.withColumn('my_col_split', F.split(df['MyColumn'], '/'))\
.select('MyColumn',F.element_at(F.col('my_col_split'), -1).alias('rsplit')
)
>>> df.show(truncate=False)
+---------------------+------+
|MyColumn |rsplit|
+---------------------+------+
|lala/mae.da/rg1/zzzzz|zzzzz |
|fefe |fefe |
|fe/fe/frs/fs/fe32/4 |4 |
+---------------------+------+
Pav3k's DF used.

import pandas as pd
from pyspark.sql import functions as F
df = pd.DataFrame({"MyColumn": ["lala/mae.da/rg1/zzzzz", "fefe", "fe/fe/frs/fs/fe32/4"]})
df = spark.createDataFrame(df)
df.show(truncate=False)
# output
+---------------------+
|MyColumn |
+---------------------+
|lala/mae.da/rg1/zzzzz|
|fefe |
|fe/fe/frs/fs/fe32/4 |
+---------------------+
(
df
.withColumn("NewCol",
F.split("MyColumn", "/")
)
.withColumn("NewCol", F.col("Newcol")[F.size("NewCol") -1])
.show()
)
# output
+--------------------+------+
| MyColumn|NewCol|
+--------------------+------+
|lala/mae.da/rg1/z...| zzzzz|
| fefe| fefe|
| fe/fe/frs/fs/fe32/4| 4|
+--------------------+------+

Since Spark 2.4, you can use split built-in function to split your string then use element_at built-in function to get the last element of your obtained array, as follows:
from pyspark.sql import functions as F
df = df.select(F.element_at(F.split(F.col("MyColumn"), '/'), -1))

PySpark - Date 0000.00.00 imported differently via function .to_date() and .csv() module

I am importing data, which has a date column in yyyy.MM.dd format. Missing values have been marked as 0000.00.00. This 0000.00.00 is treated differently depending upon the function/module employed to bring the data in the dataframe.
.csv file looks like this -
2016.12.23,2016.12.23
0000.00.00,0000.00.00
Method 1: .csv()
schema = StructType([
StructField('date', StringType()),
StructField('date1', DateType()),
])
df = spark.read.schema(schema)\
.format('csv')\
.option('header','false')\
.option('sep',',')\
.option('dateFormat','yyyy.MM.dd')\
.load(path+'file.csv')
df.show()
+----------+----------+
| date| date1|
+----------+----------+
|2016.12.23|2016-12-23|
|0000.00.00|0002-11-30|
+----------+----------+
Method 2: .to_date()
from pyspark.sql.functions import to_date, col
df = sqlContext.createDataFrame([('2016.12.23','2016.12.23'),('0000.00.00','0000.00.00')],['date','date1'])
df = df.withColumn('date1',to_date(col('date1'),'yyyy.MM.dd'))
df.show()
+----------+----------+
| date| date1|
+----------+----------+
|2016.12.23|2016-12-23|
|0000.00.00| null|
+----------+----------+
Question: Why two methods give different results? I would have expected the get Null for both. In the first case instead I get 0002-11-30. Can anyone explain this anomaly?

datatype for handling big numbers in pyspark

I am using spark with python.After uploading a csv file,I needed to parse a column in a csv file which has numbers that are 22 digits long. For parsing that column I used LongType() . I used map() function for defining column.
Following are my commands in pyspark.
>>> test=sc.textFile("test.csv")
>>> header=test.first()
>>> schemaString = header.replace('"','')
>>> testfields = [StructField(field_name, StringType(), True) for field_name in schemaString.split(',')]
>>> testfields[5].dataType = LongType()
>>> testschema = StructType(testfields)
>>> testHeader = test.filter(lambda l: "test_date" in l)
>>> testNoHeader = test.subtract(testHeader)
>>> test_temp = testNoHeader.map(lambda k: k.split(",")).map(lambda
p:(p[0],p[1],p[2],p[3],p[4],***float(p[5].strip('"'))***,p[6],p[7]))
>>> test_temp.top(2)
Note: I have also tried 'long' and 'bigint' in place of 'float' in my variable test_temp, but the error in spark was 'keyword not found'
And following is the output
[('2012-03-14', '7', '1698.00', 'XYZ02abc008793060653', 'II93', ***8.27370028700801e+21*** , 'W0W0000000000007', '879870080088815007'), ('2002-03-14', '1', '999.00', 'ABC02E000050086941', 'II93', 8.37670028702205e+21, 'A0B0080000012523', '870870080000012421')]
The value in my csv file is as follows:
8.27370028700801e+21 is 8273700287008010012345
8.37670028702205e+21 is 8376700287022050054321
When I create a data frame out of it and then query it,
>>> test_df = sqlContext.createDataFrame(test_temp, testschema)
>>> test_df.registerTempTable("test")
>>> sqlContext.sql("SELECT test_column FROM test").show()
the test_column gives value 'null' for all the records.
So, how to solve this problem of parsing big number in spark, really appreciate your help

Well, types matter. Since you convert your data to float you cannot use LongType in the DataFrame. It doesn't blow only because PySpark is relatively forgiving when it comes to types.
Also, 8273700287008010012345 is too large to be represented as LongType which can represent only the values between -9223372036854775808 and 9223372036854775807.
If you want to convert your data to a DataFrame you'll have to use DoubleType:
from pyspark.sql.types import *
rdd = sc.parallelize([(8.27370028700801e+21, )])
schema = StructType([StructField("x", DoubleType(), False)])
rdd.toDF(schema).show()
## +-------------------+
## | x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+
Typically it is a better idea to handle this with DataFrames directly:
from pyspark.sql.functions import col
str_df = sc.parallelize([("8273700287008010012345", )]).toDF(["x"])
str_df.select(col("x").cast("double")).show()
## +-------------------+
## | x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+
If you don't want to use Double you can cast to Decimal with specified precision:
str_df.select(col("x").cast(DecimalType(38))).show(1, False)
## +----------------------+
## |x |
## +----------------------+
## |8273700287008010012345|
## +----------------------+

decimal(precision,scale) , make sure scale is appropriate

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark weird behaviour of to_timestamp() - python

Related

pyspark extracting a string using python

How to convert string in DataFrame to date in pyspark

Get last / delimited value from Dataframe column in PySpark

PySpark - Date 0000.00.00 imported differently via function .to_date() and .csv() module

datatype for handling big numbers in pyspark

Categories

Resources