How to convert string in DataFrame to date in pyspark

How to convert string in DataFrame to date in pyspark - python

I would like to convert string in a column of DataFrame to date in pyspark.
l = [("10/14/2000","12/4/1991","5/8/1991"), ("11/3/1391","1/26/1992","9/5/1992")]
spark.createDataFrame(l).collect()
df = spark.createDataFrame(l, ["first", 'second',"third"])
df2 = df.select(col("first"),to_date(col("first"),"MM/dd/yyyy").alias("date"))
df3 = df.select(col("first"),to_date(col("first"),"%M/%d/%y").alias("date"))
I tried those codes above, but neither of them worked.
Could somebody help me to solve this issue?

The code snippet you are using is correct , however the date_format you are using for parsing is not in line with Spark 3.x
Furthermore to handle inconsistent cases , like - 10/14/2000 and 11/3/1391 , with MM/dd/yyyy , you can set the timeParserPolicy=LEGACY as the below link is applicable for Spark 3.x , more info about this can be found here
The available DateTime Patterns for Parsing can be found - https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
Data Prepartion
l = [("10/14/2000","12/4/1991","5/8/1991"), ("11/3/1391","1/26/1992","9/5/1992")]
df = sql.createDataFrame(l, ["first", 'second',"third"])
df.show()
+----------+---------+--------+
| first| second| third|
+----------+---------+--------+
|10/14/2000|12/4/1991|5/8/1991|
| 11/3/1391|1/26/1992|9/5/1992|
+----------+---------+--------+
To Date
df.select(F.col("first"),F.to_date(F.col("first"),"MM/dd/yyyy").alias("date")).show()
+----------+----------+
| first| date|
+----------+----------+
|10/14/2000|2000-10-14|
| 11/3/1391|1391-11-03|
+----------+----------+

Related

pyspark extracting a string using python

Spark dataframe which has column emailID : ram.shyam.78uy#testing.com. i would like to extract the string between "." and "#" i.e 78uy and store it in column.
tried
split_for_alias = split(rs_csv['emailID'],'[.]')
rs_csv_alias= rs_csv.withColumn('alias',split_for_alias.getItem(size(split_for_alias) -2))
Its adding 78uy#testing as alias. Another column can be added and chop off the extra values. But is it possible to do in single statement.

Extract the alphanumeric immediately to the left of special character . and immediately followed by special character #
DataFrame
data= [
(1,"am.shyam.78uy#testing.com"),
(2, "j.k.kilo#jom.com")
]
df=spark.createDataFrame(data, ("id",'emailID'))
df.show()
+---+--------------------+
| id| emailID|
+---+--------------------+
| 1|am.shyam.78uy#tes...|
| 2| j.k.kilo#jom.com|
+---+--------------------+
Code
df.withColumn('name', regexp_extract('emailID', '(?<=\.)(\w+)(?=\#)',1)).show()
outcome
+---+--------------------+----+
| id| emailID|name|
+---+--------------------+----+
| 1|am.shyam.78uy#tes...|78uy|
| 2| j.k.kilo#jom.com|kilo|
+---+--------------------+----+

We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
First we setup a Pandas DataFrame to test:
import pandas as pd
df = pd.DataFrame({"id":[1,2],"email": ["am.shyam.78uy#testing.com", "j.k.kilo#jom.com"]})
Next, we make a native Python function. The logic is clear this way.
from typing import List, Dict, Any
def extract(df:List[Dict[str,Any]]) -> List[Dict[str,Any]]:
for row in df:
email = row["email"].split("#")[0].split(".")[-1]
row["new_col"] = email
return df
Then we can test on the Pandas engine:
from fugue import transform
transform(df, extract, schema="*, new_col:str")
Because it works, we can bring it to Spark by supplying an engine:
import fugue_spark
transform(df, extract, schema="*, new_col:str", engine="spark").show()
+---+--------------------+-------+
| id| email|new_col|
+---+--------------------+-------+
| 1|am.shyam.78uy#tes...| 78uy|
| 2| j.k.kilo#jom.com| kilo|
+---+--------------------+-------+
Note .show() is needed because Spark evaluates lazily. This transform can take in both Pandas and Spark DataFrames and will output a Spark DataFrame if using the Spark engine.

Get last / delimited value from Dataframe column in PySpark

I am trying to get the last string after '/'.
The column can look like this: "lala/mae.da/rg1/zzzzz" (not necessary only 3 /), and I'd like to return: zzzzz
In SQL and Python it's very easy, but I would like to know if there is a way to do it in PySpark.
Solving it in Python:
original_string = "lala/mae.da/rg1/zzzzz"
last_char_index = original_string.rfind("/")
new_string = original_string[last_char_index+1:]
or directly:
new_string = original_string.rsplit('/', 1)[1]
And in SQL:
RIGHT(MyColumn, CHARINDEX('/', REVERSE(MyColumn))-1)
For PySpark I was thinking something like this:
df = df.select(col("MyColumn").rsplit('/', 1)[1])
but I get the following error: TypeError: 'Column' object is not callable and I am not even sure Spark allows me to do rsplit at all.
Do you have any suggestion on how can I solve this?

Adding another solution even though #Pav3k's answer is great. element_at which gets an item at specific position out of a list:
from pyspark.sql import functions as F
df = df.withColumn('my_col_split', F.split(df['MyColumn'], '/'))\
.select('MyColumn',F.element_at(F.col('my_col_split'), -1).alias('rsplit')
)
>>> df.show(truncate=False)
+---------------------+------+
|MyColumn |rsplit|
+---------------------+------+
|lala/mae.da/rg1/zzzzz|zzzzz |
|fefe |fefe |
|fe/fe/frs/fs/fe32/4 |4 |
+---------------------+------+
Pav3k's DF used.

import pandas as pd
from pyspark.sql import functions as F
df = pd.DataFrame({"MyColumn": ["lala/mae.da/rg1/zzzzz", "fefe", "fe/fe/frs/fs/fe32/4"]})
df = spark.createDataFrame(df)
df.show(truncate=False)
# output
+---------------------+
|MyColumn |
+---------------------+
|lala/mae.da/rg1/zzzzz|
|fefe |
|fe/fe/frs/fs/fe32/4 |
+---------------------+
(
df
.withColumn("NewCol",
F.split("MyColumn", "/")
)
.withColumn("NewCol", F.col("Newcol")[F.size("NewCol") -1])
.show()
)
# output
+--------------------+------+
| MyColumn|NewCol|
+--------------------+------+
|lala/mae.da/rg1/z...| zzzzz|
| fefe| fefe|
| fe/fe/frs/fs/fe32/4| 4|
+--------------------+------+

Since Spark 2.4, you can use split built-in function to split your string then use element_at built-in function to get the last element of your obtained array, as follows:
from pyspark.sql import functions as F
df = df.select(F.element_at(F.split(F.col("MyColumn"), '/'), -1))

Programatically select columns from a dataframe with udf

I am new to pyspark.
I am trying to extract columns of a dataframe using a config file which contains a UDF.
If I define the select column as a list on the client it works but if I import the list from a config file the column list is of the type string.
Is there an alternate way.
opening spark-shell using pyspark .
*******************************************************************
version 2.2.0
Using Python version 2.7.16 (default, Mar 18 2019 18:38:44)
SparkSession available as 'spark'
*******************************************************************
jsonDF = spark.read.json("/tmp/people.json")
jsonDF.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
jsonDF.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
jsonCurDF = jsonDF.filter(jsonDF.age.isNotNull()).cache()
# Define the UDF
from pyspark.sql.functions import udf
#udf("long")
def squared_udf(s):
return s * s
# Selecting the columns from a list.
colSelList = ['age', 'name', squared_udf('age')]
jsonCurDF.select(colSelList).show()
+---+------+----------------+
|age| name|squared_udf(age)|
+---+------+----------------+
| 30| Andy| 900|
| 19|Justin| 361|
+---+------+----------------+
# If I use an external config file
colSelListStr = ["age", "name" , "squared_udf('age')"]
jsonCurDF.select(colSelListStr).show()
The above command fails "cannot resolve '`squared_udf('age')'
Tried registering the function, tried selectExpr and using the column function.
In the colSelList the udf call is translated to a column type.
print colSelList[2]
Column<squared_udf(age)
print colSelListStr[2]
squared_udf('age')
print column(colSelListStr[2])
Column<squared_udf('age')
What am I doing wrong here? or is there an alternate solution?

It's because squared_age is considered as string not a function when you are passing it from a list.
There is a round way you can do this and you don't need to import UDF for this.
Assume this is the list you need to select
directly passing this list will result into an error because squared_age doesn't contain in this data frame
so first you take all the columns of existing df into a list by
existing_cols = df.columns
and your these are the columns you need
now take intersection of both the lists
it will you give you a common element list
intersection = list(set(existing_cols) & set(col_list))
now try like this
newDF= df.select(intersection).rdd.map(lambda x: (x["age"], x["name"], x["age"]*x["age"])).toDF(col_list)
which will give you this
hope this helped.

PySpark - Date 0000.00.00 imported differently via function .to_date() and .csv() module

I am importing data, which has a date column in yyyy.MM.dd format. Missing values have been marked as 0000.00.00. This 0000.00.00 is treated differently depending upon the function/module employed to bring the data in the dataframe.
.csv file looks like this -
2016.12.23,2016.12.23
0000.00.00,0000.00.00
Method 1: .csv()
schema = StructType([
StructField('date', StringType()),
StructField('date1', DateType()),
])
df = spark.read.schema(schema)\
.format('csv')\
.option('header','false')\
.option('sep',',')\
.option('dateFormat','yyyy.MM.dd')\
.load(path+'file.csv')
df.show()
+----------+----------+
| date| date1|
+----------+----------+
|2016.12.23|2016-12-23|
|0000.00.00|0002-11-30|
+----------+----------+
Method 2: .to_date()
from pyspark.sql.functions import to_date, col
df = sqlContext.createDataFrame([('2016.12.23','2016.12.23'),('0000.00.00','0000.00.00')],['date','date1'])
df = df.withColumn('date1',to_date(col('date1'),'yyyy.MM.dd'))
df.show()
+----------+----------+
| date| date1|
+----------+----------+
|2016.12.23|2016-12-23|
|0000.00.00| null|
+----------+----------+
Question: Why two methods give different results? I would have expected the get Null for both. In the first case instead I get 0002-11-30. Can anyone explain this anomaly?

PySpark weird behaviour of to_timestamp()

I am noticing a bit weird behaviour in PySpark's (and possibly Spark's) to_timestamp function. Looks like it is converting some strings to timestamp correctly while some other strings of the exact same format to null. Consider the following example I worked out:
times = [['2030-03-10 02:56:07'], ['2030-03-11 02:56:07']]
df_test = spark.createDataFrame(times, schema=StructType([
StructField("time_string", StringType())
]))
df_test = df_test.withColumn('timestamp',
F.to_timestamp('time_string',
format='yyyy-MM-dd HH:mm:ss'))
df_test.show(2, False)
This is what I get:
+-------------------+-------------------+
|time_string |timestamp |
+-------------------+-------------------+
|2030-03-10 02:56:07|null |
|2030-03-11 02:56:07|2030-03-11 02:56:07|
+-------------------+-------------------+
What's the reason behind the second string being converted correctly but not the first one? I have tried with unix_timestamp() function as well and the result is the same.
Further strangely, if I don't use the format parameter, I don't get the null anymore but the timestamp's hour is incremented by one.
df_test2 = df_test.withColumn('timestamp', F.to_timestamp('time_string'))
df_test2.show(2, False)
Result:
+-------------------+-------------------+
|time_string |timestamp |
+-------------------+-------------------+
|2030-03-10 02:56:07|2030-03-10 03:56:07|
|2030-03-11 02:56:07|2030-03-11 02:56:07|
+-------------------+-------------------+
Any idea what's going on?
UPDATE:
I have tried with in Scala as well via spark-shell and the result is the same:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions
val times = Seq(Row("2030-03-10 02:56:07"), Row("2030-03-11 02:56:07"))
val schema=List((StructField("time_string", StringType)))
val df = spark.createDataFrame(spark.sparkContext.parallelize(times),
StructType(schema))
val df_test = df.withColumn("timestamp",
functions.to_timestamp(functions.col("time_string"),
fmt="yyyy-MM-dd HH:mm:ss"))
df_test.show()
And the result:
+-------------------+-------------------+
| time_string| timestamp|
+-------------------+-------------------+
|2030-03-10 02:56:07| null|
|2030-03-11 02:56:07|2030-03-11 02:56:07|
+-------------------+-------------------+

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert string in DataFrame to date in pyspark - python

Related

pyspark extracting a string using python

Get last / delimited value from Dataframe column in PySpark

Programatically select columns from a dataframe with udf

PySpark - Date 0000.00.00 imported differently via function .to_date() and .csv() module

PySpark weird behaviour of to_timestamp()

Categories

Resources