Pyspark: How to Apply UDF only on Rows with NotNull Values - python

I have a pyspark dataframe and would like to apply an UDF on a column with Null values.
Below is my dataframe:
+----+----+
| a| b|
+----+----+
|null| 00|
|.Abc|null|
|/5ee| 11|
|null| 0|
+----+----+
Below is the desired dataframe (remove punctuations and change string values to upper case in column a if row values are not Null):
+----+----+
| a| b|
+----+----+
|null| 00|
| ABC|null|
| 5EE| 11|
|null| 0|
+----+----+
Below is my UDF and code:
import pyspark.sql.functions as F
import re
remove_punct = F.udf(lambda x: re.sub('[^\w\s]', '', x))
df = df.withColumn('a', F.when(F.col("a").isNotNull(), F.upper(remove_punct(F.col("a")))))
Below is the error:
TypeError: expected string or bytes-like object
Can you please suggest what would be the optimal solution the get the desired DF?
Thanks in advance!

Use regexp_replace. No need for UDF.
df = df.withColumn('a', F.upper(F.regexp_replace(F.col('a'), '[^\w\s]', '')))
If you insist on using UDF, you need to do this:
remove_punct = F.udf(lambda x: re.sub('[^\w\s]', '', x) if x is not None else None)
df = df.withColumn('a', F.upper(remove_punct(F.col("a"))))

Related

Remove any row with at least 1 NA with PySpark

I have a pyspark dataframe and I would like to remove any row countaining at least one NA.
I know how to do so only for one column (code below).
How to do the same for all columns of the dataframe?
Reproducible example
# Import modules
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql.functions import col
from pyspark.sql import Row
# Defining SparkContext
SparkContext.getOrCreate()
# Defining SparkSession
spark = SparkSession \
.builder \
.master("local") \
.appName("Introduction au DataFrame") \
.getOrCreate()
# Initiating DataFrame
values = [("1","2","3"),
("NA","1", "2"),
("4", "NA", "1")]
columns = ['var1',
'var2',
'var3']
df = spark.createDataFrame(values, columns)
# Initial dataframe
df.show()
+----+----+----+
|var1|var2|var3|
+----+----+----+
| 1| 2| 3|
| NA| 1| 2|
| 4| NA| 1|
+----+----+----+
# Subset rows without NAs (column 'var1')
df.where(~col('var1').contains('NA')).show()
+----+----+----+
|var1|var2|var3|
+----+----+----+
| 1| 2| 3|
| 4| NA| 1|
+----+----+----+
My expected output
+----+----+----+
|var1|var2|var3|
+----+----+----+
| 1| 2| 3|
+----+----+----+
What I also tried
I have tried the following but it seems that PySpark doesn't recognize NAs as in pandas.
It only recognizes null values.
df.na.drop().show()
df.select([count(when(isnan('var1'), True))]).show()
df.filter(df['var1'].isNotNull()).show()
new = (df.na.replace({'NA': None})#Replace string NA with null
.dropna()#Drop NA
).show()
try this one :
df.dropna().show()
you can specify the paramter in dropna method also:
if how = 'any' , then it's your case
or how = 'all' the row will be removed if all columns are null
how by default is 'any'

Pyspark data frame Converting false and true to 0 and 1

I have a data frame in Pyspark
df.show()
+-----+-----+
|test1|test2|
+-----+-----+
|false| true|
| true| true|
| true|false|
|false| true|
|false|false|
|false|false|
|false|false|
| true| true|
|false|false|
+-----+-----+
I want to convert all the false values in data frame to 0 and true to 1.
I am doing like below
df1 = df.withColumn('test1', F.when(df.test1 == 'false', 0).otherwise(1)).withColumn('test2', F.when(df.test2 == 'false', 0).otherwise(1))
I got my result. But I think there might be a better way to do this.
Using CASE ... WHEN (when(...).otherwise(...)) is unnecessarily verbose. Instead you can just cast to integer:
from pyspark.sql.functions import col
df.select([col(c).cast("integer") for c ["test1", "test2"]])
One way to avoid the multiple withColumn especially when you have a lot of columns could be to use functools.reduce and you only use withColumn once here:
import pyspark.sql.functions as F
from functools import reduce
cols = ['test1', 'test2']
reduce(lambda df, c: df.withColumn(c, F.when(df[c] == 'false', 0).otherwise(1)), cols, df).show()
+-----+-----+
|test1|test2|
+-----+-----+
| 1| 0|
| 0| 1|
+-----+-----+
I am assuming that the datatypes of the two columns (test1, test2) are Boolean. You can try the below mentioned suggestion:
import pyspark.sql.functions as F
df = df.withColumn( "test1" , F.when( F.col("test1") , F.lit(1) ).otherwise(0) ).withColumn( "test2" , F.when( F.col("test2") , F.lit(1) ).otherwise(0) )
The columns "test1" and "test2" are Boolean in nature. So, you do not need to equate them using ==True (or ==False).
The use of Pyspark functions makes this route faster (and more scalable) as compared to approaches which use udfs (user defined functions).
Perhaps this help to do it in a clear way and for other cases too:
from pyspark.sql.functions
import col from pyspark.sql.types
import IntegerType
def fromBooleanToInt(s):
"""
This is just a simple python function to move boolean to integers.
>>> fromBooleanToInt(None)
>>> fromBooleanToInt(True)
1
>>> fromBooleanToInt(False)
1
"""
if s == True:
return 1
elif s==False:
return 0
else:
return None
This is to create a simple dataframe to test
df_with_doubles = spark.createDataFrame([(True, False), (None,True)], ['A', 'B'])
df_with_doubles.show()
+----+-----+
| A| B|
+----+-----+
|true|false|
|null| true|
+----+-----+
This is to define the udf
fromBooleanToInt_udf = F.udf(lambda x: fromBooleanToInt(x), IntegerType())
Now let do the casting/transformation:
column_to_change = 'A'
df_with_doubles_ = df_with_doubles.withColumn(column_to_change,fromBooleanToInt_udf(df_with_doubles[column_to_change]))
df_with_doubles_.show()
+----+-----+
| A| B|
+----+-----+
| 1|false|
|null| true|
+----+-----+
For Scala Users :
df.withColumn('new', col("test1").isNotNull.cast(IntegerType))
I Hope it helps.

Get the top two elements in a nested list - pyspark

Let's say I have a list L=[[a,2],[a,3],[a,4],[b,4],[b,8],[b,9]]
Using pyspark I want to be able to remove the third element so that it will look like this:
[a,2]
[a,3]
[b,4]
[b,8]
I am new to pyspark and not sure what I should do here.
You can try something like this.
The first step is groupby key column and aggregate values in a list. Then use a udf to get the first two values of the list and then explode that column.
df = sc.parallelize([('a',2),('a',3),('a',4),
('b',4),('b',8),('b',9)]).toDF(['key', 'value'])
from pyspark.sql.functions import collect_list, udf, explode
from pyspark.sql.types import *
foo = udf(lambda x:x[0:2], ArrayType(IntegerType()))
df_list = (df.groupby('key').agg(collect_list('value')).
withColumn('values',foo('collect_list(value)')).
withColumn('value', explode('values')).
drop('values', 'collect_list(value)'))
df_list.show()
result
+---+-----+
|key|value|
+---+-----+
| b| 4|
| b| 8|
| a| 2|
| a| 3|
+---+-----+

Python: How to convert Pyspark column to date type if there are null values

In pyspark, I have a dataframe that has dates that get imported as strings. There are null values in these dates-as-strings columns. I'm trying to convert these columns into date type columns, but I keep getting errors. Here's a small example of the dataframe:
+--------+----------+----------+
|DeviceId| Created| EventDate|
+--------+----------+----------+
| 1| null|2017-03-09|
| 1| null|2017-03-09|
| 1|2017-03-09|2017-03-09|
| 1|2017-03-15|2017-03-15|
| 1| null|2017-05-06|
| 1|2017-05-06|2017-05-06|
| 1| null| null|
+--------+----------+----------+
When there are no null values, I have found that this code below will work to convert the data types:
dt_func = udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType())
df = df.withColumn('Created', dt_func(col('Created')))
Once I add null values it crashes. I've tried to modify the udf to account for nulls as follows:
import numpy as np
def convertDatetime(x):
return sf.when(x.isNull(), 'null').otherwise(datetime.strptime(x, '%Y-%m-%d'))
dt_func = udf(convertDatetime, DateType())
I also tried filling the nulls with an arbitrary date-string, converting the columns to dates, and then trying to replace the arbitrary fill date with nulls as below:
def dt_conv(df, cols, form = '%Y-%m-%d', temp_plug = '1900-01-01'):
df = df.na.fill(temp_plug)
dt_func = udf (lambda x: datetime.strptime(x, form), DateType())
for col_ in cols:
df = df.withColumn(col_, dt_func(col(col_)))
df = df.replace(datetime.strptime(temp_plug, form), 'null')
return df
However, this method gives me this error
ValueError: to_replace should be a float, int, long, string, list, tuple, or dict
Can someone help me figure this out?
try this -
# Some data, I added empty strings and nulls both
data = [(1,'','2017-03-09'),(1,None,'2017-03-09'),(1,'2017-03-09','2017-03-09')]
df = spark.createDataFrame(data).toDF('id','Created','EventDate')
df.show()
:
+---+----------+----------+
| id| Created| EventDate|
+---+----------+----------+
| 1| |2017-03-09|
| 1| null|2017-03-09|
| 1|2017-03-09|2017-03-09|
+---+----------+----------+
:
df\
.withColumn('Created-formatted',when((df.Created.isNull() | (df.Created == '')) ,'0')\
.otherwise(unix_timestamp(df.Created,'yyyy-MM-dd')))\
.withColumn('EventDate-formatted',when((df.EventDate.isNull() | (df.EventDate == '')) ,'0')\
.otherwise(unix_timestamp(df.EventDate,'yyyy-MM-dd')))\
.drop('Created','EventDate')\
.show()
:
+---+-----------------+-------------------+
| id|Created-formatted|EventDate-formatted|
+---+-----------------+-------------------+
| 1| 0| 1489035600|
| 1| 0| 1489035600|
| 1| 1489035600| 1489035600|
+---+-----------------+-------------------+
I used unix_timestamp which returns BigInt format but you can format that columns as you like .

How can I enumerate rows in groups with Spark/Python?

I'd like to enumerate grouped values just like with Pandas:
Enumerate each row for each group in a DataFrame
What is a way in Spark/Python?
With row_number window function:
from pyspark.sql.functions import row_number
from pyspark.sql import Window
w = Window.partitionBy("some_column").orderBy("some_other_column")
df.withColumn("rn", row_number().over(w))
You can achieve this on rdd level by doing:
rdd = sc.parallelize(['a', 'b', 'c'])
df = spark.createDataFrame(rdd.zipWithIndex())
df.show()
It will result:
+---+---+
| _1| _2|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+
If you only need unique ID, not real continuous indexing, you may also use
zipWithUniqueId() which is more efficient, since done locally on each partition.

Categories