Remove any row with at least 1 NA with PySpark

Remove any row with at least 1 NA with PySpark - python

I have a pyspark dataframe and I would like to remove any row countaining at least one NA.
I know how to do so only for one column (code below).
How to do the same for all columns of the dataframe?
Reproducible example
# Import modules
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql.functions import col
from pyspark.sql import Row
# Defining SparkContext
SparkContext.getOrCreate()
# Defining SparkSession
spark = SparkSession \
.builder \
.master("local") \
.appName("Introduction au DataFrame") \
.getOrCreate()
# Initiating DataFrame
values = [("1","2","3"),
("NA","1", "2"),
("4", "NA", "1")]
columns = ['var1',
'var2',
'var3']
df = spark.createDataFrame(values, columns)
# Initial dataframe
df.show()
+----+----+----+
|var1|var2|var3|
+----+----+----+
| 1| 2| 3|
| NA| 1| 2|
| 4| NA| 1|
+----+----+----+
# Subset rows without NAs (column 'var1')
df.where(~col('var1').contains('NA')).show()
+----+----+----+
|var1|var2|var3|
+----+----+----+
| 1| 2| 3|
| 4| NA| 1|
+----+----+----+
My expected output
+----+----+----+
|var1|var2|var3|
+----+----+----+
| 1| 2| 3|
+----+----+----+
What I also tried
I have tried the following but it seems that PySpark doesn't recognize NAs as in pandas.
It only recognizes null values.
df.na.drop().show()
df.select([count(when(isnan('var1'), True))]).show()
df.filter(df['var1'].isNotNull()).show()

new = (df.na.replace({'NA': None})#Replace string NA with null
.dropna()#Drop NA
).show()

try this one :
df.dropna().show()
you can specify the paramter in dropna method also:
if how = 'any' , then it's your case
or how = 'all' the row will be removed if all columns are null
how by default is 'any'

Related

how to get the output in pyspark same as got in pandas

I am trying some basic functions in pyspark like min max ect.
while using pandas df.min() I got all the separate columns and their minimum values like the image I have attached.
I need the same output using pyspark code.
but I don't know how to do that
Please help me on this

You can try with below code,
# sample data
data = [(1,10,"a"), (2,10,"c"), (0, 100, "t")]
cols = ["col1", "col2", "col3"]
df = spark.createDataFrame(data, cols)
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 10| a|
| 2| 10| c|
| 0| 100| t|
+----+----+----+
df.selectExpr([f"min({x}) as {x}" for x in cols]).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 0| 10| a|
+----+----+----+

Pyspark: How to Apply UDF only on Rows with NotNull Values

I have a pyspark dataframe and would like to apply an UDF on a column with Null values.
Below is my dataframe:
+----+----+
| a| b|
+----+----+
|null| 00|
|.Abc|null|
|/5ee| 11|
|null| 0|
+----+----+
Below is the desired dataframe (remove punctuations and change string values to upper case in column a if row values are not Null):
+----+----+
| a| b|
+----+----+
|null| 00|
| ABC|null|
| 5EE| 11|
|null| 0|
+----+----+
Below is my UDF and code:
import pyspark.sql.functions as F
import re
remove_punct = F.udf(lambda x: re.sub('[^\w\s]', '', x))
df = df.withColumn('a', F.when(F.col("a").isNotNull(), F.upper(remove_punct(F.col("a")))))
Below is the error:
TypeError: expected string or bytes-like object
Can you please suggest what would be the optimal solution the get the desired DF?
Thanks in advance!

Use regexp_replace. No need for UDF.
df = df.withColumn('a', F.upper(F.regexp_replace(F.col('a'), '[^\w\s]', '')))
If you insist on using UDF, you need to do this:
remove_punct = F.udf(lambda x: re.sub('[^\w\s]', '', x) if x is not None else None)
df = df.withColumn('a', F.upper(remove_punct(F.col("a"))))

Remove all rows that are duplicates with respect to some rows

I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame:
+------+-----+----+
| id|value|type|
+------+-----+----+
|283924| 1.5| 0|
|283924| 1.5| 1|
|982384| 3.0| 0|
|982384| 3.0| 1|
|892383| 2.0| 0|
|892383| 2.5| 1|
+------+-----+----+
I want to identify duplicates by just the "id" and "value" columns, and then remove all instances.
In this case:
Rows 1 and 2 are duplicates (again we are ignoring the "type" column)
Rows 3 and 4 are duplicates, and therefore only rows 5 & 6 should remain:
The output would be:
+------+-----+----+
| id|value|type|
+------+-----+----+
|892383| 2.5| 1|
|892383| 2.0| 0|
+------+-----+----+
I've tried
df.dropDuplicates(subset = ['id', 'value'], keep = False)
But the "keep" feature isn't in PySpark (as it is in pandas.DataFrame.drop_duplicates.
How else could I do this?

You can do that using the window functions
from pyspark.sql import Window, functions as F
df.withColumn(
'fg',
F.count("id").over(Window.partitionBy("id", "value"))
).where("fg = 1").drop("fg").show()

You can groupBy the id and type to get the count. Then use join to filter out the rows in your DataFrame where the count is not 1:
df.join(
df.groupBy('id', 'value').count().where('count = 1').drop('count'), on=['id', 'value']
).show()
#+------+-----+----+
#| id|value|type|
#+------+-----+----+
#|892383| 2.5| 1|
#|892383| 2.0| 0|
#+------+-----+----+

How to add a column in pyspark if two column values is in another dataframe?

I'm very new to pyspark. I have two dataframes like this:
df1:
enter image description here
df2:
enter image description here
label column in df1 does not exist at first. I added it later. If [user_id, sku_id] pair of df1 is in df2, then I want to add a column in df1 and set it to 1, otherwise 0, just like df1 shows. How can I do it in pyspark? I'm using py2.7.

Its possible by doing left outer join on two dataframes first, and then using when and otherwise functions on one of columns of right dataframe. here is complete solution I tried -
from pyspark.sql import functions as F
from pyspark.sql.functions import col
# this is just data input
data1 = [[4,3,3],[2,4,3],[4,2,4],[4,3,3]]
data2 = [[4,3,3],[2,3,3],[4,1,4]]
# create dataframes
df1 = spark.createDataFrame(data1,schema=['userId','sku_id','type'])
df2 = spark.createDataFrame(data2,schema=['userId','sku_id','type'])
# condition for join
cond=[df1.userId==df2.userId,df1.sku_id==df2.sku_id,df1.type==df2.type]
# magic
df1.join(df2,cond,how='left_outer')\
.select(df1.userId,df1.sku_id,df1.type,df2.userId.alias('uid'))\
.withColumn('label',F.when(col('uid')>0 ,1).otherwise(0))\
.drop(col('uid'))\
.show()
output :
+------+------+----+-----+
|userId|sku_id|type|label|
+------+------+----+-----+
| 2| 4| 3| 0|
| 4| 3| 3| 1|
| 4| 3| 3| 1|
| 4| 2| 4| 0|
+------+------+----+-----+

Add column sum as new column in PySpark dataframe

I'm using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns.
Suppose my dataframe had columns "a", "b", and "c". I know I can do this:
df.withColumn('total_col', df.a + df.b + df.c)
The problem is that I don't want to type out each column individually and add them, especially if I have a lot of columns. I want to be able to do this automatically or by specifying a list of column names that I want to add. Is there another way to do this?

This was not obvious. I see no row-based sum of the columns defined in the spark Dataframes API.
Version 2
This can be done in a fairly simple way:
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
df.columns is supplied by pyspark as a list of strings giving all of the column names in the Spark Dataframe. For a different sum, you can supply any other list of column names instead.
I did not try this as my first solution because I wasn't certain how it would behave. But it works.
Version 1
This is overly complicated, but works as well.
You can do this:
use df.columns to get a list of the names of the columns
use that names list to make a list of the columns
pass that list to something that will invoke the column's overloaded add function in a fold-type functional manner
With python's reduce, some knowledge of how operator overloading works, and the pyspark code for columns here that becomes:
def column_add(a,b):
return a.__add__(b)
newdf = df.withColumn('total_col',
reduce(column_add, ( df[col] for col in df.columns ) ))
Note this is a python reduce, not a spark RDD reduce, and the parenthesis term in the second parameter to reduce requires the parenthesis because it is a list generator expression.
Tested, Works!
$ pyspark
>>> df = sc.parallelize([{'a': 1, 'b':2, 'c':3}, {'a':8, 'b':5, 'c':6}, {'a':3, 'b':1, 'c':0}]).toDF().cache()
>>> df
DataFrame[a: bigint, b: bigint, c: bigint]
>>> df.columns
['a', 'b', 'c']
>>> def column_add(a,b):
... return a.__add__(b)
...
>>> df.withColumn('total', reduce(column_add, ( df[col] for col in df.columns ) )).collect()
[Row(a=1, b=2, c=3, total=6), Row(a=8, b=5, c=6, total=19), Row(a=3, b=1, c=0, total=4)]

The most straight forward way of doing it is to use the expr function
from pyspark.sql.functions import *
data = data.withColumn('total', expr("col1 + col2 + col3 + col4"))

The solution
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
posted by #Paul works. Nevertheless I was getting the error, as many other as I have seen,
TypeError: 'Column' object is not callable
After some time I found the problem (at least in my case). The problem is that I previously imported some pyspark functions with the line
from pyspark.sql.functions import udf, col, count, sum, when, avg, mean, min
so the line imported the sum pyspark command while df.withColumn('total', sum(df[col] for col in df.columns)) is supposed to use the normal python sum function.
You can delete the reference of the pyspark function with del sum.
Otherwise in my case I changed the import to
import pyspark.sql.functions as F
and then referenced the functions as F.sum.

Summing multiple columns from a list into one column
PySpark's sum function doesn't support column addition.
This can be achieved using expr function.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns.

My problem was similar to the above (bit more complex) as i had to add consecutive column sums as new columns in PySpark dataframe. This approach uses code from Paul's Version 1 above:
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName('addColAsCumulativeSUM').getOrCreate()
df=spark.createDataFrame(data=[(1,2,3),(4,5,6),(3,2,1)\
,(6,1,-4),(0,2,-2),(6,4,1)\
,(4,5,2),(5,-3,-5),(6,4,-1)]\
,schema=['x1','x2','x3'])
df.show()
+---+---+---+
| x1| x2| x3|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 3| 2| 1|
| 6| 1| -4|
| 0| 2| -2|
| 6| 4| 1|
| 4| 5| 2|
| 5| -3| -5|
| 6| 4| -1|
+---+---+---+
colnames=df.columns
add new columns that are cumulative sums (consecutive):
for i in range(0,len(colnames)):
colnameLst= colnames[0:i+1]
colname = 'cm'+ str(i+1)
df = df.withColumn(colname, sum(df[col] for col in colnameLst))
df.show()
+---+---+---+---+---+---+
| x1| x2| x3|cm1|cm2|cm3|
+---+---+---+---+---+---+
| 1| 2| 3| 1| 3| 6|
| 4| 5| 6| 4| 9| 15|
| 3| 2| 1| 3| 5| 6|
| 6| 1| -4| 6| 7| 3|
| 0| 2| -2| 0| 2| 0|
| 6| 4| 1| 6| 10| 11|
| 4| 5| 2| 4| 9| 11|
| 5| -3| -5| 5| 2| -3|
| 6| 4| -1| 6| 10| 9|
+---+---+---+---+---+---+
'cumulative sum' columns added are as follows:
cm1 = x1
cm2 = x1 + x2
cm3 = x1 + x2 + x3

df = spark.createDataFrame([("linha1", "valor1", 2), ("linha2", "valor2", 5)], ("Columna1", "Columna2", "Columna3"))
df.show()
+--------+--------+--------+
|Columna1|Columna2|Columna3|
+--------+--------+--------+
| linha1| valor1| 2|
| linha2| valor2| 5|
+--------+--------+--------+
df = df.withColumn('DivisaoPorDois', df[2]/2)
df.show()
+--------+--------+--------+--------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|
+--------+--------+--------+--------------+
| linha1| valor1| 2| 1.0|
| linha2| valor2| 5| 2.5|
+--------+--------+--------+--------------+
df = df.withColumn('Soma_Colunas', df[2]+df[3])
df.show()
+--------+--------+--------+--------------+------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|Soma_Colunas|
+--------+--------+--------+--------------+------------+
| linha1| valor1| 2| 1.0| 3.0|
| linha2| valor2| 5| 2.5| 7.5|
+--------+--------+--------+--------------+------------+

A very simple approach would be to just use select instead of withcolumn as below:
df = df.select('*', (col("a")+col("b")+col('c).alias("total"))
This should give you required sum with minor changes based on requirements

The following approach works for me:
Import pyspark sql functions
from pyspark.sql import functions as F
Use F.expr(list_of_columns) data_frame.withColumn('Total_Sum',F.expr('col_name1+col_name2+..col_namen)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove any row with at least 1 NA with PySpark - python

new = (df.na.replace({'NA': None})#Replace string NA with null .dropna()#Drop NA ).show()

try this one : df.dropna().show() you can specify the paramter in dropna method also: if how = 'any' , then it's your case or how = 'all' the row will be removed if all columns are null how by default is 'any'

Related

how to get the output in pyspark same as got in pandas

Pyspark: How to Apply UDF only on Rows with NotNull Values

Remove all rows that are duplicates with respect to some rows

How to add a column in pyspark if two column values is in another dataframe?

Add column sum as new column in PySpark dataframe

Categories

Resources