Pyspark data frame Converting false and true to 0 and 1 - python

I have a data frame in Pyspark
df.show()
+-----+-----+
|test1|test2|
+-----+-----+
|false| true|
| true| true|
| true|false|
|false| true|
|false|false|
|false|false|
|false|false|
| true| true|
|false|false|
+-----+-----+
I want to convert all the false values in data frame to 0 and true to 1.
I am doing like below
df1 = df.withColumn('test1', F.when(df.test1 == 'false', 0).otherwise(1)).withColumn('test2', F.when(df.test2 == 'false', 0).otherwise(1))
I got my result. But I think there might be a better way to do this.

Using CASE ... WHEN (when(...).otherwise(...)) is unnecessarily verbose. Instead you can just cast to integer:
from pyspark.sql.functions import col
df.select([col(c).cast("integer") for c ["test1", "test2"]])

One way to avoid the multiple withColumn especially when you have a lot of columns could be to use functools.reduce and you only use withColumn once here:
import pyspark.sql.functions as F
from functools import reduce
cols = ['test1', 'test2']
reduce(lambda df, c: df.withColumn(c, F.when(df[c] == 'false', 0).otherwise(1)), cols, df).show()
+-----+-----+
|test1|test2|
+-----+-----+
| 1| 0|
| 0| 1|
+-----+-----+

I am assuming that the datatypes of the two columns (test1, test2) are Boolean. You can try the below mentioned suggestion:
import pyspark.sql.functions as F
df = df.withColumn( "test1" , F.when( F.col("test1") , F.lit(1) ).otherwise(0) ).withColumn( "test2" , F.when( F.col("test2") , F.lit(1) ).otherwise(0) )
The columns "test1" and "test2" are Boolean in nature. So, you do not need to equate them using ==True (or ==False).
The use of Pyspark functions makes this route faster (and more scalable) as compared to approaches which use udfs (user defined functions).

Perhaps this help to do it in a clear way and for other cases too:
from pyspark.sql.functions
import col from pyspark.sql.types
import IntegerType
def fromBooleanToInt(s):
"""
This is just a simple python function to move boolean to integers.
>>> fromBooleanToInt(None)
>>> fromBooleanToInt(True)
1
>>> fromBooleanToInt(False)
1
"""
if s == True:
return 1
elif s==False:
return 0
else:
return None
This is to create a simple dataframe to test
df_with_doubles = spark.createDataFrame([(True, False), (None,True)], ['A', 'B'])
df_with_doubles.show()
+----+-----+
| A| B|
+----+-----+
|true|false|
|null| true|
+----+-----+
This is to define the udf
fromBooleanToInt_udf = F.udf(lambda x: fromBooleanToInt(x), IntegerType())
Now let do the casting/transformation:
column_to_change = 'A'
df_with_doubles_ = df_with_doubles.withColumn(column_to_change,fromBooleanToInt_udf(df_with_doubles[column_to_change]))
df_with_doubles_.show()
+----+-----+
| A| B|
+----+-----+
| 1|false|
|null| true|
+----+-----+

For Scala Users :
df.withColumn('new', col("test1").isNotNull.cast(IntegerType))
I Hope it helps.

Related

Pyspark: How to Apply UDF only on Rows with NotNull Values

I have a pyspark dataframe and would like to apply an UDF on a column with Null values.
Below is my dataframe:
+----+----+
| a| b|
+----+----+
|null| 00|
|.Abc|null|
|/5ee| 11|
|null| 0|
+----+----+
Below is the desired dataframe (remove punctuations and change string values to upper case in column a if row values are not Null):
+----+----+
| a| b|
+----+----+
|null| 00|
| ABC|null|
| 5EE| 11|
|null| 0|
+----+----+
Below is my UDF and code:
import pyspark.sql.functions as F
import re
remove_punct = F.udf(lambda x: re.sub('[^\w\s]', '', x))
df = df.withColumn('a', F.when(F.col("a").isNotNull(), F.upper(remove_punct(F.col("a")))))
Below is the error:
TypeError: expected string or bytes-like object
Can you please suggest what would be the optimal solution the get the desired DF?
Thanks in advance!
Use regexp_replace. No need for UDF.
df = df.withColumn('a', F.upper(F.regexp_replace(F.col('a'), '[^\w\s]', '')))
If you insist on using UDF, you need to do this:
remove_punct = F.udf(lambda x: re.sub('[^\w\s]', '', x) if x is not None else None)
df = df.withColumn('a', F.upper(remove_punct(F.col("a"))))

Is there a way to add a column with range of values to a Spark Dataframe?

I have a spark dataframe: df1 as below:
age = spark.createDataFrame(["10","11","13"], "string").toDF("age")
age.show()
+---+
|age|
+---+
| 10|
| 11|
| 13|
+---+
I have a requirement of adding a row number column to the dataframe to make it:
+---+------+
|age|col_id|
+---+------+
| 10| 1 |
| 11| 2 |
| 13| 3 |
+---+------+
None of the columns in my dataframe contains unique values.
I tried to use F.monotonically_increasing_id()) but it is just producing random numbers in increasing order.
>>> age = spark.createDataFrame(["10","11","13"], "string").toDF("age").withColumn("rowId1", F.monotonically_increasing_id())
>>> age
DataFrame[age: string, rowId1: bigint]
>>> age.show
<bound method DataFrame.show of DataFrame[age: string, rowId1: bigint]>
>>> age.show()
+---+-----------+
|age| rowId1|
+---+-----------+
| 10|17179869184|
| 11|42949672960|
| 13|60129542144|
+---+-----------+
Since I don't have any column with unique data, I am worried about using windowing functions and generate row_numbers.
So, is there a way I can add a column with row_count to the dataframe that gives:
+---+------+
|age|col_id|
+---+------+
| 10| 1 |
| 11| 2 |
| 13| 3 |
+---+------+
If windowing function is the only way to implement, how can I make sure all the data comes under a single partition ?
or if there is a way to implement the same without using windowing functions, how to implement it ?
Any help is appreciated.
Use zipWithIndex.
I could not find code I did myself in the past yesterday as I was busy working on issues, but here is a good post that explains it. https://sqlandhadoop.com/pyspark-zipwithindex-example/
pyspark different to Scala.
Other answer not good for performance - going to single Executor. zipWithIndex is narrow transformation so it works per partition.
Here goes, you can tailor accordingly:
from pyspark.sql.types import StructField
from pyspark.sql.types import StructType
from pyspark.sql.types import StringType, LongType
import pyspark.sql.functions as F
df1 = spark.createDataFrame([ ('abc'),('2'),('3'),('4'), ('abc'),('2'),('3'),('4'), ('abc'),('2'),('3'),('4') ], StringType())
schema = StructType(df1.schema.fields[:] + [StructField("index", LongType(), True)])
rdd = df1.rdd.zipWithIndex()
rdd1 = rdd.map(lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],))
df1 = spark.createDataFrame(rdd1, schema)
df1.show()
returns:
+-----+-----+
|value|index|
+-----+-----+
| abc| 0|
| 2| 1|
| 3| 2|
| 4| 3|
| abc| 4|
| 2| 5|
| 3| 6|
| 4| 7|
| abc| 8|
| 2| 9|
| 3| 10|
| 4| 11|
+-----+-----+
Assumption: This answer is based on the assumption that the order of col_id should depend on the age column. If the assumption does not hold true the other suggested solution is the in the questions comments mentioned zipWithIndex. An example usage of zipWithIndex can be found in this answer.
Proposed solution:
You can use a window with an empty partitionBy and the the row number to get the expected numbers.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
windowSpec = Window.partitionBy().orderBy(F.col('age').asc())
age = age.withColumn(
'col_id',
F.row_number().over(windowSpec)
)
[EDIT] Add assumption of requirements and reference to alternative solution.

How to explode multiple columns of a dataframe in pyspark

I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same.
Name Age Subjects Grades
[Bob] [16] [Maths,Physics,Chemistry] [A,B,C]
I want to explode the dataframe in such a way that i get the following output-
Name Age Subjects Grades
Bob 16 Maths A
Bob 16 Physics B
Bob 16 Chemistry C
How can I achieve this?
PySpark has added an arrays_zip function in 2.4, which eliminates the need for a Python UDF to zip the arrays.
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df = df.withColumn("new", F.arrays_zip("Subjects", "Grades"))\
.withColumn("new", F.explode("new"))\
.select("Name", "Age", F.col("new.Subjects").alias("Subjects"), F.col("new.Grades").alias("Grades"))
df.show()
+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]| Maths| A|
|[Bob]|[16]| Physics| B|
|[Bob]|[16]|Chemistry| C|
+-----+----+---------+------+
This works,
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df.show()
+-----+----+--------------------+---------+
| Name| Age| Subjects| Grades|
+-----+----+--------------------+---------+
|[Bob]|[16]|[Maths, Physics, ...|[A, B, C]|
+-----+----+--------------------+---------+
Use udf with zip. Those columns needed to explode have to be merged before exploding.
combine = F.udf(lambda x, y: list(zip(x, y)),
ArrayType(StructType([StructField("subs", StringType()),
StructField("grades", StringType())])))
df = df.withColumn("new", combine("Subjects", "Grades"))\
.withColumn("new", F.explode("new"))\
.select("Name", "Age", F.col("new.subs").alias("Subjects"), F.col("new.grades").alias("Grades"))
df.show()
+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]| Maths| A|
|[Bob]|[16]| Physics| B|
|[Bob]|[16]|Chemistry| C|
+-----+----+---------+------+
Arriving late to the party :-)
The simplest way to go is by using inline that doesn't have python API but is supported by selectExpr.
df.selectExpr('Name[0] as Name','Age[0] as Age','inline(arrays_zip(Subjects,Grades))').show()
+----+---+---------+------+
|Name|Age| Subjects|Grades|
+----+---+---------+------+
| Bob| 16| Maths| A|
| Bob| 16| Physics| B|
| Bob| 16|Chemistry| C|
+----+---+---------+------+
Have you tried this
df.select(explode(split(col("Subjects"))).alias("Subjects")).show()
you can convert the data frame to an RDD.
For an RDD you can use a flatMap function to separate the Subjects.
Copy/paste function if you need to repeat this quickly and easily across a large number of columns in a dataset
cols = ["word", "stem", "pos", "ner"]
def explode_cols(self, data, cols):
data = data.withColumn('exp_combo', f.arrays_zip(*cols))
data = data.withColumn('exp_combo', f.explode('exp_combo'))
for col in cols:
data = data.withColumn(col, f.col('exp_combo.' + col))
return data.drop(f.col('exp_combo'))
result = explode_cols(data, cols)
Your welcome :)
When Exploding multiple columns, the above solution comes in handy only when the length of array is same, but if they are not.
It is better to explode them separately and take distinct values each time.
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df = df.withColumn('Subjects',F.explode('Subjects')).select('Name','Age','Subjects', 'Grades').distinct()
df = df.withColumn('Grades',F.explode('Grades')).select('Name','Age','Subjects', 'Grades').distinct()
df.show()
+----+---+---------+------+
|Name|Age| Subjects|Grades|
+----+---+---------+------+
| Bob| 16| Maths| A|
| Bob| 16| Physics| B|
| Bob| 16|Chemistry| C|
+----+---+---------+------+
Thanks #nasty for saving the day.
Just small tweaks to get the code working.
def explode_cols( df, cl):
df = df.withColumn('exp_combo', arrays_zip(*cl))
df = df.withColumn('exp_combo', explode('exp_combo'))
for colm in cl:
final_col = 'exp_combo.'+ colm
df = df.withColumn(final_col, col(final_col))
#print col
#print ('exp_combo.'+ colm)
return df.drop(col('exp_combo'))

Get the top two elements in a nested list - pyspark

Let's say I have a list L=[[a,2],[a,3],[a,4],[b,4],[b,8],[b,9]]
Using pyspark I want to be able to remove the third element so that it will look like this:
[a,2]
[a,3]
[b,4]
[b,8]
I am new to pyspark and not sure what I should do here.
You can try something like this.
The first step is groupby key column and aggregate values in a list. Then use a udf to get the first two values of the list and then explode that column.
df = sc.parallelize([('a',2),('a',3),('a',4),
('b',4),('b',8),('b',9)]).toDF(['key', 'value'])
from pyspark.sql.functions import collect_list, udf, explode
from pyspark.sql.types import *
foo = udf(lambda x:x[0:2], ArrayType(IntegerType()))
df_list = (df.groupby('key').agg(collect_list('value')).
withColumn('values',foo('collect_list(value)')).
withColumn('value', explode('values')).
drop('values', 'collect_list(value)'))
df_list.show()
result
+---+-----+
|key|value|
+---+-----+
| b| 4|
| b| 8|
| a| 2|
| a| 3|
+---+-----+

How can I enumerate rows in groups with Spark/Python?

I'd like to enumerate grouped values just like with Pandas:
Enumerate each row for each group in a DataFrame
What is a way in Spark/Python?
With row_number window function:
from pyspark.sql.functions import row_number
from pyspark.sql import Window
w = Window.partitionBy("some_column").orderBy("some_other_column")
df.withColumn("rn", row_number().over(w))
You can achieve this on rdd level by doing:
rdd = sc.parallelize(['a', 'b', 'c'])
df = spark.createDataFrame(rdd.zipWithIndex())
df.show()
It will result:
+---+---+
| _1| _2|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+
If you only need unique ID, not real continuous indexing, you may also use
zipWithUniqueId() which is more efficient, since done locally on each partition.

Categories