How to explode multiple columns of a dataframe in pyspark - python

I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same.
Name Age Subjects Grades
[Bob] [16] [Maths,Physics,Chemistry] [A,B,C]
I want to explode the dataframe in such a way that i get the following output-
Name Age Subjects Grades
Bob 16 Maths A
Bob 16 Physics B
Bob 16 Chemistry C
How can I achieve this?

PySpark has added an arrays_zip function in 2.4, which eliminates the need for a Python UDF to zip the arrays.
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df = df.withColumn("new", F.arrays_zip("Subjects", "Grades"))\
.withColumn("new", F.explode("new"))\
.select("Name", "Age", F.col("new.Subjects").alias("Subjects"), F.col("new.Grades").alias("Grades"))
df.show()
+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]| Maths| A|
|[Bob]|[16]| Physics| B|
|[Bob]|[16]|Chemistry| C|
+-----+----+---------+------+

This works,
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df.show()
+-----+----+--------------------+---------+
| Name| Age| Subjects| Grades|
+-----+----+--------------------+---------+
|[Bob]|[16]|[Maths, Physics, ...|[A, B, C]|
+-----+----+--------------------+---------+
Use udf with zip. Those columns needed to explode have to be merged before exploding.
combine = F.udf(lambda x, y: list(zip(x, y)),
ArrayType(StructType([StructField("subs", StringType()),
StructField("grades", StringType())])))
df = df.withColumn("new", combine("Subjects", "Grades"))\
.withColumn("new", F.explode("new"))\
.select("Name", "Age", F.col("new.subs").alias("Subjects"), F.col("new.grades").alias("Grades"))
df.show()
+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]| Maths| A|
|[Bob]|[16]| Physics| B|
|[Bob]|[16]|Chemistry| C|
+-----+----+---------+------+

Arriving late to the party :-)
The simplest way to go is by using inline that doesn't have python API but is supported by selectExpr.
df.selectExpr('Name[0] as Name','Age[0] as Age','inline(arrays_zip(Subjects,Grades))').show()
+----+---+---------+------+
|Name|Age| Subjects|Grades|
+----+---+---------+------+
| Bob| 16| Maths| A|
| Bob| 16| Physics| B|
| Bob| 16|Chemistry| C|
+----+---+---------+------+

Have you tried this
df.select(explode(split(col("Subjects"))).alias("Subjects")).show()
you can convert the data frame to an RDD.
For an RDD you can use a flatMap function to separate the Subjects.

Copy/paste function if you need to repeat this quickly and easily across a large number of columns in a dataset
cols = ["word", "stem", "pos", "ner"]
def explode_cols(self, data, cols):
data = data.withColumn('exp_combo', f.arrays_zip(*cols))
data = data.withColumn('exp_combo', f.explode('exp_combo'))
for col in cols:
data = data.withColumn(col, f.col('exp_combo.' + col))
return data.drop(f.col('exp_combo'))
result = explode_cols(data, cols)
Your welcome :)

When Exploding multiple columns, the above solution comes in handy only when the length of array is same, but if they are not.
It is better to explode them separately and take distinct values each time.
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df = df.withColumn('Subjects',F.explode('Subjects')).select('Name','Age','Subjects', 'Grades').distinct()
df = df.withColumn('Grades',F.explode('Grades')).select('Name','Age','Subjects', 'Grades').distinct()
df.show()
+----+---+---------+------+
|Name|Age| Subjects|Grades|
+----+---+---------+------+
| Bob| 16| Maths| A|
| Bob| 16| Physics| B|
| Bob| 16|Chemistry| C|
+----+---+---------+------+

Thanks #nasty for saving the day.
Just small tweaks to get the code working.
def explode_cols( df, cl):
df = df.withColumn('exp_combo', arrays_zip(*cl))
df = df.withColumn('exp_combo', explode('exp_combo'))
for colm in cl:
final_col = 'exp_combo.'+ colm
df = df.withColumn(final_col, col(final_col))
#print col
#print ('exp_combo.'+ colm)
return df.drop(col('exp_combo'))

Related

PySpark - Filter dataframe columns based on list

I have a dataframe with some column names and I want to filter out some columns based on a list.
I have a list of columns I would like to have in my final dataframe:
final_columns = ['A','C','E']
My dataframe is this:
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
I would like to transform df1 in order to have the columns of this final_columns list.
So, basically, I expect the resulting dataframe to look like this
+--------+------+------+
| A | C | E |
+--------+------+------+
| James |Smith | |
|Michael |Boots | |
+--------+------+------+
Is there any smart way to do this?
Thank you in advance
You can do so with select and a list comprehension. The idea is to loop through final_columns, if a column is in df.colums then add it, if its not then use lit to add it with the proper alias.
You can write similar logic with a for loop if you find list comprehensions less readable.
from pyspark.sql.functions import lit
df1.select([c if c in df1.columns else lit(None).alias(c) for c in final_columns]).show()
+-------+-----+----+
| A| C| E|
+-------+-----+----+
| James|Smith|null|
|Michael|Boots|null|
+-------+-----+----+
Here is one way: use the DataFrame drop() method with a list which represents the symmetric difference between the DataFrame's current columns and your list of final columns.
df = spark.createDataFrame([(1, 1, "1", 0.1),(1, 2, "1", 0.2),(3, 3, "3", 0.3)],('a','b','c','d'))
df.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| 1| 1| 1|0.1|
| 1| 2| 1|0.2|
| 3| 3| 3|0.3|
+---+---+---+---+
# list of desired final columns
final_cols = ['a', 'c', 'd']
df2 = df.drop( *set(final_cols).symmetric_difference(df.columns) )
Note an alternate syntax for the symmetric difference operation:
df2 = df.drop( *(set(final_cols) ^ set(df.columns)) )
This gives me:
+---+---+---+
| a| c| d|
+---+---+---+
| 1| 1|0.1|
| 1| 1|0.2|
| 3| 3|0.3|
+---+---+---+
Which I believe is what you want.
Based on your requirement have written a dynamic code. This will select columns based on the list provided and also create column with null values if that column is not present in the source/original dataframe.
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
actual_columns = df1.schema.names
final_columns = ['A','C','E']
def Diff(li1, li2):
diff = list(set(li2) - set(li1))
return diff
def Same(li1, li2):
same = list(sorted(set(li1).intersection(li2)))
return same
df1 = df1.select(*Same(actual_columns,final_columns))
for i in Diff(actual_columns,final_columns):
df1 = df1.withColumn(""+i+"",lit(''))
display(df1)

Multiply column of PySpark dataframe with scalar

I want to multiply a column (say x3) of a PySpark dataframe (say df) with a scalar (say 0.1). Below is an example of a dataframe that I have:
df = sqlContext.createDataFrame(
[(1, "a", 1551.0), (3, "B", 1925.0)], ("x1", "x2", "x3"))
df.show()
+---+---+----+
| x1| x2| x3|
+---+---+----+
| 1| a| 5.0|
| 3| B|21.0|
+---+---+----+
Below is what I am trying at present:
df_new = df.withColumn( "norm_x3", 0.1*F.col( "x3") )
df_new = df_new.select( [c for c in df_new.columns if c not in {'x3'}] )
The method which I am trying above gives the expected output which is:
+---+---+-------+
| x1| x2|norm_x3|
+---+---+-------+
| 1| a| 0.5|
| 3| B| 2.1|
+---+---+-------+
Is there a more elegant and short way of doing the same thing? Thanks.
The most elegant way would be simply using drop:
df_new = df.withColumn("norm_x3", 0.1*F.col( "x3")).drop("x3")
Alternatively, you can also use withColumnRenamed, but is less preferable because you're overloading "x3" and could cause confusion in the future:
df_new = df.withColumn("x3", 0.1*F.col( "x3")).withColumnRenamed("x3", "norm_x3")
Here's one way to do it in one line:
df.select([(df[c] * 0.1).alias('norm_x3') if c == 'x3' else df[c] for c in df.columns]
Or:
df.selectExpr('*', 'x3 * 0.1 as normal_x3').drop('x3')

Sum Product in PySpark

I have a pyspark dataframe like this
data = [(("ID1", 10, 30)), (("ID2", 20, 60))]
df1 = spark.createDataFrame(data, ["ID", "colA", "colB"])
df1.show()
df1:
+---+-----------+
| ID| colA| colB|
+---+-----------+
|ID1| 10| 30|
|ID2| 20| 60|
+---+-----------+
I have Another dataframe like this
data = [(("colA", 2)), (("colB", 5))]
df2 = spark.createDataFrame(data, ["Column", "Value"])
df2.show()
df2:
+-------+------+
| Column| Value|
+-------+------+
| colA| 2|
| colB| 5|
+-------+------+
I want to divide every column in df1 by the respective value in df2. Hence df3 will look like
df3:
+---+-------------------------+
| ID| colA| colB|
+---+------------+------------+
|ID1| 10/2 = 5| 30/5 = 6|
|ID2| 20/2 = 10| 60/5 = 12|
+---+------------+------------+
Ultimately, I want to add colA and colB to get the final df4 per ID
df4:
+---+---------------+
| ID| finalSum|
+---+---------------+
|ID1| 5 + 6 = 11|
|ID2| 10 + 12 = 22|
+---+---------------+
The idea is to join both the DataFrames together and then apply the division operation. Since, df2 contains the column names and the respective value, so we need to pivot() it first and then join with the main table df1. (Pivoting is an expensive operation, but it should be fine as long as the DataFrame is small.)
# Loading the requisite packages
from pyspark.sql.functions import col
from functools import reduce
from operator import add
# Creating the DataFrames
df1 = sqlContext.createDataFrame([('ID1', 10, 30), ('ID2', 20, 60)],('ID','ColA','ColB'))
df2 = sqlContext.createDataFrame([('ColA', 2), ('ColB', 5)],('Column','Value'))
The code is fairly generic, so that we need not need to specify the column names on our own. We find the column names we need to operate on. Except ID we need all.
# This contains the list of columns where we apply mathematical operations
columns_to_be_operated = df1.columns
columns_to_be_operated.remove('ID')
print(columns_to_be_operated)
['ColA', 'ColB']
Pivoting the df2, which we will join to df1.
# Pivoting the df2 to get the rows in column form
df2 = df2.groupBy().pivot('Column').sum('Value')
df2.show()
+----+----+
|ColA|ColB|
+----+----+
| 2| 5|
+----+----+
We can change the column names, so that we don't have a duplicate name for every column. We do so, by adding a suffix _x on all the names.
# Dynamically changing the name of the columns in df2
df2 = df2.select([col(c).alias(c+'_x') for c in df2.columns])
df2.show()
+------+------+
|ColA_x|ColB_x|
+------+------+
| 2| 5|
+------+------+
Next we join the tables with a Cartesian join. (Note that you may run into memory issues if df2 is large.)
df = df1.crossJoin(df2)
df.show()
+---+----+----+------+------+
| ID|ColA|ColB|ColA_x|ColB_x|
+---+----+----+------+------+
|ID1| 10| 30| 2| 5|
|ID2| 20| 60| 2| 5|
+---+----+----+------+------+
Finally adding the columns by dividing them with the corresponding value first. reduce() applies function add() of two arguments, cumulatively, to the items of the sequence.
df = df.withColumn(
'finalSum',
reduce(add, [col(c)/col(c+'_x') for c in columns_to_be_operated])
).select('ID','finalSum')
df.show()
+---+--------+
| ID|finalSum|
+---+--------+
|ID1| 11.0|
|ID2| 22.0|
+---+--------+
Note: OP has to be careful with the division with 0. The snippet just above can be altered to take this condition into account.

Pyspark data frame Converting false and true to 0 and 1

I have a data frame in Pyspark
df.show()
+-----+-----+
|test1|test2|
+-----+-----+
|false| true|
| true| true|
| true|false|
|false| true|
|false|false|
|false|false|
|false|false|
| true| true|
|false|false|
+-----+-----+
I want to convert all the false values in data frame to 0 and true to 1.
I am doing like below
df1 = df.withColumn('test1', F.when(df.test1 == 'false', 0).otherwise(1)).withColumn('test2', F.when(df.test2 == 'false', 0).otherwise(1))
I got my result. But I think there might be a better way to do this.
Using CASE ... WHEN (when(...).otherwise(...)) is unnecessarily verbose. Instead you can just cast to integer:
from pyspark.sql.functions import col
df.select([col(c).cast("integer") for c ["test1", "test2"]])
One way to avoid the multiple withColumn especially when you have a lot of columns could be to use functools.reduce and you only use withColumn once here:
import pyspark.sql.functions as F
from functools import reduce
cols = ['test1', 'test2']
reduce(lambda df, c: df.withColumn(c, F.when(df[c] == 'false', 0).otherwise(1)), cols, df).show()
+-----+-----+
|test1|test2|
+-----+-----+
| 1| 0|
| 0| 1|
+-----+-----+
I am assuming that the datatypes of the two columns (test1, test2) are Boolean. You can try the below mentioned suggestion:
import pyspark.sql.functions as F
df = df.withColumn( "test1" , F.when( F.col("test1") , F.lit(1) ).otherwise(0) ).withColumn( "test2" , F.when( F.col("test2") , F.lit(1) ).otherwise(0) )
The columns "test1" and "test2" are Boolean in nature. So, you do not need to equate them using ==True (or ==False).
The use of Pyspark functions makes this route faster (and more scalable) as compared to approaches which use udfs (user defined functions).
Perhaps this help to do it in a clear way and for other cases too:
from pyspark.sql.functions
import col from pyspark.sql.types
import IntegerType
def fromBooleanToInt(s):
"""
This is just a simple python function to move boolean to integers.
>>> fromBooleanToInt(None)
>>> fromBooleanToInt(True)
1
>>> fromBooleanToInt(False)
1
"""
if s == True:
return 1
elif s==False:
return 0
else:
return None
This is to create a simple dataframe to test
df_with_doubles = spark.createDataFrame([(True, False), (None,True)], ['A', 'B'])
df_with_doubles.show()
+----+-----+
| A| B|
+----+-----+
|true|false|
|null| true|
+----+-----+
This is to define the udf
fromBooleanToInt_udf = F.udf(lambda x: fromBooleanToInt(x), IntegerType())
Now let do the casting/transformation:
column_to_change = 'A'
df_with_doubles_ = df_with_doubles.withColumn(column_to_change,fromBooleanToInt_udf(df_with_doubles[column_to_change]))
df_with_doubles_.show()
+----+-----+
| A| B|
+----+-----+
| 1|false|
|null| true|
+----+-----+
For Scala Users :
df.withColumn('new', col("test1").isNotNull.cast(IntegerType))
I Hope it helps.

How can I enumerate rows in groups with Spark/Python?

I'd like to enumerate grouped values just like with Pandas:
Enumerate each row for each group in a DataFrame
What is a way in Spark/Python?
With row_number window function:
from pyspark.sql.functions import row_number
from pyspark.sql import Window
w = Window.partitionBy("some_column").orderBy("some_other_column")
df.withColumn("rn", row_number().over(w))
You can achieve this on rdd level by doing:
rdd = sc.parallelize(['a', 'b', 'c'])
df = spark.createDataFrame(rdd.zipWithIndex())
df.show()
It will result:
+---+---+
| _1| _2|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+
If you only need unique ID, not real continuous indexing, you may also use
zipWithUniqueId() which is more efficient, since done locally on each partition.

Categories