Related
I have a dataframe with some column names and I want to filter out some columns based on a list.
I have a list of columns I would like to have in my final dataframe:
final_columns = ['A','C','E']
My dataframe is this:
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
I would like to transform df1 in order to have the columns of this final_columns list.
So, basically, I expect the resulting dataframe to look like this
+--------+------+------+
| A | C | E |
+--------+------+------+
| James |Smith | |
|Michael |Boots | |
+--------+------+------+
Is there any smart way to do this?
Thank you in advance
You can do so with select and a list comprehension. The idea is to loop through final_columns, if a column is in df.colums then add it, if its not then use lit to add it with the proper alias.
You can write similar logic with a for loop if you find list comprehensions less readable.
from pyspark.sql.functions import lit
df1.select([c if c in df1.columns else lit(None).alias(c) for c in final_columns]).show()
+-------+-----+----+
| A| C| E|
+-------+-----+----+
| James|Smith|null|
|Michael|Boots|null|
+-------+-----+----+
Here is one way: use the DataFrame drop() method with a list which represents the symmetric difference between the DataFrame's current columns and your list of final columns.
df = spark.createDataFrame([(1, 1, "1", 0.1),(1, 2, "1", 0.2),(3, 3, "3", 0.3)],('a','b','c','d'))
df.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| 1| 1| 1|0.1|
| 1| 2| 1|0.2|
| 3| 3| 3|0.3|
+---+---+---+---+
# list of desired final columns
final_cols = ['a', 'c', 'd']
df2 = df.drop( *set(final_cols).symmetric_difference(df.columns) )
Note an alternate syntax for the symmetric difference operation:
df2 = df.drop( *(set(final_cols) ^ set(df.columns)) )
This gives me:
+---+---+---+
| a| c| d|
+---+---+---+
| 1| 1|0.1|
| 1| 1|0.2|
| 3| 3|0.3|
+---+---+---+
Which I believe is what you want.
Based on your requirement have written a dynamic code. This will select columns based on the list provided and also create column with null values if that column is not present in the source/original dataframe.
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
actual_columns = df1.schema.names
final_columns = ['A','C','E']
def Diff(li1, li2):
diff = list(set(li2) - set(li1))
return diff
def Same(li1, li2):
same = list(sorted(set(li1).intersection(li2)))
return same
df1 = df1.select(*Same(actual_columns,final_columns))
for i in Diff(actual_columns,final_columns):
df1 = df1.withColumn(""+i+"",lit(''))
display(df1)
I have these tables:
df1 df2
+---+------------+ +---+---------+
| id| many_cols| | id|criterion|
+---+------------+ +---+---------+
| 1|lots_of_data| | 1| false|
| 2|lots_of_data| | 1| true|
| 3|lots_of_data| | 1| true|
+---+------------+ | 3| false|
+---+---------+
I intend to create additional column in df1:
+---+------------+------+
| id| many_cols|result|
+---+------------+------+
| 1|lots_of_data| 1|
| 2|lots_of_data| null|
| 3|lots_of_data| 0|
+---+------------+------+
result should be 1 if there is a corresponding true in df2
result should be 0 if there's no corresponding true in df2
result should be null if there is no corresponding id in df2
I cannot think of an efficient way to do it. I am stuck with only the 3rd condition working after a join:
df = df1.join(df2, 'id', 'full')
df.show()
# +---+------------+---------+
# | id| many_cols|criterion|
# +---+------------+---------+
# | 1|lots_of_data| false|
# | 1|lots_of_data| true|
# | 1|lots_of_data| true|
# | 3|lots_of_data| false|
# | 2|lots_of_data| null|
# +---+------------+---------+
PySpark dataframes are created like this:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df1cols = ['id', 'many_cols']
df1data = [(1, 'lots_of_data'),
(2, 'lots_of_data'),
(3, 'lots_of_data')]
df2cols = ['id', 'criterion']
df2data = [(1, False),
(1, True),
(1, True),
(3, None)]
df1 = spark.createDataFrame(df1data, df1cols)
df2 = spark.createDataFrame(df2data, df2cols)
A simple way would be to groupby df2 to get the max criterion by id the join with df1, this way you reduce the number of lines to join. The max of a boolean column is true if there is at least one corresponding true value:
from pyspark.sql import functions as F
df2_group = df2.groupBy("id").agg(F.max("criterion").alias("criterion"))
result = df1.join(df2_group, ["id"], "left").withColumn(
"result",
F.col("criterion").cast("int")
).drop("criterion")
result.show()
#+---+------------+------+
#| id| many_cols|result|
#+---+------------+------+
#| 1|lots_of_data| 1|
#| 3|lots_of_data| 0|
#| 2|lots_of_data| null|
#+---+------------+------+
You can try a correlated subquery to get the maximum Boolean from df2, and cast that to an integer.
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
df = spark.sql("""
select
df1.*,
(select int(max(criterion)) from df2 where df1.id = df2.id) as result
from df1
""")
df.show()
+---+------------+------+
| id| many_cols|result|
+---+------------+------+
| 1|lots_of_data| 1|
| 3|lots_of_data| 0|
| 2|lots_of_data| null|
+---+------------+------+
check out this solution. After joining. you can use multiple condition checks based on your requirement and assign the value accordingly using when clause and then take the max value of result grouping by id and other columns. you can use window function as well to calculate the max of result if you are just using just id for the partition.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df1cols = ['id', 'many_cols']
df1data = [(1, 'lots_of_data'),
(2, 'lots_of_data'),
(3, 'lots_of_data')]
df2cols = ['id', 'criterion']
df2data = [(1, False),
(1, True),
(1, True),
(3, False)]
df1 = spark.createDataFrame(df1data, df1cols)
df2 = spark.createDataFrame(df2data, df2cols)
df2_mod =df2.withColumnRenamed("id", "id_2")
df3=df1.join(df2_mod, on=df1.id== df2_mod.id_2, how='left')
cond1 = (F.col("id")== F.col("id_2"))& (F.col("criterion")==1)
cond2 = (F.col("id")== F.col("id_2"))& (F.col("criterion")==0)
cond3 = (F.col("id_2").isNull())
df3.select("id", "many_cols", F.when(cond1, 1).when(cond2,0).when(cond3, F.lit(None)).alias("result"))\
.groupBy("id", "many_cols").agg(F.max(F.col("result")).alias("result")).orderBy("id").show()
Result:
------
+---+------------+------+
| id| many_cols|result|
+---+------------+------+
| 1|lots_of_data| 1|
| 2|lots_of_data| null|
| 3|lots_of_data| 0|
+---+------------+------+
Using window function
w=Window().partitionBy("id")
df3.select("id", "many_cols", F.when(cond1, 1).when(cond2,0).when(cond3, F.lit(None)).alias("result"))\
.select("id", "many_cols", F.max("result").over(w).alias("result")).drop_duplicates().show()
I had to merge the ideas of proposed answers to get the solution which suited me most.
# The `cond` variable is very useful, here it represents several complex conditions
cond = F.col('criterion') == True
df2_grp = df2.select(
'id',
F.when(cond, 1).otherwise(0).alias('c')
).groupBy('id').agg(F.max(F.col('c')).alias('result'))
df = df1.join(df2_grp, 'id', 'left')
df.show()
#+---+------------+------+
#| id| many_cols|result|
#+---+------------+------+
#| 1|lots_of_data| 1|
#| 3|lots_of_data| 0|
#| 2|lots_of_data| null|
#+---+------------+------+
I have a pyspark dataframe like this
data = [(("ID1", 10, 30)), (("ID2", 20, 60))]
df1 = spark.createDataFrame(data, ["ID", "colA", "colB"])
df1.show()
df1:
+---+-----------+
| ID| colA| colB|
+---+-----------+
|ID1| 10| 30|
|ID2| 20| 60|
+---+-----------+
I have Another dataframe like this
data = [(("colA", 2)), (("colB", 5))]
df2 = spark.createDataFrame(data, ["Column", "Value"])
df2.show()
df2:
+-------+------+
| Column| Value|
+-------+------+
| colA| 2|
| colB| 5|
+-------+------+
I want to divide every column in df1 by the respective value in df2. Hence df3 will look like
df3:
+---+-------------------------+
| ID| colA| colB|
+---+------------+------------+
|ID1| 10/2 = 5| 30/5 = 6|
|ID2| 20/2 = 10| 60/5 = 12|
+---+------------+------------+
Ultimately, I want to add colA and colB to get the final df4 per ID
df4:
+---+---------------+
| ID| finalSum|
+---+---------------+
|ID1| 5 + 6 = 11|
|ID2| 10 + 12 = 22|
+---+---------------+
The idea is to join both the DataFrames together and then apply the division operation. Since, df2 contains the column names and the respective value, so we need to pivot() it first and then join with the main table df1. (Pivoting is an expensive operation, but it should be fine as long as the DataFrame is small.)
# Loading the requisite packages
from pyspark.sql.functions import col
from functools import reduce
from operator import add
# Creating the DataFrames
df1 = sqlContext.createDataFrame([('ID1', 10, 30), ('ID2', 20, 60)],('ID','ColA','ColB'))
df2 = sqlContext.createDataFrame([('ColA', 2), ('ColB', 5)],('Column','Value'))
The code is fairly generic, so that we need not need to specify the column names on our own. We find the column names we need to operate on. Except ID we need all.
# This contains the list of columns where we apply mathematical operations
columns_to_be_operated = df1.columns
columns_to_be_operated.remove('ID')
print(columns_to_be_operated)
['ColA', 'ColB']
Pivoting the df2, which we will join to df1.
# Pivoting the df2 to get the rows in column form
df2 = df2.groupBy().pivot('Column').sum('Value')
df2.show()
+----+----+
|ColA|ColB|
+----+----+
| 2| 5|
+----+----+
We can change the column names, so that we don't have a duplicate name for every column. We do so, by adding a suffix _x on all the names.
# Dynamically changing the name of the columns in df2
df2 = df2.select([col(c).alias(c+'_x') for c in df2.columns])
df2.show()
+------+------+
|ColA_x|ColB_x|
+------+------+
| 2| 5|
+------+------+
Next we join the tables with a Cartesian join. (Note that you may run into memory issues if df2 is large.)
df = df1.crossJoin(df2)
df.show()
+---+----+----+------+------+
| ID|ColA|ColB|ColA_x|ColB_x|
+---+----+----+------+------+
|ID1| 10| 30| 2| 5|
|ID2| 20| 60| 2| 5|
+---+----+----+------+------+
Finally adding the columns by dividing them with the corresponding value first. reduce() applies function add() of two arguments, cumulatively, to the items of the sequence.
df = df.withColumn(
'finalSum',
reduce(add, [col(c)/col(c+'_x') for c in columns_to_be_operated])
).select('ID','finalSum')
df.show()
+---+--------+
| ID|finalSum|
+---+--------+
|ID1| 11.0|
|ID2| 22.0|
+---+--------+
Note: OP has to be careful with the division with 0. The snippet just above can be altered to take this condition into account.
I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:
rdd.map(lambda row: row + [row.my_str_col.split('-')])
which takes something looking like:
col1 | my_str_col
-----+-----------
18 | 856-yygrm
201 | 777-psgdg
and converts it to this:
col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.
Ideally, I want these new columns to be named as well.
pyspark.sql.functions.split() is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself:
split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))
The result will be:
col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.
Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect, or using udfs. Unfortunately this only works for spark version 2.1 and above, because it requires the posexplode function.
Suppose you had the following DataFrame:
df = spark.createDataFrame(
[
[1, 'A, B, C, D'],
[2, 'E, F, G'],
[3, 'H, I'],
[4, 'J']
]
, ["num", "letters"]
)
df.show()
#+---+----------+
#|num| letters|
#+---+----------+
#| 1|A, B, C, D|
#| 2| E, F, G|
#| 3| H, I|
#| 4| J|
#+---+----------+
Split the letters column and then use posexplode to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr to grab the element at index pos in this array.
import pyspark.sql.functions as f
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)\
.show()
#+---+------------+---+---+
#|num| letters|pos|val|
#+---+------------+---+---+
#| 1|[A, B, C, D]| 0| A|
#| 1|[A, B, C, D]| 1| B|
#| 1|[A, B, C, D]| 2| C|
#| 1|[A, B, C, D]| 3| D|
#| 2| [E, F, G]| 0| E|
#| 2| [E, F, G]| 1| F|
#| 2| [E, F, G]| 2| G|
#| 3| [H, I]| 0| H|
#| 3| [H, I]| 1| I|
#| 4| [J]| 0| J|
#+---+------------+---+---+
Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr which allows us use column values as parameters.
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)\
.drop("val")\
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)\
.show()
#+---+-------+---+
#|num| name|val|
#+---+-------+---+
#| 1|letter0| A|
#| 1|letter1| B|
#| 1|letter2| C|
#| 1|letter3| D|
#| 2|letter0| E|
#| 2|letter1| F|
#| 2|letter2| G|
#| 3|letter0| H|
#| 3|letter1| I|
#| 4|letter0| J|
#+---+-------+---+
Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)\
.drop("val")\
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)\
.groupBy("num").pivot("name").agg(f.first("val"))\
.show()
#+---+-------+-------+-------+-------+
#|num|letter0|letter1|letter2|letter3|
#+---+-------+-------+-------+-------+
#| 1| A| B| C| D|
#| 3| H| I| null| null|
#| 2| E| F| G| null|
#| 4| J| null| null| null|
#+---+-------+-------+-------+-------+
Here's another approach, in case you want split a string with a delimiter.
import pyspark.sql.functions as f
df = spark.createDataFrame([("1:a:2001",),("2:b:2002",),("3:c:2003",)],["value"])
df.show()
+--------+
| value|
+--------+
|1:a:2001|
|2:b:2002|
|3:c:2003|
+--------+
df_split = df.select(f.split(df.value,":")).rdd.flatMap(
lambda x: x).toDF(schema=["col1","col2","col3"])
df_split.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a|2001|
| 2| b|2002|
| 3| c|2003|
+----+----+----+
I don't think this transition back and forth to RDDs is going to slow you down...
Also don't worry about last schema specification: it's optional, you can avoid it generalizing the solution to data with unknown column size.
I understand your pain. Using split() can work, but can also lead to breaks.
Let's take your df and make a slight change to it:
df = spark.createDataFrame([('1:"a:3":2001',),('2:"b":2002',),('3:"c":2003',)],["value"])
df.show()
+------------+
| value|
+------------+
|1:"a:3":2001|
| 2:"b":2002|
| 3:"c":2003|
+------------+
If you try to apply split() to this as outlined above:
df_split = df.select(split(df.value,":")).rdd.flatMap(
lambda x: x).toDF(schema=["col1","col2","col3"]).show()
you will get
IllegalStateException: Input row doesn't have expected number of values required by the schema. 4 fields are required while 3 values are provided.
So, is there a more elegant way of addressing this? I was so happy to have it pointed out to me. pyspark.sql.functions.from_csv() is your friend.
Taking my above example df:
from pyspark.sql.functions import from_csv
# Define a column schema to apply with from_csv()
col_schema = ["col1 INTEGER","col2 STRING","col3 INTEGER"]
schema_str = ",".join(col_schema)
# define the separator because it isn't a ','
options = {'sep': ":"}
# create a df from the value column using schema and options
df_csv = df.select(from_csv(df.value, schema_str, options).alias("value_parsed"))
df_csv.show()
+--------------+
| value_parsed|
+--------------+
|[1, a:3, 2001]|
| [2, b, 2002]|
| [3, c, 2003]|
+--------------+
Then we can easily flatten the df to put the values in columns:
df2 = df_csv.select("value_parsed.*").toDF("col1","col2","col3")
df2.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a:3|2001|
| 2| b|2002|
| 3| c|2003|
+----+----+----+
No breaks. Data correctly parsed. Life is good. Have a beer.
Instead of Column.getItem(i) we can use Column[i].
Also, enumerate is useful in big dataframes.
from pyspark.sql import functions as F
Keep parent column:
for i, c in enumerate(['new_1', 'new_2']):
df = df.withColumn(c, F.split('my_str_col', '-')[i])
or
new_cols = ['new_1', 'new_2']
df = df.select('*', *[F.split('my_str_col', '-')[i].alias(c) for i, c in enumerate(new_cols)])
Replace parent column:
for i, c in enumerate(['new_1', 'new_2']):
df = df.withColumn(c, F.split('my_str_col', '-')[i])
df = df.drop('my_str_col')
or
new_cols = ['new_1', 'new_2']
df = df.select(
*[c for c in df.columns if c != 'my_str_col'],
*[F.split('my_str_col', '-')[i].alias(c) for i, c in enumerate(new_cols)]
)
I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.
For example:
Input: PySpark DataFrame containing :
col_1 = [1,2,3], col_2 = [2,1,4], col_3 = [3,2,5]
Ouput :
col_4 = max(col1, col_2, col_3) = [3,2,5]
There is something similar in pandas as explained in this question.
Is there any way of doing this in PySpark or should I change convert my PySpark df to Pandas df and then perform the operations?
You can reduce using SQL expressions over a list of columns:
from pyspark.sql.functions import max as max_, col, when
from functools import reduce
def row_max(*cols):
return reduce(
lambda x, y: when(x > y, x).otherwise(y),
[col(c) if isinstance(c, str) else c for c in cols]
)
df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
.toDF(["a", "b", "c"]))
df.select(row_max("a", "b", "c").alias("max")))
Spark 1.5+ also provides least, greatest
from pyspark.sql.functions import greatest
df.select(greatest("a", "b", "c"))
If you want to keep name of the max you can use `structs:
from pyspark.sql.functions import struct, lit
def row_max_with_name(*cols):
cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols]
return greatest(*cols_).alias("greatest({0})".format(",".join(cols)))
maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs"))
And finally you can use above to find select "top" column:
from pyspark.sql.functions import max
((_, c), ) = (maxs
.groupBy(col("maxs")["col"].alias("col"))
.count()
.agg(max(struct(col("count"), col("col"))))
.first())
df.select(c)
We can use greatest
Creating DataFrame
df = spark.createDataFrame(
[[1,2,3], [2,1,2], [3,4,5]],
['col_1','col_2','col_3']
)
df.show()
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| 1| 2| 3|
| 2| 1| 2|
| 3| 4| 5|
+-----+-----+-----+
Solution
from pyspark.sql.functions import greatest
df2 = df.withColumn('max_by_rows', greatest('col_1', 'col_2', 'col_3'))
#Only if you need col
#from pyspark.sql.functions import col
#df2 = df.withColumn('max', greatest(col('col_1'), col('col_2'), col('col_3')))
df2.show()
+-----+-----+-----+-----------+
|col_1|col_2|col_3|max_by_rows|
+-----+-----+-----+-----------+
| 1| 2| 3| 3|
| 2| 1| 2| 2|
| 3| 4| 5| 5|
+-----+-----+-----+-----------+
You can also use the pyspark built-in least:
from pyspark.sql.functions import least, col
df = df.withColumn('min', least(col('c1'), col('c2'), col('c3')))
Another simple way of doing it. Let us say that the below df is your dataframe
df = sc.parallelize([(10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4)]).toDF(["c1", "c2", "c3"])
df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 10| 10| 1|
|200| 2| 20|
| 3| 30|300|
|400| 40| 4|
+---+---+---+
You can process the above df as below to get the desited results
from pyspark.sql.functions import lit, min
df.select( lit('c1').alias('cn1'), min(df.c1).alias('c1'),
lit('c2').alias('cn2'), min(df.c2).alias('c2'),
lit('c3').alias('cn3'), min(df.c3).alias('c3')
)\
.rdd.flatMap(lambda r: [ (r.cn1, r.c1), (r.cn2, r.c2), (r.cn3, r.c3)])\
.toDF(['Columnn', 'Min']).show()
+-------+---+
|Columnn|Min|
+-------+---+
| c1| 3|
| c2| 2|
| c3| 1|
+-------+---+
Scala solution:
df = sc.parallelize(Seq((10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4))).toDF("c1", "c2", "c3"))
df.rdd.map(row=>List[String](row(0).toString,row(1).toString,row(2).toString)).map(x=>(x(0),x(1),x(2),x.min)).toDF("c1","c2","c3","min").show
+---+---+---+---+
| c1| c2| c3|min|
+---+---+---+---+
| 10| 10| 1| 1|
|200| 2| 20| 2|
| 3| 30|300| 3|
|400| 40| 4| 4|
+---+---+---+---+