Merge multiple columns into one column in pyspark using python - python

Input DataFrame:
id,page,location,trlmonth
1,mobile,chn,08/2018
2,product,mdu,09/2018
3,product,mdu,09/2018
4,mobile,chn,08/2018
5,book,delhi,10/2018
7,music,ban,11/2018
Output DataFrame:
userdetail,count
mobile-chn-08/2018,2
product-mdu-09/2018,2
book-delhi-10/2018,1
music-ban-11/2018,1
I tried merging single column into one but how to merge multiple columns into one?
from pyspark.sql import functions as F
df2 = (df
.groupby("id")
.agg(F.concat_ws("-", F.sort_array(F.collect_list("product"))).alias("products"))
.groupby("products")
.agg(F.count("id")).alias("count"))

We can just groupby userdetail columns and get count. Try this,
>>> df.orderBy('trlmonth').groupby('page','location','trlmonth').count().select(F.concat_ws('-','page','location','trlmonth').alias('user_detail'),'count').show()
+-------------------+-----+
| user_detail|count|
+-------------------+-----+
| mobile-chn-08/2018| 2|
|product-mdu-09/2018| 2|
| book-delhi-10/2018| 1|
| music-ban-11/2018| 1|
+-------------------+-----+

Related

PySpark - Filter dataframe columns based on list

I have a dataframe with some column names and I want to filter out some columns based on a list.
I have a list of columns I would like to have in my final dataframe:
final_columns = ['A','C','E']
My dataframe is this:
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
I would like to transform df1 in order to have the columns of this final_columns list.
So, basically, I expect the resulting dataframe to look like this
+--------+------+------+
| A | C | E |
+--------+------+------+
| James |Smith | |
|Michael |Boots | |
+--------+------+------+
Is there any smart way to do this?
Thank you in advance
You can do so with select and a list comprehension. The idea is to loop through final_columns, if a column is in df.colums then add it, if its not then use lit to add it with the proper alias.
You can write similar logic with a for loop if you find list comprehensions less readable.
from pyspark.sql.functions import lit
df1.select([c if c in df1.columns else lit(None).alias(c) for c in final_columns]).show()
+-------+-----+----+
| A| C| E|
+-------+-----+----+
| James|Smith|null|
|Michael|Boots|null|
+-------+-----+----+
Here is one way: use the DataFrame drop() method with a list which represents the symmetric difference between the DataFrame's current columns and your list of final columns.
df = spark.createDataFrame([(1, 1, "1", 0.1),(1, 2, "1", 0.2),(3, 3, "3", 0.3)],('a','b','c','d'))
df.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| 1| 1| 1|0.1|
| 1| 2| 1|0.2|
| 3| 3| 3|0.3|
+---+---+---+---+
# list of desired final columns
final_cols = ['a', 'c', 'd']
df2 = df.drop( *set(final_cols).symmetric_difference(df.columns) )
Note an alternate syntax for the symmetric difference operation:
df2 = df.drop( *(set(final_cols) ^ set(df.columns)) )
This gives me:
+---+---+---+
| a| c| d|
+---+---+---+
| 1| 1|0.1|
| 1| 1|0.2|
| 3| 3|0.3|
+---+---+---+
Which I believe is what you want.
Based on your requirement have written a dynamic code. This will select columns based on the list provided and also create column with null values if that column is not present in the source/original dataframe.
data1 = [("James", "Lee", "Smith","36636"),
("Michael","Rose","Boots","40288")]
schema1 = StructType([StructField("A",StringType(),True),
StructField("B",StringType(),True),
StructField("C",StringType(),True),
StructField("D",StringType(),True)])
df1 = spark.createDataFrame(data=data1,schema=schema1)
actual_columns = df1.schema.names
final_columns = ['A','C','E']
def Diff(li1, li2):
diff = list(set(li2) - set(li1))
return diff
def Same(li1, li2):
same = list(sorted(set(li1).intersection(li2)))
return same
df1 = df1.select(*Same(actual_columns,final_columns))
for i in Diff(actual_columns,final_columns):
df1 = df1.withColumn(""+i+"",lit(''))
display(df1)

PySpark: Moving rows from one dataframe into another if column values are not found in second dataframe

I have two spark dataframes with similar schemas:
DF1:
id category flag
123abc type 1 1
456def type 1 1
789ghi type 2 0
101jkl type 3 0
Df2:
id category flag
123abc type 1 1
456def type 1 1
789ghi type 2 1
101xyz type 3 0
DF1 has more data than DF2 so I cannot replace it. However, DF2 will have ids not found in DF1, as well as several IDs with more accurate flag data. This means there there are two situations that I need resolved:
789ghi has a different flag and needs to overwrite the 789ghi in
DF1.
101xyz is not found in DF1 and needs to be moved over
Each dataframe is millions of rows, so I am looking for an efficient way to perform this operation. I am not sure if this is a situation that requires an outer join or anti-join.
You can union the two dataframes and keep the first record for each id.
from functools import reduce
from pyspark.sql import DataFrame, Window
from pyspark.sql.functions import monotonically_increasing_id, col
df = reduce(DataFrame.unionByName,[df2,df1])
df = df.withColumn('row_num',monotonically_increasing_id())
window = Window.partitionBy("id").orderBy('row_num')
df = (df.withColumn('rank', rank().over(window))
.filter(col('rank') == 1)).drop('rank','row_num')
Output
+------+--------+----+
| id|category|flag|
+------+--------+----+
|101jkl| type 3| 0|
|101xyz| type 3| 0|
|123abc| type 1| 1|
|456def| type 1| 1|
|789ghi| type 2| 1|
+------+--------+----+
Option 1:
I would find ids in df1 not in df2 and put them into a subset df
I would then union the subset with df2.
Or
Option 2:
Find elements in df1 that are in df2 and drop those rows and then union df2.
The approach I take would obviously be based on which is less expensive computationally.
Option 1 code
s=df1.select('id').subtract(df2.select('id')).collect()[0][0]
df2.union(df1.filter(col('id')==s)).show()
Outcome
+------+--------+----+
| id|category|flag|
+------+--------+----+
|123abc| type 1| 1|
|456def| type 1| 1|
|789ghi| type 2| 1|
|101xyz| type 3| 0|
|101jkl| type 3| 0|
+------+--------+----+

Sum Product in PySpark

I have a pyspark dataframe like this
data = [(("ID1", 10, 30)), (("ID2", 20, 60))]
df1 = spark.createDataFrame(data, ["ID", "colA", "colB"])
df1.show()
df1:
+---+-----------+
| ID| colA| colB|
+---+-----------+
|ID1| 10| 30|
|ID2| 20| 60|
+---+-----------+
I have Another dataframe like this
data = [(("colA", 2)), (("colB", 5))]
df2 = spark.createDataFrame(data, ["Column", "Value"])
df2.show()
df2:
+-------+------+
| Column| Value|
+-------+------+
| colA| 2|
| colB| 5|
+-------+------+
I want to divide every column in df1 by the respective value in df2. Hence df3 will look like
df3:
+---+-------------------------+
| ID| colA| colB|
+---+------------+------------+
|ID1| 10/2 = 5| 30/5 = 6|
|ID2| 20/2 = 10| 60/5 = 12|
+---+------------+------------+
Ultimately, I want to add colA and colB to get the final df4 per ID
df4:
+---+---------------+
| ID| finalSum|
+---+---------------+
|ID1| 5 + 6 = 11|
|ID2| 10 + 12 = 22|
+---+---------------+
The idea is to join both the DataFrames together and then apply the division operation. Since, df2 contains the column names and the respective value, so we need to pivot() it first and then join with the main table df1. (Pivoting is an expensive operation, but it should be fine as long as the DataFrame is small.)
# Loading the requisite packages
from pyspark.sql.functions import col
from functools import reduce
from operator import add
# Creating the DataFrames
df1 = sqlContext.createDataFrame([('ID1', 10, 30), ('ID2', 20, 60)],('ID','ColA','ColB'))
df2 = sqlContext.createDataFrame([('ColA', 2), ('ColB', 5)],('Column','Value'))
The code is fairly generic, so that we need not need to specify the column names on our own. We find the column names we need to operate on. Except ID we need all.
# This contains the list of columns where we apply mathematical operations
columns_to_be_operated = df1.columns
columns_to_be_operated.remove('ID')
print(columns_to_be_operated)
['ColA', 'ColB']
Pivoting the df2, which we will join to df1.
# Pivoting the df2 to get the rows in column form
df2 = df2.groupBy().pivot('Column').sum('Value')
df2.show()
+----+----+
|ColA|ColB|
+----+----+
| 2| 5|
+----+----+
We can change the column names, so that we don't have a duplicate name for every column. We do so, by adding a suffix _x on all the names.
# Dynamically changing the name of the columns in df2
df2 = df2.select([col(c).alias(c+'_x') for c in df2.columns])
df2.show()
+------+------+
|ColA_x|ColB_x|
+------+------+
| 2| 5|
+------+------+
Next we join the tables with a Cartesian join. (Note that you may run into memory issues if df2 is large.)
df = df1.crossJoin(df2)
df.show()
+---+----+----+------+------+
| ID|ColA|ColB|ColA_x|ColB_x|
+---+----+----+------+------+
|ID1| 10| 30| 2| 5|
|ID2| 20| 60| 2| 5|
+---+----+----+------+------+
Finally adding the columns by dividing them with the corresponding value first. reduce() applies function add() of two arguments, cumulatively, to the items of the sequence.
df = df.withColumn(
'finalSum',
reduce(add, [col(c)/col(c+'_x') for c in columns_to_be_operated])
).select('ID','finalSum')
df.show()
+---+--------+
| ID|finalSum|
+---+--------+
|ID1| 11.0|
|ID2| 22.0|
+---+--------+
Note: OP has to be careful with the division with 0. The snippet just above can be altered to take this condition into account.

Get the top two elements in a nested list - pyspark

Let's say I have a list L=[[a,2],[a,3],[a,4],[b,4],[b,8],[b,9]]
Using pyspark I want to be able to remove the third element so that it will look like this:
[a,2]
[a,3]
[b,4]
[b,8]
I am new to pyspark and not sure what I should do here.
You can try something like this.
The first step is groupby key column and aggregate values in a list. Then use a udf to get the first two values of the list and then explode that column.
df = sc.parallelize([('a',2),('a',3),('a',4),
('b',4),('b',8),('b',9)]).toDF(['key', 'value'])
from pyspark.sql.functions import collect_list, udf, explode
from pyspark.sql.types import *
foo = udf(lambda x:x[0:2], ArrayType(IntegerType()))
df_list = (df.groupby('key').agg(collect_list('value')).
withColumn('values',foo('collect_list(value)')).
withColumn('value', explode('values')).
drop('values', 'collect_list(value)'))
df_list.show()
result
+---+-----+
|key|value|
+---+-----+
| b| 4|
| b| 8|
| a| 2|
| a| 3|
+---+-----+

How to add a column in pyspark if two column values is in another dataframe?

I'm very new to pyspark. I have two dataframes like this:
df1:
enter image description here
df2:
enter image description here
label column in df1 does not exist at first. I added it later. If [user_id, sku_id] pair of df1 is in df2, then I want to add a column in df1 and set it to 1, otherwise 0, just like df1 shows. How can I do it in pyspark? I'm using py2.7.
Its possible by doing left outer join on two dataframes first, and then using when and otherwise functions on one of columns of right dataframe. here is complete solution I tried -
from pyspark.sql import functions as F
from pyspark.sql.functions import col
# this is just data input
data1 = [[4,3,3],[2,4,3],[4,2,4],[4,3,3]]
data2 = [[4,3,3],[2,3,3],[4,1,4]]
# create dataframes
df1 = spark.createDataFrame(data1,schema=['userId','sku_id','type'])
df2 = spark.createDataFrame(data2,schema=['userId','sku_id','type'])
# condition for join
cond=[df1.userId==df2.userId,df1.sku_id==df2.sku_id,df1.type==df2.type]
# magic
df1.join(df2,cond,how='left_outer')\
.select(df1.userId,df1.sku_id,df1.type,df2.userId.alias('uid'))\
.withColumn('label',F.when(col('uid')>0 ,1).otherwise(0))\
.drop(col('uid'))\
.show()
output :
+------+------+----+-----+
|userId|sku_id|type|label|
+------+------+----+-----+
| 2| 4| 3| 0|
| 4| 3| 3| 1|
| 4| 3| 3| 1|
| 4| 2| 4| 0|
+------+------+----+-----+

Categories