Get the top two elements in a nested list - pyspark

Get the top two elements in a nested list - pyspark - python

Let's say I have a list L=[[a,2],[a,3],[a,4],[b,4],[b,8],[b,9]]
Using pyspark I want to be able to remove the third element so that it will look like this:
[a,2]
[a,3]
[b,4]
[b,8]
I am new to pyspark and not sure what I should do here.

You can try something like this.
The first step is groupby key column and aggregate values in a list. Then use a udf to get the first two values of the list and then explode that column.
df = sc.parallelize([('a',2),('a',3),('a',4),
('b',4),('b',8),('b',9)]).toDF(['key', 'value'])
from pyspark.sql.functions import collect_list, udf, explode
from pyspark.sql.types import *
foo = udf(lambda x:x[0:2], ArrayType(IntegerType()))
df_list = (df.groupby('key').agg(collect_list('value')).
withColumn('values',foo('collect_list(value)')).
withColumn('value', explode('values')).
drop('values', 'collect_list(value)'))
df_list.show()
result
+---+-----+
|key|value|
+---+-----+
| b| 4|
| b| 8|
| a| 2|
| a| 3|
+---+-----+

Related

Pyspark: How to Apply UDF only on Rows with NotNull Values

I have a pyspark dataframe and would like to apply an UDF on a column with Null values.
Below is my dataframe:
+----+----+
| a| b|
+----+----+
|null| 00|
|.Abc|null|
|/5ee| 11|
|null| 0|
+----+----+
Below is the desired dataframe (remove punctuations and change string values to upper case in column a if row values are not Null):
+----+----+
| a| b|
+----+----+
|null| 00|
| ABC|null|
| 5EE| 11|
|null| 0|
+----+----+
Below is my UDF and code:
import pyspark.sql.functions as F
import re
remove_punct = F.udf(lambda x: re.sub('[^\w\s]', '', x))
df = df.withColumn('a', F.when(F.col("a").isNotNull(), F.upper(remove_punct(F.col("a")))))
Below is the error:
TypeError: expected string or bytes-like object
Can you please suggest what would be the optimal solution the get the desired DF?
Thanks in advance!

Use regexp_replace. No need for UDF.
df = df.withColumn('a', F.upper(F.regexp_replace(F.col('a'), '[^\w\s]', '')))
If you insist on using UDF, you need to do this:
remove_punct = F.udf(lambda x: re.sub('[^\w\s]', '', x) if x is not None else None)
df = df.withColumn('a', F.upper(remove_punct(F.col("a"))))

Count elements separated by commas from a list and save to another column

I have a pyspark dataframe that looks like this:
+-------------+
|list|
+-------------+
|1,1,1,1
+-------------+
|New,Upgrade,Old
+-------------+
How can I generate a field that counts the elements separated by commas? The ideal dataframe looks like this:
+---------------+-----------
|list |count
+----------------+----------
|1,1,1,1 | 4
+----------------+--------
|New,Upgrade,Old | 3
+----------------+-------

Use split and size functions.
Example:
df=spark.createDataFrame([('1,1,1,1',),('New,Upgrade,Old',)],['list'])
df.show()
#+---------------+
#| list|
#+---------------+
#| 1,1,1,1|
#|New,Upgrade,Old|
#+---------------+
from pyspark.sql.functions import *
df.withColumn("count",size(split(col("list"),","))).show()
#+---------------+-----+
#| list|count|
#+---------------+-----+
#| 1,1,1,1| 4|
#|New,Upgrade,Old| 3|
#+---------------+-----+

Merge multiple columns into one column in pyspark using python

Input DataFrame:
id,page,location,trlmonth
1,mobile,chn,08/2018
2,product,mdu,09/2018
3,product,mdu,09/2018
4,mobile,chn,08/2018
5,book,delhi,10/2018
7,music,ban,11/2018
Output DataFrame:
userdetail,count
mobile-chn-08/2018,2
product-mdu-09/2018,2
book-delhi-10/2018,1
music-ban-11/2018,1
I tried merging single column into one but how to merge multiple columns into one?
from pyspark.sql import functions as F
df2 = (df
.groupby("id")
.agg(F.concat_ws("-", F.sort_array(F.collect_list("product"))).alias("products"))
.groupby("products")
.agg(F.count("id")).alias("count"))

We can just groupby userdetail columns and get count. Try this,
>>> df.orderBy('trlmonth').groupby('page','location','trlmonth').count().select(F.concat_ws('-','page','location','trlmonth').alias('user_detail'),'count').show()
+-------------------+-----+
| user_detail|count|
+-------------------+-----+
| mobile-chn-08/2018| 2|
|product-mdu-09/2018| 2|
| book-delhi-10/2018| 1|
| music-ban-11/2018| 1|
+-------------------+-----+

How to add a column in pyspark if two column values is in another dataframe?

I'm very new to pyspark. I have two dataframes like this:
df1:
enter image description here
df2:
enter image description here
label column in df1 does not exist at first. I added it later. If [user_id, sku_id] pair of df1 is in df2, then I want to add a column in df1 and set it to 1, otherwise 0, just like df1 shows. How can I do it in pyspark? I'm using py2.7.

Its possible by doing left outer join on two dataframes first, and then using when and otherwise functions on one of columns of right dataframe. here is complete solution I tried -
from pyspark.sql import functions as F
from pyspark.sql.functions import col
# this is just data input
data1 = [[4,3,3],[2,4,3],[4,2,4],[4,3,3]]
data2 = [[4,3,3],[2,3,3],[4,1,4]]
# create dataframes
df1 = spark.createDataFrame(data1,schema=['userId','sku_id','type'])
df2 = spark.createDataFrame(data2,schema=['userId','sku_id','type'])
# condition for join
cond=[df1.userId==df2.userId,df1.sku_id==df2.sku_id,df1.type==df2.type]
# magic
df1.join(df2,cond,how='left_outer')\
.select(df1.userId,df1.sku_id,df1.type,df2.userId.alias('uid'))\
.withColumn('label',F.when(col('uid')>0 ,1).otherwise(0))\
.drop(col('uid'))\
.show()
output :
+------+------+----+-----+
|userId|sku_id|type|label|
+------+------+----+-----+
| 2| 4| 3| 0|
| 4| 3| 3| 1|
| 4| 3| 3| 1|
| 4| 2| 4| 0|
+------+------+----+-----+

How can I enumerate rows in groups with Spark/Python?

I'd like to enumerate grouped values just like with Pandas:
Enumerate each row for each group in a DataFrame
What is a way in Spark/Python?

With row_number window function:
from pyspark.sql.functions import row_number
from pyspark.sql import Window
w = Window.partitionBy("some_column").orderBy("some_other_column")
df.withColumn("rn", row_number().over(w))

You can achieve this on rdd level by doing:
rdd = sc.parallelize(['a', 'b', 'c'])
df = spark.createDataFrame(rdd.zipWithIndex())
df.show()
It will result:
+---+---+
| _1| _2|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+
If you only need unique ID, not real continuous indexing, you may also use
zipWithUniqueId() which is more efficient, since done locally on each partition.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get the top two elements in a nested list - pyspark - python

Let's say I have a list L=[[a,2],[a,3],[a,4],[b,4],[b,8],[b,9]] Using pyspark I want to be able to remove the third element so that it will look like this: [a,2] [a,3] [b,4] [b,8] I am new to pyspark and not sure what I should do here.

Related

Pyspark: How to Apply UDF only on Rows with NotNull Values

Count elements separated by commas from a list and save to another column

Merge multiple columns into one column in pyspark using python

How to add a column in pyspark if two column values is in another dataframe?

How can I enumerate rows in groups with Spark/Python?

Categories

Resources