PySpark and SQL : join and null values

PySpark and SQL : join and null values - python

I have pyspark dataframe consisting of two columns, each named input and target. These two are crossJoin of two single-column dataframes. Below is an example of how such dataframe would look like.
input
target
A
Voigt.
A
Leica
A
Zeiss
B
Voigt.
B
Leica
B
Zeiss
C
Voigt.
C
Leica
C
Zeiss
Then I have another dataframe which provides a number which describes relation between input and target column. However, it is not guaranteed that each input-target has this numerical value. For example, A - Voigt may have 2 as its relational value but A-Leica may have not have this value at all. Below is an example
input
target
val
A
Voigt.
2
A
Zeiss
1
B
Leica
3
C
Zeiss
5
C
Leica
2
Now I want a dataframe that is congregate of these two that looks like this.
input
target
val
A
Voigt.
2
A
Leica
null
A
Zeiss
1
B
Voigt.
null
B
Leica
3
B
Zeiss
null
C
Voigt.
null
C
Leica
5
C
Zeiss
2
I tried to join left these two columns, and tried to filter these out, but I've had problem completing in this form.
result = input_target.join(input_target_w_val, (input_target.input == input_target_w_val.input) & (input_target.target == input_target_w_val.target), 'left')
How should I put a filter from this point, or is there another way I can achieve this?

Try using it as below -
Input DataFrames
df1 = spark.createDataFrame(data=[("A","Voigt.") ,("A","Leica") ,("A","Zeiss") ,("B","Voigt.") ,("B","Leica") ,("B","Zeiss") ,("C","Voigt.") ,("C","Leica") ,("C","Zeiss")], schema = ["input", "target"])
df1.show()
+-----+------+
|input|target|
+-----+------+
| A|Voigt.|
| A| Leica|
| A| Zeiss|
| B|Voigt.|
| B| Leica|
| B| Zeiss|
| C|Voigt.|
| C| Leica|
| C| Zeiss|
+-----+------+
df2 = spark.createDataFrame(data=[("A","Voigt.",2) ,("A","Zeiss",1 ) ,("B","Leica",3 ) ,("C","Zeiss",5 ) ,("C","Leica",2 )], schema = ["input", "target", "val"])
df2.show()
+-----+------+---+
|input|target|val|
+-----+------+---+
| A|Voigt.| 2|
| A| Zeiss| 1|
| B| Leica| 3|
| C| Zeiss| 5|
| C| Leica| 2|
+-----+------+---+
Required Output
df1.join(df2, on = ["input", "target"], how = "left_outer").select(df1["input"], df1["target"], df2["val"]).show(truncate=False)
+-----+------+----+
|input|target|val |
+-----+------+----+
|A |Leica |null|
|A |Voigt.|2 |
|A |Zeiss |1 |
|B |Leica |3 |
|B |Voigt.|null|
|B |Zeiss |null|
|C |Leica |2 |
|C |Voigt.|null|
|C |Zeiss |5 |
+-----+------+----+

You can simply specify a list of join column names.
df = df1.join(df2, ['input', 'target'], 'left')

Some plagiarized modifications are completely silent. Not once or twice, and you take pride in constantly copying other people's work, don't you?
Forgive me for another post, I'm really pissed off, this kind of behavior has seriously disturbed the atmosphere of the forum.

Related

Pyspark: Is there a function to split dataframe column values on the basis of comma [duplicate]

This question already has answers here:
How to extract each value in a column of comma separated strings into individual rows
(3 answers)
Closed last year.
input
+--------------+-----------------------+-----------------------+
|ID |Subject |Marks |
+--------------+-----------------------+-----------------------+
|1 |maths,physics |80,90 |
|2 |Computer |73 |
|3 |music,sports,chemistry |76,89,85 |
+--------------+-----------+-----------+-----------------------+
Expected output
+--------------+-----------------------+-----------------------+
|ID |Subject |Marks |
+--------------+-----------------------+-----------------------+
|1 |maths |80 |
|1 |physics |90 |
|2 |Computer |73 |
|3 |music |76 |
|3 |sports |89 |
|3 |chemistry |85 |
+--------------+-----------+-----------+-----------------------+
need help in getting this expected output
,have already tried explode function but that only works on single column

Another way;, split the columns on , to form arrays. Zip the arrays and leverage pysparks' inline function to achieve what you want
df.withColumn('Subject', split(col("Subject"),",")).withColumn('Marks', split(col("Marks"),",")).selectExpr('ID','inline(arrays_zip(Subject,Marks))')
+---+---------+-----+
| ID| Subject|Marks|
+---+---------+-----+
| 1| maths| 80|
| 1| physics| 90|
| 2| Computer| 73|
| 3| music| 76|
| 3| sports| 89|
| 3|chemistry| 85|
+---+---------+-----+

You can explode multiple columns at once:
cols = ['Subject', 'Marks']
df[cols] = df[cols].apply(lambda x: x.str.split(','))
df.explode(cols)
Output:
ID Subject Marks
0 1 maths 80
0 1 physics 90
1 2 Computer 73
2 3 music 76
2 3 sports 89
2 3 chemistry 85

PySpark pattern matching and assigning associated values

df1:
campaign_name campaign_team
einsurancep09 other
estoreemicardcdwpnov06 other
estoreemicardwmnov06 other
estoreemicardgenericspnov06 other
df2:
terms product_category product
insurance insurance null
def emi store
ab bhi asd
de lic cards
a credit cards
Below are my scenarios:
'terms' column (df2) is sorted according to the string length in descending order.
It should be compared with the campaign_name of df1 for contains/like.
Whichever terms string matches first with the campaign_name, its product_category and product should be picked up and should be added as new columns in df1.
For campaign_name value "einsurancep09", "insurance" value from terms is contained in campaign_name so its product_category and product is picked up and added as df1 in the output.
Another eg: Consider rest 3 records where you contain def, ab and de in the campaign_name string but we are picking product_category and product of "def" as it appeared first and is the longest in the length when compared with "ab" and "de"
Below is my code:
df1 = df1.withColumn("product_category",when(df1.campaign_name.contains(df2.terms),df2.product_category).otherwise('other'))
But, it gives me below error:
raise converted from None
pyspark.sql.utils.AnalysisException: Resolved attribute(s) terms#37,product_category#38 missing from campaign_name#16,campaign_team#17 in operator !Project [campaign_name#16, campaign_team#17, CASE WHEN Contains(campaign_name#16, terms#37) THEN product_category#38 ELSE other END AS product_category#44].;
!Project [campaign_name#16, campaign_team#17, CASE WHEN Contains(campaign_name#16, terms#37) THEN product_category#38 ELSE other END AS product_category#44]
+- Relation[campaign_name#16,campaign_team#17] csv
So where am I going wrong?
As per stack's answer, I am getting below output:
+---------------+-------------+---------+----------------+-------+
|campaign_name |campaign_team|terms |product_category|product|
+---------------+-------------+---------+----------------+-------+
|einsurancepnm06|other |insurance|Insurance |NaN |
+---------------+-------------+---------+----------------+-------+
Expected output:

If the volume of dataframe df2 is high, then the spark will not be optimized to perform such operations. If it is not (< 200 records) then there are two feasible approaches that we can use for solving this:
CROSS Join
UDF
Approach 1: CROSS Join
Steps:
Add new column rw and assign row number as per require order in df2.
Cross Join df1 and sorted df2 and create a new dataframe df3.
Use contains columnar function and add new column match.
Select the first Row after partition By all df1 columns and order by df2 row number and match column
Code
>>> df1.show(truncate=False)
+---------------------------+-------------+
|campaign_name |campaign_team|
+---------------------------+-------------+
|einsurancep09 |other |
|estoreemicardcdwpnov06 |other |
|estoreemicardwmnov06 |other |
|estoreemicardgenericspnov06|other |
|abcdefcdwpnov06 |other |
|abcdefwmnov06 |other |
|abcdefgenericspnov06 |other |
+---------------------------+-------------+
>>> df2.show(truncate=False)
+---------+----------------+-------+---+
|terms |product_category|product|rw |
+---------+----------------+-------+---+
|insurance|insurance |null |1 |
|def |emi |store |2 |
|ab |bhi |asd |3 |
|de |lic |cards |4 |
|a |credit |cards |5 |
+---------+----------------+-------+---+
>>> df3 = df1.crossJoin(df2)
>>> df4 = df3.withColumn("match", col("campaign_name").contains(col("terms")))
>>> W = Window.partitionBy(col("campaign_name"), col("campaign_team")).orderBy(col("match").desc(), col("rw"))
>>> finalDF = df4.withColumn("rn", row_number().over(W)).filter(col("rn") == lit(1)).drop("rn", "terms","rw","match")
>>> finalDF.show(truncate=False)
+---------------------------+-------------+----------------+-------+
|campaign_name |campaign_team|product_category|product|
+---------------------------+-------------+----------------+-------+
|estoreemicardgenericspnov06|other |credit |cards |
|abcdefwmnov06 |other |emi |store |
|estoreemicardcdwpnov06 |other |credit |cards |
|estoreemicardwmnov06 |other |credit |cards |
|einsurancep09 |other |insurance |null |
|abcdefgenericspnov06 |other |emi |store |
|abcdefcdwpnov06 |other |emi |store |
+---------------------------+-------------+----------------+-------+
Approach 2 UDF
Steps:
Add new column rw and assign row number as per require order in df2.
based on row number collect all df2 rows into a single row and create a new Dataframe df3.
Use any approach from below and create a new Dataframe df4.
Convert the df3 row into String Variable and add String literal as a new column in df1.
Or
cross Join df1 and df3.
Declare UDF as below.
Call UDF and add a new column with the return value.
Create require columns from output returns.
Code
>>> df3 = df2.na.fill("").groupBy(lit(1)).agg(sort_array(collect_list(concat(col("rw"),lit(":"), col("terms"), lit(":"), col("product_category"), lit(":"), col("product")))).alias("Check")).withColumn("Check", concat_ws(",", col("Check"))).drop("1")
>>> df3.show(truncate=False)
+-----------------------------------------------------------------------------------+
|Check |
+-----------------------------------------------------------------------------------+
|1:insurance:insurance:,2:def:emi:store,3:ab:bhi:asd,4:de:lic:cards,5:a:credit:cards|
+-----------------------------------------------------------------------------------+
>>> df4 = df1.crossJoin(df3)
>>> def categoryFunction(name, Check):
checkList = Check.lower().split(",")
out = ""
match = False
for Key in checkList:
keyword = Key.split(":",2)
terms = keyword[1]
tempOut = keyword[2]
if terms in name.lower():
out = tempOut
match = True
if match:
break
return out
>>> categoryUDF = udf(categoryFunction, StringType())
>>> finalDF = df4.withColumn("out", categoryUDF(col("campaign_name"), col("Check"))).drop("Check").withColumn("out", split(col("out"), ":")).withColumn("product_category", col("out")[0]).withColumn("product", col("out")[1]).drop("out").show(truncate=False)
>>> finalDF.show(truncate=False)
+---------------------------+-------------+----------------+-------+
|campaign_name |campaign_team|product_category|product|
+---------------------------+-------------+----------------+-------+
|einsurancep09 |other |insurance | |
|estoreemicardcdwpnov06 |other |credit |cards |
|estoreemicardwmnov06 |other |credit |cards |
|estoreemicardgenericspnov06|other |credit |cards |
|abcdefcdwpnov06 |other |emi |store |
|abcdefwmnov06 |other |emi |store |
|abcdefgenericspnov06 |other |emi |store |
+---------------------------+-------------+----------------+-------+

Assumption
Dataset df1 should have an order to meet the OP requirement. So I introduce the rec_no column
df = spark.sql("""
select 'abcdefcdwpnovo6' campaign_name, 'other' campaign_team union all
select 'abcdefdwpnovo6' , 'other' union all
select 'abcdefgenericpnovo6' , 'other'
""")
df.createOrReplaceTempView("df")
df.show()
+-------------------+-------------+
| campaign_name|campaign_team|
+-------------------+-------------+
| abcdefcdwpnovo6| other|
| abcdefdwpnovo6| other|
|abcdefgenericpnovo6| other|
+-------------------+-------------+
df1 = spark.sql("""
select 1 rec_no, 'def' terms, 'emi' product_category, 'store' product union all
select 2, 'abc' ,'bhi' ,'asd' union all
select 3, 'de' ,'lic' ,'cards' union all
select 4, 'a' ,'credit' ,'cards'
""")
df1.createOrReplaceTempView("df1")
df1.show()
+------+-----+----------------+-------+
|rec_no|terms|product_category|product|
+------+-----+----------------+-------+
| 1| def| emi| store|
| 2| abc| bhi| asd|
| 3| de| lic| cards|
| 4| a| credit| cards|
+------+-----+----------------+-------+
Output:
You can drop the ps, rk and rec_no columns
spark.sql("""
with t1 ( select * from df a cross join df1 b ),
t2 ( select rec_no, campaign_name,campaign_team,terms,product_category,product x, position(terms,campaign_name) ps,
rank() over(order by rec_no) rk from t1 where position(terms,campaign_name)>0 )
select * from t2 where rk=1
""").show()
+------+-------------------+-------------+-----+----------------+-----+---+---+
|rec_no| campaign_name|campaign_team|terms|product_category| x| ps| rk|
+------+-------------------+-------------+-----+----------------+-----+---+---+
| 1| abcdefcdwpnovo6| other| def| emi|store| 4| 1|
| 1| abcdefdwpnovo6| other| def| emi|store| 4| 1|
| 1|abcdefgenericpnovo6| other| def| emi|store| 4| 1|
+------+-------------------+-------------+-----+----------------+-----+---+---+
Update-1:
The OP's question is still not clear. Try below.
spark.sql("""
with t1 ( select * from df a cross join df1 b ),
t2 ( select rec_no, campaign_name,campaign_team,terms,product_category,product, position(product_category,campaign_name) ps,
rank() over(partition by product_category order by rec_no) rk from t1 where position(product_category,campaign_name)>0
)
select * from t2 where rk=1 order by rec_no
""").show(truncate=False)
+------+---------------------------+-------------+---------+----------------+-------+---+---+
|rec_no|campaign_name |campaign_team|terms |product_category|product|ps |rk |
+------+---------------------------+-------------+---------+----------------+-------+---+---+
|1 |einsurancep09 |other |insurance|insurance |null |2 |1 |
|2 |estoreemicardcdwpnov06 |other |def |emi |store |7 |1 |
|2 |estoreemicardgenericspnov06|other |def |emi |store |7 |1 |
|2 |estoreemicardwmnov06 |other |def |emi |store |7 |1 |
+------+---------------------------+-------------+---------+----------------+-------+---+---+

Pyspark pivot on multiple column names

I currently have a dataframe df
id | c1 | c2 | c3 |
1 | diff | same | diff
2 | same | same | same
3 | diff | same | same
4 | same | same | same
I want my output to look like
name| diff | same
c1 | 2 | 2
c2 | 0 | 4
c3 | 1 | 3
When I try :
df.groupby('c2').pivot('c2').count() -> transformation A
|f2 | diff | same |
|same | null | 2
|diff | 2 | null
I'm assuming I need to write a loop for each column and pass it through transformation A?
But I'm having issues getting transformation A right.
Please help

Pivot is an expensive shuffle operation and should be avoided if possible. Try using this logic with arrays_zip and explode to dynamically collapse columns and groupby-aggregate.
from pyspark.sql import functions as F
df.withColumn("cols", F.explode(F.arrays_zip(F.array([F.array(F.col(x),F.lit(x))\
for x in df.columns if x!='id']))))\
.withColumn("name", F.col("cols.0")[1]).withColumn("val", F.col("cols.0")[0]).drop("cols")\
.groupBy("name").agg(F.count(F.when(F.col("val")=='diff',1)).alias("diff"),\
F.count(F.when(F.col("val")=='same',1)).alias("same")).orderBy("name").show()
#+----+----+----+
#|name|diff|same|
#+----+----+----+
#| c1| 2| 2|
#| c2| 0| 4|
#| c3| 1| 3|
#+----+----+----+
You can also do this by exploding a map_type by creating a map dynamically.
from pyspark.sql import functions as F
from itertools import chain
df.withColumn("cols", F.create_map(*(chain(*[(F.lit(name), F.col(name))\
for name in df.columns if name!='id']))))\
.select(F.explode("cols").alias("name","val"))\
.groupBy("name").agg(F.count(F.when(F.col("val")=='diff',1)).alias("diff"),\
F.count(F.when(F.col("val")=='same',1)).alias("same")).orderBy("name").show()
#+----+----+----+
#|name|diff|same|
#+----+----+----+
#| c1| 2| 2|
#| c2| 0| 4|
#| c3| 1| 3|
#+----+----+----+

from pyspark.sql.functions import *
df = spark.createDataFrame([(1,'diff','same','diff'),(2,'same','same','same'),(3,'diff','same','same'),(4,'same','same','same')],['idcol','C1','C2','C3'])
df.createOrReplaceTempView("MyTable")
#spark.sql("select * from MyTable").collect()
x1=spark.sql("select idcol, 'C1' AS col, C1 from MyTable union all select idcol, 'C2' AS col, C2 from MyTable union all select idcol, 'C3' AS col, C3 from MyTable")
#display(x1)
x2=x1.groupBy('col').pivot('C1').agg(count('C1')).orderBy('col')
display(x2)

insert a column from another df into another df as a column. stucked for hours! merge(join) with different column

these 2 df i have tried different code from join (which it requires a common column), to union and some other code to merge, tho i can't get the result i want, i tried also straight forward
data.join(tdf, how='inner').select('*')
data.join(tdf, how='outer').select('*')
none of the 2 above codes gave me a wanted df.
data.show()
|_c0| description| medical_specialty| sample_name| transcription| keywords|
+---+--------------------+--------------------+-------------------------------------------------------------+
| 1| Consult for lapa...| Bariatrics|Laparoscopic Gas...|PAST MEDICAL HIST...|bariatrics, lapar...|
| 2| Consult for lapa...| Bariatrics|Laparoscopic Gas...|"HISTORY OF PRESE...| at his highest h...|
| 3| 2-D M-Mode. Dopp...| Cardiovascular /...|2-D Echocardiogr...|2-D M-MODE: , ,1....|cardiovascular / ...|
| 4| 2-D Echocardiogram| Cardiovascular /...|2-D Echocardiogr...|1. The left vent...|cardiovascular / ...|
How to add the age column as a column of the above df/ Or how to merge these dfs
tdf.show()
|age|
+---+
| |
| 42|
| |
| |
| 30|
| |
| |
goal:
+---+--------------------+--------------------+-------------------+------------------+----------------------+---+
|_c0| description| medical_specialty| sample_name| transcription| keywords|age|
+---+--------------------+--------------------+-------------------------------------------------------------+---+
| 1| Consult for lapa...| Bariatrics|Laparoscopic Gas...|PAST MEDICAL HIST...|bariatrics, lapar...| |
| 2| Consult for lapa...| Bariatrics|Laparoscopic Gas...|"HISTORY OF PRESE...| at his highest h...| 42|
| 3| 2-D M-Mode. Dopp...| Cardiovascular /...|2-D Echocardiogr...|2-D M-MODE: , ,1....|cardiovascular / ...| |
| 4| 2-D Echocardiogram| Cardiovascular /...|2-D Echocardiogr...|1. The left vent...|cardiovascular / ...| |

As per my understanding, you want to join the two dataframes based on their row_number as your data contains the row_number but your age df doesn't.
So you can add a row number then join your dataframe like:
tdf = tdf.withColumn('Row_id',f.row_number().over(Window.orderBy(f.lit('a'))))
joined_data = data.join(tdf, data._c0==tdf.Row_id).select('*')
This will join the age column to your existing dataframe.

Pyspark : order/sort by then group by and concat string

I have a dataframe like this:
usr sec scrpt
0 1 5 This
1 2 10 is
2 3 12 a
3 1 7 string
4 2 4 oreo
i am trying to order/sort by user,sec and then groupby on user and concat the string there. so this table consist of every user call there in which second what did he speak . so resulting dataframe should look like
user concated
1 this string
2 oreo is
3 a
I have tried below in python and works fine
df.sort_values(by=['usr','sec'],ascending=[True, True]).groupby(['usr')['scrpt'].apply(lambda x: ','.join(x)).reset_index()
Could anyone give me similar in pyspark?

From Spark-2.4+ use array_join, sort_array, transform functions for this case.
#sample dataframe
df=spark.createDataFrame([(1,5,"This"),(2,10,"is"),(3,12,"a"),(1,7,"string"),(2,4,"oreo")],["usr","sec","scrpt"])
df.show()
#+---+---+------+
#|usr|sec| scrpt|
#+---+---+------+
#| 1| 5| This|
#| 2| 10| is|
#| 3| 12| a|
#| 1| 7|string|
#| 2| 4| oreo|
#+---+---+------+
df.groupBy("usr").agg(array_join(expr("""transform(sort_array(collect_list(struct(sec,scrpt)),True), x -> x.scrpt)""")," ").alias("concated")).orderBy("usr").show(10,False)
df.groupBy("usr").agg(concat_ws(" ",expr("""transform(sort_array(collect_list(struct(sec,scrpt)),True), x -> x.scrpt)""")).alias("concated")).orderBy("usr").show(10,False)
#+---+-----------+
#|usr|concated |
#+---+-----------+
#|1 |This string|
#|2 |oreo is |
#|3 |a |
#+---+-----------+
#lower case
df.groupBy("usr").agg(lower(array_join(expr("""transform(sort_array(collect_list(struct(sec,scrpt)),True), x -> x.scrpt)""")," ")).alias("concated")).orderBy("usr").show(10,False)
#+---+-----------+
#|usr|concated |
#+---+-----------+
#|1 |this string|
#|2 |oreo is |
#|3 |a |
#+---+-----------+

You can use Window functionality to accomplish what you want in PySpark.
import pyspark.sql.functions as sf
# Construct a window to construct sentences
sentence_window = Window.partitionBy('usr').orderBy(sf.col('sec').asc())
# Construct a window to get the last sentence. The others will be sentence fragments spoken by the user.
rank_window = Window.partitionBy('usr').orderBy(sf.col('sec').desc())
user_sentences = spark_data_df.select('usr',
sf.collect_list(sf.col('scrpt')).over(sentence_window).alias('sentence'),
sf.rank().over(rank_window).alias('rank'))
user_sentences = user_sentences.filter("rank = 1").drop('rank')
user_sentences = user_sentences.withColumn('sentence', sf.concat_ws(' ', sf.col('sentence')))
user_sentences.show(10, False)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark and SQL : join and null values - python

You can simply specify a list of join column names. df = df1.join(df2, ['input', 'target'], 'left')

Some plagiarized modifications are completely silent. Not once or twice, and you take pride in constantly copying other people's work, don't you? Forgive me for another post, I'm really pissed off, this kind of behavior has seriously disturbed the atmosphere of the forum.

Related

Pyspark: Is there a function to split dataframe column values on the basis of comma [duplicate]

PySpark pattern matching and assigning associated values

Pyspark pivot on multiple column names

insert a column from another df into another df as a column. stucked for hours! merge(join) with different column

Pyspark : order/sort by then group by and concat string

Categories

Resources