Calculate values from two dataframes in PySpark - python

I'm trying to group and sum for a PySpark (2.4) Dataframe but can't only get values one by one.
I've the following dataframe :
data.groupBy("card_scheme", "failed").count().show()
+----------------+------+------+
| card_Scheme|failed| count|
+----------------+------+------+
| jcb| false| 4|
|american express| false| 22084|
| AMEX| false| 4|
| mastercard| true| 1122|
| visa| true| 1975|
| visa| false|126372|
| CB| false| 6|
| discover| false| 2219|
| maestro| false| 2|
| VISA| false| 13|
| mastercard| false| 40856|
| MASTERCARD| false| 9|
+----------------+------+------+
I'm trying to calculate the formula X = false / (false + true) for each card_scheme and still get one dataframe in the end.
I'm expecting something like:
| card_scheme | X |
|-------------|---|
| jcb | 1 |
| .... | . |
| visa | 0.9846| (which is 126372 / (126372 + 1975)
| ... | . |

Creating the dataset
myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]
df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])
df.show()
+----------------+------+------+
| card_Scheme|failed| count|
+----------------+------+------+
| jcb| false| 4|
|american express| false| 22084|
| AMEX| false| 4|
| mastercard| true| 1122|
| visa| true| 1975|
| visa| false|126372|
| CB| false| 6|
| discover| false| 2219|
| maestro| false| 2|
| VISA| false| 13|
| mastercard| false| 40856|
| MASTERCARD| false| 9|
+----------------+------+------+
Method 1: This method will be slower, as it involves a traspose via pivot.
df=df.groupBy("card_Scheme").pivot("failed").sum("count")
df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))
df=df.select('card_Scheme','X')
df.show()
+----------------+------------------+
| card_Scheme| X|
+----------------+------------------+
| VISA| 1.0|
| jcb| 1.0|
| MASTERCARD| 1.0|
| maestro| 1.0|
| AMEX| 1.0|
| mastercard|0.9732717137548239|
|american express| 1.0|
| CB| 1.0|
| discover| 1.0|
| visa|0.9846120283294506|
+----------------+------------------+
Method 2: Use SQL - you can do so the via windows function. This will be a lot faster.
from pyspark.sql.window import Window
df = df.groupBy("card_scheme", "failed").agg(sum("count"))\
.withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))\
.where(col('failed')== False).drop('failed','sum(count)')
df.show()
+----------------+------------------+
| card_scheme| X|
+----------------+------------------+
| VISA| 1.0|
| jcb| 1.0|
| MASTERCARD| 1.0|
| maestro| 1.0|
| AMEX| 1.0|
| mastercard|0.9732717137548239|
|american express| 1.0|
| CB| 1.0|
| discover| 1.0|
| visa|0.9846120283294506|
+----------------+------------------+

First split root dataframe into two dataframes:
df_true = data.filter(data.failed == True).alias("df1")
df_false =data.filter(data.failed == False).alias("df2")
Then doing full outer join we can get final result:
from pyspark.sql.functions import col,when
df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer") \
.select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme") \
, when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))
No need to do groupby, just extra two dataframes and joining.

data.groupBy("card_scheme").pivot("failed").agg(count("card_scheme")) should work. I am not sure about the agg(count(any_column)), but the clue is pivot function. In result you'll get two new columns: false and true. Then you can easily calculate the x = false / (false + true).

A simple solution would be to do a second groupby:
val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe
val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))
with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()
The idea is that you can create a second column which has the failed in the failed=false and 0 otherwise. This means that the sum of the count column gives you false + true while the sum of the countfalse gives just the false. Then simply do a second groupby
Note: Some of the other answers use pivot. I believe the pivot solution would be slower (it does more), however, if you do choose to use it, add the specific values to the pivot call, i.e. pivot("failed", ["true", "false"]) to improve performance, otherwise spark would have to do a two path (the first to find the values)

from pyspark.sql import functions as func
from pyspark.sql.functions import col
data = data.groupby("card_scheme", "failed").count()
Create 2 new dataframes:
a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))
b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))
Join both the dataframes:
c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)
Divide one column with another:
c.withColumn('X', c.num/c.den)

Related

How to use ALS with multiple implicit feedbacks?

In the ALS example given in PySpark as per this documentation - http://spark.apache.org/docs/latest/ml-collaborative-filtering.html) the data used has explicit feedback in one column. The data is like this:
| User | Item | Rating |
| --- | --- | --- |
| First | A | 2 |
| Second | B | 3|
However, in my case I have implicit feedbacks in multiple columns like this:
| User | Item | Clicks | Views | Purchase |
| --- | --- | --- | --- | --- |
| First | A | 20 | 35 | 3 |
| Second | B | 3| 12 | 0 |
I know we can use implicit feedback by setting implicitPrefs as False. However, it only accepts a single column. How to use multiple columns?
I found this question: How to manage multiple positive implicit feedbacks? However, it is not related with Spark and Alternating Least Square method. Do I have to manually assign a weighting scheme as per that answer? or is there a better solution in PySpark?
I have thoroughly Researched your issue, i haven't found passing multiple columns in ALS, most of the such problems are being solved by manually weighing and creating Rating column.
Below is my solution
Create indexing for Views, Clicks and Purchase value as below
Extract Smallest value (except 0) and devide all ements for same colmn by it
example : min value for Purchase col is 3
so 3/3, 10/3, 20/3 .. etc
Now after getting indexed value for these columns calculate Rating
Below is the formula for Rating
Rating = 60% of Purchase + 30% of Clicks + 10% of Views
data.show()
+------+----+------+-----+--------+
| User|Item|Clicks|Views|Purchase|
+------+----+------+-----+--------+
| First| A| 20| 35| 3|
|Second| B| 3| 12| 0|
| Three| C| 4| 15| 20|
| Four| D| 5| 16| 10|
+------+----+------+-----+--------+
df1 = data.sort('Purchase').select('Purchase')
df= df1.filter(df1.Purchase >0)
purch_index = df.first()['Purchase']
df2 = data.sort('Views').select('Views')
df2= df2.filter(df2.Views >0)
Views_index = df2.first()['Views']
f3 = data.sort('Clicks').select('Clicks')
df3= df3.filter(df3.Clicks >0)
CLicks_index = df3.first()['Clicks']
semi_rawdf = data.withColumn('Clicks',round(col('Clicks')/CLicks_index)).withColumn('Views',round(col('Views')/Views_index)).withColumn('Purchase',round(col('Purchase')/purch_index))
semi_rawdf.show()
+------+----+------+-----+--------+
| User|Item|Clicks|Views|Purchase|
+------+----+------+-----+--------+
| First| A| 7.0| 3.0| 1.0|
|Second| B| 1.0| 1.0| 0.0|
| Three| C| 1.0| 1.0| 7.0|
| Four| D| 2.0| 1.0| 3.0|
+------+----+------+-----+--------+
from pyspark.sql.types import DecimalType
from decimal import Decimal
refined_df = semi_rawdf.withColumn('Rating',((col('Clicks')*0.3)+round(col('Views')*0.1)+round(col('Purchase')*0.6)))
refined_df = refined_df.withColumn('Rating', col('Rating').cast(DecimalType(6,2)))
refined_df.select('User','Item','Rating').show()
+------+----+------+
| User|Item|Rating|
+------+----+------+
| First| A| 3.10|
|Second| B| 0.30|
| Three| C| 4.30|
| Four| D| 2.60|
+------+----+------+

How to join two Spark DataFrame and operate their share column?

I have 2 DataFrame like this:
+--+-----------+
|id|some_string|
+--+-----------+
| a| foo|
| b| bar|
| c| egg|
| d| fog|
+--+-----------+
and this:
+--+-----------+
|id|some_string|
+--+-----------+
| a| hoi|
| b| hei|
| c| hai|
| e| hui|
+--+-----------+
I want to join them to be like this:
+--+-----------+
|id|some_string|
+--+-----------+
| a| foohoi|
| b| barhei|
| c| egghai|
| d| fog|
| e| hui|
+--+-----------+
so, the column some_string from the first dataframe is concantenated to the column some_string from the second dataframe. If I am using
df_join = df1.join(df2,on='id',how='outer')
it would return
+--+-----------+-----------+
|id|some_string|some_string|
+--+-----------+-----------+
| a| foo| hoi|
| b| bar| hei|
| c| egg| hai|
| d| fog| null|
| e| null| hui|
+--+-----------+-----------+
Is there any way to do it?
You need to use when in order to achieve proper concatenation. Other than that the way you were using outer join was almost correct.
You need to check if anyone of these two columns is Null or not Null and then do the concatenation.
from pyspark.sql.functions import col, when, concat
df1 = sqlContext.createDataFrame([('a','foo'),('b','bar'),('c','egg'),('d','fog')],['id','some_string'])
df2 = sqlContext.createDataFrame([('a','hoi'),('b','hei'),('c','hai'),('e','hui')],['id','some_string'])
df_outer_join=df1.join(df2.withColumnRenamed('some_string','some_string_x'), ['id'], how='outer')
df_outer_join.show()
+---+-----------+-------------+
| id|some_string|some_string_x|
+---+-----------+-------------+
| e| null| hui|
| d| fog| null|
| c| egg| hai|
| b| bar| hei|
| a| foo| hoi|
+---+-----------+-------------+
df_outer_join = df_outer_join.withColumn('some_string_concat',
when(col('some_string').isNotNull() & col('some_string_x').isNotNull(),concat(col('some_string'),col('some_string_x')))
.when(col('some_string').isNull() & col('some_string_x').isNotNull(),col('some_string_x'))
.when(col('some_string').isNotNull() & col('some_string_x').isNull(),col('some_string')))\
.drop('some_string','some_string_x')
df_outer_join.show()
+---+------------------+
| id|some_string_concat|
+---+------------------+
| e| hui|
| d| fog|
| c| egghai|
| b| barhei|
| a| foohoi|
+---+------------------+
Considering you want to perform an outer join you can try the following:
from pyspark.sql.functions import concat, col, lit, when
df_join= df1.join(df2,on='id',how='outer').when(isnull(df1.some_string1), ''). when(isnull(df2.some_string2),'').withColumn('new_column',concat(col('some_string1'),lit(''),col('some_string2'))).select('id','new_column')
(Please note that the some_string1 and 2 refer to the some_string columns from the df1 and df2 dataframes. I would advise you to name them differently instead of giving the same name some_string, so that you can call them)

Most performant way to perform custom one-hot-encoding on a PySpark dataframe?

Let's say we have this PySpark dataframe:
+----+-------------+
| id | string_data |
+----+-------------+
| 1 | "test" |
+----+-------------+
| 2 | null |
+----+-------------+
| 3 | "9" |
+----+-------------+
| 4 | "deleted__" |
I want to perform some operation on this that will result in this dataframe:
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| id | string_data | is_string_data_null | is_string_data_a_number | does_string_data_contain_keyword_test | is_string_data_normal |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 1 | "test" | 0 | 0 | 1 | 0 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 2 | null | 1 | 0 | 0 | 0 |im
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 3 | "9" | 0 | 1 | 0 | 0 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 4 | "deleted__" | 0 | 0 | 0 | 1 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| | | | | | |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
Where each of the new columns has either a 1 or a 0 depending on the truth value. I have currently implemented this using a custom UDF that checks the value of the string_data column, but this is incredibly slow. I have also tried implementing a UDF that does not create new columns but instead overwrites the original one with an encoded vector [1, 0, 0...], etc. This is also too slow because we have to apply this to millions of rows and thousands of columns.
Is there any better way of doing this? I understand UDFs are not the most efficient way to solve things in PySpark but I can't seem to find any built-in PySpark functions that work.
Any thoughts would be appreciated!
Edit: Sorry, from mobile I didn't see the full expected output so my previous answer was very incomplete.
Anyway, your operation has to be done in two steps, starting with this DataFrame:
>>> df.show()
+---+-----------+
| id|string_data|
+---+-----------+
| 1| test|
| 2| null|
| 3| 9|
| 4| deleted__|
+---+-----------+
Create the boolean fields based on the conditions in the string_data field:
>>> df = (df
.withColumn('is_string_data_null', df.string_data.isNull())
.withColumn('is_string_data_a_number', df.string_data.cast('integer').isNotNull())
.withColumn('does_string_data_contain_keyword_test', coalesce(df.string_data, lit('')).contains('test'))
.withColumn('is_string_normal', ~(col('is_string_data_null') | col('is_string_data_a_number') | col('does_string_data_contain_keyword_test')))
)
>>> df.show()
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| id|string_data|is_string_data_null|is_string_data_a_number|does_string_data_contain_keyword_test|is_string_normal|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| 1| test| false| false| true| false|
| 2| null| true| false| false| false|
| 3| 9| false| true| false| false|
| 4| deleted__| false| false| false| true|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
Now that we have our columns, we can cast them to integers:
>>> df = (df
.withColumn('is_string_data_null', df.is_string_data_null.cast('integer'))
.withColumn('is_string_data_a_number', df.is_string_data_a_number.cast('integer'))
.withColumn('does_string_data_contain_keyword_test', df.does_string_data_contain_keyword_test.cast('integer'))
.withColumn('is_string_normal', df.is_string_normal.cast('integer'))
)
>>> df.show()
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| id|string_data|is_string_data_null|is_string_data_a_number|does_string_data_contain_keyword_test|is_string_normal|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| 1| test| 0| 0| 1| 0|
| 2| null| 1| 0| 0| 0|
| 3| 9| 0| 1| 0| 0|
| 4| deleted__| 0| 0| 0| 1|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
This should be far more performant than an UDF, as all the operations are done by Spark itself so there's no context switch from Spark to Python.

Subtract 2 pyspark dataframes based on column

I have 2 pyspark dataframes,
i
+---+-----+
| ID|COL_A|
+---+-----+
| 1| 123|
| 2| 456|
| 3| 111|
| 4| 678|
+---+-----+
j
+----+-----+
|ID_B|COL_B|
+----+-----+
| 2| 456|
| 3| 111|
| 4| 876|
+----+-----+
I'm trying to subtract i from j based on values of a particular column i.e., values present in COL_A of i should not be present in COL_B of j.
Expected output should be,
diff
+---+-----+
| ID|COL_A|
+---+-----+
| 1| 123|
| 4| 678|
+---+-----+
This is my code,
common = i.join(j.withColumnRenamed('COL_B', 'COL_A'), ['COL_A'], 'leftsemi')
diff = i.subtract(common)
diff.show()
But the output is coming wrong,
diff
+---+-----+
| ID|COL_A|
+---+-----+
| 2| 456|
| 1| 123|
| 4| 678|
| 3| 111|
+---+-----+
Am I doing something wrong here? Thanks in advance.
Try:
left_join = i.join(j, j.COL_B == i.COL_A,how='left')
left_join.filter(left_join.COL_A.isNull()).show()
If you are having column names as args, you can do like:
left_join = i.join(j, j[colb] == i[cola],how='left')
left_join.filter(left_join[cola].isNull()).show()

Concatenate two PySpark dataframes

I'm trying to concatenate two PySpark dataframes with some columns that are only on one of them:
from pyspark.sql.functions import randn, rand
df_1 = sqlContext.range(0, 10)
+--+
|id|
+--+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+--+
df_2 = sqlContext.range(11, 20)
+--+
|id|
+--+
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+--+
df_1 = df_1.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal"))
df_2 = df_2.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal_2"))
and now I want to generate a third dataframe. I would like something like pandas concat:
df_1.show()
+---+--------------------+--------------------+
| id| uniform| normal|
+---+--------------------+--------------------+
| 0| 0.8122802274304282| 1.2423430583597714|
| 1| 0.8642043127063618| 0.3900018344856156|
| 2| 0.8292577771850476| 1.8077401259195247|
| 3| 0.198558705368724| -0.4270585782850261|
| 4|0.012661361966674889| 0.702634599720141|
| 5| 0.8535692890157796|-0.42355804115129153|
| 6| 0.3723296190171911| 1.3789648582622995|
| 7| 0.9529794127670571| 0.16238718777444605|
| 8| 0.9746632635918108| 0.02448061333761742|
| 9| 0.513622008243935| 0.7626741803250845|
+---+--------------------+--------------------+
df_2.show()
+---+--------------------+--------------------+
| id| uniform| normal_2|
+---+--------------------+--------------------+
| 11| 0.3221262660507942| 1.0269298899109824|
| 12| 0.4030672316912547| 1.285648175568798|
| 13| 0.9690555459609131|-0.22986601831364423|
| 14|0.011913836266515876| -0.678915153834693|
| 15| 0.9359607054250594|-0.16557488664743034|
| 16| 0.45680471157575453| -0.3885563551710555|
| 17| 0.6411908952297819| 0.9161177183227823|
| 18| 0.5669232696934479| 0.7270125277020573|
| 19| 0.513622008243935| 0.7626741803250845|
+---+--------------------+--------------------+
#do some concatenation here, how?
df_concat.show()
| id| uniform| normal| normal_2 |
+---+--------------------+--------------------+------------+
| 0| 0.8122802274304282| 1.2423430583597714| None |
| 1| 0.8642043127063618| 0.3900018344856156| None |
| 2| 0.8292577771850476| 1.8077401259195247| None |
| 3| 0.198558705368724| -0.4270585782850261| None |
| 4|0.012661361966674889| 0.702634599720141| None |
| 5| 0.8535692890157796|-0.42355804115129153| None |
| 6| 0.3723296190171911| 1.3789648582622995| None |
| 7| 0.9529794127670571| 0.16238718777444605| None |
| 8| 0.9746632635918108| 0.02448061333761742| None |
| 9| 0.513622008243935| 0.7626741803250845| None |
| 11| 0.3221262660507942| None | 0.123 |
| 12| 0.4030672316912547| None |0.12323 |
| 13| 0.9690555459609131| None |0.123 |
| 14|0.011913836266515876| None |0.18923 |
| 15| 0.9359607054250594| None |0.99123 |
| 16| 0.45680471157575453| None |0.123 |
| 17| 0.6411908952297819| None |1.123 |
| 18| 0.5669232696934479| None |0.10023 |
| 19| 0.513622008243935| None |0.916332123 |
+---+--------------------+--------------------+------------+
Is that possible?
Maybe you can try creating the unexisting columns and calling union (unionAll for Spark 1.6 or lower):
from pyspark.sql.functions import lit
cols = ['id', 'uniform', 'normal', 'normal_2']
df_1_new = df_1.withColumn("normal_2", lit(None)).select(cols)
df_2_new = df_2.withColumn("normal", lit(None)).select(cols)
result = df_1_new.union(df_2_new)
# To remove the duplicates:
result = result.dropDuplicates()
df_concat = df_1.union(df_2)
The dataframes may need to have identical columns, in which case you can use withColumn() to create normal_1 and normal_2
unionByName is a built-in option available in spark which is available from spark 2.3.0.
with spark version 3.1.0, there is allowMissingColumns option with the default value set to False to handle missing columns. Even if both dataframes don't have the same set of columns, this function will work, setting missing column values to null in the resulting dataframe.
df_1.unionByName(df_2, allowMissingColumns=True).show()
+---+--------------------+--------------------+--------------------+
| id| uniform| normal| normal_2|
+---+--------------------+--------------------+--------------------+
| 0| 0.8122802274304282| 1.2423430583597714| null|
| 1| 0.8642043127063618| 0.3900018344856156| null|
| 2| 0.8292577771850476| 1.8077401259195247| null|
| 3| 0.198558705368724| -0.4270585782850261| null|
| 4|0.012661361966674889| 0.702634599720141| null|
| 5| 0.8535692890157796|-0.42355804115129153| null|
| 6| 0.3723296190171911| 1.3789648582622995| null|
| 7| 0.9529794127670571| 0.16238718777444605| null|
| 8| 0.9746632635918108| 0.02448061333761742| null|
| 9| 0.513622008243935| 0.7626741803250845| null|
| 11| 0.3221262660507942| null| 1.0269298899109824|
| 12| 0.4030672316912547| null| 1.285648175568798|
| 13| 0.9690555459609131| null|-0.22986601831364423|
| 14|0.011913836266515876| null| -0.678915153834693|
| 15| 0.9359607054250594| null|-0.16557488664743034|
| 16| 0.45680471157575453| null| -0.3885563551710555|
| 17| 0.6411908952297819| null| 0.9161177183227823|
| 18| 0.5669232696934479| null| 0.7270125277020573|
| 19| 0.513622008243935| null| 0.7626741803250845|
+---+--------------------+--------------------+--------------------+
You can use unionByName to make this:
df = df_1.unionByName(df_2)
unionByName is available since Spark 2.3.0.
To make it more generic of keeping both columns in df1 and df2:
import pyspark.sql.functions as F
# Keep all columns in either df1 or df2
def outter_union(df1, df2):
# Add missing columns to df1
left_df = df1
for column in set(df2.columns) - set(df1.columns):
left_df = left_df.withColumn(column, F.lit(None))
# Add missing columns to df2
right_df = df2
for column in set(df1.columns) - set(df2.columns):
right_df = right_df.withColumn(column, F.lit(None))
# Make sure columns are ordered the same
return left_df.union(right_df.select(left_df.columns))
To concatenate multiple pyspark dataframes into one:
from functools import reduce
reduce(lambda x,y:x.union(y), [df_1,df_2])
And you can replace the list of [df_1, df_2] to a list of any length.
Here is one way to do it, in case it is still useful: I ran this in pyspark shell, Python version 2.7.12 and my Spark install was version 2.0.1.
PS: I guess you meant to use different seeds for the df_1 df_2 and the code below reflects that.
from pyspark.sql.types import FloatType
from pyspark.sql.functions import randn, rand
import pyspark.sql.functions as F
df_1 = sqlContext.range(0, 10)
df_2 = sqlContext.range(11, 20)
df_1 = df_1.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal"))
df_2 = df_2.select("id", rand(seed=11).alias("uniform"), randn(seed=28).alias("normal_2"))
def get_uniform(df1_uniform, df2_uniform):
if df1_uniform:
return df1_uniform
if df2_uniform:
return df2_uniform
u_get_uniform = F.udf(get_uniform, FloatType())
df_3 = df_1.join(df_2, on = "id", how = 'outer').select("id", u_get_uniform(df_1["uniform"], df_2["uniform"]).alias("uniform"), "normal", "normal_2").orderBy(F.col("id"))
Here are the outputs I get:
df_1.show()
+---+-------------------+--------------------+
| id| uniform| normal|
+---+-------------------+--------------------+
| 0|0.41371264720975787| 0.5888539012978773|
| 1| 0.7311719281896606| 0.8645537008427937|
| 2| 0.1982919638208397| 0.06157382353970104|
| 3|0.12714181165849525| 0.3623040918178586|
| 4| 0.7604318153406678|-0.49575204523675975|
| 5|0.12030715258495939| 1.0854146699817222|
| 6|0.12131363910425985| -0.5284523629183004|
| 7|0.44292918521277047| -0.4798519469521663|
| 8| 0.8898784253886249| -0.8820294772950535|
| 9|0.03650707717266999| -2.1591956435415334|
+---+-------------------+--------------------+
df_2.show()
+---+-------------------+--------------------+
| id| uniform| normal_2|
+---+-------------------+--------------------+
| 11| 0.1982919638208397| 0.06157382353970104|
| 12|0.12714181165849525| 0.3623040918178586|
| 13|0.12030715258495939| 1.0854146699817222|
| 14|0.12131363910425985| -0.5284523629183004|
| 15|0.44292918521277047| -0.4798519469521663|
| 16| 0.8898784253886249| -0.8820294772950535|
| 17| 0.2731073068483362|-0.15116027592854422|
| 18| 0.7784518091224375| -0.3785563841011868|
| 19|0.43776394586845413| 0.47700719174464357|
+---+-------------------+--------------------+
df_3.show()
+---+-----------+--------------------+--------------------+
| id| uniform| normal| normal_2|
+---+-----------+--------------------+--------------------+
| 0| 0.41371265| 0.5888539012978773| null|
| 1| 0.7311719| 0.8645537008427937| null|
| 2| 0.19829196| 0.06157382353970104| null|
| 3| 0.12714182| 0.3623040918178586| null|
| 4| 0.7604318|-0.49575204523675975| null|
| 5|0.120307155| 1.0854146699817222| null|
| 6| 0.12131364| -0.5284523629183004| null|
| 7| 0.44292918| -0.4798519469521663| null|
| 8| 0.88987845| -0.8820294772950535| null|
| 9|0.036507078| -2.1591956435415334| null|
| 11| 0.19829196| null| 0.06157382353970104|
| 12| 0.12714182| null| 0.3623040918178586|
| 13|0.120307155| null| 1.0854146699817222|
| 14| 0.12131364| null| -0.5284523629183004|
| 15| 0.44292918| null| -0.4798519469521663|
| 16| 0.88987845| null| -0.8820294772950535|
| 17| 0.27310732| null|-0.15116027592854422|
| 18| 0.7784518| null| -0.3785563841011868|
| 19| 0.43776396| null| 0.47700719174464357|
+---+-----------+--------------------+--------------------+
Above answers are very elegant. I have written this function long back where i was also struggling to concatenate two dataframe with distinct columns.
Suppose you have dataframe sdf1 and sdf2
from pyspark.sql import functions as F
from pyspark.sql.types import *
def unequal_union_sdf(sdf1, sdf2):
s_df1_schema = set((x.name, x.dataType) for x in sdf1.schema)
s_df2_schema = set((x.name, x.dataType) for x in sdf2.schema)
for i,j in s_df2_schema.difference(s_df1_schema):
sdf1 = sdf1.withColumn(i,F.lit(None).cast(j))
for i,j in s_df1_schema.difference(s_df2_schema):
sdf2 = sdf2.withColumn(i,F.lit(None).cast(j))
common_schema_colnames = sdf1.columns
sdk = \
sdf1.select(common_schema_colnames).union(sdf2.select(common_schema_colnames))
return sdk
sdf_concat = unequal_union_sdf(sdf1, sdf2)
This should do it for you ...
from pyspark.sql.types import FloatType
from pyspark.sql.functions import randn, rand, lit, coalesce, col
import pyspark.sql.functions as F
df_1 = sqlContext.range(0, 6)
df_2 = sqlContext.range(3, 10)
df_1 = df_1.select("id", lit("old").alias("source"))
df_2 = df_2.select("id")
df_1.show()
df_2.show()
df_3 = df_1.alias("df_1").join(df_2.alias("df_2"), df_1.id == df_2.id, "outer")\
.select(\
[coalesce(df_1.id, df_2.id).alias("id")] +\
[col("df_1." + c) for c in df_1.columns if c != "id"])\
.sort("id")
df_3.show()
I was trying to implement pandas append functionality in pyspark and what I created a custom function where we can concat 2 or more data frame even they are having different no. of columns only condition is if dataframes have identical name then their datatype should be same/match.
I have written a custom function to merge 2 dataframes.
def append_dfs(df1,df2):
list1 = df1.columns
list2 = df2.columns
for col in list2:
if(col not in list1):
df1 = df1.withColumn(col, F.lit(None))
for col in list1:
if(col not in list2):
df2 = df2.withColumn(col, F.lit(None))
return df1.unionByName(df2)
usage:
concate 2 dataframes
final_df = append_dfs(df1,df2)
concate more than 2(say3) dataframes
final_df = append_dfs(append_dfs(df1,df2),df3)
example:
df1:
df2:
result=append_dfs(df1,df2)
result :
Hope this will useful.
I would solve this in this way:
from pyspark.sql import SparkSession
df_1.createOrReplaceTempView("tab_1")
df_2.createOrReplaceTempView("tab_2")
df_concat=spark.sql("select tab_1.id,tab_1.uniform,tab_1.normal,tab_2.normal_2 from tab_1 tab_1 left join tab_2 tab_2 on tab_1.uniform=tab_2.uniform\
union\
select tab_2.id,tab_2.uniform,tab_1.normal,tab_2.normal_2 from tab_2 tab_2 left join tab_1 tab_1 on tab_1.uniform=tab_2.uniform")
df_concat.show()
Maybe, you want to concatenate more of two Dataframes.
I found a issue which use pandas Dataframe conversion.
Suppose you have 3 spark Dataframe who want to concatenate.
The code is the following:
list_dfs = []
list_dfs_ = []
df = spark.read.json('path_to_your_jsonfile.json',multiLine = True)
df2 = spark.read.json('path_to_your_jsonfile2.json',multiLine = True)
df3 = spark.read.json('path_to_your_jsonfile3.json',multiLine = True)
list_dfs.extend([df,df2,df3])
for df in list_dfs :
df = df.select([column for column in df.columns]).toPandas()
list_dfs_.append(df)
list_dfs.clear()
df_ = sqlContext.createDataFrame(pd.concat(list_dfs_))

Categories