Add columns from different dataframes to target dataframe in PySpark

Add columns from different dataframes to target dataframe in PySpark - python

I have a few dataframes like this:
rdd_1 = sc.parallelize([(0,10,"A",2), (1,20,"B",1), (2,30,"A",2)])
rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,3213,"201602"),(1,20,3003,"201601"), (1,20,9872,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, 2321,"201601")])
df_tg = sqlContext.createDataFrame(rdd_1, ["id", "type", "route_a", "route_b"])
df_data = sqlContext.createDataFrame(rdd_2, ["id", "type", "cost", "date"])
df_tg.show()
+---+----+-------+-------+
| id|type|route_a|route_b|
+---+----+-------+-------+
| 0|10 | A| 2|
| 1|20 | B| 1|
| 2|30 | A| 2|
+---+----+-------+-------+
df_data.show()
+---+----+----+------+
| id|type|cost| date|
+---+----+----+------+
| 0|10 | 223|201603|
| 0|10 | 83 |201602|
| 1|20 |3003|201601|
| 1|20 |3213|201602|
| 1|20 |9872|201603|
| 2|30 | 10|201602|
| 2|30 | 62|201601|
| 2|40 |2321|201601|
+---+----+----+------+
So I need to add the columns like this:
+---+----+-------+-------+-----------+-----------+-----------+
| id|type|route_a|route_b|cost_201603|cost_201602|cost_201601|
+---+----+-------+-------+-----------+-----------+-----------+
| 0|10 | A| 2| 223 | 83 | None|
| 1|20 | B| 1| 9872 | 3213 | 3003|
| 2|30 | A| 2| None | 10 | 62|
+---+----+-------+-------+-----------+-----------+-----------+
For this I would have to do a few joins:
df_tg = df_tg.join(df_data[df_data.date == "201603"], ["id", "type"])
and with that I would have to rename the columns too, to not overwrite them:
df_tg = df_tg.join(df_data[df_data.date == "201603"], ["id", "type"]).withColumnRenamed("cost","cost_201603")
I can write a function to do this, but I would have to loop through both the dates available and the columns, generating ton of joins with full table scans:
def feature_add(df_target, df_feat, feat_cols, period):
for ref_month in period:
df_target = df_target.join(df_feat, ["id", "type"]).select(
*[df_target[column] for column in df_target.columns] + [df_feat[feat_col]]
).withColumnRenamed(feat_col, feat_col + '_' + ref_month)
return df_target
df_tg = feature_add(df_tg, df_data, ["cost"], ["201602", "201603", "201601"])
This works, but it's terrible. How can I add these columns, including when I'm calling the same function for other dataframes? Notice that the columns are not perfectly aligned and I need to do a inner join.

I would suggest to use pivot functions as following :
from pyspark.sql.functions import *
rdd_1 = sc.parallelize([(0,10,"A",2), (1,20,"B",1), (2,30,"A",2)])
rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,3213,"201602"),(1,20,3003,"201601"), (1,20,9872,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, 2321,"201601")])
df_tg = sqlContext.createDataFrame(rdd_1, ["id", "type", "route_a", "route_b"])
df_data = sqlContext.createDataFrame(rdd_2, ["id", "type", "cost", "date"])
pivot_df_data = df_data.groupBy("id","type").pivot("date").agg({"cost" : "sum"})
pivot_df_data.join(df_tg, ['id','type'], 'inner').select('id','type','route_a','route_b','201601','201602','201603','2016032').show()
# +---+----+-------+-------+------+------+------+-------+
# | id|type|route_a|route_b|201601|201602|201603|2016032|
# +---+----+-------+-------+------+------+------+-------+
# | 0| 10| A| 2| 223| null| null| 83|
# | 1| 20| B| 1| 3003| 3213| 9872| null|
# | 2| 30| A| 2| null| 10| null| null|
# +---+----+-------+-------+------+------+------+-------+

Related

How to join two Spark DataFrame and operate their share column?

I have 2 DataFrame like this:
+--+-----------+
|id|some_string|
+--+-----------+
| a| foo|
| b| bar|
| c| egg|
| d| fog|
+--+-----------+
and this:
+--+-----------+
|id|some_string|
+--+-----------+
| a| hoi|
| b| hei|
| c| hai|
| e| hui|
+--+-----------+
I want to join them to be like this:
+--+-----------+
|id|some_string|
+--+-----------+
| a| foohoi|
| b| barhei|
| c| egghai|
| d| fog|
| e| hui|
+--+-----------+
so, the column some_string from the first dataframe is concantenated to the column some_string from the second dataframe. If I am using
df_join = df1.join(df2,on='id',how='outer')
it would return
+--+-----------+-----------+
|id|some_string|some_string|
+--+-----------+-----------+
| a| foo| hoi|
| b| bar| hei|
| c| egg| hai|
| d| fog| null|
| e| null| hui|
+--+-----------+-----------+
Is there any way to do it?

You need to use when in order to achieve proper concatenation. Other than that the way you were using outer join was almost correct.
You need to check if anyone of these two columns is Null or not Null and then do the concatenation.
from pyspark.sql.functions import col, when, concat
df1 = sqlContext.createDataFrame([('a','foo'),('b','bar'),('c','egg'),('d','fog')],['id','some_string'])
df2 = sqlContext.createDataFrame([('a','hoi'),('b','hei'),('c','hai'),('e','hui')],['id','some_string'])
df_outer_join=df1.join(df2.withColumnRenamed('some_string','some_string_x'), ['id'], how='outer')
df_outer_join.show()
+---+-----------+-------------+
| id|some_string|some_string_x|
+---+-----------+-------------+
| e| null| hui|
| d| fog| null|
| c| egg| hai|
| b| bar| hei|
| a| foo| hoi|
+---+-----------+-------------+
df_outer_join = df_outer_join.withColumn('some_string_concat',
when(col('some_string').isNotNull() & col('some_string_x').isNotNull(),concat(col('some_string'),col('some_string_x')))
.when(col('some_string').isNull() & col('some_string_x').isNotNull(),col('some_string_x'))
.when(col('some_string').isNotNull() & col('some_string_x').isNull(),col('some_string')))\
.drop('some_string','some_string_x')
df_outer_join.show()
+---+------------------+
| id|some_string_concat|
+---+------------------+
| e| hui|
| d| fog|
| c| egghai|
| b| barhei|
| a| foohoi|
+---+------------------+

Considering you want to perform an outer join you can try the following:
from pyspark.sql.functions import concat, col, lit, when
df_join= df1.join(df2,on='id',how='outer').when(isnull(df1.some_string1), ''). when(isnull(df2.some_string2),'').withColumn('new_column',concat(col('some_string1'),lit(''),col('some_string2'))).select('id','new_column')
(Please note that the some_string1 and 2 refer to the some_string columns from the df1 and df2 dataframes. I would advise you to name them differently instead of giving the same name some_string, so that you can call them)

Groupby and create a new column in PySpark dataframe

I have a pyspark dataframe like this,
+----------+--------+
|id_ | p |
+----------+--------+
| 1 | A |
| 1 | B |
| 1 | B |
| 1 | A |
| 1 | A |
| 1 | B |
| 2 | C |
| 2 | C |
| 2 | C |
| 2 | A |
| 2 | A |
| 2 | C |
---------------------
I want to create another column for each group of id_. Column is made using pandas now with the code,
sample.groupby(by=['id_'], group_keys=False).apply(lambda grp : grp['p'].ne(grp['p'].shift()).cumsum())
How can I do this in pyspark dataframe.?
Currently I am doing this with a help of a pandas UDF, which runs very slow.
What are the alternatives.?
Expected column will be like this,
1
2
2
3
3
4
1
1
1
2
2
3

You can combination of udf and window functions to achieve your results:
# required imports
from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
# define a window, which we will use to calculate lag values
w = Window().partitionBy().orderBy(F.col('id_'))
# define user defined function (udf) to perform calculation on each row
def f(lag_val, current_val):
if lag_val != current_val:
return 1
return 0
# register udf so we can use with our dataframe
func_udf = F.udf(f, IntegerType())
# read csv file
df = spark.read.csv('/path/to/file.csv', header=True)
# create new column with lag on window we created earlier, apply udf on lagged
# and current value and then apply window function again to calculate cumsum
df.withColumn("new_column", func_udf(F.lag("p").over(w), df['p'])).withColumn('cumsum', F.sum('new_column').over(w.partitionBy(F.col('id_')).rowsBetween(Window.unboundedPreceding, 0))).show()
+---+---+----------+------+
|id_| p|new_column|cumsum|
+---+---+----------+------+
| 1| A| 1| 1|
| 1| B| 1| 2|
| 1| B| 0| 2|
| 1| A| 1| 3|
| 1| A| 0| 3|
| 1| B| 1| 4|
| 2| C| 1| 1|
| 2| C| 0| 1|
| 2| C| 0| 1|
| 2| A| 1| 2|
| 2| A| 0| 2|
| 2| C| 1| 3|
+---+---+----------+------+
# where:
# w.partitionBy : to partition by id_ column
# w.rowsBetween : to specify frame boundaries
# ref https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/expressions/Window.html#rowsBetween-long-long-

Find count of distinct values between two same values in a csv file using pyspark

Im working on pyspark to deal with big CSV files more than 50gb.
Now I need to find the number of distinct values between two references to the same value.
for example,
input dataframe:
+----+
|col1|
+----+
| a|
| b|
| c|
| c|
| a|
| b|
| a|
+----+
output dataframe:
+----+-----+
|col1|col2 |
+----+-----+
| a| null|
| b| null|
| c| null|
| c| 0|
| a| 2|
| b| 2|
| a| 1|
+----+-----+
I'm struggling with this for past one week. Tried window functions and many things in spark. But couldn't get anything. It would be a great help if someone knows how to fix this. Thank you.
Comment if you need any clarification in the question.

I am providing solution, with some assumptions.
Assuming, previous reference can be found in max of previous 'n' rows. If 'n' is reasonable less value, i think this is good solution.
I assumed you can find the previous reference in 5 rows.
def get_distincts(list, current_value):
cnt = {}
flag = False
for i in list:
if current_value == i :
flag = True
break
else:
cnt[i] = "some_value"
if flag:
return len(cnt)
else:
return None
get_distincts_udf = udf(get_distincts, IntegerType())
df = spark.createDataFrame([["a"],["b"],["c"],["c"],["a"],["b"],["a"]]).toDF("col1")
#You can replace this, if you have some unique id column
df = df.withColumn("seq_id", monotonically_increasing_id())
window = Window.orderBy("seq_id")
df = df.withColumn("list", array([lag(col("col1"),i, None).over(window) for i in range(1,6) ]))
df = df.withColumn("col2", get_distincts_udf(col('list'), col('col1'))).drop('seq_id','list')
df.show()
which results
+----+----+
|col1|col2|
+----+----+
| a|null|
| b|null|
| c|null|
| c| 0|
| a| 2|
| b| 2|
| a| 1|
+----+----+

You can try the following approach:
add a monotonically_increasing column id to keep track the order of rows
find prev_id for each col1 and save the result to a new df
for the new DF (alias 'd1'), make a LEFT JOIN to the DF itself (alias 'd2') with a condition (d2.id > d1.prev_id) & (d2.id < d1.id)
then groupby('d1.col1', 'd1.id') and aggregate on the countDistinct('d2.col1')
The code based on the above logic and your sample data is shown below:
from pyspark.sql import functions as F, Window
df1 = spark.createDataFrame([ (i,) for i in list("abccaba")], ["col1"])
# create a WinSpec partitioned by col1 so that we can find the prev_id
win = Window.partitionBy('col1').orderBy('id')
# set up id and prev_id
df11 = df1.withColumn('id', F.monotonically_increasing_id())\
.withColumn('prev_id', F.lag('id').over(win))
# check the newly added columns
df11.sort('id').show()
# +----+---+-------+
# |col1| id|prev_id|
# +----+---+-------+
# | a| 0| null|
# | b| 1| null|
# | c| 2| null|
# | c| 3| 2|
# | a| 4| 0|
# | b| 5| 1|
# | a| 6| 4|
# +----+---+-------+
# let's cache the new dataframe
df11.persist()
# do a self-join on id and prev_id and then do the aggregation
df12 = df11.alias('d1') \
.join(df11.alias('d2')
, (F.col('d2.id') > F.col('d1.prev_id')) & (F.col('d2.id') < F.col('d1.id')), how='left') \
.select('d1.col1', 'd1.id', F.col('d2.col1').alias('ids')) \
.groupBy('col1','id') \
.agg(F.countDistinct('ids').alias('distinct_values'))
# display the result
df12.sort('id').show()
# +----+---+---------------+
# |col1| id|distinct_values|
# +----+---+---------------+
# | a| 0| 0|
# | b| 1| 0|
# | c| 2| 0|
# | c| 3| 0|
# | a| 4| 2|
# | b| 5| 2|
# | a| 6| 1|
# +----+---+---------------+
# release the cached df11
df11.unpersist()
Note you will need to keep this id column to sort rows, otherwise your resulting rows will be totally messed up each time you collect them.

reuse_distance = []
block_dict = {}
stack_dict = {}
counter_reuse = 0
counter_stack = 0
reuse_list = []
Here block is nothing but the characters you want to read and search from csv
stack_list = []
stack_dist = -1
reuse_dist = -1
if block in block_dict:
reuse_dist = counter_reuse - block_dict[block]-1
block_dict[block] = counter_reuse
counter_reuse += 1
stack_dist_ind= stack_list.index(block)
stack_dist = counter_stack -stack_dist_ind - 1
del stack_list[stack_dist_ind]
stack_list.append(block)
else:
block_dict[block] = counter_reuse
counter_reuse += 1
counter_stack += 1
stack_list.append(block)
reuse_distance_2.append([block, stack_dist, reuse_dist])

PySpark Dataframe take mean of list within column and create new column with 1 & 0 depending on a condition

I am trying to calculate the mean of a list (cost) within a PySpark Dataframe column, the values that are less than the mean get the value 1 and above the mean a 0.
This is the current dataframe:
+----------+--------------------+--------------------+
| id| collect_list(p_id)|collect_list(cost) |
+----------+--------------------+--------------------+
| 7|[10, 987, 872] |[12.0, 124.6, 197.0]|
| 6|[11, 858, 299] |[15.0, 167.16, 50.0]|
| 17| [2]| [65.4785]|
| 1|[34359738369, 343...|[16.023384, 104.9...|
| 3|[17179869185, 0, ...|[48.3255, 132.025...|
+----------+--------------------+--------------------+
This is the desired output:
+----------+--------------------+--------------------+-----------+
| id| p_id |cost | result |
+----------+--------------------+--------------------+-----------+
| 7|10 |12.0 | 1 |
| 7|987 |124.6 | 0 |
| 7|872 |197.0 | 0 |
| 6|11 |15.0 | 1 |
| 6|858 |167.16 | 0 |
| 6|299 |50.0 | 1 |
| 17|2 |65.4785 | 1 |
+----------+--------------------+--------------------+-----------+

from pyspark.sql.functions import col, mean
#sample data
df = sc.parallelize([(7,[10, 987, 872],[12.0, 124.6, 197.0]),
(6,[11, 858, 299],[15.0, 167.16, 50.0]),
(17,[2],[65.4785])]).toDF(["id", "collect_list(p_id)","collect_list(cost)"])
#unpack collect_list in desired output format
df = df.rdd.flatMap(lambda row: [(row[0], x, y) for x,y in zip(row[1],row[2])]).toDF(["id", "p_id","cost"])
df1 = df.\
join(df.groupBy("id").agg(mean("cost").alias("mean_cost")), "id", 'left').\
withColumn("result",(col("cost") <= col("mean_cost")).cast("int")).\
drop("mean_cost")
df1.show()
Output is :
+---+----+-------+------+
| id|p_id| cost|result|
+---+----+-------+------+
| 7| 10| 12.0| 1|
| 7| 987| 124.6| 0|
| 7| 872| 197.0| 0|
| 6| 11| 15.0| 1|
| 6| 858| 167.16| 0|
| 6| 299| 50.0| 1|
| 17| 2|65.4785| 1|
+---+----+-------+------+

You can create a result list for every row and then zip pid, cost and result list. After that use explode on the zipped column.
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import *
def zip_cols(pid_list,cost_list):
mean = np.mean(cost_list)
res_list = list(map(lambda cost:1 if mean >= cost else 0,cost_list))
return[(x,y,z) for x,y,z in zip(pid_list, cost_list, res_list)]
udf_zip = udf(zip_cols, ArrayType(StructType([StructField("pid",IntegerType()),
StructField("cost", DoubleType()),
StructField("result",IntegerType())])))
df1 = (df.withColumn("temp",udf_zip("collect_list(p_id)","collect_list(cost)")).
drop("collect_list(p_id)","collect_list(cost)"))
df2 = (df1.withColumn("temp",explode(df1.temp)).
select("id",col("temp.pid").alias("pid"),
col("temp.cost").alias("cost"),
col("temp.result").alias("result")))
df2.show()
output
+---+---+-------+------+
| id|pid| cost|result|
+---+---+-------+------+
| 7| 10| 12.0| 1|
| 7| 98| 124.6| 0|
| 7|872| 197.0| 0|
| 6| 11| 15.0| 1|
| 6|858| 167.16| 0|
| 6|299| 50.0| 1|
| 17| 2|65.4758| 1|
+---+---+-------+------+

Concatenate two PySpark dataframes

I'm trying to concatenate two PySpark dataframes with some columns that are only on one of them:
from pyspark.sql.functions import randn, rand
df_1 = sqlContext.range(0, 10)
+--+
|id|
+--+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+--+
df_2 = sqlContext.range(11, 20)
+--+
|id|
+--+
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+--+
df_1 = df_1.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal"))
df_2 = df_2.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal_2"))
and now I want to generate a third dataframe. I would like something like pandas concat:
df_1.show()
+---+--------------------+--------------------+
| id| uniform| normal|
+---+--------------------+--------------------+
| 0| 0.8122802274304282| 1.2423430583597714|
| 1| 0.8642043127063618| 0.3900018344856156|
| 2| 0.8292577771850476| 1.8077401259195247|
| 3| 0.198558705368724| -0.4270585782850261|
| 4|0.012661361966674889| 0.702634599720141|
| 5| 0.8535692890157796|-0.42355804115129153|
| 6| 0.3723296190171911| 1.3789648582622995|
| 7| 0.9529794127670571| 0.16238718777444605|
| 8| 0.9746632635918108| 0.02448061333761742|
| 9| 0.513622008243935| 0.7626741803250845|
+---+--------------------+--------------------+
df_2.show()
+---+--------------------+--------------------+
| id| uniform| normal_2|
+---+--------------------+--------------------+
| 11| 0.3221262660507942| 1.0269298899109824|
| 12| 0.4030672316912547| 1.285648175568798|
| 13| 0.9690555459609131|-0.22986601831364423|
| 14|0.011913836266515876| -0.678915153834693|
| 15| 0.9359607054250594|-0.16557488664743034|
| 16| 0.45680471157575453| -0.3885563551710555|
| 17| 0.6411908952297819| 0.9161177183227823|
| 18| 0.5669232696934479| 0.7270125277020573|
| 19| 0.513622008243935| 0.7626741803250845|
+---+--------------------+--------------------+
#do some concatenation here, how?
df_concat.show()
| id| uniform| normal| normal_2 |
+---+--------------------+--------------------+------------+
| 0| 0.8122802274304282| 1.2423430583597714| None |
| 1| 0.8642043127063618| 0.3900018344856156| None |
| 2| 0.8292577771850476| 1.8077401259195247| None |
| 3| 0.198558705368724| -0.4270585782850261| None |
| 4|0.012661361966674889| 0.702634599720141| None |
| 5| 0.8535692890157796|-0.42355804115129153| None |
| 6| 0.3723296190171911| 1.3789648582622995| None |
| 7| 0.9529794127670571| 0.16238718777444605| None |
| 8| 0.9746632635918108| 0.02448061333761742| None |
| 9| 0.513622008243935| 0.7626741803250845| None |
| 11| 0.3221262660507942| None | 0.123 |
| 12| 0.4030672316912547| None |0.12323 |
| 13| 0.9690555459609131| None |0.123 |
| 14|0.011913836266515876| None |0.18923 |
| 15| 0.9359607054250594| None |0.99123 |
| 16| 0.45680471157575453| None |0.123 |
| 17| 0.6411908952297819| None |1.123 |
| 18| 0.5669232696934479| None |0.10023 |
| 19| 0.513622008243935| None |0.916332123 |
+---+--------------------+--------------------+------------+
Is that possible?

Maybe you can try creating the unexisting columns and calling union (unionAll for Spark 1.6 or lower):
from pyspark.sql.functions import lit
cols = ['id', 'uniform', 'normal', 'normal_2']
df_1_new = df_1.withColumn("normal_2", lit(None)).select(cols)
df_2_new = df_2.withColumn("normal", lit(None)).select(cols)
result = df_1_new.union(df_2_new)
# To remove the duplicates:
result = result.dropDuplicates()

df_concat = df_1.union(df_2)
The dataframes may need to have identical columns, in which case you can use withColumn() to create normal_1 and normal_2

unionByName is a built-in option available in spark which is available from spark 2.3.0.
with spark version 3.1.0, there is allowMissingColumns option with the default value set to False to handle missing columns. Even if both dataframes don't have the same set of columns, this function will work, setting missing column values to null in the resulting dataframe.
df_1.unionByName(df_2, allowMissingColumns=True).show()
+---+--------------------+--------------------+--------------------+
| id| uniform| normal| normal_2|
+---+--------------------+--------------------+--------------------+
| 0| 0.8122802274304282| 1.2423430583597714| null|
| 1| 0.8642043127063618| 0.3900018344856156| null|
| 2| 0.8292577771850476| 1.8077401259195247| null|
| 3| 0.198558705368724| -0.4270585782850261| null|
| 4|0.012661361966674889| 0.702634599720141| null|
| 5| 0.8535692890157796|-0.42355804115129153| null|
| 6| 0.3723296190171911| 1.3789648582622995| null|
| 7| 0.9529794127670571| 0.16238718777444605| null|
| 8| 0.9746632635918108| 0.02448061333761742| null|
| 9| 0.513622008243935| 0.7626741803250845| null|
| 11| 0.3221262660507942| null| 1.0269298899109824|
| 12| 0.4030672316912547| null| 1.285648175568798|
| 13| 0.9690555459609131| null|-0.22986601831364423|
| 14|0.011913836266515876| null| -0.678915153834693|
| 15| 0.9359607054250594| null|-0.16557488664743034|
| 16| 0.45680471157575453| null| -0.3885563551710555|
| 17| 0.6411908952297819| null| 0.9161177183227823|
| 18| 0.5669232696934479| null| 0.7270125277020573|
| 19| 0.513622008243935| null| 0.7626741803250845|
+---+--------------------+--------------------+--------------------+

You can use unionByName to make this:
df = df_1.unionByName(df_2)
unionByName is available since Spark 2.3.0.

To make it more generic of keeping both columns in df1 and df2:
import pyspark.sql.functions as F
# Keep all columns in either df1 or df2
def outter_union(df1, df2):
# Add missing columns to df1
left_df = df1
for column in set(df2.columns) - set(df1.columns):
left_df = left_df.withColumn(column, F.lit(None))
# Add missing columns to df2
right_df = df2
for column in set(df1.columns) - set(df2.columns):
right_df = right_df.withColumn(column, F.lit(None))
# Make sure columns are ordered the same
return left_df.union(right_df.select(left_df.columns))

To concatenate multiple pyspark dataframes into one:
from functools import reduce
reduce(lambda x,y:x.union(y), [df_1,df_2])
And you can replace the list of [df_1, df_2] to a list of any length.

Here is one way to do it, in case it is still useful: I ran this in pyspark shell, Python version 2.7.12 and my Spark install was version 2.0.1.
PS: I guess you meant to use different seeds for the df_1 df_2 and the code below reflects that.
from pyspark.sql.types import FloatType
from pyspark.sql.functions import randn, rand
import pyspark.sql.functions as F
df_1 = sqlContext.range(0, 10)
df_2 = sqlContext.range(11, 20)
df_1 = df_1.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal"))
df_2 = df_2.select("id", rand(seed=11).alias("uniform"), randn(seed=28).alias("normal_2"))
def get_uniform(df1_uniform, df2_uniform):
if df1_uniform:
return df1_uniform
if df2_uniform:
return df2_uniform
u_get_uniform = F.udf(get_uniform, FloatType())
df_3 = df_1.join(df_2, on = "id", how = 'outer').select("id", u_get_uniform(df_1["uniform"], df_2["uniform"]).alias("uniform"), "normal", "normal_2").orderBy(F.col("id"))
Here are the outputs I get:
df_1.show()
+---+-------------------+--------------------+
| id| uniform| normal|
+---+-------------------+--------------------+
| 0|0.41371264720975787| 0.5888539012978773|
| 1| 0.7311719281896606| 0.8645537008427937|
| 2| 0.1982919638208397| 0.06157382353970104|
| 3|0.12714181165849525| 0.3623040918178586|
| 4| 0.7604318153406678|-0.49575204523675975|
| 5|0.12030715258495939| 1.0854146699817222|
| 6|0.12131363910425985| -0.5284523629183004|
| 7|0.44292918521277047| -0.4798519469521663|
| 8| 0.8898784253886249| -0.8820294772950535|
| 9|0.03650707717266999| -2.1591956435415334|
+---+-------------------+--------------------+
df_2.show()
+---+-------------------+--------------------+
| id| uniform| normal_2|
+---+-------------------+--------------------+
| 11| 0.1982919638208397| 0.06157382353970104|
| 12|0.12714181165849525| 0.3623040918178586|
| 13|0.12030715258495939| 1.0854146699817222|
| 14|0.12131363910425985| -0.5284523629183004|
| 15|0.44292918521277047| -0.4798519469521663|
| 16| 0.8898784253886249| -0.8820294772950535|
| 17| 0.2731073068483362|-0.15116027592854422|
| 18| 0.7784518091224375| -0.3785563841011868|
| 19|0.43776394586845413| 0.47700719174464357|
+---+-------------------+--------------------+
df_3.show()
+---+-----------+--------------------+--------------------+
| id| uniform| normal| normal_2|
+---+-----------+--------------------+--------------------+
| 0| 0.41371265| 0.5888539012978773| null|
| 1| 0.7311719| 0.8645537008427937| null|
| 2| 0.19829196| 0.06157382353970104| null|
| 3| 0.12714182| 0.3623040918178586| null|
| 4| 0.7604318|-0.49575204523675975| null|
| 5|0.120307155| 1.0854146699817222| null|
| 6| 0.12131364| -0.5284523629183004| null|
| 7| 0.44292918| -0.4798519469521663| null|
| 8| 0.88987845| -0.8820294772950535| null|
| 9|0.036507078| -2.1591956435415334| null|
| 11| 0.19829196| null| 0.06157382353970104|
| 12| 0.12714182| null| 0.3623040918178586|
| 13|0.120307155| null| 1.0854146699817222|
| 14| 0.12131364| null| -0.5284523629183004|
| 15| 0.44292918| null| -0.4798519469521663|
| 16| 0.88987845| null| -0.8820294772950535|
| 17| 0.27310732| null|-0.15116027592854422|
| 18| 0.7784518| null| -0.3785563841011868|
| 19| 0.43776396| null| 0.47700719174464357|
+---+-----------+--------------------+--------------------+

Above answers are very elegant. I have written this function long back where i was also struggling to concatenate two dataframe with distinct columns.
Suppose you have dataframe sdf1 and sdf2
from pyspark.sql import functions as F
from pyspark.sql.types import *
def unequal_union_sdf(sdf1, sdf2):
s_df1_schema = set((x.name, x.dataType) for x in sdf1.schema)
s_df2_schema = set((x.name, x.dataType) for x in sdf2.schema)
for i,j in s_df2_schema.difference(s_df1_schema):
sdf1 = sdf1.withColumn(i,F.lit(None).cast(j))
for i,j in s_df1_schema.difference(s_df2_schema):
sdf2 = sdf2.withColumn(i,F.lit(None).cast(j))
common_schema_colnames = sdf1.columns
sdk = \
sdf1.select(common_schema_colnames).union(sdf2.select(common_schema_colnames))
return sdk
sdf_concat = unequal_union_sdf(sdf1, sdf2)

This should do it for you ...
from pyspark.sql.types import FloatType
from pyspark.sql.functions import randn, rand, lit, coalesce, col
import pyspark.sql.functions as F
df_1 = sqlContext.range(0, 6)
df_2 = sqlContext.range(3, 10)
df_1 = df_1.select("id", lit("old").alias("source"))
df_2 = df_2.select("id")
df_1.show()
df_2.show()
df_3 = df_1.alias("df_1").join(df_2.alias("df_2"), df_1.id == df_2.id, "outer")\
.select(\
[coalesce(df_1.id, df_2.id).alias("id")] +\
[col("df_1." + c) for c in df_1.columns if c != "id"])\
.sort("id")
df_3.show()

I was trying to implement pandas append functionality in pyspark and what I created a custom function where we can concat 2 or more data frame even they are having different no. of columns only condition is if dataframes have identical name then their datatype should be same/match.
I have written a custom function to merge 2 dataframes.
def append_dfs(df1,df2):
list1 = df1.columns
list2 = df2.columns
for col in list2:
if(col not in list1):
df1 = df1.withColumn(col, F.lit(None))
for col in list1:
if(col not in list2):
df2 = df2.withColumn(col, F.lit(None))
return df1.unionByName(df2)
usage:
concate 2 dataframes
final_df = append_dfs(df1,df2)
concate more than 2(say3) dataframes
final_df = append_dfs(append_dfs(df1,df2),df3)
example:
df1:
df2:
result=append_dfs(df1,df2)
result :
Hope this will useful.

I would solve this in this way:
from pyspark.sql import SparkSession
df_1.createOrReplaceTempView("tab_1")
df_2.createOrReplaceTempView("tab_2")
df_concat=spark.sql("select tab_1.id,tab_1.uniform,tab_1.normal,tab_2.normal_2 from tab_1 tab_1 left join tab_2 tab_2 on tab_1.uniform=tab_2.uniform\
union\
select tab_2.id,tab_2.uniform,tab_1.normal,tab_2.normal_2 from tab_2 tab_2 left join tab_1 tab_1 on tab_1.uniform=tab_2.uniform")
df_concat.show()

Maybe, you want to concatenate more of two Dataframes.
I found a issue which use pandas Dataframe conversion.
Suppose you have 3 spark Dataframe who want to concatenate.
The code is the following:
list_dfs = []
list_dfs_ = []
df = spark.read.json('path_to_your_jsonfile.json',multiLine = True)
df2 = spark.read.json('path_to_your_jsonfile2.json',multiLine = True)
df3 = spark.read.json('path_to_your_jsonfile3.json',multiLine = True)
list_dfs.extend([df,df2,df3])
for df in list_dfs :
df = df.select([column for column in df.columns]).toPandas()
list_dfs_.append(df)
list_dfs.clear()
df_ = sqlContext.createDataFrame(pd.concat(list_dfs_))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add columns from different dataframes to target dataframe in PySpark - python

Related

How to join two Spark DataFrame and operate their share column?

Groupby and create a new column in PySpark dataframe

Find count of distinct values between two same values in a csv file using pyspark

PySpark Dataframe take mean of list within column and create new column with 1 & 0 depending on a condition

Concatenate two PySpark dataframes

Categories

Resources