Perform dataframe join before another one - python

What would be nice and fast way to perform a dataframe join between 3 tables?
Also I need to have duplicate columns in my dataframe.
SELECT
tbl_1.id,
tbl_2.id,
tbl_3.id
FROM tbl_1
JOIN tbl_2
ON tbl_1.id = tbl_2.id
JOIN tbl_3
ON tbl_3.id = tbl_2.id ;

First thing: if you want to have all duplicated columns in the output DataFrame, you should prefix them with something, so that they can be differentiated somehow. Example:
df = spark.createDataFrame([(1, 2), (1, 1)], "id: int, val: int")
df.show()
+---+---+
| id|val|
+---+---+
| 1| 2|
| 1| 1|
+---+---+
from pyspark.sql.functions import col
df.select([col(c).alias("df1_" + c) for c in df.columns]).show()
+------+-------+
|df1_id|df1_val|
+------+-------+
| 1| 2|
| 1| 1|
+------+-------+
Second thing: if the SQL syntax is not available for any reason, then you will have to join the DataFrames one by one - the join method takes only one DataFrame as right side of join, so you can chain them for example. Snippet:
df1 = spark.createDataFrame([(1, 1), (1, 4)], "df1_id: int, df1_val: int")
df2 = spark.createDataFrame([(1, 2), (1, 5)], "df2_id: int, df2_val: int")
df3 = spark.createDataFrame([(1, 3), (1, 6)], "df3_id: int, df3_val: int")
df1.join(df2, df1.df1_id == df2.df2_id, "inner").join(df3, df1.df1_id == df3.df3_id, "inner").show()
+------+-------+------+-------+------+-------+
|df1_id|df1_val|df2_id|df2_val|df3_id|df3_val|
+------+-------+------+-------+------+-------+
| 1| 4| 1| 5| 1| 3|
| 1| 4| 1| 2| 1| 3|
| 1| 1| 1| 5| 1| 3|
| 1| 1| 1| 2| 1| 3|
| 1| 4| 1| 5| 1| 6|
| 1| 4| 1| 2| 1| 6|
| 1| 1| 1| 5| 1| 6|
| 1| 1| 1| 2| 1| 6|
+------+-------+------+-------+------+-------+

Related

pyspark dataframe doesn't change after some processing

I create a dateframe, and use window function to get the accumulative value, but after use the function,df and df.select() show with different row order
spark = SparkSession.builder.master('local[8]').appName("SparkByExample.com").getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", '50')
# create dataframe
from pyspark.sql.types import StructType,StructField, StringType, IntegerType,FloatType,LongType,ArrayType
data = [(1,1,2),
(1,2,2),
(1,3,2),
(1,2,1),
(2,1,2),
(2,3,2),
(2,2,1),
(3,1,2),
(3,3,2),
(3,2,1)]
schema = StructType([ \
StructField("col1", IntegerType(),True), \
StructField("col2", IntegerType(),True), \
StructField("col3", IntegerType(),True)])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| 2|
| 1| 2| 2|
| 1| 3| 2|
| 1| 2| 1|
| 2| 1| 2|
| 2| 3| 2|
| 2| 2| 1|
| 3| 1| 2|
| 3| 3| 2|
| 3| 2| 1|
+----+----+----+
# this is window fuction to get the accumulative value
from pyspark.sql.window import Window
w = Window().partitionBy(F.col('col1')).orderBy(F.col('col2'))
def f(lag_val, current_val):
value = 0
if lag_val != current_val:
value = 1
return value
# register udf so we can use with our dataframe
func_udf = F.udf(f, IntegerType())
print(id(df))
df = df.withColumn("new_column", func_udf(F.lag("col3").over(w), df['col3'])).withColumn('new_column2', F.sum('new_column').over(w.partitionBy(F.col('col1')).rowsBetween(Window.unboundedPreceding, 0)))
df.show()
+----+----+----+----------+-----------+
|col1|col2|col3|new_column|new_column2|
+----+----+----+----------+-----------+
| 3| 1| 2| 1| 1|
| 3| 2| 1| 1| 2|
| 3| 3| 2| 1| 3|
| 2| 1| 2| 1| 1|
| 2| 2| 1| 1| 2|
| 2| 3| 2| 1| 3|
| 1| 1| 2| 1| 1|
| 1| 2| 2| 0| 1|
| 1| 2| 1| 1| 2|
| 1| 3| 2| 1| 3|
+----+----+----+----------+-----------+
test = df.select('col1','col2','col3')
test.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| 2|
| 1| 2| 2|
| 1| 3| 2|
| 1| 2| 1|
| 2| 1| 2|
| 2| 3| 2|
| 2| 2| 1|
| 3| 1| 2|
| 3| 3| 2|
| 3| 2| 1|
+----+----+----+
we can see that the row order of 'col1','col2','col3' of df and test are different, but test shows the original order of df, this could be because of action and transformation, but I am not sure.
Since on a python dataframe packages, sometimes it will automatically alter the order after some functions are executed. If you are referring to debugging, I suggest you break down your code as the following:
df = df.withColumn("new_column", func_udf(F.lag("col3").over(w), df['col3'])).withColumn('new_column2', F.sum('new_column').over(w.partitionBy(F.col('col1')).rowsBetween(Window.unboundedPreceding, 0)))
to something like this
df = df.withColumn("new_column", func_udf(F.lag("col3").over(w), df['col3']))
df_partition = w.partitionBy(F.col('col1')).rowsBetween(Window.unboundedPreceding, 0))
df = df.withColumn('new_column2', F.sum('new_column').over(df_partition)
You should further break those lines into smaller function until you understand what each function does.
If you do explain() to check the query plans,
test.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[col1#255,col2#256,col3#257]
df.explain()
== Physical Plan ==
Window [sum(cast(new_column#288 as bigint)) windowspecdefinition(col1#255, col2#256 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS new_column2#295L], [col1#255], [col2#256 ASC NULLS FIRST]
+- *(4) Sort [col1#255 ASC NULLS FIRST, col2#256 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(col1#255, 50), true, [id=#553]
+- *(3) Project [col1#255, col2#256, col3#257, pythonUDF0#389 AS new_column#288]
+- BatchEvalPython [f(_we0#289, col3#257)], [pythonUDF0#389]
+- Window [lag(col3#257, 1, null) windowspecdefinition(col1#255, col2#256 ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _we0#289], [col1#255], [col2#256 ASC NULLS FIRST]
+- *(2) Sort [col1#255 ASC NULLS FIRST, col2#256 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(col1#255, 50), true, [id=#544]
+- *(1) Scan ExistingRDD[col1#255,col2#256,col3#257]
You can see that the query optimizer knew that the test dataframe only requires columns from the original dataframe, so it skipped all the transformations for the new columns in df because they are irrelevant. That's why the dataframe had the same order as the original dataframe.
But for the new df with additional columns, sorting was performed in the window functions, so the order of the dataframe changed.

Add distinct count of a column to each row in PySpark

I need to add distinct count of a column to each row in PySpark dataframe.
Example:
If the original dataframe is this:
+----+----+
|col1|col2|
+----+----+
|abc | 1|
|xyz | 1|
|dgc | 2|
|ydh | 3|
|ujd | 1|
|ujx | 3|
+----+----+
Then I want something like this:
+----+----+----+
|col1|col2|col3|
+----+----+----+
|abc | 1| 3|
|xyz | 1| 3|
|dgc | 2| 3|
|ydh | 3| 3|
|ujd | 1| 3|
|ujx | 3| 3|
+----+----+----+
I tried df.withColumn('total_count', f.countDistinct('col2')) but it's giving error.
You can count distinct elements in the column and create new column with the value:
distincts = df.dropDuplicates(["col2"]).count()
df = df.withColumn("col3", f.lit(distincts))
Cross join to the count distinct as below:
df2 = df.crossJoin(df.select(F.countDistinct('col2').alias('col3')))
df2.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| abc| 1| 3|
| xyz| 1| 3|
| dgc| 2| 3|
| ydh| 3| 3|
| ujd| 1| 3|
| ujx| 3| 3|
+----+----+----+
You can use Window, collect_set and size:
from pyspark.sql import functions as F, Window
df = spark.createDataFrame([("abc", 1), ("xyz", 1), ("dgc", 2), ("ydh", 3), ("ujd", 1), ("ujx", 3)], ['col1', 'col2'])
window = Window.orderBy("col2").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn("col3", F.size(F.collect_set(F.col("col2")).over(window))).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| abc| 1| 3|
| xyz| 1| 3|
| dgc| 2| 3|
| ydh| 3| 3|
| ujd| 1| 3|
| ujx| 3| 3|
+----+----+----+

PySpark making dataframe with three columns from RDD with tuple and int

I have a RDD in a form of:
[(('1', '10'), 1), (('10', '1'), 1), (('1', '12'), 1), (('12', '1'), 1)]
What I have done is
df = spark.createDataFrame(rdd, ["src", "rp"])
where I make a column of the tuple and int which looks like this:
+-------+-----+
| src|rp |
+-------+-----+
|[1, 10]| 1|
|[10, 1]| 1|
|[1, 12]| 1|
|[12, 1]| 1|
+-------+-----+
But I can't figure out how to make src column of the first element in [x,y] and dst column of the second element so i would have a dataframe with three columns src, dst and rp:
+-------+-----+-----+
| src|dst |rp |
+-------+-----+-----+
| 1| 10| 1|
| 10| 1| 1|
| 1| 12| 1|
| 12| 1| 1|
+-------+-----+-----+
You need an intermediate transformation on your RDD to make it a flat list of three elements:
spark.createDataFrame(rdd.map(lambda l: [l[0][0], l[0][1], l[1]]), ["src", "dst", "rp"])
+---+---+---+
|src|dst| rp|
+---+---+---+
| 1| 10| 1|
| 10| 1| 1|
| 1| 12| 1|
| 12| 1| 1|
+---+---+---+
You can just do a simple select on the dataframe to separate out the columns. No need to do a intermediate transformation as the other answer suggests.
from pyspark.sql.functions import col
df = sqlContext.createDataFrame(rdd, ["src", "rp"])
df = df.select(col("src._1").alias("src"), col("src._2").alias("dst"),col("rp"))
df.show()
Here's the result
+---+---+---+
|src|dst| rp|
+---+---+---+
| 1| 10| 1|
| 10| 1| 1|
| 1| 12| 1|
| 12| 1| 1|
+---+---+---+

Sum of variable number of columns in PySpark

I have a Spark DataFrame like this one:
+-----+--------+-------+-------+-------+-------+-------+
| Type|Criteria|Value#1|Value#2|Value#3|Value#4|Value#5|
+-----+--------+-------+-------+-------+-------+-------+
| Cat| 1| 1| 2| 3| 4| 5|
| Dog| 2| 1| 2| 3| 4| 5|
|Mouse| 4| 1| 2| 3| 4| 5|
| Fox| 5| 1| 2| 3| 4| 5|
+-----+--------+-------+-------+-------+-------+-------+
You can reproduce it with the next code:
data = [('Cat', 1, 1, 2, 3, 4, 5),
('Dog', 2, 1, 2, 3, 4, 5),
('Mouse', 4, 1, 2, 3, 4, 5),
('Fox', 5, 1, 2, 3, 4, 5)]
columns = ['Type', 'Criteria', 'Value#1', 'Value#2', 'Value#3', 'Value#4', 'Value#5']
df = spark.createDataFrame(data, schema=columns)
df.show()
My task is to add Total column that is a sum of all Value columns with # no more then Criteria for this Row.
In this example:
For row 'Cat': Criteria is 1, so Total is just Value#1.
For row 'Dog': Criteria is 2, so Total is the sum of Value#1 and Value#2.
For row 'Fox': Criteria is 5, so Total is the sum of all columns (Value#1 through Value#5).
Result should look like this:
+-----+--------+-------+-------+-------+-------+-------+-----+
| Type|Criteria|Value#1|Value#2|Value#3|Value#4|Value#5|Total|
+-----+--------+-------+-------+-------+-------+-------+-----+
| Cat| 1| 1| 2| 3| 4| 5| 1|
| Dog| 2| 1| 2| 3| 4| 5| 3|
|Mouse| 4| 1| 2| 3| 4| 5| 10|
| Fox| 5| 1| 2| 3| 4| 5| 15|
+-----+--------+-------+-------+-------+-------+-------+-----+
I can do it using Python UDF, but my datasets are large, and Python UDF are slow because of serialization. I'm looking for pure Spark solution.
I'm using PySpark and Spark 2.1
You can easily adjust the solution to PySpark: compute row maximum of the subset of columns and add to an exisiting dataframe by user6910411
from pyspark.sql.functions import col, when
total = sum([
when(col("Criteria") >= i, col("Value#{}".format(i))).otherwise(0)
for i in range(1, 6)
])
df.withColumn("total", total).show()
# +-----+--------+-------+-------+-------+-------+-------+-----+
# | Type|Criteria|Value#1|Value#2|Value#3|Value#4|Value#5|total|
# +-----+--------+-------+-------+-------+-------+-------+-----+
# | Cat| 1| 1| 2| 3| 4| 5| 1|
# | Dog| 2| 1| 2| 3| 4| 5| 3|
# |Mouse| 4| 1| 2| 3| 4| 5| 10|
# | Fox| 5| 1| 2| 3| 4| 5| 15|
# +-----+--------+-------+-------+-------+-------+-------+-----+
For arbitrary set of order columns define a list:
cols = df.columns[2:]
and redefine total as:
total_ = sum([
when(col("Criteria") > i, col(cols[i])).otherwise(0)
for i in range(len(cols))
])
df.withColumn("total", total_).show()
# +-----+--------+-------+-------+-------+-------+-------+-----+
# | Type|Criteria|Value#1|Value#2|Value#3|Value#4|Value#5|total|
# +-----+--------+-------+-------+-------+-------+-------+-----+
# | Cat| 1| 1| 2| 3| 4| 5| 1|
# | Dog| 2| 1| 2| 3| 4| 5| 3|
# |Mouse| 4| 1| 2| 3| 4| 5| 10|
# | Fox| 5| 1| 2| 3| 4| 5| 15|
# +-----+--------+-------+-------+-------+-------+-------+-----+
Important:
Here sum is __builtin__.sum not pyspark.sql.functions.sum.

Pyspark: how to duplicate a row n time in dataframe?

I've got a dataframe like this and I want to duplicate the row n times if the column n is bigger than one:
A B n
1 2 1
2 9 1
3 8 2
4 1 1
5 3 3
And transform like this:
A B n
1 2 1
2 9 1
3 8 2
3 8 2
4 1 1
5 3 3
5 3 3
5 3 3
I think I should use explode, but I don't understand how it works...
Thanks
With Spark 2.4.0+, this is easier with builtin functions: array_repeat + explode:
from pyspark.sql.functions import expr
df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)], ["A", "B", "n"])
new_df = df.withColumn('n', expr('explode(array_repeat(n,int(n)))'))
>>> new_df.show()
+---+---+---+
| A| B| n|
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
| 5| 3| 3|
| 5| 3| 3|
+---+---+---+
The explode function returns a new row for each element in the given array or map.
One way to exploit this function is to use a udf to create a list of size n for each row. Then explode the resulting array.
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import ArrayType, IntegerType
df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)] ,["A", "B", "n"])
+---+---+---+
| A| B| n|
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
+---+---+---+
# use udf function to transform the n value to n times
n_to_array = udf(lambda n : [n] * n, ArrayType(IntegerType()))
df2 = df.withColumn('n', n_to_array(df.n))
+---+---+---------+
| A| B| n|
+---+---+---------+
| 1| 2| [1]|
| 2| 9| [1]|
| 3| 8| [2, 2]|
| 4| 1| [1]|
| 5| 3|[3, 3, 3]|
+---+---+---------+
# now use explode
df2.withColumn('n', explode(df2.n)).show()
+---+---+---+
| A | B | n |
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
| 5| 3| 3|
| 5| 3| 3|
+---+---+---+
I think the udf answer by #Ahmed is the best way to go, but here is an alternative method, that may be as good or better for small n:
First, collect the maximum value of n over the whole DataFrame:
max_n = df.select(f.max('n').alias('max_n')).first()['max_n']
print(max_n)
#3
Now create an array for each row of length max_n, containing numbers in range(max_n). The output of this intermediate step will result in a DataFrame like:
df.withColumn('n_array', f.array([f.lit(i) for i in range(max_n)])).show()
#+---+---+---+---------+
#| A| B| n| n_array|
#+---+---+---+---------+
#| 1| 2| 1|[0, 1, 2]|
#| 2| 9| 1|[0, 1, 2]|
#| 3| 8| 2|[0, 1, 2]|
#| 4| 1| 1|[0, 1, 2]|
#| 5| 3| 3|[0, 1, 2]|
#+---+---+---+---------+
Now we explode the n_array column, and filter to keep only the values in the array that are less than n. This will ensure that we have n copies of each row. Finally we drop the exploded column to get the end result:
df.withColumn('n_array', f.array([f.lit(i) for i in range(max_n)]))\
.select('A', 'B', 'n', f.explode('n_array').alias('col'))\
.where(f.col('col') < f.col('n'))\
.drop('col')\
.show()
#+---+---+---+
#| A| B| n|
#+---+---+---+
#| 1| 2| 1|
#| 2| 9| 1|
#| 3| 8| 2|
#| 3| 8| 2|
#| 4| 1| 1|
#| 5| 3| 3|
#| 5| 3| 3|
#| 5| 3| 3|
#+---+---+---+
However, we are creating a max_n length array for each row- as opposed to just an n length array in the udf solution. It's not immediately clear to me how this will scale vs. udf for large max_n, but I suspect the udf will win out.

Categories