PySpark making dataframe with three columns from RDD with tuple and int - python

I have a RDD in a form of:
[(('1', '10'), 1), (('10', '1'), 1), (('1', '12'), 1), (('12', '1'), 1)]
What I have done is
df = spark.createDataFrame(rdd, ["src", "rp"])
where I make a column of the tuple and int which looks like this:
+-------+-----+
| src|rp |
+-------+-----+
|[1, 10]| 1|
|[10, 1]| 1|
|[1, 12]| 1|
|[12, 1]| 1|
+-------+-----+
But I can't figure out how to make src column of the first element in [x,y] and dst column of the second element so i would have a dataframe with three columns src, dst and rp:
+-------+-----+-----+
| src|dst |rp |
+-------+-----+-----+
| 1| 10| 1|
| 10| 1| 1|
| 1| 12| 1|
| 12| 1| 1|
+-------+-----+-----+

You need an intermediate transformation on your RDD to make it a flat list of three elements:
spark.createDataFrame(rdd.map(lambda l: [l[0][0], l[0][1], l[1]]), ["src", "dst", "rp"])
+---+---+---+
|src|dst| rp|
+---+---+---+
| 1| 10| 1|
| 10| 1| 1|
| 1| 12| 1|
| 12| 1| 1|
+---+---+---+

You can just do a simple select on the dataframe to separate out the columns. No need to do a intermediate transformation as the other answer suggests.
from pyspark.sql.functions import col
df = sqlContext.createDataFrame(rdd, ["src", "rp"])
df = df.select(col("src._1").alias("src"), col("src._2").alias("dst"),col("rp"))
df.show()
Here's the result
+---+---+---+
|src|dst| rp|
+---+---+---+
| 1| 10| 1|
| 10| 1| 1|
| 1| 12| 1|
| 12| 1| 1|
+---+---+---+

Related

Perform dataframe join before another one

What would be nice and fast way to perform a dataframe join between 3 tables?
Also I need to have duplicate columns in my dataframe.
SELECT
tbl_1.id,
tbl_2.id,
tbl_3.id
FROM tbl_1
JOIN tbl_2
ON tbl_1.id = tbl_2.id
JOIN tbl_3
ON tbl_3.id = tbl_2.id ;
First thing: if you want to have all duplicated columns in the output DataFrame, you should prefix them with something, so that they can be differentiated somehow. Example:
df = spark.createDataFrame([(1, 2), (1, 1)], "id: int, val: int")
df.show()
+---+---+
| id|val|
+---+---+
| 1| 2|
| 1| 1|
+---+---+
from pyspark.sql.functions import col
df.select([col(c).alias("df1_" + c) for c in df.columns]).show()
+------+-------+
|df1_id|df1_val|
+------+-------+
| 1| 2|
| 1| 1|
+------+-------+
Second thing: if the SQL syntax is not available for any reason, then you will have to join the DataFrames one by one - the join method takes only one DataFrame as right side of join, so you can chain them for example. Snippet:
df1 = spark.createDataFrame([(1, 1), (1, 4)], "df1_id: int, df1_val: int")
df2 = spark.createDataFrame([(1, 2), (1, 5)], "df2_id: int, df2_val: int")
df3 = spark.createDataFrame([(1, 3), (1, 6)], "df3_id: int, df3_val: int")
df1.join(df2, df1.df1_id == df2.df2_id, "inner").join(df3, df1.df1_id == df3.df3_id, "inner").show()
+------+-------+------+-------+------+-------+
|df1_id|df1_val|df2_id|df2_val|df3_id|df3_val|
+------+-------+------+-------+------+-------+
| 1| 4| 1| 5| 1| 3|
| 1| 4| 1| 2| 1| 3|
| 1| 1| 1| 5| 1| 3|
| 1| 1| 1| 2| 1| 3|
| 1| 4| 1| 5| 1| 6|
| 1| 4| 1| 2| 1| 6|
| 1| 1| 1| 5| 1| 6|
| 1| 1| 1| 2| 1| 6|
+------+-------+------+-------+------+-------+

pyspark dataframe doesn't change after some processing

I create a dateframe, and use window function to get the accumulative value, but after use the function,df and df.select() show with different row order
spark = SparkSession.builder.master('local[8]').appName("SparkByExample.com").getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", '50')
# create dataframe
from pyspark.sql.types import StructType,StructField, StringType, IntegerType,FloatType,LongType,ArrayType
data = [(1,1,2),
(1,2,2),
(1,3,2),
(1,2,1),
(2,1,2),
(2,3,2),
(2,2,1),
(3,1,2),
(3,3,2),
(3,2,1)]
schema = StructType([ \
StructField("col1", IntegerType(),True), \
StructField("col2", IntegerType(),True), \
StructField("col3", IntegerType(),True)])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| 2|
| 1| 2| 2|
| 1| 3| 2|
| 1| 2| 1|
| 2| 1| 2|
| 2| 3| 2|
| 2| 2| 1|
| 3| 1| 2|
| 3| 3| 2|
| 3| 2| 1|
+----+----+----+
# this is window fuction to get the accumulative value
from pyspark.sql.window import Window
w = Window().partitionBy(F.col('col1')).orderBy(F.col('col2'))
def f(lag_val, current_val):
value = 0
if lag_val != current_val:
value = 1
return value
# register udf so we can use with our dataframe
func_udf = F.udf(f, IntegerType())
print(id(df))
df = df.withColumn("new_column", func_udf(F.lag("col3").over(w), df['col3'])).withColumn('new_column2', F.sum('new_column').over(w.partitionBy(F.col('col1')).rowsBetween(Window.unboundedPreceding, 0)))
df.show()
+----+----+----+----------+-----------+
|col1|col2|col3|new_column|new_column2|
+----+----+----+----------+-----------+
| 3| 1| 2| 1| 1|
| 3| 2| 1| 1| 2|
| 3| 3| 2| 1| 3|
| 2| 1| 2| 1| 1|
| 2| 2| 1| 1| 2|
| 2| 3| 2| 1| 3|
| 1| 1| 2| 1| 1|
| 1| 2| 2| 0| 1|
| 1| 2| 1| 1| 2|
| 1| 3| 2| 1| 3|
+----+----+----+----------+-----------+
test = df.select('col1','col2','col3')
test.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| 2|
| 1| 2| 2|
| 1| 3| 2|
| 1| 2| 1|
| 2| 1| 2|
| 2| 3| 2|
| 2| 2| 1|
| 3| 1| 2|
| 3| 3| 2|
| 3| 2| 1|
+----+----+----+
we can see that the row order of 'col1','col2','col3' of df and test are different, but test shows the original order of df, this could be because of action and transformation, but I am not sure.
Since on a python dataframe packages, sometimes it will automatically alter the order after some functions are executed. If you are referring to debugging, I suggest you break down your code as the following:
df = df.withColumn("new_column", func_udf(F.lag("col3").over(w), df['col3'])).withColumn('new_column2', F.sum('new_column').over(w.partitionBy(F.col('col1')).rowsBetween(Window.unboundedPreceding, 0)))
to something like this
df = df.withColumn("new_column", func_udf(F.lag("col3").over(w), df['col3']))
df_partition = w.partitionBy(F.col('col1')).rowsBetween(Window.unboundedPreceding, 0))
df = df.withColumn('new_column2', F.sum('new_column').over(df_partition)
You should further break those lines into smaller function until you understand what each function does.
If you do explain() to check the query plans,
test.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[col1#255,col2#256,col3#257]
df.explain()
== Physical Plan ==
Window [sum(cast(new_column#288 as bigint)) windowspecdefinition(col1#255, col2#256 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS new_column2#295L], [col1#255], [col2#256 ASC NULLS FIRST]
+- *(4) Sort [col1#255 ASC NULLS FIRST, col2#256 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(col1#255, 50), true, [id=#553]
+- *(3) Project [col1#255, col2#256, col3#257, pythonUDF0#389 AS new_column#288]
+- BatchEvalPython [f(_we0#289, col3#257)], [pythonUDF0#389]
+- Window [lag(col3#257, 1, null) windowspecdefinition(col1#255, col2#256 ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _we0#289], [col1#255], [col2#256 ASC NULLS FIRST]
+- *(2) Sort [col1#255 ASC NULLS FIRST, col2#256 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(col1#255, 50), true, [id=#544]
+- *(1) Scan ExistingRDD[col1#255,col2#256,col3#257]
You can see that the query optimizer knew that the test dataframe only requires columns from the original dataframe, so it skipped all the transformations for the new columns in df because they are irrelevant. That's why the dataframe had the same order as the original dataframe.
But for the new df with additional columns, sorting was performed in the window functions, so the order of the dataframe changed.

Add distinct count of a column to each row in PySpark

I need to add distinct count of a column to each row in PySpark dataframe.
Example:
If the original dataframe is this:
+----+----+
|col1|col2|
+----+----+
|abc | 1|
|xyz | 1|
|dgc | 2|
|ydh | 3|
|ujd | 1|
|ujx | 3|
+----+----+
Then I want something like this:
+----+----+----+
|col1|col2|col3|
+----+----+----+
|abc | 1| 3|
|xyz | 1| 3|
|dgc | 2| 3|
|ydh | 3| 3|
|ujd | 1| 3|
|ujx | 3| 3|
+----+----+----+
I tried df.withColumn('total_count', f.countDistinct('col2')) but it's giving error.
You can count distinct elements in the column and create new column with the value:
distincts = df.dropDuplicates(["col2"]).count()
df = df.withColumn("col3", f.lit(distincts))
Cross join to the count distinct as below:
df2 = df.crossJoin(df.select(F.countDistinct('col2').alias('col3')))
df2.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| abc| 1| 3|
| xyz| 1| 3|
| dgc| 2| 3|
| ydh| 3| 3|
| ujd| 1| 3|
| ujx| 3| 3|
+----+----+----+
You can use Window, collect_set and size:
from pyspark.sql import functions as F, Window
df = spark.createDataFrame([("abc", 1), ("xyz", 1), ("dgc", 2), ("ydh", 3), ("ujd", 1), ("ujx", 3)], ['col1', 'col2'])
window = Window.orderBy("col2").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn("col3", F.size(F.collect_set(F.col("col2")).over(window))).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| abc| 1| 3|
| xyz| 1| 3|
| dgc| 2| 3|
| ydh| 3| 3|
| ujd| 1| 3|
| ujx| 3| 3|
+----+----+----+

Explode nested JSON with Index using posexplode

Im using the below function to explode a deeply nested JSON (has nested struct and array).
# Flatten nested df
def flatten_df(nested_df):
for col in nested_df.columns:
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for col in array_cols:
nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col]))
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
if len(nested_cols) == 0:
return nested_df
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
flat_df = nested_df.select(flat_cols +
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols
for c in nested_df.select(nc+'.*').columns])
return flatten_df(flat_df)
Im successfully able to explode. But I also want to add the order or the index of the elements in the exploded dataframe. So in the above code I replace the explode_outer function to posexplode_outer. But I get the below error
An error was encountered:
'The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases'
I tried changing the nested_df.withColumn to nested_df.select but I wasn't successful. Can anyone help me explode the nested json but same time maintain the order of array elements as a column in the exploded dataframe.
Read json data as dataframe and create view or table.In spark SQL you can use number of laterviewexplode method using alias reference. If json data structure in struct type you can use dot to represent the structure. Level1.level2
Replace nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col])) with nested_df = df.selectExpr("*", f"posexplode({col}) as (position,col)").drop(col)
You might need to write some logic to replace the column names to original, but it should be simple
The error is because posexplode_outer returns two columns pos and col, so you cannot use it along withColumn(). This can be used in select as shown in the code below
from pyspark.sql import functions as F
from pyspark.sql.window import Window
tst= sqlContext.createDataFrame([(1,7,80),(1,8,40),(1,5,100),(5,8,90),(7,6,50),(0,3,60)],schema=['col1','col2','col3'])
tst_new = tst.withColumn("arr",F.array(tst.columns))
expr = tst.columns
expr.append(F.posexplode_outer('arr'))
#%%
tst_explode = tst_new.select(*expr)
results:
tst_explode.show()
+----+----+----+---+---+
|col1|col2|col3|pos|col|
+----+----+----+---+---+
| 1| 7| 80| 0| 1|
| 1| 7| 80| 1| 7|
| 1| 7| 80| 2| 80|
| 1| 8| 40| 0| 1|
| 1| 8| 40| 1| 8|
| 1| 8| 40| 2| 40|
| 1| 5| 100| 0| 1|
| 1| 5| 100| 1| 5|
| 1| 5| 100| 2|100|
| 5| 8| 90| 0| 5|
| 5| 8| 90| 1| 8|
| 5| 8| 90| 2| 90|
| 7| 6| 50| 0| 7|
| 7| 6| 50| 1| 6|
| 7| 6| 50| 2| 50|
| 0| 3| 60| 0| 0|
| 0| 3| 60| 1| 3|
| 0| 3| 60| 2| 60|
+----+----+----+---+---+
If you need to rename the columns, you can use the .withColumnRenamed() function
df_final=(tst_explode.withColumnRenamed('pos','position')).withColumnRenamed('col','column')
You can try select with list-comprehension to posexplode the ArrayType columns in your existing code:
for col in array_cols:
nested_df = nested_df.select([ F.posexplode_outer(col).alias(col+'_pos', col) if c == col else c for c in nested_df.columns ])
Example:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1,"n1", ["a", "b", "c"]),(2,"n2", ["foo", "bar"])],["id", "name", "vals"])
#+---+----+----------+
#| id|name| vals|
#+---+----+----------+
#| 1| n1| [a, b, c]|
#| 2| n2|[foo, bar]|
#+---+----+----------+
col = "vals"
df.select([F.posexplode_outer(col).alias(col+'_pos', col) if c == col else c for c in df.columns]).show()
#+---+----+--------+----+
#| id|name|vals_pos|vals|
#+---+----+--------+----+
#| 1| n1| 0| a|
#| 1| n1| 1| b|
#| 1| n1| 2| c|
#| 2| n2| 0| foo|
#| 2| n2| 1| bar|
#+---+----+--------+----+

Pyspark: how to duplicate a row n time in dataframe?

I've got a dataframe like this and I want to duplicate the row n times if the column n is bigger than one:
A B n
1 2 1
2 9 1
3 8 2
4 1 1
5 3 3
And transform like this:
A B n
1 2 1
2 9 1
3 8 2
3 8 2
4 1 1
5 3 3
5 3 3
5 3 3
I think I should use explode, but I don't understand how it works...
Thanks
With Spark 2.4.0+, this is easier with builtin functions: array_repeat + explode:
from pyspark.sql.functions import expr
df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)], ["A", "B", "n"])
new_df = df.withColumn('n', expr('explode(array_repeat(n,int(n)))'))
>>> new_df.show()
+---+---+---+
| A| B| n|
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
| 5| 3| 3|
| 5| 3| 3|
+---+---+---+
The explode function returns a new row for each element in the given array or map.
One way to exploit this function is to use a udf to create a list of size n for each row. Then explode the resulting array.
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import ArrayType, IntegerType
df = spark.createDataFrame([(1,2,1), (2,9,1), (3,8,2), (4,1,1), (5,3,3)] ,["A", "B", "n"])
+---+---+---+
| A| B| n|
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
+---+---+---+
# use udf function to transform the n value to n times
n_to_array = udf(lambda n : [n] * n, ArrayType(IntegerType()))
df2 = df.withColumn('n', n_to_array(df.n))
+---+---+---------+
| A| B| n|
+---+---+---------+
| 1| 2| [1]|
| 2| 9| [1]|
| 3| 8| [2, 2]|
| 4| 1| [1]|
| 5| 3|[3, 3, 3]|
+---+---+---------+
# now use explode
df2.withColumn('n', explode(df2.n)).show()
+---+---+---+
| A | B | n |
+---+---+---+
| 1| 2| 1|
| 2| 9| 1|
| 3| 8| 2|
| 3| 8| 2|
| 4| 1| 1|
| 5| 3| 3|
| 5| 3| 3|
| 5| 3| 3|
+---+---+---+
I think the udf answer by #Ahmed is the best way to go, but here is an alternative method, that may be as good or better for small n:
First, collect the maximum value of n over the whole DataFrame:
max_n = df.select(f.max('n').alias('max_n')).first()['max_n']
print(max_n)
#3
Now create an array for each row of length max_n, containing numbers in range(max_n). The output of this intermediate step will result in a DataFrame like:
df.withColumn('n_array', f.array([f.lit(i) for i in range(max_n)])).show()
#+---+---+---+---------+
#| A| B| n| n_array|
#+---+---+---+---------+
#| 1| 2| 1|[0, 1, 2]|
#| 2| 9| 1|[0, 1, 2]|
#| 3| 8| 2|[0, 1, 2]|
#| 4| 1| 1|[0, 1, 2]|
#| 5| 3| 3|[0, 1, 2]|
#+---+---+---+---------+
Now we explode the n_array column, and filter to keep only the values in the array that are less than n. This will ensure that we have n copies of each row. Finally we drop the exploded column to get the end result:
df.withColumn('n_array', f.array([f.lit(i) for i in range(max_n)]))\
.select('A', 'B', 'n', f.explode('n_array').alias('col'))\
.where(f.col('col') < f.col('n'))\
.drop('col')\
.show()
#+---+---+---+
#| A| B| n|
#+---+---+---+
#| 1| 2| 1|
#| 2| 9| 1|
#| 3| 8| 2|
#| 3| 8| 2|
#| 4| 1| 1|
#| 5| 3| 3|
#| 5| 3| 3|
#| 5| 3| 3|
#+---+---+---+
However, we are creating a max_n length array for each row- as opposed to just an n length array in the udf solution. It's not immediately clear to me how this will scale vs. udf for large max_n, but I suspect the udf will win out.

Categories