Dividing complex rows of dataframe to simple rows in Pyspark - python

I have this code:
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
sc = SparkContext()
sqlContext = SQLContext(sc)
documents = sqlContext.createDataFrame([
Row(id=1, title=[Row(value=u'cars', max_dist=1000)]),
Row(id=2, title=[Row(value=u'horse bus',max_dist=50), Row(value=u'normal bus',max_dist=100)]),
Row(id=3, title=[Row(value=u'Airplane', max_dist=5000)]),
Row(id=4, title=[Row(value=u'Bicycles', max_dist=20),Row(value=u'Motorbikes', max_dist=80)]),
Row(id=5, title=[Row(value=u'Trams', max_dist=15)])])
documents.show(truncate=False)
#+---+----------------------------------+
#|id |title |
#+---+----------------------------------+
#|1 |[[1000,cars]] |
#|2 |[[50,horse bus], [100,normal bus]]|
#|3 |[[5000,Airplane]] |
#|4 |[[20,Bicycles], [80,Motorbikes]] |
#|5 |[[15,Trams]] |
#+---+----------------------------------+
I need to split all compound rows (e.g. 2 & 4) to multiple rows while retaining the 'id', to get a result like this:
#+---+----------------------------------+
#|id |title |
#+---+----------------------------------+
#|1 |[1000,cars] |
#|2 |[50,horse bus] |
#|2 |[100,normal bus] |
#|3 |[5000,Airplane] |
#|4 |[20,Bicycles] |
#|4 |[80,Motorbikes] |
#|5 |[15,Trams] |
#+---+----------------------------------+

Just explode it:
from pyspark.sql.functions import explode
documents.withColumn("title", explode("title"))
## +---+----------------+
## | id| title|
## +---+----------------+
## | 1| [1000,cars]|
## | 2| [50,horse bus]|
## | 2|[100,normal bus]|
## | 3| [5000,Airplane]|
## | 4| [20,Bicycles]|
## | 4| [80,Motorbikes]|
## | 5| [15,Trams]|
## +---+----------------+

Ok, here is what I've come up with. Unfortunately, I had to leave the world of Row objects and enter the world of list objects because I couldn't find a way to append to a Row object.
That means this method is bit messy. If you can find a way to add a new column to a Row object, then this is NOT the way to go.
def add_id(row):
it_list = []
for i in range(0, len(row[1])):
sm_list = []
for j in row[1][i]:
sm_list.append(j)
sm_list.append(row[0])
it_list.append(sm_list)
return it_list
with_id = documents.flatMap(lambda x: add_id(x))
df = with_id.map(lambda x: Row(id=x[2], title=Row(value=x[0], max_dist=x[1]))).toDF()
When I run df.show(), I get:
+---+----------------+
| id| title|
+---+----------------+
| 1| [cars,1000]|
| 2| [horse bus,50]|
| 2|[normal bus,100]|
| 3| [Airplane,5000]|
| 4| [Bicycles,20]|
| 4| [Motorbikes,80]|
| 5| [Trams,15]|
+---+----------------+

I am using Spark Dataset API, and following solved the 'explode' requirement for me:
Dataset<Row> explodedDataset = initialDataset.selectExpr("ID","explode(finished_chunk) as chunks");
Note: The explode method of Dataset API is deprecated in Spark 2.4.5 and the documentation suggests using Select(shown above) or FlatMap.

Related

Is there a way to add a column with range of values to a Spark Dataframe?

I have a spark dataframe: df1 as below:
age = spark.createDataFrame(["10","11","13"], "string").toDF("age")
age.show()
+---+
|age|
+---+
| 10|
| 11|
| 13|
+---+
I have a requirement of adding a row number column to the dataframe to make it:
+---+------+
|age|col_id|
+---+------+
| 10| 1 |
| 11| 2 |
| 13| 3 |
+---+------+
None of the columns in my dataframe contains unique values.
I tried to use F.monotonically_increasing_id()) but it is just producing random numbers in increasing order.
>>> age = spark.createDataFrame(["10","11","13"], "string").toDF("age").withColumn("rowId1", F.monotonically_increasing_id())
>>> age
DataFrame[age: string, rowId1: bigint]
>>> age.show
<bound method DataFrame.show of DataFrame[age: string, rowId1: bigint]>
>>> age.show()
+---+-----------+
|age| rowId1|
+---+-----------+
| 10|17179869184|
| 11|42949672960|
| 13|60129542144|
+---+-----------+
Since I don't have any column with unique data, I am worried about using windowing functions and generate row_numbers.
So, is there a way I can add a column with row_count to the dataframe that gives:
+---+------+
|age|col_id|
+---+------+
| 10| 1 |
| 11| 2 |
| 13| 3 |
+---+------+
If windowing function is the only way to implement, how can I make sure all the data comes under a single partition ?
or if there is a way to implement the same without using windowing functions, how to implement it ?
Any help is appreciated.
Use zipWithIndex.
I could not find code I did myself in the past yesterday as I was busy working on issues, but here is a good post that explains it. https://sqlandhadoop.com/pyspark-zipwithindex-example/
pyspark different to Scala.
Other answer not good for performance - going to single Executor. zipWithIndex is narrow transformation so it works per partition.
Here goes, you can tailor accordingly:
from pyspark.sql.types import StructField
from pyspark.sql.types import StructType
from pyspark.sql.types import StringType, LongType
import pyspark.sql.functions as F
df1 = spark.createDataFrame([ ('abc'),('2'),('3'),('4'), ('abc'),('2'),('3'),('4'), ('abc'),('2'),('3'),('4') ], StringType())
schema = StructType(df1.schema.fields[:] + [StructField("index", LongType(), True)])
rdd = df1.rdd.zipWithIndex()
rdd1 = rdd.map(lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],))
df1 = spark.createDataFrame(rdd1, schema)
df1.show()
returns:
+-----+-----+
|value|index|
+-----+-----+
| abc| 0|
| 2| 1|
| 3| 2|
| 4| 3|
| abc| 4|
| 2| 5|
| 3| 6|
| 4| 7|
| abc| 8|
| 2| 9|
| 3| 10|
| 4| 11|
+-----+-----+
Assumption: This answer is based on the assumption that the order of col_id should depend on the age column. If the assumption does not hold true the other suggested solution is the in the questions comments mentioned zipWithIndex. An example usage of zipWithIndex can be found in this answer.
Proposed solution:
You can use a window with an empty partitionBy and the the row number to get the expected numbers.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
windowSpec = Window.partitionBy().orderBy(F.col('age').asc())
age = age.withColumn(
'col_id',
F.row_number().over(windowSpec)
)
[EDIT] Add assumption of requirements and reference to alternative solution.

Looking to convert String Column to Integer Column in PySpark. What happens to strings that can't be converted?

I'm trying to convert a column in a dataframe to IntegerType. Here is an example of the dataframe:
+----+-------+
|From| To|
+----+-------+
| 1|1664968|
| 2| 3|
| 2| 747213|
| 2|1664968|
| 2|1691047|
| 2|4095634|
+----+-------+
I'm using the following code:
exploded_df = exploded_df.withColumn('From', exploded_df['To'].cast(IntegerType()))
However, I wanted to know what happens to strings that are not digits, for example, what happens if I have a string with several spaces? The reason is that I want to filter the dataframe in order to get the values of the column 'From' that don't have numbers in column 'To'.
Is there a simpler way to filter by this condition without converting the columns to IntegerType?
Thank you!
Values which cannot be cast are set to null, and the column will be considered a nullable column of that type. Here's a simple example:
from pyspark import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
df = sql_context.createDataFrame([("1",),
("2",),
("3",),
("4",),
("hello world",)], schema=['id'])
print(df.show())
df = df.withColumn("id", F.col("id").astype(IntegerType()))
print(df.show())
Output:
+-----------+
| id|
+-----------+
| 1|
| 2|
| 3|
| 4|
|hello world|
+-----------+
+----+
| id|
+----+
| 1|
| 2|
| 3|
| 4|
|null|
+----+
And to verify the schema is correct:
print(df.printSchema())
Output:
None
root
|-- id: integer (nullable = true)
Hope this helps!
We can use regex to check does To column have some alphabets,spaces in the data, Using .rlike funtion in spark to filter out the matching rows.
Example:
df=spark.createDataFrame([("1","1664968"),("2","3"),("2","742a7"),("2"," "),("2","a")],["From","To"])
df.show()
#+----+-------+
#|From| To|
#+----+-------+
#| 1|1664968|
#| 2| 3|
#| 2| 742a7|
#| 2| |
#| 2| a|
#+----+-------+
#get the rows which have space or word in them
df.filter(col("To").rlike('([a-z]|\\s+)')).show(truncate=False)
#+----+-----+
#|From|To |
#+----+-----+
#|2 |742a7|
#|2 | |
#|2 |a |
#+----+-----+
#to get rows which doesn't have any space or word in them.
df.filter(~col("To").rlike('([a-z]|\\s+)')).show(truncate=False)
#+----+-------+
#|From|To |
#+----+-------+
#|1 |1664968|
#|2 |3 |
#+----+-------+

Pyspark- Assign each group in groupBy an ID [duplicate]

This question already has answers here:
Create a group id over a window in Spark Dataframe
(3 answers)
Closed 4 years ago.
I would like to assign each group in a groupby a unique id number starting from 0 or 1 and incrementing by 1 for each group using pyspark.
I have done this previously using pandas with python with the command:
df['id_num'] = (df
.groupby('column_name')
.grouper
.group_info[0])
A toy example of the input and desired output is:
Input
+------+
|object|
+------+
|apple |
|orange|
|pear |
|berry |
|apple |
|pear |
|berry |
+------+
output:
+------+--+
|object|id|
+------+--+
|apple |1 |
|orange|2 |
|pear |3 |
|berry |4 |
|apple |1 |
|pear |3 |
|berry |4 |
+------+--+
I am not sure if the order is important. If not you can use dense_rank window function in this case
>>> from pyspark.sql.window import Window
>>> import pyspark.sql.functions as F
>>>
>>> df.show()
+------+
|object|
+------+
| apple|
|orange|
| pear|
| berry|
| apple|
| pear|
| berry|
+------+
>>>
>>> df.withColumn("id", F.dense_rank().over(Window.orderBy(df.object))).show()
+------+---+
|object| id|
+------+---+
| apple| 1|
| apple| 1|
| berry| 2|
| berry| 2|
|orange| 3|
| pear| 4|
| pear| 4|
+------+---+
from pyspark.sql.functions import col, create_map, lit
from itertools import chain
values = [('apple',),('orange',),('pear',),('berry',),('apple',),('pear',),('berry',)]
df = sqlContext.createDataFrame(values,['object'])
#Creating a column of distinct elements and converting them into dictionary with unique indexes.
df1 = df.distinct()
distinct_list = list(df1.select('object').toPandas()['object'])
dict_with_index = {distinct_list[i]:i+1 for i in range(len(distinct_list))}
#Applying the mapping of dictionary.
mapping_expr = create_map([lit(x) for x in chain(*dict_with_index.items())])
df=df.withColumn("id", mapping_expr.getItem(col("object")))
df.show()
+------+---+
|object| id|
+------+---+
| apple| 2|
|orange| 1|
| pear| 3|
| berry| 4|
| apple| 2|
| pear| 3|
| berry| 4|
+------+---+

Joining Dataframes with same coumn name in pyspark

I have two dataframe which has been readed from two csv files.
+---+----------+-----------------+
| ID| NUMBER | RECHARGE_AMOUNT|
+---+----------+-----------------+
| 1|9090909092| 30|
| 2|9090909093| 30|
| 3|9090909090| 30|
| 4|9090909094| 30|
+---+----------+-----------------+
and
+---+----------+-----------------+
| ID| NUMBER | RECHARGE_AMOUNT|
+---+----------+-----------------+
| 1|9090909092| 40|
| 2|9090909093| 50|
| 3|9090909090| 60|
| 4|9090909094| 70|
+---+----------+-----------------+
I am triying to join this two data from using NUMBER coumn using the pyspark code dfFinal = dfFinal.join(df2, on=['NUMBER'], how='inner') and new dataframe is generated as follows.
+----------+---+-----------------+---+-----------------+
| NUMBER | ID| RECHARGE_AMOUNT| ID| RECHARGE_AMOUNT|
+----------+---+-----------------+---+-----------------+
|9090909092| 1| 30| 1| 40|
|9090909093| 2| 30| 2| 50|
|9090909090| 3| 30| 3| 60|
|9090909094| 4| 30| 4| 70|
+----------+---+-----------------+---+-----------------+
But i am not able to write this dataframe into a file since the dataframe after joining is having duplicate column. I am using the following code. dfFinal.coalesce(1).write.format('com.databricks.spark.csv').save('/home/user/output',header = 'true') Is there any way to avoid duplicate column after joining in spark. Given below is my pyspark code.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("test1").getOrCreate()
files = ["/home/user/test1.txt", "/home/user/test2.txt"]
dfFinal = spark.read.load(files[0],format="csv", sep=",", inferSchema="false", header="true", mode="DROPMALFORMED")
dfFinal.show()
for i in range(1,len(files)):
df2 = spark.read.load(files[i],format="csv", sep=",", inferSchema="false", header="true", mode="DROPMALFORMED")
df2.show()
dfFinal = dfFinal.join(df2, on=['NUMBER'], how='inner')
dfFinal.show()
dfFinal.coalesce(1).write.format('com.databricks.spark.csv').save('/home/user/output',header = 'true')
I need to generate unique column name.ie: if i gave two files in files array with same coumn it should generate as follows.
+----------+----+-------------------+-----+-------------------+
| NUMBER |IDx | RECHARGE_AMOUNTx | IDy | RECHARGE_AMOUNTy |
+----------+----+-------------------+-----+-------------------+
|9090909092| 1 | 30 | 1 | 40 |
|9090909093| 2 | 30 | 2 | 50 |
|9090909090| 3 | 30 | 3 | 60 |
|9090909094| 4 | 30 | 4 | 70 |
+----------+---+-----------------+---+------------------------+
In panda i can use suffixes argument as show below dfFinal = dfFinal.merge(df2,left_on='NUMBER',right_on='NUMBER',how='inner',suffixes=('x', 'y'),sort=True) which will generate the above dataframe. Is there any way i can replicate this on pyspark.
You can select the columns from each dataframe and alias it.
Like this.
dfFinal = dfFinal.join(df2, on=['NUMBER'], how='inner') \
.select('NUMBER',
dfFinal.ID.alias('ID_1'),
dfFinal.RECHARGE_AMOUNT.alias('RECHARGE_AMOUNT_1'),
df2.ID.alias('ID_2'),
df2.RECHARGE_AMOUNT.alias('RECHARGE_AMOUNT_2'))

Concatenate two PySpark dataframes

I'm trying to concatenate two PySpark dataframes with some columns that are only on one of them:
from pyspark.sql.functions import randn, rand
df_1 = sqlContext.range(0, 10)
+--+
|id|
+--+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+--+
df_2 = sqlContext.range(11, 20)
+--+
|id|
+--+
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+--+
df_1 = df_1.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal"))
df_2 = df_2.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal_2"))
and now I want to generate a third dataframe. I would like something like pandas concat:
df_1.show()
+---+--------------------+--------------------+
| id| uniform| normal|
+---+--------------------+--------------------+
| 0| 0.8122802274304282| 1.2423430583597714|
| 1| 0.8642043127063618| 0.3900018344856156|
| 2| 0.8292577771850476| 1.8077401259195247|
| 3| 0.198558705368724| -0.4270585782850261|
| 4|0.012661361966674889| 0.702634599720141|
| 5| 0.8535692890157796|-0.42355804115129153|
| 6| 0.3723296190171911| 1.3789648582622995|
| 7| 0.9529794127670571| 0.16238718777444605|
| 8| 0.9746632635918108| 0.02448061333761742|
| 9| 0.513622008243935| 0.7626741803250845|
+---+--------------------+--------------------+
df_2.show()
+---+--------------------+--------------------+
| id| uniform| normal_2|
+---+--------------------+--------------------+
| 11| 0.3221262660507942| 1.0269298899109824|
| 12| 0.4030672316912547| 1.285648175568798|
| 13| 0.9690555459609131|-0.22986601831364423|
| 14|0.011913836266515876| -0.678915153834693|
| 15| 0.9359607054250594|-0.16557488664743034|
| 16| 0.45680471157575453| -0.3885563551710555|
| 17| 0.6411908952297819| 0.9161177183227823|
| 18| 0.5669232696934479| 0.7270125277020573|
| 19| 0.513622008243935| 0.7626741803250845|
+---+--------------------+--------------------+
#do some concatenation here, how?
df_concat.show()
| id| uniform| normal| normal_2 |
+---+--------------------+--------------------+------------+
| 0| 0.8122802274304282| 1.2423430583597714| None |
| 1| 0.8642043127063618| 0.3900018344856156| None |
| 2| 0.8292577771850476| 1.8077401259195247| None |
| 3| 0.198558705368724| -0.4270585782850261| None |
| 4|0.012661361966674889| 0.702634599720141| None |
| 5| 0.8535692890157796|-0.42355804115129153| None |
| 6| 0.3723296190171911| 1.3789648582622995| None |
| 7| 0.9529794127670571| 0.16238718777444605| None |
| 8| 0.9746632635918108| 0.02448061333761742| None |
| 9| 0.513622008243935| 0.7626741803250845| None |
| 11| 0.3221262660507942| None | 0.123 |
| 12| 0.4030672316912547| None |0.12323 |
| 13| 0.9690555459609131| None |0.123 |
| 14|0.011913836266515876| None |0.18923 |
| 15| 0.9359607054250594| None |0.99123 |
| 16| 0.45680471157575453| None |0.123 |
| 17| 0.6411908952297819| None |1.123 |
| 18| 0.5669232696934479| None |0.10023 |
| 19| 0.513622008243935| None |0.916332123 |
+---+--------------------+--------------------+------------+
Is that possible?
Maybe you can try creating the unexisting columns and calling union (unionAll for Spark 1.6 or lower):
from pyspark.sql.functions import lit
cols = ['id', 'uniform', 'normal', 'normal_2']
df_1_new = df_1.withColumn("normal_2", lit(None)).select(cols)
df_2_new = df_2.withColumn("normal", lit(None)).select(cols)
result = df_1_new.union(df_2_new)
# To remove the duplicates:
result = result.dropDuplicates()
df_concat = df_1.union(df_2)
The dataframes may need to have identical columns, in which case you can use withColumn() to create normal_1 and normal_2
unionByName is a built-in option available in spark which is available from spark 2.3.0.
with spark version 3.1.0, there is allowMissingColumns option with the default value set to False to handle missing columns. Even if both dataframes don't have the same set of columns, this function will work, setting missing column values to null in the resulting dataframe.
df_1.unionByName(df_2, allowMissingColumns=True).show()
+---+--------------------+--------------------+--------------------+
| id| uniform| normal| normal_2|
+---+--------------------+--------------------+--------------------+
| 0| 0.8122802274304282| 1.2423430583597714| null|
| 1| 0.8642043127063618| 0.3900018344856156| null|
| 2| 0.8292577771850476| 1.8077401259195247| null|
| 3| 0.198558705368724| -0.4270585782850261| null|
| 4|0.012661361966674889| 0.702634599720141| null|
| 5| 0.8535692890157796|-0.42355804115129153| null|
| 6| 0.3723296190171911| 1.3789648582622995| null|
| 7| 0.9529794127670571| 0.16238718777444605| null|
| 8| 0.9746632635918108| 0.02448061333761742| null|
| 9| 0.513622008243935| 0.7626741803250845| null|
| 11| 0.3221262660507942| null| 1.0269298899109824|
| 12| 0.4030672316912547| null| 1.285648175568798|
| 13| 0.9690555459609131| null|-0.22986601831364423|
| 14|0.011913836266515876| null| -0.678915153834693|
| 15| 0.9359607054250594| null|-0.16557488664743034|
| 16| 0.45680471157575453| null| -0.3885563551710555|
| 17| 0.6411908952297819| null| 0.9161177183227823|
| 18| 0.5669232696934479| null| 0.7270125277020573|
| 19| 0.513622008243935| null| 0.7626741803250845|
+---+--------------------+--------------------+--------------------+
You can use unionByName to make this:
df = df_1.unionByName(df_2)
unionByName is available since Spark 2.3.0.
To make it more generic of keeping both columns in df1 and df2:
import pyspark.sql.functions as F
# Keep all columns in either df1 or df2
def outter_union(df1, df2):
# Add missing columns to df1
left_df = df1
for column in set(df2.columns) - set(df1.columns):
left_df = left_df.withColumn(column, F.lit(None))
# Add missing columns to df2
right_df = df2
for column in set(df1.columns) - set(df2.columns):
right_df = right_df.withColumn(column, F.lit(None))
# Make sure columns are ordered the same
return left_df.union(right_df.select(left_df.columns))
To concatenate multiple pyspark dataframes into one:
from functools import reduce
reduce(lambda x,y:x.union(y), [df_1,df_2])
And you can replace the list of [df_1, df_2] to a list of any length.
Here is one way to do it, in case it is still useful: I ran this in pyspark shell, Python version 2.7.12 and my Spark install was version 2.0.1.
PS: I guess you meant to use different seeds for the df_1 df_2 and the code below reflects that.
from pyspark.sql.types import FloatType
from pyspark.sql.functions import randn, rand
import pyspark.sql.functions as F
df_1 = sqlContext.range(0, 10)
df_2 = sqlContext.range(11, 20)
df_1 = df_1.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal"))
df_2 = df_2.select("id", rand(seed=11).alias("uniform"), randn(seed=28).alias("normal_2"))
def get_uniform(df1_uniform, df2_uniform):
if df1_uniform:
return df1_uniform
if df2_uniform:
return df2_uniform
u_get_uniform = F.udf(get_uniform, FloatType())
df_3 = df_1.join(df_2, on = "id", how = 'outer').select("id", u_get_uniform(df_1["uniform"], df_2["uniform"]).alias("uniform"), "normal", "normal_2").orderBy(F.col("id"))
Here are the outputs I get:
df_1.show()
+---+-------------------+--------------------+
| id| uniform| normal|
+---+-------------------+--------------------+
| 0|0.41371264720975787| 0.5888539012978773|
| 1| 0.7311719281896606| 0.8645537008427937|
| 2| 0.1982919638208397| 0.06157382353970104|
| 3|0.12714181165849525| 0.3623040918178586|
| 4| 0.7604318153406678|-0.49575204523675975|
| 5|0.12030715258495939| 1.0854146699817222|
| 6|0.12131363910425985| -0.5284523629183004|
| 7|0.44292918521277047| -0.4798519469521663|
| 8| 0.8898784253886249| -0.8820294772950535|
| 9|0.03650707717266999| -2.1591956435415334|
+---+-------------------+--------------------+
df_2.show()
+---+-------------------+--------------------+
| id| uniform| normal_2|
+---+-------------------+--------------------+
| 11| 0.1982919638208397| 0.06157382353970104|
| 12|0.12714181165849525| 0.3623040918178586|
| 13|0.12030715258495939| 1.0854146699817222|
| 14|0.12131363910425985| -0.5284523629183004|
| 15|0.44292918521277047| -0.4798519469521663|
| 16| 0.8898784253886249| -0.8820294772950535|
| 17| 0.2731073068483362|-0.15116027592854422|
| 18| 0.7784518091224375| -0.3785563841011868|
| 19|0.43776394586845413| 0.47700719174464357|
+---+-------------------+--------------------+
df_3.show()
+---+-----------+--------------------+--------------------+
| id| uniform| normal| normal_2|
+---+-----------+--------------------+--------------------+
| 0| 0.41371265| 0.5888539012978773| null|
| 1| 0.7311719| 0.8645537008427937| null|
| 2| 0.19829196| 0.06157382353970104| null|
| 3| 0.12714182| 0.3623040918178586| null|
| 4| 0.7604318|-0.49575204523675975| null|
| 5|0.120307155| 1.0854146699817222| null|
| 6| 0.12131364| -0.5284523629183004| null|
| 7| 0.44292918| -0.4798519469521663| null|
| 8| 0.88987845| -0.8820294772950535| null|
| 9|0.036507078| -2.1591956435415334| null|
| 11| 0.19829196| null| 0.06157382353970104|
| 12| 0.12714182| null| 0.3623040918178586|
| 13|0.120307155| null| 1.0854146699817222|
| 14| 0.12131364| null| -0.5284523629183004|
| 15| 0.44292918| null| -0.4798519469521663|
| 16| 0.88987845| null| -0.8820294772950535|
| 17| 0.27310732| null|-0.15116027592854422|
| 18| 0.7784518| null| -0.3785563841011868|
| 19| 0.43776396| null| 0.47700719174464357|
+---+-----------+--------------------+--------------------+
Above answers are very elegant. I have written this function long back where i was also struggling to concatenate two dataframe with distinct columns.
Suppose you have dataframe sdf1 and sdf2
from pyspark.sql import functions as F
from pyspark.sql.types import *
def unequal_union_sdf(sdf1, sdf2):
s_df1_schema = set((x.name, x.dataType) for x in sdf1.schema)
s_df2_schema = set((x.name, x.dataType) for x in sdf2.schema)
for i,j in s_df2_schema.difference(s_df1_schema):
sdf1 = sdf1.withColumn(i,F.lit(None).cast(j))
for i,j in s_df1_schema.difference(s_df2_schema):
sdf2 = sdf2.withColumn(i,F.lit(None).cast(j))
common_schema_colnames = sdf1.columns
sdk = \
sdf1.select(common_schema_colnames).union(sdf2.select(common_schema_colnames))
return sdk
sdf_concat = unequal_union_sdf(sdf1, sdf2)
This should do it for you ...
from pyspark.sql.types import FloatType
from pyspark.sql.functions import randn, rand, lit, coalesce, col
import pyspark.sql.functions as F
df_1 = sqlContext.range(0, 6)
df_2 = sqlContext.range(3, 10)
df_1 = df_1.select("id", lit("old").alias("source"))
df_2 = df_2.select("id")
df_1.show()
df_2.show()
df_3 = df_1.alias("df_1").join(df_2.alias("df_2"), df_1.id == df_2.id, "outer")\
.select(\
[coalesce(df_1.id, df_2.id).alias("id")] +\
[col("df_1." + c) for c in df_1.columns if c != "id"])\
.sort("id")
df_3.show()
I was trying to implement pandas append functionality in pyspark and what I created a custom function where we can concat 2 or more data frame even they are having different no. of columns only condition is if dataframes have identical name then their datatype should be same/match.
I have written a custom function to merge 2 dataframes.
def append_dfs(df1,df2):
list1 = df1.columns
list2 = df2.columns
for col in list2:
if(col not in list1):
df1 = df1.withColumn(col, F.lit(None))
for col in list1:
if(col not in list2):
df2 = df2.withColumn(col, F.lit(None))
return df1.unionByName(df2)
usage:
concate 2 dataframes
final_df = append_dfs(df1,df2)
concate more than 2(say3) dataframes
final_df = append_dfs(append_dfs(df1,df2),df3)
example:
df1:
df2:
result=append_dfs(df1,df2)
result :
Hope this will useful.
I would solve this in this way:
from pyspark.sql import SparkSession
df_1.createOrReplaceTempView("tab_1")
df_2.createOrReplaceTempView("tab_2")
df_concat=spark.sql("select tab_1.id,tab_1.uniform,tab_1.normal,tab_2.normal_2 from tab_1 tab_1 left join tab_2 tab_2 on tab_1.uniform=tab_2.uniform\
union\
select tab_2.id,tab_2.uniform,tab_1.normal,tab_2.normal_2 from tab_2 tab_2 left join tab_1 tab_1 on tab_1.uniform=tab_2.uniform")
df_concat.show()
Maybe, you want to concatenate more of two Dataframes.
I found a issue which use pandas Dataframe conversion.
Suppose you have 3 spark Dataframe who want to concatenate.
The code is the following:
list_dfs = []
list_dfs_ = []
df = spark.read.json('path_to_your_jsonfile.json',multiLine = True)
df2 = spark.read.json('path_to_your_jsonfile2.json',multiLine = True)
df3 = spark.read.json('path_to_your_jsonfile3.json',multiLine = True)
list_dfs.extend([df,df2,df3])
for df in list_dfs :
df = df.select([column for column in df.columns]).toPandas()
list_dfs_.append(df)
list_dfs.clear()
df_ = sqlContext.createDataFrame(pd.concat(list_dfs_))

Categories