Below is my DF:
deviceDict = {'TABLET' : 'MOBILE', 'PHONE':'MOBILE', 'PC':'Desktop', 'CEDEX' : '', 'ST' : 'SAINT', 'AV' : 'AVENUE', 'BD': 'BOULEVARD'}
df = spark.createDataFrame([('TABLET', 'DAF ST PAQ BD'), ('PHONE', 'AVOTHA'), ('PC', 'STPA CEDEX'), ('OTHER', 'AV DAF'), (None, None)], ["device_type", 'City'])
df.show()
Output:
+-----------+-------------+
|device_type| City|
+-----------+-------------+
| TABLET|DAF ST PAQ BD|
| PHONE| AVOTHA|
| PC| STPA CEDEX|
| OTHER| AV DAF|
| null| null|
+-----------+-------------+
The aim is to replace key/value, solution from Pyspark: Replacing value in a column by searching a dictionary
tests = df.na.replace(deviceDict, 1)
Result:
+-----------+-------------+
|device_type| City|
+-----------+-------------+
| MOBILE|DAF ST PAQ BD|
| MOBILE| AVOTHA|
| Desktop| STPA CEDEX|
| OTHER| AV DAF|
| null| null|
+-----------+-------------+
It worked for device_type but I wasn't able to change the city (even when using subset)
Expected output:
+-----------+------------------------+
|device_type| City|
+-----------+------------------------+
| MOBILE| DAF SAINT PAQ BOULEVARD|
| MOBILE| AVOTHA|
| Desktop| STPA|
| OTHER| AVENUE DAF|
| null| null|
+-----------+------------------------+
The replacement doesn't occur for the column City because you're trying to do some partial replacement in the column values. Whereas function DataFrame.replace uses the entire value as a mapping.
To achieve what you want for column City, you can use multiple nested regexp_replace expressions that you can dynamically generate using Python functools.reduce for example:
from functools import reduce
import pyspark.sql.functions as F
m = list(deviceDict.items())
df1 = df.na.replace(deviceDict, 1).withColumn(
"City",
reduce(
lambda acc, x: F.regexp_replace(acc, rf"\b{x[0]}\b", x[1]),
m[1:],
F.regexp_replace(F.col("City"), rf"\b{m[0][0]}\b", m[0][1]),
)
)
df1.show(truncate=False)
#+-----------+-----------------------+
#|device_type|City |
#+-----------+-----------------------+
#|MOBILE |DAF SAINT PAQ BOULEVARD|
#|MOBILE |AVOTHA |
#|Desktop |STPA |
#|OTHER |AVENUE DAF |
#|null |null |
#+-----------+-----------------------+
Related
I have performed the data cleaning of my dataframe with pyspark, including the removal of the Stop-Words.
Removing the Stop-Word produces a list for each line, containing words that are NOT Stop-Words.
Now I would like to count all the words left in that column, to make the Word-Cloud or the Word-Frequency.
This is my pyspark dataframe:
+--------------------+-----+-----+-----------+--------------------+--------------------+--------------------+
| content|score|label|classWeigth| words| filtered| terms_stemmed|
+--------------------+-----+-----+-----------+--------------------+--------------------+--------------------+
|absolutely love d...| 5| 1| 0.48|[absolutely, love...|[absolutely, love...|[absolut, love, d...|
|absolutely love t...| 5| 1| 0.48|[absolutely, love...|[absolutely, love...|[absolut, love, g...|
|absolutely phenom...| 5| 1| 0.48|[absolutely, phen...|[absolutely, phen...|[absolut, phenome...|
|absolutely shocki...| 1| 0| 0.52|[absolutely, shoc...|[absolutely, shoc...|[absolut, shock, ...|
|accept the phone ...| 1| 0| 0.52|[accept, the, pho...|[accept, phone, n...|[accept, phone, n...|
+--------------------+-----+-----+-----------+--------------------+--------------------+--------------------+
terms_stemmed is the final column, from which I would like to get a new data frame like the following:
+-------------+--------+
|terms_stemmed| count |
+-------------+--------+
|app | 592059|
|use | 218178|
|good | 187671|
|like | 155304|
|game | 149941|
|.... | .... |
Someone can help me?
One option is to use explode
import pyspark.sql.functions as F
new_df = df\
.withColumn('terms_stemmed', F.explode('terms_stemmed'))\
.groupby('terms_stemmed')\
.count()
Example
import pyspark.sql.functions as F
df = spark.createDataFrame([
(1, ["Apple", "Banana"]),
(2, ["Banana", "Orange", "Banana"]),
(3, ["Orange"])
], ("id", "terms_stemmed"))
df.show(truncate=False)
+---+------------------------+
|id |terms_stemmed |
+---+------------------------+
|1 |[Apple, Banana] |
|2 |[Banana, Orange, Banana]|
|3 |[Orange] |
+---+------------------------+
new_df = df\
.withColumn('terms_stemmed', F.explode('terms_stemmed'))\
.groupby('terms_stemmed')\
.count()
new_df.show()
+-------------+-----+
|terms_stemmed|count|
+-------------+-----+
| Banana| 3|
| Apple| 1|
| Orange| 2|
+-------------+-----+
I have here an example DF:
+---+---------+----+---------+---------+-------------+-------------+
|id | company |type|rev2016 | rev2017 | main2016 | main2017 |
+---+---------+----+---------+---------+-------------+-------------+
| 1 | google |web | 100 | 200 | 55 | 66 |
+---+---------+----+---------+---------+-------------+-------------+
And I want this output:
+---+---------+----+-------------+------+------+
|id | company |type| Metric | 2016 | 2017 |
+---+---------+----+-------------+------+------+
| 1 | google |web | rev | 100 | 200 |
| 1 | google |web | main | 55 | 66 |
+---+---------+----+-------------+------+------+
What I am trying to achieve is transposing revenue and maintenance columns to rows with new column 'Metric'. I am trying pivoting no luck so far.
You can construct an array of structs from the columns, and then explode the arrays and expand the structs to get the desired output.
import pyspark.sql.functions as F
struct_list = [
F.struct(
F.lit('rev').alias('Metric'),
F.col('rev2016').alias('2016'),
F.col('rev2017').alias('2017')
),
F.struct(
F.lit('main').alias('Metric'),
F.col('main2016').alias('2016'),
F.col('main2017').alias('2017')
)
]
df2 = df.withColumn(
'arr',
F.explode(F.array(*struct_list))
).select('id', 'company', 'type', 'arr.*')
df2.show()
+---+-------+----+------+----+----+
| id|company|type|Metric|2016|2017|
+---+-------+----+------+----+----+
| 1| google| web| rev| 100| 200|
| 1| google| web| main| 55| 66|
+---+-------+----+------+----+----+
Or you can use stack:
df2 = df.selectExpr(
'id', 'company', 'type',
"stack(2, 'rev', rev2016, rev2017, 'main', main2016, main2017) as (Metric, `2016`, `2017`)"
)
df2.show()
+---+-------+----+------+----+----+
| id|company|type|Metric|2016|2017|
+---+-------+----+------+----+----+
| 1| google| web| rev| 100| 200|
| 1| google| web| main| 55| 66|
+---+-------+----+------+----+----+
I would like to search through a Pyspark DataFrame containing string fields and determine which keyword strings appear in each. Say I have the following DataFrame of keywords:
+-----------+----------+
| city| state|
+-----------+----------+
| Seattle|Washington|
|Los Angeles|California|
+-----------+----------+
which I would like to search for in this DataFrame:
+----------------------------------------+------+
|body |source|
+----------------------------------------+------+
|Seattle is in Washington. |a |
|Los Angeles is in California |b |
|Banana is a fruit |c |
|Seattle is not in New Hampshire |d |
|California is home to Los Angeles |e |
|Seattle, California is not a real place.|f |
+----------------------------------------+------+
I want to create a new DataFrame that identifies which keywords of which type appear in each source. So the desired end result would be:
+-----------+------+-----+
|name |source|type |
+-----------+------+-----+
|Seattle |a |city |
|Washington |a |state|
|Los Angeles|b |city |
|California |b |state|
|Seattle |d |city |
|Los Angeles|e |city |
|California |e |state|
|Seattle |f |city |
|California |f |state|
+-----------+------+-----+
How can I obtain this result? I could use join to isolate the body strings that contain these keywords, but I'm not sure how to track which specific keyword was a match and use that information to create new columns.
First, let's create and modify the dataframes:
import pyspark.sql.functions as psf
keywords_df = sc.parallelize([["Seattle", "Washington"], ["Los Angeles", "California"]])\
.toDF(["city", "state"])
keywords_df = keywords_df\
.withColumn("struct", psf.explode(psf.array(
psf.struct(psf.col("city").alias("word"), psf.lit("city").alias("type")),
psf.struct(psf.col("state").alias("word"), psf.lit("state").alias("type"))
)))\
.select("struct.*")
keywords_df.show()
+-----------+-----+
| word| type|
+-----------+-----+
| Seattle| city|
| Washington|state|
|Los Angeles| city|
| California|state|
+-----------+-----+
If your key words didn't contain spaces you could have split your sentences into words, that you'd have exploded to get just one word on each line. Then you'd have been able to join with your keywords dataframe. It's not the case here because of Los Angeles.
text_df = sc.parallelize([["Seattle is in Washington.", "a"],["Los Angeles is in California", "b"],
["Banana is a fruit", "c"],["Seattle is not in New Hampshire", "d"],
["California is home to Los Angeles", "e"],["Seattle, California is not a real place.", "f"]])\
.toDF(["body", "source"])
Instead we'll use a join with a string contains condition instead:
res = text_df.join(keywords_df, text_df.body.contains(keywords_df.word)).drop("body")
res.show()
+------+-----------+-----+
|source| word| type|
+------+-----------+-----+
| a| Seattle| city|
| a| Washington|state|
| b|Los Angeles| city|
| b| California|state|
| d| Seattle| city|
| f| Seattle| city|
| e|Los Angeles| city|
| e| California|state|
| f| California|state|
+------+-----------+-----+
I'm a newbie in PySpark.
I have a Spark DataFrame df that has a column 'device_type'.
I want to replace every value that is in "Tablet" or "Phone" to "Phone", and replace "PC" to "Desktop".
In Python I can do the following,
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df['device_type'] = df['device_type'].replace(deviceDict,inplace=False)
How can I achieve this using PySpark? Thanks!
You can use either na.replace:
df = spark.createDataFrame([
('Tablet', ), ('Phone', ), ('PC', ), ('Other', ), (None, )
], ["device_type"])
df.na.replace(deviceDict, 1).show()
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| Other|
| null|
+-----------+
or map literal:
from itertools import chain
from pyspark.sql.functions import create_map, lit
mapping = create_map([lit(x) for x in chain(*deviceDict.items())])
df.select(mapping[df['device_type']].alias('device_type'))
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| null|
| null|
+-----------+
Please note that the latter solution will convert values not present in the mapping to NULL. If this is not a desired behavior you can add coalesce:
from pyspark.sql.functions import coalesce
df.select(
coalesce(mapping[df['device_type']], df['device_type']).alias('device_type')
)
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| Other|
| null|
+-----------+
After a lot of searching and alternatives I think that the simplest way to replace using a python dict is with pyspark dataframe method replace:
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df_replace = df.replace(deviceDict,subset=['device_type'])
This will replace all values with the dict, you can get the same results using df.na.replace() if you pass a dict argument combined with a subset argument. It's not clear enough on his docs because if you search the function replace you will get two references, one inside of pyspark.sql.DataFrame.replace and the other one in side of pyspark.sql.DataFrameNaFunctions.replace, but the sample code of both reference use df.na.replace so it is not clear you can actually use df.replace.
Here is a little helper function, inspired by the R recode function, that abstracts the previous answers. As a bonus, it adds the option for a default value.
from itertools import chain
from pyspark.sql.functions import col, create_map, lit, when, isnull
from pyspark.sql.column import Column
df = spark.createDataFrame([
('Tablet', ), ('Phone', ), ('PC', ), ('Other', ), (None, )
], ["device_type"])
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df.show()
+-----------+
|device_type|
+-----------+
| Tablet|
| Phone|
| PC|
| Other|
| null|
+-----------+
Here is the definition of recode.
def recode(col_name, map_dict, default=None):
if not isinstance(col_name, Column): # Allows either column name string or column instance to be passed
col_name = col(col_name)
mapping_expr = create_map([lit(x) for x in chain(*map_dict.items())])
if default is None:
return mapping_expr.getItem(col_name)
else:
return when(~isnull(mapping_expr.getItem(col_name)), mapping_expr.getItem(col_name)).otherwise(default)
Creating a column without a default gives null/None in all unmatched values.
df.withColumn("device_type", recode('device_type', deviceDict)).show()
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| null|
| null|
+-----------+
On the other hand, specifying a value for default replaces all unmatched values with this default.
df.withColumn("device_type", recode('device_type', deviceDict, default='Other')).show()
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| Other|
| Other|
+-----------+
You can do this using df.withColumn too:
from itertools import chain
from pyspark.sql.functions import create_map, lit
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
mapping_expr = create_map([lit(x) for x in chain(*deviceDict.items())])
df = df.withColumn('device_type', mapping_expr[df['dvice_type']])
df.show()
The simplest way to do it is to apply a udf on your dataframe :
from pyspark.sql.functions import col , udf
deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
map_func = udf(lambda row : deviceDict.get(row,row))
df = df.withColumn("device_type", map_func(col("device_type")))
Another way of solving this is using CASE WHEN in traditional sql but using f-strings and using the python dictionary along with .join for automatically generating the CASE WHEN statement:
column = 'device_type' #column to replace
e = f"""CASE {' '.join([f"WHEN {column}='{k}' THEN '{v}'"
for k,v in deviceDict.items()])} ELSE {column} END"""
df.withColumn(column,F.expr(e)).show()
+-----------+
|device_type|
+-----------+
| Mobile|
| Mobile|
| Desktop|
| Other|
| null|
+-----------+
Note: if you want to return NULL where the keys doesnot match, just change ELSE {column} END to ELSE NULL END in the case statement for variable e
column = 'device_type' #column to replace
e = f"""CASE {' '.join([f"WHEN {column}='{k}' THEN '{v}'"
for k,v in deviceDict.items()])} ELSE NULL END"""
df.withColumn('New_Col',F.expr(e)).show()
+-----------+-------+
|device_type|New_Col|
+-----------+-------+
| Tablet| Mobile|
| Phone| Mobile|
| PC|Desktop|
| Other| null|
| null| null|
+-----------+-------+
I'm trying to concatenate two PySpark dataframes with some columns that are only on one of them:
from pyspark.sql.functions import randn, rand
df_1 = sqlContext.range(0, 10)
+--+
|id|
+--+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+--+
df_2 = sqlContext.range(11, 20)
+--+
|id|
+--+
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+--+
df_1 = df_1.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal"))
df_2 = df_2.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal_2"))
and now I want to generate a third dataframe. I would like something like pandas concat:
df_1.show()
+---+--------------------+--------------------+
| id| uniform| normal|
+---+--------------------+--------------------+
| 0| 0.8122802274304282| 1.2423430583597714|
| 1| 0.8642043127063618| 0.3900018344856156|
| 2| 0.8292577771850476| 1.8077401259195247|
| 3| 0.198558705368724| -0.4270585782850261|
| 4|0.012661361966674889| 0.702634599720141|
| 5| 0.8535692890157796|-0.42355804115129153|
| 6| 0.3723296190171911| 1.3789648582622995|
| 7| 0.9529794127670571| 0.16238718777444605|
| 8| 0.9746632635918108| 0.02448061333761742|
| 9| 0.513622008243935| 0.7626741803250845|
+---+--------------------+--------------------+
df_2.show()
+---+--------------------+--------------------+
| id| uniform| normal_2|
+---+--------------------+--------------------+
| 11| 0.3221262660507942| 1.0269298899109824|
| 12| 0.4030672316912547| 1.285648175568798|
| 13| 0.9690555459609131|-0.22986601831364423|
| 14|0.011913836266515876| -0.678915153834693|
| 15| 0.9359607054250594|-0.16557488664743034|
| 16| 0.45680471157575453| -0.3885563551710555|
| 17| 0.6411908952297819| 0.9161177183227823|
| 18| 0.5669232696934479| 0.7270125277020573|
| 19| 0.513622008243935| 0.7626741803250845|
+---+--------------------+--------------------+
#do some concatenation here, how?
df_concat.show()
| id| uniform| normal| normal_2 |
+---+--------------------+--------------------+------------+
| 0| 0.8122802274304282| 1.2423430583597714| None |
| 1| 0.8642043127063618| 0.3900018344856156| None |
| 2| 0.8292577771850476| 1.8077401259195247| None |
| 3| 0.198558705368724| -0.4270585782850261| None |
| 4|0.012661361966674889| 0.702634599720141| None |
| 5| 0.8535692890157796|-0.42355804115129153| None |
| 6| 0.3723296190171911| 1.3789648582622995| None |
| 7| 0.9529794127670571| 0.16238718777444605| None |
| 8| 0.9746632635918108| 0.02448061333761742| None |
| 9| 0.513622008243935| 0.7626741803250845| None |
| 11| 0.3221262660507942| None | 0.123 |
| 12| 0.4030672316912547| None |0.12323 |
| 13| 0.9690555459609131| None |0.123 |
| 14|0.011913836266515876| None |0.18923 |
| 15| 0.9359607054250594| None |0.99123 |
| 16| 0.45680471157575453| None |0.123 |
| 17| 0.6411908952297819| None |1.123 |
| 18| 0.5669232696934479| None |0.10023 |
| 19| 0.513622008243935| None |0.916332123 |
+---+--------------------+--------------------+------------+
Is that possible?
Maybe you can try creating the unexisting columns and calling union (unionAll for Spark 1.6 or lower):
from pyspark.sql.functions import lit
cols = ['id', 'uniform', 'normal', 'normal_2']
df_1_new = df_1.withColumn("normal_2", lit(None)).select(cols)
df_2_new = df_2.withColumn("normal", lit(None)).select(cols)
result = df_1_new.union(df_2_new)
# To remove the duplicates:
result = result.dropDuplicates()
df_concat = df_1.union(df_2)
The dataframes may need to have identical columns, in which case you can use withColumn() to create normal_1 and normal_2
unionByName is a built-in option available in spark which is available from spark 2.3.0.
with spark version 3.1.0, there is allowMissingColumns option with the default value set to False to handle missing columns. Even if both dataframes don't have the same set of columns, this function will work, setting missing column values to null in the resulting dataframe.
df_1.unionByName(df_2, allowMissingColumns=True).show()
+---+--------------------+--------------------+--------------------+
| id| uniform| normal| normal_2|
+---+--------------------+--------------------+--------------------+
| 0| 0.8122802274304282| 1.2423430583597714| null|
| 1| 0.8642043127063618| 0.3900018344856156| null|
| 2| 0.8292577771850476| 1.8077401259195247| null|
| 3| 0.198558705368724| -0.4270585782850261| null|
| 4|0.012661361966674889| 0.702634599720141| null|
| 5| 0.8535692890157796|-0.42355804115129153| null|
| 6| 0.3723296190171911| 1.3789648582622995| null|
| 7| 0.9529794127670571| 0.16238718777444605| null|
| 8| 0.9746632635918108| 0.02448061333761742| null|
| 9| 0.513622008243935| 0.7626741803250845| null|
| 11| 0.3221262660507942| null| 1.0269298899109824|
| 12| 0.4030672316912547| null| 1.285648175568798|
| 13| 0.9690555459609131| null|-0.22986601831364423|
| 14|0.011913836266515876| null| -0.678915153834693|
| 15| 0.9359607054250594| null|-0.16557488664743034|
| 16| 0.45680471157575453| null| -0.3885563551710555|
| 17| 0.6411908952297819| null| 0.9161177183227823|
| 18| 0.5669232696934479| null| 0.7270125277020573|
| 19| 0.513622008243935| null| 0.7626741803250845|
+---+--------------------+--------------------+--------------------+
You can use unionByName to make this:
df = df_1.unionByName(df_2)
unionByName is available since Spark 2.3.0.
To make it more generic of keeping both columns in df1 and df2:
import pyspark.sql.functions as F
# Keep all columns in either df1 or df2
def outter_union(df1, df2):
# Add missing columns to df1
left_df = df1
for column in set(df2.columns) - set(df1.columns):
left_df = left_df.withColumn(column, F.lit(None))
# Add missing columns to df2
right_df = df2
for column in set(df1.columns) - set(df2.columns):
right_df = right_df.withColumn(column, F.lit(None))
# Make sure columns are ordered the same
return left_df.union(right_df.select(left_df.columns))
To concatenate multiple pyspark dataframes into one:
from functools import reduce
reduce(lambda x,y:x.union(y), [df_1,df_2])
And you can replace the list of [df_1, df_2] to a list of any length.
Here is one way to do it, in case it is still useful: I ran this in pyspark shell, Python version 2.7.12 and my Spark install was version 2.0.1.
PS: I guess you meant to use different seeds for the df_1 df_2 and the code below reflects that.
from pyspark.sql.types import FloatType
from pyspark.sql.functions import randn, rand
import pyspark.sql.functions as F
df_1 = sqlContext.range(0, 10)
df_2 = sqlContext.range(11, 20)
df_1 = df_1.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal"))
df_2 = df_2.select("id", rand(seed=11).alias("uniform"), randn(seed=28).alias("normal_2"))
def get_uniform(df1_uniform, df2_uniform):
if df1_uniform:
return df1_uniform
if df2_uniform:
return df2_uniform
u_get_uniform = F.udf(get_uniform, FloatType())
df_3 = df_1.join(df_2, on = "id", how = 'outer').select("id", u_get_uniform(df_1["uniform"], df_2["uniform"]).alias("uniform"), "normal", "normal_2").orderBy(F.col("id"))
Here are the outputs I get:
df_1.show()
+---+-------------------+--------------------+
| id| uniform| normal|
+---+-------------------+--------------------+
| 0|0.41371264720975787| 0.5888539012978773|
| 1| 0.7311719281896606| 0.8645537008427937|
| 2| 0.1982919638208397| 0.06157382353970104|
| 3|0.12714181165849525| 0.3623040918178586|
| 4| 0.7604318153406678|-0.49575204523675975|
| 5|0.12030715258495939| 1.0854146699817222|
| 6|0.12131363910425985| -0.5284523629183004|
| 7|0.44292918521277047| -0.4798519469521663|
| 8| 0.8898784253886249| -0.8820294772950535|
| 9|0.03650707717266999| -2.1591956435415334|
+---+-------------------+--------------------+
df_2.show()
+---+-------------------+--------------------+
| id| uniform| normal_2|
+---+-------------------+--------------------+
| 11| 0.1982919638208397| 0.06157382353970104|
| 12|0.12714181165849525| 0.3623040918178586|
| 13|0.12030715258495939| 1.0854146699817222|
| 14|0.12131363910425985| -0.5284523629183004|
| 15|0.44292918521277047| -0.4798519469521663|
| 16| 0.8898784253886249| -0.8820294772950535|
| 17| 0.2731073068483362|-0.15116027592854422|
| 18| 0.7784518091224375| -0.3785563841011868|
| 19|0.43776394586845413| 0.47700719174464357|
+---+-------------------+--------------------+
df_3.show()
+---+-----------+--------------------+--------------------+
| id| uniform| normal| normal_2|
+---+-----------+--------------------+--------------------+
| 0| 0.41371265| 0.5888539012978773| null|
| 1| 0.7311719| 0.8645537008427937| null|
| 2| 0.19829196| 0.06157382353970104| null|
| 3| 0.12714182| 0.3623040918178586| null|
| 4| 0.7604318|-0.49575204523675975| null|
| 5|0.120307155| 1.0854146699817222| null|
| 6| 0.12131364| -0.5284523629183004| null|
| 7| 0.44292918| -0.4798519469521663| null|
| 8| 0.88987845| -0.8820294772950535| null|
| 9|0.036507078| -2.1591956435415334| null|
| 11| 0.19829196| null| 0.06157382353970104|
| 12| 0.12714182| null| 0.3623040918178586|
| 13|0.120307155| null| 1.0854146699817222|
| 14| 0.12131364| null| -0.5284523629183004|
| 15| 0.44292918| null| -0.4798519469521663|
| 16| 0.88987845| null| -0.8820294772950535|
| 17| 0.27310732| null|-0.15116027592854422|
| 18| 0.7784518| null| -0.3785563841011868|
| 19| 0.43776396| null| 0.47700719174464357|
+---+-----------+--------------------+--------------------+
Above answers are very elegant. I have written this function long back where i was also struggling to concatenate two dataframe with distinct columns.
Suppose you have dataframe sdf1 and sdf2
from pyspark.sql import functions as F
from pyspark.sql.types import *
def unequal_union_sdf(sdf1, sdf2):
s_df1_schema = set((x.name, x.dataType) for x in sdf1.schema)
s_df2_schema = set((x.name, x.dataType) for x in sdf2.schema)
for i,j in s_df2_schema.difference(s_df1_schema):
sdf1 = sdf1.withColumn(i,F.lit(None).cast(j))
for i,j in s_df1_schema.difference(s_df2_schema):
sdf2 = sdf2.withColumn(i,F.lit(None).cast(j))
common_schema_colnames = sdf1.columns
sdk = \
sdf1.select(common_schema_colnames).union(sdf2.select(common_schema_colnames))
return sdk
sdf_concat = unequal_union_sdf(sdf1, sdf2)
This should do it for you ...
from pyspark.sql.types import FloatType
from pyspark.sql.functions import randn, rand, lit, coalesce, col
import pyspark.sql.functions as F
df_1 = sqlContext.range(0, 6)
df_2 = sqlContext.range(3, 10)
df_1 = df_1.select("id", lit("old").alias("source"))
df_2 = df_2.select("id")
df_1.show()
df_2.show()
df_3 = df_1.alias("df_1").join(df_2.alias("df_2"), df_1.id == df_2.id, "outer")\
.select(\
[coalesce(df_1.id, df_2.id).alias("id")] +\
[col("df_1." + c) for c in df_1.columns if c != "id"])\
.sort("id")
df_3.show()
I was trying to implement pandas append functionality in pyspark and what I created a custom function where we can concat 2 or more data frame even they are having different no. of columns only condition is if dataframes have identical name then their datatype should be same/match.
I have written a custom function to merge 2 dataframes.
def append_dfs(df1,df2):
list1 = df1.columns
list2 = df2.columns
for col in list2:
if(col not in list1):
df1 = df1.withColumn(col, F.lit(None))
for col in list1:
if(col not in list2):
df2 = df2.withColumn(col, F.lit(None))
return df1.unionByName(df2)
usage:
concate 2 dataframes
final_df = append_dfs(df1,df2)
concate more than 2(say3) dataframes
final_df = append_dfs(append_dfs(df1,df2),df3)
example:
df1:
df2:
result=append_dfs(df1,df2)
result :
Hope this will useful.
I would solve this in this way:
from pyspark.sql import SparkSession
df_1.createOrReplaceTempView("tab_1")
df_2.createOrReplaceTempView("tab_2")
df_concat=spark.sql("select tab_1.id,tab_1.uniform,tab_1.normal,tab_2.normal_2 from tab_1 tab_1 left join tab_2 tab_2 on tab_1.uniform=tab_2.uniform\
union\
select tab_2.id,tab_2.uniform,tab_1.normal,tab_2.normal_2 from tab_2 tab_2 left join tab_1 tab_1 on tab_1.uniform=tab_2.uniform")
df_concat.show()
Maybe, you want to concatenate more of two Dataframes.
I found a issue which use pandas Dataframe conversion.
Suppose you have 3 spark Dataframe who want to concatenate.
The code is the following:
list_dfs = []
list_dfs_ = []
df = spark.read.json('path_to_your_jsonfile.json',multiLine = True)
df2 = spark.read.json('path_to_your_jsonfile2.json',multiLine = True)
df3 = spark.read.json('path_to_your_jsonfile3.json',multiLine = True)
list_dfs.extend([df,df2,df3])
for df in list_dfs :
df = df.select([column for column in df.columns]).toPandas()
list_dfs_.append(df)
list_dfs.clear()
df_ = sqlContext.createDataFrame(pd.concat(list_dfs_))