PySpark: How to explode two columns of arrays - python

I have a DF in PySpark where I'm trying to explode two columns of arrays. Here's my DF:
+--------+-----+--------------------+--------------------+
| id| zip_| values| time|
+--------+-----+--------------------+--------------------+
|56434459|02138|[1.033990484, 1.0...|[1.624322475139E9...|
|56434508|02138|[1.04760919, 1.07...|[1.624322475491E9...|
|56434484|02138|[1.047177758, 1.0...|[1.62432247655E9,...|
|56434495|02138|[0.989590562, 1.0...|[1.624322476937E9...|
|56434465|02138|[1.051481754, 1.1...|[1.624322477275E9...|
|56434469|02138|[1.026476497, 1.1...|[1.624322477605E9...|
|56434463|02138|[1.10024864, 1.31...|[1.624322478085E9...|
|56434458|02138|[1.011091305, 1.0...|[1.624322478462E9...|
|56434464|02138|[1.038230333, 1.0...|[1.62432247882E9,...|
|56434474|02138|[1.041924752, 1.1...|[1.624322479386E9...|
|56434452|02138|[1.044482358, 1.1...|[1.624322479919E9...|
|56434445|02138|[1.050144598, 1.1...|[1.624322480344E9...|
|56434499|02138|[1.047851812, 1.0...|[1.624322480785E9...|
|56434449|02138|[1.044700917, 1.1...|[1.6243224811E9, ...|
|56434461|02138|[1.03341455, 1.07...|[1.624322481443E9...|
|56434526|02138|[1.04779412, 1.07...|[1.624322481861E9...|
|56434433|02138|[1.0498406, 1.139...|[1.624322482181E9...|
|56434507|02138|[1.0013894403, 1....|[1.624322482419E9...|
|56434488|02138|[1.047270063, 1.0...|[1.624322482716E9...|
|56434451|02138|[1.043182727, 1.1...|[1.624322483061E9...|
+--------+-----+--------------------+--------------------+
only showing top 20 rows
My current solution is to do a posexplode on each column, combined with a concat_ws for a unique ID, creating two DFs.
First DF:
+-----------+-----+-----------+
| new_id| zip_| values_new|
+-----------+-----+-----------+
| 56434459_0|02138|1.033990484|
| 56434459_1|02138| 1.07805057|
| 56434459_2|02138| 1.09000133|
| 56434459_3|02138| 1.07009546|
| 56434459_4|02138|1.102403015|
| 56434459_5|02138| 1.1291009|
| 56434459_6|02138|1.088399924|
| 56434459_7|02138|1.047513142|
| 56434459_8|02138|1.010418795|
| 56434459_9|02138| 1.0|
|56434459_10|02138| 1.0|
|56434459_11|02138| 1.0|
|56434459_12|02138| 0.99048968|
|56434459_13|02138|0.984854524|
|56434459_14|02138| 1.0|
| 56434508_0|02138| 1.04760919|
| 56434508_1|02138| 1.07858897|
| 56434508_2|02138| 1.09084267|
| 56434508_3|02138| 1.07627785|
| 56434508_4|02138| 1.13778706|
+-----------+-----+-----------+
only showing top 20 rows
Second DF:
+-----------+-----+----------------+
| new_id| zip_| values_new|
+-----------+-----+----------------+
| 56434459_0|02138|1.624322475139E9|
| 56434459_1|02138|1.592786475139E9|
| 56434459_2|02138|1.561164075139E9|
| 56434459_3|02138|1.529628075139E9|
| 56434459_4|02138|1.498092075139E9|
| 56434459_5|02138|1.466556075139E9|
| 56434459_6|02138|1.434933675139E9|
| 56434459_7|02138|1.403397675139E9|
| 56434459_8|02138|1.371861675139E9|
| 56434459_9|02138|1.340325675139E9|
|56434459_10|02138|1.308703275139E9|
|56434459_11|02138|1.277167275139E9|
|56434459_12|02138|1.245631275139E9|
|56434459_13|02138|1.214095275139E9|
|56434459_14|02138|1.182472875139E9|
| 56434508_0|02138|1.624322475491E9|
| 56434508_1|02138|1.592786475491E9|
| 56434508_2|02138|1.561164075491E9|
| 56434508_3|02138|1.529628075491E9|
| 56434508_4|02138|1.498092075491E9|
+-----------+-----+----------------+
only showing top 20 rows
I then join the DFs on new_id, resulting in:
+------------+-----+----------------+-----+------------------+
| new_id| zip_| values_new| zip_| values_new|
+------------+-----+----------------+-----+------------------+
| 123957783_3|02138|1.527644029268E9|02138| 1.0|
| 125820702_3|02138|1.527643636531E9|02138| 1.013462378|
|165689784_12|02138|1.243647038288E9|02138|0.9283950599999999|
|165689784_14|02138|1.180488638288E9|02138| 1.011595547|
| 56424973_12|02138|1.245630256025E9|02138|0.9566622300000001|
| 56424989_14|02138|1.182471866886E9|02138| 1.0|
| 56425304_7|02138|1.403398444955E9|02138| 1.028527131|
| 56425386_6|02138|1.432949752808E9|02138| 1.08516484|
| 56430694_17|02138|1.087866094991E9|02138| 1.120045416|
| 56430700_20|02138| 9.61635686239E8|02138| 1.099920854|
| 56430856_13|02138|1.214097787512E9|02138| 0.989263804|
| 56430866_12|02138|1.245633801277E9|02138| 0.990684134|
| 56430875_10|02138|1.308705777269E9|02138| 1.0|
| 56430883_3|02138|1.529630585921E9|02138| 1.06920212|
| 56430987_13|02138|1.214100806414E9|02138| 0.978794644|
| 56431009_1|02138|1.592792025664E9|02138| 1.07923349|
| 56431013_9|02138|1.340331235566E9|02138| 1.0|
| 56431025_8|02138|1.371860189767E9|02138| 0.9477155|
| 56432373_13|02138|1.214092187852E9|02138| 0.994825498|
| 56432421_2|02138|1.561161037707E9|02138| 1.11343257|
+------------+-----+----------------+-----+------------------+
only showing top 20 rows
My question: Is there a more effective way to get the resultant DF? I tried doing two posexplodes in parallel but PySpark allows only one.

You can achieve it as follows:
df = (df.withColumn("values_new", explode(col("values")))
.withColumn("times_new", explode(col("time")))
.withColumn("id_new", monotonically_increasing_id()))

Related

Transpose specific columns to rows using python pyspark

I have here an example DF:
+---+---------+----+---------+---------+-------------+-------------+
|id | company |type|rev2016 | rev2017 | main2016 | main2017 |
+---+---------+----+---------+---------+-------------+-------------+
| 1 | google |web | 100 | 200 | 55 | 66 |
+---+---------+----+---------+---------+-------------+-------------+
And I want this output:
+---+---------+----+-------------+------+------+
|id | company |type| Metric | 2016 | 2017 |
+---+---------+----+-------------+------+------+
| 1 | google |web | rev | 100 | 200 |
| 1 | google |web | main | 55 | 66 |
+---+---------+----+-------------+------+------+
What I am trying to achieve is transposing revenue and maintenance columns to rows with new column 'Metric'. I am trying pivoting no luck so far.
You can construct an array of structs from the columns, and then explode the arrays and expand the structs to get the desired output.
import pyspark.sql.functions as F
struct_list = [
F.struct(
F.lit('rev').alias('Metric'),
F.col('rev2016').alias('2016'),
F.col('rev2017').alias('2017')
),
F.struct(
F.lit('main').alias('Metric'),
F.col('main2016').alias('2016'),
F.col('main2017').alias('2017')
)
]
df2 = df.withColumn(
'arr',
F.explode(F.array(*struct_list))
).select('id', 'company', 'type', 'arr.*')
df2.show()
+---+-------+----+------+----+----+
| id|company|type|Metric|2016|2017|
+---+-------+----+------+----+----+
| 1| google| web| rev| 100| 200|
| 1| google| web| main| 55| 66|
+---+-------+----+------+----+----+
Or you can use stack:
df2 = df.selectExpr(
'id', 'company', 'type',
"stack(2, 'rev', rev2016, rev2017, 'main', main2016, main2017) as (Metric, `2016`, `2017`)"
)
df2.show()
+---+-------+----+------+----+----+
| id|company|type|Metric|2016|2017|
+---+-------+----+------+----+----+
| 1| google| web| rev| 100| 200|
| 1| google| web| main| 55| 66|
+---+-------+----+------+----+----+

Compare two pyspark dataframes by joining

I have two pyspark dataframes that have different number of rows. I am trying to compare values in all the columns by joining these two dataframes on multiple keys so I can find the records that have different values and the records that have the same values in these columns.
#df1:
+-------+----------+----------+----------+
|id |age |sex |value
+-------+----------+----------+
| 1 | 23 | M | 8.4
| 2 | 4 | M |2
| 3 |16 | F | 4.1
| 4 | 60 | M |4
| 5 | null | F |5
+-------+----------+----------+|
#df2:
+-------+----------+----------+----------+
|id |age |sex |value
+-------+----------+----------+
| 1 | 23 | M | 8.4
| 2 | 4 | null |2
| 4 | 13 | M |3.1
| 5 | 34 | F |6.2
+-------+----------+----------+|
#joining df1 and df2 on multiple keys
same=df1.join(df2, on=['id','age','sex','value'], how='inner')
Please note that the dataframes above are just samples. My real data has around 25 columns and 100k+ rows. So when I tried to do the join, the spark job was taking a long time and doesn't finish.
Want to know if anyone has good advice on comparing two dataframes and find out records that have different values in columns either using joining or other methods?
Use hashing.
from pyspark.sql.functions import hash
df1 = spark.createDataFrame([('312312','151132'),('004312','12232'),('','151132'),('013vjhrr134','111232'),(None,'151132'),('0fsgdhfjgk','151132')],
("Fruits", "Meat"))
df1 = df1.withColumn('hash_value', hash("Fruits", "Meat"))
df = spark.createDataFrame([('312312','151132'),('000312','151132'),('','151132'),('013vjh134134','151132'),(None,'151132'),('0fsgdhfjgk','151132')],
("Fruits", "Meat"))
df = df.withColumn('hash_value', hash("Fruits", "Meat"))
df.show()
+------------+------+-----------+
| Fruits| Meat| hash_value|
+------------+------+-----------+
| 312312|151132| -344340697|
| 000312|151132| -548650515|
| |151132|-2105905448|
|013vjh134134|151132| 2052362224|
| null|151132| 598159392|
| 0fsgdhfjgk|151132| 951458223|
+------------+------+-----------+
df1.show()
+-----------+------+-----------+
| Fruits| Meat| hash_value|
+-----------+------+-----------+
| 312312|151132| -344340697|
| 004312| 12232| 76821046|
| |151132|-2105905448|
|013vjhrr134|111232| 1289730088|
| null|151132| 598159392|
| 0fsgdhfjgk|151132| 951458223|
+-----------+------+-----------+
OR you can use SHA2 for the same
from pyspark.sql.functions import sha2, concat_ws
df1.withColumn("row_sha2", sha2(concat_ws("||", *df.columns), 256)).show(truncate=False)
+-----------+------+----------------------------------------------------------------+
|Fruits |Meat |row_sha2 |
+-----------+------+----------------------------------------------------------------+
|312312 |151132|7be3824bcaa5fa29ad58df2587d392a1cc9ca5511ef01005be6f97c9558d1eed|
|004312 |12232 |c7fcf8031a17e5f3168297579f6dc8a6f17d7a4a71939d6b989ca783f30e21ac|
| |151132|68ea989b7d33da275a16ff897b0ab5a88bc0f4545ec22d90cee63244c1f00fb0|
|013vjhrr134|111232|9c9df63553d841463a803c64e3f4a8aed53bcdf78bf4a089a88af9e91406a226|
|null |151132|83de2d466a881cb4bb16b83665b687c01752044296079b2cae5bab8af93db14f|
|0fsgdhfjgk |151132|394631bbd1ccee841d3ba200806f8d0a51c66119b13575cf547f8cc91066c90d|
+-----------+------+----------------------------------------------------------------+
This will create a unique code for all your rows now join and compare the hash values on the two data frames.
If the values in the rows are same they will have same hash values as well.
Now you can compare using joins
df1.join(df, "hash_value", "inner").show()
+-----------+----------+------+----------+------+
| hash_value| Fruits| Meat| Fruits| Meat|
+-----------+----------+------+----------+------+
|-2105905448| |151132| |151132|
| -344340697| 312312|151132| 312312|151132|
| 598159392| null|151132| null|151132|
| 951458223|0fsgdhfjgk|151132|0fsgdhfjgk|151132|
+-----------+----------+------+----------+------+
df1.join(df, "hash_value", "outer").show()
+-----------+-----------+------+------------+------+
| hash_value| Fruits| Meat| Fruits| Meat|
+-----------+-----------+------+------------+------+
|-2105905448| |151132| |151132|
| -548650515| null| null| 000312|151132|
| -344340697| 312312|151132| 312312|151132|
| 76821046| 004312| 12232| null| null|
| 598159392| null|151132| null|151132|
| 951458223| 0fsgdhfjgk|151132| 0fsgdhfjgk|151132|
| 1289730088|013vjhrr134|111232| null| null|
| 2052362224| null| null|013vjh134134|151132|
+-----------+-----------+------+------------+------+

New column with previous rows value

Im working with pyspark and i have frame like this
this is my frame
+---+-----+
| id|value|
+---+-----+
| 1| 65|
| 2| 66|
| 3| 65|
| 4| 68|
| 5| 71|
+---+-----+
and i want to generate frame with pyspark like this
+---+-----+-------------+
| id|value| prev_value |
+---+-----+-------------+
| 1 | 65 | null |
| 2 | 66 | 65 |
| 3 | 65 | 66,65 |
| 4 | 68 | 65,66,65 |
| 5 | 71 | 68,65,66,65 |
+---+-----+-------------+
Here is one way:
from pyspark.sql.window import Window
from pyspark.sql.types import StringType
# define window and calculate "running total" of lagged value
win = Window.partitionBy().orderBy(f.col('id'))
df = df.withColumn('prev_value', f.collect_list(f.lag('value').over(win)).over(win))
# now define udf to concatenate the lists
concat = f.udf(lambda x: 'null' if len(x)==0 else ','.join([str(elt) for elt in x[::-1]]))
df = df.withColumn('prev_value', concat('prev_value'))

Column with avg over previous day pyspark

I want to create a new column which is the mean of previous day sales using pyspark.
consider these value are at different timestamp.
for eg convert this:
| Date | value |
|------------|-------|
| 2019/02/11 | 30 |
| 2019/02/11 | 40 |
| 2019/02/11 | 20 |
| 2019/02/12 | 10 |
| 2019/02/12 | 15 |
to this
| Date | value | avg |
|------------|-------|------|
| 2019/02/11 | 30 | null |
| 2019/02/11 | 40 | null |
| 2019/02/11 | 20 | null |
| 2019/02/12 | 10 | 30 |
| 2019/02/12 | 15 | 30 |
My thinking :
Use filter and aggregation function to obtain average but its throwing error. Not sure where I am doing wrong.
df = df.withColumn("avg",lit((df.filter(df["date"] == date_sub("date",1)).agg({"value": "avg"}))))
You can do that using the windows functions but you have to create a new column to handle the dates.
I added few lines to you example :
df.withColumn(
"rnk",
F.dense_rank().over(Window.partitionBy().orderBy("date"))
).withColumn(
"avg",
F.avg("value").over(Window.partitionBy().orderBy("rnk").rangeBetween(-1,-1))
).show()
+----------+-----+---+----+
| date|value|rnk| avg|
+----------+-----+---+----+
|2018-01-01| 20| 1|null|
|2018-01-01| 30| 1|null|
|2018-01-01| 40| 1|null|
|2018-01-02| 40| 2|30.0|
|2018-01-02| 30| 2|30.0|
|2018-01-03| 40| 3|35.0|
|2018-01-03| 40| 3|35.0|
+----------+-----+---+----+
You can also do that using aggregation :
agg_df = df.withColumn("date", F.date_add("date", 1)).groupBy('date').avg("value")
df.join(agg_df, how="full_outer", on="date").orderBy("date").show()
+----------+-----+----------+
| date|value|avg(value)|
+----------+-----+----------+
|2018-01-01| 20| null|
|2018-01-01| 30| null|
|2018-01-01| 40| null|
|2018-01-02| 30| 30.0|
|2018-01-02| 40| 30.0|
|2018-01-03| 40| 35.0|
|2018-01-03| 40| 35.0|
|2018-01-04| null| 40.0|
+----------+-----+----------+
Step 0: Creating a DataFrame
from pyspark.sql.window import Window
from pyspark.sql.functions import col, avg, lag
df = sqlContext.createDataFrame([('2019/02/11',30),('2019/02/11',40),('2019/02/11',20),
('2019/02/12',10),('2019/02/12',15),
('2019/02/13',10),('2019/02/13',20)],['Date','value'])
Step 1: First calculating the average and then using windows function to get the lag by 1 day.
my_window = Window.partitionBy().orderBy('Date')
df_avg_previous = df.groupBy('Date').agg(avg(col('value')).alias('avg'))
df_avg_previous = df_avg_previous.withColumn('avg', lag(col('avg'),1).over(my_window))
df_avg_previous.show()
+----------+----+
| Date| avg|
+----------+----+
|2019/02/11|null|
|2019/02/12|30.0|
|2019/02/13|12.5|
+----------+----+
Step 2: Finally joining the two dataframes by using a left join.
df = df.join(df_avg_previous, ['Date'],how='left').orderBy('Date')
df.show()
+----------+-----+----+
| Date|value| avg|
+----------+-----+----+
|2019/02/11| 40|null|
|2019/02/11| 20|null|
|2019/02/11| 30|null|
|2019/02/12| 10|30.0|
|2019/02/12| 15|30.0|
|2019/02/13| 10|12.5|
|2019/02/13| 20|12.5|
+----------+-----+----+

How do I flattern a pySpark dataframe ?

I have a spark dataframe like this:
id | Operation | Value
-----------------------------------------------------------
1 | [Date_Min, Date_Max, Device] | [148590, 148590, iphone]
2 | [Date_Min, Date_Max, Review] | [148590, 148590, Good]
3 | [Date_Min, Date_Max, Review, Device] | [148590, 148590, Bad,samsung]
The resul that i expect:
id | Operation | Value |
--------------------------
1 | Date_Min | 148590 |
1 | Date_Max | 148590 |
1 | Device | iphone |
2 | Date_Min | 148590 |
2 | Date_Max | 148590 |
2 | Review | Good |
3 | Date_Min | 148590 |
3 | Date_Max | 148590 |
3 | Review | Bad |
3 | Review | samsung|
I'm using Spark 2.1.0 with pyspark. I tried this solution but it worked only for one column.
Thanks
Here is an example dataframe from above. I use this solution in order to solve your question.
df = spark.createDataFrame(
[[1, ['Date_Min', 'Date_Max', 'Device'], ['148590', '148590', 'iphone']],
[2, ['Date_Min', 'Date_Max', 'Review'], ['148590', '148590', 'Good']],
[3, ['Date_Min', 'Date_Max', 'Review', 'Device'], ['148590', '148590', 'Bad', 'samsung']]],
schema=['id', 'l1', 'l2'])
Here, you can define udf to zip two list together for each row first.
from pyspark.sql.types import *
from pyspark.sql.functions import col, udf, explode
zip_list = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
StructField("first", StringType()),
StructField("second", StringType())
]))
)
Finally, you can zip two columns together then explode that column.
df_out = df.withColumn("tmp", zip_list('l1', 'l2')).\
withColumn("tmp", explode("tmp")).\
select('id', col('tmp.first').alias('Operation'), col('tmp.second').alias('Value'))
df_out.show()
Output
+---+---------+-------+
| id|Operation| Value|
+---+---------+-------+
| 1| Date_Min| 148590|
| 1| Date_Max| 148590|
| 1| Device| iphone|
| 2| Date_Min| 148590|
| 2| Date_Max| 148590|
| 2| Review| Good|
| 3| Date_Min| 148590|
| 3| Date_Max| 148590|
| 3| Review| Bad|
| 3| Device|samsung|
+---+---------+-------+
If using DataFrame then try this:-
import pyspark.sql.functions as F
your_df.select("id", F.explode("Operation"), F.explode("Value")).show()

Categories