I have here an example DF:
+---+---------+----+---------+---------+-------------+-------------+
|id | company |type|rev2016 | rev2017 | main2016 | main2017 |
+---+---------+----+---------+---------+-------------+-------------+
| 1 | google |web | 100 | 200 | 55 | 66 |
+---+---------+----+---------+---------+-------------+-------------+
And I want this output:
+---+---------+----+-------------+------+------+
|id | company |type| Metric | 2016 | 2017 |
+---+---------+----+-------------+------+------+
| 1 | google |web | rev | 100 | 200 |
| 1 | google |web | main | 55 | 66 |
+---+---------+----+-------------+------+------+
What I am trying to achieve is transposing revenue and maintenance columns to rows with new column 'Metric'. I am trying pivoting no luck so far.
You can construct an array of structs from the columns, and then explode the arrays and expand the structs to get the desired output.
import pyspark.sql.functions as F
struct_list = [
F.struct(
F.lit('rev').alias('Metric'),
F.col('rev2016').alias('2016'),
F.col('rev2017').alias('2017')
),
F.struct(
F.lit('main').alias('Metric'),
F.col('main2016').alias('2016'),
F.col('main2017').alias('2017')
)
]
df2 = df.withColumn(
'arr',
F.explode(F.array(*struct_list))
).select('id', 'company', 'type', 'arr.*')
df2.show()
+---+-------+----+------+----+----+
| id|company|type|Metric|2016|2017|
+---+-------+----+------+----+----+
| 1| google| web| rev| 100| 200|
| 1| google| web| main| 55| 66|
+---+-------+----+------+----+----+
Or you can use stack:
df2 = df.selectExpr(
'id', 'company', 'type',
"stack(2, 'rev', rev2016, rev2017, 'main', main2016, main2017) as (Metric, `2016`, `2017`)"
)
df2.show()
+---+-------+----+------+----+----+
| id|company|type|Metric|2016|2017|
+---+-------+----+------+----+----+
| 1| google| web| rev| 100| 200|
| 1| google| web| main| 55| 66|
+---+-------+----+------+----+----+
Related
I have a DF in PySpark where I'm trying to explode two columns of arrays. Here's my DF:
+--------+-----+--------------------+--------------------+
| id| zip_| values| time|
+--------+-----+--------------------+--------------------+
|56434459|02138|[1.033990484, 1.0...|[1.624322475139E9...|
|56434508|02138|[1.04760919, 1.07...|[1.624322475491E9...|
|56434484|02138|[1.047177758, 1.0...|[1.62432247655E9,...|
|56434495|02138|[0.989590562, 1.0...|[1.624322476937E9...|
|56434465|02138|[1.051481754, 1.1...|[1.624322477275E9...|
|56434469|02138|[1.026476497, 1.1...|[1.624322477605E9...|
|56434463|02138|[1.10024864, 1.31...|[1.624322478085E9...|
|56434458|02138|[1.011091305, 1.0...|[1.624322478462E9...|
|56434464|02138|[1.038230333, 1.0...|[1.62432247882E9,...|
|56434474|02138|[1.041924752, 1.1...|[1.624322479386E9...|
|56434452|02138|[1.044482358, 1.1...|[1.624322479919E9...|
|56434445|02138|[1.050144598, 1.1...|[1.624322480344E9...|
|56434499|02138|[1.047851812, 1.0...|[1.624322480785E9...|
|56434449|02138|[1.044700917, 1.1...|[1.6243224811E9, ...|
|56434461|02138|[1.03341455, 1.07...|[1.624322481443E9...|
|56434526|02138|[1.04779412, 1.07...|[1.624322481861E9...|
|56434433|02138|[1.0498406, 1.139...|[1.624322482181E9...|
|56434507|02138|[1.0013894403, 1....|[1.624322482419E9...|
|56434488|02138|[1.047270063, 1.0...|[1.624322482716E9...|
|56434451|02138|[1.043182727, 1.1...|[1.624322483061E9...|
+--------+-----+--------------------+--------------------+
only showing top 20 rows
My current solution is to do a posexplode on each column, combined with a concat_ws for a unique ID, creating two DFs.
First DF:
+-----------+-----+-----------+
| new_id| zip_| values_new|
+-----------+-----+-----------+
| 56434459_0|02138|1.033990484|
| 56434459_1|02138| 1.07805057|
| 56434459_2|02138| 1.09000133|
| 56434459_3|02138| 1.07009546|
| 56434459_4|02138|1.102403015|
| 56434459_5|02138| 1.1291009|
| 56434459_6|02138|1.088399924|
| 56434459_7|02138|1.047513142|
| 56434459_8|02138|1.010418795|
| 56434459_9|02138| 1.0|
|56434459_10|02138| 1.0|
|56434459_11|02138| 1.0|
|56434459_12|02138| 0.99048968|
|56434459_13|02138|0.984854524|
|56434459_14|02138| 1.0|
| 56434508_0|02138| 1.04760919|
| 56434508_1|02138| 1.07858897|
| 56434508_2|02138| 1.09084267|
| 56434508_3|02138| 1.07627785|
| 56434508_4|02138| 1.13778706|
+-----------+-----+-----------+
only showing top 20 rows
Second DF:
+-----------+-----+----------------+
| new_id| zip_| values_new|
+-----------+-----+----------------+
| 56434459_0|02138|1.624322475139E9|
| 56434459_1|02138|1.592786475139E9|
| 56434459_2|02138|1.561164075139E9|
| 56434459_3|02138|1.529628075139E9|
| 56434459_4|02138|1.498092075139E9|
| 56434459_5|02138|1.466556075139E9|
| 56434459_6|02138|1.434933675139E9|
| 56434459_7|02138|1.403397675139E9|
| 56434459_8|02138|1.371861675139E9|
| 56434459_9|02138|1.340325675139E9|
|56434459_10|02138|1.308703275139E9|
|56434459_11|02138|1.277167275139E9|
|56434459_12|02138|1.245631275139E9|
|56434459_13|02138|1.214095275139E9|
|56434459_14|02138|1.182472875139E9|
| 56434508_0|02138|1.624322475491E9|
| 56434508_1|02138|1.592786475491E9|
| 56434508_2|02138|1.561164075491E9|
| 56434508_3|02138|1.529628075491E9|
| 56434508_4|02138|1.498092075491E9|
+-----------+-----+----------------+
only showing top 20 rows
I then join the DFs on new_id, resulting in:
+------------+-----+----------------+-----+------------------+
| new_id| zip_| values_new| zip_| values_new|
+------------+-----+----------------+-----+------------------+
| 123957783_3|02138|1.527644029268E9|02138| 1.0|
| 125820702_3|02138|1.527643636531E9|02138| 1.013462378|
|165689784_12|02138|1.243647038288E9|02138|0.9283950599999999|
|165689784_14|02138|1.180488638288E9|02138| 1.011595547|
| 56424973_12|02138|1.245630256025E9|02138|0.9566622300000001|
| 56424989_14|02138|1.182471866886E9|02138| 1.0|
| 56425304_7|02138|1.403398444955E9|02138| 1.028527131|
| 56425386_6|02138|1.432949752808E9|02138| 1.08516484|
| 56430694_17|02138|1.087866094991E9|02138| 1.120045416|
| 56430700_20|02138| 9.61635686239E8|02138| 1.099920854|
| 56430856_13|02138|1.214097787512E9|02138| 0.989263804|
| 56430866_12|02138|1.245633801277E9|02138| 0.990684134|
| 56430875_10|02138|1.308705777269E9|02138| 1.0|
| 56430883_3|02138|1.529630585921E9|02138| 1.06920212|
| 56430987_13|02138|1.214100806414E9|02138| 0.978794644|
| 56431009_1|02138|1.592792025664E9|02138| 1.07923349|
| 56431013_9|02138|1.340331235566E9|02138| 1.0|
| 56431025_8|02138|1.371860189767E9|02138| 0.9477155|
| 56432373_13|02138|1.214092187852E9|02138| 0.994825498|
| 56432421_2|02138|1.561161037707E9|02138| 1.11343257|
+------------+-----+----------------+-----+------------------+
only showing top 20 rows
My question: Is there a more effective way to get the resultant DF? I tried doing two posexplodes in parallel but PySpark allows only one.
You can achieve it as follows:
df = (df.withColumn("values_new", explode(col("values")))
.withColumn("times_new", explode(col("time")))
.withColumn("id_new", monotonically_increasing_id()))
**Dataframe 1 **
+----+------+------+-----+-----+
|key |dc_count|dc_day_count |
+----+------+------+-----+-----+
| 123 |13 |66 |
| 123 |13 |12 |
+----+------+------+-----+-----+
**rule Dataframe **
+----+------+------+-----+-----++------+-----+-----+
|key |rule_dc_count|rule_day_count |rule_out |
+----+------+------+-----+-----++------+-----+-----+
| 123 |2 |30 |139 |
| 123 |null |null |64 |
| 124 |2 |30 |139 |
| 124 |null |null |64 |
+----+------+------+-----+-----+----+------+-----+--
if dc_count>rule_dc_count and dc_day_count > rule_day_count
populate corresponding rule_out
else
other rule_out"
expected output
+----+------+------+-
|key |rule_out |
+----+------+------+
| 123 | 139 |
| 124 | 64 |
+----+------+------+
PySpark Version
The challenge here to get the second row's value for a key in a same column, In order to resolve this LEAD() analytical function can be used.
Create the DataFrame here
from pyspark.sql import functions as F
df = spark.createDataFrame([(123,13,66),(124,13,12)],[ "key","dc_count","dc_day_count"])
df1 = spark.createDataFrame([(123,2,30,139),(123,0,0,64),(124,2,30,139),(124,0,0,64)],
["key","rule_dc_count","rule_day_count","rule_out"])
Logic to get the Desired Result
from pyspark.sql import Window as W
_w = W.partitionBy('key').orderBy(F.col('key').desc())
df1 = df1.withColumn('rn', F.lead('rule_out').over(_w))
df1 = df1.join(df,'key','left')
df1 = df1.withColumn('condition_col',
F.when(
(F.col('dc_count') > F.col('rule_dc_count')) &
(F.col('dc_day_count') > F.col('rule_day_count')),F.col('rule_out'))
.otherwise(F.col('rn')))
df1 = df1.filter(F.col('rn').isNotNull())
Output
df1.show()
+---+-------------+--------------+--------+---+--------+------------+-------------+
|key|rule_dc_count|rule_day_count|rule_out| rn|dc_count|dc_day_count|condition_col|
+---+-------------+--------------+--------+---+--------+------------+-------------+
|124| 2| 30| 139| 64| 13| 12| 64|
|123| 2| 30| 139| 64| 13| 66| 139|
+---+-------------+--------------+--------+---+--------+------------+-------------+
Assuming expected output as-
+---+--------+
|key|rule_out|
+---+--------+
|123|139 |
+---+--------+
Below query should work-
spark.sql(
"""
|SELECT
| t1.key, t2.rule_out
|FROM table1 t1 join table2 t2 on t1.key=t2.key and
|t1.dc_count > t2.rule_dc_count and t1.dc_day_count > t2.rule_day_count
""".stripMargin)
.show(false)
Im working with pyspark and i have frame like this
this is my frame
+---+-----+
| id|value|
+---+-----+
| 1| 65|
| 2| 66|
| 3| 65|
| 4| 68|
| 5| 71|
+---+-----+
and i want to generate frame with pyspark like this
+---+-----+-------------+
| id|value| prev_value |
+---+-----+-------------+
| 1 | 65 | null |
| 2 | 66 | 65 |
| 3 | 65 | 66,65 |
| 4 | 68 | 65,66,65 |
| 5 | 71 | 68,65,66,65 |
+---+-----+-------------+
Here is one way:
from pyspark.sql.window import Window
from pyspark.sql.types import StringType
# define window and calculate "running total" of lagged value
win = Window.partitionBy().orderBy(f.col('id'))
df = df.withColumn('prev_value', f.collect_list(f.lag('value').over(win)).over(win))
# now define udf to concatenate the lists
concat = f.udf(lambda x: 'null' if len(x)==0 else ','.join([str(elt) for elt in x[::-1]]))
df = df.withColumn('prev_value', concat('prev_value'))
I want to create a new column which is the mean of previous day sales using pyspark.
consider these value are at different timestamp.
for eg convert this:
| Date | value |
|------------|-------|
| 2019/02/11 | 30 |
| 2019/02/11 | 40 |
| 2019/02/11 | 20 |
| 2019/02/12 | 10 |
| 2019/02/12 | 15 |
to this
| Date | value | avg |
|------------|-------|------|
| 2019/02/11 | 30 | null |
| 2019/02/11 | 40 | null |
| 2019/02/11 | 20 | null |
| 2019/02/12 | 10 | 30 |
| 2019/02/12 | 15 | 30 |
My thinking :
Use filter and aggregation function to obtain average but its throwing error. Not sure where I am doing wrong.
df = df.withColumn("avg",lit((df.filter(df["date"] == date_sub("date",1)).agg({"value": "avg"}))))
You can do that using the windows functions but you have to create a new column to handle the dates.
I added few lines to you example :
df.withColumn(
"rnk",
F.dense_rank().over(Window.partitionBy().orderBy("date"))
).withColumn(
"avg",
F.avg("value").over(Window.partitionBy().orderBy("rnk").rangeBetween(-1,-1))
).show()
+----------+-----+---+----+
| date|value|rnk| avg|
+----------+-----+---+----+
|2018-01-01| 20| 1|null|
|2018-01-01| 30| 1|null|
|2018-01-01| 40| 1|null|
|2018-01-02| 40| 2|30.0|
|2018-01-02| 30| 2|30.0|
|2018-01-03| 40| 3|35.0|
|2018-01-03| 40| 3|35.0|
+----------+-----+---+----+
You can also do that using aggregation :
agg_df = df.withColumn("date", F.date_add("date", 1)).groupBy('date').avg("value")
df.join(agg_df, how="full_outer", on="date").orderBy("date").show()
+----------+-----+----------+
| date|value|avg(value)|
+----------+-----+----------+
|2018-01-01| 20| null|
|2018-01-01| 30| null|
|2018-01-01| 40| null|
|2018-01-02| 30| 30.0|
|2018-01-02| 40| 30.0|
|2018-01-03| 40| 35.0|
|2018-01-03| 40| 35.0|
|2018-01-04| null| 40.0|
+----------+-----+----------+
Step 0: Creating a DataFrame
from pyspark.sql.window import Window
from pyspark.sql.functions import col, avg, lag
df = sqlContext.createDataFrame([('2019/02/11',30),('2019/02/11',40),('2019/02/11',20),
('2019/02/12',10),('2019/02/12',15),
('2019/02/13',10),('2019/02/13',20)],['Date','value'])
Step 1: First calculating the average and then using windows function to get the lag by 1 day.
my_window = Window.partitionBy().orderBy('Date')
df_avg_previous = df.groupBy('Date').agg(avg(col('value')).alias('avg'))
df_avg_previous = df_avg_previous.withColumn('avg', lag(col('avg'),1).over(my_window))
df_avg_previous.show()
+----------+----+
| Date| avg|
+----------+----+
|2019/02/11|null|
|2019/02/12|30.0|
|2019/02/13|12.5|
+----------+----+
Step 2: Finally joining the two dataframes by using a left join.
df = df.join(df_avg_previous, ['Date'],how='left').orderBy('Date')
df.show()
+----------+-----+----+
| Date|value| avg|
+----------+-----+----+
|2019/02/11| 40|null|
|2019/02/11| 20|null|
|2019/02/11| 30|null|
|2019/02/12| 10|30.0|
|2019/02/12| 15|30.0|
|2019/02/13| 10|12.5|
|2019/02/13| 20|12.5|
+----------+-----+----+
I have a spark dataframe like this:
id | Operation | Value
-----------------------------------------------------------
1 | [Date_Min, Date_Max, Device] | [148590, 148590, iphone]
2 | [Date_Min, Date_Max, Review] | [148590, 148590, Good]
3 | [Date_Min, Date_Max, Review, Device] | [148590, 148590, Bad,samsung]
The resul that i expect:
id | Operation | Value |
--------------------------
1 | Date_Min | 148590 |
1 | Date_Max | 148590 |
1 | Device | iphone |
2 | Date_Min | 148590 |
2 | Date_Max | 148590 |
2 | Review | Good |
3 | Date_Min | 148590 |
3 | Date_Max | 148590 |
3 | Review | Bad |
3 | Review | samsung|
I'm using Spark 2.1.0 with pyspark. I tried this solution but it worked only for one column.
Thanks
Here is an example dataframe from above. I use this solution in order to solve your question.
df = spark.createDataFrame(
[[1, ['Date_Min', 'Date_Max', 'Device'], ['148590', '148590', 'iphone']],
[2, ['Date_Min', 'Date_Max', 'Review'], ['148590', '148590', 'Good']],
[3, ['Date_Min', 'Date_Max', 'Review', 'Device'], ['148590', '148590', 'Bad', 'samsung']]],
schema=['id', 'l1', 'l2'])
Here, you can define udf to zip two list together for each row first.
from pyspark.sql.types import *
from pyspark.sql.functions import col, udf, explode
zip_list = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
StructField("first", StringType()),
StructField("second", StringType())
]))
)
Finally, you can zip two columns together then explode that column.
df_out = df.withColumn("tmp", zip_list('l1', 'l2')).\
withColumn("tmp", explode("tmp")).\
select('id', col('tmp.first').alias('Operation'), col('tmp.second').alias('Value'))
df_out.show()
Output
+---+---------+-------+
| id|Operation| Value|
+---+---------+-------+
| 1| Date_Min| 148590|
| 1| Date_Max| 148590|
| 1| Device| iphone|
| 2| Date_Min| 148590|
| 2| Date_Max| 148590|
| 2| Review| Good|
| 3| Date_Min| 148590|
| 3| Date_Max| 148590|
| 3| Review| Bad|
| 3| Device|samsung|
+---+---------+-------+
If using DataFrame then try this:-
import pyspark.sql.functions as F
your_df.select("id", F.explode("Operation"), F.explode("Value")).show()