How do I flattern a pySpark dataframe ? - python

I have a spark dataframe like this:
id | Operation | Value
-----------------------------------------------------------
1 | [Date_Min, Date_Max, Device] | [148590, 148590, iphone]
2 | [Date_Min, Date_Max, Review] | [148590, 148590, Good]
3 | [Date_Min, Date_Max, Review, Device] | [148590, 148590, Bad,samsung]
The resul that i expect:
id | Operation | Value |
--------------------------
1 | Date_Min | 148590 |
1 | Date_Max | 148590 |
1 | Device | iphone |
2 | Date_Min | 148590 |
2 | Date_Max | 148590 |
2 | Review | Good |
3 | Date_Min | 148590 |
3 | Date_Max | 148590 |
3 | Review | Bad |
3 | Review | samsung|
I'm using Spark 2.1.0 with pyspark. I tried this solution but it worked only for one column.
Thanks

Here is an example dataframe from above. I use this solution in order to solve your question.
df = spark.createDataFrame(
[[1, ['Date_Min', 'Date_Max', 'Device'], ['148590', '148590', 'iphone']],
[2, ['Date_Min', 'Date_Max', 'Review'], ['148590', '148590', 'Good']],
[3, ['Date_Min', 'Date_Max', 'Review', 'Device'], ['148590', '148590', 'Bad', 'samsung']]],
schema=['id', 'l1', 'l2'])
Here, you can define udf to zip two list together for each row first.
from pyspark.sql.types import *
from pyspark.sql.functions import col, udf, explode
zip_list = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
StructField("first", StringType()),
StructField("second", StringType())
]))
)
Finally, you can zip two columns together then explode that column.
df_out = df.withColumn("tmp", zip_list('l1', 'l2')).\
withColumn("tmp", explode("tmp")).\
select('id', col('tmp.first').alias('Operation'), col('tmp.second').alias('Value'))
df_out.show()
Output
+---+---------+-------+
| id|Operation| Value|
+---+---------+-------+
| 1| Date_Min| 148590|
| 1| Date_Max| 148590|
| 1| Device| iphone|
| 2| Date_Min| 148590|
| 2| Date_Max| 148590|
| 2| Review| Good|
| 3| Date_Min| 148590|
| 3| Date_Max| 148590|
| 3| Review| Bad|
| 3| Device|samsung|
+---+---------+-------+

If using DataFrame then try this:-
import pyspark.sql.functions as F
your_df.select("id", F.explode("Operation"), F.explode("Value")).show()

Related

PySpark: How to explode two columns of arrays

I have a DF in PySpark where I'm trying to explode two columns of arrays. Here's my DF:
+--------+-----+--------------------+--------------------+
| id| zip_| values| time|
+--------+-----+--------------------+--------------------+
|56434459|02138|[1.033990484, 1.0...|[1.624322475139E9...|
|56434508|02138|[1.04760919, 1.07...|[1.624322475491E9...|
|56434484|02138|[1.047177758, 1.0...|[1.62432247655E9,...|
|56434495|02138|[0.989590562, 1.0...|[1.624322476937E9...|
|56434465|02138|[1.051481754, 1.1...|[1.624322477275E9...|
|56434469|02138|[1.026476497, 1.1...|[1.624322477605E9...|
|56434463|02138|[1.10024864, 1.31...|[1.624322478085E9...|
|56434458|02138|[1.011091305, 1.0...|[1.624322478462E9...|
|56434464|02138|[1.038230333, 1.0...|[1.62432247882E9,...|
|56434474|02138|[1.041924752, 1.1...|[1.624322479386E9...|
|56434452|02138|[1.044482358, 1.1...|[1.624322479919E9...|
|56434445|02138|[1.050144598, 1.1...|[1.624322480344E9...|
|56434499|02138|[1.047851812, 1.0...|[1.624322480785E9...|
|56434449|02138|[1.044700917, 1.1...|[1.6243224811E9, ...|
|56434461|02138|[1.03341455, 1.07...|[1.624322481443E9...|
|56434526|02138|[1.04779412, 1.07...|[1.624322481861E9...|
|56434433|02138|[1.0498406, 1.139...|[1.624322482181E9...|
|56434507|02138|[1.0013894403, 1....|[1.624322482419E9...|
|56434488|02138|[1.047270063, 1.0...|[1.624322482716E9...|
|56434451|02138|[1.043182727, 1.1...|[1.624322483061E9...|
+--------+-----+--------------------+--------------------+
only showing top 20 rows
My current solution is to do a posexplode on each column, combined with a concat_ws for a unique ID, creating two DFs.
First DF:
+-----------+-----+-----------+
| new_id| zip_| values_new|
+-----------+-----+-----------+
| 56434459_0|02138|1.033990484|
| 56434459_1|02138| 1.07805057|
| 56434459_2|02138| 1.09000133|
| 56434459_3|02138| 1.07009546|
| 56434459_4|02138|1.102403015|
| 56434459_5|02138| 1.1291009|
| 56434459_6|02138|1.088399924|
| 56434459_7|02138|1.047513142|
| 56434459_8|02138|1.010418795|
| 56434459_9|02138| 1.0|
|56434459_10|02138| 1.0|
|56434459_11|02138| 1.0|
|56434459_12|02138| 0.99048968|
|56434459_13|02138|0.984854524|
|56434459_14|02138| 1.0|
| 56434508_0|02138| 1.04760919|
| 56434508_1|02138| 1.07858897|
| 56434508_2|02138| 1.09084267|
| 56434508_3|02138| 1.07627785|
| 56434508_4|02138| 1.13778706|
+-----------+-----+-----------+
only showing top 20 rows
Second DF:
+-----------+-----+----------------+
| new_id| zip_| values_new|
+-----------+-----+----------------+
| 56434459_0|02138|1.624322475139E9|
| 56434459_1|02138|1.592786475139E9|
| 56434459_2|02138|1.561164075139E9|
| 56434459_3|02138|1.529628075139E9|
| 56434459_4|02138|1.498092075139E9|
| 56434459_5|02138|1.466556075139E9|
| 56434459_6|02138|1.434933675139E9|
| 56434459_7|02138|1.403397675139E9|
| 56434459_8|02138|1.371861675139E9|
| 56434459_9|02138|1.340325675139E9|
|56434459_10|02138|1.308703275139E9|
|56434459_11|02138|1.277167275139E9|
|56434459_12|02138|1.245631275139E9|
|56434459_13|02138|1.214095275139E9|
|56434459_14|02138|1.182472875139E9|
| 56434508_0|02138|1.624322475491E9|
| 56434508_1|02138|1.592786475491E9|
| 56434508_2|02138|1.561164075491E9|
| 56434508_3|02138|1.529628075491E9|
| 56434508_4|02138|1.498092075491E9|
+-----------+-----+----------------+
only showing top 20 rows
I then join the DFs on new_id, resulting in:
+------------+-----+----------------+-----+------------------+
| new_id| zip_| values_new| zip_| values_new|
+------------+-----+----------------+-----+------------------+
| 123957783_3|02138|1.527644029268E9|02138| 1.0|
| 125820702_3|02138|1.527643636531E9|02138| 1.013462378|
|165689784_12|02138|1.243647038288E9|02138|0.9283950599999999|
|165689784_14|02138|1.180488638288E9|02138| 1.011595547|
| 56424973_12|02138|1.245630256025E9|02138|0.9566622300000001|
| 56424989_14|02138|1.182471866886E9|02138| 1.0|
| 56425304_7|02138|1.403398444955E9|02138| 1.028527131|
| 56425386_6|02138|1.432949752808E9|02138| 1.08516484|
| 56430694_17|02138|1.087866094991E9|02138| 1.120045416|
| 56430700_20|02138| 9.61635686239E8|02138| 1.099920854|
| 56430856_13|02138|1.214097787512E9|02138| 0.989263804|
| 56430866_12|02138|1.245633801277E9|02138| 0.990684134|
| 56430875_10|02138|1.308705777269E9|02138| 1.0|
| 56430883_3|02138|1.529630585921E9|02138| 1.06920212|
| 56430987_13|02138|1.214100806414E9|02138| 0.978794644|
| 56431009_1|02138|1.592792025664E9|02138| 1.07923349|
| 56431013_9|02138|1.340331235566E9|02138| 1.0|
| 56431025_8|02138|1.371860189767E9|02138| 0.9477155|
| 56432373_13|02138|1.214092187852E9|02138| 0.994825498|
| 56432421_2|02138|1.561161037707E9|02138| 1.11343257|
+------------+-----+----------------+-----+------------------+
only showing top 20 rows
My question: Is there a more effective way to get the resultant DF? I tried doing two posexplodes in parallel but PySpark allows only one.
You can achieve it as follows:
df = (df.withColumn("values_new", explode(col("values")))
.withColumn("times_new", explode(col("time")))
.withColumn("id_new", monotonically_increasing_id()))

Transpose specific columns to rows using python pyspark

I have here an example DF:
+---+---------+----+---------+---------+-------------+-------------+
|id | company |type|rev2016 | rev2017 | main2016 | main2017 |
+---+---------+----+---------+---------+-------------+-------------+
| 1 | google |web | 100 | 200 | 55 | 66 |
+---+---------+----+---------+---------+-------------+-------------+
And I want this output:
+---+---------+----+-------------+------+------+
|id | company |type| Metric | 2016 | 2017 |
+---+---------+----+-------------+------+------+
| 1 | google |web | rev | 100 | 200 |
| 1 | google |web | main | 55 | 66 |
+---+---------+----+-------------+------+------+
What I am trying to achieve is transposing revenue and maintenance columns to rows with new column 'Metric'. I am trying pivoting no luck so far.
You can construct an array of structs from the columns, and then explode the arrays and expand the structs to get the desired output.
import pyspark.sql.functions as F
struct_list = [
F.struct(
F.lit('rev').alias('Metric'),
F.col('rev2016').alias('2016'),
F.col('rev2017').alias('2017')
),
F.struct(
F.lit('main').alias('Metric'),
F.col('main2016').alias('2016'),
F.col('main2017').alias('2017')
)
]
df2 = df.withColumn(
'arr',
F.explode(F.array(*struct_list))
).select('id', 'company', 'type', 'arr.*')
df2.show()
+---+-------+----+------+----+----+
| id|company|type|Metric|2016|2017|
+---+-------+----+------+----+----+
| 1| google| web| rev| 100| 200|
| 1| google| web| main| 55| 66|
+---+-------+----+------+----+----+
Or you can use stack:
df2 = df.selectExpr(
'id', 'company', 'type',
"stack(2, 'rev', rev2016, rev2017, 'main', main2016, main2017) as (Metric, `2016`, `2017`)"
)
df2.show()
+---+-------+----+------+----+----+
| id|company|type|Metric|2016|2017|
+---+-------+----+------+----+----+
| 1| google| web| rev| 100| 200|
| 1| google| web| main| 55| 66|
+---+-------+----+------+----+----+

how to loop one dataframe to another dataframe and get the single matching record in pyspark

**Dataframe 1 **
+----+------+------+-----+-----+
|key |dc_count|dc_day_count |
+----+------+------+-----+-----+
| 123 |13 |66 |
| 123 |13 |12 |
+----+------+------+-----+-----+
**rule Dataframe **
+----+------+------+-----+-----++------+-----+-----+
|key |rule_dc_count|rule_day_count |rule_out |
+----+------+------+-----+-----++------+-----+-----+
| 123 |2 |30 |139 |
| 123 |null |null |64 |
| 124 |2 |30 |139 |
| 124 |null |null |64 |
+----+------+------+-----+-----+----+------+-----+--
if dc_count>rule_dc_count and dc_day_count > rule_day_count
populate corresponding rule_out
else
other rule_out"
expected output
+----+------+------+-
|key |rule_out |
+----+------+------+
| 123 | 139 |
| 124 | 64 |
+----+------+------+
PySpark Version
The challenge here to get the second row's value for a key in a same column, In order to resolve this LEAD() analytical function can be used.
Create the DataFrame here
from pyspark.sql import functions as F
df = spark.createDataFrame([(123,13,66),(124,13,12)],[ "key","dc_count","dc_day_count"])
df1 = spark.createDataFrame([(123,2,30,139),(123,0,0,64),(124,2,30,139),(124,0,0,64)],
["key","rule_dc_count","rule_day_count","rule_out"])
Logic to get the Desired Result
from pyspark.sql import Window as W
_w = W.partitionBy('key').orderBy(F.col('key').desc())
df1 = df1.withColumn('rn', F.lead('rule_out').over(_w))
df1 = df1.join(df,'key','left')
df1 = df1.withColumn('condition_col',
F.when(
(F.col('dc_count') > F.col('rule_dc_count')) &
(F.col('dc_day_count') > F.col('rule_day_count')),F.col('rule_out'))
.otherwise(F.col('rn')))
df1 = df1.filter(F.col('rn').isNotNull())
Output
df1.show()
+---+-------------+--------------+--------+---+--------+------------+-------------+
|key|rule_dc_count|rule_day_count|rule_out| rn|dc_count|dc_day_count|condition_col|
+---+-------------+--------------+--------+---+--------+------------+-------------+
|124| 2| 30| 139| 64| 13| 12| 64|
|123| 2| 30| 139| 64| 13| 66| 139|
+---+-------------+--------------+--------+---+--------+------------+-------------+
Assuming expected output as-
+---+--------+
|key|rule_out|
+---+--------+
|123|139 |
+---+--------+
Below query should work-
spark.sql(
"""
|SELECT
| t1.key, t2.rule_out
|FROM table1 t1 join table2 t2 on t1.key=t2.key and
|t1.dc_count > t2.rule_dc_count and t1.dc_day_count > t2.rule_day_count
""".stripMargin)
.show(false)

New column with previous rows value

Im working with pyspark and i have frame like this
this is my frame
+---+-----+
| id|value|
+---+-----+
| 1| 65|
| 2| 66|
| 3| 65|
| 4| 68|
| 5| 71|
+---+-----+
and i want to generate frame with pyspark like this
+---+-----+-------------+
| id|value| prev_value |
+---+-----+-------------+
| 1 | 65 | null |
| 2 | 66 | 65 |
| 3 | 65 | 66,65 |
| 4 | 68 | 65,66,65 |
| 5 | 71 | 68,65,66,65 |
+---+-----+-------------+
Here is one way:
from pyspark.sql.window import Window
from pyspark.sql.types import StringType
# define window and calculate "running total" of lagged value
win = Window.partitionBy().orderBy(f.col('id'))
df = df.withColumn('prev_value', f.collect_list(f.lag('value').over(win)).over(win))
# now define udf to concatenate the lists
concat = f.udf(lambda x: 'null' if len(x)==0 else ','.join([str(elt) for elt in x[::-1]]))
df = df.withColumn('prev_value', concat('prev_value'))

Python Spark implementing map-reduce algorithm to create (column, value) tuples

UPDATE(04/20/17):
I am using Apache Spark 2.1.0 and I will be using Python.
I have narrowed down the problem and hopefully someone more knowledgeable with Spark can answer. I need to create an RDD of tuples from the header of the values.csv file:
values.csv (main collected data, very large):
+--------+---+---+---+---+---+----+
| ID | 1 | 2 | 3 | 4 | 9 | 11 |
+--------+---+---+---+---+---+----+
| | | | | | | |
| abc123 | 1 | 2 | 3 | 1 | 0 | 1 |
| | | | | | | |
| aewe23 | 4 | 5 | 6 | 1 | 0 | 2 |
| | | | | | | |
| ad2123 | 7 | 8 | 9 | 1 | 0 | 3 |
+--------+---+---+---+---+---+----+
output (RDD):
+----------+----------+----------+----------+----------+----------+----------+
| abc123 | (1;1) | (2;2) | (3;3) | (4;1) | (9;0) | (11;1) |
| | | | | | | |
| aewe23 | (1;4) | (2;5) | (3;6) | (4;1) | (9;0) | (11;2) |
| | | | | | | |
| ad2123 | (1;7) | (2;8) | (3;9) | (4;1) | (9;0) | (11;3) |
+----------+----------+----------+----------+----------+----------+----------+
What happened was I paired each value with the column name of that value in the format:
(column_number, value)
raw format (if you are interested in working with it):
id,1,2,3,4,9,11
abc123,1,2,3,1,0,1
aewe23,4,5,6,1,0,2
ad2123,7,8,9,1,0,3
The Problem:
The example values.csv file contains only a few columns, but in the actual file there are thousands of columns. I can extract the header and broadcast it to every node in the distributed environment, but I am not sure if that is the most efficient way to solve the problem. Is it possible to achieve the output with a parallelized header?
I think you can achieve the solution using PySpark Dataframe too. However, my solution is not optimal yet. I use split to get the new column name and corresponding columns to do sum. This depends on how large is your key_list. If it's too large, this might not work will because you have to load key_list on memory (using collect).
import pandas as pd
import pyspark.sql.functions as func
# example data
values = spark.createDataFrame(pd.DataFrame([['abc123', 1, 2, 3, 1, 0, 1],
['aewe23', 4, 5, 6, 1, 0, 2],
['ad2123', 7, 8, 9, 1, 0, 3]],
columns=['id', '1', '2', '3','4','9','11']))
key_list = spark.createDataFrame(pd.DataFrame([['a', '1'],
['b','2;4'],
['c','3;9;11']],
columns=['key','cols']))
# use values = spark.read.csv(path_to_csv, header=True) for your data
key_list_df = key_list.select('key', func.split('cols', ';').alias('col'))
key_list_rdd = key_list_df.rdd.collect()
for row in key_list_rdd:
values = values.withColumn(row.key, sum(values[c] for c in row.col if c in values.columns))
keys = [row.key for row in key_list_rdd]
output_df = values.select(keys)
Output
output_df.show(n=3)
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 3| 4|
| 4| 6| 8|
| 7| 9| 12|
+---+---+---+

Categories