I currently have a dataframe df
id | c1 | c2 | c3 |
1 | diff | same | diff
2 | same | same | same
3 | diff | same | same
4 | same | same | same
I want my output to look like
name| diff | same
c1 | 2 | 2
c2 | 0 | 4
c3 | 1 | 3
When I try :
df.groupby('c2').pivot('c2').count() -> transformation A
|f2 | diff | same |
|same | null | 2
|diff | 2 | null
I'm assuming I need to write a loop for each column and pass it through transformation A?
But I'm having issues getting transformation A right.
Please help
Pivot is an expensive shuffle operation and should be avoided if possible. Try using this logic with arrays_zip and explode to dynamically collapse columns and groupby-aggregate.
from pyspark.sql import functions as F
df.withColumn("cols", F.explode(F.arrays_zip(F.array([F.array(F.col(x),F.lit(x))\
for x in df.columns if x!='id']))))\
.withColumn("name", F.col("cols.0")[1]).withColumn("val", F.col("cols.0")[0]).drop("cols")\
.groupBy("name").agg(F.count(F.when(F.col("val")=='diff',1)).alias("diff"),\
F.count(F.when(F.col("val")=='same',1)).alias("same")).orderBy("name").show()
#+----+----+----+
#|name|diff|same|
#+----+----+----+
#| c1| 2| 2|
#| c2| 0| 4|
#| c3| 1| 3|
#+----+----+----+
You can also do this by exploding a map_type by creating a map dynamically.
from pyspark.sql import functions as F
from itertools import chain
df.withColumn("cols", F.create_map(*(chain(*[(F.lit(name), F.col(name))\
for name in df.columns if name!='id']))))\
.select(F.explode("cols").alias("name","val"))\
.groupBy("name").agg(F.count(F.when(F.col("val")=='diff',1)).alias("diff"),\
F.count(F.when(F.col("val")=='same',1)).alias("same")).orderBy("name").show()
#+----+----+----+
#|name|diff|same|
#+----+----+----+
#| c1| 2| 2|
#| c2| 0| 4|
#| c3| 1| 3|
#+----+----+----+
from pyspark.sql.functions import *
df = spark.createDataFrame([(1,'diff','same','diff'),(2,'same','same','same'),(3,'diff','same','same'),(4,'same','same','same')],['idcol','C1','C2','C3'])
df.createOrReplaceTempView("MyTable")
#spark.sql("select * from MyTable").collect()
x1=spark.sql("select idcol, 'C1' AS col, C1 from MyTable union all select idcol, 'C2' AS col, C2 from MyTable union all select idcol, 'C3' AS col, C3 from MyTable")
#display(x1)
x2=x1.groupBy('col').pivot('C1').agg(count('C1')).orderBy('col')
display(x2)
Related
I have a Spark DataFrame, say df, to which I need to apply a GroupBy col1, aggregate by maximum value of col2 and pass the corresponding value of col3 (which has nothing to do with the groupBy or the aggregation). It is best to illustrate it with an example.
df.show()
+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
| 1| 500| 10 |
| 1| 600| 11 |
| 1| 700| 12 |
| 2| 600| 14 |
| 2| 800| 15 |
| 2| 650| 17 |
+-----+-----+-----+
I can easily perform the groupBy and the aggregation to obtain the maximum value of each group in col2, using
import pyspark.sql.functions as F
df1 = df.groupBy("col1").agg(
F.max("col2").alias('Max_col2')).show()
+-----+---------+
| col1| Max_col2|
+-----+---------+
| 1| 700|
| 2| 800|
+-----+---------+
However, what I am struggling with and what I would like to do is to, additionally, pass the corresponding value of col3, thus obtaining the following table:
+-----+---------+-----+
| col1| Max_col2| col3|
+-----+---------+-----+
| 1| 700| 12 |
| 2| 800| 15 |
+-----+---------+-----+
Does anyone know how this can be done?
Many thanks in advance,
Marioanzas
You can aggregate the maximum of a struct, and then expand the struct:
import pyspark.sql.functions as F
df2 = df.groupBy('col1').agg(
F.max(F.struct('col2', 'col3')).alias('col')
).select('col1', 'col.*')
df2.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 700| 12|
| 2| 800| 15|
+----+----+----+
df1:
+---+------+
| id| code|
+---+------+
| 1|[A, F]|
| 2| [G]|
| 3| [A]|
+---+------+
df2:
+--------+----+
| col1|col2|
+--------+----+
| Apple| A|
| Google| G|
|Facebook| F|
+--------+----+
I want the df3 should be like this by using the df1, and df2 columns :
+---+------+-----------------+
| id| code| changed|
+---+------+-----------------+
| 1|[A, F]|[Apple, Facebook]|
| 2| [G]| [Google]|
| 3| [A]| [Apple]|
+---+------+-----------------+
I know this can be archived if the code column is NOT an ARRAY. I don't know how to iterate the code array for this purpose.
Try:
from pyspark.sql.functions import *
import pyspark.sql.functions as f
res=(df1
.select(f.col("id"), f.explode(f.col("code")).alias("code"))
.join(df2, f.col("code")==df2.col2)
.groupBy("id")
.agg(f.collect_list(f.col("code")).alias("code"), f.collect_list(f.col("col1")).alias("changed"))
)
I have a pyspark dataframe like this,
+----------+--------+
|id_ | p |
+----------+--------+
| 1 | A |
| 1 | B |
| 1 | B |
| 1 | A |
| 1 | A |
| 1 | B |
| 2 | C |
| 2 | C |
| 2 | C |
| 2 | A |
| 2 | A |
| 2 | C |
---------------------
I want to create another column for each group of id_. Column is made using pandas now with the code,
sample.groupby(by=['id_'], group_keys=False).apply(lambda grp : grp['p'].ne(grp['p'].shift()).cumsum())
How can I do this in pyspark dataframe.?
Currently I am doing this with a help of a pandas UDF, which runs very slow.
What are the alternatives.?
Expected column will be like this,
1
2
2
3
3
4
1
1
1
2
2
3
You can combination of udf and window functions to achieve your results:
# required imports
from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
# define a window, which we will use to calculate lag values
w = Window().partitionBy().orderBy(F.col('id_'))
# define user defined function (udf) to perform calculation on each row
def f(lag_val, current_val):
if lag_val != current_val:
return 1
return 0
# register udf so we can use with our dataframe
func_udf = F.udf(f, IntegerType())
# read csv file
df = spark.read.csv('/path/to/file.csv', header=True)
# create new column with lag on window we created earlier, apply udf on lagged
# and current value and then apply window function again to calculate cumsum
df.withColumn("new_column", func_udf(F.lag("p").over(w), df['p'])).withColumn('cumsum', F.sum('new_column').over(w.partitionBy(F.col('id_')).rowsBetween(Window.unboundedPreceding, 0))).show()
+---+---+----------+------+
|id_| p|new_column|cumsum|
+---+---+----------+------+
| 1| A| 1| 1|
| 1| B| 1| 2|
| 1| B| 0| 2|
| 1| A| 1| 3|
| 1| A| 0| 3|
| 1| B| 1| 4|
| 2| C| 1| 1|
| 2| C| 0| 1|
| 2| C| 0| 1|
| 2| A| 1| 2|
| 2| A| 0| 2|
| 2| C| 1| 3|
+---+---+----------+------+
# where:
# w.partitionBy : to partition by id_ column
# w.rowsBetween : to specify frame boundaries
# ref https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/expressions/Window.html#rowsBetween-long-long-
This question already has answers here:
Create a group id over a window in Spark Dataframe
(3 answers)
Closed 4 years ago.
I would like to assign each group in a groupby a unique id number starting from 0 or 1 and incrementing by 1 for each group using pyspark.
I have done this previously using pandas with python with the command:
df['id_num'] = (df
.groupby('column_name')
.grouper
.group_info[0])
A toy example of the input and desired output is:
Input
+------+
|object|
+------+
|apple |
|orange|
|pear |
|berry |
|apple |
|pear |
|berry |
+------+
output:
+------+--+
|object|id|
+------+--+
|apple |1 |
|orange|2 |
|pear |3 |
|berry |4 |
|apple |1 |
|pear |3 |
|berry |4 |
+------+--+
I am not sure if the order is important. If not you can use dense_rank window function in this case
>>> from pyspark.sql.window import Window
>>> import pyspark.sql.functions as F
>>>
>>> df.show()
+------+
|object|
+------+
| apple|
|orange|
| pear|
| berry|
| apple|
| pear|
| berry|
+------+
>>>
>>> df.withColumn("id", F.dense_rank().over(Window.orderBy(df.object))).show()
+------+---+
|object| id|
+------+---+
| apple| 1|
| apple| 1|
| berry| 2|
| berry| 2|
|orange| 3|
| pear| 4|
| pear| 4|
+------+---+
from pyspark.sql.functions import col, create_map, lit
from itertools import chain
values = [('apple',),('orange',),('pear',),('berry',),('apple',),('pear',),('berry',)]
df = sqlContext.createDataFrame(values,['object'])
#Creating a column of distinct elements and converting them into dictionary with unique indexes.
df1 = df.distinct()
distinct_list = list(df1.select('object').toPandas()['object'])
dict_with_index = {distinct_list[i]:i+1 for i in range(len(distinct_list))}
#Applying the mapping of dictionary.
mapping_expr = create_map([lit(x) for x in chain(*dict_with_index.items())])
df=df.withColumn("id", mapping_expr.getItem(col("object")))
df.show()
+------+---+
|object| id|
+------+---+
| apple| 2|
|orange| 1|
| pear| 3|
| berry| 4|
| apple| 2|
| pear| 3|
| berry| 4|
+------+---+
I have two dataframe which has been readed from two csv files.
+---+----------+-----------------+
| ID| NUMBER | RECHARGE_AMOUNT|
+---+----------+-----------------+
| 1|9090909092| 30|
| 2|9090909093| 30|
| 3|9090909090| 30|
| 4|9090909094| 30|
+---+----------+-----------------+
and
+---+----------+-----------------+
| ID| NUMBER | RECHARGE_AMOUNT|
+---+----------+-----------------+
| 1|9090909092| 40|
| 2|9090909093| 50|
| 3|9090909090| 60|
| 4|9090909094| 70|
+---+----------+-----------------+
I am triying to join this two data from using NUMBER coumn using the pyspark code dfFinal = dfFinal.join(df2, on=['NUMBER'], how='inner') and new dataframe is generated as follows.
+----------+---+-----------------+---+-----------------+
| NUMBER | ID| RECHARGE_AMOUNT| ID| RECHARGE_AMOUNT|
+----------+---+-----------------+---+-----------------+
|9090909092| 1| 30| 1| 40|
|9090909093| 2| 30| 2| 50|
|9090909090| 3| 30| 3| 60|
|9090909094| 4| 30| 4| 70|
+----------+---+-----------------+---+-----------------+
But i am not able to write this dataframe into a file since the dataframe after joining is having duplicate column. I am using the following code. dfFinal.coalesce(1).write.format('com.databricks.spark.csv').save('/home/user/output',header = 'true') Is there any way to avoid duplicate column after joining in spark. Given below is my pyspark code.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("test1").getOrCreate()
files = ["/home/user/test1.txt", "/home/user/test2.txt"]
dfFinal = spark.read.load(files[0],format="csv", sep=",", inferSchema="false", header="true", mode="DROPMALFORMED")
dfFinal.show()
for i in range(1,len(files)):
df2 = spark.read.load(files[i],format="csv", sep=",", inferSchema="false", header="true", mode="DROPMALFORMED")
df2.show()
dfFinal = dfFinal.join(df2, on=['NUMBER'], how='inner')
dfFinal.show()
dfFinal.coalesce(1).write.format('com.databricks.spark.csv').save('/home/user/output',header = 'true')
I need to generate unique column name.ie: if i gave two files in files array with same coumn it should generate as follows.
+----------+----+-------------------+-----+-------------------+
| NUMBER |IDx | RECHARGE_AMOUNTx | IDy | RECHARGE_AMOUNTy |
+----------+----+-------------------+-----+-------------------+
|9090909092| 1 | 30 | 1 | 40 |
|9090909093| 2 | 30 | 2 | 50 |
|9090909090| 3 | 30 | 3 | 60 |
|9090909094| 4 | 30 | 4 | 70 |
+----------+---+-----------------+---+------------------------+
In panda i can use suffixes argument as show below dfFinal = dfFinal.merge(df2,left_on='NUMBER',right_on='NUMBER',how='inner',suffixes=('x', 'y'),sort=True) which will generate the above dataframe. Is there any way i can replicate this on pyspark.
You can select the columns from each dataframe and alias it.
Like this.
dfFinal = dfFinal.join(df2, on=['NUMBER'], how='inner') \
.select('NUMBER',
dfFinal.ID.alias('ID_1'),
dfFinal.RECHARGE_AMOUNT.alias('RECHARGE_AMOUNT_1'),
df2.ID.alias('ID_2'),
df2.RECHARGE_AMOUNT.alias('RECHARGE_AMOUNT_2'))