PySpark: Add a new column with a tuple created from columns

PySpark: Add a new column with a tuple created from columns - python

Here I have a dateframe created as follow,
df = spark.createDataFrame([('a',5,'R','X'),('b',7,'G','S'),('c',8,'G','S')],
["Id","V1","V2","V3"])
It looks like
+---+---+---+---+
| Id| V1| V2| V3|
+---+---+---+---+
| a| 5| R| X|
| b| 7| G| S|
| c| 8| G| S|
+---+---+---+---+
I'm looking to add a column that is a tuple consisting of V1,V2,V3.
The result should look like
+---+---+---+---+-------+
| Id| V1| V2| V3|V_tuple|
+---+---+---+---+-------+
| a| 5| R| X|(5,R,X)|
| b| 7| G| S|(7,G,S)|
| c| 8| G| S|(8,G,S)|
+---+---+---+---+-------+
I've tried to use similar syntex as in Python but it didn't work:
df.withColumn("V_tuple",list(zip(df.V1,df.V2,df.V3)))
TypeError: zip argument #1 must support iteration.
Any help would be appreciated!

I'm coming from scala but I do believe that there's a similar way in python :
Using sql.functions package mehtod :
If you want to get a StructType with this three column use the struct(cols: Column*): Column method like this :
from pyspark.sql.functions import struct
df.withColumn("V_tuple",struct(df.V1,df.V2,df.V3))
but if you want to get it as a String you can use the concat(exprs: Column*): Column method like this :
from pyspark.sql.functions import concat
df.withColumn("V_tuple",concat(df.V1,df.V2,df.V3))
With this second method you may have to cast the columns into Strings
I'm not sure about the python syntax, Just edit the answer if there's a syntax error.
Hope this help you. Best Regards

Use struct:
from pyspark.sql.functions import struct
df.withColumn("V_tuple", struct(df.V1,df.V2,df.V3))

Related

How to do a cummsum in a lambda call using PySpark

I am trying to replicate a code in Python using PySpark, and I found myself in a problem. So this is the code I am trying to replicate:
df_act = (df_act.assign(n_cycles = (lambda x: (x.cycles_bol != x.cycles_bol.shift(1)).cumsum())))
Keep in mind that I am working with a dataframe, and that cycles_bol is a column of dataframe "df_act".
and I simply can't. The closest I think I have gotten to the solution is the following:
df_act=df_act.withColumn(
"grp",
when(df_act['cycles_bol'] == lead("cycles_bol").over(Window.partitionBy("user_id").orderBy("timestamp")),0).otherwise(1).over(Window.orderBy("timestamp"))
).drop("grp").show()
Can anyone please help me?
Thanks in advance!

You dindt give much information
You have to orderBy,use lag to check if cycles_bol consecutives are the same and conditionally add. Use an existing column to orderBy if it wont change the order of cycles_bol. If you dont have such a column, generate one using monotonically_increasing function like I did.
df_act.withColumn('id', monotonically_increasing_id()).withColumn('n_cycles',sum(when(lag('cycles_bol').over(Window.orderBy('id'))!=col('cycles_bol'),1).otherwise(0)).over(Window.orderBy('id'))).drop('id').show()
+----------+--------+
|cycles_bol|n_cycles|
+----------+--------+
| A| 0|
| B| 1|
| B| 1|
| A| 2|
| B| 3|
| A| 4|
| A| 4|
| B| 5|
| C| 6|
+----------+--------+

Can the pySpark lag function reference itself?

I am looking for a way to grow a cumulative value in a column using the lag function in pySpark to first fetch the previous value in the column then add to it but tis is failing as presumably it can't find itself before it exists. Is there a way around this?

Maybe something like this you are looking for?
df = spark.createDataFrame(
[
('1',20),
('2',34),
('3',12)
], ['id','value'])
from pyspark.sql import Window as W
w = W.orderBy('id').rowsBetween(W.unboundedPreceding, 0)
df\
.withColumn('cumul_sum', F.sum(F.col('value')).over(w))\
.show()
+---+-----+---------+
| id|value|cumul_sum|
+---+-----+---------+
| 1| 20| 20|
| 2| 34| 54|
| 3| 12| 66|
+---+-----+---------+

Is there a way to add a column with range of values to a Spark Dataframe?

I have a spark dataframe: df1 as below:
age = spark.createDataFrame(["10","11","13"], "string").toDF("age")
age.show()
+---+
|age|
+---+
| 10|
| 11|
| 13|
+---+
I have a requirement of adding a row number column to the dataframe to make it:
+---+------+
|age|col_id|
+---+------+
| 10| 1 |
| 11| 2 |
| 13| 3 |
+---+------+
None of the columns in my dataframe contains unique values.
I tried to use F.monotonically_increasing_id()) but it is just producing random numbers in increasing order.
>>> age = spark.createDataFrame(["10","11","13"], "string").toDF("age").withColumn("rowId1", F.monotonically_increasing_id())
>>> age
DataFrame[age: string, rowId1: bigint]
>>> age.show
<bound method DataFrame.show of DataFrame[age: string, rowId1: bigint]>
>>> age.show()
+---+-----------+
|age| rowId1|
+---+-----------+
| 10|17179869184|
| 11|42949672960|
| 13|60129542144|
+---+-----------+
Since I don't have any column with unique data, I am worried about using windowing functions and generate row_numbers.
So, is there a way I can add a column with row_count to the dataframe that gives:
+---+------+
|age|col_id|
+---+------+
| 10| 1 |
| 11| 2 |
| 13| 3 |
+---+------+
If windowing function is the only way to implement, how can I make sure all the data comes under a single partition ?
or if there is a way to implement the same without using windowing functions, how to implement it ?
Any help is appreciated.

Use zipWithIndex.
I could not find code I did myself in the past yesterday as I was busy working on issues, but here is a good post that explains it. https://sqlandhadoop.com/pyspark-zipwithindex-example/
pyspark different to Scala.
Other answer not good for performance - going to single Executor. zipWithIndex is narrow transformation so it works per partition.
Here goes, you can tailor accordingly:
from pyspark.sql.types import StructField
from pyspark.sql.types import StructType
from pyspark.sql.types import StringType, LongType
import pyspark.sql.functions as F
df1 = spark.createDataFrame([ ('abc'),('2'),('3'),('4'), ('abc'),('2'),('3'),('4'), ('abc'),('2'),('3'),('4') ], StringType())
schema = StructType(df1.schema.fields[:] + [StructField("index", LongType(), True)])
rdd = df1.rdd.zipWithIndex()
rdd1 = rdd.map(lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],))
df1 = spark.createDataFrame(rdd1, schema)
df1.show()
returns:
+-----+-----+
|value|index|
+-----+-----+
| abc| 0|
| 2| 1|
| 3| 2|
| 4| 3|
| abc| 4|
| 2| 5|
| 3| 6|
| 4| 7|
| abc| 8|
| 2| 9|
| 3| 10|
| 4| 11|
+-----+-----+

Assumption: This answer is based on the assumption that the order of col_id should depend on the age column. If the assumption does not hold true the other suggested solution is the in the questions comments mentioned zipWithIndex. An example usage of zipWithIndex can be found in this answer.
Proposed solution:
You can use a window with an empty partitionBy and the the row number to get the expected numbers.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
windowSpec = Window.partitionBy().orderBy(F.col('age').asc())
age = age.withColumn(
'col_id',
F.row_number().over(windowSpec)
)
[EDIT] Add assumption of requirements and reference to alternative solution.

work with result of pyspark.sql.SQLContext in pyspark [duplicate]

Let's say I have spark dataframe
+--------+-----+
| letter|count|
+--------+-----+
| a| 2|
| b| 2|
| c| 1|
+--------+-----+
Then I wanted to find mean. So, I did
df = df.groupBy().mean('letter')
which give a dataframe
+------------------+
| avg(letter)|
+------------------+
|1.6666666666666667|
+------------------+
how can I hash it to get only value 1.6666666666666667 like df["avg(letter)"][0] in Pandas dataframe? Or any workaround to get 1.6666666666666667
Note: I need a float returned. Not a list nor dataframe.
Thank you

Take first:
>>> df.groupBy().mean('letter').first()[0]

Add column sum as new column in PySpark dataframe

I'm using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns.
Suppose my dataframe had columns "a", "b", and "c". I know I can do this:
df.withColumn('total_col', df.a + df.b + df.c)
The problem is that I don't want to type out each column individually and add them, especially if I have a lot of columns. I want to be able to do this automatically or by specifying a list of column names that I want to add. Is there another way to do this?

This was not obvious. I see no row-based sum of the columns defined in the spark Dataframes API.
Version 2
This can be done in a fairly simple way:
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
df.columns is supplied by pyspark as a list of strings giving all of the column names in the Spark Dataframe. For a different sum, you can supply any other list of column names instead.
I did not try this as my first solution because I wasn't certain how it would behave. But it works.
Version 1
This is overly complicated, but works as well.
You can do this:
use df.columns to get a list of the names of the columns
use that names list to make a list of the columns
pass that list to something that will invoke the column's overloaded add function in a fold-type functional manner
With python's reduce, some knowledge of how operator overloading works, and the pyspark code for columns here that becomes:
def column_add(a,b):
return a.__add__(b)
newdf = df.withColumn('total_col',
reduce(column_add, ( df[col] for col in df.columns ) ))
Note this is a python reduce, not a spark RDD reduce, and the parenthesis term in the second parameter to reduce requires the parenthesis because it is a list generator expression.
Tested, Works!
$ pyspark
>>> df = sc.parallelize([{'a': 1, 'b':2, 'c':3}, {'a':8, 'b':5, 'c':6}, {'a':3, 'b':1, 'c':0}]).toDF().cache()
>>> df
DataFrame[a: bigint, b: bigint, c: bigint]
>>> df.columns
['a', 'b', 'c']
>>> def column_add(a,b):
... return a.__add__(b)
...
>>> df.withColumn('total', reduce(column_add, ( df[col] for col in df.columns ) )).collect()
[Row(a=1, b=2, c=3, total=6), Row(a=8, b=5, c=6, total=19), Row(a=3, b=1, c=0, total=4)]

The most straight forward way of doing it is to use the expr function
from pyspark.sql.functions import *
data = data.withColumn('total', expr("col1 + col2 + col3 + col4"))

The solution
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
posted by #Paul works. Nevertheless I was getting the error, as many other as I have seen,
TypeError: 'Column' object is not callable
After some time I found the problem (at least in my case). The problem is that I previously imported some pyspark functions with the line
from pyspark.sql.functions import udf, col, count, sum, when, avg, mean, min
so the line imported the sum pyspark command while df.withColumn('total', sum(df[col] for col in df.columns)) is supposed to use the normal python sum function.
You can delete the reference of the pyspark function with del sum.
Otherwise in my case I changed the import to
import pyspark.sql.functions as F
and then referenced the functions as F.sum.

Summing multiple columns from a list into one column
PySpark's sum function doesn't support column addition.
This can be achieved using expr function.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns.

My problem was similar to the above (bit more complex) as i had to add consecutive column sums as new columns in PySpark dataframe. This approach uses code from Paul's Version 1 above:
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName('addColAsCumulativeSUM').getOrCreate()
df=spark.createDataFrame(data=[(1,2,3),(4,5,6),(3,2,1)\
,(6,1,-4),(0,2,-2),(6,4,1)\
,(4,5,2),(5,-3,-5),(6,4,-1)]\
,schema=['x1','x2','x3'])
df.show()
+---+---+---+
| x1| x2| x3|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 3| 2| 1|
| 6| 1| -4|
| 0| 2| -2|
| 6| 4| 1|
| 4| 5| 2|
| 5| -3| -5|
| 6| 4| -1|
+---+---+---+
colnames=df.columns
add new columns that are cumulative sums (consecutive):
for i in range(0,len(colnames)):
colnameLst= colnames[0:i+1]
colname = 'cm'+ str(i+1)
df = df.withColumn(colname, sum(df[col] for col in colnameLst))
df.show()
+---+---+---+---+---+---+
| x1| x2| x3|cm1|cm2|cm3|
+---+---+---+---+---+---+
| 1| 2| 3| 1| 3| 6|
| 4| 5| 6| 4| 9| 15|
| 3| 2| 1| 3| 5| 6|
| 6| 1| -4| 6| 7| 3|
| 0| 2| -2| 0| 2| 0|
| 6| 4| 1| 6| 10| 11|
| 4| 5| 2| 4| 9| 11|
| 5| -3| -5| 5| 2| -3|
| 6| 4| -1| 6| 10| 9|
+---+---+---+---+---+---+
'cumulative sum' columns added are as follows:
cm1 = x1
cm2 = x1 + x2
cm3 = x1 + x2 + x3

df = spark.createDataFrame([("linha1", "valor1", 2), ("linha2", "valor2", 5)], ("Columna1", "Columna2", "Columna3"))
df.show()
+--------+--------+--------+
|Columna1|Columna2|Columna3|
+--------+--------+--------+
| linha1| valor1| 2|
| linha2| valor2| 5|
+--------+--------+--------+
df = df.withColumn('DivisaoPorDois', df[2]/2)
df.show()
+--------+--------+--------+--------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|
+--------+--------+--------+--------------+
| linha1| valor1| 2| 1.0|
| linha2| valor2| 5| 2.5|
+--------+--------+--------+--------------+
df = df.withColumn('Soma_Colunas', df[2]+df[3])
df.show()
+--------+--------+--------+--------------+------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|Soma_Colunas|
+--------+--------+--------+--------------+------------+
| linha1| valor1| 2| 1.0| 3.0|
| linha2| valor2| 5| 2.5| 7.5|
+--------+--------+--------+--------------+------------+

A very simple approach would be to just use select instead of withcolumn as below:
df = df.select('*', (col("a")+col("b")+col('c).alias("total"))
This should give you required sum with minor changes based on requirements

The following approach works for me:
Import pyspark sql functions
from pyspark.sql import functions as F
Use F.expr(list_of_columns) data_frame.withColumn('Total_Sum',F.expr('col_name1+col_name2+..col_namen)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark: Add a new column with a tuple created from columns - python

Use struct: from pyspark.sql.functions import struct df.withColumn("V_tuple", struct(df.V1,df.V2,df.V3))

Related

How to do a cummsum in a lambda call using PySpark

Can the pySpark lag function reference itself?

Is there a way to add a column with range of values to a Spark Dataframe?

work with result of pyspark.sql.SQLContext in pyspark [duplicate]

Add column sum as new column in PySpark dataframe

Categories

Resources