Compare two columns to create a new column in Spark DataFrame - python

I have a Spark DataFrame that has 2 columns, I am trying to create a new column using the other two columns with the when otherwise operation.
df_newcol = df.withColumn("Flag", when(col("a") <= lit(ratio1) | col("b") <= lit(ratio1), 1).otherwise(2))
But this throws an error
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
I have used when and otherwise previously with one column, while using it with multiple columns do we have to write the logic differently.
Thanks.

You have an operator precedence issue, make sure you put comparison operators in parenthesis when the comparison is mixed with logical operators such as & and |, with which being fixed, you don't even need lit, a scalar should work as well:
import pyspark.sql.functions as F
df = spark.createDataFrame([[1, 2], [2, 3], [3, 4]], ['a', 'b'])
Both of the following should work:
df.withColumn('flag', F.when((F.col("a") <= F.lit(2)) | (F.col("b") <= F.lit(2)), 1).otherwise(2)).show()
+---+---+----+
| a| b|flag|
+---+---+----+
| 1| 2| 1|
| 2| 3| 1|
| 3| 4| 2|
+---+---+----+
df.withColumn('flag', F.when((F.col("a") <= 2) | (F.col("b") <= 2), 1).otherwise(2)).show()
+---+---+----+
| a| b|flag|
+---+---+----+
| 1| 2| 1|
| 2| 3| 1|
| 3| 4| 2|
+---+---+----+

Related

How to iterate over a pyspark dataframe and create a dictionary out of it

I have the following pyspark dataframe:
import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','a', 'b','b','b','b'],
'time': [1,2,3,4,1,2,3,4],
'col': ['1','2','1','2','3','2','3','2']})
foo_df = spark.createDataFrame(foo)
foo_df.show()
+---+----+---+
| id|time|col|
+---+----+---+
| a| 1| 1|
| a| 2| 2|
| a| 3| 1|
| a| 4| 2|
| b| 1| 3|
| b| 2| 2|
| b| 3| 3|
| b| 4| 2|
+---+----+---+
I would like to iterate over all ids and obtain a python dictionary that would have as keys the id and as values the col and would look like this:
foo_dict = {'a': ['1','2','1','2'], 'b': ['3','2','3','2']})
I have in total 10k ids and around 10m rows in foo, so I am looking for an efficient implementation.
Any ideas ?
It's a pandas dataframe. You should checkout the documentaton. The dataframe object has inbuilt methods to help iterate, slice and dice your data. There is also this fun tool to help you visualize what is going on.
pandas has a ready-made method to convert a dataframe to a dict.

Enumerate groups of consecutive equal values in PySpark

I'm trying to uniquely label consecutive rows with equal values in a PySpark dataframe. In Pandas, one could do this quite simply with:
s = pd.Series([1,1,1,2,2,1,1,3])
s.ne(s.shift()).cumsum()
0 1
1 1
2 1
3 2
4 2
5 3
6 3
7 4
dtype: int64
How could this be done in PySpark? Setup -
from pyspark.sql.types import IntegerType
from pyspark.sql.types import StructType
spark = SparkSession.builder.appName('pandasToSparkDF').getOrCreate()
mySchema = StructType([StructField("col1", IntegerType(), True)])
df_sp = spark.createDataFrame(s.to_frame(), schema=mySchema)
I've found slightly related questions such as this one, but none of them about this same scenario.
I'm thinking a good starting point could be to find the first differences as in this answer
I've come up with a solution. The idea is similar to what is done in Pandas. We start by adding an unique identifier column, over which we'll compute the lagged column (using over here is necessary since it is a window function).
We then compare the column of interest with the lagged column and take the cumulative sum of the result cast to int:
mySchema = StructType([StructField("col1", IntegerType(), True)])
df_sp = spark.createDataFrame(s.to_frame(), schema=mySchema)
win = Window.orderBy("id")
df_sp = (df_sp.withColumn("id", f.monotonically_increasing_id())
.withColumn("col1_shift", f.lag("col1", offset=1, default=0).over(win))
.withColumn("col1_shift_ne", (f.col("col1") != f.col("col1_shift")).cast("int"))
.withColumn("col1_shift_ne_cumsum", f.sum("col1_shift_ne").over(win))
.drop(*['id','col1_shift', 'col1_shift_ne']))
df_sp.show()
---+--------------------+
|col1|col1_shift_ne_cumsum|
+----+--------------------+
| 1| 1|
| 1| 1|
| 1| 1|
| 2| 2|
| 2| 2|
| 1| 3|
| 1| 3|
| 3| 4|
+----+--------------------+
Another way of solving this would be using a rangebetween and using unbounded preceeding sum after comparing the lag:
from pyspark.sql import functions as F, Window as W
w1 = W.orderBy(F.monotonically_increasing_id())
w2 = W.orderBy(F.monotonically_increasing_id()).rangeBetween(W.unboundedPreceding,0)
cond = F.col("col1") != F.lag("col1").over(w1)
df_sp.withColumn("col1_shift_ne_cumsum",F.sum(F.when(cond,1).otherwise(0)).over(w2)+1).show()
+----+--------------------+
|col1|col1_shift_ne_cumsum|
+----+--------------------+
| 1| 1|
| 1| 1|
| 1| 1|
| 2| 2|
| 2| 2|
| 1| 3|
| 1| 3|
| 3| 4|
+----+--------------------+

Summing values across each row as boolean (PySpark)

I currently have a PySpark dataframe that has many columns populated by integer counts. Many of these columns have counts of zero. I would like to find a way to sum how many columns have counts greater than zero.
In other words, I would like an approach that sums values across a row, where all the columns for a given row are effectively boolean (although the datatype conversion may not be necessary). Several columns in my table are datetime or string, so ideally I would have an approach that first selects the numeric columns.
Current Dataframe example and Desired Output
+---+---------- +----------+------------
|USER| DATE |COUNT_COL1| COUNT_COL2|... DESIRED COLUMN
+---+---------- +----------+------------
| b | 7/1/2019 | 12 | 1 | 2 (2 columns are non-zero)
| a | 6/9/2019 | 0 | 5 | 1
| c | 1/1/2019 | 0 | 0 | 0
Pandas: As an example, in pandas this can be accomplished by selecting the numeric columns,converting to bool and summing with the axis=1. I am looking for a PySpark equivalent.
test_cols=list(pandas_df.select_dtypes(include=[np.number]).columns.values)
pandas_df[test_cols].astype(bool).sum(axis=1)
For numericals, you can do it by creating an array of all the columns with the integer values(using df.dtypes), and then use higher order functions. In this case I used filter to get rid of all 0s, and then used size to get the number of all non zero elements per row.(spark2.4+)
from pyspark.sql import functions as F
df.withColumn("arr", F.array(*[F.col(i[0]) for i in df.dtypes if i[1] in ['int','bigint']]))\
.withColumn("DESIRED COLUMN", F.expr("""size(filter(arr,x->x!=0))""")).drop("arr").show()
#+----+--------+----------+----------+--------------+
#|USER| DATE|COUNT_COL1|COUNT_COL2|DESIRED COLUMN|
#+----+--------+----------+----------+--------------+
#| b|7/1/2019| 12| 1| 2|
#| a|6/9/2019| 0| 5| 1|
#| c|1/1/2019| 0| 0| 0|
#+----+--------+----------+----------+--------------+
Let's say you have below df:
df.show()
df.printSchema()
+---+---+---+---+
|_c0|_c1|_c2|_c3|
+---+---+---+---+
| a| 1| 2| 3|
| a| 0| 2| 1|
| a| 0| 0| 1|
| a| 0| 0| 0|
+---+---+---+---+
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
Using case when statement you can check if column is numeric and then if it is larger than 0. In the next step f.size will return count thanks to f.array_remove which left only cols with True value.
from pyspark.sql import functions as f
cols = [f.when(f.length(f.regexp_replace(f.col(x), '\\d+', '')) > 0, False).otherwise(f.col(x).cast('int') > 0) for x in df2.columns]
df.select("*", f.size(f.array_remove(f.array(*cols), False)).alias("count")).show()
+---+---+---+---+-----+
|_c0|_c1|_c2|_c3|count|
+---+---+---+---+-----+
| a| 1| 2| 3| 3|
| a| 0| 2| 1| 2|
| a| 0| 0| 1| 1|
| a| 0| 0| 0| 0|
+---+---+---+---+-----+

Pyspark - Using two time indices for window function

I have a dataframe where each row has two date columns. I would like to create a window function with a range between that counts the number of rows in a particular range, where BOTH date columns are within the range. In the case below, both timestamps of a row must be before the timestamp of the current row, to be included in the count.
Example df including the count column:
+---+-----------+-----------+-----+
| ID|Timestamp_1|Timestamp_2|Count|
+---+-----------+-----------+-----+
| a| 0| 3| 0|
| b| 2| 5| 0|
| d| 5| 5| 3|
| c| 5| 9| 3|
| e| 8| 10| 4|
+---+-----------+-----------+-----+
I tried creating two windows and creating the new column over both of these:
w_1 = Window.partitionBy().orderBy('Timestamp_1').rangeBetween(Window.unboundedPreceding, 0)
w_2 = Window.partitionBy().orderBy('Timestamp_2').rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn('count', F.count('ID').over(w_1).over(w_2))
However, this is not allowed in Pyspark and therefore gives an error.
Any ideas? Solutions in SQL are also fine!
Would a self-join work?
from pyspark.sql import functions as F
df_count = (
df.alias('a')
.join(
df.alias('b'),
(F.col('b.Timestamp_1') <= F.col('a.Timestamp_1')) &
(F.col('b.Timestamp_2') <= F.col('a.Timestamp_2')),
'left'
)
.groupBy(
'a.ID'
)
.agg(
F.count('b.ID').alias('count')
)
)
df = df.join(df_count, 'ID')

Add column sum as new column in PySpark dataframe

I'm using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns.
Suppose my dataframe had columns "a", "b", and "c". I know I can do this:
df.withColumn('total_col', df.a + df.b + df.c)
The problem is that I don't want to type out each column individually and add them, especially if I have a lot of columns. I want to be able to do this automatically or by specifying a list of column names that I want to add. Is there another way to do this?
This was not obvious. I see no row-based sum of the columns defined in the spark Dataframes API.
Version 2
This can be done in a fairly simple way:
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
df.columns is supplied by pyspark as a list of strings giving all of the column names in the Spark Dataframe. For a different sum, you can supply any other list of column names instead.
I did not try this as my first solution because I wasn't certain how it would behave. But it works.
Version 1
This is overly complicated, but works as well.
You can do this:
use df.columns to get a list of the names of the columns
use that names list to make a list of the columns
pass that list to something that will invoke the column's overloaded add function in a fold-type functional manner
With python's reduce, some knowledge of how operator overloading works, and the pyspark code for columns here that becomes:
def column_add(a,b):
return a.__add__(b)
newdf = df.withColumn('total_col',
reduce(column_add, ( df[col] for col in df.columns ) ))
Note this is a python reduce, not a spark RDD reduce, and the parenthesis term in the second parameter to reduce requires the parenthesis because it is a list generator expression.
Tested, Works!
$ pyspark
>>> df = sc.parallelize([{'a': 1, 'b':2, 'c':3}, {'a':8, 'b':5, 'c':6}, {'a':3, 'b':1, 'c':0}]).toDF().cache()
>>> df
DataFrame[a: bigint, b: bigint, c: bigint]
>>> df.columns
['a', 'b', 'c']
>>> def column_add(a,b):
... return a.__add__(b)
...
>>> df.withColumn('total', reduce(column_add, ( df[col] for col in df.columns ) )).collect()
[Row(a=1, b=2, c=3, total=6), Row(a=8, b=5, c=6, total=19), Row(a=3, b=1, c=0, total=4)]
The most straight forward way of doing it is to use the expr function
from pyspark.sql.functions import *
data = data.withColumn('total', expr("col1 + col2 + col3 + col4"))
The solution
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
posted by #Paul works. Nevertheless I was getting the error, as many other as I have seen,
TypeError: 'Column' object is not callable
After some time I found the problem (at least in my case). The problem is that I previously imported some pyspark functions with the line
from pyspark.sql.functions import udf, col, count, sum, when, avg, mean, min
so the line imported the sum pyspark command while df.withColumn('total', sum(df[col] for col in df.columns)) is supposed to use the normal python sum function.
You can delete the reference of the pyspark function with del sum.
Otherwise in my case I changed the import to
import pyspark.sql.functions as F
and then referenced the functions as F.sum.
Summing multiple columns from a list into one column
PySpark's sum function doesn't support column addition.
This can be achieved using expr function.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns.
My problem was similar to the above (bit more complex) as i had to add consecutive column sums as new columns in PySpark dataframe. This approach uses code from Paul's Version 1 above:
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName('addColAsCumulativeSUM').getOrCreate()
df=spark.createDataFrame(data=[(1,2,3),(4,5,6),(3,2,1)\
,(6,1,-4),(0,2,-2),(6,4,1)\
,(4,5,2),(5,-3,-5),(6,4,-1)]\
,schema=['x1','x2','x3'])
df.show()
+---+---+---+
| x1| x2| x3|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 3| 2| 1|
| 6| 1| -4|
| 0| 2| -2|
| 6| 4| 1|
| 4| 5| 2|
| 5| -3| -5|
| 6| 4| -1|
+---+---+---+
colnames=df.columns
add new columns that are cumulative sums (consecutive):
for i in range(0,len(colnames)):
colnameLst= colnames[0:i+1]
colname = 'cm'+ str(i+1)
df = df.withColumn(colname, sum(df[col] for col in colnameLst))
df.show()
+---+---+---+---+---+---+
| x1| x2| x3|cm1|cm2|cm3|
+---+---+---+---+---+---+
| 1| 2| 3| 1| 3| 6|
| 4| 5| 6| 4| 9| 15|
| 3| 2| 1| 3| 5| 6|
| 6| 1| -4| 6| 7| 3|
| 0| 2| -2| 0| 2| 0|
| 6| 4| 1| 6| 10| 11|
| 4| 5| 2| 4| 9| 11|
| 5| -3| -5| 5| 2| -3|
| 6| 4| -1| 6| 10| 9|
+---+---+---+---+---+---+
'cumulative sum' columns added are as follows:
cm1 = x1
cm2 = x1 + x2
cm3 = x1 + x2 + x3
df = spark.createDataFrame([("linha1", "valor1", 2), ("linha2", "valor2", 5)], ("Columna1", "Columna2", "Columna3"))
df.show()
+--------+--------+--------+
|Columna1|Columna2|Columna3|
+--------+--------+--------+
| linha1| valor1| 2|
| linha2| valor2| 5|
+--------+--------+--------+
df = df.withColumn('DivisaoPorDois', df[2]/2)
df.show()
+--------+--------+--------+--------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|
+--------+--------+--------+--------------+
| linha1| valor1| 2| 1.0|
| linha2| valor2| 5| 2.5|
+--------+--------+--------+--------------+
df = df.withColumn('Soma_Colunas', df[2]+df[3])
df.show()
+--------+--------+--------+--------------+------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|Soma_Colunas|
+--------+--------+--------+--------------+------------+
| linha1| valor1| 2| 1.0| 3.0|
| linha2| valor2| 5| 2.5| 7.5|
+--------+--------+--------+--------------+------------+
A very simple approach would be to just use select instead of withcolumn as below:
df = df.select('*', (col("a")+col("b")+col('c).alias("total"))
This should give you required sum with minor changes based on requirements
The following approach works for me:
Import pyspark sql functions
from pyspark.sql import functions as F
Use F.expr(list_of_columns) data_frame.withColumn('Total_Sum',F.expr('col_name1+col_name2+..col_namen)

Categories