Pyspark - Using two time indices for window function - python

I have a dataframe where each row has two date columns. I would like to create a window function with a range between that counts the number of rows in a particular range, where BOTH date columns are within the range. In the case below, both timestamps of a row must be before the timestamp of the current row, to be included in the count.
Example df including the count column:
+---+-----------+-----------+-----+
| ID|Timestamp_1|Timestamp_2|Count|
+---+-----------+-----------+-----+
| a| 0| 3| 0|
| b| 2| 5| 0|
| d| 5| 5| 3|
| c| 5| 9| 3|
| e| 8| 10| 4|
+---+-----------+-----------+-----+
I tried creating two windows and creating the new column over both of these:
w_1 = Window.partitionBy().orderBy('Timestamp_1').rangeBetween(Window.unboundedPreceding, 0)
w_2 = Window.partitionBy().orderBy('Timestamp_2').rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn('count', F.count('ID').over(w_1).over(w_2))
However, this is not allowed in Pyspark and therefore gives an error.
Any ideas? Solutions in SQL are also fine!

Would a self-join work?
from pyspark.sql import functions as F
df_count = (
df.alias('a')
.join(
df.alias('b'),
(F.col('b.Timestamp_1') <= F.col('a.Timestamp_1')) &
(F.col('b.Timestamp_2') <= F.col('a.Timestamp_2')),
'left'
)
.groupBy(
'a.ID'
)
.agg(
F.count('b.ID').alias('count')
)
)
df = df.join(df_count, 'ID')

Related

How to fill gaps between two rows having the difference expressed in days

I have the following dataframe where diff_days is the difference between one row and the previous row
+----------+--------+---------+
| fx_date| col_1 |diff_days|
+----------+--------+---------+
|2020-01-05| A| null|
|2020-01-09| B| 4|
|2020-01-11| C| 2|
+----------+--------+---------+
I want to get a dataframe adding rows with missing dates and replicated values of col_1 related to the first row.
It should be:
+----------+--------+
| fx_date| col_1 |
+----------+--------+
|2020-01-05| A|
|2020-01-06| A|
|2020-01-07| A|
|2020-01-08| A|
|2020-01-09| B|
|2020-01-10| B|
|2021-01-11| C|
+----------+--------+
You can use lag + sequence functions to generate the dates between previous and current row dates, then explode the list like this:
from pyspark.sql import functions as F, Window
df1 = df.withColumn(
"previous_dt",
F.date_add(F.lag("fx_date", 1).over(Window.orderBy("fx_date")), 1)
).withColumn(
"fx_date",
F.expr("sequence(coalesce(previous_dt, fx_date), fx_date, interval 1 day)")
).withColumn(
"fx_date",
F.explode("fx_date")
).drop("previous_dt", "diff_days")
df1.show()
#+----------+-----+
#| fx_date|col_1|
#+----------+-----+
#|2020-01-05| A|
#|2020-01-06| B|
#|2020-01-07| B|
#|2020-01-08| B|
#|2020-01-09| B|
#|2020-01-10| C|
#|2020-01-11| C|
#+----------+-----+

Enumerate groups of consecutive equal values in PySpark

I'm trying to uniquely label consecutive rows with equal values in a PySpark dataframe. In Pandas, one could do this quite simply with:
s = pd.Series([1,1,1,2,2,1,1,3])
s.ne(s.shift()).cumsum()
0 1
1 1
2 1
3 2
4 2
5 3
6 3
7 4
dtype: int64
How could this be done in PySpark? Setup -
from pyspark.sql.types import IntegerType
from pyspark.sql.types import StructType
spark = SparkSession.builder.appName('pandasToSparkDF').getOrCreate()
mySchema = StructType([StructField("col1", IntegerType(), True)])
df_sp = spark.createDataFrame(s.to_frame(), schema=mySchema)
I've found slightly related questions such as this one, but none of them about this same scenario.
I'm thinking a good starting point could be to find the first differences as in this answer
I've come up with a solution. The idea is similar to what is done in Pandas. We start by adding an unique identifier column, over which we'll compute the lagged column (using over here is necessary since it is a window function).
We then compare the column of interest with the lagged column and take the cumulative sum of the result cast to int:
mySchema = StructType([StructField("col1", IntegerType(), True)])
df_sp = spark.createDataFrame(s.to_frame(), schema=mySchema)
win = Window.orderBy("id")
df_sp = (df_sp.withColumn("id", f.monotonically_increasing_id())
.withColumn("col1_shift", f.lag("col1", offset=1, default=0).over(win))
.withColumn("col1_shift_ne", (f.col("col1") != f.col("col1_shift")).cast("int"))
.withColumn("col1_shift_ne_cumsum", f.sum("col1_shift_ne").over(win))
.drop(*['id','col1_shift', 'col1_shift_ne']))
df_sp.show()
---+--------------------+
|col1|col1_shift_ne_cumsum|
+----+--------------------+
| 1| 1|
| 1| 1|
| 1| 1|
| 2| 2|
| 2| 2|
| 1| 3|
| 1| 3|
| 3| 4|
+----+--------------------+
Another way of solving this would be using a rangebetween and using unbounded preceeding sum after comparing the lag:
from pyspark.sql import functions as F, Window as W
w1 = W.orderBy(F.monotonically_increasing_id())
w2 = W.orderBy(F.monotonically_increasing_id()).rangeBetween(W.unboundedPreceding,0)
cond = F.col("col1") != F.lag("col1").over(w1)
df_sp.withColumn("col1_shift_ne_cumsum",F.sum(F.when(cond,1).otherwise(0)).over(w2)+1).show()
+----+--------------------+
|col1|col1_shift_ne_cumsum|
+----+--------------------+
| 1| 1|
| 1| 1|
| 1| 1|
| 2| 2|
| 2| 2|
| 1| 3|
| 1| 3|
| 3| 4|
+----+--------------------+

pyspark iterate N rows from Data Frame to each execution

def fun_1(csv):
# returns int[] of length = Number of New Lines in String csv
def fun_2(csv): # My WorkArround to Pass one CSV Line at One Time
return fun_1(csv)[0]
The Input Data Frame is df
+----+----+-----+
|col1|col2|CSVs |
+----+----+-----+
| 1| a|2,0,1|
| 2| b|2,0,2|
| 3| c|2,0,3|
| 4| a|2,0,1|
| 5| b|2,0,2|
| 6| c|2,0,3|
| 7| a|2,0,1|
+----+----+-----+
Below is a Code Snippet which works but takes long time
from pyspark.sql.functions import udf
from pyspark.sql import functions as sf
funudf = udf(fun_2) # wish it could be fun_1
df=df.withColumn( 'pred' , funudf(sf.col('csv')))
fun_1 , has a memory issue and could only handle 50000 max rows at a time. I wish to use funudf = udf(fun_1) .
Hence, how can I split the PySpark DF into segments of 50000 rows , call funudf ->fun_1 .
Output has two colunms, 'col1' from the input and 'funudf return value' .
You can achieve the desired result of forcing PySpark to operate on fixed batches of rows by using the groupByKey method exposed in the RDD API. Using groupByKey will force PySpark to shuffle all the data for a single key to a single executor.
NOTE: for this very same reason using groupByKey is often discouraged because of the network cost.
Strategy:
Add a column that groups your data into the desired size of batches and groupByKey
Define a function that reproduces the logic of your UDF (and also returns an id for joining later on). This operates on pyspark.resultiterable.ResultIterable, the result of groupByKey. Apply function to your groups using mapValues
Convert the resulting RDD into a DataFrame and join back in.
Example:
# Synthesize DF
data = {'_id': range(9), 'group': ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c'], 'vals': [2.0*i for i in range(9)]}
df = spark.createDataFrame(pd.DataFrame(data))
df.show()
##
# Step - 1 Convert to rdd and groupByKey to force each group to separate executor
##
kv = df.rdd.map(lambda r: (r.group, [r._id, r.group, r.vals]))
groups = kv.groupByKey()
##
# Step 2 - Calulate function
##
# Dummy function taking
def mult3(ditr):
data = ditr.data
ids = [v[0] for v in data]
vals = [3*v[2] for v in data]
return zip(ids, vals)
# run mult3 and flaten results
mv = groups.mapValues(mult3).map(lambda r: r[1]).flatMap(lambda r: r) # rdd[(id, val)]
##
# Step 3 - Join results back into base DF
##
# convert results into a DF and join back in
schema = t.StructType([t.StructField('_id', t.LongType()), t.StructField('vals_x_3', t.FloatType())])
df_vals = spark.createDataFrame(mv, schema)
joined = df.join(df_vals, '_id')
joined.show()
>>>
+---+-----+----+
|_id|group|vals|
+---+-----+----+
| 0| a| 0.0|
| 1| b| 2.0|
| 2| c| 4.0|
| 3| a| 6.0|
| 4| b| 8.0|
| 5| c|10.0|
| 6| a|12.0|
| 7| b|14.0|
| 8| c|16.0|
+---+-----+----+
+---+-----+----+--------+
|_id|group|vals|vals_x_3|
+---+-----+----+--------+
| 0| a| 0.0| 0.0|
| 7| b|14.0| 42.0|
| 6| a|12.0| 36.0|
| 5| c|10.0| 30.0|
| 1| b| 2.0| 6.0|
| 3| a| 6.0| 18.0|
| 8| c|16.0| 48.0|
| 2| c| 4.0| 12.0|
| 4| b| 8.0| 24.0|
+---+-----+----+--------+

Find count of distinct values between two same values in a csv file using pyspark

Im working on pyspark to deal with big CSV files more than 50gb.
Now I need to find the number of distinct values between two references to the same value.
for example,
input dataframe:
+----+
|col1|
+----+
| a|
| b|
| c|
| c|
| a|
| b|
| a|
+----+
output dataframe:
+----+-----+
|col1|col2 |
+----+-----+
| a| null|
| b| null|
| c| null|
| c| 0|
| a| 2|
| b| 2|
| a| 1|
+----+-----+
I'm struggling with this for past one week. Tried window functions and many things in spark. But couldn't get anything. It would be a great help if someone knows how to fix this. Thank you.
Comment if you need any clarification in the question.
I am providing solution, with some assumptions.
Assuming, previous reference can be found in max of previous 'n' rows. If 'n' is reasonable less value, i think this is good solution.
I assumed you can find the previous reference in 5 rows.
def get_distincts(list, current_value):
cnt = {}
flag = False
for i in list:
if current_value == i :
flag = True
break
else:
cnt[i] = "some_value"
if flag:
return len(cnt)
else:
return None
get_distincts_udf = udf(get_distincts, IntegerType())
df = spark.createDataFrame([["a"],["b"],["c"],["c"],["a"],["b"],["a"]]).toDF("col1")
#You can replace this, if you have some unique id column
df = df.withColumn("seq_id", monotonically_increasing_id())
window = Window.orderBy("seq_id")
df = df.withColumn("list", array([lag(col("col1"),i, None).over(window) for i in range(1,6) ]))
df = df.withColumn("col2", get_distincts_udf(col('list'), col('col1'))).drop('seq_id','list')
df.show()
which results
+----+----+
|col1|col2|
+----+----+
| a|null|
| b|null|
| c|null|
| c| 0|
| a| 2|
| b| 2|
| a| 1|
+----+----+
You can try the following approach:
add a monotonically_increasing column id to keep track the order of rows
find prev_id for each col1 and save the result to a new df
for the new DF (alias 'd1'), make a LEFT JOIN to the DF itself (alias 'd2') with a condition (d2.id > d1.prev_id) & (d2.id < d1.id)
then groupby('d1.col1', 'd1.id') and aggregate on the countDistinct('d2.col1')
The code based on the above logic and your sample data is shown below:
from pyspark.sql import functions as F, Window
df1 = spark.createDataFrame([ (i,) for i in list("abccaba")], ["col1"])
# create a WinSpec partitioned by col1 so that we can find the prev_id
win = Window.partitionBy('col1').orderBy('id')
# set up id and prev_id
df11 = df1.withColumn('id', F.monotonically_increasing_id())\
.withColumn('prev_id', F.lag('id').over(win))
# check the newly added columns
df11.sort('id').show()
# +----+---+-------+
# |col1| id|prev_id|
# +----+---+-------+
# | a| 0| null|
# | b| 1| null|
# | c| 2| null|
# | c| 3| 2|
# | a| 4| 0|
# | b| 5| 1|
# | a| 6| 4|
# +----+---+-------+
# let's cache the new dataframe
df11.persist()
# do a self-join on id and prev_id and then do the aggregation
df12 = df11.alias('d1') \
.join(df11.alias('d2')
, (F.col('d2.id') > F.col('d1.prev_id')) & (F.col('d2.id') < F.col('d1.id')), how='left') \
.select('d1.col1', 'd1.id', F.col('d2.col1').alias('ids')) \
.groupBy('col1','id') \
.agg(F.countDistinct('ids').alias('distinct_values'))
# display the result
df12.sort('id').show()
# +----+---+---------------+
# |col1| id|distinct_values|
# +----+---+---------------+
# | a| 0| 0|
# | b| 1| 0|
# | c| 2| 0|
# | c| 3| 0|
# | a| 4| 2|
# | b| 5| 2|
# | a| 6| 1|
# +----+---+---------------+
# release the cached df11
df11.unpersist()
Note you will need to keep this id column to sort rows, otherwise your resulting rows will be totally messed up each time you collect them.
reuse_distance = []
block_dict = {}
stack_dict = {}
counter_reuse = 0
counter_stack = 0
reuse_list = []
Here block is nothing but the characters you want to read and search from csv
stack_list = []
stack_dist = -1
reuse_dist = -1
if block in block_dict:
reuse_dist = counter_reuse - block_dict[block]-1
block_dict[block] = counter_reuse
counter_reuse += 1
stack_dist_ind= stack_list.index(block)
stack_dist = counter_stack -stack_dist_ind - 1
del stack_list[stack_dist_ind]
stack_list.append(block)
else:
block_dict[block] = counter_reuse
counter_reuse += 1
counter_stack += 1
stack_list.append(block)
reuse_distance_2.append([block, stack_dist, reuse_dist])

Add column sum as new column in PySpark dataframe

I'm using PySpark and I have a Spark dataframe with a bunch of numeric columns. I want to add a column that is the sum of all the other columns.
Suppose my dataframe had columns "a", "b", and "c". I know I can do this:
df.withColumn('total_col', df.a + df.b + df.c)
The problem is that I don't want to type out each column individually and add them, especially if I have a lot of columns. I want to be able to do this automatically or by specifying a list of column names that I want to add. Is there another way to do this?
This was not obvious. I see no row-based sum of the columns defined in the spark Dataframes API.
Version 2
This can be done in a fairly simple way:
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
df.columns is supplied by pyspark as a list of strings giving all of the column names in the Spark Dataframe. For a different sum, you can supply any other list of column names instead.
I did not try this as my first solution because I wasn't certain how it would behave. But it works.
Version 1
This is overly complicated, but works as well.
You can do this:
use df.columns to get a list of the names of the columns
use that names list to make a list of the columns
pass that list to something that will invoke the column's overloaded add function in a fold-type functional manner
With python's reduce, some knowledge of how operator overloading works, and the pyspark code for columns here that becomes:
def column_add(a,b):
return a.__add__(b)
newdf = df.withColumn('total_col',
reduce(column_add, ( df[col] for col in df.columns ) ))
Note this is a python reduce, not a spark RDD reduce, and the parenthesis term in the second parameter to reduce requires the parenthesis because it is a list generator expression.
Tested, Works!
$ pyspark
>>> df = sc.parallelize([{'a': 1, 'b':2, 'c':3}, {'a':8, 'b':5, 'c':6}, {'a':3, 'b':1, 'c':0}]).toDF().cache()
>>> df
DataFrame[a: bigint, b: bigint, c: bigint]
>>> df.columns
['a', 'b', 'c']
>>> def column_add(a,b):
... return a.__add__(b)
...
>>> df.withColumn('total', reduce(column_add, ( df[col] for col in df.columns ) )).collect()
[Row(a=1, b=2, c=3, total=6), Row(a=8, b=5, c=6, total=19), Row(a=3, b=1, c=0, total=4)]
The most straight forward way of doing it is to use the expr function
from pyspark.sql.functions import *
data = data.withColumn('total', expr("col1 + col2 + col3 + col4"))
The solution
newdf = df.withColumn('total', sum(df[col] for col in df.columns))
posted by #Paul works. Nevertheless I was getting the error, as many other as I have seen,
TypeError: 'Column' object is not callable
After some time I found the problem (at least in my case). The problem is that I previously imported some pyspark functions with the line
from pyspark.sql.functions import udf, col, count, sum, when, avg, mean, min
so the line imported the sum pyspark command while df.withColumn('total', sum(df[col] for col in df.columns)) is supposed to use the normal python sum function.
You can delete the reference of the pyspark function with del sum.
Otherwise in my case I changed the import to
import pyspark.sql.functions as F
and then referenced the functions as F.sum.
Summing multiple columns from a list into one column
PySpark's sum function doesn't support column addition.
This can be achieved using expr function.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns.
My problem was similar to the above (bit more complex) as i had to add consecutive column sums as new columns in PySpark dataframe. This approach uses code from Paul's Version 1 above:
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName('addColAsCumulativeSUM').getOrCreate()
df=spark.createDataFrame(data=[(1,2,3),(4,5,6),(3,2,1)\
,(6,1,-4),(0,2,-2),(6,4,1)\
,(4,5,2),(5,-3,-5),(6,4,-1)]\
,schema=['x1','x2','x3'])
df.show()
+---+---+---+
| x1| x2| x3|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 3| 2| 1|
| 6| 1| -4|
| 0| 2| -2|
| 6| 4| 1|
| 4| 5| 2|
| 5| -3| -5|
| 6| 4| -1|
+---+---+---+
colnames=df.columns
add new columns that are cumulative sums (consecutive):
for i in range(0,len(colnames)):
colnameLst= colnames[0:i+1]
colname = 'cm'+ str(i+1)
df = df.withColumn(colname, sum(df[col] for col in colnameLst))
df.show()
+---+---+---+---+---+---+
| x1| x2| x3|cm1|cm2|cm3|
+---+---+---+---+---+---+
| 1| 2| 3| 1| 3| 6|
| 4| 5| 6| 4| 9| 15|
| 3| 2| 1| 3| 5| 6|
| 6| 1| -4| 6| 7| 3|
| 0| 2| -2| 0| 2| 0|
| 6| 4| 1| 6| 10| 11|
| 4| 5| 2| 4| 9| 11|
| 5| -3| -5| 5| 2| -3|
| 6| 4| -1| 6| 10| 9|
+---+---+---+---+---+---+
'cumulative sum' columns added are as follows:
cm1 = x1
cm2 = x1 + x2
cm3 = x1 + x2 + x3
df = spark.createDataFrame([("linha1", "valor1", 2), ("linha2", "valor2", 5)], ("Columna1", "Columna2", "Columna3"))
df.show()
+--------+--------+--------+
|Columna1|Columna2|Columna3|
+--------+--------+--------+
| linha1| valor1| 2|
| linha2| valor2| 5|
+--------+--------+--------+
df = df.withColumn('DivisaoPorDois', df[2]/2)
df.show()
+--------+--------+--------+--------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|
+--------+--------+--------+--------------+
| linha1| valor1| 2| 1.0|
| linha2| valor2| 5| 2.5|
+--------+--------+--------+--------------+
df = df.withColumn('Soma_Colunas', df[2]+df[3])
df.show()
+--------+--------+--------+--------------+------------+
|Columna1|Columna2|Columna3|DivisaoPorDois|Soma_Colunas|
+--------+--------+--------+--------------+------------+
| linha1| valor1| 2| 1.0| 3.0|
| linha2| valor2| 5| 2.5| 7.5|
+--------+--------+--------+--------------+------------+
A very simple approach would be to just use select instead of withcolumn as below:
df = df.select('*', (col("a")+col("b")+col('c).alias("total"))
This should give you required sum with minor changes based on requirements
The following approach works for me:
Import pyspark sql functions
from pyspark.sql import functions as F
Use F.expr(list_of_columns) data_frame.withColumn('Total_Sum',F.expr('col_name1+col_name2+..col_namen)

Categories