Pyspark divide column by its subtotals grouped by another column - python

My problem is similar to this and this. Both posts show how to divide a column value by the total sum of the same column. In my case I want to divide the values of a column by the sum of subtotals. Subtotal is calculated by grouping the column values depending on another column. I am slightly modifying the example mentioned in the links shared above.
Here is my dataframe
df = [[1,'CAT1',10], [2, 'CAT1', 11], [3, 'CAT2', 20], [4, 'CAT2', 22], [5, 'CAT3', 30]]
df = spark.createDataFrame(df, ['id', 'category', 'consumption'])
df.show()
+---+--------+-----------+
| id|category|consumption|
+---+--------+-----------+
| 1| CAT1| 10|
| 2| CAT1| 11|
| 3| CAT2| 20|
| 4| CAT2| 22|
| 5| CAT3| 30|
+---+--------+-----------+
I want to divide "consumption" value by the total of grouped "category" and put the value in a column "normalized" as below.
The subtotals doesn't need to be in the output(number 21, 42 and 30 in column consumption)
What I've achieved so far
df.crossJoin(
df.groupby('category').agg(F.sum('consumption').alias('sum_'))
).withColumn("normalized", F.col("consumption")/F.col("sum_"))\
.show()
+---+--------+-----------+--------+----+-------------------+
| id|category|consumption|category|sum_| normalized|
+---+--------+-----------+--------+----+-------------------+
| 1| CAT1| 10| CAT2| 42|0.23809523809523808|
| 2| CAT1| 11| CAT2| 42| 0.2619047619047619|
| 1| CAT1| 10| CAT1| 21|0.47619047619047616|
| 2| CAT1| 11| CAT1| 21| 0.5238095238095238|
| 1| CAT1| 10| CAT3| 30| 0.3333333333333333|
| 2| CAT1| 11| CAT3| 30|0.36666666666666664|
| 3| CAT2| 20| CAT2| 42|0.47619047619047616|
| 4| CAT2| 22| CAT2| 42| 0.5238095238095238|
| 5| CAT3| 30| CAT2| 42| 0.7142857142857143|
| 3| CAT2| 20| CAT1| 21| 0.9523809523809523|
| 4| CAT2| 22| CAT1| 21| 1.0476190476190477|
| 5| CAT3| 30| CAT1| 21| 1.4285714285714286|
| 3| CAT2| 20| CAT3| 30| 0.6666666666666666|
| 4| CAT2| 22| CAT3| 30| 0.7333333333333333|
| 5| CAT3| 30| CAT3| 30| 1.0|
+---+--------+-----------+--------+----+-------------------+

You can do basically the same as in the links you have already mentioned. The only difference is that you have to calculate the subtotals before with groupby and sum:
import pyspark.sql.functions as F
df = df.join(df.groupby('category').sum('consumption'), 'category')
df = df.select('id', 'category', F.round(F.col('consumption')/F.col('sum(consumption)'), 2).alias('normalized'))
df.show()
Output:
+---+--------+----------+
| id|category|normalized|
+---+--------+----------+
| 3| CAT2| 0.48|
| 4| CAT2| 0.52|
| 1| CAT1| 0.48|
| 2| CAT1| 0.52|
| 5| CAT3| 1.0|
+---+--------+----------+

This is another way of solving the problem as proposed by the OP, but without using joins().
joins() in general are costly operations and should be avoided when ever possible.
# We first register our DataFrame as temporary SQL view
df.registerTempTable('table_view')
df = sqlContext.sql("""select id, category,
consumption/sum(consumption) over (partition by category) as normalize
from table_view""")
df.show()
+---+--------+-------------------+
| id|category| normalize|
+---+--------+-------------------+
| 3| CAT2|0.47619047619047616|
| 4| CAT2| 0.5238095238095238|
| 1| CAT1|0.47619047619047616|
| 2| CAT1| 0.5238095238095238|
| 5| CAT3| 1.0|
+---+--------+-------------------+
Note: """ has been used to have multiline statements for the sake of visibility and neatness. With simple 'select id ....' that wouldn't work if you try to spread your statement over multiple lines. Needless to say, the final result will be the same.

Related

Merge Rows in Apache spark by eliminating null values

I have a spark data frame like below
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 2|null|null|null| 102| 202| 302|
| 4|null|null|null| 104| 204| 304|
| 1|null|null|null| 101| 201| 301|
| 3|null|null|null| 103| 203| 303|
| 1| 11| 21| 31|null|null|null|
| 2| 12| 22| 32|null|null|null|
| 4| 14| 24| 34|null|null|null|
| 3| 13| 23| 33|null|null|null|
+---+----+----+----+----+----+----+
I wanted to transform data frame like below by merging null rows
+---+----+----+----+----+----+----+
| id| 1| 2| 3|sf_1|sf_2|sf_3|
+---+----+----+----+----+----+----+
| 1| 11| 21| 31| 101| 201| 301|
| 2| 12| 22| 32| 102| 202| 302|
| 4| 14| 24| 34| 104| 204| 304|
| 3| 13| 23| 33| 103| 203| 303|
+---+----+----+----+----+----+----+
preferably in scala.
You can group on id and aggregate using first with ignorenulls for other columns:
import pyspark.sql.functions as F
(df.groupBy('id').agg(*[F.first(x,ignorenulls=True) for x in df.columns if x!='id'])
.show())
+---+----+----+----+-----+-----+-----+
| id| 1| 2| 3| sf_1| sf_2| sf_3|
+---+----+----+----+-----+-----+-----+
| 1|11.0|21.0|31.0|101.0|201.0|301.0|
| 3|13.0|23.0|33.0|103.0|203.0|303.0|
| 2|12.0|22.0|32.0|102.0|202.0|302.0|
| 4|14.0|24.0|34.0|104.0|204.0|304.0|
+---+----+----+----+-----+-----+-----+
scala way of doing.
val inputColumns = inputLoadDF.columns.toList.drop(0)
val exprs = inputColumns.map(x => first(x,true))
inputLoadDF.groupBy("id").agg(exprs.head,exprs.tail:_*).show()

Joining on the previous month data for each month in the PySpark dataset

I have a dataset which is on a monthly basis and each month has N-number of accounts. Some months will have new accounts and some accounts will disappear after a certain month(this is randomly done).
I need to get an account's current month balance and deduct it from the previous month's balance (if this account existed in the previous month) otherwise have it as the current month's balance.
I was suggested to do a join on each month. i.e. join month1 to month2, month2 to month3, etc. But I am not exactly sure how that would go...
Here is an example dataset:
|date |account |balance |
----------------------------------
|01.01.2019|1 |40 |
|01.01.2019|2 |33 |
|01.01.2019|3 |31 |
|01.02.2019|1 |32 |
|01.02.2019|2 |56 |
|01.02.2019|4 |89 |
|01.03.2019|2 |12 |
|01.03.2019|4 |35 |
|01.03.2019|5 |76 |
|01.03.2019|6 |47 |
----------------------------------
The account id is unique for each gone, current and new-coming account.
I initially used f.lag, but now that there are accounts that dissapear and come in new, the number of accounts per month is not constant, so I cannot lag. As I said I was suggested to use join. I.e. join Jan onto Feb, Feb onto March, etc.
But I am not really sure how that would go. Anyone has any ideas ?
P.S. I created this table with example of an account that stays, an account that is new and an account that is removed from later months.
The end goal is:
|date |account |balance | balance_diff_with_previous_month |
--------------------------------------------------------------------|
|01.01.2019|1 |40 |na |
|01.01.2019|2 |33 |na |
|01.01.2019|3 |31 |na |
|01.02.2019|1 |32 |-8 |
|01.02.2019|2 |56 |23 |
|01.02.2019|4 |89 |89 |
|01.03.2019|2 |12 |-44 |
|01.03.2019|4 |35 |-54 |
|01.03.2019|5 |76 |76 |
|01.03.2019|6 |47 |47 |
--------------------------------------------------------------------|
As I said, f.lag cannot be used because the number of accounts per month is not constant and I do not control the number of accounts, therefore cannot f.lag a constant amount of rows.
Anyone has any ideas about how joining on account and/or date(current month) with date-1 (previous month)?
Thanks for reading and helping :)
alternate solution using joins ....
df = spark.createDataFrame([
("01.01.2019", 1, 40),("01.01.2019", 2, 33),("01.01.2019", 3, 31),
("01.02.2019", 1, 32), ("01.02.2019", 2, 56),("01.02.2019", 4, 89),
("01.03.2019", 2, 12),("01.03.2019", 4, 35),("01.03.2019", 5, 76),("01.03.2019", 6, 47)],
["date","account","balance"])
df.alias("current").join(
df.alias("previous"),
[F.to_date(F.col("previous.date"), "dd.MM.yyyy") == F.to_date(F.add_months(F.to_date(F.col("current.date"), "dd.MM.yyyy"),-1),"dd.MM.yyyy"), F.col("previous.account") == F.col("current.account")],
"left"
).select(
F.col("current.date").alias("date"),
F.coalesce("current.account", "previous.account").alias("account"),
F.col("current.balance").alias("balance"),
(F.col("current.balance") - F.coalesce(F.col("previous.balance"), F.lit(0))).alias("balance_diff_with_previous_month")
).orderBy("date","account").show()
which results
+----------+-------+-------+--------------------------------+
| date|account|balance|balance_diff_with_previous_month|
+----------+-------+-------+--------------------------------+
|01.01.2019| 1| 40| 40|
|01.01.2019| 2| 33| 33|
|01.01.2019| 3| 31| 31|
|01.02.2019| 1| 32| -8|
|01.02.2019| 2| 56| 23|
|01.02.2019| 4| 89| 89|
|01.03.2019| 2| 12| -44|
|01.03.2019| 4| 35| -54|
|01.03.2019| 5| 76| 76|
|01.03.2019| 6| 47| 47|
+----------+-------+-------+--------------------------------+
F.lag works perfectly for what you want if you partition by account and
partition = Window.partitionBy("account") \
.orderBy(F.col("date").cast("timestamp").cast("long"))
previousAmount = data.withColumn("balance_diff_with_previous_month", F.lag("balance").over(partition))
.show(10, False)
>>> from pyspark.sql.functions import *
>>> from pyspark.sql import Window
>>> df.show()
+----------+-------+-------+
| date|account|balance|
+----------+-------+-------+
|01.01.2019| 1| 40|
|01.01.2019| 2| 33|
|01.01.2019| 3| 31|
|01.02.2019| 1| 32|
|01.02.2019| 2| 56|
|01.02.2019| 4| 89|
|01.03.2019| 2| 12|
|01.03.2019| 4| 35|
|01.03.2019| 5| 76|
|01.03.2019| 6| 47|
+----------+-------+-------+
>>> df1 = df.withColumn("date", expr("to_date(date, 'dd.MM.yyyy')"))
>>> W = Window.partitionBy("account").orderBy("date")
>>> df1.withColumn("balance_diff_with_previous_month", col("balance") - lag(col("balance"),1,0).over(W)).show()
+----------+-------+-------+--------------------------------+
| date|account|balance|balance_diff_with_previous_month|
+----------+-------+-------+--------------------------------+
|2019-01-01| 1| 40| 40.0|
|2019-01-01| 2| 33| 33.0|
|2019-01-01| 3| 31| 31.0|
|2019-02-01| 1| 32| -8.0|
|2019-02-01| 2| 56| 23.0|
|2019-02-01| 4| 89| 89.0|
|2019-03-01| 2| 12| -44.0|
|2019-03-01| 4| 35| -54.0|
|2019-03-01| 5| 76| 76.0|
|2019-03-01| 6| 47| 47.0|
+----------+-------+-------+--------------------------------+

Pyspark - ranking column replacing ranking of ties by Mean of rank

Consider a data set with ranking
+--------+----+-----------+--------------+
| colA|colB|colA_rank |colA_rank_mean|
+--------+----+-----------+--------------+
| 21| 50| 1| 1|
| 9| 23| 2| 2.5|
| 9| 21| 3| 2.5|
| 8| 21| 4| 3|
| 2| 21| 5| 5.5|
| 2| 5| 6| 5.5|
| 1| 5| 7| 7.5|
| 1| 4| 8| 7.5|
| 0| 4| 9| 11|
| 0| 3| 10| 11|
| 0| 3| 11| 11|
| 0| 2| 12| 11|
| 0| 2| 13| 11|
+--------+----+-----------+--------------+
colA_rank is a normal ranking, while with colA_rank_mean I would like to resolve ties by replacing the ranking with the mean rank of the ties. Is it achievable with a single pass and some particular ranking method ?
Currently I am thinking of 2 passes but that would seem to require ordering the dataset twice on colA, one without partition and one with partition.
#Step 1: normal rank
df = df.withColumn("colA_rank",F.row_number().over(Window.orderBy("colA")))
#Step 2 : solve ties :
df = df.withColumn("colA_rank_mean",F.mean(colA_rank).over(Window.partitionBy("colA"))

cumulative sum function in pyspark grouping on multiple columns based on condition

I need to create a event_id basically a counter grouping on multiple columns(v_id,d_id,ip,l_id) and incrementing it when delta > 40 to get
the output like this
v_id d_id ip l_id delta event_id last_event_flag
1 20 30 40 1 1 N
1 20 30 40 2 1 N
1 20 30 40 3 1 N
1 20 30 40 4 1 Y
1 20 20 40 1 1 Y
1 30 30 40 2 1 N
1 30 30 40 3 1 N
1 30 30 40 4 1 N
1 30 30 40 5 1 Y
i was able to achieve this using pandas data frame
df['event_id'] = (df.delta >=40.0).groupby([df.l_id,df.v_id,d_id,ip]).cumsum() + 1
df.append(df['event_id'], ignore_index=True
but seeing memory error when executing it on a larger data .
How to do similar thing in pyspark.
In pyspark you can do it using a window function:
First let's create the dataframe. Note that you can also directly load it as a dataframe from a csv:
df = spark.createDataFrame(
sc.parallelize(
[[1,20,30,40,1,1],
[1,20,30,40,2,1],
[1,20,30,40,3,1],
[1,20,30,40,4,1],
[1,20,30,40,45,2],
[1,20,30,40,1,2],
[1,30,30,40,2,1],
[1,30,30,40,3,1],
[1,30,30,40,4,1],
[1,30,30,40,5,1]]
),
["v_id","d_id","ip","l_id","delta","event_id"]
)
You have an implicit ordering in your table, we need to create a monotonically increasing id so that we don't end up shuffling it around:
import pyspark.sql.functions as psf
df = df.withColumn(
"rn",
psf.monotonically_increasing_id()
)
+----+----+---+----+-----+--------+----------+
|v_id|d_id| ip|l_id|delta|event_id| rn|
+----+----+---+----+-----+--------+----------+
| 1| 20| 30| 40| 1| 1| 0|
| 1| 20| 30| 40| 2| 1| 1|
| 1| 20| 30| 40| 3| 1| 2|
| 1| 20| 30| 40| 4| 1| 3|
| 1| 20| 30| 40| 45| 2| 4|
| 1| 20| 30| 40| 1| 2|8589934592|
| 1| 30| 30| 40| 2| 1|8589934593|
| 1| 30| 30| 40| 3| 1|8589934594|
| 1| 30| 30| 40| 4| 1|8589934595|
| 1| 30| 30| 40| 5| 1|8589934596|
+----+----+---+----+-----+--------+----------+
Now to compute event_id and last_event_flag:
from pyspark.sql import Window
w1 = Window.partitionBy("v_id", "d_id", "l_id", "ip").orderBy("rn")
w2 = Window.partitionBy("v_id", "d_id", "l_id", "ip").orderBy(psf.desc("rn"))
df.withColumn(
"event_id",
psf.sum((df.delta >= 40).cast("int")).over(w1) + 1
).withColumn(
"last_event_flag",
psf.row_number().over(w2) == 1
).drop("rn")
+----+----+---+----+-----+--------+---------------+
|v_id|d_id| ip|l_id|delta|event_id|last_event_flag|
+----+----+---+----+-----+--------+---------------+
| 1| 20| 30| 40| 1| 1| false|
| 1| 20| 30| 40| 2| 1| false|
| 1| 20| 30| 40| 3| 1| false|
| 1| 20| 30| 40| 4| 1| false|
| 1| 20| 30| 40| 45| 2| false|
| 1| 20| 30| 40| 1| 2| true|
| 1| 30| 30| 40| 2| 1| false|
| 1| 30| 30| 40| 3| 1| false|
| 1| 30| 30| 40| 4| 1| false|
| 1| 30| 30| 40| 5| 1| true|
+----+----+---+----+-----+--------+---------------+
Perhaps you should calculate df = df[df.delta>=40] before running the groupby- I'm not sure if that matters.
Also you can look into chunksize to perform calculations based on chunks of the csv for memory efficiency. So you might break up the data into chunks of 10000 lines and then run the calculations to avoid memory error.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
How to read a 6 GB csv file with pandas

Window timeseries with step in Spark/Scala

I have this input :
timestamp,user
1,A
2,B
5,C
9,E
12,F
The result wanted is :
timestampRange,userList
1 to 2,[A,B]
3 to 4,[] Or null
5 to 6,[C]
7 to 8,[] Or null
9 to 10,[E]
11 to 12,[F]
I tried using Window, but the problem, it doesn't include the empty timestamp range.
Any hints would be helpful.
Don't know if widowing function will cover the gaps between ranges, but you can take the following approach :
Define a dataframe, df_ranges:
val ranges = List((1,2), (3,4), (5,6), (7,8), (9,10))
val df_ranges = sc.parallelize(ranges).toDF("start", "end")
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 3| 4|
| 5| 6|
| 7| 8|
| 9| 10|
+-----+---+
Data with the timestamp column, df_data :
val data = List((1,"A"), (2,"B"), (5,"C"), (9,"E"))
val df_data = sc.parallelize(data).toDF("timestamp", "user")
+---------+----+
|timestamp|user|
+---------+----+
| 1| A|
| 2| B|
| 5| C|
| 9| E|
+---------+----+
Join the two dataframe on the start, end, timestamp columns:
df_ranges.join(df_data, df_ranges.col("start").equalTo(df_data.col("timestamp")).or(df_ranges.col("end").equalTo(df_data.col("timestamp"))), "left")
+-----+---+---------+----+
|start|end|timestamp|user|
+-----+---+---------+----+
| 1| 2| 1| A|
| 1| 2| 2| B|
| 5| 6| 5| C|
| 9| 10| 9| E|
| 3| 4| null|null|
| 7| 8| null|null|
+-----+---+---------+----+
Now do a simple aggregation with collect_list function :
res4.groupBy("start", "end").agg(collect_list("user")).orderBy("start")
+-----+---+------------------+
|start|end|collect_list(user)|
+-----+---+------------------+
| 1| 2| [A, B]|
| 3| 4| []|
| 5| 6| [C]|
| 7| 8| []|
| 9| 10| [E]|
+-----+---+------------------+

Categories