pyspark recursive row chaining conversion to scala - python

I wrote a pyspark implementation of reading row over row to incrementally (and recursively) multiply a column value in sequence. Due to platform limitations on our side, I need to convert this to Scala now without UDAF. I looked at this implementation, but that one takes up long as the number of year_months grow as it needs # of temp tables as the # of year_months.
There are around 100 year_months and 70 departments giving total number of rows in this dataframe to be 7000. We need to take up the starting value (by first year month in the sequence) for each department and multiply it with next row value. The resulting multiplied factor needs to be multiplied over with next row and so on.
Example data:
department, productivity_ratio, year_month
101,1.00,2013-01-01
101,0.98,2013-02-01
101,1.01,2013-03-01
101,0.99,2013-04-01
...
102,1.00,2013-01-01
102,1.02,2013-02-01
102,0.96,2013-03-01
...
Expected result:
department,productivity_ratio,year_month,chained_productivity_ratio
101,1.00,2013-01-01,1.00
101,0.98,2013-02-01,0.98 (1.00*0.98)
101,1.01,2013-03-01,0.9898 (1.00*0.98*1.01)
101,0.99,2013-04-01,0.9799 (1.00*0.98*1.01*0.99)
...
102,1.00,2013-01-01,1.00 (reset to 1.00 as starting point as department name changed in sequence)
102,1.02,2013-02-01,1.02 (1.00*1.02)
102,0.96,2013-03-01,0.9792 (1.00*1.02*0.96)
...
Is there any way to implement this in faster way in scala either converting this into a loop over departments and looking at the productivity_ratio as a sequence to multiply with previous value or by changing the dataframe into a different data structure to avoid running into distributed sequencing problems.
Existing pyspark code:
%pyspark
import pandas as pd
import numpy as np
import StringIO
inputParquet = "s3://path/to/parquet/files/"
inputData = spark.read.parquet(inputParquet)
inputData.printSchema
root
|-- department: string
|-- productivity_ratio: double
|-- year_month: date
inputSorted=inputData.sort('department', 'year_month')
inputSortedNotnull=inputSorted.dropna()
finalInput=inputSortedNotnull.toPandas()
prev_dept = 999
prev_productivity_ratio = 1
new_productivity_chained = []
for t in finalInput.itertuples():
if prev_dept == t[1]:
new_productivity_chained.append(t[2] * prev_productivity_ratio)
prev_productivity_ratio = t[2] * prev_productivity_ratio
else:
prev_productivity_ratio = 1
new_productivity_chained.append(prev_productivity_ratio)
prev_dept = t[1]
productivityChained = finalInput.assign(chained_productivity=new_productivity_chained)

You can use window lag function and do exp(sum(log(<column>))) to calculate the chained_productivity_ratio and all the functions we are using are spark inbuilt functions the performance will be great!
Example:
In Pyspark:
df.show()
#+----------+------------------+----------+
#|department|productivity_ratio|year_month|
#+----------+------------------+----------+
#| 101| 1.00|2013-01-01|
#| 101| 0.98|2013-02-01|
#| 101| 1.01|2013-03-01|
#| 101| 0.99|2013-04-01|
#| 102| 1.00|2013-01-01|
#| 102| 1.02|2013-02-01|
#| 102| 0.96|2013-03-01|
#+----------+------------------+----------+
from pyspark.sql.functions import *
from pyspark.sql import Window
w = Window.partitionBy("department").orderBy("year_month")
df.withColumn("chained_productivity_ratio",exp(sum(log(col("productivity_ratio"))).over(w))).show()
#+----------+------------------+----------+--------------------------+
#|department|productivity_ratio|year_month|chained_productivity_ratio|
#+----------+------------------+----------+--------------------------+
#| 101| 1.00|2013-01-01| 1.0|
#| 101| 0.98|2013-02-01| 0.98|
#| 101| 1.01|2013-03-01| 0.9898|
#| 101| 0.99|2013-04-01| 0.9799019999999999|
#| 102| 1.00|2013-01-01| 1.0|
#| 102| 1.02|2013-02-01| 1.02|
#| 102| 0.96|2013-03-01| 0.9792|
#+----------+------------------+----------+--------------------------+
In Scala:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val w = Window.partitionBy("department").orderBy("year_month")
df.withColumn("chained_productivity_ratio",exp(sum(log(col("productivity_ratio"))).over(w))).show()
//+----------+------------------+----------+--------------------------+
//|department|productivity_ratio|year_month|chained_productivity_ratio|
//+----------+------------------+----------+--------------------------+
//| 101| 1.00|2013-01-01| 1.0|
//| 101| 0.98|2013-02-01| 0.98|
//| 101| 1.01|2013-03-01| 0.9898|
//| 101| 0.99|2013-04-01| 0.9799019999999999|
//| 102| 1.00|2013-01-01| 1.0|
//| 102| 1.02|2013-02-01| 1.02|
//| 102| 0.96|2013-03-01| 0.9792|
//+----------+------------------+----------+--------------------------+

Related

Efficient nested loops with bootstrap resampling on DataFrame in PySpark

I have a DataFrame with the following schema:
+--------------------+--------+-------+-------+-----------+--------------------+
| userid|datadate| runid|variant|device_type| prediction|
+--------------------+--------+-------+-------+-----------+--------------------+
|0001d15b-e2da-4f4...|20220111|1196752| 1| Mobile| 0.8827571312010658|
|00021723-2a0d-497...|20220111|1196752| 1| Mobile| 0.30763173370229735|
|00021723-2a0d-497...|20220111|1196752| 0| Mobile| 0.5336206154783815|
I would like to perform the following operation:
I want to do for each "runid", for each "device_type", some calculations with variant==1 and variant==0, including a resampling loop.
The ultimate goal is to store these calculations in another DF.
So in a naive approach the code would look like that:
for runid in df.select('runid').distinct().rdd.flatMap(list).collect():
for device in ["Mobile","Desktop"]:
a_variant = df.filter((df.runid == runid) & (df.device_type == device) & (df.variant == 0))
b_variant = df.filter((df.runid == runid) & (df.device_type == device) & (df.variant == 1))
## do some more calculations here
# bootstrap loop:
for samp in range(100):
sampled_vector_a = a_variant.select("prediction").sample(withReplacement = True, fraction = 1.0, seed = 123)
sampled_vector_b = b_variant.select("prediction").sample(withReplacement = True, fraction = 1.0, seed = 123)
## do some more calculations here
## do some more calculations here
## store calculations in a new DataFrame
Currently the process is too slow.
How can I optimize this process by utilizing spark in the best way?
Thanks!
Here is a way to sample from each group in a dataframe after applying groupBy.
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName("Demo").getOrCreate()
df = spark.createDataFrame(data,columns)
data = [["uid1","runid1",1,"Mobile",0.8],["uid2","runid1",1,"Mobile",0.3],
["uid3","runid1",0,"Mobile",0.5],["uid4","runid2",0,"Mobile",0.7],
["uid5","runid2",0,"Mobile",0.9]]
columns = ["userid","runid","variant","device_type","prediction"]
df.show()
# +------+------+-------+-----------+----------+
# |userid| runid|variant|device_type|prediction|
# +------+------+-------+-----------+----------+
# | uid1|runid1| 1| Mobile| 0.8|
# | uid2|runid1| 1| Mobile| 0.3|
# | uid3|runid1| 0| Mobile| 0.5|
# | uid4|runid2| 0| Mobile| 0.7|
# | uid5|runid2| 0| Mobile| 0.9|
# +------+------+-------+-----------+----------+
Define a sampling function that is going to be called by applyInPandas. The function my_sample extracts one sample for each input dataframe:
def my_sample(key, df):
x = df.sample(n=1)
return x
applyInPandas also needs a schema for its output, since it is returning the whole dataframe it will have the same fields as df:
from pyspark.sql.types import *
schema = StructType([StructField('userid', StringType()),
StructField('runid', StringType()),
StructField('variant', LongType()),
StructField('device_type', StringType()),
StructField('prediction', DoubleType())])
Just to check, try grouping the data, there are three groups:
df.groupby("runid", "device_type", "variant").mean("prediction").show()
# +------+-----------+-------+---------------+
# | runid|device_type|variant|avg(prediction)|
# +------+-----------+-------+---------------+
# |runid1| Mobile| 0| 0.5|
# |runid1| Mobile| 1| 0.55|
# |runid2| Mobile| 0| 0.8|
# +------+-----------+-------+---------------+
Now apply my_sample to each group using applyInPandas:
df.groupby("runid","device_type","variant").applyInPandas(my_sample, schema=schema).show()
# +------+------+-------+-----------+----------+
# |userid| runid|variant|device_type|prediction|
# +------+------+-------+-----------+----------+
# | uid3|runid1| 0| Mobile| 0.5|
# | uid2|runid1| 1| Mobile| 0.3|
# | uid4|runid2| 0| Mobile| 0.7|
# +------+------+-------+-----------+----------+
Note: I used applyInPandas since pyspark.sql.GroupedData.apply.html is deprecated

Translating a SAS Ranking with Tie set to HIGH into PySpark

I'm trying to replicate the following SAS code in PySpark:
PROC RANK DATA = aud_baskets OUT = aud_baskets_ranks GROUPS=10 TIES=HIGH;
BY customer_id;
VAR expenditure;
RANKS basket_rank;
RUN;
The idea is to rank all expenditures under each customer_id block. The data would look like this:
+-----------+--------------+-----------+
|customer_id|transaction_id|expenditure|
+-----------+--------------+-----------+
| A| 1| 34|
| A| 2| 90|
| B| 1| 89|
| A| 3| 6|
| B| 2| 8|
| B| 3| 7|
| C| 1| 96|
| C| 2| 9|
+-----------+--------------+-----------+
In PySpark, I tried this:
spendWindow = Window.partitionBy('customer_id').orderBy(col('expenditure').asc())
aud_baskets = (aud_baskets_ranks.withColumn('basket_rank', ntile(10).over(spendWindow)))
The problem is that PySpark doesn't let the user change the way it will handle Ties, like SAS does (that I know of). I need to set this behavior in PySpark so that values are moved up to the next tier each time one of those edge cases occur, as oppose to dropping them to the rank below.
Or is there a way to custom write this approach?
Use dense_rank it will give same rank in case of ties and next rank will not be skipped
ntile function split the group of records in each partition into n parts. In your case which is 10
from pyspark.sql.functions import dense_rank
spendWindow = Window.partitionBy('customer_id').orderBy(col('expenditure').asc())
aud_baskets = aud_baskets_ranks.withColumn('basket_rank',dense_rank.over(spendWindow))
Try The following code. It is generated by an automated tool called SPROCKET. It should take care of ties.
df = (aud_baskets)
for (colToRank,rankedName) in zip(['expenditure'],['basket_rank']):
wA = Window.orderBy(asc(colToRank))
df_w_rank = (df.withColumn('raw_rank', rank().over(wA)))
ties = df_w_rank.groupBy('raw_rank').count().filter("""count > 1""")
df_w_rank = (df_w_rank.join(ties,['raw_rank'],'left').withColumn(rankedName,expr("""case when count is not null
then (raw_rank + count - 1) else
raw_rank end""")))
rankedNameGroup = rankedName
n = df_w_rank.count()
df_with_rank_groups = (df_w_rank.withColumn(rankedNameGroup,expr("""FLOOR({rankedName}
*{k}/({n}+1))""".format(k=10, n=n,
rankedName=rankedName))))
df = df_with_rank_groups
aud_baskets_ranks = df_with_rank_groups.drop('raw_rank', 'count')

Pyspark lazy evaluation in loops too slow

First of all I want to let you know that I am still very new in spark and getting used to the lazy-evaluation concept.
Here my issue:
I have two spark DataFrames that I load from reading CSV.GZ files.
What I am trying to do is to merge both tables in order to split the first table according keys that I have on the second one.
For example:
Table A
+----------+---------+--------+---------+------+
| Date| Zone| X| Type|Volume|
+----------+---------+--------+---------+------+
|2019-01-16|010010000| B| A| 684|
|2019-01-16|010020000| B| A| 21771|
|2019-01-16|010030000| B| A| 7497|
|2019-01-16|010040000| B| A| 74852|
Table B
+----+---------+
|Dept| Zone|
+----+---------+
| 01|010010000|
| 02|010020000|
| 01|010030000|
| 02|010040000|
Then when I merge both tables I have:
+---------+----------+--------+---------+------+----+
| Zone| Date| X| Type|Volume|Dept|
+---------+----------+--------+---------+------+----+
|010010000|2019-01-16| B| A| 684| 01|
|010020000|2019-01-16| B| A| 21771| 02|
|010030000|2019-01-16| B| A| 7497| 01|
|010040000|2019-01-16| B| A| 74852| 02|
So what I want to do is to split this table in Y disjointed tables, where Y is the number of different 'Dept' values that I find on my merged table.
So for example:
Result1:
+---------+----------+--------+---------+------+----+
| Zone| Date| X| Type|Volume|Dept|
+---------+----------+--------+---------+------+----+
|010010000|2019-01-16| B| A| 684| 01|
|010030000|2019-01-16| B| A| 7497| 01|
Result2:
+---------+----------+--------+---------+------+----+
| Zone| Date| X| Type|Volume|Dept|
+---------+----------+--------+---------+------+----+
|010020000|2019-01-16| B| A| 21771| 02|
|010040000|2019-01-16| B| A| 74852| 02|
My code looks like this:
sp_df_A = spark.read.csv(file_path_A, header=True, sep=';', encoding='cp1252')
sp_df_B = spark.read.csv(file_path_B, header=True, sep=';', encoding='cp1252')
sp_merged_df = sp_df_A.join(sp_df_B, on=['Zone'], how='left')
# list of unique 'Dept' values on the merged DataFrame
unique_buckets = [x.__getitem__('Dept') for x in sp_merged_df.select('Dept').distinct().collect()]
# Iterate over all 'Dept' found
for zone_bucket in unique_buckets:
print(zone_bucket)
bucket_dir = os.path.join(output_dir, 'Zone_%s' % zone_bucket)
if not os.path.exists(bucket_dir):
os.mkdir(bucket_dir)
# Filter target 'Dept'
tmp_df = sp_merged_df.filter(sp_merged_df['Dept'] == zone_bucket)
# write result
tmp_df.write.format('com.databricks.spark.csv').option('codec', 'org.apache.hadoop.io.compress.GzipCodec').save(bucket_dir, header = 'true')
The thing is that this very simple code is taking too much time to write a result. So my guess is that the lazy evaluation is loading, merging and filtering on every cycle of the loop.
Can this be the case?
Your guess is correct. Your code reads, joins and filters all the data for each of the buckets. This is indeed caused by the lazy evaluation of spark.
Spark waits with any data transformation until an action is performed. When an action is called, spark looks at all the transformations and creates a plan on how to efficiently get the results of the action. While spark executes this plan the program holds. When spark is done the program continues and spark "forgets" about everything it has done until the next action is called.
In your case spark "forgets" the joined dataframe sp_merged_df and each time a .collect() or .save() is called it reconstructs it.
If you want spark to "remember" a RDD or DataFrame you can .cache() it (see docs).

sampling with weight using pyspark

I have an unbalanced dataframe on spark using PySpark.
I want to resample it to make it balanced.
I only find the sample function in PySpark
sample(withReplacement, fraction, seed=None)
but I want to sample the dataframe with weight of unitvolume
in Python, I can do it like
df.sample(n,Flase,weights=log(unitvolume))
is there any method I could do the same using PySpark?
Spark provides tools for stratified sampling, but this work only on categorical data. You could try to bucketize it:
from pyspark.ml.feature import Bucketizer
from pyspark.sql.functions import col, log
df_log = df.withColumn("log_unitvolume", log(col("unitvolume"))
splits = ... # A list of splits
bucketizer = Bucketizer(splits=splits, inputCol="log_unitvolume", outputCol="bucketed_log_unitvolume")
df_log_bucketed = bucketizer.transform(df_log)
Compute statistics:
counts = df.groupBy("bucketed_log_unitvolume")
fractions = ... # Define fractions from each bucket:
and use these for sampling:
df_log_bucketed.sampleBy("bucketed_log_unitvolume", fractions)
You can also try to rescale log_unitvolume to [0, 1] range and then:
from pyspark.sql.functions import rand
df_log_rescaled.where(col("log_unitvolume_rescaled") < rand())
I think it will be better to simply ignore the .sample() function altogether. Sampling without replacement can be implemented with a uniform random number generator:
import pyspark.sql.functions as F
n_samples_appx = 100
total_weight = df.agg(F.sum('weight')).collect().values
df.filter(F.rand(seed=843) < F.col('weight') / total_weight * n_samples_appx)
This will randomly include/exclude rows from your dataset, which is typically comparable to sampling with replacement. You should be careful about interpretation if you have RHS that exceeds 1 -- weighted sampling is a nuanced process that, rigorously speaking, should only be performed with-replacement.
So if you want to sample with replacement instead, you can use F.rand() to get samples of the poisson distribution which will tell you how many copies of the row to include, and you can either treat that value as a weight, or do some annoying joins & unions to duplicate your rows. But I find that this is typically not required.
You can also do this in a portable repeatable way with the hash:
import pyspark.sql.functions as F
n_samples_appx = 100
total_weight = df.agg(F.sum('weight')).collect().values
df.filter(F.hash(F.col('id')) % (total_weight / n_samples_appx * F.col('weight')).astype('int') == 0)
This will sample at a rate of 1-in-modulo, which incorporates your weight. hash() should be a consistent and deterministic function, but the sampling will occur like random.
One way to do it is to use udf to make a sampling column. This column will have a random number multiplied by your desired weight. Then we sort by the sampling column, and take the top N.
Consider the following illustrative example:
Create Dummy Data
import numpy as np
import string
import pyspark.sql.functions as f
index = range(100)
weights = [i%26 for i in index]
labels = [string.ascii_uppercase[w] for w in weights]
df = sqlCtx.createDataFrame(
zip(index, labels, weights),
('index', 'label', 'weight')
)
df.show(n=5)
#+-----+-----+------+
#|index|label|weight|
#+-----+-----+------+
#| 0| A| 0|
#| 1| B| 1|
#| 2| C| 2|
#| 3| D| 3|
#| 4| E| 4|
#+-----+-----+------+
#only showing top 5 rows
Add Sampling Column
In this example, we want to sample the DataFrame using the column weight as the weight. We define a udf using numpy.random.random() to generate uniform random numbers and multiply by the weight. Then we use sort() on this column and use limit() to get the desired number of samples.
N = 10 # the number of samples
def get_sample_value(x):
return np.random.random() * x
get_sample_value_udf = f.udf(get_sample_value, FloatType())
df_sample = df.withColumn('sampleVal', get_sample_value_udf(f.col('weight')))\
.sort('sampleVal', ascending=False)\
.select('index', 'label', 'weight')\
.limit(N)
Result
As expected, the DataFrame df_sample has 10 rows, and it's contents tend to have letters near the end of the alphabet (higher weights).
df_sample.count()
#10
df_sample.show()
#+-----+-----+------+
#|index|label|weight|
#+-----+-----+------+
#| 23| X| 23|
#| 73| V| 21|
#| 46| U| 20|
#| 25| Z| 25|
#| 19| T| 19|
#| 96| S| 18|
#| 75| X| 23|
#| 48| W| 22|
#| 51| Z| 25|
#| 69| R| 17|
#+-----+-----+------+

Reading and grouping data to get count using python spark

I'm new to spark using python and I'm trying to do some basic stuff to get an understanding of python and spark.
I have a file like below -
empid||deptid||salary
1||10||500
2||10||200
3||20||300
4||20||400
5||20||100
I want to write a small python spark to read the print the count of employees in each department.
I've been working with databases and this is quite simple in a sql, but I'm trying to do this using python spark. I don't have a code to share as I'm completely new to python and spark, but wanted to understand how it works using a simple hands-on example
I've install pyspark and did some quick reading here https://spark.apache.org/docs/latest/quick-start.html
Form my understanding there are dataframes on which one can perform sql like group by, but not sure how to write a proper code
You can read the text file as a dataframe using :
df = spark.createDataFrame(
sc.textFile("path/to/my/file").map(lambda l: l.split(',')),
["empid","deptid","salary"]
)
textFile loads the data sample as an RDD with only one column. Then we split each line through a map and convert it to a dataframe.
Starting from a python list of lists:
df = spark.createDataFrame(
sc.parallelize([[1,10,500],
[2,10,200],
[3,20,300],
[4,20,400],
[5,20,100]]),
["empid","deptid","salary"]
)
df.show()
+-----+------+------+
|empid|deptid|salary|
+-----+------+------+
| 1| 10| 500|
| 2| 10| 200|
| 3| 20| 300|
| 4| 20| 400|
| 5| 20| 100|
+-----+------+------+
Now to count the number of employees by department we'll use a groupBy and then use the count aggregation function:
df_agg = df.groupBy("deptid").count().show()
+------+-----+
|deptid|count|
+------+-----+
| 10| 2|
| 20| 3|
+------+-----+
For the max:
import pyspark.sql.functions as psf
df_agg.agg(psf.max("count")).show()

Categories