Column with avg over previous day pyspark - python

I want to create a new column which is the mean of previous day sales using pyspark.
consider these value are at different timestamp.
for eg convert this:
| Date | value |
|------------|-------|
| 2019/02/11 | 30 |
| 2019/02/11 | 40 |
| 2019/02/11 | 20 |
| 2019/02/12 | 10 |
| 2019/02/12 | 15 |
to this
| Date | value | avg |
|------------|-------|------|
| 2019/02/11 | 30 | null |
| 2019/02/11 | 40 | null |
| 2019/02/11 | 20 | null |
| 2019/02/12 | 10 | 30 |
| 2019/02/12 | 15 | 30 |
My thinking :
Use filter and aggregation function to obtain average but its throwing error. Not sure where I am doing wrong.
df = df.withColumn("avg",lit((df.filter(df["date"] == date_sub("date",1)).agg({"value": "avg"}))))

You can do that using the windows functions but you have to create a new column to handle the dates.
I added few lines to you example :
df.withColumn(
"rnk",
F.dense_rank().over(Window.partitionBy().orderBy("date"))
).withColumn(
"avg",
F.avg("value").over(Window.partitionBy().orderBy("rnk").rangeBetween(-1,-1))
).show()
+----------+-----+---+----+
| date|value|rnk| avg|
+----------+-----+---+----+
|2018-01-01| 20| 1|null|
|2018-01-01| 30| 1|null|
|2018-01-01| 40| 1|null|
|2018-01-02| 40| 2|30.0|
|2018-01-02| 30| 2|30.0|
|2018-01-03| 40| 3|35.0|
|2018-01-03| 40| 3|35.0|
+----------+-----+---+----+
You can also do that using aggregation :
agg_df = df.withColumn("date", F.date_add("date", 1)).groupBy('date').avg("value")
df.join(agg_df, how="full_outer", on="date").orderBy("date").show()
+----------+-----+----------+
| date|value|avg(value)|
+----------+-----+----------+
|2018-01-01| 20| null|
|2018-01-01| 30| null|
|2018-01-01| 40| null|
|2018-01-02| 30| 30.0|
|2018-01-02| 40| 30.0|
|2018-01-03| 40| 35.0|
|2018-01-03| 40| 35.0|
|2018-01-04| null| 40.0|
+----------+-----+----------+

Step 0: Creating a DataFrame
from pyspark.sql.window import Window
from pyspark.sql.functions import col, avg, lag
df = sqlContext.createDataFrame([('2019/02/11',30),('2019/02/11',40),('2019/02/11',20),
('2019/02/12',10),('2019/02/12',15),
('2019/02/13',10),('2019/02/13',20)],['Date','value'])
Step 1: First calculating the average and then using windows function to get the lag by 1 day.
my_window = Window.partitionBy().orderBy('Date')
df_avg_previous = df.groupBy('Date').agg(avg(col('value')).alias('avg'))
df_avg_previous = df_avg_previous.withColumn('avg', lag(col('avg'),1).over(my_window))
df_avg_previous.show()
+----------+----+
| Date| avg|
+----------+----+
|2019/02/11|null|
|2019/02/12|30.0|
|2019/02/13|12.5|
+----------+----+
Step 2: Finally joining the two dataframes by using a left join.
df = df.join(df_avg_previous, ['Date'],how='left').orderBy('Date')
df.show()
+----------+-----+----+
| Date|value| avg|
+----------+-----+----+
|2019/02/11| 40|null|
|2019/02/11| 20|null|
|2019/02/11| 30|null|
|2019/02/12| 10|30.0|
|2019/02/12| 15|30.0|
|2019/02/13| 10|12.5|
|2019/02/13| 20|12.5|
+----------+-----+----+

Related

How to use ALS with multiple implicit feedbacks?

In the ALS example given in PySpark as per this documentation - http://spark.apache.org/docs/latest/ml-collaborative-filtering.html) the data used has explicit feedback in one column. The data is like this:
| User | Item | Rating |
| --- | --- | --- |
| First | A | 2 |
| Second | B | 3|
However, in my case I have implicit feedbacks in multiple columns like this:
| User | Item | Clicks | Views | Purchase |
| --- | --- | --- | --- | --- |
| First | A | 20 | 35 | 3 |
| Second | B | 3| 12 | 0 |
I know we can use implicit feedback by setting implicitPrefs as False. However, it only accepts a single column. How to use multiple columns?
I found this question: How to manage multiple positive implicit feedbacks? However, it is not related with Spark and Alternating Least Square method. Do I have to manually assign a weighting scheme as per that answer? or is there a better solution in PySpark?
I have thoroughly Researched your issue, i haven't found passing multiple columns in ALS, most of the such problems are being solved by manually weighing and creating Rating column.
Below is my solution
Create indexing for Views, Clicks and Purchase value as below
Extract Smallest value (except 0) and devide all ements for same colmn by it
example : min value for Purchase col is 3
so 3/3, 10/3, 20/3 .. etc
Now after getting indexed value for these columns calculate Rating
Below is the formula for Rating
Rating = 60% of Purchase + 30% of Clicks + 10% of Views
data.show()
+------+----+------+-----+--------+
| User|Item|Clicks|Views|Purchase|
+------+----+------+-----+--------+
| First| A| 20| 35| 3|
|Second| B| 3| 12| 0|
| Three| C| 4| 15| 20|
| Four| D| 5| 16| 10|
+------+----+------+-----+--------+
df1 = data.sort('Purchase').select('Purchase')
df= df1.filter(df1.Purchase >0)
purch_index = df.first()['Purchase']
df2 = data.sort('Views').select('Views')
df2= df2.filter(df2.Views >0)
Views_index = df2.first()['Views']
f3 = data.sort('Clicks').select('Clicks')
df3= df3.filter(df3.Clicks >0)
CLicks_index = df3.first()['Clicks']
semi_rawdf = data.withColumn('Clicks',round(col('Clicks')/CLicks_index)).withColumn('Views',round(col('Views')/Views_index)).withColumn('Purchase',round(col('Purchase')/purch_index))
semi_rawdf.show()
+------+----+------+-----+--------+
| User|Item|Clicks|Views|Purchase|
+------+----+------+-----+--------+
| First| A| 7.0| 3.0| 1.0|
|Second| B| 1.0| 1.0| 0.0|
| Three| C| 1.0| 1.0| 7.0|
| Four| D| 2.0| 1.0| 3.0|
+------+----+------+-----+--------+
from pyspark.sql.types import DecimalType
from decimal import Decimal
refined_df = semi_rawdf.withColumn('Rating',((col('Clicks')*0.3)+round(col('Views')*0.1)+round(col('Purchase')*0.6)))
refined_df = refined_df.withColumn('Rating', col('Rating').cast(DecimalType(6,2)))
refined_df.select('User','Item','Rating').show()
+------+----+------+
| User|Item|Rating|
+------+----+------+
| First| A| 3.10|
|Second| B| 0.30|
| Three| C| 4.30|
| Four| D| 2.60|
+------+----+------+

How to pass a third-party column after a GroupBy and aggregation in PySpark DataFrame?

I have a Spark DataFrame, say df, to which I need to apply a GroupBy col1, aggregate by maximum value of col2 and pass the corresponding value of col3 (which has nothing to do with the groupBy or the aggregation). It is best to illustrate it with an example.
df.show()
+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
| 1| 500| 10 |
| 1| 600| 11 |
| 1| 700| 12 |
| 2| 600| 14 |
| 2| 800| 15 |
| 2| 650| 17 |
+-----+-----+-----+
I can easily perform the groupBy and the aggregation to obtain the maximum value of each group in col2, using
import pyspark.sql.functions as F
df1 = df.groupBy("col1").agg(
F.max("col2").alias('Max_col2')).show()
+-----+---------+
| col1| Max_col2|
+-----+---------+
| 1| 700|
| 2| 800|
+-----+---------+
However, what I am struggling with and what I would like to do is to, additionally, pass the corresponding value of col3, thus obtaining the following table:
+-----+---------+-----+
| col1| Max_col2| col3|
+-----+---------+-----+
| 1| 700| 12 |
| 2| 800| 15 |
+-----+---------+-----+
Does anyone know how this can be done?
Many thanks in advance,
Marioanzas
You can aggregate the maximum of a struct, and then expand the struct:
import pyspark.sql.functions as F
df2 = df.groupBy('col1').agg(
F.max(F.struct('col2', 'col3')).alias('col')
).select('col1', 'col.*')
df2.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 700| 12|
| 2| 800| 15|
+----+----+----+

Most performant way to perform custom one-hot-encoding on a PySpark dataframe?

Let's say we have this PySpark dataframe:
+----+-------------+
| id | string_data |
+----+-------------+
| 1 | "test" |
+----+-------------+
| 2 | null |
+----+-------------+
| 3 | "9" |
+----+-------------+
| 4 | "deleted__" |
I want to perform some operation on this that will result in this dataframe:
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| id | string_data | is_string_data_null | is_string_data_a_number | does_string_data_contain_keyword_test | is_string_data_normal |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 1 | "test" | 0 | 0 | 1 | 0 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 2 | null | 1 | 0 | 0 | 0 |im
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 3 | "9" | 0 | 1 | 0 | 0 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 4 | "deleted__" | 0 | 0 | 0 | 1 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| | | | | | |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
Where each of the new columns has either a 1 or a 0 depending on the truth value. I have currently implemented this using a custom UDF that checks the value of the string_data column, but this is incredibly slow. I have also tried implementing a UDF that does not create new columns but instead overwrites the original one with an encoded vector [1, 0, 0...], etc. This is also too slow because we have to apply this to millions of rows and thousands of columns.
Is there any better way of doing this? I understand UDFs are not the most efficient way to solve things in PySpark but I can't seem to find any built-in PySpark functions that work.
Any thoughts would be appreciated!
Edit: Sorry, from mobile I didn't see the full expected output so my previous answer was very incomplete.
Anyway, your operation has to be done in two steps, starting with this DataFrame:
>>> df.show()
+---+-----------+
| id|string_data|
+---+-----------+
| 1| test|
| 2| null|
| 3| 9|
| 4| deleted__|
+---+-----------+
Create the boolean fields based on the conditions in the string_data field:
>>> df = (df
.withColumn('is_string_data_null', df.string_data.isNull())
.withColumn('is_string_data_a_number', df.string_data.cast('integer').isNotNull())
.withColumn('does_string_data_contain_keyword_test', coalesce(df.string_data, lit('')).contains('test'))
.withColumn('is_string_normal', ~(col('is_string_data_null') | col('is_string_data_a_number') | col('does_string_data_contain_keyword_test')))
)
>>> df.show()
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| id|string_data|is_string_data_null|is_string_data_a_number|does_string_data_contain_keyword_test|is_string_normal|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| 1| test| false| false| true| false|
| 2| null| true| false| false| false|
| 3| 9| false| true| false| false|
| 4| deleted__| false| false| false| true|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
Now that we have our columns, we can cast them to integers:
>>> df = (df
.withColumn('is_string_data_null', df.is_string_data_null.cast('integer'))
.withColumn('is_string_data_a_number', df.is_string_data_a_number.cast('integer'))
.withColumn('does_string_data_contain_keyword_test', df.does_string_data_contain_keyword_test.cast('integer'))
.withColumn('is_string_normal', df.is_string_normal.cast('integer'))
)
>>> df.show()
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| id|string_data|is_string_data_null|is_string_data_a_number|does_string_data_contain_keyword_test|is_string_normal|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| 1| test| 0| 0| 1| 0|
| 2| null| 1| 0| 0| 0|
| 3| 9| 0| 1| 0| 0|
| 4| deleted__| 0| 0| 0| 1|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
This should be far more performant than an UDF, as all the operations are done by Spark itself so there's no context switch from Spark to Python.

Joining Dataframes with same coumn name in pyspark

I have two dataframe which has been readed from two csv files.
+---+----------+-----------------+
| ID| NUMBER | RECHARGE_AMOUNT|
+---+----------+-----------------+
| 1|9090909092| 30|
| 2|9090909093| 30|
| 3|9090909090| 30|
| 4|9090909094| 30|
+---+----------+-----------------+
and
+---+----------+-----------------+
| ID| NUMBER | RECHARGE_AMOUNT|
+---+----------+-----------------+
| 1|9090909092| 40|
| 2|9090909093| 50|
| 3|9090909090| 60|
| 4|9090909094| 70|
+---+----------+-----------------+
I am triying to join this two data from using NUMBER coumn using the pyspark code dfFinal = dfFinal.join(df2, on=['NUMBER'], how='inner') and new dataframe is generated as follows.
+----------+---+-----------------+---+-----------------+
| NUMBER | ID| RECHARGE_AMOUNT| ID| RECHARGE_AMOUNT|
+----------+---+-----------------+---+-----------------+
|9090909092| 1| 30| 1| 40|
|9090909093| 2| 30| 2| 50|
|9090909090| 3| 30| 3| 60|
|9090909094| 4| 30| 4| 70|
+----------+---+-----------------+---+-----------------+
But i am not able to write this dataframe into a file since the dataframe after joining is having duplicate column. I am using the following code. dfFinal.coalesce(1).write.format('com.databricks.spark.csv').save('/home/user/output',header = 'true') Is there any way to avoid duplicate column after joining in spark. Given below is my pyspark code.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("test1").getOrCreate()
files = ["/home/user/test1.txt", "/home/user/test2.txt"]
dfFinal = spark.read.load(files[0],format="csv", sep=",", inferSchema="false", header="true", mode="DROPMALFORMED")
dfFinal.show()
for i in range(1,len(files)):
df2 = spark.read.load(files[i],format="csv", sep=",", inferSchema="false", header="true", mode="DROPMALFORMED")
df2.show()
dfFinal = dfFinal.join(df2, on=['NUMBER'], how='inner')
dfFinal.show()
dfFinal.coalesce(1).write.format('com.databricks.spark.csv').save('/home/user/output',header = 'true')
I need to generate unique column name.ie: if i gave two files in files array with same coumn it should generate as follows.
+----------+----+-------------------+-----+-------------------+
| NUMBER |IDx | RECHARGE_AMOUNTx | IDy | RECHARGE_AMOUNTy |
+----------+----+-------------------+-----+-------------------+
|9090909092| 1 | 30 | 1 | 40 |
|9090909093| 2 | 30 | 2 | 50 |
|9090909090| 3 | 30 | 3 | 60 |
|9090909094| 4 | 30 | 4 | 70 |
+----------+---+-----------------+---+------------------------+
In panda i can use suffixes argument as show below dfFinal = dfFinal.merge(df2,left_on='NUMBER',right_on='NUMBER',how='inner',suffixes=('x', 'y'),sort=True) which will generate the above dataframe. Is there any way i can replicate this on pyspark.
You can select the columns from each dataframe and alias it.
Like this.
dfFinal = dfFinal.join(df2, on=['NUMBER'], how='inner') \
.select('NUMBER',
dfFinal.ID.alias('ID_1'),
dfFinal.RECHARGE_AMOUNT.alias('RECHARGE_AMOUNT_1'),
df2.ID.alias('ID_2'),
df2.RECHARGE_AMOUNT.alias('RECHARGE_AMOUNT_2'))

PySpark Dataframe take mean of list within column and create new column with 1 & 0 depending on a condition

I am trying to calculate the mean of a list (cost) within a PySpark Dataframe column, the values that are less than the mean get the value 1 and above the mean a 0.
This is the current dataframe:
+----------+--------------------+--------------------+
| id| collect_list(p_id)|collect_list(cost) |
+----------+--------------------+--------------------+
| 7|[10, 987, 872] |[12.0, 124.6, 197.0]|
| 6|[11, 858, 299] |[15.0, 167.16, 50.0]|
| 17| [2]| [65.4785]|
| 1|[34359738369, 343...|[16.023384, 104.9...|
| 3|[17179869185, 0, ...|[48.3255, 132.025...|
+----------+--------------------+--------------------+
This is the desired output:
+----------+--------------------+--------------------+-----------+
| id| p_id |cost | result |
+----------+--------------------+--------------------+-----------+
| 7|10 |12.0 | 1 |
| 7|987 |124.6 | 0 |
| 7|872 |197.0 | 0 |
| 6|11 |15.0 | 1 |
| 6|858 |167.16 | 0 |
| 6|299 |50.0 | 1 |
| 17|2 |65.4785 | 1 |
+----------+--------------------+--------------------+-----------+
from pyspark.sql.functions import col, mean
#sample data
df = sc.parallelize([(7,[10, 987, 872],[12.0, 124.6, 197.0]),
(6,[11, 858, 299],[15.0, 167.16, 50.0]),
(17,[2],[65.4785])]).toDF(["id", "collect_list(p_id)","collect_list(cost)"])
#unpack collect_list in desired output format
df = df.rdd.flatMap(lambda row: [(row[0], x, y) for x,y in zip(row[1],row[2])]).toDF(["id", "p_id","cost"])
df1 = df.\
join(df.groupBy("id").agg(mean("cost").alias("mean_cost")), "id", 'left').\
withColumn("result",(col("cost") <= col("mean_cost")).cast("int")).\
drop("mean_cost")
df1.show()
Output is :
+---+----+-------+------+
| id|p_id| cost|result|
+---+----+-------+------+
| 7| 10| 12.0| 1|
| 7| 987| 124.6| 0|
| 7| 872| 197.0| 0|
| 6| 11| 15.0| 1|
| 6| 858| 167.16| 0|
| 6| 299| 50.0| 1|
| 17| 2|65.4785| 1|
+---+----+-------+------+
You can create a result list for every row and then zip pid, cost and result list. After that use explode on the zipped column.
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import *
def zip_cols(pid_list,cost_list):
mean = np.mean(cost_list)
res_list = list(map(lambda cost:1 if mean >= cost else 0,cost_list))
return[(x,y,z) for x,y,z in zip(pid_list, cost_list, res_list)]
udf_zip = udf(zip_cols, ArrayType(StructType([StructField("pid",IntegerType()),
StructField("cost", DoubleType()),
StructField("result",IntegerType())])))
df1 = (df.withColumn("temp",udf_zip("collect_list(p_id)","collect_list(cost)")).
drop("collect_list(p_id)","collect_list(cost)"))
df2 = (df1.withColumn("temp",explode(df1.temp)).
select("id",col("temp.pid").alias("pid"),
col("temp.cost").alias("cost"),
col("temp.result").alias("result")))
df2.show()
output
+---+---+-------+------+
| id|pid| cost|result|
+---+---+-------+------+
| 7| 10| 12.0| 1|
| 7| 98| 124.6| 0|
| 7|872| 197.0| 0|
| 6| 11| 15.0| 1|
| 6|858| 167.16| 0|
| 6|299| 50.0| 1|
| 17| 2|65.4758| 1|
+---+---+-------+------+

Categories