In the ALS example given in PySpark as per this documentation - http://spark.apache.org/docs/latest/ml-collaborative-filtering.html) the data used has explicit feedback in one column. The data is like this:
| User | Item | Rating |
| --- | --- | --- |
| First | A | 2 |
| Second | B | 3|
However, in my case I have implicit feedbacks in multiple columns like this:
| User | Item | Clicks | Views | Purchase |
| --- | --- | --- | --- | --- |
| First | A | 20 | 35 | 3 |
| Second | B | 3| 12 | 0 |
I know we can use implicit feedback by setting implicitPrefs as False. However, it only accepts a single column. How to use multiple columns?
I found this question: How to manage multiple positive implicit feedbacks? However, it is not related with Spark and Alternating Least Square method. Do I have to manually assign a weighting scheme as per that answer? or is there a better solution in PySpark?
I have thoroughly Researched your issue, i haven't found passing multiple columns in ALS, most of the such problems are being solved by manually weighing and creating Rating column.
Below is my solution
Create indexing for Views, Clicks and Purchase value as below
Extract Smallest value (except 0) and devide all ements for same colmn by it
example : min value for Purchase col is 3
so 3/3, 10/3, 20/3 .. etc
Now after getting indexed value for these columns calculate Rating
Below is the formula for Rating
Rating = 60% of Purchase + 30% of Clicks + 10% of Views
data.show()
+------+----+------+-----+--------+
| User|Item|Clicks|Views|Purchase|
+------+----+------+-----+--------+
| First| A| 20| 35| 3|
|Second| B| 3| 12| 0|
| Three| C| 4| 15| 20|
| Four| D| 5| 16| 10|
+------+----+------+-----+--------+
df1 = data.sort('Purchase').select('Purchase')
df= df1.filter(df1.Purchase >0)
purch_index = df.first()['Purchase']
df2 = data.sort('Views').select('Views')
df2= df2.filter(df2.Views >0)
Views_index = df2.first()['Views']
f3 = data.sort('Clicks').select('Clicks')
df3= df3.filter(df3.Clicks >0)
CLicks_index = df3.first()['Clicks']
semi_rawdf = data.withColumn('Clicks',round(col('Clicks')/CLicks_index)).withColumn('Views',round(col('Views')/Views_index)).withColumn('Purchase',round(col('Purchase')/purch_index))
semi_rawdf.show()
+------+----+------+-----+--------+
| User|Item|Clicks|Views|Purchase|
+------+----+------+-----+--------+
| First| A| 7.0| 3.0| 1.0|
|Second| B| 1.0| 1.0| 0.0|
| Three| C| 1.0| 1.0| 7.0|
| Four| D| 2.0| 1.0| 3.0|
+------+----+------+-----+--------+
from pyspark.sql.types import DecimalType
from decimal import Decimal
refined_df = semi_rawdf.withColumn('Rating',((col('Clicks')*0.3)+round(col('Views')*0.1)+round(col('Purchase')*0.6)))
refined_df = refined_df.withColumn('Rating', col('Rating').cast(DecimalType(6,2)))
refined_df.select('User','Item','Rating').show()
+------+----+------+
| User|Item|Rating|
+------+----+------+
| First| A| 3.10|
|Second| B| 0.30|
| Three| C| 4.30|
| Four| D| 2.60|
+------+----+------+
Related
Let's say we have this PySpark dataframe:
+----+-------------+
| id | string_data |
+----+-------------+
| 1 | "test" |
+----+-------------+
| 2 | null |
+----+-------------+
| 3 | "9" |
+----+-------------+
| 4 | "deleted__" |
I want to perform some operation on this that will result in this dataframe:
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| id | string_data | is_string_data_null | is_string_data_a_number | does_string_data_contain_keyword_test | is_string_data_normal |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 1 | "test" | 0 | 0 | 1 | 0 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 2 | null | 1 | 0 | 0 | 0 |im
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 3 | "9" | 0 | 1 | 0 | 0 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| 4 | "deleted__" | 0 | 0 | 0 | 1 |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
| | | | | | |
+----+-------------+---------------------+-------------------------+---------------------------------------+-----------------------+
Where each of the new columns has either a 1 or a 0 depending on the truth value. I have currently implemented this using a custom UDF that checks the value of the string_data column, but this is incredibly slow. I have also tried implementing a UDF that does not create new columns but instead overwrites the original one with an encoded vector [1, 0, 0...], etc. This is also too slow because we have to apply this to millions of rows and thousands of columns.
Is there any better way of doing this? I understand UDFs are not the most efficient way to solve things in PySpark but I can't seem to find any built-in PySpark functions that work.
Any thoughts would be appreciated!
Edit: Sorry, from mobile I didn't see the full expected output so my previous answer was very incomplete.
Anyway, your operation has to be done in two steps, starting with this DataFrame:
>>> df.show()
+---+-----------+
| id|string_data|
+---+-----------+
| 1| test|
| 2| null|
| 3| 9|
| 4| deleted__|
+---+-----------+
Create the boolean fields based on the conditions in the string_data field:
>>> df = (df
.withColumn('is_string_data_null', df.string_data.isNull())
.withColumn('is_string_data_a_number', df.string_data.cast('integer').isNotNull())
.withColumn('does_string_data_contain_keyword_test', coalesce(df.string_data, lit('')).contains('test'))
.withColumn('is_string_normal', ~(col('is_string_data_null') | col('is_string_data_a_number') | col('does_string_data_contain_keyword_test')))
)
>>> df.show()
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| id|string_data|is_string_data_null|is_string_data_a_number|does_string_data_contain_keyword_test|is_string_normal|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| 1| test| false| false| true| false|
| 2| null| true| false| false| false|
| 3| 9| false| true| false| false|
| 4| deleted__| false| false| false| true|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
Now that we have our columns, we can cast them to integers:
>>> df = (df
.withColumn('is_string_data_null', df.is_string_data_null.cast('integer'))
.withColumn('is_string_data_a_number', df.is_string_data_a_number.cast('integer'))
.withColumn('does_string_data_contain_keyword_test', df.does_string_data_contain_keyword_test.cast('integer'))
.withColumn('is_string_normal', df.is_string_normal.cast('integer'))
)
>>> df.show()
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| id|string_data|is_string_data_null|is_string_data_a_number|does_string_data_contain_keyword_test|is_string_normal|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
| 1| test| 0| 0| 1| 0|
| 2| null| 1| 0| 0| 0|
| 3| 9| 0| 1| 0| 0|
| 4| deleted__| 0| 0| 0| 1|
+---+-----------+-------------------+-----------------------+-------------------------------------+----------------+
This should be far more performant than an UDF, as all the operations are done by Spark itself so there's no context switch from Spark to Python.
Still new to Spark and I'm trying to do this final transformation as cleanly and efficiently as possible.
Say I have a dataframe that looks like the following
+------+--------+
|ID | Hit |
+------+--------+
|123 | 0 |
|456 | 1 |
|789 | 0 |
|123 | 1 |
|123 | 0 |
|789 | 1 |
|1234 | 0 |
| 1234 | 0 |
+------+--------+
I'm trying to end up with a new dataframe(or two, depending on what's more efficient), where if a row has a 1 in "hit", it cannot have a row with a 0 in hit and if there is, the 0's would be to a distinct level based on the ID column.
Here's one of the methods I tried but I'm not sure if this is
1. The most efficient way possible
2. The cleanest way possible
dfhits = df.filter(df.Hit == 1)
dfnonhits = df.filter(df.Hit == 0)
dfnonhitsdistinct = dfnonhits.filter(~dfnonhits['ID'].isin(dfhits))
Enddataset would look like the following:
+------+--------+
|ID | Hit |
+------+--------+
|456 | 1 |
|123 | 1 |
|789 | 1 |
|1234 | 0 |
+------+--------+
# Creating the Dataframe.
from pyspark.sql.functions import col
df = sqlContext.createDataFrame([(123,0),(456,1),(789,0),(123,1),(123,0),(789,1),(500,0),(500,0)],
['ID','Hit'])
df.show()
+---+---+
| ID|Hit|
+---+---+
|123| 0|
|456| 1|
|789| 0|
|123| 1|
|123| 0|
|789| 1|
|500| 0|
|500| 0|
+---+---+
The idea is to find the total of Hit per ID and in case it is more than 0, it means that there is atleast one 1 present in Hit. So, when this condition is true, we will remove all rows with Hit values 0.
# Registering the dataframe as a temporary view.
df.registerTempTable('table_view')
df=sqlContext.sql(
'select ID, Hit, sum(Hit) over (partition by ID) as sum_Hit from table_view'
)
df.show()
+---+---+-------+
| ID|Hit|sum_Hit|
+---+---+-------+
|789| 0| 1|
|789| 1| 1|
|500| 0| 0|
|500| 0| 0|
|123| 0| 1|
|123| 1| 1|
|123| 0| 1|
|456| 1| 1|
+---+---+-------+
df = df.filter(~((col('Hit')==0) & (col('sum_Hit')>0))).drop('sum_Hit').dropDuplicates()
df.show()
+---+---+
| ID|Hit|
+---+---+
|789| 1|
|500| 0|
|123| 1|
|456| 1|
+---+---+
I want to create a new column which is the mean of previous day sales using pyspark.
consider these value are at different timestamp.
for eg convert this:
| Date | value |
|------------|-------|
| 2019/02/11 | 30 |
| 2019/02/11 | 40 |
| 2019/02/11 | 20 |
| 2019/02/12 | 10 |
| 2019/02/12 | 15 |
to this
| Date | value | avg |
|------------|-------|------|
| 2019/02/11 | 30 | null |
| 2019/02/11 | 40 | null |
| 2019/02/11 | 20 | null |
| 2019/02/12 | 10 | 30 |
| 2019/02/12 | 15 | 30 |
My thinking :
Use filter and aggregation function to obtain average but its throwing error. Not sure where I am doing wrong.
df = df.withColumn("avg",lit((df.filter(df["date"] == date_sub("date",1)).agg({"value": "avg"}))))
You can do that using the windows functions but you have to create a new column to handle the dates.
I added few lines to you example :
df.withColumn(
"rnk",
F.dense_rank().over(Window.partitionBy().orderBy("date"))
).withColumn(
"avg",
F.avg("value").over(Window.partitionBy().orderBy("rnk").rangeBetween(-1,-1))
).show()
+----------+-----+---+----+
| date|value|rnk| avg|
+----------+-----+---+----+
|2018-01-01| 20| 1|null|
|2018-01-01| 30| 1|null|
|2018-01-01| 40| 1|null|
|2018-01-02| 40| 2|30.0|
|2018-01-02| 30| 2|30.0|
|2018-01-03| 40| 3|35.0|
|2018-01-03| 40| 3|35.0|
+----------+-----+---+----+
You can also do that using aggregation :
agg_df = df.withColumn("date", F.date_add("date", 1)).groupBy('date').avg("value")
df.join(agg_df, how="full_outer", on="date").orderBy("date").show()
+----------+-----+----------+
| date|value|avg(value)|
+----------+-----+----------+
|2018-01-01| 20| null|
|2018-01-01| 30| null|
|2018-01-01| 40| null|
|2018-01-02| 30| 30.0|
|2018-01-02| 40| 30.0|
|2018-01-03| 40| 35.0|
|2018-01-03| 40| 35.0|
|2018-01-04| null| 40.0|
+----------+-----+----------+
Step 0: Creating a DataFrame
from pyspark.sql.window import Window
from pyspark.sql.functions import col, avg, lag
df = sqlContext.createDataFrame([('2019/02/11',30),('2019/02/11',40),('2019/02/11',20),
('2019/02/12',10),('2019/02/12',15),
('2019/02/13',10),('2019/02/13',20)],['Date','value'])
Step 1: First calculating the average and then using windows function to get the lag by 1 day.
my_window = Window.partitionBy().orderBy('Date')
df_avg_previous = df.groupBy('Date').agg(avg(col('value')).alias('avg'))
df_avg_previous = df_avg_previous.withColumn('avg', lag(col('avg'),1).over(my_window))
df_avg_previous.show()
+----------+----+
| Date| avg|
+----------+----+
|2019/02/11|null|
|2019/02/12|30.0|
|2019/02/13|12.5|
+----------+----+
Step 2: Finally joining the two dataframes by using a left join.
df = df.join(df_avg_previous, ['Date'],how='left').orderBy('Date')
df.show()
+----------+-----+----+
| Date|value| avg|
+----------+-----+----+
|2019/02/11| 40|null|
|2019/02/11| 20|null|
|2019/02/11| 30|null|
|2019/02/12| 10|30.0|
|2019/02/12| 15|30.0|
|2019/02/13| 10|12.5|
|2019/02/13| 20|12.5|
+----------+-----+----+
I'm trying to group and sum for a PySpark (2.4) Dataframe but can't only get values one by one.
I've the following dataframe :
data.groupBy("card_scheme", "failed").count().show()
+----------------+------+------+
| card_Scheme|failed| count|
+----------------+------+------+
| jcb| false| 4|
|american express| false| 22084|
| AMEX| false| 4|
| mastercard| true| 1122|
| visa| true| 1975|
| visa| false|126372|
| CB| false| 6|
| discover| false| 2219|
| maestro| false| 2|
| VISA| false| 13|
| mastercard| false| 40856|
| MASTERCARD| false| 9|
+----------------+------+------+
I'm trying to calculate the formula X = false / (false + true) for each card_scheme and still get one dataframe in the end.
I'm expecting something like:
| card_scheme | X |
|-------------|---|
| jcb | 1 |
| .... | . |
| visa | 0.9846| (which is 126372 / (126372 + 1975)
| ... | . |
Creating the dataset
myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]
df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])
df.show()
+----------------+------+------+
| card_Scheme|failed| count|
+----------------+------+------+
| jcb| false| 4|
|american express| false| 22084|
| AMEX| false| 4|
| mastercard| true| 1122|
| visa| true| 1975|
| visa| false|126372|
| CB| false| 6|
| discover| false| 2219|
| maestro| false| 2|
| VISA| false| 13|
| mastercard| false| 40856|
| MASTERCARD| false| 9|
+----------------+------+------+
Method 1: This method will be slower, as it involves a traspose via pivot.
df=df.groupBy("card_Scheme").pivot("failed").sum("count")
df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))
df=df.select('card_Scheme','X')
df.show()
+----------------+------------------+
| card_Scheme| X|
+----------------+------------------+
| VISA| 1.0|
| jcb| 1.0|
| MASTERCARD| 1.0|
| maestro| 1.0|
| AMEX| 1.0|
| mastercard|0.9732717137548239|
|american express| 1.0|
| CB| 1.0|
| discover| 1.0|
| visa|0.9846120283294506|
+----------------+------------------+
Method 2: Use SQL - you can do so the via windows function. This will be a lot faster.
from pyspark.sql.window import Window
df = df.groupBy("card_scheme", "failed").agg(sum("count"))\
.withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))\
.where(col('failed')== False).drop('failed','sum(count)')
df.show()
+----------------+------------------+
| card_scheme| X|
+----------------+------------------+
| VISA| 1.0|
| jcb| 1.0|
| MASTERCARD| 1.0|
| maestro| 1.0|
| AMEX| 1.0|
| mastercard|0.9732717137548239|
|american express| 1.0|
| CB| 1.0|
| discover| 1.0|
| visa|0.9846120283294506|
+----------------+------------------+
First split root dataframe into two dataframes:
df_true = data.filter(data.failed == True).alias("df1")
df_false =data.filter(data.failed == False).alias("df2")
Then doing full outer join we can get final result:
from pyspark.sql.functions import col,when
df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer") \
.select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme") \
, when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))
No need to do groupby, just extra two dataframes and joining.
data.groupBy("card_scheme").pivot("failed").agg(count("card_scheme")) should work. I am not sure about the agg(count(any_column)), but the clue is pivot function. In result you'll get two new columns: false and true. Then you can easily calculate the x = false / (false + true).
A simple solution would be to do a second groupby:
val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe
val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))
with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()
The idea is that you can create a second column which has the failed in the failed=false and 0 otherwise. This means that the sum of the count column gives you false + true while the sum of the countfalse gives just the false. Then simply do a second groupby
Note: Some of the other answers use pivot. I believe the pivot solution would be slower (it does more), however, if you do choose to use it, add the specific values to the pivot call, i.e. pivot("failed", ["true", "false"]) to improve performance, otherwise spark would have to do a two path (the first to find the values)
from pyspark.sql import functions as func
from pyspark.sql.functions import col
data = data.groupby("card_scheme", "failed").count()
Create 2 new dataframes:
a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))
b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))
Join both the dataframes:
c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)
Divide one column with another:
c.withColumn('X', c.num/c.den)
I am trying to calculate the mean of a list (cost) within a PySpark Dataframe column, the values that are less than the mean get the value 1 and above the mean a 0.
This is the current dataframe:
+----------+--------------------+--------------------+
| id| collect_list(p_id)|collect_list(cost) |
+----------+--------------------+--------------------+
| 7|[10, 987, 872] |[12.0, 124.6, 197.0]|
| 6|[11, 858, 299] |[15.0, 167.16, 50.0]|
| 17| [2]| [65.4785]|
| 1|[34359738369, 343...|[16.023384, 104.9...|
| 3|[17179869185, 0, ...|[48.3255, 132.025...|
+----------+--------------------+--------------------+
This is the desired output:
+----------+--------------------+--------------------+-----------+
| id| p_id |cost | result |
+----------+--------------------+--------------------+-----------+
| 7|10 |12.0 | 1 |
| 7|987 |124.6 | 0 |
| 7|872 |197.0 | 0 |
| 6|11 |15.0 | 1 |
| 6|858 |167.16 | 0 |
| 6|299 |50.0 | 1 |
| 17|2 |65.4785 | 1 |
+----------+--------------------+--------------------+-----------+
from pyspark.sql.functions import col, mean
#sample data
df = sc.parallelize([(7,[10, 987, 872],[12.0, 124.6, 197.0]),
(6,[11, 858, 299],[15.0, 167.16, 50.0]),
(17,[2],[65.4785])]).toDF(["id", "collect_list(p_id)","collect_list(cost)"])
#unpack collect_list in desired output format
df = df.rdd.flatMap(lambda row: [(row[0], x, y) for x,y in zip(row[1],row[2])]).toDF(["id", "p_id","cost"])
df1 = df.\
join(df.groupBy("id").agg(mean("cost").alias("mean_cost")), "id", 'left').\
withColumn("result",(col("cost") <= col("mean_cost")).cast("int")).\
drop("mean_cost")
df1.show()
Output is :
+---+----+-------+------+
| id|p_id| cost|result|
+---+----+-------+------+
| 7| 10| 12.0| 1|
| 7| 987| 124.6| 0|
| 7| 872| 197.0| 0|
| 6| 11| 15.0| 1|
| 6| 858| 167.16| 0|
| 6| 299| 50.0| 1|
| 17| 2|65.4785| 1|
+---+----+-------+------+
You can create a result list for every row and then zip pid, cost and result list. After that use explode on the zipped column.
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import *
def zip_cols(pid_list,cost_list):
mean = np.mean(cost_list)
res_list = list(map(lambda cost:1 if mean >= cost else 0,cost_list))
return[(x,y,z) for x,y,z in zip(pid_list, cost_list, res_list)]
udf_zip = udf(zip_cols, ArrayType(StructType([StructField("pid",IntegerType()),
StructField("cost", DoubleType()),
StructField("result",IntegerType())])))
df1 = (df.withColumn("temp",udf_zip("collect_list(p_id)","collect_list(cost)")).
drop("collect_list(p_id)","collect_list(cost)"))
df2 = (df1.withColumn("temp",explode(df1.temp)).
select("id",col("temp.pid").alias("pid"),
col("temp.cost").alias("cost"),
col("temp.result").alias("result")))
df2.show()
output
+---+---+-------+------+
| id|pid| cost|result|
+---+---+-------+------+
| 7| 10| 12.0| 1|
| 7| 98| 124.6| 0|
| 7|872| 197.0| 0|
| 6| 11| 15.0| 1|
| 6|858| 167.16| 0|
| 6|299| 50.0| 1|
| 17| 2|65.4758| 1|
+---+---+-------+------+