I have sales list like this (pySpark):
+---------+----------+
|productId| date|
+---------+----------+
| 868|2020-11-01|
| 878|2020-11-01|
| 878|2020-11-01|
| 913|2020-11-01|
| 746|2020-11-01|
| 878|2020-11-01|
| 657|2020-11-02|
| 746|2020-11-02|
| 101|2020-11-02|
+---------+----------+
So, I want to get a new column: item position by number of purchases. Most popular item of the day will have rank 1, etc. I've tried to implement window function, but can't figure out how to do it correctly. What is the best way to do it? Thanks.
import pyspark.sql.functions as F
count_per_day = F.count('productId').over(Window.partitionBy('date')).alias('count_per_day')
df = df.select('*', count_per_day)
rank = F.rank().over(Window.partitionBy('date').orderBy(F.col('count_per_day').desc())).alias('rank')
df = df.select('*', rank)
Related
I have a dataframe which have two columns - unique_id and id_string.The dataframe looks likes:
| unique_id| id_string |
| -------- | --------- |
| 123 | abc |
| 456 | pqr |
| 789 | xyz |
| 000 | lmn |
I want to compare the id_string of each unique_id with the all the other id_string of the column.
I want the output to look like the table below:
| unique_id| id_string | duplicate_id|duplicate_string|score|
| -------- | --------- |-------------|----------------|-----|
| 123 | abc |456 |pqr |91 |
| 123 | abc |789 |xyz |92 |
| 123 | abc |000 |lmn |93 |
I have written a code using the for loop which is below:
out_put_df = pd.DataFrame()
for i in input_df.index:
unique_id = input_df.at[i, 'unique_id']
id_string = input_df.at[i, 'id_string']
j = i+1
for j in range(len(input_df.index)-j):
dupicate_id = input_df.at[j, 'unique_id']
duplicate_string = input_df.at[j, 'id_string']
comparition_score = fuzz.token_set_ratio(id_string, duplicate_string)
out_put_df = out_put_df.append(pd.DataFrame({'unique_id': unique_id,'id_string': id_string,'dupicate_id': dupicate_id,'duplicate_string': duplicate_string,'comparition_score': comparition_score}, index=[0]), ignore_index=True)
The original dataframe has half million rows, so it is taking infinite time. Can someone please tell me the optimum way to do this?. I come to know about it itertools.combinations , but I am unable to use it too.
Thanks in advance.
You should use sorting. Sort by id_string column and iterate over all rows - whenever the next value in id_string column is equal to the current one - you have a duplicate. You can also look back (which might be easier) with the last value of id_string:
input_df = pd.DataFrame()
prev_id_string = None # or some other invalid value that does not exist in your df
prev_unique_id = None
rows = []
for _, (unique_id, id_string) in input_df.sort_values(by=['id_string', 'unique_id']).iterrows():
if prev_id_string == id_string:
score = # calculate it as you wish
row.append((prev_unique_id, prev_id_string, unique_id, score))
prev_unique_id = unique_id
out_put_df = pd.DataFrame(columns="unique_id, id_string, duplicate_id, duplicate_string, score".split(", "), data=rows)
I am trying to remove rows from a dataframe where the first sequence of letters in the Ref column are equal to the Product column.
For example, for the input:
+---------+---------------+
| Product | Provision Ref |
+---------+---------------+
| DVX | DVX9251 |
+---------+---------------+
| CDV | 22CDV95 |
+---------+---------------+
| TV | TV12369 |
+---------+---------------+
| TV | 992TV15 |
+---------+---------------+
Desired output:
+---------+---------------+
| Product | Provision Ref |
+---------+---------------+
| CDV | 22CDV95 |
+---------+---------------+
| TV | 992TV15 |
+---------+---------------+
I have tried both of the following pieces of code but they are not working
df = df.loc[df['Provision Ref'].str[0:df['Product'].map(len)] != df['Product']]
df = df.loc[df['Provision Ref'].str[0:int(df['Product'].map(len))] != df['Product']]
Try this:
filtered = df[df.groupby('Product', sort=False).apply(lambda g: g['Provision Ref'].str.startswith(g['Product'].iloc[0])).tolist()]
Output:
>>> filtered
Product Provision Ref
0 DVX DVX9251
2 TV TV12369
More readable but less efficient:
filtered = df[df.apply(lambda x: x['Provision Ref'].startswith(x['Product']), axis=1)]
Another method, probably more efficient if the items of Product have few unique lengths (e.g. most are either 2, 3 or 4 chars long, etc.):
filtered = df[df.groupby(df['Product'].str.len(), sort=False).apply(lambda x: x['Provision Ref'].str[:len(x['Product'].iloc[0])] == x['Product']).tolist()]
We can use a row-wise .apply(), because
df['Provision Ref'].str.startswith(df['Product']) isn't vectorized like that (as #anarchy wrote).
df[~df.apply(lambda row: row['Provision Ref'].startswith(row['Product']), axis=1)]
Product Provision Ref
1 CDV 22CDV95
3 TV 992TV15
My goal is to replace all negative elements in a column of a PySpark.DataFrame with zero.
input data
+------+
| col1 |
+------+
| -2 |
| 1 |
| 3 |
| 0 |
| 2 |
| -7 |
| -14 |
| 3 |
+------+
desired output data
+------+
| col1 |
+------+
| 0 |
| 1 |
| 3 |
| 0 |
| 2 |
| 0 |
| 0 |
| 3 |
+------+
Basically I can do this as below:
df = df.withColumn('col1', F.when(F.col('col1') < 0, 0).otherwise(F.col('col1'))
or udf can be defined as
import pyspark.sql.functions as F
smooth = F.udf(lambda x: x if x > 0 else 0, IntegerType())
df = df.withColumn('col1', smooth(F.col('col1')))
or
df = df.withColumn('col1', (F.col('col1') + F.abs('col1')) / 2)
or
df = df.withColumn('col1', F.greatest(F.col('col1'), F.lit(0))
My question is, which one is the most efficient way of doing this? Udf has optimization issues, so absolutely it's not the correct way of doing this. But I don't know how to approach comparing the other two cases. One answer should be absolutely making experiments and comparing the mean running times and so on. But I want to compare these approaches (and new approaches) theoretically.
Thanks in advance...
You can simply make a column where you say, if x > 0: x else 0. This would be the best approach.
The question has already been addressed, theoretically: Spark functions vs UDF performance?
import pyspark.sql.functions as F
df = df.withColumn("only_positive", F.when(F.col("col1") > 0, F.col("col1")).otherwise(0))
You can overwrite col1 in the original dataframe, if you pass that to withColumn()
I am trying to select records from df1 if df1.date1 lies between df2.date2 and df2.date3 (only three ranges of date2, date3 combination, row-wise, are allowed).
In my case there is no common variable to establish a 'join' criteria. I tried different pyspark.sql functions such as 'filter','when', 'withColumn', 'date_sub', 'date_add' etc. but unable to find a solution.
I did go through several SO posts, however, most of them propose to use 'join' which might not fit my problem!
df1
+----------+-----------+
| emp_id | date1 |
+----------+-----------+
| 67891 | 11-13-2015|
| 12345 | 02-28-2017|
| 34567 | 04-07-2017|
+----------+-----------+
df2
+------------+------------+
| date2 | date3 |
+------------+------------+
|01-28-2017 | 03-15-2017 |
|07-13-2017 | 11-13-2017 |
|06-07-2018 | 09-07-2018 |
+------------+------------+
Expected record:
+----------+-----------+
| emp_id | date1 |
+----------+-----------+
| 12345 | 02-28-2017|
+----------+-----------+
You can do non-equi joins in spark. You don't necessarily need matching keys.
This is in scala, am pretty sure it's almost the same in python. Lemme know if it doesn't work. Will update the answer in pyspark as well.
scala> df1.join(df2 , 'date1 > 'date2 && 'date1 < 'date3).show
+------+----------+----------+----------+
|emp_id| date1| date2| date3|
+------+----------+----------+----------+
| 12345|02-28-2017|01-28-2017|03-15-2017|
+------+----------+----------+----------+
Pyspark solution:
>>> from pyspark.sql.functions import unix_timestamp
>>> from pyspark.sql.functions import from_unixtime
>>> x = [(67891 ,'11-13-2015'),(12345, '02-28-2017'),(34567,'04-07-2017')]
>>> df1 = spark.createDataFrame(x,['emp_id','date1'])
>>> y = [('01-28-2017','03-15-2017'),('07-13-2017','11-13-2017'),('06-07-2018','09-07-2018')]
>>> df2 = spark.createDataFrame(y,['date2','date3'])
>>> df1a = df1.select('emp_id', from_unixtime(unix_timestamp('date1', 'MM-dd-yyy')).alias('date1'))
>>> df2a = df2.select(from_unixtime(unix_timestamp('date2', 'MM-dd-yyy')).alias('date2'),from_unixtime(unix_timestamp('date3', 'MM-dd-yyy')).alias('date3'))
>>> df1a.join(df2a, on=[df1a['date1'] > df2a['date2'], df1a['date1'] < df2a['date3']]).show()
+------+-------------------+-------------------+-------------------+
|emp_id| date1| date2| date3|
+------+-------------------+-------------------+-------------------+
| 12345|2017-02-28 00:00:00|2017-01-28 00:00:00|2017-03-15 00:00:00|
+------+-------------------+-------------------+-------------------+
This question already has answers here:
How do I get a SQL row_number equivalent for a Spark RDD?
(4 answers)
Closed 4 years ago.
I would like to know the most efficient way to generate column index
to unique identify a record within each group of label:
+-------+-------+-------+
| label | value | index |
+-------+-------+-------+
| a | v1 | 0 |
+-------+-------+-------+
| a | v2 | 1 |
+-------+-------+-------+
| a | v3 | 2 |
+-------+-------+-------+
| a | v4 | 3 |
+-------+-------+-------+
| b | v5 | 0 |
+-------+-------+-------+
| b | v6 | 1 |
+-------+-------+-------+
My actual data is very large and each group of label has the same number of records. Column index will be used for Pivot.
I could do the usual sort + for-loop incremental + check if cur<>pre then reset index, etc but a faster and more efficient way is always welcome.
EDIT: got my answer from the suggested question:
from pyspark.sql import Row, functions as F
from pyspark.sql.window import Window
df = df.withColumn("index",
F.row_number().over(
Window.partitionBy("label").orderBy("value"))
)
Thank you for all your helps!
You can use Window functions to create a rank based column while partitioning on the label column. However, this requires an ordering - in this case on value:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window = Window.partitionBy(df['label']).orderBy(df['value'])
df.withColumn('index', row_number().over(window))
This will give a new column index with values starting from 1 (to start from 0, simply add -1 to the expression above). The values will be given as ordered by the value column.