Pyspark calculate a field on a grouped table

Pyspark calculate a field on a grouped table - python

I've got a data frame that looks like this:
+-------+-----+-------------+------------+
|startID|endID|trip_distance|total_amount|
+-------+-----+-------------+------------+
| 1| 3| 5| 12|
| 1| 3| 0| 4|
+-------+-----+-------------+------------+
I need to create a new table that groups the trips by the start and end IDs, and then figures out what the average trip rate was.
The trip rate is figured by taking all the trips with the same start and end IDs, in my case startID 1, and endID 3, had a total of 2 trips, and for those 2 trips the avg trip_distance was 2.5, and avg total_amount was 8. So the trip_rate should be 8/2.5=3.2
So the end result should look like this:
+-------+-----+-----+----------+
|startID|endID|count| trip_rate|
+-------+-----+-----+----------+
| 1| 3| 2| 3.2|
+-------+-----+-----+----------+
Here is what I'm trying to do:
from pyspark.shell import spark
from pyspark.sql.functions import avg
df = spark.createDataFrame(
[
(1, 3, 5, 12),
(1, 3, 0, 4)
],
['startID', 'endID', 'trip_distance', 'total_amount'] # add your columns label here
)
df.show()
grouped_table = df.groupBy('startID', 'endID').count().alias('count')
grouped_table.show()
grouped_table = df.withColumn('trip_rate', (avg('total_amount') / avg('trip_distance')))
grouped_table.show()
But I'm getting the following error:
pyspark.sql.utils.AnalysisException: "grouping expressions sequence is empty, and '`startID`' is not an aggregate function. Wrap '((avg(`total_amount`) / avg(`trip_distance`)) AS `trip_rate`)' in windowing function(s) or wrap '`startID`' in first() (or first_value) if you don't care which value you get.;;\nAggregate [startID#0L, endID#1L, trip_distance#2L, total_amount#3L, (avg(total_amount#3L) / avg(trip_distance#2L)) AS trip_rate#44]\n+- LogicalRDD [startID#0L, endID#1L, trip_distance#2L, total_amount#3L], false\n"
I tried wrapping the calculation in an AS function, but I kept getting syntax errors.

Group by, sum and divide. count and sum can be used inside agg()
from pyspark.sql import functions as F
df.groupBy('startID', 'endID').agg(F.count(F.lit(1)).alias("count"), \
(F.sum("total_amount")/F.sum("trip_distance")).alias('trip_rate')).show()

Related

How can i show highest and lowest numbers in two rows itself?

I wanted the highest number and the lowest number in two rows but I am getting the whole output , should i have to use dense rank or rank window function ?
like so
popular_eco_move=spark.sql("select a.eco,b.eco_name,count(b.eco_name) as number_of_occurance
from chess_game as a, chess_eco_codes as b where a.eco=b.eco group by a.eco,b.eco_name order
by
number_of_occurance desc")
popular_eco_move.show(10)
+---+--------------------+-------------------+
|eco| eco_name|number_of_occurance|
+---+--------------------+-------------------+
|C42| Petrov Defense| 64|
|E15| Queen's Indian| 56|
|C88| Ruy Lopez| 46|
|D37|Queen's Gambit De...| 44|
|B90| Sicilian, Najdorf| 38|
|C67| Ruy Lopez| 37|
|B12| Caro-Kann Defense| 37|
|C11| French| 35|
|C45| Scotch Game| 34|
|D27|Queen's Gambit Ac...| 32|
+---+--------------------+-------------------+
only showing top 10 rows
Result attributes: eco, eco_name, number_of_occurences
Final result will have only two rows

Hello try a with clause to store a query a re-use it like this :
with my_select as
(select a.eco,b.eco_name,count(b.eco_name) as occurance
from `game`.`chess_game` as a, `game`.`chess_eco_codes` as b where a.eco=b.eco
group by a.eco,b.eco_name)
select * from my_select
where occurance = (select max(occurance) from my_select)
or occurance = (select min(occurance) from my_select)

If you're using PySpark, you should learn how to write it in its Pythonic way instead of just SQL.
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
(df
.withColumn('rank_asc', F.dense_rank().over(W.orderBy(F.asc('number_of_occurance'))))
.withColumn('rank_desc', F.dense_rank().over(W.orderBy(F.desc('number_of_occurance'))))
.where((F.col('rank_asc') == 1) | (F.col('rank_desc') == 1))
# .drop('rank_asc', 'rank_desc') # to drop these two temp columns
.show()
)
+---+--------------------+-------------------+--------+---------+
|eco| eco_name|number_of_occurance|rank_asc|rank_desc|
+---+--------------------+-------------------+--------+---------+
|C42| Petrov Defense| 64| 9| 1|
|D27|Queen's Gambit Ac...| 32| 1| 9|
+---+--------------------+-------------------+--------+---------+

pandas: apply filters taking into account timestamp

I have the following test data:
import pandas as pd
import datetime
data = {'date': ['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04', '2014-01-05', '2014-01-06', '2014-01-07'],
'id': [1, 2, 2, 3, 4, 4, 5], 'name': ['Darren', 'Sabrina', 'Steve', 'Sean', 'Ray', 'Stef', 'Dany']}
data = pd.DataFrame(data)
data['date'] = pd.to_datetime(data['date'])
The question is: Going back in time x days (viewed from every single entry), are there more than y distinct names which share the same id?
Here is the code I have written. In my example I go back x=2 days and check for at least two distinct names (y=1) sharing the same id. If at least two distinct names exist, I save 1 inside the list "result_store", otherwise 0. Of course, in this example, going back x days is not possible if i is smaller x, but this little inaccuracy is not a problem for me.
def rule(data, x=2, y=1):
result_store = []
for i in range(data.shape[0]):
id = data['id'][i]
end_time = data['date'][i]
start_time = end_time-datetime.timedelta(days=x)
time_frame = data[(data['date'] >= start_time) & (data['date'] <= end_time)]
time_frame = time_frame.loc[time_frame['id'] == id]
distinct_names = time_frame['name'].nunique()
if distinct_names > y:
result_store.append(1)
else:
result_store.append(0)
return result_store
The result is
[0, 0, 1, 0, 0, 1, 0]
In reality, I have thousands of rows and my solution is extremely slow. I have also tried to parallelize over the index i using parmap, but the speedup also is not satisfactory. Is there any more efficient way to do this? Maybe by using pyspark?
Thank you!

This will work for spark2.4(array_distinct only in 2.4). I used the DataFrame you provided, and spark inferred the column date to be of type TimestampType. For my spark code to work, the column date has to be of type TimestampType. The window function travels back 2 days, based on same id, and collects a list of names. If the number of distinct names are >1, then it inputs 1, otherwise 0.
The code below uses rangeBetween(-(86400*2),Window.currentRow) which basically means that to include currentRow and then go back 2 days, so if current row date is 3, it will include [3,2,1]. if you only want current row date and 1 day before, you could replace 86400*2 with 86400*1
#If you can't use spark2.4 or get stuck, please leave a comment.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df=spark.createDataFrame(data)
w=Window().partitionBy("id").orderBy((F.col("date")).cast("long")).rangeBetween(-(86400*2),Window.currentRow)
df.withColumn("no_distinct", F.size(F.array_distinct(F.collect_list("name").over(w))))\
.withColumn("no_distinct", F.when(F.col("no_distinct")>1, F.lit(1)).otherwise(F.lit(0)))\
.orderBy(F.col("date")).show()
+-------------------+---+-------+-----------+
| date| id| name|no_distinct|
+-------------------+---+-------+-----------+
|2014-01-01 00:00:00| 1| Darren| 0|
|2014-01-02 00:00:00| 2|Sabrina| 0|
|2014-01-03 00:00:00| 2| Steve| 1|
|2014-01-04 00:00:00| 3| Sean| 0|
|2014-01-05 00:00:00| 4| Ray| 0|
|2014-01-06 00:00:00| 4| Stef| 1|
|2014-01-07 00:00:00| 5| Dany| 0|
+-------------------+---+-------+-----------+

pyspark - how to select exact number of records per strata using (df.sampleByKey()) in stratified random sampling

I have a spark dataframe (I am using pyspark) 'orders'. Its having following columns in it
['id', 'orderdate', 'customerid', 'status']
I am trying to do a stratified random sampling using key column as 'status'. My objective is as below
>> create a new dataframe with exactly 5 random records per status
So the method I have chosen is using .sampleBy('strata_key',{fraction_dict}). But the challenge I have faced is in choosing the exact fraction value for each status, so that every time I should get exactly 5 random records per status. I have followed below method
1.Created a dictionary for total count per status as below
#Total count of records for each order 'status' in 'ORDERS' dataframe is as below
d=dict([(x['status'],x['count']) for x in orders.groupBy("status").count().collect()])
print(d)
OUTPUT:
{'PENDING_PAYMENT': 15030, 'COMPLETE': 22899, 'ON_HOLD': 3798, 'PAYMENT_REVIEW': 729, 'PROCESSING': 8275, 'CLOSED': 7556, 'SUSPECTED_FRAUD': 1558,
'PENDING': 7610, 'CANCELED': 1428}
2.Created a function which generates fraction values needed to fetch exact N records
#Exact number of records needed per status
N=5
#function calculates fraction
def fraction_calc(count_dict,N)
d_mod={}
for i in d:
d_mod[i]=(N/d[i])
return d_mod
#creating dictionary of fractions using above function
fraction=fraction_calc(d,5)
print(fraction)
OUTPUT:
{'PENDING_PAYMENT': 0.00033266799733865603, 'COMPLETE': 0.000218350146294598, 'ON_HOLD': 0.0013164823591363876, 'PAYMENT_REVIEW': 0.006858710562414266, 'PROCESSING': 0.0006042296072507553, 'CLOSED': 0.0006617257808364214, 'SUSPECTED_FRAUD': 0.003209242618741977, 'PENDING': 0.000657030223390276, 'CANCELED': 0.0035014005602240898}
3.Creating the final dataframe which is sampled using startified sampling API .sampleBy()
#creating final sampled dataframe
df_sample=orders.sampleBy("status",fraction)
But Still I am not getting exact 5 records per status.A sample output is as below
#Checking count per status of resultant sample dataframe
df_sample.groupBy("status").count().show()
+---------------+-----+
| status|count|
+---------------+-----+
|PENDING_PAYMENT| 3|
| COMPLETE| 6|
| ON_HOLD| 7|
| PAYMENT_REVIEW| 4|
| PROCESSING| 6|
| CLOSED| 6|
|SUSPECTED_FRAUD| 7|
| PENDING| 9|
| CANCELED| 5|
+---------------+-----+
What should I do here to achieve my objective.

Found a work around
from pyspark.sql.window import Window
from pyspark.sql.functions import rand,row_number
1. Using rand() builtin function for generating column 'key' of random numbers and then assigning a row number to each element of the partition window created over "order_status" column order by 'key'. The code is as below
df_sample=df.withColumn("key",rand()).\
withColumn("rnk", row_number().\
over(Window.partitionBy("status").\
orderBy("key"))).\
where("rnk<=5").drop("key","rnk")
2. Now I am getting exactly 5 random records per status.A sample output is as below. This out will change for every spark session.
#Checking count per status of resultant sample dataframe
df_sample.groupBy("status").count().show()
+---------------+-----+
| status |count|
+---------------+-----+
|PENDING_PAYMENT| 5|
| COMPLETE| 5|
| ON_HOLD| 5|
| PAYMENT_REVIEW| 5|
| PROCESSING| 5|
| CLOSED| 5|
|SUSPECTED_FRAUD| 5|
| PENDING| 5|
| CANCELED| 5|
+---------------+-----+

pyspark `substr' without length

Is there a way, in pyspark, to perform the substr function on a DataFrame column, without specifying the length? Namely, something like df["my-col"].substr(begin).

I am not sure why this function is not exposed as api in pysaprk.sql.functions module.
SparkSQL supports the substring function without defining len argument substring(str, pos, len)
You can use it with expr api of functions module like below to achieve same:
df.withColumn('substr_name', f.expr("substring(name, 2)")).show()
+----------+---+-----------+
| name| id|substr_name|
+----------+---+-----------+
|Alex Shtof| 1| lex Shtof|
| SMaZ| 2| MaZ|
+----------+---+-----------+
How spark is doing it internally :
Now if you see physical plan of above statement then will notice that if we don't pass len then spark will automatically add 2147483647.
As #pault said in comment, 2147483647 is the maximum positive value for a 32-bit signed binary integer (2^31 -1).
df.withColumn('substr_name', f.expr("substring(name, 2)")).explain()
== Physical Plan ==
*Project [name#140, id#141L, substring(name#140, 2, 2147483647) AS substr_name#169]
+- Scan ExistingRDD[name#140,id#141L] --> 2147483647 is automatically added
In substring api implementation of functions module, it expect us to explicitly pass length. If you want then you can give any higher number in len which can cover max lengths of your column.
df.withColumn('substr_name', f.substring('name', 2, 100)).show()
+----------+---+-----------+
| name| id|substr_name|
+----------+---+-----------+
|Alex Shtof| 1| lex Shtof|
| SMaZ| 2| MaZ|
+----------+---+-----------+
>>> df.withColumn('substr_name', f.substring('name', 2, 100)).explain()
== Physical Plan ==
*Project [name#140, id#141L, substring(name#140, 2, 100) AS substr_name#189]
+- Scan ExistingRDD[name#140,id#141L] --> 100 is what we passed

If the objective is to make a substring from a position given by a parameter begin to the end of the string, then you can do it as follows:
import pyspark.sql.functions as f
l = [(1, 'Prague'), (2, 'New York')]
df = spark.createDataFrame(l, ['id', 'city'])
begin = 2
l = (f.length('city') - f.lit(begin) + 1)
(
df
.withColumn('substr', f.col('city').substr(f.lit(begin), l))
).show()
+---+--------+-------+
| id| city| substr|
+---+--------+-------+
| 1| Prague| rague|
| 2|New York|ew York|
+---+--------+-------+

I'd create udf.
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import StringType
>>> df = spark.createDataFrame([('Alice', 23), ('Brian', 25)], schema=["name", "age"])
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 23|
|Brian| 25|
+-----+---+
>>> #F.udf(returnType=StringType())
... def substr_udf(col):
... return str(col)[2:]
>>> df = df.withColumn('substr', substr_udf('name'))
>>> df.show()
+-----+---+------+
| name|age|substr|
+-----+---+------+
|Alice| 23| ice|
|Brian| 25| ian|
+-----+---+------+

No we need to specify the both parameters pos and len
BUt do make sure that both should be of same type other wise it will give error.
Error: Column not iterable.
You can do in this way:
df = df.withColumn("new", F.col("previous").substr(F.lit(5), F.length("previous")-5))

sampling with weight using pyspark

I have an unbalanced dataframe on spark using PySpark.
I want to resample it to make it balanced.
I only find the sample function in PySpark
sample(withReplacement, fraction, seed=None)
but I want to sample the dataframe with weight of unitvolume
in Python, I can do it like
df.sample(n,Flase,weights=log(unitvolume))
is there any method I could do the same using PySpark?

Spark provides tools for stratified sampling, but this work only on categorical data. You could try to bucketize it:
from pyspark.ml.feature import Bucketizer
from pyspark.sql.functions import col, log
df_log = df.withColumn("log_unitvolume", log(col("unitvolume"))
splits = ... # A list of splits
bucketizer = Bucketizer(splits=splits, inputCol="log_unitvolume", outputCol="bucketed_log_unitvolume")
df_log_bucketed = bucketizer.transform(df_log)
Compute statistics:
counts = df.groupBy("bucketed_log_unitvolume")
fractions = ... # Define fractions from each bucket:
and use these for sampling:
df_log_bucketed.sampleBy("bucketed_log_unitvolume", fractions)
You can also try to rescale log_unitvolume to [0, 1] range and then:
from pyspark.sql.functions import rand
df_log_rescaled.where(col("log_unitvolume_rescaled") < rand())

I think it will be better to simply ignore the .sample() function altogether. Sampling without replacement can be implemented with a uniform random number generator:
import pyspark.sql.functions as F
n_samples_appx = 100
total_weight = df.agg(F.sum('weight')).collect().values
df.filter(F.rand(seed=843) < F.col('weight') / total_weight * n_samples_appx)
This will randomly include/exclude rows from your dataset, which is typically comparable to sampling with replacement. You should be careful about interpretation if you have RHS that exceeds 1 -- weighted sampling is a nuanced process that, rigorously speaking, should only be performed with-replacement.
So if you want to sample with replacement instead, you can use F.rand() to get samples of the poisson distribution which will tell you how many copies of the row to include, and you can either treat that value as a weight, or do some annoying joins & unions to duplicate your rows. But I find that this is typically not required.
You can also do this in a portable repeatable way with the hash:
import pyspark.sql.functions as F
n_samples_appx = 100
total_weight = df.agg(F.sum('weight')).collect().values
df.filter(F.hash(F.col('id')) % (total_weight / n_samples_appx * F.col('weight')).astype('int') == 0)
This will sample at a rate of 1-in-modulo, which incorporates your weight. hash() should be a consistent and deterministic function, but the sampling will occur like random.

One way to do it is to use udf to make a sampling column. This column will have a random number multiplied by your desired weight. Then we sort by the sampling column, and take the top N.
Consider the following illustrative example:
Create Dummy Data
import numpy as np
import string
import pyspark.sql.functions as f
index = range(100)
weights = [i%26 for i in index]
labels = [string.ascii_uppercase[w] for w in weights]
df = sqlCtx.createDataFrame(
zip(index, labels, weights),
('index', 'label', 'weight')
)
df.show(n=5)
#+-----+-----+------+
#|index|label|weight|
#+-----+-----+------+
#| 0| A| 0|
#| 1| B| 1|
#| 2| C| 2|
#| 3| D| 3|
#| 4| E| 4|
#+-----+-----+------+
#only showing top 5 rows
Add Sampling Column
In this example, we want to sample the DataFrame using the column weight as the weight. We define a udf using numpy.random.random() to generate uniform random numbers and multiply by the weight. Then we use sort() on this column and use limit() to get the desired number of samples.
N = 10 # the number of samples
def get_sample_value(x):
return np.random.random() * x
get_sample_value_udf = f.udf(get_sample_value, FloatType())
df_sample = df.withColumn('sampleVal', get_sample_value_udf(f.col('weight')))\
.sort('sampleVal', ascending=False)\
.select('index', 'label', 'weight')\
.limit(N)
Result
As expected, the DataFrame df_sample has 10 rows, and it's contents tend to have letters near the end of the alphabet (higher weights).
df_sample.count()
#10
df_sample.show()
#+-----+-----+------+
#|index|label|weight|
#+-----+-----+------+
#| 23| X| 23|
#| 73| V| 21|
#| 46| U| 20|
#| 25| Z| 25|
#| 19| T| 19|
#| 96| S| 18|
#| 75| X| 23|
#| 48| W| 22|
#| 51| Z| 25|
#| 69| R| 17|
#+-----+-----+------+

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark calculate a field on a grouped table - python

Group by, sum and divide. count and sum can be used inside agg() from pyspark.sql import functions as F df.groupBy('startID', 'endID').agg(F.count(F.lit(1)).alias("count"), \ (F.sum("total_amount")/F.sum("trip_distance")).alias('trip_rate')).show()

Related

How can i show highest and lowest numbers in two rows itself?

pandas: apply filters taking into account timestamp

pyspark - how to select exact number of records per strata using (df.sampleByKey()) in stratified random sampling

pyspark `substr' without length

sampling with weight using pyspark

Categories

Resources