I have the following csv file.
Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt
0,1424696633908,1424696631913248572,-5.958191,0.6880646,8.135345,a,nexus4,nexus4_1,stand
1,1424696633909,1424696631918283972,-5.95224,0.6702118,8.136536,a,nexus4,nexus4_1,stand
2,1424696633918,1424696631923288855,-5.9950867,0.6535491999999999,8.204376,a,nexus4,nexus4_1,stand
3,1424696633919,1424696631928385290,-5.9427185,0.6761626999999999,8.128204,a,nexus4,nexus4_1,stand
I have to create a RDD where USER MODEL AND GT are PRIMARY KEY, I don't know if I have to do it using them as a tuple.
Then when I have the primary key field I have to calculate AVG, MAX and MIN from 'x','y' and 'z'.
Here is an output:
User,Model,gt,media(x,y,z),desviacion(x,y,z),max(x,y,z),min(x,y,z)
a, nexus4,stand,-3.0,0.7,8.2,2.8,0.14,0.0,-1.0,0.8,8.2,-5.0,0.6,8.2
Any idea about how to group them and for example get the media values from "x"
With my current code I get the following.
# Data loading
lectura = sc.textFile("Phones_accelerometer.csv")
datos = lectura.map(lambda x: ((x.split(",")[6], x.split(",")[7], x.split(",")[9]),(x.split(",")[3], x.split(",")[4], x.split(",")[5])))
sumCount = datos.combineByKey(lambda value: (value, 1), lambda x, value: (x[0] + value, x[1] + 1), lambda x, y: (x[0] + y[0], x[1] + y[1]))
An example of my tuples:
[(('a', 'nexus4', 'stand'), ('-5.958191', '0.6880646', '8.135345'))]
If you have a csv data in a file as given in the question then you can use sqlContext to read it as a dataframe and cast the appropriate types as
df = sqlContext.read.format("com.databricks.spark.csv").option("header", True).load("path to csv file")
import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.select(F.col('User'), F.col('Model'), F.col('gt'), F.col('x').cast('float'), F.col('y').cast('float'), F.col('z').cast('float'))
I have selected primary keys and necessary columns only which should give you
+----+------+-----+----------+---------+--------+
|User|Model |gt |x |y |z |
+----+------+-----+----------+---------+--------+
|a |nexus4|stand|-5.958191 |0.6880646|8.135345|
|a |nexus4|stand|-5.95224 |0.6702118|8.136536|
|a |nexus4|stand|-5.9950867|0.6535492|8.204376|
|a |nexus4|stand|-5.9427185|0.6761627|8.128204|
+----+------+-----+----------+---------+--------+
All of your requirements: median, deviation, max and min depend on the list of x, y and z when grouped by primary keys: User, Model and gt.
So you would need groupBy and collect_list inbuilt function and a udf function to calculate all of your requiremnts. Final step is to separate them in different columns which are given below
from math import sqrt
def calculation(array):
num_items = len(array)
print num_items, sum(array)
mean = sum(array) / num_items
differences = [x - mean for x in array]
sq_differences = [d ** 2 for d in differences]
ssd = sum(sq_differences)
variance = ssd / (num_items - 1)
sd = sqrt(variance)
return [mean, sd, max(array), min(array)]
calcUdf = F.udf(calculation, T.ArrayType(T.FloatType()))
df.groupBy('User', 'Model', 'gt')\
.agg(calcUdf(F.collect_list(F.col('x'))).alias('x'), calcUdf(F.collect_list(F.col('y'))).alias('y'), calcUdf(F.collect_list(F.col('z'))).alias('z'))\
.select(F.col('User'), F.col('Model'), F.col('gt'), F.col('x')[0].alias('median_x'), F.col('y')[0].alias('median_y'), F.col('z')[0].alias('median_z'), F.col('x')[1].alias('deviation_x'), F.col('y')[1].alias('deviation_y'), F.col('z')[1].alias('deviation_z'), F.col('x')[2].alias('max_x'), F.col('y')[2].alias('max_y'), F.col('z')[2].alias('max_z'), F.col('x')[3].alias('min_x'), F.col('y')[3].alias('min_y'), F.col('z')[3].alias('min_z'))\
.show(truncate=False)
So finally you should have
+----+------+-----+---------+---------+--------+-----------+-----------+-----------+----------+---------+--------+----------+---------+--------+
|User|Model |gt |median_x |median_y |median_z|deviation_x|deviation_y|deviation_z|max_x |max_y |max_z |min_x |min_y |min_z |
+----+------+-----+---------+---------+--------+-----------+-----------+-----------+----------+---------+--------+----------+---------+--------+
|a |nexus4|stand|-5.962059|0.6719971|8.151115|0.022922019|0.01436464 |0.0356973 |-5.9427185|0.6880646|8.204376|-5.9950867|0.6535492|8.128204|
+----+------+-----+---------+---------+--------+-----------+-----------+-----------+----------+---------+--------+----------+---------+--------+
I hope the answer is helpful.
You'll have to used groupByKey to get median. While generally not preferred for performance reasons, finding the median value of a list of numbers can not be parallelized easily. The logic to compute median requires the entire list of numbers. groupByKey is the aggregation method to use when you need to process all the values for a key at the same time
Also, as mentioned in the comments, this task would be easier using Spark DataFrames.
Related
I want to compute by hand some custom summary statistics of a large dataframe on PySpark. For the sake of simplicity, let me use a simpler dummy dataset, as the following:
from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.types import DataType, NumericType, DateType, TimestampType
import pyspark.sql.types as t
import pyspark.sql.functions as f
from datetime import datetime
spark = (
SparkSession.builder
.appName("pyspark")
.master("local[*]")
.getOrCreate()
)
dd = [
("Alice", 18.0, datetime(2022, 1, 1)),
("Bob", None, datetime(2022, 2, 1)),
("Mark", 33.0, None),
(None, 80.0, datetime(2022, 4, 1)),
]
schema = t.StructType(
[
t.StructField("T", t.StringType()),
t.StructField("C", t.DoubleType()),
t.StructField("D", t.DateType()),
]
)
df = spark.createDataFrame(dd, schema)
Ok, the thing is, I want to compute some aggregations: missing counts, stddev, max and min from all the columns, and of course I'd want to do it in parallel. Well, I can take two approaches for this:
Approach 1: One select query
This way, I let the Spark engine make the parallel computing by making one big select query. Let's see:
def df_dtypes(df: DataFrame) -> List[Tuple[str, DataType]]:
"""
Like df.dtypes attribute of Spark DataFrame, but returning DataType objects instead
of strings.
"""
return [(str(f.name), f.dataType) for f in df.schema.fields]
def get_missing(df: DataFrame) -> Tuple:
suffix = "__missing"
result = (
*(
(
f.count(
f.when(
(f.isnan(c) | f.isnull(c)),
c,
)
)
/ f.count("*")
* 100
if isinstance(t, NumericType) # isnan only works for numeric types
else f.count(
f.when(
f.isnull(c),
c,
)
)
/ f.count("*")
* 100
)
.cast("double")
.alias(c + suffix)
for c, t in df_dtypes(df)
),
)
return result
def get_min(df: DataFrame) -> Tuple:
suffix = "__min"
result = (
*(
(f.min(c) if isinstance(t, (NumericType, DateType, TimestampType)) else f.lit(None))
.cast(t)
.alias(c + suffix)
for c, t in df_dtypes(df)
),
)
return result
def get_max(df: DataFrame) -> Tuple:
suffix = "__max"
result = (
*(
(f.max(c) if isinstance(t, (NumericType, DateType, TimestampType)) else f.lit(None))
.cast(t)
.alias(c + suffix)
for c, t in df_dtypes(df)
),
)
return result
def get_std(df: DataFrame) -> Tuple:
suffix = "__std"
result = (
*(
(f.stddev(c) if isinstance(t, NumericType) else f.lit(None)).cast(t).alias(c + suffix)
for c, t in df_dtypes(df)
),
)
return result
# build the big query
query = get_min(df) + get_max(df) + get_missing(df) + get_std(df)
# run the job
df.select(*query).show()
As far as I know, this job will run in parallel because the internals of Spark works. Is this approach efficient? The problem with this might be the huge number of columns with suffixes that it creates, could it be a bottle neck?
Approach 2: Using threads
In this approach, I can make use of Python threads to try to perform each calculation concurrently.
from pyspark import InheritableThread
from queue import Queue
def get_min(df: DataFrame, q: Queue) -> None:
result = df.select(
f.lit("min").alias("summary"),
*(
(f.min(c) if isinstance(t, (NumericType, DateType, TimestampType)) else f.lit(None))
.cast(t)
.alias(c)
for c, t in df_dtypes(df)
),
).collect()
q.put(result)
def get_max(df: DataFrame, q: Queue) -> None:
result = df.select(
f.lit("max").alias("summary"),
*(
(f.max(c) if isinstance(t, (NumericType, DateType, TimestampType)) else f.lit(None))
.cast(t)
.alias(c)
for c, t in df_dtypes(df)
),
).collect()
q.put(result)
def get_std(df: DataFrame, q: Queue) -> None:
result = df.select(
f.lit("std").alias("summary"),
*(
(f.stddev(c) if isinstance(t, NumericType) else f.lit(None)).cast(t).alias(c)
for c, t in df_dtypes(df)
),
).collect()
q.put(result)
def get_missing(df: DataFrame, q: Queue) -> None:
result = df.select(
f.lit("missing").alias("summary"),
*(
(
f.count(
f.when(
(f.isnan(c) | f.isnull(c)),
c,
)
)
/ f.count("*")
* 100
if isinstance(t, NumericType) # isnan only works for numeric types
else f.count(
f.when(
f.isnull(c),
c,
)
)
/ f.count("*")
* 100
)
.cast("double")
.alias(c)
for c, t in df_dtypes(df)
),
).collect()
q.put(result)
# caching the dataframe to reuse it for all the jobs?
df.cache()
# I use a queue to retrieve the results from the threads
q = Queue()
threads = [
InheritableThread(target=fun, args=(df, q)).start()
for fun in (get_min, get_max, get_missing, get_std)
]
# and then some code to recover the results from the queue
This way has the advantage of not ending up with dozens of columns with suffixes, just the original columns. But I'm not sure how this way deals with the GIL, is that actually parallel?
Could you tell me which one do you prefer? Or some suggestions about different ways to compute them?
At the end I want to build a JSON with all of this aggregated statistics. The structure of JSON is not relevant, it would depend on the approach taken. For the first one, I'd get something like {"T__min": None, "T__max": None, "T__missing": 1, "T__std": None, "C__min": 18.0, "C__max": 80.0, ...} so this way I end up with tons of fields and the select query would be huge. For the second approach I would get one JSON per variable with those statistics.
I'm not really familiar with the InheritableThread and Queue, but as far as I can see, you want to create threads based on statistics. Meaning, every thread calculating a different statistic. This doesn't look optimized by design. I mean, some statistics will likely be calculated quicker than others. And then your processing power in those threads will not be used.
As you know, Spark is a distributed computing system which performs all the parallelism for you. I very highly doubt you can outperform Spark's optimization using Python's tools. If we could do that, it would already be integrated into Spark.
The first approach is very nicely written: conditional statements based on data types, inclusion of isnan, type hints - well done. It would probably perform the best it's possible, it's definitely written efficiently. The biggest drawback is the nature that it will be run on the whole dataframe, but you can't really escape that. Regarding the number of columns, you shouldn't be worried. The whole select statement will be very long, but it's just one operation. The logical/physical plan should be efficient. In worst-case scenario, you could persist/cache the dataframe before this operation, as you may have problems if this dataframe is created using some complex code. But other than that you should be fine.
As an alternative, for some statistics you may consider using summary:
df.summary().show()
# +-------+-----+------------------+
# |summary| T| C|
# +-------+-----+------------------+
# | count| 3| 3|
# | mean| null|43.666666666666664|
# | stddev| null| 32.34707611722168|
# | min|Alice| 18.0|
# | 25%| null| 18.0|
# | 50%| null| 33.0|
# | 75%| null| 80.0|
# | max| Mark| 80.0|
# +-------+-----+------------------+
This approach would only work for numeric and string columns. Date/Timestamp columns (e.g. "D") are automatically excluded. But I'm not sure if this would be more efficient. And definitely it would be less clear, as it would add additional logic to your code which now is quite straightforward.
This question already has an answer here:
How to divide a column by its sum in a Spark DataFrame
(1 answer)
Closed 4 years ago.
I am trying to divide columns in PySpark by their respective sums. My dataframe(using only one column here) looks like this:
event_rates = [[1,10.461016949152542], [2, 10.38953488372093], [3, 10.609418282548477]]
event_rates = spark.createDataFrame(event_rates, ['cluster_id','mean_encoded'])
event_rates.show()
+----------+------------------+
|cluster_id| mean_encoded|
+----------+------------------+
| 1|10.461016949152542|
| 2| 10.38953488372093|
| 3|10.609418282548477|
+----------+------------------+
I tried two methods to do this but have failed in getting results
from pyspark.sql.functions import sum as spark_sum
cols = event_rates.columns[1:]
for each in cols:
event_rates = event_rates.withColumn(each+"_scaled", event_rates[each]/spark_sum(event_rates[each]))
This gives me the following error
org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and '`cluster_id`' is not an aggregate function. Wrap '((`mean_encoded` / sum(`mean_encoded`)) AS `mean_encoded_scaled`)' in windowing function(s) or wrap '`cluster_id`' in first() (or first_value) if you don't care which value you get.;;
Aggregate [cluster_id#22356L, mean_encoded#22357, (mean_encoded#22357 / sum(mean_encoded#22357)) AS mean_encoded_scaled#2
and following the question here I tried the following
stats = (event_rates.agg([spark_sum(x).alias(x + '_sum') for x in cols]))
event_rates = event_rates.join(broadcast(stats))
exprs = [event_rates[x] / event_rates[event_rates + '_sum'] for x in cols]
event_rates.select(exprs)
But I get an error from the first line stating
AssertionError: all exprs should be Column
How do I get across this?
This is an example on how to divide column mean_encoded by its sum. You need to sum the column first then crossJoin back to the previous dataframe. Then, you can divide any column by its sum.
import pyspark.sql.functions as fn
from pyspark.sql.types import *
event_rates = event_rates.crossJoin(event_rates.groupby().agg(fn.sum('mean_encoded').alias('sum_mean_encoded')))
event_rates_div = event_rates.select('cluster_id',
'mean_encoded',
fn.col('mean_encoded') / fn.col('sum_mean_encoded'))
Output
+----------+------------------+---------------------------------+
|cluster_id| mean_encoded|(mean_encoded / sum_mean_encoded)|
+----------+------------------+---------------------------------+
| 1|10.461016949152542| 0.3325183371367686|
| 2| 10.38953488372093| 0.3302461777809474|
| 3|10.609418282548477| 0.3372354850822839|
+----------+------------------+---------------------------------+
Try out this,
from pyspark.sql import functions as F
total = event_rates.groupBy().agg(F.sum("mean_encoded"),F.sum("cluster_id")).collect()
total
Answer will be,
[Row(sum(mean_encoded)=31.459970115421946, sum(cluster_id)=6)]
I have the following csv file.
Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt
0,1424696633908,1424696631913248572,-5.958191,0.6880646,8.135345,a,nexus4,nexus4_1,stand
1,1424696633909,1424696631918283972,-5.95224,0.6702118,8.136536,a,nexus4,nexus4_1,stand
2,1424696633918,1424696631923288855,-5.9950867,0.6535491999999999,8.204376,a,nexus4,nexus4_1,stand
3,1424696633919,1424696631928385290,-5.9427185,0.6761626999999999,8.128204,a,nexus4,nexus4_1,stand
I have to create a RDD where USER MODEL AND GT are PRIMARY KEY, I don't know if I have to do it using them as a tuple.
Then when I have the primary key field I have to calculate AVG, MAX and MIN from 'x','y' and 'z'.
Here is an output:
User,Model,gt,media(x,y,z),desviacion(x,y,z),max(x,y,z),min(x,y,z)
a, nexus4,stand,-3.0,0.7,8.2,2.8,0.14,0.0,-1.0,0.8,8.2,-5.0,0.6,8.2
Any idea about how to group them and for example get the media values from "x"
With my current code I get the following.
# Data loading
lectura = sc.textFile("Phones_accelerometer.csv")
datos = lectura.map(lambda x: ((x.split(",")[6], x.split(",")[7], x.split(",")[9]),(x.split(",")[3], x.split(",")[4], x.split(",")[5])))
sumCount = datos.combineByKey(lambda value: (value, 1), lambda x, value: (x[0] + value, x[1] + 1), lambda x, y: (x[0] + y[0], x[1] + y[1]))
An example of my tuples:
[(('a', 'nexus4', 'stand'), ('-5.958191', '0.6880646', '8.135345'))]
If you have a csv data in a file as given in the question then you can use sqlContext to read it as a dataframe and cast the appropriate types as
df = sqlContext.read.format("com.databricks.spark.csv").option("header", True).load("path to csv file")
import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.select(F.col('User'), F.col('Model'), F.col('gt'), F.col('x').cast('float'), F.col('y').cast('float'), F.col('z').cast('float'))
I have selected primary keys and necessary columns only which should give you
+----+------+-----+----------+---------+--------+
|User|Model |gt |x |y |z |
+----+------+-----+----------+---------+--------+
|a |nexus4|stand|-5.958191 |0.6880646|8.135345|
|a |nexus4|stand|-5.95224 |0.6702118|8.136536|
|a |nexus4|stand|-5.9950867|0.6535492|8.204376|
|a |nexus4|stand|-5.9427185|0.6761627|8.128204|
+----+------+-----+----------+---------+--------+
All of your requirements: median, deviation, max and min depend on the list of x, y and z when grouped by primary keys: User, Model and gt.
So you would need groupBy and collect_list inbuilt function and a udf function to calculate all of your requiremnts. Final step is to separate them in different columns which are given below
from math import sqrt
def calculation(array):
num_items = len(array)
print num_items, sum(array)
mean = sum(array) / num_items
differences = [x - mean for x in array]
sq_differences = [d ** 2 for d in differences]
ssd = sum(sq_differences)
variance = ssd / (num_items - 1)
sd = sqrt(variance)
return [mean, sd, max(array), min(array)]
calcUdf = F.udf(calculation, T.ArrayType(T.FloatType()))
df.groupBy('User', 'Model', 'gt')\
.agg(calcUdf(F.collect_list(F.col('x'))).alias('x'), calcUdf(F.collect_list(F.col('y'))).alias('y'), calcUdf(F.collect_list(F.col('z'))).alias('z'))\
.select(F.col('User'), F.col('Model'), F.col('gt'), F.col('x')[0].alias('median_x'), F.col('y')[0].alias('median_y'), F.col('z')[0].alias('median_z'), F.col('x')[1].alias('deviation_x'), F.col('y')[1].alias('deviation_y'), F.col('z')[1].alias('deviation_z'), F.col('x')[2].alias('max_x'), F.col('y')[2].alias('max_y'), F.col('z')[2].alias('max_z'), F.col('x')[3].alias('min_x'), F.col('y')[3].alias('min_y'), F.col('z')[3].alias('min_z'))\
.show(truncate=False)
So finally you should have
+----+------+-----+---------+---------+--------+-----------+-----------+-----------+----------+---------+--------+----------+---------+--------+
|User|Model |gt |median_x |median_y |median_z|deviation_x|deviation_y|deviation_z|max_x |max_y |max_z |min_x |min_y |min_z |
+----+------+-----+---------+---------+--------+-----------+-----------+-----------+----------+---------+--------+----------+---------+--------+
|a |nexus4|stand|-5.962059|0.6719971|8.151115|0.022922019|0.01436464 |0.0356973 |-5.9427185|0.6880646|8.204376|-5.9950867|0.6535492|8.128204|
+----+------+-----+---------+---------+--------+-----------+-----------+-----------+----------+---------+--------+----------+---------+--------+
I hope the answer is helpful.
You'll have to used groupByKey to get median. While generally not preferred for performance reasons, finding the median value of a list of numbers can not be parallelized easily. The logic to compute median requires the entire list of numbers. groupByKey is the aggregation method to use when you need to process all the values for a key at the same time
Also, as mentioned in the comments, this task would be easier using Spark DataFrames.
My raw data comes in a tabular format. It contains observations from different variables. Each observation with the variable name, the timestamp and the value at that time.
Variable [string], Time [datetime], Value [float]
The data is stored as Parquet in HDFS and loaded into a Spark Dataframe (df). From that dataframe.
Now I want to calculate default statistics like Mean, Standard Deviation and others for each variable. Afterwards, once the Mean has been retrieved, I want to filter/count those values for that variable that are closely around the Mean.
Due to the answer towards my other question, I came up with this code:
from pyspark.sql.window import Window
from pyspark.sql.functions import *
from pyspark.sql.types import *
w1 = Window().partitionBy("Variable")
w2 = Window.partitionBy("Variable").orderBy("Time")
def stddev_pop_w(col, w):
#Built-in stddev doesn't support windowing
return sqrt(avg(col * col).over(w) - pow(avg(col).over(w), 2))
def isInRange(value, mean, stddev, radius):
try:
if (abs(value - mean) < radius * stddev):
return 1
else:
return 0
except AttributeError:
return -1
delta = col("Time").cast("long") - lag("Time", 1).over(w2).cast("long")
#f = udf(lambda (value, mean, stddev, radius): abs(value - mean) < radius * stddev, IntegerType())
#f2 = udf(lambda value, mean, stddev: isInRange(value, mean, stddev, 2), IntegerType())
#f3 = udf(lambda value, mean, stddev: isInRange(value, mean, stddev, 3), IntegerType())
df_ = df_all \
.withColumn("mean", mean("Value").over(w1)) \
.withColumn("std_deviation", stddev_pop_w(col("Value"), w1)) \
.withColumn("delta", delta) \
# .withColumn("stddev_2", f2("Value", "mean", "std_deviation")) \
# .withColumn("stddev_3", f3("Value", "mean", "std_deviation")) \
#df2.show(5, False)
Question: The last two commented-lines won't work. It will give an AttributeError because the incoming values for stddev and mean are null. I guess this happens because I'm referring to columns that are also just calculated on the fly and have no value at that moment. But is there a way to achieve that?
Currently I'm doing a second run like this:
df = df_.select("*", \
abs(df_.Value - df_.mean).alias("max_deviation_mean"), \
when(abs(df_.Value - df_.mean) < 2 * df_.std_deviation, 1).otherwise(1).alias("std_dev_mean_2"), \
when(abs(df_.Value - df_.mean) < 3 * df_.std_deviation, 1).otherwise(1).alias("std_dev_mean_3"))
The solution is to use the DataFrame.aggregateByKey function that aggregates the values per partition and node before shuffling that aggregate around the computing nodes where they are combined to one resulting value.
Pseudo-code looks like this. It is inspired by this tutorial, but it uses two instances of the StatCounter though we are summarizing two different statistics at once:
from pyspark.statcounter import StatCounter
# value[0] is the timestamp and value[1] is the float-value
# we are using two instances of StatCounter to sum-up two different statistics
def mergeValues(s1, v1, s2, v2):
s1.merge(v1)
s2.merge(v2)
return
def combineStats(s1, s2):
s1[0].mergeStats(s2[0])
s1[1].mergeStats(s2[1])
return
(df.aggregateByKey((StatCounter(), StatCounter()),
(lambda s, values: mergeValues(s[0], values[0], s[1], values[1]),
(lambda s1, s2: combineStats(s1, s2))
.mapValues(lambda s: ( s[0].min(), s[0].max(), s[1].max(), s[1].min(), s[1].mean(), s[1].variance(), s[1].stddev,() s[1].count()))
.collect())
This cannot work because when you execute
from pyspark.sql.functions import *
you shadow built-in abs with pyspark.sql.functions.abs which expects a column not a local Python value as an input.
Also UDF you created doesn't handle NULL entries.
Don't use import * unless you're aware of what exactly is imported. Instead alias
from pyspark.sql.functions import abs as abs_
or import module
from pyspark.sql import functions as sqlf
sqlf.col("x")
Always check input inside UDF or even better avoid UDFs unless necessary.
How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median.
This question is similar to this question. However, the answer to the question is using Scala, which I do not know.
How can I calculate exact median with Apache Spark?
Using the thinking for the Scala answer, I am trying to write a similar answer in Python.
I know I first want to sort the RDD. I do not know how. I see the sortBy (Sorts this RDD by the given keyfunc) and sortByKey (Sorts this RDD, which is assumed to consist of (key, value) pairs.) methods. I think both use key value and my RDD only has integer elements.
First, I was thinking of doing myrdd.sortBy(lambda x: x)?
Next I will find the length of the rdd (rdd.count()).
Finally, I want to find the element or 2 elements at the center of the rdd. I need help with this method too.
EDIT:
I had an idea. Maybe I can index my RDD and then key = index and value = element. And then I can try to sort by value? I don't know if this is possible because there is only a sortByKey method.
Ongoing work
SPARK-30569 - Add DSL functions invoking percentile_approx
Spark 2.0+:
You can use approxQuantile method which implements Greenwald-Khanna algorithm:
Python:
df.approxQuantile("x", [0.5], 0.25)
Scala:
df.stat.approxQuantile("x", Array(0.5), 0.25)
where the last parameter is a relative error. The lower the number the more accurate results and more expensive computation.
Since Spark 2.2 (SPARK-14352) it supports estimation on multiple columns:
df.approxQuantile(["x", "y", "z"], [0.5], 0.25)
and
df.approxQuantile(Array("x", "y", "z"), Array(0.5), 0.25)
Underlying methods can be also used in SQL aggregation (both global and groped) using approx_percentile function:
> SELECT approx_percentile(10.0, array(0.5, 0.4, 0.1), 100);
[10.0,10.0,10.0]
> SELECT approx_percentile(10.0, 0.5, 100);
10.0
Spark < 2.0
Python
As I've mentioned in the comments it is most likely not worth all the fuss. If data is relatively small like in your case then simply collect and compute median locally:
import numpy as np
np.random.seed(323)
rdd = sc.parallelize(np.random.randint(1000000, size=700000))
%time np.median(rdd.collect())
np.array(rdd.collect()).nbytes
It takes around 0.01 second on my few years old computer and around 5.5MB of memory.
If data is much larger sorting will be a limiting factor so instead of getting an exact value it is probably better to sample, collect, and compute locally. But if you really want a to use Spark something like this should do the trick (if I didn't mess up anything):
from numpy import floor
import time
def quantile(rdd, p, sample=None, seed=None):
"""Compute a quantile of order p ∈ [0, 1]
:rdd a numeric rdd
:p quantile(between 0 and 1)
:sample fraction of and rdd to use. If not provided we use a whole dataset
:seed random number generator seed to be used with sample
"""
assert 0 <= p <= 1
assert sample is None or 0 < sample <= 1
seed = seed if seed is not None else time.time()
rdd = rdd if sample is None else rdd.sample(False, sample, seed)
rddSortedWithIndex = (rdd.
sortBy(lambda x: x).
zipWithIndex().
map(lambda (x, i): (i, x)).
cache())
n = rddSortedWithIndex.count()
h = (n - 1) * p
rddX, rddXPlusOne = (
rddSortedWithIndex.lookup(x)[0]
for x in int(floor(h)) + np.array([0L, 1L]))
return rddX + (h - floor(h)) * (rddXPlusOne - rddX)
And some tests:
np.median(rdd.collect()), quantile(rdd, 0.5)
## (500184.5, 500184.5)
np.percentile(rdd.collect(), 25), quantile(rdd, 0.25)
## (250506.75, 250506.75)
np.percentile(rdd.collect(), 75), quantile(rdd, 0.75)
(750069.25, 750069.25)
Finally lets define median:
from functools import partial
median = partial(quantile, p=0.5)
So far so good but it takes 4.66 s in a local mode without any network communication. There is probably way to improve this, but why even bother?
Language independent (Hive UDAF):
If you use HiveContext you can also use Hive UDAFs. With integral values:
rdd.map(lambda x: (float(x), )).toDF(["x"]).registerTempTable("df")
sqlContext.sql("SELECT percentile_approx(x, 0.5) FROM df")
With continuous values:
sqlContext.sql("SELECT percentile(x, 0.5) FROM df")
In percentile_approx you can pass an additional argument which determines a number of records to use.
Here is the method I used using window functions (with pyspark 2.2.0).
from pyspark.sql import DataFrame
class median():
""" Create median class with over method to pass partition """
def __init__(self, df, col, name):
assert col
self.column=col
self.df = df
self.name = name
def over(self, window):
from pyspark.sql.functions import percent_rank, pow, first
first_window = window.orderBy(self.column) # first, order by column we want to compute the median for
df = self.df.withColumn("percent_rank", percent_rank().over(first_window)) # add percent_rank column, percent_rank = 0.5 coressponds to median
second_window = window.orderBy(pow(df.percent_rank-0.5, 2)) # order by (percent_rank - 0.5)^2 ascending
return df.withColumn(self.name, first(self.column).over(second_window)) # the first row of the window corresponds to median
def addMedian(self, col, median_name):
""" Method to be added to spark native DataFrame class """
return median(self, col, median_name)
# Add method to DataFrame class
DataFrame.addMedian = addMedian
Then call the addMedian method to calculate the median of col2:
from pyspark.sql import Window
median_window = Window.partitionBy("col1")
df = df.addMedian("col2", "median").over(median_window)
Finally you can group by if needed.
df.groupby("col1", "median")
Adding a solution if you want an RDD method only and dont want to move to DF.
This snippet can get you a percentile for an RDD of double.
If you input percentile as 50, you should obtain your required median.
Let me know if there are any corner cases not accounted for.
/**
* Gets the nth percentile entry for an RDD of doubles
*
* #param inputScore : Input scores consisting of a RDD of doubles
* #param percentile : The percentile cutoff required (between 0 to 100), e.g 90%ile of [1,4,5,9,19,23,44] = ~23.
* It prefers the higher value when the desired quantile lies between two data points
* #return : The number best representing the percentile in the Rdd of double
*/
def getRddPercentile(inputScore: RDD[Double], percentile: Double): Double = {
val numEntries = inputScore.count().toDouble
val retrievedEntry = (percentile * numEntries / 100.0 ).min(numEntries).max(0).toInt
inputScore
.sortBy { case (score) => score }
.zipWithIndex()
.filter { case (score, index) => index == retrievedEntry }
.map { case (score, index) => score }
.collect()(0)
}
There are two ways that can be used. One is using approxQuantile method and the other percentile_approx method. However, both the methods might not give accurate results when there are even number of records.
importpyspark.sql.functions.percentile_approx as F
# df.select(F.percentile_approx("COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.5).alias("MEDIAN)) # might not give proper results when there are even number of records
((
df.select(F.percentile_approx("COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.5) + df.select(F.percentile_approx("COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.51)
)*.5).alias("MEDIAN))
I have written the function which takes data frame as an input and returns a dataframe which has median as an output over a partition and order_col is the column for which we want to calculate median for part_col is the level at which we want to calculate median for :
from pyspark.sql import Window
import pyspark.sql.functions as F
def calculate_median(dataframe, part_col, order_col):
win = Window.partitionBy(*part_col).orderBy(order_col)
# count_row = dataframe.groupby(*part_col).distinct().count()
dataframe.persist()
dataframe.count()
temp = dataframe.withColumn("rank", F.row_number().over(win))
temp = temp.withColumn(
"count_row_part",
F.count(order_col).over(Window.partitionBy(part_col))
)
temp = temp.withColumn(
"even_flag",
F.when(
F.col("count_row_part") %2 == 0,
F.lit(1)
).otherwise(
F.lit(0)
)
).withColumn(
"mid_value",
F.floor(F.col("count_row_part")/2)
)
temp = temp.withColumn(
"avg_flag",
F.when(
(F.col("even_flag")==1) &
(F.col("rank") == F.col("mid_value"))|
((F.col("rank")-1) == F.col("mid_value")),
F.lit(1)
).otherwise(
F.when(
F.col("rank") == F.col("mid_value")+1,
F.lit(1)
)
)
)
temp.show(10)
return temp.filter(
F.col("avg_flag") == 1
).groupby(
part_col + ["avg_flag"]
).agg(
F.avg(F.col(order_col)).alias("median")
).drop("avg_flag")
For exact median computation you can use the following function and use it with PySpark DataFrame API:
def median_exact(col: Union[Column, str]) -> Column:
"""
For grouped aggregations, Spark provides a way via pyspark.sql.functions.percentile_approx("col", .5) function,
since for large datasets, computing the median is computationally expensive.
This function manually computes the median and should only be used for small to mid sized datasets / groupings.
:param col: Column to compute the median for.
:return: A pyspark `Column` containing the median calculation expression
"""
list_expr = F.filter(F.collect_list(col), lambda x: x.isNotNull())
sorted_list_expr = F.sort_array(list_expr)
size_expr = F.size(sorted_list_expr)
even_num_elements = (size_expr % 2) == 0
odd_num_elements = ~even_num_elements
return F.when(size_expr == 0, None).otherwise(
F.when(odd_num_elements, sorted_list_expr[F.floor(size_expr / 2)]).otherwise(
(
sorted_list_expr[(size_expr / 2 - 1).cast("long")]
+ sorted_list_expr[(size_expr / 2).cast("long")]
)
/ 2
)
)
Apply it like this:
output_df = input_spark_df.groupby("group").agg(
median_exact("elems").alias("elems_median")
)
We can calculate the median and quantiles in spark using the following code:
df.stat.approxQuantile(col,[quantiles],error)
For example, finding the median in the following dataframe [1,2,3,4,5]:
df.stat.approxQuantile(col,[0.5],0)
The lesser the error, the more accurate the results.
From version 3.4+ (and also already in 3.3.1) the median function is directly available
https://github.com/apache/spark/blob/e170a2eb236a376b036730b5d63371e753f1d947/python/pyspark/sql/functions.py#L633
import pyspark.sql.functions as f
df.groupBy("grp").agg(f.median("val"))
I guess the respective documentation will be added if the version is finally released.