How to calculate Category Utility - python

I am currently trying to implement Category Utility (as defined here) in Python. I am trying to use Pandas to accomplish this. I have a rough draft of code currently, however I am pretty sure that it's wrong. I think it's wrong specifically in the code that deals with looping over the clusters and calculating the inner_sum, seeing as Category Utility requires going through all possible values for each attribute. Would anyone be able to help me figure out how to improve this draft such that it properly calculates the category utility for the given clusters?
Here is the code:
import pandas as pd
from typing import List
def probability(df: pd.DataFrame, clause: str) -> pd.DataFrame:
"""
Gets the probabilities of the values within the given data frame
of the provided clause
"""
return df.groupby(clause).size().div(len(df))
def conditional_probability(df: pd.DataFrame, clause: str, given: str) -> pd.DataFrame:
"""
Gets the conditional probability of the values within the provided data
frame of the provided clause assuming that the given is true.
"""
base_probabilities: pd.DataFrame = probability(df, clause=given)
return df.groupby([clause, given]).size().div(len(df))
.div(base_probabilities, axis=0, level=given)
def category_utility(clusters: pd.DataFrame) -> float:
# k is the number of clusters.
k: int = len(clusters)
# probabilities of all clusters. To be used to get P(C_l)
probs_of_clusters: pd.DataFrame = probability(clusters, 'clusters')
# probabilities of all attributes being any possible value.
# To be used to get P(a_i = v_ij)
probs_of_attr_vals: pd.DataFrame = probability(clusters, 'attributes')
# Probabilities of all attributes being any possible value given the cluster they're in.
# To be used to get P(a_i = v_ij | C_l)
cond_prob_of_attr_vals: pd.DataFrame = conditional_probability(clusters, clause='attributes', given='clusters')
tracked_cu: List[float] = []
for cluster in clusters['clusters']:
# The probability of the current cluster.
# P(C_l)
prob_of_curr_cluster: float = probs_of_clusters[cluster]
# The summation of the square difference between an attribute being in a cluster and just overall existing in the data.
# E (P(a_i = v_ij | C_l) ^ 2 - P(a_i = v_ij) ^ 2)
inner_sum: float = sum([cond_prob_of_attr_vals[attr] ** 2 - probs_of_attr_vals[attr] for attr in clusters['attributes']])
tracked_cu += inner_sum * prob_of_curr_cluster
return sum(tracked_cu) / k
Any help with correctly implementing this would be appreciated.

Related

in featuretools, How to Custom Primitives of 2 columns?

I created Custom Primitives like below.
class Correlate(TransformPrimitive):
name = 'correlate'
input_types = [Numeric,Numeric]
return_type = Numeric
commutative = True
compatibility = [Library.PANDAS, Library.DASK, Library.KOALAS]
def get_function(self):
def correlate(column1,column2):
return np.correlate(column1,column2,"same")
return correlate
Then I checked the calculation like below just in case.
np.correlate(feature_matrix["alcohol"], feature_matrix["chlorides"],mode="same")
However above function result and below function result were difference.
Do you know why those are difference?
If my code is wrong basically, please correct me.
Thanks for the question! You can create a custom primitive with a fixed argument to calculate that kind of correlation by using the TransformPrimitive as a base class. I will go through an example using this data.
import pandas as pd
data = [
[0.40168819, 0.0857946],
[0.06268886, 0.27811651],
[0.16931269, 0.96509497],
[0.15123022, 0.80546244],
[0.58610794, 0.56928692],
]
df = pd.DataFrame(data=data, columns=list('ab'))
df.reset_index(inplace=True)
df
index a b
0 0.401688 0.085795
1 0.062689 0.278117
2 0.169313 0.965095
3 0.151230 0.805462
4 0.586108 0.569287
The function np.correlate is a transform when the parameter mode=same, so define a custom primitive by using the TransformPrimitive as a base class.
from featuretools.primitives import TransformPrimitive
from featuretools.variable_types import Numeric
import numpy as np
class Correlate(TransformPrimitive):
name = 'correlate'
input_types = [Numeric, Numeric]
return_type = Numeric
def get_function(self):
def correlate(a, b):
return np.correlate(a, b, mode='same')
return correlate
The DFS call requires the data to be structured into an EntitySet, then you can use the custom primitive.
import featuretools as ft
es = ft.EntitySet()
es.entity_from_dataframe(
entity_id='data',
dataframe=df,
index='index',
)
fm, fd = ft.dfs(
entityset=es,
target_entity='data',
trans_primitives=[Correlate],
max_depth=1,
)
fm[['CORRELATE(a, b)']]
CORRELATE(a, b)
index
0 0.534548
1 0.394685
2 0.670774
3 0.670506
4 0.622236
You should get the same values between the feature matrix and np.correlate.
actual = fm['CORRELATE(a, b)'].values
expected = np.correlate(df['a'], df['b'], mode='same')
np.testing.assert_array_equal(actual, expected)
You can learn more about defining simple custom primitives and advanced custom primitives in the linked pages. Let me know if you found this helpful.

Specify function

I have the following function. It calculates the euclidean distances form some financial figures between companies and give me the closest company. Unfortunately, sometimes the closest company is the same company. Does anyone know how I can adjust the function so that it does not return the same company?
#Calculating the closest distances
records = df_ipos.to_dict('records') #converting dataframe to a list of dictionaries
def return_closest(df,inp_record):
"""returns the closest euclidean distanced record"""
filtered_records = df.to_dict('records')#converting dataframe to a list of dictionaries
for record in filtered_records: #iterating through dictionaries
params = ['z_SA','z_LEV','z_AT', 'z_PM', 'z_RG']#parameters to calculate euclidean distance
distance = []
for param in params:
d1,d2 = record.get(param,0),inp_record.get(param,0) # fetching value of these parameters. default is0 if not found
if d1!=d1: #checking isNan
d1 = 0
if d2!=d2:
d2 = 0
distance.append((d1 - d2)**2)
euclidean = math.sqrt(sum(distance))
record['Euclidean distance'] = round(euclidean,6) #assigning to a new key
distance_records = sorted(filtered_records,key = lambda x:x['Euclidean distance']) #sorting in increasing order
return next(filter(lambda x:x['Euclidean distance'],distance_records),None) #returning the lowest value which is not zero. Default None
for record in records:
ipo_year = record.get('IPO Year')
sic_code = record.get('SIC-Code')
df = df_fundamentals[df_fundamentals['Year']==ipo_year]
df = df[df['SIC-Code']==sic_code] #filtering dataframe
closest_record = return_closest(df,record)
if closest_record:
record['Closest Company'] = closest_record.get('Name') #adding new columns
record['Actual Distance'] = closest_record.get('Euclidean distance')
df_dist = pd.DataFrame(records) #changing list of dictionaries back to dataframe
thanks in advance!
Based on your question, it is not exactly clear to me what your inputs are.
But as a simple fix, I would suggest you check before your function's for loop, whether the record you are comparing is identical to the one which you check against, i.e., add:
...
filtered_records = [rec for rec in filtered_records if rec['Name'] != inp_record['Name']]
for record in filtered_records: #iterating through dictionaries
...
This only applies, if 'Name' really contains the company name. Also for your function not to work, there seems to be an absolute distance greater zero when comparing your parameters. I am not sure if this is intended, maybe you look at data from different years? I cannot really tell, due to the limited amount of information.

Error when using .between() for checking a long/lat location pulling it from a dictionary

I'm scanning through a data frame which is grouped by a specific id and am trying to return a locations surface type depending on certain long/lat locations which I have in a dictionary. The problem with the data set is that it's created at 100 frames per second so I am trying to find the median value as values before and after this point are incorrect.
I am using pandas jupyter notebook and have
This is the function which i want to pull the locations from the dictionary. The location is just a made up example
pitch_boundaries = {
'Astro': {'max_long': -6.123456, 'min_long': -6.123456,
'max_lat': 53.123456, 'min_lat': 53.123456},
}
def get_loc_name(loc_df, pitch_boundaries):
for pitch_name, coord_limits in pitch_boundaries.items():
between_long_limits = loc_df['longitude'].median().between(coord_limits['min_long'], coord_limits['max_long'])
between_lat_limits = loc_df['latitude'].median().between(coord_limits['min_lat'], coord_limits['max_lat'])
if between_long_limits.any() and between_lat_limits.any():
return pitch_name
# If we get here then there is no pitch.
call it here
def makeAverageDataFrame(df):
pitchBounds = get_loc_name(df, pitch_boundaries)
i = len(df_average.index)
df_average.loc[i] = [pitchBounds]
finally where the errors occurs
for region, df_region in df_Will.groupby('session_id'):
makeAverageDataFrame(df_region)
Actual results
# AttributeError: 'float' object has no attribute 'between'
or if I remove .median(): None
What I want is a new dataframe with something like
|surface|
|Astro|
|Grass|
|Astro|
loc_df['longitude'] is a series, and loc_df['longitude'].median() gives you a float, which does not have between method. Try loc_df[['longitude']].
def get_loc_name(loc_df, pitch_boundaries):
for pitch_name, coord_limits in pitch_boundaries.items():
between_long_limits = loc_df[['longitude']].median().between(coord_limits['min_long'], coord_limits['max_long'])
between_lat_limits = loc_df[['latitude']].median().between(coord_limits['min_lat'], coord_limits['max_lat'])
if between_long_limits.any() and between_lat_limits.any():
return pitch_name
And your problem with returning None is that your makeAverageDataFrame does not return anything (None). Try:
def makeAverageDataFrame(df):
pitchBounds = get_loc_name(df, pitch_boundaries)
return pitchBounds

Creating Function for Pandas that takes arguements (df, columnname) and calculates null percentgage

I am learning Python's Pandas library using kaggle's titanic tutorial. I am trying to create a function which will calculate the nulls in a column.
My attempt below appears to print the entire dataframe, instead of null values in the specified column:
def null_percentage_calculator(df,nullcolumn):
df_column_null = df[nullcolumn].isnull().sum()
df_column_null_percentage = np.ceil((df_column_null /testtotal)*100)
print("{} percent of {} {} are NaN values".format(df_column_null_percentage,df,nullcolumn))
null_percentage_calculator(train,"Age")
My previous (and very first stack overflow question) was a similar problem, and it was explained to me that the .index method in pandas is undesirable and I should try and use other methods like [ ] and .loc to explicitly refer to the column.
So I have tried this:
df_column_null=[df[nullcolumn]].isnull().sum()
I have also tried
df_column_null=df[nullcolumn]df[nullcolumn].isnull().sum()
I am struggling to understand this aspect of Pandas. My non function method works fine:
Train_Age_Nulls = train["Age"].isnull().sum()
Train_Age_Nulls_percentage = (Train_Age_Nulls/traintotal)*100
Train_Age_Nulls_percentage_rounded = np.ceil(Train_Age_Nulls_percentage)
print("{} percent of Train's Age are NaN values".format(Train_Age_Nulls_percentage_rounded))
Could anyone let me know where I am going wrong?
def null_percentage_calculator(df,nullcolumn):
df_column_null = df[nullcolumn].isnull().sum()
df_column_null_percentage = np.ceil((df_column_null /testtotal)*100)
# what is testtotal?
print("{} percent of {} {} are NaN values".format(df_column_null_percentage,df,nullcolumn))
I would do this with:
def null_percentage_calculator(df,nullcolumn):
nulls = df[nullcolumn].isnull().sum()
pct = float(nulls) / len(df[nullcolumn]) # need float because of python division
# if you must you can * 100
print "{} percent of column {} are null".format(pct*100, nullcolumn)
beware of python integer division where 63/180 = 0
if you want a float out, you have to put a float in.

How to find median and quantiles using Spark

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median.
This question is similar to this question. However, the answer to the question is using Scala, which I do not know.
How can I calculate exact median with Apache Spark?
Using the thinking for the Scala answer, I am trying to write a similar answer in Python.
I know I first want to sort the RDD. I do not know how. I see the sortBy (Sorts this RDD by the given keyfunc) and sortByKey (Sorts this RDD, which is assumed to consist of (key, value) pairs.) methods. I think both use key value and my RDD only has integer elements.
First, I was thinking of doing myrdd.sortBy(lambda x: x)?
Next I will find the length of the rdd (rdd.count()).
Finally, I want to find the element or 2 elements at the center of the rdd. I need help with this method too.
EDIT:
I had an idea. Maybe I can index my RDD and then key = index and value = element. And then I can try to sort by value? I don't know if this is possible because there is only a sortByKey method.
Ongoing work
SPARK-30569 - Add DSL functions invoking percentile_approx
Spark 2.0+:
You can use approxQuantile method which implements Greenwald-Khanna algorithm:
Python:
df.approxQuantile("x", [0.5], 0.25)
Scala:
df.stat.approxQuantile("x", Array(0.5), 0.25)
where the last parameter is a relative error. The lower the number the more accurate results and more expensive computation.
Since Spark 2.2 (SPARK-14352) it supports estimation on multiple columns:
df.approxQuantile(["x", "y", "z"], [0.5], 0.25)
and
df.approxQuantile(Array("x", "y", "z"), Array(0.5), 0.25)
Underlying methods can be also used in SQL aggregation (both global and groped) using approx_percentile function:
> SELECT approx_percentile(10.0, array(0.5, 0.4, 0.1), 100);
[10.0,10.0,10.0]
> SELECT approx_percentile(10.0, 0.5, 100);
10.0
Spark < 2.0
Python
As I've mentioned in the comments it is most likely not worth all the fuss. If data is relatively small like in your case then simply collect and compute median locally:
import numpy as np
np.random.seed(323)
rdd = sc.parallelize(np.random.randint(1000000, size=700000))
%time np.median(rdd.collect())
np.array(rdd.collect()).nbytes
It takes around 0.01 second on my few years old computer and around 5.5MB of memory.
If data is much larger sorting will be a limiting factor so instead of getting an exact value it is probably better to sample, collect, and compute locally. But if you really want a to use Spark something like this should do the trick (if I didn't mess up anything):
from numpy import floor
import time
def quantile(rdd, p, sample=None, seed=None):
"""Compute a quantile of order p ∈ [0, 1]
:rdd a numeric rdd
:p quantile(between 0 and 1)
:sample fraction of and rdd to use. If not provided we use a whole dataset
:seed random number generator seed to be used with sample
"""
assert 0 <= p <= 1
assert sample is None or 0 < sample <= 1
seed = seed if seed is not None else time.time()
rdd = rdd if sample is None else rdd.sample(False, sample, seed)
rddSortedWithIndex = (rdd.
sortBy(lambda x: x).
zipWithIndex().
map(lambda (x, i): (i, x)).
cache())
n = rddSortedWithIndex.count()
h = (n - 1) * p
rddX, rddXPlusOne = (
rddSortedWithIndex.lookup(x)[0]
for x in int(floor(h)) + np.array([0L, 1L]))
return rddX + (h - floor(h)) * (rddXPlusOne - rddX)
And some tests:
np.median(rdd.collect()), quantile(rdd, 0.5)
## (500184.5, 500184.5)
np.percentile(rdd.collect(), 25), quantile(rdd, 0.25)
## (250506.75, 250506.75)
np.percentile(rdd.collect(), 75), quantile(rdd, 0.75)
(750069.25, 750069.25)
Finally lets define median:
from functools import partial
median = partial(quantile, p=0.5)
So far so good but it takes 4.66 s in a local mode without any network communication. There is probably way to improve this, but why even bother?
Language independent (Hive UDAF):
If you use HiveContext you can also use Hive UDAFs. With integral values:
rdd.map(lambda x: (float(x), )).toDF(["x"]).registerTempTable("df")
sqlContext.sql("SELECT percentile_approx(x, 0.5) FROM df")
With continuous values:
sqlContext.sql("SELECT percentile(x, 0.5) FROM df")
In percentile_approx you can pass an additional argument which determines a number of records to use.
Here is the method I used using window functions (with pyspark 2.2.0).
from pyspark.sql import DataFrame
class median():
""" Create median class with over method to pass partition """
def __init__(self, df, col, name):
assert col
self.column=col
self.df = df
self.name = name
def over(self, window):
from pyspark.sql.functions import percent_rank, pow, first
first_window = window.orderBy(self.column) # first, order by column we want to compute the median for
df = self.df.withColumn("percent_rank", percent_rank().over(first_window)) # add percent_rank column, percent_rank = 0.5 coressponds to median
second_window = window.orderBy(pow(df.percent_rank-0.5, 2)) # order by (percent_rank - 0.5)^2 ascending
return df.withColumn(self.name, first(self.column).over(second_window)) # the first row of the window corresponds to median
def addMedian(self, col, median_name):
""" Method to be added to spark native DataFrame class """
return median(self, col, median_name)
# Add method to DataFrame class
DataFrame.addMedian = addMedian
Then call the addMedian method to calculate the median of col2:
from pyspark.sql import Window
median_window = Window.partitionBy("col1")
df = df.addMedian("col2", "median").over(median_window)
Finally you can group by if needed.
df.groupby("col1", "median")
Adding a solution if you want an RDD method only and dont want to move to DF.
This snippet can get you a percentile for an RDD of double.
If you input percentile as 50, you should obtain your required median.
Let me know if there are any corner cases not accounted for.
/**
* Gets the nth percentile entry for an RDD of doubles
*
* #param inputScore : Input scores consisting of a RDD of doubles
* #param percentile : The percentile cutoff required (between 0 to 100), e.g 90%ile of [1,4,5,9,19,23,44] = ~23.
* It prefers the higher value when the desired quantile lies between two data points
* #return : The number best representing the percentile in the Rdd of double
*/
def getRddPercentile(inputScore: RDD[Double], percentile: Double): Double = {
val numEntries = inputScore.count().toDouble
val retrievedEntry = (percentile * numEntries / 100.0 ).min(numEntries).max(0).toInt
inputScore
.sortBy { case (score) => score }
.zipWithIndex()
.filter { case (score, index) => index == retrievedEntry }
.map { case (score, index) => score }
.collect()(0)
}
There are two ways that can be used. One is using approxQuantile method and the other percentile_approx method. However, both the methods might not give accurate results when there are even number of records.
importpyspark.sql.functions.percentile_approx as F
# df.select(F.percentile_approx("COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.5).alias("MEDIAN)) # might not give proper results when there are even number of records
((
df.select(F.percentile_approx("COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.5) + df.select(F.percentile_approx("COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.51)
)*.5).alias("MEDIAN))
I have written the function which takes data frame as an input and returns a dataframe which has median as an output over a partition and order_col is the column for which we want to calculate median for part_col is the level at which we want to calculate median for :
from pyspark.sql import Window
import pyspark.sql.functions as F
def calculate_median(dataframe, part_col, order_col):
win = Window.partitionBy(*part_col).orderBy(order_col)
# count_row = dataframe.groupby(*part_col).distinct().count()
dataframe.persist()
dataframe.count()
temp = dataframe.withColumn("rank", F.row_number().over(win))
temp = temp.withColumn(
"count_row_part",
F.count(order_col).over(Window.partitionBy(part_col))
)
temp = temp.withColumn(
"even_flag",
F.when(
F.col("count_row_part") %2 == 0,
F.lit(1)
).otherwise(
F.lit(0)
)
).withColumn(
"mid_value",
F.floor(F.col("count_row_part")/2)
)
temp = temp.withColumn(
"avg_flag",
F.when(
(F.col("even_flag")==1) &
(F.col("rank") == F.col("mid_value"))|
((F.col("rank")-1) == F.col("mid_value")),
F.lit(1)
).otherwise(
F.when(
F.col("rank") == F.col("mid_value")+1,
F.lit(1)
)
)
)
temp.show(10)
return temp.filter(
F.col("avg_flag") == 1
).groupby(
part_col + ["avg_flag"]
).agg(
F.avg(F.col(order_col)).alias("median")
).drop("avg_flag")
For exact median computation you can use the following function and use it with PySpark DataFrame API:
def median_exact(col: Union[Column, str]) -> Column:
"""
For grouped aggregations, Spark provides a way via pyspark.sql.functions.percentile_approx("col", .5) function,
since for large datasets, computing the median is computationally expensive.
This function manually computes the median and should only be used for small to mid sized datasets / groupings.
:param col: Column to compute the median for.
:return: A pyspark `Column` containing the median calculation expression
"""
list_expr = F.filter(F.collect_list(col), lambda x: x.isNotNull())
sorted_list_expr = F.sort_array(list_expr)
size_expr = F.size(sorted_list_expr)
even_num_elements = (size_expr % 2) == 0
odd_num_elements = ~even_num_elements
return F.when(size_expr == 0, None).otherwise(
F.when(odd_num_elements, sorted_list_expr[F.floor(size_expr / 2)]).otherwise(
(
sorted_list_expr[(size_expr / 2 - 1).cast("long")]
+ sorted_list_expr[(size_expr / 2).cast("long")]
)
/ 2
)
)
Apply it like this:
output_df = input_spark_df.groupby("group").agg(
median_exact("elems").alias("elems_median")
)
We can calculate the median and quantiles in spark using the following code:
df.stat.approxQuantile(col,[quantiles],error)
For example, finding the median in the following dataframe [1,2,3,4,5]:
df.stat.approxQuantile(col,[0.5],0)
The lesser the error, the more accurate the results.
From version 3.4+ (and also already in 3.3.1) the median function is directly available
https://github.com/apache/spark/blob/e170a2eb236a376b036730b5d63371e753f1d947/python/pyspark/sql/functions.py#L633
import pyspark.sql.functions as f
df.groupBy("grp").agg(f.median("val"))
I guess the respective documentation will be added if the version is finally released.

Categories