in featuretools, How to Custom Primitives of 2 columns? - python

I created Custom Primitives like below.
class Correlate(TransformPrimitive):
name = 'correlate'
input_types = [Numeric,Numeric]
return_type = Numeric
commutative = True
compatibility = [Library.PANDAS, Library.DASK, Library.KOALAS]
def get_function(self):
def correlate(column1,column2):
return np.correlate(column1,column2,"same")
return correlate
Then I checked the calculation like below just in case.
np.correlate(feature_matrix["alcohol"], feature_matrix["chlorides"],mode="same")
However above function result and below function result were difference.
Do you know why those are difference?
If my code is wrong basically, please correct me.

Thanks for the question! You can create a custom primitive with a fixed argument to calculate that kind of correlation by using the TransformPrimitive as a base class. I will go through an example using this data.
import pandas as pd
data = [
[0.40168819, 0.0857946],
[0.06268886, 0.27811651],
[0.16931269, 0.96509497],
[0.15123022, 0.80546244],
[0.58610794, 0.56928692],
]
df = pd.DataFrame(data=data, columns=list('ab'))
df.reset_index(inplace=True)
df
index a b
0 0.401688 0.085795
1 0.062689 0.278117
2 0.169313 0.965095
3 0.151230 0.805462
4 0.586108 0.569287
The function np.correlate is a transform when the parameter mode=same, so define a custom primitive by using the TransformPrimitive as a base class.
from featuretools.primitives import TransformPrimitive
from featuretools.variable_types import Numeric
import numpy as np
class Correlate(TransformPrimitive):
name = 'correlate'
input_types = [Numeric, Numeric]
return_type = Numeric
def get_function(self):
def correlate(a, b):
return np.correlate(a, b, mode='same')
return correlate
The DFS call requires the data to be structured into an EntitySet, then you can use the custom primitive.
import featuretools as ft
es = ft.EntitySet()
es.entity_from_dataframe(
entity_id='data',
dataframe=df,
index='index',
)
fm, fd = ft.dfs(
entityset=es,
target_entity='data',
trans_primitives=[Correlate],
max_depth=1,
)
fm[['CORRELATE(a, b)']]
CORRELATE(a, b)
index
0 0.534548
1 0.394685
2 0.670774
3 0.670506
4 0.622236
You should get the same values between the feature matrix and np.correlate.
actual = fm['CORRELATE(a, b)'].values
expected = np.correlate(df['a'], df['b'], mode='same')
np.testing.assert_array_equal(actual, expected)
You can learn more about defining simple custom primitives and advanced custom primitives in the linked pages. Let me know if you found this helpful.

Related

How to calculate Category Utility

I am currently trying to implement Category Utility (as defined here) in Python. I am trying to use Pandas to accomplish this. I have a rough draft of code currently, however I am pretty sure that it's wrong. I think it's wrong specifically in the code that deals with looping over the clusters and calculating the inner_sum, seeing as Category Utility requires going through all possible values for each attribute. Would anyone be able to help me figure out how to improve this draft such that it properly calculates the category utility for the given clusters?
Here is the code:
import pandas as pd
from typing import List
def probability(df: pd.DataFrame, clause: str) -> pd.DataFrame:
"""
Gets the probabilities of the values within the given data frame
of the provided clause
"""
return df.groupby(clause).size().div(len(df))
def conditional_probability(df: pd.DataFrame, clause: str, given: str) -> pd.DataFrame:
"""
Gets the conditional probability of the values within the provided data
frame of the provided clause assuming that the given is true.
"""
base_probabilities: pd.DataFrame = probability(df, clause=given)
return df.groupby([clause, given]).size().div(len(df))
.div(base_probabilities, axis=0, level=given)
def category_utility(clusters: pd.DataFrame) -> float:
# k is the number of clusters.
k: int = len(clusters)
# probabilities of all clusters. To be used to get P(C_l)
probs_of_clusters: pd.DataFrame = probability(clusters, 'clusters')
# probabilities of all attributes being any possible value.
# To be used to get P(a_i = v_ij)
probs_of_attr_vals: pd.DataFrame = probability(clusters, 'attributes')
# Probabilities of all attributes being any possible value given the cluster they're in.
# To be used to get P(a_i = v_ij | C_l)
cond_prob_of_attr_vals: pd.DataFrame = conditional_probability(clusters, clause='attributes', given='clusters')
tracked_cu: List[float] = []
for cluster in clusters['clusters']:
# The probability of the current cluster.
# P(C_l)
prob_of_curr_cluster: float = probs_of_clusters[cluster]
# The summation of the square difference between an attribute being in a cluster and just overall existing in the data.
# E (P(a_i = v_ij | C_l) ^ 2 - P(a_i = v_ij) ^ 2)
inner_sum: float = sum([cond_prob_of_attr_vals[attr] ** 2 - probs_of_attr_vals[attr] for attr in clusters['attributes']])
tracked_cu += inner_sum * prob_of_curr_cluster
return sum(tracked_cu) / k
Any help with correctly implementing this would be appreciated.

How to multiprocess finding closest geographic point in two pandas dataframes?

I have a function which I'm trying to apply in parallel and within that function I call another function that I think would benefit from being executed in parallel. The goal is to take in multiple years of crop yields for each field and combine all of them into one pandas dataframe. I have the function I use for finding the closest point in each dataframe, but it is quite intensive and takes some time. I'm looking to speed it up.
I've tried creating a pool and using map_async on the inner function. I've also tried doing the same with the loop for the outer function. The latter is the only thing I've gotten to work the way I intended it to. I can use this, but I know there has to be a way to make it faster. Check out the code below:
return_columns = []
return_columns_cb = lambda x: return_columns.append(x)
def getnearestpoint(gdA, gdB, retcol):
dist = lambda point1, point2: distance.great_circle(point1, point2).feet
def find_closest(point):
distances = gdB.apply(
lambda row: dist(point, (row["Longitude"], row["Latitude"])), axis=1
)
return (gdB.loc[distances.idxmin(), retcol], distances.min())
append_retcol = gdA.apply(
lambda row: find_closest((row["Longitude"], row["Latitude"])), axis=1
)
return append_retcol
def combine_yield(field):
#field is a list of the files for the field I'm working with
#lots of pre-processing
#dfs in this case is a list of the dataframes for the current field
#mdf is the dataframe with the most points which I poppped from this list
p = Pool()
for i in range(0, len(dfs)):
p.apply_async(getnearestpoint, args=(mdf, dfs[i], dfs[i].columns[-1]), callback=return_cols_cb)
for col in return_columns:
mdf = mdf.append(col)
'''I unzip my points back to longitude and latitude here in the final
dataframe so I can write to csv without tuples'''
mdf[["Longitude", "Latitude"]] = pd.DataFrame(
mdf["Point"].tolist(), index=mdf.index
)
return mdf
def multiprocess_combine_yield():
'''do stuff to get dictionary below with each field name as key and values
as all the files for that field'''
yield_by_field = {'C01': ('files...'), ...}
#The farm I'm working on has 30 fields and below is too slow
for k,v in yield_by_field.items():
combine_yield(v)
I guess what I need help on is I envision something like using a pool to imap or apply_async on each tuple of files in the dictionary. Then within the combine_yield function when applied to that tuple of files, I want to to be able to parallel process the distance function. That function bogs the program down because it calculates the distance between every point in each of the dataframes for each year of yield. The files average around 1200 data points and then you multiply all of that by 30 fields and I need something better. Maybe the efficiency improvement lies in finding a better way to pull in the closest point. I still need something that gives me the value from gdB, and the distance though because of what I do later on when selecting which rows to use from the 'mdf' dataframe.
Thanks to #ALollz comment, I figured this out. I went back to my getnearestpoint function and instead of doing a bunch of Series.apply I am now using cKDTree from scipy.spatial to find the closest point, and then using a vectorized haversine distance to calculate the true distances on each of these matched points. Much much quicker. Here are the basics of the code below:
import numpy as np
import pandas as pd
from scipy.spatial import cKDTree
def getnearestpoint(gdA, gdB, retcol):
gdA_coordinates = np.array(
list(zip(gdA.loc[:, "Longitude"], gdA.loc[:, "Latitude"]))
)
gdB_coordinates = np.array(
list(zip(gdB.loc[:, "Longitude"], gdB.loc[:, "Latitude"]))
)
tree = cKDTree(data=gdB_coordinates)
distances, indices = tree.query(gdA_coordinates, k=1)
#These column names are done as so due to formatting of my 'retcols'
df = pd.DataFrame.from_dict(
{
f"Longitude_{retcol[:4]}": gdB.loc[indices, "Longitude"].values,
f"Latitude_{retcol[:4]}": gdB.loc[indices, "Latitude"].values,
retcol: gdB.loc[indices, retcol].values,
}
)
gdA = pd.merge(left=gdA, right=df, left_on=gdA.index, right_on=df.index)
gdA.drop(columns="key_0", inplace=True)
return gdA
def combine_yield(field):
#same preprocessing as before
for i in range(0, len(dfs)):
mdf = getnearestpoint(mdf, dfs[i], dfs[i].columns[-1])
main_coords = np.array(list(zip(mdf.Longitude, mdf.Latitude)))
lat_main = main_coords[:, 1]
longitude_main = main_coords[:, 0]
longitude_cols = [
c for c in mdf.columns for m in [re.search(r"Longitude_B\d{4}", c)] if m
]
latitude_cols = [
c for c in mdf.columns for m in [re.search(r"Latitude_B\d{4}", c)] if m
]
year_coords = list(zip_longest(longitude_cols, latitude_cols, fillvalue=np.nan))
for i in year_coords:
year = re.search(r"\d{4}", i[0]).group(0)
year_coords = np.array(list(zip(mdf.loc[:, i[0]], mdf.loc[:, i[1]])))
year_coords = np.deg2rad(year_coords)
lat_year = year_coords[:, 1]
longitude_year = year_coords[:, 0]
diff_lat = lat_main - lat_year
diff_lng = longitude_main - longitude_year
d = (
np.sin(diff_lat / 2) ** 2
+ np.cos(lat_main) * np.cos(lat_year) * np.sin(diff_lng / 2) ** 2
)
mdf[f"{year} Distance"] = 2 * (2.0902 * 10 ** 7) * np.arcsin(np.sqrt(d))
return mdf
Then I'll just do Pool.map(combine_yield, (v for k,v in yield_by_field.items()))
This has made a substantial difference. Hope it helps anyone else in a similar predicament.

Create custom primitive function of a list type using custom variable types

I have a question about featuretools's make_agg_premitives function.
In my data there are values that consist of list format.
For example,
id products
a ['a', 'b', 'c']
b ['a','c']
a ['a','c']
I want to aggregate the products columns by using various custom functions:
def len_lists(values):
return len(a)
len_ = make_agg_primitive(function = len_lists,
input_types = [?????],
return_type = ft.variable_types.Numeric,
description="length of a list related instance")
You can use featuretools to create a custom variable type that can be used with a custom primitive to generate the transform feature that you want.
Note: The operation that you want to do is actually a transform primitive, not an aggregation primitive.
Using your example let’s create a custom List type
from featuretools.variable_types import Variable
class List(Variable):
type_string = "list"
Now let’s use our new List type to create a custom transform primitive, and generate features for a simple entityset that contains a List variable type.
from featuretools.primitives import make_trans_primitive
from featuretools.variable_types import Numeric
import pandas as pd
import featuretools as ft
def len_list(values):
return values.str.len()
LengthList = make_trans_primitive(function = len_list,
input_types = [List],
return_type = Numeric,
description="length of a list related instance")
# Create a simple entityset containing list data
data = pd.DataFrame({"id": [1, 2, 3],
"products": [ ['a', 'b', 'c'], ['a','c'], ['b'] ]})
es = ft.EntitySet(id="data")
es = es.entity_from_dataframe(entity_id="customers",
dataframe=data,
index="id",
variable_types={
'products': List # Use the custom List type
})
feature_matrix, features = ft.dfs(entityset=es,
target_entity="customers",
agg_primitives=[],
trans_primitives=[LengthList],
max_depth=2)
You can now view the features generated, which include features that used the custom transform primitive
feature_matrix.head()
LEN_LIST(products)
id
1 3
2 2
3 1
thank you! thanks to you, I'm gonna to make various functions!
in the code
def len_list(values):
return values.str.len()
a format of values is dataframe, right?

h2o frame from pandas casting

I am using h2o to perform predictive modeling from python.
I have loaded some data from a csv using pandas, specifying some column types:
dtype_dict = {'SIT_SSICCOMP':'object',
'SIT_CAPACC':'object',
'PTT_SSIRMPOL':'object',
'PTT_SPTCLVEI':'object',
'cap_pad':'object',
'SIT_SADNS_RESP_PERC':'object',
'SIT_GEOCODE':'object',
'SIT_TIPOFIRMA':'object',
'SIT_TPFRODESI':'object',
'SIT_CITTAACC':'object',
'SIT_INDIRACC':'object',
'SIT_NUMCIVACC':'object'
}
date_cols = ["SIT_SSIDTSIN","SIT_SSIDTDEN","PTT_SPTDTEFF","PTT_SPTDTSCA","SIT_DTANTIFRODE","PTT_DTELABOR"]
columns_to_drop = ['SIT_TPFRODESI','SIT_CITTAACC',
'SIT_INDIRACC', 'SIT_NUMCIVACC', 'SIT_CAPACC', 'SIT_LONGITACC',
'SIT_LATITACC','cap_pad','SIT_DTANTIFRODE']
comp='mycomp'
file_completo = os.path.join(dataDir,"db4modelrisk_"+comp+".csv")
db4scoring = pd.read_csv(filepath_or_buffer=file_completo,sep=";", encoding='latin1',
header=0,infer_datetime_format =True,na_values=[''], keep_default_na =False,
parse_dates=date_cols,dtype=dtype_dict,nrows=500e3)
db4scoring.drop(labels=columns_to_drop,axis=1,inplace =True)
Then, after I set up a h2o cluster I import it in h2o using db4scoring_h2o = H2OFrame(db4scoring) and I convert categorical predictors in factor for example:
db4scoring_h2o["SIT_SADTPROV"]=db4scoring_h2o["SIT_SADTPROV"].asfactor()
db4scoring_h2o["PTT_SPTFRAZ"]=db4scoring_h2o["PTT_SPTFRAZ"].asfactor()
When I check data types using db4scoring.dtypes I notice that they are properly set but when I import it in h2o I notice that h2oframe performs some unwanted conversions to enum (eg from float or from int). I wonder if is is a way to specify the variable format in H2OFrame.
Yes, there is. See the H2OFrame doc here: http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html#h2oframe
You just need to use the column_types argument when you cast.
Here's a short example:
# imports
import h2o
import numpy as np
import pandas as pd
# create small random pandas df
df = pd.DataFrame(np.random.randint(0,10,size=(10, 2)),
columns=list('AB'))
print(df)
# A B
#0 5 0
#1 1 3
#2 4 8
#3 3 9
# ...
# start h2o, convert pandas frame to H2OFrame
# use column_types dict to set data types
h2o.init()
h2o_df = h2o.H2OFrame(df, column_types={'A':'numeric', 'B':'enum'})
h2o_df.describe() # you should now see the desired data types
# A B
# type int enum
# ...
# Filter a dictionary to keep elements only whose keys are even
newDict = filterTheDict(dictOfNames, lambda elem : elem[0] % 2 == 0)
print('Filtered Dictionary : ')
print(newDict)`enter code here`

xarray groupby: Apply different reducers to variables

I'm using xarray's groupby + reducer to perform spatial overlay/aggregation on spatial rasters. I'm wondering if there is a way to use a different reducer for certain data variables. In the code below for instance, I would like categorical_variable to be reduced with first() (or mode but that doesn't seem to be implemented), and continuous_variable to be reduced with mean()
import xarray as xr
import numpy as np
categorical_variable = np.array([[1,1,1,1,1],
[1,1,1,1,2],
[1,1,1,2,2],
[1,1,2,2,2],
[1,2,2,2,2]], dtype='int16')
grouping_variable = np.array([[1,1,1,2,2],
[1,1,3,2,2],
[1,3,3,3,3],
[3,3,3,3,3],
[4,4,4,4,4]], dtype='int16')
continuous_variable = np.random.rand(5,5)
xr_dataset = xr.Dataset({'grouping_variable': xr.DataArray(grouping_variable,
dims=['x', 'y']),
'categorical_variable': xr.DataArray(categorical_variable,
dims=['x', 'y']),
'continuous_variable': xr.DataArray(continuous_variable,
dims=['x', 'y'])})
xr_grouped = xr_dataset.groupby('grouping_variable')
xr_reduced = xr_grouped.mean()
This isn't currently possible in one go in xarray currently AFAIK, but since you're losing the spatial structure anyway you can go via pandas quite simply and use agg:
>>> df = xr_dataset.to_dataframe()
>>> df.groupby('grouping_variable').agg({"categorical_variable": "first",
"continuous_variable": "mean"})
categorical_variable continuous_variable
grouping_variable
1 1 0.458534
2 1 0.822294
3 1 0.539483
4 1 0.515586
The performance is not optimal but this is what I ended up doing:
xr_dataset = xr.merge([
xr_dataset.categorical_variable.groupby('grouping_variable').first(),
xr_dataset.continuous_variable.groupby('grouping_variable').mean(),
...
])

Categories