Error in FeatureUnion Sklearn Pipeline

Error in FeatureUnion Sklearn Pipeline - python

I have the following dataframe:
ID Text
1 qwerty
2 asdfgh
I am trying to create md5 hash for Text field and remove ID field from the dataframe above. To achieve that i have created a simple pipeline with custom transformers from sklearn.
Here is the code I have used:
class cust_txt_col(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def hash_generate(self, txt):
m = hashlib.md5()
text = str(txt)
long_text = ' '.join(text.split())
m.update(long_text.encode('utf-8'))
text_hash= m.hexdigest()
return text_hash
def transform(self, x):
return x[self.key].apply(lambda z: self.hash_generate(z)).values
class cust_regression_vals(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
def fit(self, x, y=None):
return self
def transform(self, x):
x = x.drop(['Gene', 'Variation','ID','Text'], axis=1)
return x.values
fp = pipeline.Pipeline([
('union', pipeline.FeatureUnion([
('hash', cust_txt_col('Text')), # can pass in either a pipeline
('normalized', cust_regression_vals()) # or a transformer
]))
])
When I run this I receive the follwoing error:
ValueError: all the input arrays must have same number of dimensions
Can you, please, tell me what is wrong with my code?
if i run the classes one by one :
for cust_txt_col i got below o/p
['3e909f222a1e06098ec7ca1ea7e84540' '1691bdba3b75df145169e0501369fce3'
'1691bdba3b75df145169e0501369fce3' ..., 'e11ec9863aaeb93f77a231319021e14d'
'851c517b2af0a46cb9bc9373b748b6ff' '0ffe46fc75d21a5347b1f1a5a84526ad']
for cust_regression_vals i got below o/p
[[qwerty],
[asdfgh]]

cust_txt_col is returning a 1d array. FeatureUnion demands that each constituent transformer returns a 2d array.

Related

Importing an externally defined function to pytorch Dataset

I have the following problem. I am defining a dataset class for my pytorch dataloader. In the class, I would like to use an externally defined function in the __getitem__ method.
Here is the example of what I mean
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
def readOneElement(df, idx):
# some fancy code which returns one element with index i based on full data set df (pandas)
...
# And here the Dataset class
class ModelDataset(Dataset):
def __init__(self, dataPath, transform = None):
def listUniqueExamples(self, df):
listExamples = df[['criterionId', 'adGroupId', 'campaignId', 'accountId']].drop_duplicates().reset_index();
return listExamples
self.transform = transform
# Load data
df = pd.read_csv(dataPath + '/trainData.csv')
# Transform constants to self
self.df = df
self.listExamples = listUniqueExamples(self, self.df)
self.length = len(self.listExamples)
def __len__(self):
return self.length
def __getitem__(self, idx):
# Create one example data
sample = readOneElement(df = self.df, idx = idx) # !!!
return sample
The line marked with #!!! will not work, because the function was defined outside this Dataset object. I get the following error
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
in
----> 1 data[20]
in __getitem__(self, idx)
31 def __getitem__(self, idx):
32 # Create one example data
---> 33 sample = readOneElement(df = self.df, idx = idx)
NameError: name 'readOneElement' is not defined
If I would have defined this function inside the Dataset object (as I have done for the listUniqueExamples function) then it would have worked. However, in this exact case I want this function to be external.
Is there any way to import external functions to the Dataset class in pytorch?
Thank you in advance!

Apply multiple StandardScaler's to individual groups?

Is there a pythonic way to chain together sklearn's StandardScaler instances to independently scale data with groups? I.e., if I wanted to find independently scale the features of the iris dataset; I could use the following code:
from sklearn.datasets import load_iris
data = load_iris()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['class'] = data['target']
means = df.groupby('class').mean()
stds = df.groupby('class').std()
df_rescaled = (
(df.drop(['class'], 1) - means.reindex(df['class']).values) /
stds.reindex(df['class']).values)
Here, I'm subtracting by the mean and dividing by the stdev of each group independently. But Its somewhat hard to carry around these means and stdev's, and essentially, replicate the behavior of StandardScaler when I have a categorical variable I'd like to control for.
Is there a more pythonic / sklearn-friendly way to implement this type of scaling?

Sure, you can use any sklearn operation and apply it to a groupby object.
First, a little convenience wrapper:
import typing
import pandas as pd
class SklearnWrapper:
def __init__(self, transform: typing.Callable):
self.transform = transform
def __call__(self, df):
transformed = self.transform.fit_transform(df.values)
return pd.DataFrame(transformed, columns=df.columns, index=df.index)
This one will apply any sklearn transform you pass into it to a group.
And finally simple usage:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
data = load_iris()
df = pd.DataFrame(data["data"], columns=data["feature_names"])
df["class"] = data["target"]
df_rescaled = (
df.groupby("class")
.apply(SklearnWrapper(StandardScaler()))
.drop("class", axis="columns")
)
EDIT: You can pretty much do anything with SklearnWrapper.
Here is an example of transforming and reversing this operation for each group (e.g. do not overwrite the transformation object) - just fit the object anew each time a new group is seen (and add it to list).
I have kinda replicated a bit of sklearn's functionality for easier usage (you can extend it with any function you want by passing appropriate string to _call_with_function internal method):
class SklearnWrapper:
def __init__(self, transformation: typing.Callable):
self.transformation = transformation
self._group_transforms = []
# Start with -1 and for each group up the pointer by one
self._pointer = -1
def _call_with_function(self, df: pd.DataFrame, function: str):
# If pointer >= len we are making a new apply, reset _pointer
if self._pointer >= len(self._group_transforms):
self._pointer = -1
self._pointer += 1
return pd.DataFrame(
getattr(self._group_transforms[self._pointer], function)(df.values),
columns=df.columns,
index=df.index,
)
def fit(self, df):
self._group_transforms.append(self.transformation.fit(df.values))
return self
def transform(self, df):
return self._call_with_function(df, "transform")
def fit_transform(self, df):
self.fit(df)
return self.transform(df)
def inverse_transform(self, df):
return self._call_with_function(df, "inverse_transform")
Usage (group transform, inverse operation and apply it again):
data = load_iris()
df = pd.DataFrame(data["data"], columns=data["feature_names"])
df["class"] = data["target"]
# Create scaler outside the class
scaler = SklearnWrapper(StandardScaler())
# Fit and transform data (holding state)
df_rescaled = df.groupby("class").apply(scaler.fit_transform)
# Inverse the operation
df_inverted = df_rescaled.groupby("class").apply(scaler.inverse_transform)
# Apply transformation once again
df_transformed = (
df_inverted.groupby("class")
.apply(scaler.transform)
.drop("class", axis="columns")
)

I updated #Szymon Maszke code:
class SklearnWrapper:
def __init__(self, transformation: typing.Callable):
self.transformation = transformation
self._group_transforms = []
# Start with -1 and for each group up the pointer by one
self._pointer = -1
def _call_with_function(self, df: pd.DataFrame, function: str):
# If pointer >= len we are making a new apply, reset _pointer
if self._pointer == len(self._group_transforms)-1 and function=="inverse_transform":
self._pointer = -1
self._pointer += 1
print(self._pointer)
return pd.DataFrame(
getattr(self._group_transforms[self._pointer], function)(df.values),
columns=df.columns,
index=df.index,
)
def fit(self, df):
scaler = copy(self.transformation)
self._group_transforms.append(scaler.fit(df.values))
return self
def transform(self, df):
return self._call_with_function(df, "transform")
def fit_transform(self, df):
self.fit(df)
return self.transform(df)
def inverse_transform(self, df):
return self._call_with_function(df, "inverse_transform")
StandardScaler() was not stored properly in _group_transforms, so I created a copy (using the copy lib) and store it (maybe there's a better way to do it using OOP).

Custom Layer Keras: transform output at certain positions

I am trying to write a custom layer in Keras (with tensorflow backend) that makes certain positions binary.
For example suppose that I have [0.6,0.8,0.9,0.2] and that position 1 and 3 must be binary, I would like to have a layer that outputs [0.6,1,0.9,0]
e.g. output[pos] > 0.5 then output[pos] = 1, else output[pos] = 0
I wrote this but it is not working at all...
...Layers of the net...
x = Lambda(self.adjust_positions)(x)
here the functions I wrote
def update_1(self, x, pos):
with tf.control_dependencies([tf.assign(x[pos],[1])]):
return tf.identity(x)
def update_0(self, x, pos):
with tf.control_dependencies([tf.assign(x[pos],[0])]):
return tf.identity(x)
def adjust_positions(self, x):
for pos in indexes:
tf.cond(tf.gather(x, pos)<[0.5], self.update_0(x, pos), self.update_1(x,pos))
return x
The error I get is:
ValueError: Sliced assignment is only supported for variables
55 def update_0(self, x, pos):
---> 56 with tf.control_dependencies([tf.assign(x[pos],[0])]):
57 return tf.identity(x)
How can I implement this functionality? What I have done is reasonable?

How to impute columns with categorial datatype in scikit-learn

I have a dataset that includes both numeric and object in the features. Additionally some of the features with object datatype have missing values. I created a modified version of Imputer (following the instructions on another post) to take care of the missing value for both numeric and categorial datatype but when I apply to my dataset it returns AttributeError. I believe I am making a silly mistake in the definition of fit method for the impute and i appreciate your insight. Here is the my code and the error:
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
#load the data
path='~/Desktop/ML/Hands_on/housing_train.csv'
path=os.path.expanduser(path)
data=pd.read_csv(path)
#select the columns_names including dtype=object && missing data
object_data=data.select_dtypes(include=['object'])
object_data_null=[]
for col in object_data.columns:
if object_data[col].isnull().any():
object_data_null.append(col)
class GeneralImputer(Imputer):
def __init__(self, **kwargs):
Imputer.__init__(self, **kwargs)
def fit(self, X, y=None):
if self.strategy == 'most_frequent':
self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
self.statistics_ = self.fills.values
return self
else:
return Imputer.fit(self, X, y=y)
def transform(self, X):
if hasattr(self, 'fills'):
return pd.DataFrame(X).fillna(self.fills).values.astype(str)
else:
return Imputer.transform(self, X)
imputer=GeneralImputer(strategy='most_frequent', axis=1)
for i in object_data_null:
imputer.fit(data[i])
data[i]=imputer.transform(data[i])
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-29-989e78355872> in <module>()
38 object_data_null
39 for i in object_data_null:
---> 40 imputer.fit(data[i])
41 data[i]=imputer.transform(data[i])
42
<ipython-input-29-989e78355872> in fit(self, X, y)
23 if self.strategy == 'most_frequent':
24 self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
---> 25 self.statistics_ = self.fills.values
26 return self
27 else:
AttributeError: 'str' object has no attribute 'values'

For a 1-sized object the squeeze() method will return a scaler object as mentioned in the documentation
So that means, for most of the time (which happens for all columns here), the mode of a column will be a single object and then the squeeze() will return just the string.
So no need to get .values after it. Change your fit() method to remove that:
def fit(self, X, y=None):
if self.strategy == 'most_frequent':
self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
# Removed .values from the below line
self.statistics_ = self.fills
return self

Custom transformer for sklearn Pipeline that alters both X and y

I want to create my own transformer for use with the sklearn Pipeline.
I am creating a class that implements both fit and transform methods. The purpose of the transformer will be to remove rows from the matrix that have more than a specified number of NaNs.
The issue I am facing is how can I change both the X and y matrices that are passed to the transformer?
I believe this has to be done in the fit method since it has access to both X and y. Since python passes arguments by assignment once I reassign X to a new matrix with fewer rows the reference to the original X is lost (and of course the same is true for y). Is it possible to maintain this reference?
I’m using a pandas DataFrame to easily drop the rows that have too many NaNs, this may not be the right way to do it for my use case. The current code looks like this:
class Dropna():
# thresh is max number of NaNs allowed in a row
def __init__(self, thresh=0):
self.thresh = thresh
def fit(self, X, y):
total = X.shape[1]
# +1 to account for 'y' being added to the dframe
new_thresh = total + 1 - self.thresh
df = pd.DataFrame(X)
df['y'] = y
df.dropna(thresh=new_thresh, inplace=True)
X = df.drop('y', axis=1).values
y = df['y'].values
return self
def transform(self, X):
return X

Modifying the sample axis, e.g. removing samples, does not (yet?) comply with the scikit-learn transformer API. So if you need to do this, you should do it outside any calls to scikit learn, as preprocessing.
As it is now, the transformer API is used to transform the features of a given sample into something new. This can implicitly contain information from other samples, but samples are never deleted.
Another option is to attempt to impute the missing values. But again, if you need to delete samples, treat it as preprocessing before using scikit learn.

You have to modify the internal code of sklearn Pipeline.
We define a transformer that removes samples where at least the value of a feature or the target is NaN during fitting (fit_transform). While it removes the samples where at least the value of a feature is NaN during inference (transform). Important to note that our transformer returns X and y in fit_transform so we need to handle this behaviour in the sklearn Pipeline.
class Dropna():
def fit(self, X, y):
return self
def fit_transform(self, X, y):
mask = (np.isnan(X).any(-1) | np.isnan(y))
if hasattr(X, 'loc'):
X = X.loc[~mask]
else:
X = X[~mask]
if hasattr(y, 'loc'):
y = y.loc[~mask]
else:
y = y[~mask]
return X, y ###### make fit_transform return X and y
def transform(self, X):
mask = np.isnan(X).any(-1)
if hasattr(X, 'loc'):
X = X.loc[~mask]
else:
X = X[~mask]
return X
We only have to modify the original sklearn Pipeline in only two specific points in fit and in _fit method. The rest remains unchanged.
from sklearn import pipeline
from sklearn.base import clone
from sklearn.utils import _print_elapsed_time
from sklearn.utils.validation import check_memory
class Pipeline(pipeline.Pipeline):
def _fit(self, X, y=None, **fit_params_steps):
self.steps = list(self.steps)
self._validate_steps()
memory = check_memory(self.memory)
fit_transform_one_cached = memory.cache(pipeline._fit_transform_one)
for (step_idx, name, transformer) in self._iter(
with_final=False, filter_passthrough=False
):
if transformer is None or transformer == "passthrough":
with _print_elapsed_time("Pipeline", self._log_message(step_idx)):
continue
try:
# joblib >= 0.12
mem = memory.location
except AttributeError:
mem = memory.cachedir
finally:
cloned_transformer = clone(transformer) if mem else transformer
X, fitted_transformer = fit_transform_one_cached(
cloned_transformer,
X,
y,
None,
message_clsname="Pipeline",
message=self._log_message(step_idx),
**fit_params_steps[name],
)
if isinstance(X, tuple): ###### unpack X if is tuple X = (X,y)
X, y = X
self.steps[step_idx] = (name, fitted_transformer)
return X, y
def fit(self, X, y=None, **fit_params):
fit_params_steps = self._check_fit_params(**fit_params)
Xt = self._fit(X, y, **fit_params_steps)
if isinstance(Xt, tuple): ###### unpack X if is tuple X = (X,y)
Xt, y = Xt
with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
if self._final_estimator != "passthrough":
fit_params_last_step = fit_params_steps[self.steps[-1][0]]
self._final_estimator.fit(Xt, y, **fit_params_last_step)
return self
This is required in order to unpack the values generated by Dropna().fit_transform(X, y) in the new X and y.
Here is the full pipeline at work:
from sklearn.linear_model import Ridge
X = np.random.uniform(0,1, (100,3))
y = np.random.uniform(0,1, (100,))
X[np.random.uniform(0,1, (100)) < 0.1] = np.nan
y[np.random.uniform(0,1, (100)) < 0.1] = np.nan
pipe = Pipeline([('dropna', Dropna()), ('model', Ridge())])
pipe.fit(X, y)
pipe.predict(X).shape
Another trial with a further intermediate preprocessing step:
from sklearn.preprocessing import StandardScaler
pipe = Pipeline([('dropna', Dropna()), ('scaler', StandardScaler()), ('model', Ridge())])
pipe.fit(X, y)
pipe.predict(X).shape
More complex behaviors can be achieved with other simple modifications according to the needs. If you are interested also in Pipeline().fit_transform or Pipeline().fit_predict you need to operate the same changes.

The package imblearn, which is built on top of sklearn, contains an estimator FunctionSampler that allows manipulating both the features array, X, and target array, y, in a pipeline step.
Note that using it in a pipeline step requires using the Pipeline class in imblearn that inherits from the one in sklearn. Furthermore, by default, in the context of Pipeline, the method resample does nothing when it is not called immediately after fit (as in fit_resample). So, read the documentation ahead of time.

Adding to #João Matias response:
Here's an example of using imblearn to define a pipeline step that drops rows with missing values:
from imblearn import FunctionSampler
def drop_rows_with_any_nan(X, y):
return X[~np.isnan(X).any(axis=1), :], y[~np.isnan(X).any(axis=1)]
drop_rows_with_any_nan_sampler = FunctionSampler(func=drop_rows_with_any_nan, validate=False)
model_clf2 = pipeline.Pipeline(
[
('preprocess', column_transformer),
('drop_na', drop_rows_with_any_nan_sampler),
('smote', SMOTE()),
('xgb', xgboost.XGBClassifier()),
]
)
Note, you have to use the imblearn pipeline.

You can solve this easily by using the sklearn.preprocessing.FunctionTransformer method (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html)
You just need to put your alternations to X in a function
def drop_nans(X, y=None):
total = X.shape[1]
new_thresh = total - thresh
df = pd.DataFrame(X)
df.dropna(thresh=new_thresh, inplace=True)
return df.values
then you get your transformer by calling
transformer = FunctionTransformer(drop_nans, validate=False)
which you can use in the pipeline. The threshold can be set outside the drop_nans function.

#eickenberg is the proper and clean answer. Nevertheless, I like to keep everything into one Pipeline, so if you are interested, I created a library (not yet deployed on pypi) that allow to apply transformation on Y:
https://gitlab.com/thibaultB/transformers/
Usage is the following:
df = pd.DataFrame([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
df.columns = ["a", "b", "target"]
spliter = SplitXY("target") # Create a new step and give it name of column target
pipe = Pipeline([
("imputer", SklearnPandasWrapper(KNNImputer())),
("spliter", spliter),
("scaler", StandardScaler()),
("rf",
EstimatorWithoutYWrapper(RandomForestRegressor(random_state=45),
spliter)) # EstimatorWithoutYWrapper overwrite RandomForestRegressor to get y from spliter just before calling fit or transform
])
pipe.fit(df)
res = pipe.predict(df)
Using this code, you can alter the number of rows if you put all the transformer that modify the numbers of rows before the "SplitXY" transformer. Transformer before the SplitXY transformer should keep columns name, it is why I also added a SklearnPandasWrapper that wrap sklearn transformer (that usually return numpy array) to keep columns name.

You can use function transformer
df=pd.DataFrame([[1,2,3],[4,5,6],[np.NaN,np.NaN,9],[7,np.NaN,9]])
from sklearn.pipeline import FunctionTransformer,make_pipeline
def remove_na(df_,thresh=2):
return df.dropna(thresh=2)
pipe=make_pipeline(FunctionTransformer(func=remove_na,
validate=False,kw_args={"thresh":2}))
pipe.fit_transform(df)

Use "deep-copies" further on, down the pipeline and X, y remain protected
.fit() can first assign on each call deep-copy to new class-variables
self.X_without_NaNs = X.copy()
self.y_without_NaNs = y.copy()
and then reduce / transform these not to have more NaN-s than ordered by self.treshold

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error in FeatureUnion Sklearn Pipeline - python

cust_txt_col is returning a 1d array. FeatureUnion demands that each constituent transformer returns a 2d array.

Related

Importing an externally defined function to pytorch Dataset

Apply multiple StandardScaler's to individual groups?

Custom Layer Keras: transform output at certain positions

How to impute columns with categorial datatype in scikit-learn

Custom transformer for sklearn Pipeline that alters both X and y

Categories

Resources