Importing an externally defined function to pytorch Dataset

Importing an externally defined function to pytorch Dataset - python

I have the following problem. I am defining a dataset class for my pytorch dataloader. In the class, I would like to use an externally defined function in the __getitem__ method.
Here is the example of what I mean
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
def readOneElement(df, idx):
# some fancy code which returns one element with index i based on full data set df (pandas)
...
# And here the Dataset class
class ModelDataset(Dataset):
def __init__(self, dataPath, transform = None):
def listUniqueExamples(self, df):
listExamples = df[['criterionId', 'adGroupId', 'campaignId', 'accountId']].drop_duplicates().reset_index();
return listExamples
self.transform = transform
# Load data
df = pd.read_csv(dataPath + '/trainData.csv')
# Transform constants to self
self.df = df
self.listExamples = listUniqueExamples(self, self.df)
self.length = len(self.listExamples)
def __len__(self):
return self.length
def __getitem__(self, idx):
# Create one example data
sample = readOneElement(df = self.df, idx = idx) # !!!
return sample
The line marked with #!!! will not work, because the function was defined outside this Dataset object. I get the following error
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
in
----> 1 data[20]
in __getitem__(self, idx)
31 def __getitem__(self, idx):
32 # Create one example data
---> 33 sample = readOneElement(df = self.df, idx = idx)
NameError: name 'readOneElement' is not defined
If I would have defined this function inside the Dataset object (as I have done for the listUniqueExamples function) then it would have worked. However, in this exact case I want this function to be external.
Is there any way to import external functions to the Dataset class in pytorch?
Thank you in advance!

Related

Is there more elegant way to traverse data by batch in python?

I want to achieve my own data loader in python. The target is to randomly traverse dataset by mini-batch, and I wonder if there is a more elegant way to achieve it.
For example, I have a dataset dataset=[1,2,3,4,5,6,7,8,9]. When batch_size=3, the returned three batches should be like: [1,5,7],[2,3,4],[6,8,9]. Here is my achievement:
import numpy as np
class DataLoader:
def __init__(self, data: list, batch_size: int):
self.data = data
self.batch_size = batch_size
self.samples_reserve = None
def _reset(self):
self.samples_reserve = np.arange(len(self.data)).tolist()
def __iter__(self):
self._reset()
return self
def __next__(self):
if len(self.samples_reserve) == 0:
raise StopIteration
samples_choice = set(np.random.choice(self.samples_reserve, self.batch_size, replace=False))
self.samples_reserve = list(set(self.samples_reserve) - samples_choice)
return list(samples_choice)
def __len__(self):
return int(len(self.data) / self.batch_size)
if __name__ == '__main__':
for i in DataLoader([1,2,3,4,5,6,7,8,9], 3):
print(i)
The __next__ function must maintain the current reserved data by transfer data between set and list. I wonder if I can achieve the following code in a more elegant way, e.g. are there some api functions that I can use directly, e.g. sample?
samples_choice = set(np.random.choice(self.samples_reserve, self.batch_size, replace=False))
self.samples_reserve = list(set(self.samples_reserve) - samples_choice)

My solution, based on numpy.random.Generator.choice:
def get_samples(dataset, size):
if not dataset:
return []
rng = np.random.default_rng()
if size >= len(dataset):
size = len(dataset)
sd = set(dataset)
samples = []
while sd:
sample = rng.choice(list(sd), min(size, len(sd)), replace=False)
samples.append(list(sample))
sd -= set(sample)
return samples
print(get_samples(list(range(20)), 7))

Why is `to_numpy()` recognized as an attribute rather than a method of dataframe?

import pandas as pd
import numpy as np
class CLF:
Weights = 0
def fit(DF_input, DF_output, eta=0.1, drop=1000):
X, y = DF_input.to_numpy(copy=True), DF_output.to_numpy(copy=True)
N,d = X.shape
m = len(np.unique(y))
self.Weights = np.random.normal(0,1, size=(d,m))
INPUT = pd.read_csv(path_input)
OUTPUT = pd.read_csv(path_output)
clf = CLF()
clf.fit(INPUT, OUTPUT)
I defined a method .fit() for the class I wrote. The first step is convert two dataframes into numpy arrays. However, I got the following error when I tried to use the method, although INPUT.to_numpy(copy=True) and OUTPUT.to_numpy(copy=True) both work fine in their own right. Can somebody help me out here? Why was to_numpy recognized as an attribute rather than a method of dataframes?
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-22-a3d455104534> in <module>
1 clf = CLF()
----> 2 clf.fit(INPUT, OUTPUT)
<ipython-input-16-57babd738b2d> in fit(DF_input, DF_output, eta, drop)
4
5 def fit(DF_input, DF_output, eta=0.1,drop=1000):
----> 6 X, y = DF_input.to_numpy(copy=True), DF_output.to_numpy(copy=True)
7 N,d = X.shape
8 m = len(np.unique(y)) # number of classes
AttributeError: 'CLF' object has no attribute 'to_numpy'

Your problem is that the first input for object method is usually reserved for self. The correct syntax should be:
class CLF:
Weights = 0
# notice the `self`
def fit(self, DF_input, DF_output, eta=0.1, drop=1000):
X, y = DF_input.to_numpy(copy=True), DF_output.to_numpy(copy=True)
N,d = X.shape
m = len(np.unique(y))
self.Weights = np.random.normal(0,1, size=(d,m))
INPUT = pd.read_csv(path_input)
OUTPUT = pd.read_csv(path_output)
clf = CLF()
clf.fit(INPUT, OUTPUT)

An instance method is a type of attribute; this is a more general error message that keys on the . (dot) operator, rather than parsing through to the left parenthesis to discriminate your usage.
The problem is that you defined an instance method fit, but named your instance as DF_input. I think you simply forgot the usual self naming for the implicit instance parameter.

Apply multiple StandardScaler's to individual groups?

Is there a pythonic way to chain together sklearn's StandardScaler instances to independently scale data with groups? I.e., if I wanted to find independently scale the features of the iris dataset; I could use the following code:
from sklearn.datasets import load_iris
data = load_iris()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['class'] = data['target']
means = df.groupby('class').mean()
stds = df.groupby('class').std()
df_rescaled = (
(df.drop(['class'], 1) - means.reindex(df['class']).values) /
stds.reindex(df['class']).values)
Here, I'm subtracting by the mean and dividing by the stdev of each group independently. But Its somewhat hard to carry around these means and stdev's, and essentially, replicate the behavior of StandardScaler when I have a categorical variable I'd like to control for.
Is there a more pythonic / sklearn-friendly way to implement this type of scaling?

Sure, you can use any sklearn operation and apply it to a groupby object.
First, a little convenience wrapper:
import typing
import pandas as pd
class SklearnWrapper:
def __init__(self, transform: typing.Callable):
self.transform = transform
def __call__(self, df):
transformed = self.transform.fit_transform(df.values)
return pd.DataFrame(transformed, columns=df.columns, index=df.index)
This one will apply any sklearn transform you pass into it to a group.
And finally simple usage:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
data = load_iris()
df = pd.DataFrame(data["data"], columns=data["feature_names"])
df["class"] = data["target"]
df_rescaled = (
df.groupby("class")
.apply(SklearnWrapper(StandardScaler()))
.drop("class", axis="columns")
)
EDIT: You can pretty much do anything with SklearnWrapper.
Here is an example of transforming and reversing this operation for each group (e.g. do not overwrite the transformation object) - just fit the object anew each time a new group is seen (and add it to list).
I have kinda replicated a bit of sklearn's functionality for easier usage (you can extend it with any function you want by passing appropriate string to _call_with_function internal method):
class SklearnWrapper:
def __init__(self, transformation: typing.Callable):
self.transformation = transformation
self._group_transforms = []
# Start with -1 and for each group up the pointer by one
self._pointer = -1
def _call_with_function(self, df: pd.DataFrame, function: str):
# If pointer >= len we are making a new apply, reset _pointer
if self._pointer >= len(self._group_transforms):
self._pointer = -1
self._pointer += 1
return pd.DataFrame(
getattr(self._group_transforms[self._pointer], function)(df.values),
columns=df.columns,
index=df.index,
)
def fit(self, df):
self._group_transforms.append(self.transformation.fit(df.values))
return self
def transform(self, df):
return self._call_with_function(df, "transform")
def fit_transform(self, df):
self.fit(df)
return self.transform(df)
def inverse_transform(self, df):
return self._call_with_function(df, "inverse_transform")
Usage (group transform, inverse operation and apply it again):
data = load_iris()
df = pd.DataFrame(data["data"], columns=data["feature_names"])
df["class"] = data["target"]
# Create scaler outside the class
scaler = SklearnWrapper(StandardScaler())
# Fit and transform data (holding state)
df_rescaled = df.groupby("class").apply(scaler.fit_transform)
# Inverse the operation
df_inverted = df_rescaled.groupby("class").apply(scaler.inverse_transform)
# Apply transformation once again
df_transformed = (
df_inverted.groupby("class")
.apply(scaler.transform)
.drop("class", axis="columns")
)

I updated #Szymon Maszke code:
class SklearnWrapper:
def __init__(self, transformation: typing.Callable):
self.transformation = transformation
self._group_transforms = []
# Start with -1 and for each group up the pointer by one
self._pointer = -1
def _call_with_function(self, df: pd.DataFrame, function: str):
# If pointer >= len we are making a new apply, reset _pointer
if self._pointer == len(self._group_transforms)-1 and function=="inverse_transform":
self._pointer = -1
self._pointer += 1
print(self._pointer)
return pd.DataFrame(
getattr(self._group_transforms[self._pointer], function)(df.values),
columns=df.columns,
index=df.index,
)
def fit(self, df):
scaler = copy(self.transformation)
self._group_transforms.append(scaler.fit(df.values))
return self
def transform(self, df):
return self._call_with_function(df, "transform")
def fit_transform(self, df):
self.fit(df)
return self.transform(df)
def inverse_transform(self, df):
return self._call_with_function(df, "inverse_transform")
StandardScaler() was not stored properly in _group_transforms, so I created a copy (using the copy lib) and store it (maybe there's a better way to do it using OOP).

How to impute columns with categorial datatype in scikit-learn

I have a dataset that includes both numeric and object in the features. Additionally some of the features with object datatype have missing values. I created a modified version of Imputer (following the instructions on another post) to take care of the missing value for both numeric and categorial datatype but when I apply to my dataset it returns AttributeError. I believe I am making a silly mistake in the definition of fit method for the impute and i appreciate your insight. Here is the my code and the error:
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
#load the data
path='~/Desktop/ML/Hands_on/housing_train.csv'
path=os.path.expanduser(path)
data=pd.read_csv(path)
#select the columns_names including dtype=object && missing data
object_data=data.select_dtypes(include=['object'])
object_data_null=[]
for col in object_data.columns:
if object_data[col].isnull().any():
object_data_null.append(col)
class GeneralImputer(Imputer):
def __init__(self, **kwargs):
Imputer.__init__(self, **kwargs)
def fit(self, X, y=None):
if self.strategy == 'most_frequent':
self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
self.statistics_ = self.fills.values
return self
else:
return Imputer.fit(self, X, y=y)
def transform(self, X):
if hasattr(self, 'fills'):
return pd.DataFrame(X).fillna(self.fills).values.astype(str)
else:
return Imputer.transform(self, X)
imputer=GeneralImputer(strategy='most_frequent', axis=1)
for i in object_data_null:
imputer.fit(data[i])
data[i]=imputer.transform(data[i])
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-29-989e78355872> in <module>()
38 object_data_null
39 for i in object_data_null:
---> 40 imputer.fit(data[i])
41 data[i]=imputer.transform(data[i])
42
<ipython-input-29-989e78355872> in fit(self, X, y)
23 if self.strategy == 'most_frequent':
24 self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
---> 25 self.statistics_ = self.fills.values
26 return self
27 else:
AttributeError: 'str' object has no attribute 'values'

For a 1-sized object the squeeze() method will return a scaler object as mentioned in the documentation
So that means, for most of the time (which happens for all columns here), the mode of a column will be a single object and then the squeeze() will return just the string.
So no need to get .values after it. Change your fit() method to remove that:
def fit(self, X, y=None):
if self.strategy == 'most_frequent':
self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
# Removed .values from the below line
self.statistics_ = self.fills
return self

Error in FeatureUnion Sklearn Pipeline

I have the following dataframe:
ID Text
1 qwerty
2 asdfgh
I am trying to create md5 hash for Text field and remove ID field from the dataframe above. To achieve that i have created a simple pipeline with custom transformers from sklearn.
Here is the code I have used:
class cust_txt_col(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def hash_generate(self, txt):
m = hashlib.md5()
text = str(txt)
long_text = ' '.join(text.split())
m.update(long_text.encode('utf-8'))
text_hash= m.hexdigest()
return text_hash
def transform(self, x):
return x[self.key].apply(lambda z: self.hash_generate(z)).values
class cust_regression_vals(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
def fit(self, x, y=None):
return self
def transform(self, x):
x = x.drop(['Gene', 'Variation','ID','Text'], axis=1)
return x.values
fp = pipeline.Pipeline([
('union', pipeline.FeatureUnion([
('hash', cust_txt_col('Text')), # can pass in either a pipeline
('normalized', cust_regression_vals()) # or a transformer
]))
])
When I run this I receive the follwoing error:
ValueError: all the input arrays must have same number of dimensions
Can you, please, tell me what is wrong with my code?
if i run the classes one by one :
for cust_txt_col i got below o/p
['3e909f222a1e06098ec7ca1ea7e84540' '1691bdba3b75df145169e0501369fce3'
'1691bdba3b75df145169e0501369fce3' ..., 'e11ec9863aaeb93f77a231319021e14d'
'851c517b2af0a46cb9bc9373b748b6ff' '0ffe46fc75d21a5347b1f1a5a84526ad']
for cust_regression_vals i got below o/p
[[qwerty],
[asdfgh]]

cust_txt_col is returning a 1d array. FeatureUnion demands that each constituent transformer returns a 2d array.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Importing an externally defined function to pytorch Dataset - python

Related

Is there more elegant way to traverse data by batch in python?

Why is `to_numpy()` recognized as an attribute rather than a method of dataframe?

Apply multiple StandardScaler's to individual groups?

How to impute columns with categorial datatype in scikit-learn

Error in FeatureUnion Sklearn Pipeline

Categories

Resources