How to encode large number of unique categorical values?

How to encode large number of unique categorical values? - python

I have a dataset with a few columns that have around 18000 unique values each.
It is impossible to use one_hot because it blows up in dimensionality and also runs out of memory.
A simple label_encoder will still have values from {0, 18000} so it not ideal. Perhaps, this can be normalized between two values i.e.: {-1, 1}.
How would one handle this issue?
Edit
Came up with this - don't know if its correct
class OrdinalEncoderAndStandardScalerTransformer(BaseEstimator, TransformerMixin):
def __init__(self, mean=None, var=None, encoding_dict=None):
self.mean = mean
self.var = var
self.encoding_dict = encoding_dict
def fit(self, x, y=None):
self.ordinal_encoder = OrdinalEncoder()
self.scaler = StandardScaler()
return self
def transform(self, x, y=None):
series_name = x.name
_x = x.to_numpy().reshape(-1, 1)
_x = self.ordinal_encoder.fit_transform(_x)
categories = self.ordinal_encoder.categories_
self.encoding_dict = dict(zip((categories[0]), range(len(categories[0]))))
_x = np.squeeze(self.scaler.fit_transform(_x))
self.mean = self.scaler.mean_[0]
self.var = self.scaler.var_[0]
return pd.Series(_x, name=series_name)

Assuming the order is not meaningful, two possible ways come to mind:
leave one out variety of mean target encoding (with cardinality this high, target leakage is a serious concern, so other varieties might be unsuitable);
hashing (allows for memory conservation vs some information loss tradeoff).
There's a nice flowchart for encoding method selection at https://innovation.alteryx.com/encode-smarter/

Related

Is it possible to use train and test set with different features?

I'm new to ML and I have a little technical/methodology question. I had to create a custom transformer for my dataset. Basically I had a surface column and an independent 'type of use' (of a building) column and to bind them I've made a transformer which passes the type of use into a column and the corresponding surface into a variable :
class AreaPerUse(BaseEstimator, TransformerMixin):
# initializer
def __init__(self, columns = None):
# save the features list internally in the class
self.columns = columns
def fit(self, X, y = None):
return self
def transform(self, X, y = None):
#get all the 'uses' in a list
UsesList= X[['LargestPropertyUseType','SecondLargestPropertyUseType', 'ThirdLargestPropertyUseType' ]].to_numpy()
UsesList= np.unique(UsesList)
#creation of new features from this list
for newFeature in UsesList:
X[newFeature ]=0 #0 kbu by default
#loop which will allocate in variables the surface devoted to a use (column) and then delete the not useful columns
for i in UsesList:
X[i] = np.where(X['LargestPropertyUseType'] == i , X['LargestPropertyUseTypeGFA'], X[i])
X[i] = np.where(X['SecondLargestPropertyUseType'] == i , X['SecondLargestPropertyUseTypeGFA'], X[i])
X[i] = np.where(X['ThirdLargestPropertyUseType'] == i , X['ThirdLargestPropertyUseTypeGFA'], X[i])
#drop the no more needed columns
X.drop(['SecondLargestPropertyUseType', 'SecondLargestPropertyUseTypeGFA', 'LargestPropertyUseType', 'LargestPropertyUseTypeGFA', 'ThirdLargestPropertyUseType', 'ThirdLargestPropertyUseTypeGFA'], axis = 1, inplace = True)
return X
def get_features_names_out(self):
return X[self.columns].tolist()
It works well except that the number of columns generated is random (it depends on the number of 'type of use' in the train and the test set) and I can end up with a different number of columns between the train and the test . It makes sense at the dataset level but I don't know if it's a practice I can keep.
Sklearn doesn't return errors but several warnings about number of columns, 'Feature names seen at fit time but now missing' and 'Feature names not visible at fit time: "

Should we scale before the KElbowVisualizer method for clustering in python

I know that before any clustering we need to scale the data.
But I want to ask if the KElbowVisualizer method do the scaling by itself or before giving it the data I should scale it.
I already searched in the documentation of this method but I did not find an answer please can you share it with me if you find it. Thank you;

I looked at the implementation of KElbowVisualizer inyellowbrick/cluster/elbow.py at github and I havn't found any code under function fit (line 306) for scaling the X variables.
# https://github.com/DistrictDataLabs/yellowbrick/blob/main/yellowbrick/cluster/elbow.py
#...
def fit(self, X, y=None, **kwargs):
"""
Fits n KMeans models where n is the length of ``self.k_values_``,
storing the silhouette scores in the ``self.k_scores_`` attribute.
The "elbow" and silhouette score corresponding to it are stored in
``self.elbow_value`` and ``self.elbow_score`` respectively.
This method finishes up by calling draw to create the plot.
"""
self.k_scores_ = []
self.k_timers_ = []
self.kneedle = None
self.knee_value = None
if self.locate_elbow:
self.elbow_value_ = None
self.elbow_score_ = None
for k in self.k_values_:
# Compute the start time for each model
start = time.time()
# Set the k value and fit the model
self.estimator.set_params(n_clusters=k)
self.estimator.fit(X, **kwargs)
# Append the time and score to our plottable metrics
self.k_timers_.append(time.time() - start)
self.k_scores_.append(self.scoring_metric(X, self.estimator.labels_))
#...
So, you may need to scale your data (X parameters) before passing to KElbowVisualizer().fit()

Using standard and custom transformers in sklearn with pipeline and featureUnion throws error, TFIDF not iterating through the documents?

I have a DataFrame with 14 columns. I am using custom transformers to
Select desired columns from my DataFrame. Five columns from those fourteen.
Select column of certain data type (categorical, object, int etc)
Perform preprocessing on the column based on the type.
My custom ColumnSelector transformer is:
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
try:
return X[self.columns]
except KeyError:
cols_error = list(set(self.columns) - set(X.columns))
raise KeyError("The DataFrame does not include the columns: %s" % cols_error)
Followed by custom TypeSelector:
class TypeSelector(BaseEstimator, TransformerMixin):
def __init__(self, dtype):
self.dtype = dtype
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.select_dtypes(include=[self.dtype])
The original DataFrame, from which I select desired columns is
df_with_types and has 981 rows. The columns I wish to extract are listed below along with there respective data types;
meeting_subject_stem_sentence : 'object',
priority_label_stem_sentence : 'object',
attendees: 'category',
day_of_week: 'category',
meeting_time_mins: 'int64'
I then proceed to construct my pipeline the following way
preprocess_pipeline = make_pipeline(
ColumnSelector(columns=['meeting_subject_stem_sentence',
'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence']),
FeatureUnion(transformer_list=[
("integer_features", make_pipeline(
TypeSelector('int64'),
StandardScaler()
)),
("categorical_features", make_pipeline(
TypeSelector("category"),
OneHotEnc()
)),
("text_features", make_pipeline(
TypeSelector("object"),
TfidfVectorizer(stop_words=stopWords)
))
])
)
The error thrown when I fit the pipeline to data is:
preprocess_pipeline.fit_transform(df_with_types)
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 2, expected 981.
I have a hunch this is happening because of the TFIDF vectorizer. Fitting things only on the TFIDF vectorizer without FeatureUnion...
the_pipe = Pipeline([('col_sel', ColumnSelector(columns=['meeting_subject_stem_sentence',
'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence'])),
('type_selector', TypeSelector('object')), ('tfidf', TfidfVectorizer())])
When I fit the_pipe:
a = the_pipe.fit_transform(df_with_types)
This gives me a 2*2 matrix instead of 981.
(0, 0) 1.0
(1, 1) 1.0
Calling the feature names attribute using named_steps, I get
the_pipe.named_steps['tfidf'].get_feature_names()
[u'meeting_subject_stem_sentence', u'priority_label_stem_sentence']
It seems to be fitting only on the column names and not iterating through the documents. How do I achieve this in a Pipeline like the above one. Also, if I wanted to apply a pairwise distance/similarity function to each feature as part of the pipeline after ColumnSelector and TypeSelector, what must I do.
An example would be...
preprocess_pipeline = make_pipeline(
ColumnSelector(columns=['meeting_subject_stem_sentence',
'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence']),
FeatureUnion(transformer_list=[
("integer_features", make_pipeline(
TypeSelector('int64'),
StandardScaler(),
'Pairwise manhattan distance between each element of the integer feature'
)),
("categorical_features", make_pipeline(
TypeSelector("category"),
OneHotEnc(),
'Pairwise dice coefficient here'
)),
("text_features", make_pipeline(
TypeSelector("object"),
TfidfVectorizer(stop_words=stopWords),
'Pairwise cosine similarity here'
))
])
)
Please help. Being a beginner, I have been wracking my head with this to no avail. I have gone through zac_stewart's blog and many other similar ones but none seem to talk about how to use TFIDF with TypeSelector or ColumnSelector.
Thank you so much for all the help. Hope I formulated the question clearly.
EDIT 1:
If I use a TextSelector transformer, like the following...
class TextSelector(BaseEstimator, TransformerMixin):
""" Transformer that selects text column from DataFrame by key."""
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
'''Create X attribute to be transformed'''
return self
def transform(self, X, y=None):
'''the key passed here indicates column name'''
return X[self.key]
text_processing_pipe_line_1 = Pipeline([('selector', TextSelector(key='meeting_subject')),
('text_1', TfidfVectorizer(stop_words=stopWords))])
t = text_processing_pipe_line_1.fit_transform(df_with_types)
(0, 656) 0.378616399898
(0, 75) 0.378616399898
(0, 117) 0.519159384271
(0, 545) 0.512337545421
(0, 223) 0.425773433566
(1, 154) 0.5
(1, 137) 0.5
(1, 23) 0.5
(1, 355) 0.5
(2, 656) 0.497937369182
This works and it is iterating through the documents, thus if I could make TypeSelector return a series, that would right? Thanks again for the help.

Question 1
You have 2 columns that hold text:
meeting_subject_stem_sentence
priority_label_stem_sentence
Either apply TfidfVectorizer separately on each of them and then apply FeatureUnion or just concatenate the strings into 1 column and view this concatenation as one document.
I guess this is the root of your problem, since TfidfVectorizer.fit() inputs raw_documents and it needs to be an iterable that yields str. In your case it is an iterable that yields another iterable (holding 2 strings - one per each text columns).
Read the official docs for more info.
Question 2
You cannot use pairwise similarity/distances as a part of the pipeline because it is not a transformer. Transformers transform each sample independently of each other whereas a pairwise metric needs 2 samples at the same time. However, you can simply compute it after you fit_transform the pipeline via metrics.pairwise.pairwise_distances.

Printing remaining features in Feature Reduction

I am running a feature reduction (from 500 to around 30) for a randomforest classifier algo. I can reduce the number of features, but I want to see what features are left at every point in the reduction.As you can see below, I have made an attempt, but does not work.
X does not contain the ColumnNames. Ideally, it could be possible to also have the columnnames in X but only fit from row, then printing X would be possible I think.
I am sure there is a much better way though...
Anybody know how to do this?
FEATURES = []
readThisFile = r'C:\ManyFeatures.txt'
featuresFile = open(readThisFile)
AllFeatures = featuresFile.read()
FEATURES = AllFeatures.split('\n')
featuresFile.close()
Location = r'C:\MASSIVE.xlsx'
data = pd.read_excel(Location)
X = np.array(data[FEATURES])
y = data['_MiniTARGET'].values
for x in range(533, 10,-100):
X = SelectKBest(f_classif, k=x).fit_transform(X, y)
#U=pd.DataFrame(X)
#print (U.feature_importances_)

Bayesian Statistics

I need to know how to find the Bayesian probability of two discrete distributions. For example the distributions are given as follows:
hypo_A=[ 0.1,0.4,0.5,0.0,0.0,0.0]
hypo_B=[ 0.1,0.1,0.1,0.3,0.3,0.1]
With a prior of both of them being equally likely
The Bayesian formula is given p(x/H) = (p(H/x)*p(x))/(summation(p(H/x`)*p(x`))).
Basically I need to know how to multiply these unequal distribution in python.

I highly recommended to read Think Bayes book.
Here is a simple implantation of Bayesian statistics with python I wrote:
from collections import namedtuple
hypothesis=namedtuple('hypothesis',['likelihood','belief'])
class DiscreteBayes:
def __init__(self):
"""initiates the hypothesis list"""
self.hypo=dict()
def normalize(self):
"""normalizes the sum of all beliefs to 1"""
s=sum([float(h.belief) for h in self.hypo.values()])
self.hypo=dict([(k,hypothesis(likelihood=h.likelihood,belief=h.belief/s)) for k,h in self.hypo.items()])
def update(self,data):
"""updates beliefs based on new data"""
if type(data)!=list:
data=[data]
for datum in data:
self.hypo=dict([(k,hypothesis(likelihood=h.likelihood,belief=h.belief*h.likelihood(datum))) for k,h in self.hypo.items()])
self.normalize()
def predict(self,x):
"""predict new data based on previously seen"""
return sum([float(h.belief)*float(h.likelihood(x)) for h in self.hypo.values()])
In your case:
hypo_A = [ 0.1,0.4,0.5,0.0,0.0,0.0]
hypo_B = [ 0.1,0.1,0.1,0.3,0.3,0.1]
d = DiscreteBayes()
d.hypo['hypo_A'] = hypothesis(likelihood=hypo_A.get ,belief=1)
d.hypo['hypo_B'] = hypothesis(likelihood=hypo_B.get ,belief=1)
d.normalize()
x = 1
d.update(x) #updating beliefs after seeing x
d.predict(x) #the probability of seeing x in the future
print (d.hypo)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to encode large number of unique categorical values? - python

Related

Is it possible to use train and test set with different features?

Should we scale before the KElbowVisualizer method for clustering in python

Using standard and custom transformers in sklearn with pipeline and featureUnion throws error, TFIDF not iterating through the documents?

Printing remaining features in Feature Reduction

Bayesian Statistics

Categories

Resources