Related
I have a dataset with a few columns that have around 18000 unique values each.
It is impossible to use one_hot because it blows up in dimensionality and also runs out of memory.
A simple label_encoder will still have values from {0, 18000} so it not ideal. Perhaps, this can be normalized between two values i.e.: {-1, 1}.
How would one handle this issue?
Edit
Came up with this - don't know if its correct
class OrdinalEncoderAndStandardScalerTransformer(BaseEstimator, TransformerMixin):
def __init__(self, mean=None, var=None, encoding_dict=None):
self.mean = mean
self.var = var
self.encoding_dict = encoding_dict
def fit(self, x, y=None):
self.ordinal_encoder = OrdinalEncoder()
self.scaler = StandardScaler()
return self
def transform(self, x, y=None):
series_name = x.name
_x = x.to_numpy().reshape(-1, 1)
_x = self.ordinal_encoder.fit_transform(_x)
categories = self.ordinal_encoder.categories_
self.encoding_dict = dict(zip((categories[0]), range(len(categories[0]))))
_x = np.squeeze(self.scaler.fit_transform(_x))
self.mean = self.scaler.mean_[0]
self.var = self.scaler.var_[0]
return pd.Series(_x, name=series_name)
Assuming the order is not meaningful, two possible ways come to mind:
leave one out variety of mean target encoding (with cardinality this high, target leakage is a serious concern, so other varieties might be unsuitable);
hashing (allows for memory conservation vs some information loss tradeoff).
There's a nice flowchart for encoding method selection at https://innovation.alteryx.com/encode-smarter/
I have a DataFrame with 14 columns. I am using custom transformers to
Select desired columns from my DataFrame. Five columns from those fourteen.
Select column of certain data type (categorical, object, int etc)
Perform preprocessing on the column based on the type.
My custom ColumnSelector transformer is:
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
try:
return X[self.columns]
except KeyError:
cols_error = list(set(self.columns) - set(X.columns))
raise KeyError("The DataFrame does not include the columns: %s" % cols_error)
Followed by custom TypeSelector:
class TypeSelector(BaseEstimator, TransformerMixin):
def __init__(self, dtype):
self.dtype = dtype
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.select_dtypes(include=[self.dtype])
The original DataFrame, from which I select desired columns is
df_with_types and has 981 rows. The columns I wish to extract are listed below along with there respective data types;
meeting_subject_stem_sentence : 'object',
priority_label_stem_sentence : 'object',
attendees: 'category',
day_of_week: 'category',
meeting_time_mins: 'int64'
I then proceed to construct my pipeline the following way
preprocess_pipeline = make_pipeline(
ColumnSelector(columns=['meeting_subject_stem_sentence',
'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence']),
FeatureUnion(transformer_list=[
("integer_features", make_pipeline(
TypeSelector('int64'),
StandardScaler()
)),
("categorical_features", make_pipeline(
TypeSelector("category"),
OneHotEnc()
)),
("text_features", make_pipeline(
TypeSelector("object"),
TfidfVectorizer(stop_words=stopWords)
))
])
)
The error thrown when I fit the pipeline to data is:
preprocess_pipeline.fit_transform(df_with_types)
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 2, expected 981.
I have a hunch this is happening because of the TFIDF vectorizer. Fitting things only on the TFIDF vectorizer without FeatureUnion...
the_pipe = Pipeline([('col_sel', ColumnSelector(columns=['meeting_subject_stem_sentence',
'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence'])),
('type_selector', TypeSelector('object')), ('tfidf', TfidfVectorizer())])
When I fit the_pipe:
a = the_pipe.fit_transform(df_with_types)
This gives me a 2*2 matrix instead of 981.
(0, 0) 1.0
(1, 1) 1.0
Calling the feature names attribute using named_steps, I get
the_pipe.named_steps['tfidf'].get_feature_names()
[u'meeting_subject_stem_sentence', u'priority_label_stem_sentence']
It seems to be fitting only on the column names and not iterating through the documents. How do I achieve this in a Pipeline like the above one. Also, if I wanted to apply a pairwise distance/similarity function to each feature as part of the pipeline after ColumnSelector and TypeSelector, what must I do.
An example would be...
preprocess_pipeline = make_pipeline(
ColumnSelector(columns=['meeting_subject_stem_sentence',
'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence']),
FeatureUnion(transformer_list=[
("integer_features", make_pipeline(
TypeSelector('int64'),
StandardScaler(),
'Pairwise manhattan distance between each element of the integer feature'
)),
("categorical_features", make_pipeline(
TypeSelector("category"),
OneHotEnc(),
'Pairwise dice coefficient here'
)),
("text_features", make_pipeline(
TypeSelector("object"),
TfidfVectorizer(stop_words=stopWords),
'Pairwise cosine similarity here'
))
])
)
Please help. Being a beginner, I have been wracking my head with this to no avail. I have gone through zac_stewart's blog and many other similar ones but none seem to talk about how to use TFIDF with TypeSelector or ColumnSelector.
Thank you so much for all the help. Hope I formulated the question clearly.
EDIT 1:
If I use a TextSelector transformer, like the following...
class TextSelector(BaseEstimator, TransformerMixin):
""" Transformer that selects text column from DataFrame by key."""
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
'''Create X attribute to be transformed'''
return self
def transform(self, X, y=None):
'''the key passed here indicates column name'''
return X[self.key]
text_processing_pipe_line_1 = Pipeline([('selector', TextSelector(key='meeting_subject')),
('text_1', TfidfVectorizer(stop_words=stopWords))])
t = text_processing_pipe_line_1.fit_transform(df_with_types)
(0, 656) 0.378616399898
(0, 75) 0.378616399898
(0, 117) 0.519159384271
(0, 545) 0.512337545421
(0, 223) 0.425773433566
(1, 154) 0.5
(1, 137) 0.5
(1, 23) 0.5
(1, 355) 0.5
(2, 656) 0.497937369182
This works and it is iterating through the documents, thus if I could make TypeSelector return a series, that would right? Thanks again for the help.
Question 1
You have 2 columns that hold text:
meeting_subject_stem_sentence
priority_label_stem_sentence
Either apply TfidfVectorizer separately on each of them and then apply FeatureUnion or just concatenate the strings into 1 column and view this concatenation as one document.
I guess this is the root of your problem, since TfidfVectorizer.fit() inputs raw_documents and it needs to be an iterable that yields str. In your case it is an iterable that yields another iterable (holding 2 strings - one per each text columns).
Read the official docs for more info.
Question 2
You cannot use pairwise similarity/distances as a part of the pipeline because it is not a transformer. Transformers transform each sample independently of each other whereas a pairwise metric needs 2 samples at the same time. However, you can simply compute it after you fit_transform the pipeline via metrics.pairwise.pairwise_distances.
How do I select specific rows from a python dataframe for an ols regression in PANDAS?
I have a pandas dataframe with 1,000 rows. I want to regress column A on columns B + C, for the first 10 rows. When I type:
mod = pd.ols(y=df[‘A’], x=df[[‘B’,’C’]], window=10)
I get the regression results for rows 991-1000. How do I specify that I want the FIRST (or second, etc.) 10 rows?
Thanks in advance.
I think you can use iloc:
mod = pd.ols(y=df['A'].iloc[2:12], x=df[['B','C']].iloc[2:12], window=10)
Or ix:
mod = pd.ols(y=df.ix[2:12, 'A'], x=df.ix[2:12, ['B', 'C']], window=10)
If you need all groups use range:
for i in range(10):
#print i, i+10
mod = pd.ols(y=df['A'].iloc[i:i + 10], x=df[['B','C']].iloc[i:i + 10], window=10)
If you need help about ols, try help(pd.ols) in IPython, because this function in pandas docs is missing:
In [79]: help(pd.ols)
Help on function ols in module pandas.stats.interface:
ols(**kwargs)
Returns the appropriate OLS object depending on whether you need
simple or panel OLS, and a full-sample or rolling/expanding OLS.
Will be a normal linear regression or a (pooled) panel regression depending
on the type of the inputs:
y : Series, x : DataFrame -> OLS
y : Series, x : dict of DataFrame -> OLS
y : DataFrame, x : DataFrame -> PanelOLS
y : DataFrame, x : dict of DataFrame/Panel -> PanelOLS
y : Series with MultiIndex, x : Panel/DataFrame + MultiIndex -> PanelOLS
Parameters
----------
y: Series or DataFrame
See above for types
x: Series, DataFrame, dict of Series, dict of DataFrame, Panel
weights : Series or ndarray
The weights are presumed to be (proportional to) the inverse of the
variance of the observations. That is, if the variables are to be
transformed by 1/sqrt(W) you must supply weights = 1/W
intercept: bool
True if you want an intercept. Defaults to True.
nw_lags: None or int
Number of Newey-West lags. Defaults to None.
nw_overlap: bool
Whether there are overlaps in the NW lags. Defaults to False.
window_type: {'full sample', 'rolling', 'expanding'}
'full sample' by default
window: int
size of window (for rolling/expanding OLS). If window passed and no
explicit window_type, 'rolling" will be used as the window_type
Panel OLS options:
pool: bool
Whether to run pooled panel regression. Defaults to true.
entity_effects: bool
Whether to account for entity fixed effects. Defaults to false.
time_effects: bool
Whether to account for time fixed effects. Defaults to false.
x_effects: list
List of x's to account for fixed effects. Defaults to none.
dropped_dummies: dict
Key is the name of the variable for the fixed effect.
Value is the value of that variable for which we drop the dummy.
For entity fixed effects, key equals 'entity'.
By default, the first dummy is dropped if no dummy is specified.
cluster: {'time', 'entity'}
cluster variances
Examples
--------
# Run simple OLS.
result = ols(y=y, x=x)
# Run rolling simple OLS with window of size 10.
result = ols(y=y, x=x, window_type='rolling', window=10)
print(result.beta)
result = ols(y=y, x=x, nw_lags=1)
# Set up LHS and RHS for data across all items
y = A
x = {'B' : B, 'C' : C}
# Run panel OLS.
result = ols(y=y, x=x)
# Run expanding panel OLS with window 10 and entity clustering.
result = ols(y=y, x=x, cluster='entity', window_type='expanding', window=10)
Returns
-------
The appropriate OLS object, which allows you to obtain betas and various
statistics, such as std err, t-stat, etc.
I'm loading a csv, using Numpy, as a dataset to create a decision tree model in Python. using the below extract places columns 0-7 in X and the last column as the target in Y.
#load and set data
data = np.loadtxt("data/tmp.csv", delimiter=",")
X = data[:,0:7] #identify columns as data sets
Y = data[:,8] #identfy last column as target
#create model
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
What i'd like to know is if its possible to have the classifier in any column. for example if its in the fourth column would the following code still fit the model correctly or would it produce errors when it comes to predicting?
#load and set data
data = np.loadtxt("data/tmp.csv", delimiter=",")
X = data[:,0:8] #identify columns as data sets
Y = data[:,3] #identfy fourth column as target
#create model
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
If you have >4 columns, and the 4th one is the target and the others are features, here's one way (out of many) to load them:
# load data
X = np.hstack([data[:, :3], data[:, 5:]]) # features
Y = data[:,4] # target
# process X & Y
(with belated thanks to #omerbp for reminding me hstack takes a tuple/list, not naked arguments!)
First of all, As suggested by #mescalinum in a comment to the question, think of this situation:
.... 4th_feature ... label
.... 1 ... 1
.... 0 ... 0
.... 1 ... 1
............................
In this example, the classifier (any classifier, not DecisionTreeClassifier particularly) will learn that the 4th feature can best predict the label, since the 4th feature is the label. Unfortunately, this issue happen a lot (by accident I mean).
Secondly, if you want the 4th feature to be input label, you can just swap the columns:
arr[:,[frm, to]] = arr[:,[to, frm]]
#Ahemed Fasih's answer can also do the trick, however its around 10 time slower:
import timeit
setup_code = """
import numpy as np
i, j = 400000, 200
my_array = np.arange(i*j).reshape(i, j)
"""
swap_cols = """
def swap_cols(arr, frm, to):
arr[:,[frm, to]] = arr[:,[to, frm]]
"""
stack ="np.hstack([my_array[:, :3], my_array[:, 5:]])"
swap ="swap_cols(my_array, 4, 8)"
print "hstack - total time:", min(timeit.repeat(stmt=stack,setup=setup_code,number=20,repeat=3))
#hstack - total time: 3.29988478635
print "swap - total time:", min(timeit.repeat(stmt=swap,setup=setup_code+swap_cols,number=20,repeat=3))
#swap - total time: 0.372791106328
How to perform stepwise regression in python? There are methods for OLS in SCIPY but I am not able to do stepwise. Any help in this regard would be a great help. Thanks.
Edit: I am trying to build a linear regression model. I have 5 independent variables and using forward stepwise regression, I aim to select variables such that my model has the lowest p-value. Following link explains the objective:
https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEAQFjAD&url=http%3A%2F%2Fbusiness.fullerton.edu%2Fisds%2Fjlawrence%2FStat-On-Line%2FExcel%2520Notes%2FExcel%2520Notes%2520-%2520STEPWISE%2520REGRESSION.doc&ei=YjKsUZzXHoPwrQfGs4GQCg&usg=AFQjCNGDaQ7qRhyBaQCmLeO4OD2RVkUhzw&bvm=bv.47244034,d.bmk
Thanks again.
Trevor Smith and I wrote a little forward selection function for linear regression with statsmodels: http://planspace.org/20150423-forward_selection_with_statsmodels/ You could easily modify it to minimize a p-value, or select based on beta p-values with just a little more work.
You may try mlxtend which got various selection methods.
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
clf = LinearRegression()
# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)
# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)
You can make forward-backward selection based on statsmodels.api.OLS model, as shown in this answer.
However, this answer describes why you should not use stepwise selection for econometric models in the first place.
I developed this repository https://github.com/xinhe97/StepwiseSelectionOLS
My Stepwise Selection Classes (best subset, forward stepwise, backward stepwise) are compatible to sklearn. You can do Pipeline and GridSearchCV with my Classes.
The essential part of my code is as follows:
################### Criteria ###################
def processSubset(self, X,y,feature_index):
# Fit model on feature_set and calculate rsq_adj
regr = sm.OLS(y,X[:,feature_index]).fit()
rsq_adj = regr.rsquared_adj
bic = self.myBic(X.shape[0], regr.mse_resid, len(feature_index))
rsq = regr.rsquared
return {"model":regr, "rsq_adj":rsq_adj, "bic":bic, "rsq":rsq, "predictors_index":feature_index}
################### Forward Stepwise ###################
def forward(self,predictors_index,X,y):
# Pull out predictors we still need to process
remaining_predictors_index = [p for p in range(X.shape[1])
if p not in predictors_index]
results = []
for p in remaining_predictors_index:
new_predictors_index = predictors_index+[p]
new_predictors_index.sort()
results.append(self.processSubset(X,y,new_predictors_index))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest rsq_adj
# best_model = models.loc[models['bic'].idxmin()]
best_model = models.loc[models['rsq'].idxmax()]
# Return the best model, along with model's other information
return best_model
def forwardK(self,X_est,y_est, fK):
models_fwd = pd.DataFrame(columns=["model", "rsq_adj", "bic", "rsq", "predictors_index"])
predictors_index = []
M = min(fK,X_est.shape[1])
for i in range(1,M+1):
print(i)
models_fwd.loc[i] = self.forward(predictors_index,X_est,y_est)
predictors_index = models_fwd.loc[i,'predictors_index']
print(models_fwd)
# best_model_fwd = models_fwd.loc[models_fwd['bic'].idxmin(),'model']
best_model_fwd = models_fwd.loc[models_fwd['rsq'].idxmax(),'model']
# best_predictors = models_fwd.loc[models_fwd['bic'].idxmin(),'predictors_index']
best_predictors = models_fwd.loc[models_fwd['rsq'].idxmax(),'predictors_index']
return best_model_fwd, best_predictors
Statsmodels has additional methods for regression: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html. I think it will help you to implement stepwise regression.
"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm
"""X_opt variable has all the columns of independent variables of matrix X
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]
"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Using the summary method, you can check in your kernel the p values of your
variables written as 'P>|t|'. Then check for the variable with the highest p
value. Suppose x3 has the highest value e.g 0.956. Then remove this column
from your array and repeat all the steps.
X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Repeat these methods until you remove all the columns which have p value higher than the significance value(e.g 0.05). In the end your variable X_opt will have all the optimal variables with p values less than significance level.
Here's a method I just wrote that uses "mixed selection" as described in Introduction to Statistical Learning. As input, it takes:
lm, a statsmodels.OLS.fit(Y,X), where X is an array of n ones, where n is the
number of data points, and Y, where Y is the response in the training data
curr_preds- a list with ['const']
potential_preds- a list of all potential predictors.
There also needs to be a pandas dataframe X_mix that has all of the data, including 'const', and all of the data corresponding to the potential predictors
tol, optional. The max pvalue, .05 if not specified
def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
while (len(potential_preds) > 0):
index_best = -1 # this will record the index of the best predictor
curr = -1 # this will record current index
best_r_squared = lm.rsquared_adj # record the r squared of the current model
# loop to determine if any of the predictors can better the r-squared
for pred in potential_preds:
curr += 1 # increment current
preds = curr_preds.copy() # grab the current predictors
preds.append(pred)
lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
new_r_sq = lm_new.rsquared_adj # record r squared for new model
if new_r_sq > best_r_squared:
best_r_squared = new_r_sq
index_best = curr
if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
curr_preds.append(potential_preds.pop(index_best))
else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
break
# fit a new lm using the new predictors, look at the p-values
pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
pval_too_big = []
# make a list of all the p-values that are greater than the tolerance
for feat in pvals.index:
if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
pval_too_big.append(feat)
# now remove all the features from curr_preds that have a p-value that is too large
for feat in pval_too_big:
pop_index = curr_preds.index(feat)
curr_preds.pop(pop_index)