Semi-supervised Naive Bayes with NLTK [closed] - python

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I have built a semi-supervised version of NLTK's Naive Bayes in Python based on the EM (expectation-maximization algorithm). However, in some iterations of EM I am getting negative log-likelihoods (the log-likelihoods of EM must be positive in every iteration), therefore I believe that there must be some mistakes in my code. After carefully reviewing my code, I have no idea why is this happenning. It would be really appreciated if someone could spot any mistakes in my code below:
(Reference material of semi-supervised Naive Bayes)
EM-algorithm main loop
#initial assumptions:
#Bernoulli NB: only feature presence (value 1) or absence (value None) is computed
#initial data:
#C: classifier trained with labeled data
#labeled_data: an array of tuples (feature dic, label)
#features: dictionary that outputs feature dictionary for a given document id
for iteration in range(1, self.maxiter):
#Expectation: compute probabilities for each class for each unlabeled document
#An array of tuples (feature dictionary, probability dist) is built
unlabeled_data = [(features[id],C.prob_classify(features[id])) for id in U]
#Maximization: given the probability distributions of previous step,
#update label, feature-label counts and update classifier C
#gen_freqdists is a custom function, see below
#gen_probdists is the original NLTK function
l_freqdist_act,ft_freqdist_act, ft_values_act = self.gen_freqdists(labeled_data,unlabeled_data)
l_probdist_act, ft_probdist_act = self.gen_probdists(l_freqdist_act, ft_freqdist_act, ft_values_act, ELEProbDist)
C = nltk.NaiveBayesClassifier(l_probdist_act, ft_probdist_act)
#Compute log-likelihood
#NLTK Naive bayes classifier prob_classify func gives logprob(class) + logprob(doc|class))
#for labeled data, sum logprobs output by the classifier for the label
#for unlabeled data, sum logprobs output by the classifier for each label
log_lh = sum([C.prob_classify(ftdic).prob(label) for (ftdic,label) in labeled_data])
log_lh += sum([C.prob_classify(ftdic).prob(label) for (ftdic,ignore) in unlabeled_data for label in l_freqdist_act.samples()])
#Continue until convergence
if log_lh_old == "first":
if self.debug: print "\tM: #iteration 1",log_lh,"(FIRST)"
log_lh_old = log_lh
else:
log_lh_diff = log_lh - log_lh_old
if self.debug: print "\tM: #iteration",iteration,log_lh_old,"->",log_lh,"(",log_lh_diff,")"
if log_lh_diff < self.log_lh_diff_min: break
log_lh_old = log_lh
Custom function gen-freqdists, used to create needed frequency distributions
def gen_freqdists(self, instances_l, instances_ul):
l_freqdist = FreqDist() #frequency distrib. of labels
ft_freqdist= defaultdict(FreqDist) #dictionary of freq. distrib. for ft-label pairs
ft_values = defaultdict(set) #dictionary of possible values for each ft (only 1/None)
fts = set() #set of all fts
#counts for labeled data
for (ftdic,label) in instances_l:
l_freqdist.inc(label,1)
for f in ftdic.keys():
fts.add(f)
ft_freqdist[label,f].inc(1,1)
ft_values[f].add(1)
#counts for unlabeled data
#we must compute maximum a posteriori label estimate
#and update label/ft occurrences accordingly
for (ftdic,probs) in instances_ul:
map_l = probs.max() #label with highest probability
map_p = probs.prob(map_l) #probability of map_l
l_freqdist.inc(map_l,count=map_p)
for f in ftdic.keys():
fts.add(f)
ft_freqdist[map_l,f].inc(1,count=map_p)
ft_values[f].add(1)
#features not appearing in documents get implicit None values
for l in l_freqdist.samples():
num_samples = l_freqdist[l]
for f in fts:
count = ft_freqdist[l,f].N()
ft_freqdist[l,f].inc(None, num_samples-count)
ft_values[f].add(None)
#return computed frequency distributions
return l_freqdist, ft_freqdist, ft_values

I think you're summing the wrong values.
This is your code that is supposed to compute the sum of the log probs:
#Compute log-likelihood
#NLTK Naive bayes classifier prob_classify func gives logprob(class) + logprob(doc|class))
#for labeled data, sum logprobs output by the classifier for the label
#for unlabeled data, sum logprobs output by the classifier for each label
log_lh = sum([C.prob_classify(ftdic).prob(label) for (ftdic,label) in labeled_data])
log_lh += sum([C.prob_classify(ftdic).prob(label) for (ftdic,ignore) in unlabeled_data for label in l_freqdist_act.samples()])
According to the NLTK documentation for prob_classify (on NaiveBayesClassifier) a ProbDistI object is returned (not logprob(class) + logprob(doc|class)). When you get this object, You're calling the prob method on it for a given label. You probably want to call logprob, and negate that return as well.

Related

Strange Behavior in Emukit When Combining Multifidelity and Experimental Design

I have a Python code that aims to build several multi-fidelity models (one for each of several variables) and use Emukit's experimental design functions to update them iteratively. I am using simple uncertainty acquisition (ModelVariance) and the multi-fidelity-wrapped gradient optimizer as shown in the examples here and here. I started by applying this technique to only one of my several variables. When doing that I noticed that 1) all update points (x_new) seemed to be selected from the LF model and 2) the variance dropped precipitously everywhere after adding only a single update point. I shrugged this off initially, and applied the technique to all my variables (using a loop over a dictionary to do each variable in turn). When I did that, I discovered that the mean predictions (new model points) seemed perfectly reasonable, but the reported variances using .predict() for ALL the models of ALL the variables were exactly the same, and were in fact what I had been given by the program when just doing the single variable. Something seems to be going very wrong finding and updating the variances after adding a new training point and using .set_data to update the model and I am not sure what or where the problem is. Is there an emukit bug? Am I using an incorrect setting? Is the problem with my dictionaries or for-loops? I am at a loss. Can anyone offer some insight?
Here is the code I currently have, somewhat redacted. I am sorry that it's such a long read....
# SKIPPING GENERAL IMPORTS
def make_mf(x,y,kernel,fidels):
# Generic multifidelity model builder.
# Returns a mutlifidelity model built based on the training points (x and y),
# kernels, and number of fidelities
mf_lin_model=GPyLinearMultiFidelityModel(x, y,kernel, n_fidelities=fidels)
# set up loop to fix noise to 0 for all fidelities, indicating training points are exact
for i in range(fidels):
if i == 0:
caller = "mf_lin_model.mixed_noise.Gaussian_noise.fix(0)"
else:
caller = "mf_lin_model.mixed_noise.Gaussian_noise_" + str(i) + ".fix(0)"
eval(caller)
## Wrap the model using the given 'GPyMultiOutputWrapper'
mf_model= model = GPyMultiOutputWrapper(mf_lin_model, 2, n_optimization_restarts=5,verbose_optimization=False)
# Fit the model
mf_model.optimize()
# Return the final model to the calling procedure
return(mf_model)
np.random.seed(20)
# list of y (result variables)
yvars=["VAR1","VAR2","VAR3"]
#list of x (input) variables
xvars=["XVAR"]
# list of fidelity levels. levels should be in order of ascending fidelity (0=lowest)
levels=["lf","hf"]
# list of what we'll need to store for each variable and level
# these are the model itself, the predicted values for plotting,
# and the predicted values at the training points
contents=['surrogate','y_plot','y_train']
# list of medium_fidelity variables
# these are the training coordintaes, the model, predicted values for plotting,
# predicted variances, the maximum and mean variance, and predicted
# values at the training points
multifivars=['y_plot','variance','varmax','varmean','pl_train']
mainvars=['model','x_train','y_train']
# set up a dictionary to store the models and related results for each y-variable
# and each fidelity
MyModels={key:{lkey:{ckey:None for ckey in contents} for lkey in levels} for key in yvars}
# Set up a dictionary for the multi-fidelity models
MultiFidelity={key:{vkey: None for vkey in mainvars}for key in yvars}
for key in MultiFidelity.keys():
for level in levels:
MultiFidelity[key][level]={mkey:None for mkey in multifivars}
#set up a dictionary to easily access data
MyData={key:None for key in levels}
# set up a dictionaries to easily access training and plotting points
x_train={key:None for key in levels}
Y_plot={key:None for key in levels}
T_plot={key:None for key in levels}
# Number of initial points evaluated at each fidelity level
npoints=[5,2]
MyPoints={levels[i]:npoints[i] for i in range(len(levels))}
## SKIPPED THE SECTION WHERE I READ IN THE RAW DATA
# High sampling of models for plotting of functions
x_plot = np.linspace(2, 16, 200)[:, None]
# set up points for plotting and retrieving MF model
X_plot = convert_x_list_to_array([x_plot, x_plot])
for i in range(len(levels)):
Y_plot[levels[i]] = X_plot[i*len(x_plot):(i+1)*len(x_plot)]
Y_plot_h=X_plot[len(x_plot):]
# Sampling for training for multi-fidelity analysis
x_train[levels[0]] = np.atleast_2d(np.random.rand(MyPoints[levels[0]])*14+2).T
for i in range (1,len(levels)):
x_train[levels[i]] = np.atleast_2d(np.random.permutation(x_train[levels[i-1]])[:MyPoints[levels[i]]])
#x_train_h = np.atleast_2d([3, 9.5, 11, 15]).T
# set up points for plotting mf result at training points
X_train=convert_x_list_to_array([x_train[levels[0]],x_train[levels[0]]])
for i in range(len(levels)):
T_plot[levels[i]] = X_train[i*len(x_train[levels[0]]):(i+1)*len(x_train[levels[0]])]
#print(X_train)
# combine the training points of all fidelity levels into a list of arrays
xtemp=[]
for level in levels:
xtemp.append(x_train[level])
kernels = [GPy.kern.RBF(1), GPy.kern.RBF(1)]
lin_mf_kernel = emukit.multi_fidelity.kernels.LinearMultiFidelityKernel(kernels)
for var in MyModels.keys():
ytemp=[]
for level in levels:
# use SciPy interpolate to build surrogate for given variable and fidelity level
MyModels[var][level]['surrogate']=interpolate.interp1d(MyData[level]['Coll'],MyData[level][var])
# find y-values for training MF points and append to a list of arrays
MyModels[var][level]['y_train']=MyModels[var][level]['surrogate'](x_train[level])
ytemp.append(MyModels[var][level]['y_train'])
MyModels[var][level]['y_plot']=MyModels[var][level]['surrogate'](x_plot)
## Convert lists of arrays to ndarrays augmented with fidelity indicators
MultiFidelity[var]['x_train'],MultiFidelity[var]['y_train']=convert_xy_lists_to_arrays(xtemp,ytemp)
# Build the multi-fidelity model
## Construct a linear multi-fidelity model
MultiFidelity[var]['model']= make_mf(MultiFidelity[var]['x_train'], MultiFidelity[var]['y_train'], lin_mf_kernel,len(levels))
# Get multifidelity model values and variances at plotting points
for level in levels:
MultiFidelity[var][level]['y_plot'],MultiFidelity[var][level]['variance']=MultiFidelity[var]['model'].predict(Y_plot[level])
# find maximum and average variance to measure the accuracy of the MF model
MultiFidelity[var][level]['varmax']=np.amax(MultiFidelity[var][level]['variance'])
MultiFidelity[var][level]['varmean']=np.mean(MultiFidelity[var][level]['variance'])
MultiFidelity[var][level]['pl_train'], _ = MultiFidelity[var]['model'].predict(T_plot[level])
for key in MyModels.keys():
for level in levels:
print(key,level,MultiFidelity[key][level]['varmax'],MultiFidelity[key][level]['varmean'])
# set up the parameter space. we are scanning in x between 2 and 16 to match the range of my input
parameter_space = ParameterSpace([ContinuousParameter('x', 2, 16), InformationSourceParameter(len(levels))])
# set up how we will look for the target of our search
optimizer = MultiSourceAcquisitionOptimizer(GradientAcquisitionOptimizer(parameter_space), parameter_space)
# Plot each variable vs X for BEFORE any new points are added
for var in yvars:
plot_vars(var,0)
# Note: right now I am basing the aquisition function on the first variable ONLY. I intend to
build a more complex function later when I get these bugs worked out.
acquisition=ModelVariance(MultiFidelity[yvars[0]]['model'])
# perform optimization to find the target point
x_new, val = optimizer.optimize(acquisition)
# x_new=np.atleast_2d(0)
# x_new[0][0]=np.random.rand()*14+2
print('first update points is',x_new)
# I want to manually specify that I add one HF training point and 4 LF training points,
# hence the way the following code is built. This could be a source of problems?
# construct our own version of the new data point because we will want it from the HF surrogate model
# (hence the value 1 in the final column)
new_point_x_hi = [[x_new[0][0],1.]]
# also, since this is an HF point, we include it as a training point in the LF model
new_point_x_lo = [[x_new[0][0],0.]]
# # we also append the new x-value to the training point x-array
x_train[levels[0]]=np.append(x_train[levels[0]],[[x_new[0][0]]],axis=0)
x_train[levels[1]]=np.append(x_train[levels[1]],[[x_new[0][0]]],axis=0)
# next, prepare points to allow the plotting of the training points on each model
X_train=convert_x_list_to_array([x_train[levels[0]],x_train[levels[0]]])
for i in range(len(levels)):
T_plot[levels[i]] = X_train[i*len(x_train[levels[0]]):(i+1)*len(x_train[levels[0]])]
for var in yvars:
# Now, for every variable in our list we add training points and update the models
# find the corresponding y-values from the respective surrogates
new_point_y_hi = np.atleast_2d(MyModels[var]['hf']['surrogate'](x_new[0][0]))
new_point_y_lo = np.atleast_2d(MyModels[var]['lf']['surrogate'](x_new[0][0]))
# Note that, as usual, we make these into 2D arrays to match EMUKit's formatting
# now append the new point to our model's training data arrays
MultiFidelity[var]['x_train']=np.append(MultiFidelity[var]['x_train'],new_point_x_hi,axis=0)
MultiFidelity[var]['y_train']=np.append(MultiFidelity[var]['y_train'],new_point_y_hi,axis=0)
MultiFidelity[var]['x_train']=np.append(MultiFidelity[var]['x_train'],new_point_x_lo,axis=0)
MultiFidelity[var]['y_train']=np.append(MultiFidelity[var]['y_train'],new_point_y_lo,axis=0)
# now we use .set_data to update the model based on the extended training data
# MultiFidelity[var]['model']= make_mf(MultiFidelity[var]['x_train'], MultiFidelity[var]['y_train'], lin_mf_kernel,len(levels))
MultiFidelity[var]['model'].set_data(MultiFidelity[var]['x_train'],MultiFidelity[var]['y_train'])
# and finally, re-calculate the values and variances at our plotting points to create an updated plot
# MultiFidelity[var]['lf']['y_plot'],MultiFidelity[var]['lf']['variance']=MultiFidelity[var]['model'].predict(Y_plot['lf'])
# MultiFidelity[var]['hf']['y_plot'],MultiFidelity[var]['hf']['variance']=MultiFidelity[var]['model'].predict(Y_plot['hf'])
# MultiFidelity[var]['hf']['pl_train'], _ = MultiFidelity[var]['model'].predict(T_plot['hf'])
# not forgetting to update the maximum and average variances
for level in levels:
# get new plotting points
MultiFidelity[var][level]['y_plot'],MultiFidelity[var][level]['variance']=MultiFidelity[var]['model'].predict(Y_plot[level])
MultiFidelity[var][level]['pl_train'], _ = MultiFidelity[var]['model'].predict(T_plot[level])
# find maximum and average variance to measure the accuracy of the MF model
MultiFidelity[var][level]['varmax']=np.amax(MultiFidelity[var][level]['variance'])
MultiFidelity[var][level]['varmean']=np.mean(MultiFidelity[var][level]['variance'])
# report maximum and avarage variance
print(var,level,'max = ',MultiFidelity[var][level]['varmax'],'mean = ', MultiFidelity[var][level]['varmean'])
# Plot each variable vs Coll for rcas, helios and the low and high-fidelity models for aftr HF point added
plot_vars(var,1)
# NOW DID THE SAME THING FOR A SEQUENCE OF 4 LF POINTS
I have tried using different acquisition functions and got the same behavior. I have also tried rebilding the model from scratch using model.optimize() and only got stranger behavior.

Bayesian modeling of repeated binary measurements in PyMC3 (Python)

I am going to run a study in which multiple raters have to evaluate whether each of a number of papers is '1' or '0'. The reason I use multiple raters is that I suspect that each individual rater is likely to make mistakes, and I hope that by using multiple raters I can control for that.
My aim is to estimate the true proportion of '1' in the population of papers, and I want to do this using a bayesian model in PyMC3. More general answers about model specification without the concrete implementation in PyMC3 are of course also welcome.
This is how I've simulated some data:
n = 250 # number of papers we sample
p = 0.3 # true rate
true_sample = binom.rvs(1, 0.3, size=n)
# add error
def rating(array,error_rate):
scores = []
for i in array:
scores.append(np.random.binomial(i, error_rate))
return np.array(scores)
r = 10 # number of raters
r_error = np.random.uniform(0.7, 0.99,10) # how often does each rater rate a paper correctly
#get the data
rated_data = {}
for i in range(r):
rated_data[f'rater_{i}'] = rating(true_sample, r_error[i])
df = pd.DataFrame(rated_data, index = [f'abstract_{i}' for i in range(250)])
This is the model I have tried:
with pm.Model() as binom_model2:
p = pm.Beta('p',0.5,0.5) # this is the proportion of '1' in the population
for i in range(10): # error_r and p for each rater separately
er = pm.Beta(f'er{i}',10,3)
prob = pm.Binomial(f'prob{i}', p = (p * er), n = n,observed = df.iloc[:,i].sum() )
This seems to work fine, in that it gives good estimates of p and error_r (but do tell me if you think there are problems with the model!). However, it doesn't use all information that is available, namely, the fact that the ratings on each row of the dataframe are ratings of the same paper. I presume that a model that could incorporate this, would give even more accurate estimates of p and of the error-rates. I'm not sure how to do this, and any help would be appreciated.

How is most informative feature percentage is calculated in Naive Bayes nltk python?

We usually get the following results when we run the below command :-
classifier.show_most_informative_features(10)
Results:
Most Informative Features
outstanding = 1 pos : neg = 13.9 : 1.0
insulting = 1 neg : pos = 13.7 : 1.0
vulnerable = 1 pos : neg = 13.0 : 1.0
ludicrous = 1 neg : pos = 12.6 : 1.0
uninvolving = 1 neg : pos = 12.3 : 1.0
astounding = 1 pos : neg = 11.7 : 1.0
Does anybody knows how this 13.9, 13.7 etc is calculated?
Also, We can get the most informative features with the following method classifier.show_most_informative_features(10) with Naive bayes but if we want to get the same results using logistic regression, please could somebody suggest the way to get that. I saw one post on stackoverflow but that requires vector which I am not using to create features.
classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Original Naive bayes accuracy percent: ", nltk.classify.accuracy(classifier,dev_set)* 100)
classifier.show_most_informative_features(10)
LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(train_set)
print("LogisticRegression accuracy percent: ", nltk.classify.accuracy(LogisticRegression_classifier, dev_set)*100)
The most informative features for the Naive Bayes classifier in NLTK is documented as such:
def most_informative_features(self, n=100):
"""
Return a list of the 'most informative' features used by this
classifier. For the purpose of this function, the
informativeness of a feature ``(fname,fval)`` is equal to the
highest value of P(fname=fval|label), for any label, divided by
the lowest value of P(fname=fval|label), for any label:
| max[ P(fname=fval|label1) / P(fname=fval|label2) ]
"""
# The set of (fname, fval) pairs used by this classifier.
features = set()
# The max & min probability associated w/ each (fname, fval)
# pair. Maps (fname,fval) -> float.
maxprob = defaultdict(lambda: 0.0)
minprob = defaultdict(lambda: 1.0)
for (label, fname), probdist in self._feature_probdist.items():
for fval in probdist.samples():
feature = (fname, fval)
features.add(feature)
p = probdist.prob(fval)
maxprob[feature] = max(p, maxprob[feature])
minprob[feature] = min(p, minprob[feature])
if minprob[feature] == 0:
features.discard(feature)
# Convert features to a list, & sort it by how informative
# features are.
features = sorted(features,
key=lambda feature_:
minprob[feature_]/maxprob[feature_])
return features[:n]
In the case of binary classification ('pos' vs 'neg'), where your features are from a unigram bag-of-word (BoW) models, the "information value" returned by the most_informative_features() function for the word outstanding is equals to:
p('outstanding'|'pos') / p('outstanding'|'neg')
The function iterates through all features (in the case of unigram BoW models, features are words), and then take the top 100 words with the highest "information value".
The probability of a word given the tag is computed in the train() function using the Expected Likelihood Estimation from the ELEProbDist which is a LidstoneProbDist object under the hood where the gamma argument is set to 0.5, and it does:
class LidstoneProbDist(ProbDistI):
"""
The Lidstone estimate for the probability distribution of the
experiment used to generate a frequency distribution. The
"Lidstone estimate" is parameterized by a real number *gamma*,
which typically ranges from 0 to 1. The Lidstone estimate
approximates the probability of a sample with count *c* from an
experiment with *N* outcomes and *B* bins as
``c+gamma)/(N+B*gamma)``. This is equivalent to adding
*gamma* to the count for each bin, and taking the maximum
likelihood estimate of the resulting frequency distribution.
"""

How to get predictive attributes of each target in `Random Forest`?

I've been messing around with Random Forest models lately and they are really useful w/ the feature_importance_ attribute!
It would be useful to know which variables are more predictive of particular targets.
For example, what if the 1st and 2nd attributes were more predictive of distringuishing target 0 but the 3rd and 4th attributes were more predictive of target 1?
Is there a way to get the feature_importance_ array for each target separately? With sklearn, scipy, pandas, or numpy preferably.
# Iris dataset
DF_iris = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
Se_iris = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Import modules
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
# Split Data
X_tr, X_te, y_tr, y_te = train_test_split(DF_iris, Se_iris, test_size=0.3, random_state=0)
# Create model
Mod_rf = RandomForestClassifier(random_state=0)
Mod_rf.fit(X_tr,y_tr)
# Variable Importance
Mod_rf.feature_importances_
# array([ 0.14334485, 0.0264803 , 0.40058315, 0.42959169])
# Target groups
Se_iris.unique()
# array([0, 1, 2])
This is not really how RF works. Since there is no simple "feature voting" (which takes place in linear models) it is really hard to answer the question what "feature X is more predictive for target Y" even means. What feature_importance of RF captures is "how probable is, in general, to use this feature in the decision process". The problem with addressing your question is that if you ask "how probable is, in general, to use this feature in decision process leading to label Y" you would have to pretty much run the same procedure but remove all subtrees which do not contain label Y in a leaf - this way you remove parts of the decision process which do not address the problem "is it Y or not Y" but rather try to answer which "not Y" it is. However, in practice, due to very stochastic nature of RF, cutting its depth etc. this might barely reduce anything. The bad news is also, that I never seen it implemented in any standard RF library, you could do this on your own, just the way I said:
for i = 1 to K (K is number of distinct labels)
tmp_RF = deepcopy(RF)
for tree in tmp_RF:
tree = remove_all_subtrees_that_do_not_contain_given_label(tree, i)
for x in X (X is your dataset)
features_importance[i] += how_many_times_each_feature_is_used(tree, x) / |X|
features_importance[i] /= |tmp_RF|
return features_importance
in particular you could use existing feature_importance codes, simply by doing
for i = 1 to K (K is number of distinct labels)
tmp_RF = deepcopy(RF)
for tree in tmp_RF:
tree = remove_all_subtrees_that_do_not_contain_given_label(tree, i)
features_importance[i] = run_regular_feature_importance(tmp_RF)
return features_importance

Stepwise Regression in Python

How to perform stepwise regression in python? There are methods for OLS in SCIPY but I am not able to do stepwise. Any help in this regard would be a great help. Thanks.
Edit: I am trying to build a linear regression model. I have 5 independent variables and using forward stepwise regression, I aim to select variables such that my model has the lowest p-value. Following link explains the objective:
https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEAQFjAD&url=http%3A%2F%2Fbusiness.fullerton.edu%2Fisds%2Fjlawrence%2FStat-On-Line%2FExcel%2520Notes%2FExcel%2520Notes%2520-%2520STEPWISE%2520REGRESSION.doc&ei=YjKsUZzXHoPwrQfGs4GQCg&usg=AFQjCNGDaQ7qRhyBaQCmLeO4OD2RVkUhzw&bvm=bv.47244034,d.bmk
Thanks again.
Trevor Smith and I wrote a little forward selection function for linear regression with statsmodels: http://planspace.org/20150423-forward_selection_with_statsmodels/ You could easily modify it to minimize a p-value, or select based on beta p-values with just a little more work.
You may try mlxtend which got various selection methods.
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
clf = LinearRegression()
# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)
# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)
You can make forward-backward selection based on statsmodels.api.OLS model, as shown in this answer.
However, this answer describes why you should not use stepwise selection for econometric models in the first place.
I developed this repository https://github.com/xinhe97/StepwiseSelectionOLS
My Stepwise Selection Classes (best subset, forward stepwise, backward stepwise) are compatible to sklearn. You can do Pipeline and GridSearchCV with my Classes.
The essential part of my code is as follows:
################### Criteria ###################
def processSubset(self, X,y,feature_index):
# Fit model on feature_set and calculate rsq_adj
regr = sm.OLS(y,X[:,feature_index]).fit()
rsq_adj = regr.rsquared_adj
bic = self.myBic(X.shape[0], regr.mse_resid, len(feature_index))
rsq = regr.rsquared
return {"model":regr, "rsq_adj":rsq_adj, "bic":bic, "rsq":rsq, "predictors_index":feature_index}
################### Forward Stepwise ###################
def forward(self,predictors_index,X,y):
# Pull out predictors we still need to process
remaining_predictors_index = [p for p in range(X.shape[1])
if p not in predictors_index]
results = []
for p in remaining_predictors_index:
new_predictors_index = predictors_index+[p]
new_predictors_index.sort()
results.append(self.processSubset(X,y,new_predictors_index))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest rsq_adj
# best_model = models.loc[models['bic'].idxmin()]
best_model = models.loc[models['rsq'].idxmax()]
# Return the best model, along with model's other information
return best_model
def forwardK(self,X_est,y_est, fK):
models_fwd = pd.DataFrame(columns=["model", "rsq_adj", "bic", "rsq", "predictors_index"])
predictors_index = []
M = min(fK,X_est.shape[1])
for i in range(1,M+1):
print(i)
models_fwd.loc[i] = self.forward(predictors_index,X_est,y_est)
predictors_index = models_fwd.loc[i,'predictors_index']
print(models_fwd)
# best_model_fwd = models_fwd.loc[models_fwd['bic'].idxmin(),'model']
best_model_fwd = models_fwd.loc[models_fwd['rsq'].idxmax(),'model']
# best_predictors = models_fwd.loc[models_fwd['bic'].idxmin(),'predictors_index']
best_predictors = models_fwd.loc[models_fwd['rsq'].idxmax(),'predictors_index']
return best_model_fwd, best_predictors
Statsmodels has additional methods for regression: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html. I think it will help you to implement stepwise regression.
"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm
"""X_opt variable has all the columns of independent variables of matrix X
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]
"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Using the summary method, you can check in your kernel the p values of your
variables written as 'P>|t|'. Then check for the variable with the highest p
value. Suppose x3 has the highest value e.g 0.956. Then remove this column
from your array and repeat all the steps.
X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Repeat these methods until you remove all the columns which have p value higher than the significance value(e.g 0.05). In the end your variable X_opt will have all the optimal variables with p values less than significance level.
Here's a method I just wrote that uses "mixed selection" as described in Introduction to Statistical Learning. As input, it takes:
lm, a statsmodels.OLS.fit(Y,X), where X is an array of n ones, where n is the
number of data points, and Y, where Y is the response in the training data
curr_preds- a list with ['const']
potential_preds- a list of all potential predictors.
There also needs to be a pandas dataframe X_mix that has all of the data, including 'const', and all of the data corresponding to the potential predictors
tol, optional. The max pvalue, .05 if not specified
def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
while (len(potential_preds) > 0):
index_best = -1 # this will record the index of the best predictor
curr = -1 # this will record current index
best_r_squared = lm.rsquared_adj # record the r squared of the current model
# loop to determine if any of the predictors can better the r-squared
for pred in potential_preds:
curr += 1 # increment current
preds = curr_preds.copy() # grab the current predictors
preds.append(pred)
lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
new_r_sq = lm_new.rsquared_adj # record r squared for new model
if new_r_sq > best_r_squared:
best_r_squared = new_r_sq
index_best = curr
if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
curr_preds.append(potential_preds.pop(index_best))
else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
break
# fit a new lm using the new predictors, look at the p-values
pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
pval_too_big = []
# make a list of all the p-values that are greater than the tolerance
for feat in pvals.index:
if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
pval_too_big.append(feat)
# now remove all the features from curr_preds that have a p-value that is too large
for feat in pval_too_big:
pop_index = curr_preds.index(feat)
curr_preds.pop(pop_index)

Categories