How to get predictive attributes of each target in `Random Forest`? - python

I've been messing around with Random Forest models lately and they are really useful w/ the feature_importance_ attribute!
It would be useful to know which variables are more predictive of particular targets.
For example, what if the 1st and 2nd attributes were more predictive of distringuishing target 0 but the 3rd and 4th attributes were more predictive of target 1?
Is there a way to get the feature_importance_ array for each target separately? With sklearn, scipy, pandas, or numpy preferably.
# Iris dataset
DF_iris = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
Se_iris = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Import modules
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
# Split Data
X_tr, X_te, y_tr, y_te = train_test_split(DF_iris, Se_iris, test_size=0.3, random_state=0)
# Create model
Mod_rf = RandomForestClassifier(random_state=0)
Mod_rf.fit(X_tr,y_tr)
# Variable Importance
Mod_rf.feature_importances_
# array([ 0.14334485, 0.0264803 , 0.40058315, 0.42959169])
# Target groups
Se_iris.unique()
# array([0, 1, 2])

This is not really how RF works. Since there is no simple "feature voting" (which takes place in linear models) it is really hard to answer the question what "feature X is more predictive for target Y" even means. What feature_importance of RF captures is "how probable is, in general, to use this feature in the decision process". The problem with addressing your question is that if you ask "how probable is, in general, to use this feature in decision process leading to label Y" you would have to pretty much run the same procedure but remove all subtrees which do not contain label Y in a leaf - this way you remove parts of the decision process which do not address the problem "is it Y or not Y" but rather try to answer which "not Y" it is. However, in practice, due to very stochastic nature of RF, cutting its depth etc. this might barely reduce anything. The bad news is also, that I never seen it implemented in any standard RF library, you could do this on your own, just the way I said:
for i = 1 to K (K is number of distinct labels)
tmp_RF = deepcopy(RF)
for tree in tmp_RF:
tree = remove_all_subtrees_that_do_not_contain_given_label(tree, i)
for x in X (X is your dataset)
features_importance[i] += how_many_times_each_feature_is_used(tree, x) / |X|
features_importance[i] /= |tmp_RF|
return features_importance
in particular you could use existing feature_importance codes, simply by doing
for i = 1 to K (K is number of distinct labels)
tmp_RF = deepcopy(RF)
for tree in tmp_RF:
tree = remove_all_subtrees_that_do_not_contain_given_label(tree, i)
features_importance[i] = run_regular_feature_importance(tmp_RF)
return features_importance

Related

Squared Error Relevance Area (SERA) implementation in Python as custom evaluation metric

I'm facing an imbalanced regression problem and I've already tried several ways to solve this problem. Eventually I came a cross this new metric called SERA (Squared Error Relevance Area) as a custom scoring function for imbalanced regression as mentioned in this paper. https://link.springer.com/article/10.1007/s10994-020-05900-9
In order to calculate SERA you have to compute the relevance function phi, which is varied from 0 to 1 in small steps. For each value of relevance (phi) (e.g. 0.45) a subset of the training dataset is selected where the relevance is greater or equal to that value (e.g. 0.45). And for that selected training subset sum of squared errors is calculated i.e. sum(y_true - y_pred)**2 which is known as squared error relevance (SER). Then a plot us created for SER vs phi and area under the curve is calculated i.e. SERA.
Here is my implementation, inspired by this other question here in StackOverflow:
import pandas as pd
from scipy.integrate import simps
from sklearn.metrics import make_scorer
def calc_sera(y_true, y_pred,x_relevance=None):
# creating a list from 0 to 1 with 0.001 interval
start_range = 0
end_range = 1
interval_size = 0.001
list_1 = [round(val * interval_size, 3) for val in range(1, 1000)]
list_1.append(start_range)
list_1.append(end_range)
epsilon = sorted(list_1, key=lambda x: float(x))
df = pd.concat([y_true,y_pred,x_relevance],axis=1,keys= ['true', 'pred', 'phi'])
# Initiating lists to store relevance(phi) and squared-error relevance (ser)
relevance = []
ser = []
# Converting the dataframe to a numpy array
rel_arr = x_relevance
# selecting a phi value
for phi in epsilon:
relevance.append(phi)
error_squared_sum = 0
error_squared_sum = sum((df[df.phi>=phi]['true'] - df[df.phi>=phi]['pred'])**2)
ser.append(error_squared_sum)
# squared-error relevance area (sera)
# numerical integration using simps(y, x)
sera = simps(ser, relevance)
return sera
sera = make_scorer(calc_sera, x_relevance=X['relevance'], greater_is_better=False)
I implemented a simple GridSearch using this score as an evaluation metric to select the best model:
model = CatBoostRegressor(random_state=0)
cv = KFold(n_splits = 5, shuffle = True, random_state = 42)
parameters = {'depth': [6,8,10],'learning_rate' : [0.01, 0.05, 0.1],'iterations': [100, 200, 500,1000]}
clf = GridSearchCV(estimator=model,
param_grid=parameters,
scoring=sera,
verbose=0,cv=cv)
clf.fit(X=X.drop(columns=['relevance']),
y=y,
sample_weight=X['relevance'])
print("Best parameters:", clf.best_params_)
print("Lowest SERA: ", clf.best_score_)
I also added the relevance function as weights to the model so it could apply this weights in the learning task. However, what I am getting as output is this:
Best parameters: {'depth': 6, 'iterations': 100, 'learning_rate': 0.01}
Lowest SERA: nan
Any clue on why SERA value is returning nan? Should I implement this another way?
Whenever you get unexpected NaN scores in a grid search, you should set the parameter error_score="raise" to get an error traceback, and debug from there.
In this case I think I see the problem though: sera is defined with x_relevance=X['relevance'], which includes all the rows of X. But in the search, you're cross-validating: each testing set has fewer rows, and those are what sera will be called on. I can think of a couple of options; I haven't tested either, so let me know if something doesn't work.
Use pandas index
In your pd.concat, just set join="inner". If y_true is a slice of the original pandas series (I think this is how GridSearchCV will pass it...), then the concatenation will join on those row indices, so keeping the whole of X['relevance'] is fine: it will just drop the irrelevant rows. y_pred may well be a numpy array, so you might need to set its index appropriately first?
Keep relevance in X
Then your scorer can reference the relevance column directly from the sliced X. For this, you will want to drop that column from the fitting data, which could be difficult to do for the training but not the testing set; however, CatBoost has an ignored_features parameter that I think ought to work.

sklearn cross validation : The least populated class in y has only 1 members, which is less than n_splits=10

i'm working in a machine learning project and i'm stuck with this warning when i try to use cross validation to know how many neighbours do i need to achieve the best accuracy in knn; here's the warning:
The least populated class in y has only 1 members, which is less than n_splits=10.
The dataset i'm using is https://archive.ics.uci.edu/ml/datasets/Student+Performance
In this dataset we have several attributes, but we'll be using only "G1", "G2", "G3", "studytime","freetime","health","famrel". all the instances in those columns are integers.
https://i.stack.imgur.com/sirSl.png <-dataset example
Next,here's my first chunk of code where i assign the train and test groups:
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/gdrive')
import sklearn
data=pd.read_excel("/gdrive/MyDrive/Colab Notebooks/student-por.xls")
#print(data.head())
data = data[["G1", "G2", "G3", "studytime","freetime","health","famrel"]]
print(data)
predict = "G3"
x = np.array(data.drop([predict], axis=1))
y = np.array(data[predict])
print(y)
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.3, random_state=42)
print(len(y))
print(len(x))
That's how i assign x and y. with len, i can see that x and y have 649 rows both, representing 649 students.
Here's the second chunk of code when i do the cross_val:
#CROSSVALIDATION
from sklearn.neighbors import KNeighborsClassifier
neighbors = list (range(2,30))
cv_scores=[]
#print(y_train)
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn,x_train,y_train,cv=11,scoring='accuracy')
cv_scores.append(scores.mean())
plt.plot(cv_scores)
plt.show()```
the code is pretty self explanatory as you can tell
The warning:
The least populated class in y has only 1 members, which is less than n_splits=10.
happens in every iteration of the for-loop
Although this warning happens every time, plt.show() is still able to plot a graph regarding which amount of neighbours is best to achieve a good accuracy, i dont know if the plot, or the readings in cv_scores are accurate.
my question is :
How my "class in y" has only 1 members, len(y) clearly says y have 649 instances, more than enough to be splitted in 59 groups of 11 members each one?, By members is it referring to "instances" in my dataset, or colums/labels in the y group?
I'm not using stratify=y when i do the train/test splits, it's seems to be the 1# solution to this warning but its useless in my case.
I've tried everything i've seen on google/stack overflow and nothing helped me, the dataset seems to be the problem, but i canĀ“t understand whats wrong.
I think your main mistake is that your are using KNeighborsClassifier, and your feature to predict seems to be continuous (G3 - final grade (numeric: from 0 to 20, output target)) and not categorical.
In this case, every single value of the "y" is taken as a different possible class or label. The message you obtain is saying that in your dataset (on the "y"), there are values that only appears one time. For example, the values 3, appears only one time inside your dataset. This is not an error, but indicates that the model won't work correctly or accurate.
After all, I strongly reccomend you to use the sklearn.neighbors.KNeighborsRegressor.
This is the Knn used for "continuous" variables and not classes. Using this model, you shouldn't have this problem anymore. The output value will be the mean between the number of nearest neighbors you defined.
With this simple changes, your problem will be solved.

Scikit-learn SVC always giving accuracy 0 on random data cross validation

In the following code I create a random sample set of size 50, with 20 features each. I then generate a random target vector composed of half True and half False values.
All of the values are stored in Pandas objects, since this simulates a real scenario in which the data will be given in that way.
I then perform a manual leave-one-out inside a loop, each time selecting an index, dropping its respective data, fitting the rest of the data using a default SVC, and finally running a prediction on the left-out data.
import random
import numpy as np
import pandas as pd
from sklearn.svm import SVC
n_samp = 50
m_features = 20
X_val = np.random.rand(n_samp, m_features)
X = pd.DataFrame(X_val, index=range(n_samp))
# print X_val
y_val = [True] * (n_samp/2) + [False] * (n_samp/2)
random.shuffle(y_val)
y = pd.Series(y_val, index=range(n_samp))
# print y_val
seccess_count = 0
for idx in y.index:
clf = SVC() # Can be inside or outside loop. Result is the same.
# Leave-one-out for the fitting phase
loo_X = X.drop(idx)
loo_y = y.drop(idx)
clf.fit(loo_X.values, loo_y.values)
# Make a prediction on the sample that was left out
pred_X = X.loc[idx:idx]
pred_result = clf.predict(pred_X.values)
print y.loc[idx], pred_result[0] # Actual value vs. predicted value - always opposite!
is_success = y.loc[idx] == pred_result[0]
seccess_count += 1 if is_success else 0
print '\nSeccess Count:', seccess_count # Almost always 0!
Now here's the strange part - I expect to get an accuracy of about 50%, since this is random data, but instead I almost always get exactly 0! I say almost always, since every about 10 runs of this exact code I get a few correct hits.
What's really crazy to me is that if I choose the answers opposite to those predicted, I will get 100% accuracy. On random data!
What am I missing here?
Ok, I think I just figured it out! It all comes down to our old machine learning foe - the majority class.
In more detail: I chose a target comprising 25 True and 25 False values - perfectly balanced. When performing the leave-one-out, this caused a class imbalance, say 24 True and 25 False. Since the SVC was set to default parameters, and run on random data, it probably couldn't find any way to predict the result other than choosing the majority class, which in this iteration would be False! So in every iteration the imbalance was turned against the currently-left-out sample.
All in all - a good lesson in machine learning, and an excelent mathematical riddle to share with your friends :)

pyspark Linear Regression Example from official documentation - Bad results?

I am planning to use Linear Regression in Spark. To get started, I checked out the example from the official documentation (which you can find here)
I also found this question on stackoverflow, which is essentially the same question as mine. The answer suggest to tweak the step size, which I also tried to do, however the results are still as random as without tweaking the step size. The code I'm using looks like this:
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parsePoint)
# Build the model
model = LinearRegressionWithSGD.train(parsedData,100000,0.01)
# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
The results look as follows:
(Expected Label, Predicted Label)
(-0.4307829, -0.7824231588143065)
(-0.1625189, -0.6234287565006766)
(-0.1625189, -0.41979307020176226)
(-0.1625189, -0.6517649080382241)
(0.3715636, -0.38543073492870156)
(0.7654678, -0.7329426818746223)
(0.8544153, -0.33273378445315)
(1.2669476, -0.36663240056848917)
(1.2669476, -0.47541427992967517)
(1.2669476, -0.1887811811672498)
(1.3480731, -0.28646712964591936)
(1.446919, -0.3425075015127807)
(1.4701758, -0.14055275401870437)
(1.4929041, -0.06819303631450688)
(1.5581446, -0.772558163357755)
(1.5993876, -0.19251656391040356)
(1.6389967, -0.38105697301968594)
(1.6956156, -0.5409707504639943)
(1.7137979, 0.14914490255841997)
(1.8000583, -0.0008818203337740971)
(1.8484548, 0.06478505759587616)
(1.8946169, -0.0685096804502884)
(1.9242487, -0.14607596025743624)
(2.008214, -0.24904211817187422)
(2.0476928, -0.4686214015035236)
(2.1575593, 0.14845590638215034)
(2.1916535, -0.5140996125798528)
(2.2137539, 0.6278134417345228)
(2.2772673, -0.35049969044209983)
(2.2975726, -0.06036824276546304)
(2.3272777, -0.18585219083806218)
(2.5217206, -0.03167349168036536)
(2.5533438, -0.1611040092884861)
(2.5687881, 1.1032200139582564)
(2.6567569, 0.04975777739217784)
(2.677591, -0.01426285133724671)
(2.7180005, 0.07853368755223371)
(2.7942279, -0.4071930969456503)
(2.8063861, 0.000492545291049501)
(2.8124102, -0.019947344959659177)
(2.8419982, 0.03023139779978133)
(2.8535925, 0.5421291261646886)
(2.9204698, 0.3923068894170366)
(2.9626924, 0.21639267973240908)
(2.9626924, -0.22540434628281075)
(2.9729753, 0.2363938458250126)
(3.0130809, 0.35136961387278565)
(3.0373539, 0.013876918415846595)
(3.2752562, 0.49970959078043126)
(3.3375474, 0.5436323480304672)
(3.3928291, 0.48746004196839055)
(3.4355988, 0.3350764608584778)
(3.4578927, 0.6127634045652381)
(3.5160131, -0.03781697409079157)
(3.5307626, 0.2129806543371961)
(3.5652984, 0.5528805608876549)
(3.5876769, 0.06299042506665305)
(3.6309855, 0.5648082098866389)
(3.6800909, -0.1588172848952902)
(3.7123518, 0.1635062564072022)
(3.9843437, 0.7827244309795267)
(3.993603, 0.6049246406551748)
(4.029806, 0.06372113813964088)
(4.1295508, 0.24281029469705093)
(4.3851468, 0.5906868686740623)
(4.6844434, 0.4055055537895428)
(5.477509, 0.7335244827296759)
Mean Squared Error = 6.83550847274
So, what am I missing? Since the data is from the official spark documentation, I would guess that it should be suited to apply linear regression on it (and get at least a reasonably good prediction)?
For starters you're missing an intercept. While mean values of the independent variables are close to zero:
parsedData.map(lambda lp: lp.features).mean()
## DenseVector([-0.031, -0.0066, 0.1182, -0.0199, 0.0178, -0.0249,
## -0.0294, 0.0669]
mean of the dependent variable is pretty far from it:
parsedData.map(lambda lp: lp.label).mean()
## 2.452345085074627
Forcing the regression line to go through the origin in case like this doesn't make sense. So lets see how LinearRegressionWithSGD performs with default arguments and added intercept:
model = LinearRegressionWithSGD.train(parsedData, intercept=True)
valuesAndPreds = (parsedData.map(lambda p: (p.label, model.predict(p.features))))
valuesAndPreds.map(lambda vp: (vp[0] - vp[1]) ** 2).mean()
## 0.44005904185432504
Lets compare it to the analytical solution
import numpy as np
from sklearn import linear_model
features = np.array(parsedData.map(lambda lp: lp.features.toArray()).collect())
labels = np.array(parsedData.map(lambda lp: lp.label).collect())
lm = linear_model.LinearRegression()
lm.fit(features, labels)
np.mean((lm.predict(features) - labels) ** 2)
## 0.43919976805833411
As you can results obtained using LinearRegressionWithSGD are almost optimal.
You could add a grid search but in this particular case there is probably nothing to gain.

Stepwise Regression in Python

How to perform stepwise regression in python? There are methods for OLS in SCIPY but I am not able to do stepwise. Any help in this regard would be a great help. Thanks.
Edit: I am trying to build a linear regression model. I have 5 independent variables and using forward stepwise regression, I aim to select variables such that my model has the lowest p-value. Following link explains the objective:
https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEAQFjAD&url=http%3A%2F%2Fbusiness.fullerton.edu%2Fisds%2Fjlawrence%2FStat-On-Line%2FExcel%2520Notes%2FExcel%2520Notes%2520-%2520STEPWISE%2520REGRESSION.doc&ei=YjKsUZzXHoPwrQfGs4GQCg&usg=AFQjCNGDaQ7qRhyBaQCmLeO4OD2RVkUhzw&bvm=bv.47244034,d.bmk
Thanks again.
Trevor Smith and I wrote a little forward selection function for linear regression with statsmodels: http://planspace.org/20150423-forward_selection_with_statsmodels/ You could easily modify it to minimize a p-value, or select based on beta p-values with just a little more work.
You may try mlxtend which got various selection methods.
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
clf = LinearRegression()
# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)
# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)
You can make forward-backward selection based on statsmodels.api.OLS model, as shown in this answer.
However, this answer describes why you should not use stepwise selection for econometric models in the first place.
I developed this repository https://github.com/xinhe97/StepwiseSelectionOLS
My Stepwise Selection Classes (best subset, forward stepwise, backward stepwise) are compatible to sklearn. You can do Pipeline and GridSearchCV with my Classes.
The essential part of my code is as follows:
################### Criteria ###################
def processSubset(self, X,y,feature_index):
# Fit model on feature_set and calculate rsq_adj
regr = sm.OLS(y,X[:,feature_index]).fit()
rsq_adj = regr.rsquared_adj
bic = self.myBic(X.shape[0], regr.mse_resid, len(feature_index))
rsq = regr.rsquared
return {"model":regr, "rsq_adj":rsq_adj, "bic":bic, "rsq":rsq, "predictors_index":feature_index}
################### Forward Stepwise ###################
def forward(self,predictors_index,X,y):
# Pull out predictors we still need to process
remaining_predictors_index = [p for p in range(X.shape[1])
if p not in predictors_index]
results = []
for p in remaining_predictors_index:
new_predictors_index = predictors_index+[p]
new_predictors_index.sort()
results.append(self.processSubset(X,y,new_predictors_index))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest rsq_adj
# best_model = models.loc[models['bic'].idxmin()]
best_model = models.loc[models['rsq'].idxmax()]
# Return the best model, along with model's other information
return best_model
def forwardK(self,X_est,y_est, fK):
models_fwd = pd.DataFrame(columns=["model", "rsq_adj", "bic", "rsq", "predictors_index"])
predictors_index = []
M = min(fK,X_est.shape[1])
for i in range(1,M+1):
print(i)
models_fwd.loc[i] = self.forward(predictors_index,X_est,y_est)
predictors_index = models_fwd.loc[i,'predictors_index']
print(models_fwd)
# best_model_fwd = models_fwd.loc[models_fwd['bic'].idxmin(),'model']
best_model_fwd = models_fwd.loc[models_fwd['rsq'].idxmax(),'model']
# best_predictors = models_fwd.loc[models_fwd['bic'].idxmin(),'predictors_index']
best_predictors = models_fwd.loc[models_fwd['rsq'].idxmax(),'predictors_index']
return best_model_fwd, best_predictors
Statsmodels has additional methods for regression: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html. I think it will help you to implement stepwise regression.
"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm
"""X_opt variable has all the columns of independent variables of matrix X
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]
"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Using the summary method, you can check in your kernel the p values of your
variables written as 'P>|t|'. Then check for the variable with the highest p
value. Suppose x3 has the highest value e.g 0.956. Then remove this column
from your array and repeat all the steps.
X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Repeat these methods until you remove all the columns which have p value higher than the significance value(e.g 0.05). In the end your variable X_opt will have all the optimal variables with p values less than significance level.
Here's a method I just wrote that uses "mixed selection" as described in Introduction to Statistical Learning. As input, it takes:
lm, a statsmodels.OLS.fit(Y,X), where X is an array of n ones, where n is the
number of data points, and Y, where Y is the response in the training data
curr_preds- a list with ['const']
potential_preds- a list of all potential predictors.
There also needs to be a pandas dataframe X_mix that has all of the data, including 'const', and all of the data corresponding to the potential predictors
tol, optional. The max pvalue, .05 if not specified
def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
while (len(potential_preds) > 0):
index_best = -1 # this will record the index of the best predictor
curr = -1 # this will record current index
best_r_squared = lm.rsquared_adj # record the r squared of the current model
# loop to determine if any of the predictors can better the r-squared
for pred in potential_preds:
curr += 1 # increment current
preds = curr_preds.copy() # grab the current predictors
preds.append(pred)
lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
new_r_sq = lm_new.rsquared_adj # record r squared for new model
if new_r_sq > best_r_squared:
best_r_squared = new_r_sq
index_best = curr
if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
curr_preds.append(potential_preds.pop(index_best))
else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
break
# fit a new lm using the new predictors, look at the p-values
pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
pval_too_big = []
# make a list of all the p-values that are greater than the tolerance
for feat in pvals.index:
if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
pval_too_big.append(feat)
# now remove all the features from curr_preds that have a p-value that is too large
for feat in pval_too_big:
pop_index = curr_preds.index(feat)
curr_preds.pop(pop_index)

Categories