Order of priors in sklearn LinearDiscriminantAnalysis - python

I'm fitting a Linear Discriminant Analysis model using the stock market data (Smarket.csv) from here. I'm trying to predict Direction with columns Lag1 and Lag2. Direction has two values: Up or Down.
Here is my reproducible code and the result:
import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
url="https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/Notebooks/Data/Smarket.csv"
Smarket=pd.read_csv(url, usecols=range(1,10), index_col=0, parse_dates=True)
X_train = Smarket[:'2004'][['Lag1', 'Lag2']]
y_train = Smarket[:'2004']['Direction']
LDA = LinearDiscriminantAnalysis()
model = LDA.fit(X_train, y_train)
print(model.priors_)
[0.49198397 0.50801603]
How do I know which prior value corresponds to which class (Up or Down)? I looked at the documentation but there seems to be nothing.
Can someone explain it to me or point me to a resource that explains this?

Although I cannot find an explicit reference in the documentation (I'm sure there is a general one, somewhere), in such cases the classes are ordered alphabetically, ie. in your case it is ['Down', 'Up'].
You can easily verify that this is consistent with your results here; since the priors_ attribute is just passed through the priors argument, which, according to the documentation, is just the class proportions as inferred from the training data (when priors=None, like here):
y_train.value_counts(normalize=True)
gives:
Up 0.508016
Down 0.491984
Name: Direction, dtype: float64
and
model.priors_[0] == (y_train.value_counts(normalize=True)['Down']
# True
model.priors_[1] == (y_train.value_counts(normalize=True)['Up']
# True

Related

export SHAP waterfall plot to dataframe

I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below
row_to_show = 20
data_for_prediction = ord_test_t.iloc[row_to_show] # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)
rf_boruta.predict_proba(data_for_prediction_array)
explainer = shap.TreeExplainer(rf_boruta)
# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0],ord_test_t.iloc[row_to_show])
This generated the plot as shown below
However, I want to export this to dataframe and how can I do it?
I expect my output to be like as shown below. I want to export this for the full dataframe. Can you help me please?
Let's do a small experiment:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer
X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
What is explainer here? If you do dir(explainer) you'll find out it has some methods and attributes among which is:
explainer.expected_value
which is of interest to you because this is base on which SHAP values add up.
Furthermore:
sv = explainer.shap_values(X)
len(sv)
will give a hint sv is a list consisting of 2 objects which are most probably SHAP values for 1 and 0, which must be symmetric (because what moves towards 1 moves exactly by the same amount, but with opposite sign, towards 0).
Hence:
sv1 = sv[1]
Now you have everything to pack it to the desired format:
df = pd.DataFrame(sv1, columns=X.columns)
df.insert(0, 'bv', explainer.expected_value[1])
Q: How do I know?
A: Read docs and source code.
If I recall correctly, you can do something like this with pandas
import pandas as pd
shap_values = explainer.shap_values(data_for_prediction)
shap_values_df = pd.DataFrame(shap_values)
to get the feature names, you should do something like this (if data_for_prediction is a dataframe):
feature_names = data_for_prediction.columns.tolist()
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
I'm a currenty using that :
def getShapReport(classifier,X_test):
shap_values = shap.TreeExplainer(classifier).shap_values(X_test)
shap.summary_plot(shap_values, X_test)
shap.summary_plot(shap_values[1], X_test)
return pd.DataFrame(shap_values[1])
It first displays the shap values for the model, and for each prediction after that, and finally it returns the dataframe for the positive class(i'm on an imbalance context)
It is for a Tree explainer and not a waterfall, but it is basically the same.

Is there a way to print a list of the most important features of an Light GBM Classifier model?

Running an LGBM Classifier model and I'm able to use lgbm.plot_importance to plot the most important features but I would prefer having a list of these features instead, does anybody know how to go about doing this?
The lightgbm.Booster object has a method .feature_importance() which can be used to access feature importances.
That method returns an array with one importance value per feature, and supports two types of importance, based on the value of importance_type:
"gain" = "cumulative gain of all splits using this feature"
"split" = "number of splits this feature was used in"
You can explore this using the following code. I ran this with lightgbm==3.3.0, numpy==1.21.0, pandas==1.2.3, and scikit-learn==0.24.1, using Python 3.8.
import lightgbm as lgb
import pandas as pd
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
data = lgb.Dataset(X, label=y)
# train model
bst = lgb.train(
params={"objective": "binary"},
train_set=data,
num_boost_round=10
)
# compute importances
importance_df = (
pd.DataFrame({
'feature_name': bst.feature_name(),
'importance_gain': bst.feature_importance(importance_type='gain'),
'importance_split': bst.feature_importance(importance_type='split'),
})
.sort_values('importance_gain', ascending=False)
.reset_index(drop=True)
)
print(importance_df)
Here's an example of the output.
feature_name importance_gain importance_split
0 Column_22 1051.204456 8
1 Column_23 862.363854 10
2 Column_27 262.272097 19
3 Column_7 161.842017 13
4 Column_21 66.431762 24
This is saying that, for example, feature Column_21 was used in more splits than other top features, but the improvement those splits provided were much less impactful than the 8 splits using Column_22.
Seems like you are using Sklearn API for Lightgbm. This should help.
General idea:
LGBMClassifier.feature_importances_
Particular case:
model_name.feature_importances_
Full code snippet (assuming pandas dataframe was used for training):
features = train_x.columns
importances = model.feature_importances_
feature_importance = pd.DataFrame({'importance':importances,'features':features}).sort_values('importance', ascending=False).reset_index(drop=True)
feature_importance
Also you can plot importances:
lgb.plot_importance(model_name)

Partial fit or incremental learning for autoregressive model

I have two time series representing two independent periods of data observation. I would like to fit an autoregressive model to this data. In other words, I would like to perform two partial fits, or two sessions of incremental learning.
This is a simplified description of a not-unusual scenario which could also apply to batch fitting on large datasets.
How do I do this (in statsmodels or otherwise)? Bonus points if the solution can generalise to other time-series models like ARIMA.
In pseudocode, something like:
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
res = AutoReg(data_1, lags=12).fit()
res.aic
# This is more like what I would like to do
model = AutoReg(lags=12)
model.partial_fit(data_1)
model.partial_fit(data_2)
model.results.aic
Statsmodels does not directly have this functionality. As Kevin S mentioned though, pmdarima does have a wrapper that provides this functionality. Specifically the update method. Per their documentation: "Update the model fit with additional observed endog/exog values.".
See example below around your particular code:
from pmdarima.arima import ARIMA
import statsmodels.api as sm
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
model = ARIMA(order=(12,0,0))
model.fit(data_1)
# update the model parameters with the new parameters
model.update(data_2)
I don't know how to achieve that in autoreg, but I think it can be achieved somehow, but need to manually evaluate results or somehow add the data.
But in ARIMA and SARIMAX, it's already implemented and it's simple.
For incremental learning, there are three functions related and it's documented here. First is apply which use fitted parameters on new unrelated data. Then there are extend and append. Append can be refit. I don't know exact difference though.
Here is my example that is different but similar...
from statsmodels.tsa.api import ARIMA
data = np.array(range(200))
order = (4, 2, 1)
model = ARIMA(data, order=order)
fitted_model = model.fit()
prediction = fitted_model.forecast(7)
new_data = np.array(range(600, 800))
fitted_model = fitted_model.apply(new_data)
new_prediction = fitted_model.forecast(7)
print(prediction) # [200. 201. 202. 203. 204. 205. 206.]
print(new_prediction) # [800. 801. 802. 803. 804. 805. 806.]
This replace all the data, so it can be used on unrelated data (unknown index). I profiled it and apply is very fast in comparison to fit.

sklearn cross validation : The least populated class in y has only 1 members, which is less than n_splits=10

i'm working in a machine learning project and i'm stuck with this warning when i try to use cross validation to know how many neighbours do i need to achieve the best accuracy in knn; here's the warning:
The least populated class in y has only 1 members, which is less than n_splits=10.
The dataset i'm using is https://archive.ics.uci.edu/ml/datasets/Student+Performance
In this dataset we have several attributes, but we'll be using only "G1", "G2", "G3", "studytime","freetime","health","famrel". all the instances in those columns are integers.
https://i.stack.imgur.com/sirSl.png <-dataset example
Next,here's my first chunk of code where i assign the train and test groups:
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/gdrive')
import sklearn
data=pd.read_excel("/gdrive/MyDrive/Colab Notebooks/student-por.xls")
#print(data.head())
data = data[["G1", "G2", "G3", "studytime","freetime","health","famrel"]]
print(data)
predict = "G3"
x = np.array(data.drop([predict], axis=1))
y = np.array(data[predict])
print(y)
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.3, random_state=42)
print(len(y))
print(len(x))
That's how i assign x and y. with len, i can see that x and y have 649 rows both, representing 649 students.
Here's the second chunk of code when i do the cross_val:
#CROSSVALIDATION
from sklearn.neighbors import KNeighborsClassifier
neighbors = list (range(2,30))
cv_scores=[]
#print(y_train)
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn,x_train,y_train,cv=11,scoring='accuracy')
cv_scores.append(scores.mean())
plt.plot(cv_scores)
plt.show()```
the code is pretty self explanatory as you can tell
The warning:
The least populated class in y has only 1 members, which is less than n_splits=10.
happens in every iteration of the for-loop
Although this warning happens every time, plt.show() is still able to plot a graph regarding which amount of neighbours is best to achieve a good accuracy, i dont know if the plot, or the readings in cv_scores are accurate.
my question is :
How my "class in y" has only 1 members, len(y) clearly says y have 649 instances, more than enough to be splitted in 59 groups of 11 members each one?, By members is it referring to "instances" in my dataset, or colums/labels in the y group?
I'm not using stratify=y when i do the train/test splits, it's seems to be the 1# solution to this warning but its useless in my case.
I've tried everything i've seen on google/stack overflow and nothing helped me, the dataset seems to be the problem, but i canĀ“t understand whats wrong.
I think your main mistake is that your are using KNeighborsClassifier, and your feature to predict seems to be continuous (G3 - final grade (numeric: from 0 to 20, output target)) and not categorical.
In this case, every single value of the "y" is taken as a different possible class or label. The message you obtain is saying that in your dataset (on the "y"), there are values that only appears one time. For example, the values 3, appears only one time inside your dataset. This is not an error, but indicates that the model won't work correctly or accurate.
After all, I strongly reccomend you to use the sklearn.neighbors.KNeighborsRegressor.
This is the Knn used for "continuous" variables and not classes. Using this model, you shouldn't have this problem anymore. The output value will be the mean between the number of nearest neighbors you defined.
With this simple changes, your problem will be solved.

How to find the features names of the coefficients using scikit linear regression?

I use scikit linear regression and if I change the order of the features, the coef are still printed in the same order, hence I would like to know the mapping of the feature with the coeff.
#training the model
model_1_features = ['sqft_living', 'bathrooms', 'bedrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
model_2 = linear_model.LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
model_3 = linear_model.LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])
# extracting the coef
print model_1.coef_
print model_2.coef_
print model_3.coef_
The trick is that right after you have trained your model, you know the order of the coefficients:
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
print(list(zip(model_1.coef_, model_1_features)))
This will print the coefficients and the correct feature. (Tested with pandas DataFrame)
If you want to reuse the coefficients later you can also put them in a dictionary:
coef_dict = {}
for coef, feat in zip(model_1.coef_,model_1_features):
coef_dict[feat] = coef
(You can test it for yourself by training two models with the same features but, as you said, shuffled order of features.)
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coef_table = pd.DataFrame(list(X_train.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",regressor.coef_.transpose())
#Robin posted a great answer, but for me I had to make one tweak on it to work the way I wanted, and it was to refer to the dimension of the 'coef_' np.array that I wanted, namely modifying to this: model_1.coef_[0,:], as below:
coef_dict = {}
for coef, feat in zip(model_1.coef_[0,:],model_1_features):
coef_dict[feat] = coef
Then the dict was created as I pictured it, with {'feature_name' : coefficient_value} pairs.
Here is what I use for pretty printing of coefficients in Jupyter. I'm not sure I follow why order is an issue - as far as I know the order of the coefficients should match the order of the input data that you gave it.
Note that the first line assumes you have a Pandas data frame called df in which you originally stored the data prior to turning it into a numpy array for regression:
fieldList = np.array(list(df)).reshape(-1,1)
coeffs = np.reshape(np.round(clf.coef_,5),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)
print(pd.DataFrame(coeffs,columns=['Field','Coeff']))
Borrowing from Robin, but simplifying the syntax:
coef_dict = dict(zip(model_1_features, model_1.coef_))
Important note about zip: zip assumes its inputs are of equal length, making it especially important to confirm that the lengths of the features and coefficients match (which in more complicated models might not be the case). If one input is longer than the other, the longer input will have the values in its extra index positions cut off. Notice the missing 7 in the following example:
In [1]: [i for i in zip([1, 2, 3], [4, 5, 6, 7])]
Out[1]: [(1, 4), (2, 5), (3, 6)]
pd.DataFrame(data=regression.coef_, index=X_train.columns)
All of these answers were great but what personally worked for me was this, as the feature names I needed were the columns of my train_date dataframe:
pd.DataFrame(data=model_1.coef_,columns=train_data.columns)
Right after training the model, the coefficient values are stored in the variable model.coef_[0]. We can iterate over the column names and store the column name and their coefficient value in a dictionary.
model.fit(X_train,y)
# assuming all the columns except last one is used in training
columns = data.iloc[:,-1].columns
coef_dict = {}
for i in range(0,len(columns)):
coef_dict[columns[i]] = model.coef_[0][i]
Hope this helps!
As of scikit-learn version 1.0, the LinearRegression estimator has a feature_names_in_ attribute. From the docs:
feature_names_in_ : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
Assuming you're fitting on a pandas.DataFrame (train_data), your estimators (model_1, model_2, and model_3) will have the attribute. You can line up your coefficients using any of the methods listed in previous answers, but I'm in favor of this one:
coef_series = pd.Series(
data=model_1.coef_,
index=model_1.feature_names_in_
)
A minimally reproducible example
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# for repeatability
np.random.seed(0)
# random data
Xy = pd.DataFrame(
data=np.random.random((10, 3)),
columns=["x0", "x1", "y"]
)
# separate X and y
X = Xy.drop(columns="y")
y = Xy.y
# initialize estimator
lr = LinearRegression()
# fit to pandas.DataFrame
lr.fit(X, y)
# get coeficients and their respective feature names
coef_series = pd.Series(
data=lr.coef_,
index=lr.feature_names_in_
)
print(coef_series)
x0 0.230524
x1 -0.275611
dtype: float64

Categories