I’m running LDA on a dataset and the outcome was good across all metrics. However I can’t seem to extract the top features or loadings like I can for PCA.
Is anyone familiar with extracting top features / loadings from LDA when using sklearn python3?
try this:
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
X = training_input
y = training_label.ravel()
clf = LDA(n_components=1)
clf.fit(X, y)
clf.coef_
beste_Merkmal = np.argsort(clf.coef_)[0][::-1][0:25]
print('beste_Merkmal =', beste_Merkmal)
Related
I am trying to build a linear SVC model from scikit-learn following the methods laid out in a paper by Hyun et al. (source: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007608#sec010). In the paper it states:
SVMs were implemented in scikit-learn, using square hinge loss weighted by class frequency to address class imbalance issues. L1 regularization was included to enforce sparsity for feature selection
I've tried to implement this myself using the following code:
import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.svm import LinearSVC
from numpy import mean, std
model = LinearSVC(penalty="l1", class_weight='balanced', loss='squared_hinge')
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=5, random_state=127)
n_scores = cross_val_score(model, X_data, Y_data, scoring="accuracy", cv=cv, n_jobs=-1)
Where the X data involved is a binary matrix of presence/absence of genes; the y data are binary phenotype classifiers (resistant = 1, susceptible = 0). Unfortunately, I cannot give access to the dataset.
However, upon return of my results (n_scores) all values are "nan". When I perform the same task again but set the penalty to l2, I get accuracy scores.
What is happening? And why doesn't it work?
Dual must be set to False. Example: https://scikit-learn.org/stable/auto_examples/svm/plot_svm_scale_c.html
So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html
I'm trying to find any python library or package which implements newgrnn (Generalized Regression Neural Network) using python.
Is there any package or library available where I can use neural network for regression. I'm trying to find python equivalent of the newgrnn (Generalized Regression Neural Network) which is described here.
I found the library neupy which solved my problem:
from neupy import algorithms
from neupy.algorithms.rbfn.utils import pdf_between_data
grnn = algorithms.GRNN(std=0.003)
grnn.train(X, y)
# In this part of the code you can do any moifications you want
ratios = pdf_between_data(grnn.input_train, X, grnn.std)
predicted = (np.dot(grnn.target_train.T, ratios) / ratios.sum(axis=0)).T
This is the link for the library: http://neupy.com/apidocs/neupy.algorithms.rbfn.grnn.html
A more upgraded form is pyGRNN which offers in addition to the normal GRNN the Anisotropic GRNN, which optimizes the hyperparameters automatically:
from sklearn import datasets
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from pyGRNN import GRNN
# get the data set
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X_train, X_test, y_train, y_test = train_test_split(preprocessing.minmax_scale(X),
preprocessing.minmax_scale(y.reshape((-1, 1))),
test_size=0.25)
# use Anisotropic GRNN with Limited-Memory BFGS algorithm
# to select the optimal bandwidths
AGRNN = GRNN(calibration = 'gradient_search')
AGRNN.fit(X_train, y_train.ravel())
sigma = AGRNN.sigma
y_pred = AGRNN.predict(X_test)
mse_AGRNN = MSE(y_test, y_pred)
mse_AGRNN ## 0.030437040
From what I know, Linear Discriminant Analysis (LDA) is a technique to reduce the number of input features. Wiki also states the same
Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.
However, when I try to use the LinearDiscriminantAnalysis from sklearn.discriminant_analysis, I failed to get the data with reduced features.
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
X, y = make_blobs(40000,600,2,cluster_std=20,random_state=101)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)
model = LinearDiscriminantAnalysis(n_components=100)
model.fit(X_train,y_train)
X_train_new = model.transform(X_train)
print(X_train_new.shape)
>>> (28000, 1)
My original data has 600 features, I would like to reduce it to only 100 features with the LDA. But the LDA from sklearn gave me the shape (28000,1) instead.
Why is there only 1 feature after the LDA transformation? What am I doing it wrong?
Your LDA transforms your dataset to only one feature because LDA will escape n_components > (n_classes - 1).
Here you have two classes => 2 - 1 = 1 feature.
Please refer to LDA for two classes on Wikipedia
Change your number of centers to 200 for example and you'll see the difference
Xx, yy = make_blobs(40000, 600, centers=200, cluster_std=5)
X_train, X_test, y_train, y_test = train_test_split(Xx, yy, test_size=0.3)
model = LinearDiscriminantAnalysis(n_components=100)
model.fit(X_train, y_train)
X_train_new = model.transform(X_train)
print(X_train_new.shape)
>> (28000, 100)
Use PCA or SVD otherwise
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=100)
X_train_new = svd.fit_transform(X_train)
svd.explained_variance_ratio_.sum() # should be > 0.90
print(X_train_new.shape)
>>> (28000, 100)
I just completed logistic regression. The data can be downloaded from below link:
pleas click this link to download the data
Below is the code to logistic regression.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()
data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values
X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)
lr = LogisticRegression()
lr.fit(X_train,y_train)
# Predict the probability of the testing samples to belong to 0 or 1 class
predicted_probs = lr.predict_proba(X_test)
print(predicted_probs[0:3])
print(lr.coef_)
i can print the coefficient of logistic regression and i can compute the probability of an event to occur 1 or 0.
When I write a python function using those coefficients and compute the probability to occur 1. I am not getting answer as compared using this :lr.predict_proba(X_test)
the function i wrote is as follow:
def xG(bodyPart,shotQuality,defPressure,numDefPlayers,numAttPlayers,shotdist,angle,chanceRating,type):
coeff = [0.09786083,2.30523761, -0.05875112,0.07905136,
-0.1663424 ,-0.73930942,-0.10385882,0.98845481,0.13175622]
return (coeff[0]*bodyPart+ coeff[1]*shotQuality+coeff[2]*defPressure+coeff[3]*numDefPlayers+coeff[4]*numAttPlayers+coeff[5]*shotdist+ coeff[6]*angle+coeff[7]*chanceRating+coeff[8]*type)
I got the weird answer. I knew sth wrong in the function calculation.
May i seek your advice as I am new to machine learning and statistics.
I think you missed the intercept_ in your xG. You can retrieve it from lr.intercept_ and it should be summed in the final formula:
return 1/(1+e**(-(intercept + coeff[0]*bodyPart+ coeff[1]*shotQuality+coeff[2]*defPressure+coeff[3]*numDefPlayers+coeff[4]*numAttPlayers+coeff[5]*shotdist+ coeff[6]*angle+coeff[7]*chanceRating+coeff[8]*type))