I'm currently doing a project in machine learning. My objective is to use the following dataset to train (and test) a classification algorithm by means of multinomial logistic regression in Python: https://archive.ics.uci.edu/ml/datasets/wine. I'm using Google Colab, since it's a group project.
Since I would like a nice summary of the regression results, I prefer to use the package stadsmodels over skicit-learn. However, when I try the print the summary, I get the following error: ValueError: need covariance of parameters for computing (unnormalized) covariances
Unfortunately, I can't upload any images due to lack of reputation. I've looked all over the internet, but I couldn't find any solution. Any help would be highly appreciated!
My code is as follows:
import numpy as np
import pandas as pd
import sklearn as sk
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from urllib.request import urlretrieve
#Import dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
urlretrieve(url, 'wine.csv')
wine = pd.read_csv('wine.csv', header = None)
wine.columns = ['Wine', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
print(wine)
#Normalizing the data
X = wine.drop(['Wine'], axis=1)
y = wine['Wine']
X_norm = sk.preprocessing.normalize(X.transpose()).transpose()
X_norm = pd.DataFrame(X_norm)
X_norm.columns = ['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
#Partitiong the dataset into a training- and test set
X_train, X_test, y_train, y_test = sk.model_selection.train_test_split(X_norm, y, test_size = 0.30, random_state = 0)
#Printing the coefficients by statsmodels
model_sm = sm.MNLogit(y_train, X_train).fit(method = 'bfgs', maxiter = 1000, full_output = True, disp = True)
print(model_sm.summary())
Thanks in advance!
Related
Paul need a laptop that is fast enough. One of the main parameter of computers which he must focus on is CPU. In this project we need to forecast performance of CPU which is characterized in terms of cycle time and memory capacity and so on.
It is Linear Regression problem and you should predict the Estimated Relative Performance column.
I am new in Python. Could anybody help me with the code for this task?
CSV file (on Google Drive)
This is what I have done. But probably I did not understand the case.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
data = pd.read_csv("Computer_Hardware.csv")
data
data.describe()
y = data["Machine Cycle Time in nanoseconds"]
x1 = data["Estimated Relative Performance"]
plt.scatter(x1,y)
plt.xlabel("Estimated Relative Performance", fontsize = 20)
plt.ylabel("Machine Cycle Time in nanoseconds", fontsize = 20)
plt.show()
x = sm.add_constant(x1)
x = sm.add_constant(x1)
results = sm.OLS(y,x).fit()
results.summary()
In any fitted model from statsmodels you can extract predicted values with method predict() and then add them to your frame.
data['predicted'] = results.predict()
Maybe your model needs more work, for now, it only uses a variable and maybe you will get a better prediction with another model that uses more variables.
y = b0 + b1 * x1
According to the text "... CPU which is characterized in terms of cycle time and memory capacity and so on" is the problem.
A proposal will be to extend your models using statsmodels API to write a formula. In your case I like to remove all spaces in columns names before.
# Rename columns without spaces
old_columns = data.columns
new_columns = [col.replace(' ', '_') for col in old_columns]
data = data.rename(columns={old:new for old, new in zip(old_columns, new_columns)})
# Fit a model using more variables
import statsmodels.formula.api as sm2
formula = ('Estimated_Relative_Performance ~ ',
'Machine_Cycle_Time_in_nanoseconds + ',
'Maximum_Main_Memory_in_kilobytes + ',
'Cache_Memory_in_Kilobytes + ',
'Maximum_Channels_in_Units')
formula = ' '.join(formula)
print(formula)
results2 = sm2.ols(formula, data).fit()
results2.summary()
data['predicted2'] = results2.predict()
Keras is one of the easiest libraries to use if you don't have much experience in Python. This is a good tutorial: https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
I studied some blogs regarding the topic, and came with this code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
raw_data = pd.read_csv("Computer_Hardware.csv")
x = raw_data[['Machine Cycle Time in nanoseconds',
'Minimum Main Memory in Kilobytes', 'Maximum Main Memory in kilobytes',
'Cache Memory in Kilobytes', 'Minimum Channels in Units',
'Maximum Channels in Units', 'Published Relative Performance']]
y = raw_data['Estimated Relative Performance']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train)
print(model.coef_)
print(model.intercept_)
pd.DataFrame(model.coef_, x.columns, columns = ['Coeff'])
predictions = model.predict(x_test)
plt.hist(y_test - predictions)
from sklearn import metrics
metrics.mean_absolute_error(y_test, predictions)
metrics.mean_squared_error(y_test, predictions)
np.sqrt(metrics.mean_squared_error(y_test, predictions))
I'm currently trying the following concept:
I applied np.log1p() to the independent variables and dependent variable (price)
Assuming X = independent variables and Y = dependent variable, I train_test_split X & Y
Then I trained the LinearRegression(), Ridge(), Lasso(), and ElasticNet() models
Given that the labels I used to train the model were also log1p(Y), I'm assuming the model predictions are also log values?
If the predictions are log values, how come np.expm1 doesn't return a value that is on a similar scale?
Linear Regression Code for reference
import os
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from scipy.stats import skew
from scipy import stats
from scipy.stats import norm
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
df_num = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
df_cat = pd.DataFrame(np.random.randint(0,2,size=(10000, 2)), columns=['cat1', 'cat2'])
price = pd.DataFrame(np.random.randint(0,100,size=(10000, 1)), columns=['price'])
y = price
skewness = df_num.apply(lambda x: skew(x))
skewness = skewness[abs(skewness) > 0.5]
skewed_features = skewness.index
df_num[skewed_features] = np.log1p(df_num[skewed_features])
y = np.log1p(y)
train = pd.concat([df_num, df_cat], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(train, y, test_size = 0.3, random_state = 0)
lr_clf = LinearRegression()
lr_clf.fit(X_train, y_train)
def predict_price(A, B, C, D, cat1):
cat1_index = np.where(train.columns == cat1)[0][0]
x = np.zeros(len(train.columns))
x[0] = np.log1p(A)
x[1] = np.log1p(B)
x[2] = np.log1p(C)
x[3] = np.log1p(D)
if cat1_index >= 0:
x[cat1_index] = 1
return np.expm1(lr_clf.predict([x])[0])
predict_price(20, 30, 15, 55, 'cat2')
EDIT1: I tried to recreate an example from scratch, but I can't seem to replicate the issue I'm running into. The issue I run into in my real data is that:
predictions work totally fine if I DON'T log-normalize inputs when training and DON'T log normalize inputs when predicting.
HOWEVER when I do log-normalize when training and log normalize inputs and np.expm1 the prediction, the value is totally off.
Please let me know if there is anything I can explain more clearly.
I had fitted a XGBoost model for binary classification. I am trying to understand the fitted model and trying to use SHAP to explain the prediction.
However, I get confused by the force plot generated by SHAP. I expected the output value should be smaller than 0 as the predicted probability is less than 0.5. However, the SHAP value shows 8.12.
Below are my code to generate the result.
import shap
import xgboost as xgb
import json
from scipy.sparse import load_npz
print('Version of SHAP: {}'.format(shap.__version__))
print('Version of XGBoost: {}'.format(xgb.__version__))
Version of SHAP: 0.39.0
Version of XGBoost: 1.4.1
# Read the data
X = load_npz('test_data.npz')
X_dmatrix = xgb.DMatrix(X)
# Read the selected features
with open('feature_list.json', 'r') as file:
feature_list = json.load(file)
feature_names = [f'Feature {x:04d}' for x in range(len(feature_list))]
# Read the XGBoost model
xgb_model = xgb.Booster()
xgb_model.load_model('xgboost.json')
# Model prediction
model_pred_detail = xgb_model.predict(X_dmatrix, pred_contribs=True)
model_pred_prob = xgb_model.predict(X_dmatrix)
model_pred_detail.shape
(7887, 501)
# Random select a case
xid=4549
print('Predict proba: {:.04f}'.format(model_pred_prob[xid]))
Predict proba: 0.2292
# Doing SHAP way (https://github.com/slundberg/shap)
explainer = shap.Explainer(xgb_model, feature_names=feature_names, algorithm='tree')
shap_values = explainer(X.toarray())
shap.plots.force(shap_values[xid])
However, I get another plot if I use the SHAP value from XGBoost library which looks similar to my expectation.
shap.force_plot(
model_pred_detail[xid, -1], # From XGBoost.Booster.predict with pred_contribs=True
model_pred_detail[xid, 0:-1], # From XGBoost.Booster.predict with pred_contribs=True
feature_names=feature_names,
features=X[xid].toarray()
)
Why does this happen? Which one should be the correct SHAP value to explain the XGBoost model?
Thank you for your help.
Follow up with the reply from #sergey-bushmanov
Since I cannot share my own data, I reproduce the situation with open dataset from Kaggle.
Here is the code for model training:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
import xgboost as xgb
from sklearn.model_selection import train_test_split
import matplotlib.pylab as plt
from matplotlib import pyplot
import io
from scipy.sparse import save_npz
# parameter setting
class_weight = 10
minNgramLength = 1
maxNgramLength = 3
binary = False
min_df = 20
# Convert to fix the problem of encoding
with open('Corona_NLP_train.csv', 'rb') as file:
csv_file = file.read()
csv_file2 = csv_file.decode('utf-8', errors='replace')
# Read and split data
df_note = pd.read_csv(io.StringIO(csv_file2), encoding='utf-8')
df_note['label'] = np.where(df_note['Sentiment'].str.contains('negative', flags=re.I), 0, 1)
df_train, df_test = train_test_split(df_note, test_size=0.2, random_state=42)
# Tokenization
vectorizer = CountVectorizer(max_df=0.98,
min_df=min_df,
binary=binary,
ngram_range=(minNgramLength, maxNgramLength))
vectorizer.fit(df_train['OriginalTweet'])
X_train = vectorizer.transform(df_train['OriginalTweet']).astype(float)
y_train = df_train['label'].astype(float).reset_index(drop=True)
last_params ={
'lambda': 0.00016096144192346114,
'alpha': 0.057770973181367063,
'eta': 0.19258319097144733,
'gamma': 0.40032424821976653,
'max_depth': 9,
'min_child_weight': 5,
'subsample': 0.31304772813494836,
'colsample_bytree': 0.4214452441229869,
'objective': 'binary:logistic',
'verbosity': 0,
'n_estimators': 400
}
classifierCV = xgb.XGBClassifier(**last_params, importance_type='gain')
classifierCV.fit(X_train, y_train, sample_weight=w_train)
# Get the features
feature_names = vectorizer.get_feature_names()
# save model
classifierCV.get_booster().save_model('xgboost.json')
# Save features
import json
with open('feature_list.json', 'w') as file:
file.write(json.dumps({y:x for x, y in enumerate(feature_names)}))
# save data
save_npz('test_data.npz', X_train)
The problem is still here with this model.
Which one should be the correct SHAP value to explain the XGBoost model?
Let's make a guess you have a binary classification at hand. Then, what you're getting in your 2nd example is indeed the right decomposition of raw SHAP values:
In [1]: from scipy.special import expit
In [2]: expit(-1.21)
Out[2]: 0.22970105095339813
Note, .2297 is close to what you see in your:
Predict proba: 0.2292
As for:
Why does this happen?
most probably you have a typo somewhere, but to be sure you have to provide a fully reproducible example, including your data, because codewise both ways calculating SHAP values are correct.
For the same dataset (here Bupa) and parameters i get different accuracies.
What did I overlook?
R implementation:
data_file = "bupa.data"
dataset = read.csv(data_file, header = FALSE)
nobs <- nrow(dataset) # 303 observations
sample <- train <- sample(nrow(dataset), 0.95*nobs) # 227 observations
# validate <- sample(setdiff(seq_len(nrow(dataset)), train), 0.1*nobs) # 30 observations
test <- setdiff(seq_len(nrow(dataset)), train) # 76 observations
svmfit <- svm(V7~ .,data=dataset[train,],
type="C-classification",
kernel="linear",
cost=1,
cross=10)
testpr <- predict(svmfit, newdata=na.omit(dataset[test,]))
accuracy <- sum(testpr==na.omit(dataset[test,])$V7)/length(na.omit(dataset[test,])$V7)
I get accuracy: 0.94
but when i do as following in python (scikit-learn)
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
import pandas as pd
from sklearn import svm, grid_search
f = open("data/bupa.data")
dataset = np.loadtxt(fname = f, delimiter = ',')
nobs = np.shape(dataset)[0]
print("Number of Observations: %d" % nobs)
y = dataset[:,6]
X = dataset[:,:-1]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.06, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
scores = cross_validation.cross_val_score(clf, X, y, cv=10, scoring='accuracy')
I get accuracy 0.67
please help me.
I came across this post having the same issue - wildly different accuracy between scikit-learn and e1071 bindings for libSVM. I think the issue is that e1071 scales the training data and then keeps the scaling parameters for using in predicting new observations. Scikit-learn does not do this and leaves it up the user to realize that the same scaling approach needs to be taken on both training and test data. I only thought to check this after encountering and reading this guide from the nice people behind libSVM.
While I don't have your data, str(svmfit) should give you the scaling params (mean and standard deviation of the columns of Bupa). You can use these to appropriately scale your data in Python (see below for an idea). Alternately, you can scale the entire dataset together in Python and then do test/train splits; either way should give you now identical predictions.
def manual_scale(a, means, sds):
a1 = a - means
a1 = a1/sds
return a1
When using Support Vector Regression in Python/sklearn and R/e1071 both x and y variables need to be scaled/unscaled.
Here is a self-contained example using rpy2 to show equivalence of R and Python results (first part with disabled scaling in R, second part with 'manual' scaling in Python):
# import modules
import matplotlib.pyplot as plt
import numpy as np
import sklearn
import sklearn.model_selection
import sklearn.datasets
import sklearn.svm
import rpy2
import rpy2.robjects
import rpy2.robjects.packages
# use R e1071 SVM function via rpy2
def RSVR(x_train, y_train, x_test,
cost=1.0, epsilon=0.1, gamma=0.01, scale=False):
# convert Python arrays to R matrices
rx_train = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_train).T.flatten()), nrow = len(x_train))
ry_train = rpy2.robjects.FloatVector(np.array(y_train).flatten())
rx_test = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_test).T.flatten()), nrow = len(x_test))
# train SVM
e1071 = rpy2.robjects.packages.importr('e1071')
rsvr = e1071.svm(x=rx_train,
y=ry_train,
kernel='radial',
cost=cost,
epsilon=epsilon,
gamma=gamma,
scale=scale)
# run SVM
predict = rpy2.robjects.r['predict']
ry_pred = np.array(predict(rsvr, rx_test))
return ry_pred
# define auxiliary function for plotting results
def plot_results(y_test, py_pred, ry_pred, title, lim=[-500, 500]):
plt.title(title)
plt.plot(lim, lim, lw=2, color='gray', zorder=-1)
plt.scatter(y_test, py_pred, color='black', s=40, label='Python/sklearn')
plt.scatter(y_test, ry_pred, color='orange', s=10, label='R/e1071')
plt.xlabel('observed')
plt.ylabel('predicted')
plt.legend(loc=0)
return None
# get example regression data
x_orig, y_orig = sklearn.datasets.make_regression(n_samples=100, n_features=10, random_state=42)
# split into train and test set
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x_orig, y_orig, train_size=0.8)
# SVM parameters
# (identical but named differently for R/e1071 and Python/sklearn)
C = 1000.0
epsilon = 0.1
gamma = 0.01
# setup SVM and scaling classes
psvr = sklearn.svm.SVR(kernel='rbf', C=C, epsilon=epsilon, gamma=gamma)
x_sca = sklearn.preprocessing.StandardScaler()
y_sca = sklearn.preprocessing.StandardScaler()
# run R and Python SVMs without any scaling
# (see 'scale=False')
py_pred = psvr.fit(x_train, y_train).predict(x_test)
ry_pred = RSVR(x_train, y_train, x_test,
cost=C, epsilon=epsilon, gamma=gamma, scale=False)
# scale both x and y variables
sx_train = x_sca.fit_transform(x_train)
sy_train = y_sca.fit_transform(y_train.reshape(-1, 1))[:, 0]
sx_test = x_sca.transform(x_test)
sy_test = y_sca.transform(y_test.reshape(-1, 1))[:, 0]
# run Python SVM on scaled data and invert scaling afterwards
ps_pred = psvr.fit(sx_train, sy_train).predict(sx_test)
ps_pred = y_sca.inverse_transform(ps_pred.reshape(-1, 1))[:, 0]
# run R SVM with native scaling on original/unscaled data
# (see 'scale=True')
rs_pred = RSVR(x_train, y_train, x_test,
cost=C, epsilon=epsilon, gamma=gamma, scale=True)
# plot results
plt.subplot(121)
plot_results(y_test, py_pred, ry_pred, 'without scaling (Python/sklearn default)')
plt.subplot(122)
plot_results(y_test, ps_pred, rs_pred, 'with scaling (R/e1071 default)')
plt.tight_layout()
UPDATE: Actually, the scaling uses a slightly different definition of variance in R and Python, see this answer (1/(N-1)... in R vs. 1/N... in Python where N is the sample size). However, for typical sample sizes, this should be negligible.
I can confirm these statements. One indeed needs to apply the same scaling to the train and test sets. In particular I have done this:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = sc_X.fit_transform(X)
where X is my training set. Then, when preparing the test set, I have simply used the StandardScaler instance obtained from the scaling of the training test. It is important to used it just for transforming, not for fitting and transforming (like above), i.e.:
X_test = sc_X.transform(X_test)
This allowed on obtaining substantial agreement between R and scikit-learn results.
I'm new in machine learning, and trying to implement linear model estimators that provide Scikit to predict price of the used car. I used different combinations of linear models, like LinearRegression, Ridge, Lasso and Elastic Net, but all of them in most cases return negative score (-0.6 <= score <= 0.1).
Someone told me that this is because of multicollinearity problem, but I don't know how to solve it.
My sample code:
import numpy as np
import pandas as pd
from sklearn import linear_model
from sqlalchemy import create_engine
from sklearn.linear_model import Ridge
engine = create_engine('sqlite:///path-to-db')
query = "SELECT mileage, carcass, engine, transmission, state, drive, customs_cleared, price FROM cars WHERE mark='some mark' AND model='some model' AND year='some year'"
df = pd.read_sql_query(query, engine)
df = df.dropna()
df = df.reindex(np.random.permutation(df.index))
X_full = df[['mileage', 'carcass', 'engine', 'transmission', 'state', 'drive', 'customs_cleared']]
y_full = df['price']
n_train = -len(X_full)/5
X_train = X_full[:n_train]
X_test = X_full[n_train:]
y_train = y_full[:n_train]
y_test = y_full[n_train:]
predict = [200000, 0, 2.5, 0, 0, 2, 0] # parameters of the car to predict
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
y_estimate = model.predict(X_test)
print("Residual sum of squares: %.2f" % np.mean((y_estimate - y_test) ** 2))
print("Variance score: %.2f" % model.score(X_test, y_test))
print("Predicted price: ", model.predict(predict))
Carcass, state, drive and customs cleared are numeric and represent types.
What is correct way to implement prediction? Maybe some data preprocessing or different algorithm.
Thanks for any advance!
Given that you are using Ridge Regression, you should scale your variables using StandardScaler, or MinMaxScaler:
http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling
Perhaps using a Pipeline:
http://scikit-learn.org/stable/modules/pipeline.html#pipeline-chaining-estimators
If you were using vanilla Regression, scaling wouldn't matter; but with Ridge Regression, the regularization penalty term (alpha) will treat differently scaled variables differently. See this discussion on stats:
https://stats.stackexchange.com/questions/29781/when-should-you-center-your-data-when-should-you-standardize