I am trying to predict my server load however I am getting a below 10% accuracy. I am using Linear Regression to predict the data, can anyhow one help me out?
ps, the csv file contain date and time so i convert both to integer. Not sure am i doing it right
These are my Codes:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import mpl_toolkits
import imp
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
data = pd.read_csv(".....\\Machine_Learning_Serious\\Server_Prediction\\testing_server.csv")
describe = data.describe()
data_cleanup = {"Timestamp":{'AM': 0, 'PM': 1},
"Function":{'ccpl_db01': 0, 'ccpl_fin01': 1, 'ccpl_web01': 2},
"Type": {'% Disk Time': 0, 'CPU Load': 1, 'DiskFree%_C:': 2, 'DiskFree%_D:': 3, 'DiskFree%_E:': 4, 'FreeMemory': 5, 'IIS Current Connections': 6, 'Processor Queue Length': 7, 'SQL_Buffer cache hit ratio': 8, 'SQL_User Connections': 9}}
data.replace(data_cleanup,inplace = True)
final_data = data.head()
#print(final_data)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
labels = data['Data']
train1 = data.drop(['Data'], axis = 1)
from sklearn.model_selection import train_test_split
from sklearn import ensemble
x_train , x_test , y_train , y_test = train_test_split(train1, labels, test_size = 0.25, random_state = 2)
#clf = ensemble.GradientBoostingRegressor(n_estimators= 400 , max_depth = 5,min_samples_split = 2, learning_rate = 0.1,loss='ls')
fitting = reg.fit(x_train,y_train)
score = reg.score(x_test,y_test)
The main objective is to predict the correct load but right now I am way too off.
Maybe do some Exploratory data analysis first to see if you can figure out a pattern between your target variable and features?
It would also be good to extract some features from your date/time variables rather than using them as integers (like weekday_or_not, seasons etc.)
You can also try transforming your features (log, sqrt) to see if the score improves.
I would also suggest that you try a simple randomforest/xgboost model to check how that performs against the linear regression model
Related
I'm currently trying the following concept:
I applied np.log1p() to the independent variables and dependent variable (price)
Assuming X = independent variables and Y = dependent variable, I train_test_split X & Y
Then I trained the LinearRegression(), Ridge(), Lasso(), and ElasticNet() models
Given that the labels I used to train the model were also log1p(Y), I'm assuming the model predictions are also log values?
If the predictions are log values, how come np.expm1 doesn't return a value that is on a similar scale?
Linear Regression Code for reference
import os
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from scipy.stats import skew
from scipy import stats
from scipy.stats import norm
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
df_num = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
df_cat = pd.DataFrame(np.random.randint(0,2,size=(10000, 2)), columns=['cat1', 'cat2'])
price = pd.DataFrame(np.random.randint(0,100,size=(10000, 1)), columns=['price'])
y = price
skewness = df_num.apply(lambda x: skew(x))
skewness = skewness[abs(skewness) > 0.5]
skewed_features = skewness.index
df_num[skewed_features] = np.log1p(df_num[skewed_features])
y = np.log1p(y)
train = pd.concat([df_num, df_cat], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(train, y, test_size = 0.3, random_state = 0)
lr_clf = LinearRegression()
lr_clf.fit(X_train, y_train)
def predict_price(A, B, C, D, cat1):
cat1_index = np.where(train.columns == cat1)[0][0]
x = np.zeros(len(train.columns))
x[0] = np.log1p(A)
x[1] = np.log1p(B)
x[2] = np.log1p(C)
x[3] = np.log1p(D)
if cat1_index >= 0:
x[cat1_index] = 1
return np.expm1(lr_clf.predict([x])[0])
predict_price(20, 30, 15, 55, 'cat2')
EDIT1: I tried to recreate an example from scratch, but I can't seem to replicate the issue I'm running into. The issue I run into in my real data is that:
predictions work totally fine if I DON'T log-normalize inputs when training and DON'T log normalize inputs when predicting.
HOWEVER when I do log-normalize when training and log normalize inputs and np.expm1 the prediction, the value is totally off.
Please let me know if there is anything I can explain more clearly.
I had fitted a XGBoost model for binary classification. I am trying to understand the fitted model and trying to use SHAP to explain the prediction.
However, I get confused by the force plot generated by SHAP. I expected the output value should be smaller than 0 as the predicted probability is less than 0.5. However, the SHAP value shows 8.12.
Below are my code to generate the result.
import shap
import xgboost as xgb
import json
from scipy.sparse import load_npz
print('Version of SHAP: {}'.format(shap.__version__))
print('Version of XGBoost: {}'.format(xgb.__version__))
Version of SHAP: 0.39.0
Version of XGBoost: 1.4.1
# Read the data
X = load_npz('test_data.npz')
X_dmatrix = xgb.DMatrix(X)
# Read the selected features
with open('feature_list.json', 'r') as file:
feature_list = json.load(file)
feature_names = [f'Feature {x:04d}' for x in range(len(feature_list))]
# Read the XGBoost model
xgb_model = xgb.Booster()
xgb_model.load_model('xgboost.json')
# Model prediction
model_pred_detail = xgb_model.predict(X_dmatrix, pred_contribs=True)
model_pred_prob = xgb_model.predict(X_dmatrix)
model_pred_detail.shape
(7887, 501)
# Random select a case
xid=4549
print('Predict proba: {:.04f}'.format(model_pred_prob[xid]))
Predict proba: 0.2292
# Doing SHAP way (https://github.com/slundberg/shap)
explainer = shap.Explainer(xgb_model, feature_names=feature_names, algorithm='tree')
shap_values = explainer(X.toarray())
shap.plots.force(shap_values[xid])
However, I get another plot if I use the SHAP value from XGBoost library which looks similar to my expectation.
shap.force_plot(
model_pred_detail[xid, -1], # From XGBoost.Booster.predict with pred_contribs=True
model_pred_detail[xid, 0:-1], # From XGBoost.Booster.predict with pred_contribs=True
feature_names=feature_names,
features=X[xid].toarray()
)
Why does this happen? Which one should be the correct SHAP value to explain the XGBoost model?
Thank you for your help.
Follow up with the reply from #sergey-bushmanov
Since I cannot share my own data, I reproduce the situation with open dataset from Kaggle.
Here is the code for model training:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
import xgboost as xgb
from sklearn.model_selection import train_test_split
import matplotlib.pylab as plt
from matplotlib import pyplot
import io
from scipy.sparse import save_npz
# parameter setting
class_weight = 10
minNgramLength = 1
maxNgramLength = 3
binary = False
min_df = 20
# Convert to fix the problem of encoding
with open('Corona_NLP_train.csv', 'rb') as file:
csv_file = file.read()
csv_file2 = csv_file.decode('utf-8', errors='replace')
# Read and split data
df_note = pd.read_csv(io.StringIO(csv_file2), encoding='utf-8')
df_note['label'] = np.where(df_note['Sentiment'].str.contains('negative', flags=re.I), 0, 1)
df_train, df_test = train_test_split(df_note, test_size=0.2, random_state=42)
# Tokenization
vectorizer = CountVectorizer(max_df=0.98,
min_df=min_df,
binary=binary,
ngram_range=(minNgramLength, maxNgramLength))
vectorizer.fit(df_train['OriginalTweet'])
X_train = vectorizer.transform(df_train['OriginalTweet']).astype(float)
y_train = df_train['label'].astype(float).reset_index(drop=True)
last_params ={
'lambda': 0.00016096144192346114,
'alpha': 0.057770973181367063,
'eta': 0.19258319097144733,
'gamma': 0.40032424821976653,
'max_depth': 9,
'min_child_weight': 5,
'subsample': 0.31304772813494836,
'colsample_bytree': 0.4214452441229869,
'objective': 'binary:logistic',
'verbosity': 0,
'n_estimators': 400
}
classifierCV = xgb.XGBClassifier(**last_params, importance_type='gain')
classifierCV.fit(X_train, y_train, sample_weight=w_train)
# Get the features
feature_names = vectorizer.get_feature_names()
# save model
classifierCV.get_booster().save_model('xgboost.json')
# Save features
import json
with open('feature_list.json', 'w') as file:
file.write(json.dumps({y:x for x, y in enumerate(feature_names)}))
# save data
save_npz('test_data.npz', X_train)
The problem is still here with this model.
Which one should be the correct SHAP value to explain the XGBoost model?
Let's make a guess you have a binary classification at hand. Then, what you're getting in your 2nd example is indeed the right decomposition of raw SHAP values:
In [1]: from scipy.special import expit
In [2]: expit(-1.21)
Out[2]: 0.22970105095339813
Note, .2297 is close to what you see in your:
Predict proba: 0.2292
As for:
Why does this happen?
most probably you have a typo somewhere, but to be sure you have to provide a fully reproducible example, including your data, because codewise both ways calculating SHAP values are correct.
I'm trying to use random forest with grid search but this error shows up
ValueError: Invalid parameter classifier for estimator Pipeline(steps=[('tfidf_vectorizer', TfidfVectorizer()),
('rf_classifier', RandomForestClassifier())]).
Check the list of available parameters with `estimator.get_params().keys()`.
import numpy as np # linear algebra
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import pipeline,ensemble,preprocessing,feature_extraction,metrics
train=pd.read_json('cleaned_data1')
#split dataset into X , Y
X=train.iloc[:,0]
Y=train.iloc[:,2]
estimators=pipeline.Pipeline([
('tfidf_vectorizer', feature_extraction.text.TfidfVectorizer(lowercase=True)),
('rf_classifier', ensemble.RandomForestClassifier())
])
print(estimators.get_params().keys())
params = {"classifier__max_depth": [3, None],
"classifier__max_features": [1, 3, 10],
"classifier__min_samples_split": [1, 3, 10],
"classifier__min_samples_leaf": [1, 3, 10],
# "bootstrap": [True, False],
"classifier__criterion": ["gini", "entropy"]}
X_train,X_test,y_train,y_test=train_test_split(X,Y, test_size=0.2)
rf_classifier=GridSearchCV(estimators,params, cv=10 , n_jobs=-1 ,scoring='accuracy',iid=True)
rf_classifier.fit(X_train,y_train)
y_pred=rf_classifier.predict(X_test)
metrics.confusion_matrix(y_test,y_pred)
print(metrics.accuracy_score(y_test,y_pred))
I've tried to add those params
param_grid = {
'n_estimators': [200, 500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
}
but still the same error
Please ensure that when you reference something in the pipeline, you use the same naming convention when you are initializing a parameter grid.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
# Define a pipeline to search for the best combination of PCA truncation
# and classifier regularization.
pca = PCA()
# set the tolerance to a large value to make the example faster
logistic = LogisticRegression(max_iter=10000, tol=0.1)
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
X_digits, y_digits = datasets.load_digits(return_X_y=True)
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
'pca__n_components': [5, 15, 30, 45, 64],
'logistic__C': np.logspace(-4, 4, 4),
}
search = GridSearchCV(pipe, param_grid, n_jobs=-1)
search.fit(X_digits, y_digits)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
In this example, we reference LogisticRegression model as 'logistic'. Also on a side note, please note that for RandomForestClassifiers, a value of min_samples_split = 1 is not possible and will result in an error.
This is from the sklearn documentation
Where you have called the random forest ensemble 'rf_classifier' within the pipeline, you should rename this to 'classifier' which should solve the issue.
The params look for something named 'classifier' in the pipeline so they can apply themselves however at current there is nothing named this and therefore this error is thrown.
If you want (I'm not sure if this will work but worth testing), you could change "classifier__" in the params list to "rf_classifier__" to see if the params will then recognise the passed classifier.
After using H2O Python Module AutoML, it is found that XGBoost is on the top of the Leaderboard. Then what I was trying to do is to extract the hyper-parameters from the H2O XGBoost and replicate it in the XGBoost Sklearn API. However, the performance is different between these 2 approaches:
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.metrics import classification_report
import xgboost as xgb
import scikitplot as skplt
import h2o
from h2o.automl import H2OAutoML
import numpy as np
import pandas as pd
h2o.init()
iris = datasets.load_iris()
X = iris.data
y = iris.target
data = pd.DataFrame(np.concatenate([X, y[:,None]], axis=1))
data.columns = iris.feature_names + ['target']
data = data.sample(frac=1)
# data.shape
train_df = data[:120]
test_df = data[120:]
# Import a sample binary outcome train/test set into H2O
train = h2o.H2OFrame(train_df)
test = h2o.H2OFrame(test_df)
# Identify predictors and response
x = train.columns
y = "target"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
aml = H2OAutoML(max_models=10, seed=1, nfolds = 3,
keep_cross_validation_predictions=True,
exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"])
aml.train(x=x, y=y, training_frame=train)
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
m = h2o.get_model([mid for mid in model_ids if "XGBoost" in mid][0])
# m.params.keys()
Performance of H2O Xgboost
skplt.metrics.plot_confusion_matrix(test_df['target'],
m.predict(test).as_data_frame()['predict'],
normalize=False)
Replicate in XGBoost Sklearn API:
mapping_dict = {
"booster": "booster",
"colsample_bylevel": "col_sample_rate",
"colsample_bytree": "col_sample_rate_per_tree",
"gamma": "min_split_improvement",
"learning_rate": "learn_rate",
"max_delta_step": "max_delta_step",
"max_depth": "max_depth",
"min_child_weight": "min_rows",
"n_estimators": "ntrees",
"nthread": "nthread",
"reg_alpha": "reg_alpha",
"reg_lambda": "reg_lambda",
"subsample": "sample_rate",
"seed": "seed",
# "max_delta_step": "score_tree_interval",
# 'missing': None,
# 'objective': 'binary:logistic',
# 'scale_pos_weight': 1,
# 'silent': 1,
# 'base_score': 0.5,
}
parameter_from_water = {}
for item in mapping_dict.items():
parameter_from_water[item[0]] = m.params[item[1]]['actual']
# parameter_from_water
xgb_clf = xgb.XGBClassifier(**parameter_from_water)
xgb_clf.fit(train_df.drop('target', axis=1), train_df['target'])
Performance of Sklearn XGBoost:
(always worse than H2O in all examples I tried.)
skplt.metrics.plot_confusion_matrix(test_df['target'],
xgb_clf.predict(test_df.drop('target', axis=1) ),
normalize=False);
Anything obvious that I missed?
When you use H2O auto ml with the following lines of code :
aml = H2OAutoML(max_models=10, seed=1, nfolds = 3,
keep_cross_validation_predictions=True,
exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"])
aml.train(x=x, y=y, training_frame=train)
you use the option nfolds = 3, which means each algorithm will be trained three times using 2 thirds of the data as training and one third as validation. This allows the algorithm to be more stable and sometimes have better performance than if you only give your entire training dataset in one go.
This is what you do when you train your XGBoost using fit(). Even though you have the same algorithm (XGBoost) with the same hyperparameters, you don't use the training set the same way H2O does. Hence the difference in your confusion matrices !
If you want to have the same performance when copying the best model, you can change the parameter H2OAutoML(..., nfolds = 0)
Furthermore there H2O's takes into account approximately 60 different parameters, you missed a few important ones in your dictionary like the min_child_weight. So your xgboost is not exactly the same as your H2O which could explain the differences in performance
I am using Python 3.6.1 | Anaconda 4.4.0
I am novice in ML and practicing while learning. I picked up a kagle dataset to practice LDA for dimensionality reduction. Two confusion arised:
I started getting warning error "Variables are collinear."
Even though i am using n-components as 2, still the output vector x_train is showing only 1 feature.
code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
datasets = pd.read_csv('mushrooms.csv')
X_df = datasets.iloc[:, 1:] # Independent variables
y_df = datasets.iloc[:, 0] # Dependent variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
X_df = X_df.apply(LabelEncoder().fit_transform)
x = OneHotEncoder(sparse=False).fit_transform(X_df.values)
y = LabelEncoder().fit_transform(y_df.values)
# Splitting dataset in to training set and test set.
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test =
train_test_split(x,y,test_size=0.2,random_state=0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
#---------------------------------------------
# Applying LDA (Linear Discriminant Analysis)
#---------------------------------------------
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 2)
x_train = lda.fit_transform(x_train, y_train)
x_test = lda.transform(x_test)
This suggests just what the error message says: some of your variables are collinear. In other words, the elements of one vector are a linear function of the elements of another, such as
0, 1, 2, 3
3, 5, 7, 9
In this case, LDA can't differentiate their influences on the rest of the world.
I can't diagnose anything specific, since you failed to provide the suggested MCVE.