I have written two programs which are supposed to follow the same logic. But both of them are giving different answers.
First-
train_data = train_features[:1710][:]
train_label = label_features[:1710][:].ravel()
test_data = train_features[1710:][:]
test_label = label_features[1710:][:].ravel()
def getAccuracy(ans):
d = 0
for i in range(np.size(ans,0)):
if(ans[i] == test_label[i]):
d+=1
return (d*100)/float(np.size(ans,0))
estimators = [('pps', pps.RobustScaler()), ('clf', LogisticRegression())]
pipe = Pipeline(estimators)
pipe = pipe.fit(train_data,train_label)
ans = pipe.predict(test_data)
getAccuracy(ans)
Second-
train_data = train_features[:1710][:]
train_label = label_features[:1710][:].ravel()
test_data = train_features[1710:][:]
test_label = label_features[1710:][:].ravel()
def getAccuracy(ans):
d = 0
for i in range(np.size(ans,0)):
if(ans[i] == test_label[i]):
d+=1
return (d*100)/float(np.size(ans,0))
def preprocess(features):
return pps.RobustScaler().fit_transform(features)
train_data = preprocess(train_data)
clf = LogisticRegression().fit(train_data,train_label)
test_data = preprocess(test_data)
ans = clf.predict(test_data)
getAccuracy(ans)
First one gives 80.81 and second one gives 84.92. Why are both of them different?
Your second code is invalid, since your "preprocess" fits the scaler to test set, which should not happen. Pipeline, on the other hand only fits RobustScaler to your train data and then calls "transform" on the test one.
Related
def CreateData(self, n_samples, seed_in=5,
train_prop=0.9, bound_limit=6., n_std_devs=1.96,**kwargs):
np.random.seed(seed_in)
scale_c=1.0 # default
shift_c=1.0
# for ideal boundary
X_ideal = np.linspace(start=-bound_limit,stop=bound_limit, num=50000)
y_ideal_U = np.ones_like(X_ideal)+1. # default
y_ideal_L = np.ones_like(X_ideal)-1.
y_ideal_mean = np.ones_like(X_ideal)+0.5
if self.type_in[:1] == '~':
if self.type_in=="~boston":
path = 'boston_housing_data.csv'
data = np.loadtxt(path,skiprows=0)
elif self.type_in=="~concrete":
path = 'Concrete_Data.csv'
data = np.loadtxt(path, delimiter=',',skiprows=1)
elif self.type_in=="~wind":
path = '/content/Deep_Learning_Prediction_Intervals/code/canada_CSV.csv'
data = np.loadtxt(path,delimiter=',',skiprows=1,usecols = (1,2)) ## CHECK WHTHER TO HAVE LOADTXT OR ANYTHING ELSE PARUL
# work out normalisation constants (need when unnormalising later)
scale_c = np.std(data[:,-1])
shift_c = np.mean(data[:,-1])
# normalise data for ALL COLUMNS
for i in range(0,data.shape[1]): ## i varies from 0 to number of columns ,means it reads one by one the columns
# avoid zero variance features (exist one or two)
# nonlocal X_train, y_train, X_val, y_val ## ADDED BY PARUL
sdev_norm = np.std(data[:,i])
sdev_norm = 0.001 if sdev_norm == 0 else sdev_norm
data[:,i] = (data[:,i] - np.mean(data[:,i]) )/sdev_norm
# split into train/test
perm = np.random.permutation(data.shape[0]) ## DO THE DATA PERMUTATION OF ALL THE ROWS (shuffle)
train_size = int(round(train_prop*data.shape[0]))
train = data[perm[:train_size],:]
test = data[perm[train_size:],:]
y_train = train[:,-1].reshape(-1,1) ## LAST COLUMN IS CONSIDERED AS THE TARGET AND RESHAPED IN BETWEEN -1,1
X_train = train[:,:-1] ## INPUTS ARE ALL EXCEPT LAST COLUMN
y_val = test[:,-1].reshape(-1,1)
X_val = test[:,:-1]
# save important stuff
self.X_train = X_train
self.y_train = y_train
self.X_val = X_val
self.y_val = y_val
self.X_ideal = X_ideal
self.y_ideal_U = y_ideal_U
self.y_ideal_L = y_ideal_L
self.y_ideal_mean = y_ideal_mean
self.scale_c = scale_c
self.shift_c = shift_c
return X_train, y_train, X_val, y_val
It gives me an error
'UnboundLocalError: local variable 'X_train' referenced before assignment'
Any help will be appreciated. I am stuck at this. I have tried initialising X_train with X_train=[] and also tried making them global, but didn't get any result. Please help me so that I could move forward.
I have one problem concerning lgb. When I write
lgb.train(.......)
it finishes in less than milisecond. (for (10 000,25) ) shape dataset.
and when I write predict, all the output variables have same value.
train = pd.read_csv('data/train.csv', dtype = dtypes)
test = pd.read_csv('data/test.csv')
test.head()
X = train.iloc[:10000, 3:-1].values
y = train.iloc[:10000, -1].values
sc = StandardScaler()
X = sc.fit_transform(X)
#pca = PCA(0.95)
#X = pca.fit_transform(X)
d_train = lgb.Dataset(X, label=y)
params = {}
params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'binary_logloss'
params['sub_feature'] = 0.5
params['num_leaves'] = 10
params['min_data'] = 50
params['max_depth'] = 10
num_round = 10
clf = lgb.train(params, d_train, num_round, verbose_eval=1000)
X_test = sc.transform(test.iloc[:100,3:].values)
pred = clf.predict(X_test, num_iteration = clf.best_iteration)
when I print pred, all the values are (0.49)
It's my first time using lightgbm module. Do I have some error in the code? or I should look for some mismatches in dataset.
Your num_round is too small, it just starts to learn and stops there. Other than that, make your verbose_eval smaller, so see the results visually upon training. My suggestion for you to try the lgb.train code as below:
clf = lgb.train(params, d_train, num_boost_round=5000, verbose_eval=10, early_stopping_rounds = 3500)
Always use early_stopping_rounds since the model should stop if there is no evident learning or the model starts to overfit.
Do not hesitate to ask more. Have fun.
I am using Python 3.5 and python implementation of XGBoost, version 0.6
I built a forward feature selection routine in Python, which iteratively builds the optimal set of features (leading to the best score, here metric is binary classification error).
On my data set, using xgb.cv routine, I can get down to an error rate of around 0.21 by increasing max_depth (of trees) up to 40...
But then if I do a custom cross-validation, using the same XG Boost parameters, same folds, same metric and same data set, I reach the best score being 0.70 with max_depth of 4 ... if I use the optimal max_depth obtained by my xgb.cv routine, my score drops to 0.65 ... I just don't understand what is happening ...
My best guess is that xgb.cv is using different folds (i.e. shuffles the data before partitioning), but I also think I submit the folds as an input to xgb.cv (with option Shuffle=False) ... so, it might be something completely different ...
Here is the code of the forward_feature_selection (using xgb.cv):
def Forward_Feature_Selection(train, y_train, params, num_round=30, threshold=0, initial_score=0.5, to_exclude = [], nfold = 5):
k_fold = KFold(n_splits=13)
selected_features = []
gain = threshold + 1
previous_best_score = initial_score
train = train.drop(train.columns[to_exclude], axis=1) # df.columns is zero-based pd.Index
features = train.columns.values
selected = np.zeros(len(features))
scores = np.zeros(len(features))
while (gain > threshold): # we start a add-a-feature loop
for i in range(0,len(features)):
if (selected[i]==0): # take only features not yet selected
selected_features.append(features[i])
new_train = train.iloc[:][selected_features]
selected_features.remove(features[i])
dtrain = xgb.DMatrix(new_train, y_train, missing = None)
# dtrain = xgb.DMatrix(pd.DataFrame(new_train), y_train, missing = None)
if (i % 10 == 0):
print("Launching XGBoost for feature "+ str(i))
xgb_cv = xgb.cv(params, dtrain, num_round, nfold=13, folds=k_fold, shuffle=False)
if params['objective'] == 'binary:logistic':
scores[i] = xgb_cv.tail(1)["test-error-mean"] #classification
else:
scores[i] = xgb_cv.tail(1)["test-rmse-mean"] #regression
else:
scores[i] = initial_score # discard already selected variables from candidates
best = np.argmin(scores)
gain = previous_best_score - scores[best]
if (gain > 0):
previous_best_score = scores[best]
selected_features.append(features[best])
selected[best] = 1
print("Adding feature: " + features[best] + " increases score by " + str(gain) + ". Final score is now: " + str(previous_best_score))
return (selected_features, previous_best_score)
and here is my "custom" cross validation:
mean_error_rate = 0
for train, test in k_fold.split(ds):
dtrain = xgb.DMatrix(pd.DataFrame(ds.iloc[train]), dc.iloc[train]["bin_spread"], missing = None)
gbm = xgb.train(params, dtrain, 30)
dtest = xgb.DMatrix(pd.DataFrame(ds.iloc[test]), dc.iloc[test]["bin_spread"], missing = None)
res.ix[test,"pred"] = gbm.predict(dtest)
cv_reg = reg.fit(pd.DataFrame(ds.iloc[train]), dc.iloc[train]["bin_spread"])
res.ix[test,"lasso"] = cv_reg.predict(pd.DataFrame(ds.iloc[test]))
res.ix[test,"y_xgb"] = res.loc[test,"pred"] > 0.5
res.ix[test, "xgb_right"] = (res.loc[test,"y_xgb"]==res.loc[test,"bin_spread"])
print (str(100*np.sum(res.loc[test, "xgb_right"])/(N/13)))
mean_error_rate += 100*(np.sum(res.loc[test, "xgb_right"])/(N/13))
print("mean_error_rate is : " + str(mean_error_rate/13))
using the following parameters:
params = {"objective": "binary:logistic",
"booster":"gbtree",
"max_depth":4,
"eval_metric" : "error",
"eta" : 0.15}
res = pd.DataFrame(dc["bin_spread"])
k_fold = KFold(n_splits=13)
N = dc.shape[0]
num_trees = 30
And finally the call to my forward feature selection:
selfeat = Forward_Feature_Selection(dc,
dc["bin_spread"],
params,
num_round = num_trees,
threshold = 0,
initial_score=999,
to_exclude = [0,1,5,30,31],
nfold = 13)
Any help to understand what is happening will be greatly appreciated ! Thanks in advance for any tip !
This is normal. I have experienced the same. Firstly, Kfold is splitting differently each time. You have specified the folds in XGBoost but KFold is not splitting consistently, which is normal.
Next, initial state of the model are different each time.
There are inner random states withing XGBoost which can cause this too, try changing the eval metric to see if the variance reduces. If a particular metric suits your needs, try to average the best parameters and use that as your optimal parameters.
For some reason the features of this dataset is being interpreted as rows, "Model n_features is 16 and input n_features is 18189" Where 18189 is the number of rows and 16 is the correct feature list.
The suspect code is here:
for var in cat_cols:
num = LabelEncoder()
train[var] = num.fit_transform(train[var].astype('str'))
train['output'] = num.fit_transform(train['output'].astype('str'))
for var in cat_cols:
num = LabelEncoder()
test[var] = num.fit_transform(test[var].astype('str'))
test['output'] = num.fit_transform(test['output'].astype('str'))
clf = RandomForestClassifier(n_estimators = 10)
xTrain = train[list(features)].values
yTrain = train["output"].values
xTest = test[list(features)].values
xTest = test["output"].values
clf.fit(xTrain,yTrain)
clfProbs = clf.predict(xTest)#Error happens here.
Anyone got any ideas?
Sample training date csv
tr4,42,"JobCat4","divorced","tertiary","yes",2,"yes","no","unknown",5,"may",0,1,-1,0,"unknown","TypeA"
Sample test data csv
tst2,47,"JobCat3","married","unknown","no",1506,"yes","no","unknown",5,"may",0,1,-1,0,"unknown",?
You have a small typo - you created the variable xTest and then are immediately overwriting to something incorrect. Change the offending lines to:
xTest = test[list(features)].values
yTest = test["output"].values
I am using decision stumps with a BaggingClassifier to classify some data:
def fit_ensemble(attributes,class_val,n_estimators):
# max depth is 1
decisionStump = DecisionTreeClassifier(criterion = 'entropy', max_depth = 1)
ensemble = BaggingClassifier(base_estimator = decisionStump, n_estimators = n_estimators, verbose = 3)
return ensemble.fit(attributes,class_val)
def predict_all(fitted_classifier, instances):
for i, instance in enumerate(instances):
instances[i] = fitted_classifier.predict([instances[i]])
return list(itertools.chain(*instances))
def main(filename, n_estimators):
df_ = read_csv(filename)
col_names = df_.columns.values.tolist()
attributes = col_names[0:-1] ## 0..n-1
class_val = col_names[-1] ## n
fitted = fit_ensemble(df_[attributes].values, df_[class_val].values, n_estimators)
fitted_classifiers = fitted.estimators_ # get the three decision stumps.
compared_ = DataFrame(index = range(0,len(df_.index)), columns = range(0,n_estimators + 1))
compared_ = compared_.fillna(0)
compared_.ix[:,n_estimators] = df_[class_val].values
for i, fitted_classifier in enumerate(fitted_classifiers):
compared_.ix[:,i] = predict_all(fitted_classifier,df_[attributes].values)
I would like to inspect the random subset used to train each decision stump. I have looked at the documentation for both the ensemble and decision tree class, but haven't found any attributes or methods that yield the training subset. Is this a futile task? Or is there some way, perhaps while the tree is training, to output the training subset?
I am very new to pandas, but come from an R background. My code is definitely not optimized, though I can assure that the dataset is very small for my task. Thanks for the help.
It looks like I have answered my own question. the estimators_samples_ attribute of DecisionTreeClassifier is what I want.