I have to discretize into at least 5 bins a continuous target variable in order to lower the complexity of a classification model using the sklearn library
In order to do this, I've used the KBinsDiscretizer but I don't know how can I split in balanced parts the dataset now that I've discretized the target variable.
This is my code:
X = df.copy()
y = X.pop('shares')
# scaling the dataset so all data in the same range
scaler = preprocessing.MinMaxScaler()
X = scaler.fit_transform(X)
discretizer = preprocessing.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
y_discretized = discretizer.fit_transform(y.values.reshape(-1, 1))
# is this correct?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True, stratify=y_discretized)
For completeness, I'm trying to recreate a less complex model than the one showed in: [1] K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal
Your y_train and y_test are parts of y, which has (it seems) the original continuous values. So you're ending up fitting multiclass classification models, with probably lots of different classes, which likely causes the crashes.
I assume what you wanted is
X_train, X_test, y_train, y_test = train_test_split(X, y_discretized, test_size=0.33, shuffle=True, stratify=y_discretized)
Whether discretizing a continuous target to turn a regression into a classification is a topic for another site, see e.g. https://datascience.stackexchange.com/q/90297/55122
Related
I am having troble searching for materials on this matter. Don't know exactly what search for.
I am trying to get the result of my Logistic Regression classfier (that outputs a timeseries binary class), and make a filter that takes a window of X answers and if the number of positives of that given window is greater than a given treshold, only then the sample gets a positive.
My input is a time series with many features of a company process, such as currents, pressures and so on. I am trying to make a fault detection algorithm. So because my output is so noisy I want to make it more time consistent.
Classfier Pattern
Solved.
for train_index, test_index in logo.split(X, y, groups):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train.ravel())
y_pred = model.predict(X_test)
y_pred_filtrado = pd.Series(y_pred).rolling(filtro,min_periods=1).sum() #getting a sum of the window
y_pred_filtrado = np.where(y_pred_filtrado>treshold, 1, 0) #if sum is greater than a treshhold output is positive
I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
#data
https://github.com/soufMiashs/Predict_Hits
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
y.sort_values().reset_index()["HITS_DAY"].plot()
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
)
# Show the plots of the test and train set (make sure they look similar!)
y_train.sort_values().reset_index()["HITS_DAY"].plot()
y_test.sort_values().reset_index()["HITS_DAY"].plot()
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Output:
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485
I have a dataset, and I splitted it in percentages of 60, 20, 20. The 60% is for training, 20% is for testing and the rest is for validation (I made the split,because I need it).
I used the next code to split it (I think I found it in stackoverflow) and applied Naive Bayes classifier...
train_ratio = 0.60
validation_ratio = 0.20
test_ratio = 0.20
# train is now 60% of the entire data set
# the _junk suffix means that we drop that variable completely
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 1 - train_ratio)
# test is now 20% of the initial data set
# validation is now 20% of the initial data set
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))
print(x_train, x_val, x_test)
q=gnb.fit(x_train, y_train)
predict = gnb.predict(x_test)
print(predict)
print("Accuracy: ", accuracy_score(y_test, predict))
I tried use scikit-learn to make a learning curve, but according to scikit-learn documentation the learning curve function gives the train_sizes, the train_scores and valid_scores.
This is kinda confusing, I'm new to scikit-learn, and I don't have a clue how to use make a learning curve with the each percentage of the data I splitted.
Does anyone knows how to use splitted data in scikit's learning curves?
Thanks in advance.
I'm trying to learn scikit-learn and Machine Learning by using the Boston Housing Data Set.
# I splitted the initial dataset ('housing_X' and 'housing_y')
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_X, housing_y, test_size=0.25, random_state=33)
# I scaled those two datasets
from sklearn.preprocessing import StandardScaler
scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(y_train)
X_train = scalerX.transform(X_train)
y_train = scalery.transform(y_train)
X_test = scalerX.transform(X_test)
y_test = scalery.transform(y_test)
# I created the model
from sklearn import linear_model
clf_sgd = linear_model.SGDRegressor(loss='squared_loss', penalty=None, random_state=42)
train_and_evaluate(clf_sgd,X_train,y_train)
Based on this new model clf_sgd, I am trying to predict the y based on the first instance of X_train.
X_new_scaled = X_train[0]
print (X_new_scaled)
y_new = clf_sgd.predict(X_new_scaled)
print (y_new)
However, the result is quite odd for me (1.34032174, instead of 20-30, the range of the price of the houses)
[-0.32076092 0.35553428 -1.00966618 -0.28784917 0.87716097 1.28834383
0.4759489 -0.83034371 -0.47659648 -0.81061061 -2.49222645 0.35062335
-0.39859013]
[ 1.34032174]
I guess that this 1.34032174 value should be scaled back, but I am trying to figure out how to do it with no success. Any tip is welcome. Thank you very much.
You can use inverse_transform using your scalery object:
y_new_inverse = scalery.inverse_transform(y_new)
Bit late to the game:
Just don't scale your y. With scaling y you actually loose your units. The regression or loss optimization is actually determined by the relative differences between the features. BTW for house prices (or any other monetary value) it is common practice to take the logarithm. Then you obviously need to do an numpy.exp() to get back to the actual dollars/euros/yens...
I'm having a weird problem that may suprise you all. My classification rate is too high on my test set. I'm using scikit-learn packages, and I'm very suspicious of these classification rates, as they are very close to 1.
x=[]
for nums in range(1,100):
X=maxDataset[:,1:]
y=maxDataset[:,0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
cvIter=ShuffleSplit(X_train.shape[0], n_iter=10, test_size=0.2)
clf=RandomForestClassifier()
clf.fit(X_train,y_train)
clf.fit(X, y)
avg=metrics.accuracy_score(y_test,clf.predict(X_test))
x.append(avg)
xMean=np.average(x)
print(xMean)
Is something wrong with my evaluation? I was suspicious that this could be due to fitting the model on the entire data set. If this is the case, or something else is the problem,how can I fix it to get an accurate evaluation of the classifier? The xMean ranges form 98.3 to 99.
Thanks