Post Processing in Machine Learning Classfier - python

I am having troble searching for materials on this matter. Don't know exactly what search for.
I am trying to get the result of my Logistic Regression classfier (that outputs a timeseries binary class), and make a filter that takes a window of X answers and if the number of positives of that given window is greater than a given treshold, only then the sample gets a positive.
My input is a time series with many features of a company process, such as currents, pressures and so on. I am trying to make a fault detection algorithm. So because my output is so noisy I want to make it more time consistent.
Classfier Pattern

Solved.
for train_index, test_index in logo.split(X, y, groups):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train.ravel())
y_pred = model.predict(X_test)
y_pred_filtrado = pd.Series(y_pred).rolling(filtro,min_periods=1).sum() #getting a sum of the window
y_pred_filtrado = np.where(y_pred_filtrado>treshold, 1, 0) #if sum is greater than a treshhold output is positive

Related

Discretize continuous target variable using sklearn

I have to discretize into at least 5 bins a continuous target variable in order to lower the complexity of a classification model using the sklearn library
In order to do this, I've used the KBinsDiscretizer but I don't know how can I split in balanced parts the dataset now that I've discretized the target variable.
This is my code:
X = df.copy()
y = X.pop('shares')
# scaling the dataset so all data in the same range
scaler = preprocessing.MinMaxScaler()
X = scaler.fit_transform(X)
discretizer = preprocessing.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
y_discretized = discretizer.fit_transform(y.values.reshape(-1, 1))
# is this correct?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True, stratify=y_discretized)
For completeness, I'm trying to recreate a less complex model than the one showed in: [1] K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal
Your y_train and y_test are parts of y, which has (it seems) the original continuous values. So you're ending up fitting multiclass classification models, with probably lots of different classes, which likely causes the crashes.
I assume what you wanted is
X_train, X_test, y_train, y_test = train_test_split(X, y_discretized, test_size=0.33, shuffle=True, stratify=y_discretized)
Whether discretizing a continuous target to turn a regression into a classification is a topic for another site, see e.g. https://datascience.stackexchange.com/q/90297/55122

why am I getting a very high test accuracy even when i test my dataset with a single feature

I am writing a small program and I am training a random forest to predict a binary value. My dataset has around 20,000 entries and each entry has 25 features(continuous and categorical) with a binary target value to predict.
I am getting over 99% test accuracy which is surprisingly high. I tried to reduce the number of my features, even with two features I am still getting such high accuracy. I just want to make sure I am not doing anything wrong in my code, such as the training set leaking into my test set.
Here is the code snippet
data = pd.read_csv(r'test.csv')
data = data.drop_duplicates()
#spliting data
X = data.drop('label', axis=1)
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#preproccessing the dataset by one hot encoding
l1 = OneHotEncoder(handle_unknown='ignore')
l1.fit(X_train)
X_train = l1.transform(X_train)
X_test = l1.transform(X_test)
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=20, random_state=0)
classifier.fit(X_train, y_train.to_numpy())
#evaluation
y_pred = classifier.predict(X_test)
print(accuracy_score(y_test, y_pred))
additionally, I forgot to add that my dataset is balanced and precision and recall scores are 100% !
This is quite a big dataset. How balanced is your dataset? It might be the case your test split is filled mostly with the entries of one label and failed every time the entry was from the other label. Therefore, i would say accuracy is not a good measure to rely on in here.
Have a look at this:
Difference of model accuracy and performance
Have a look at your confusion matrix and inspect your splits.

Non linear regression using Xgboost

I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
#data
https://github.com/soufMiashs/Predict_Hits
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
y.sort_values().reset_index()["HITS_DAY"].plot()
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
)
# Show the plots of the test and train set (make sure they look similar!)
y_train.sort_values().reset_index()["HITS_DAY"].plot()
y_test.sort_values().reset_index()["HITS_DAY"].plot()
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Output:
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485

Why does my PCA change every time I run the code in python?

I imputed my dataframe of any missing values with the median of each feature and scaled using StandardScaler(). I ran regular kneighbors with n=3 and the accuracy stays consistent.
Now I am to do the PCA of the resulting dataset with n_components=4 and apply K-neighbors with 3 neighbors. However, every time I run my code, the PCA dataset and kneighbors accuracy changes every time I run the program but the master dataset itself doesn't change. I even tried using first 4 features of the dataset when applying kneighbors and even that is inconsistent.
data = pd.read_csv('dataset.csv')
y = merged['Life expectancy at birth (years)']
X_train, X_test, y_train, y_test = train_test_split(data,
y,
train_size=0.7,
test_size=0.3,
random_state=200)
for i in range(len(features)):
featuredata = X_train.iloc[:,i]
fulldata = data.iloc[:,i]
fulldata.fillna(featuredata.median(), inplace=True)
data.iloc[:,i] = fulldata
scaler = preprocessing.StandardScaler().fit(X_train)
data = scaler.transform(data)
If I apply KNeighbors here, it runs fine, and my accuracy score remains the same.
pcatest = PCA(n_components=4)
pca_data = pcatest.fit_transform(data)
X_train, X_test, y_train, y_test = train_test_split(pca_data,
y,
train_size=0.7,
test_size=0.3)
pca = neighbors.KNeighborsClassifier(n_neighbors=3)
pca.fit(X_train, y_train)
y_pred_pca = pca.predict(X_test)
pca_accuracy = accuracy_score(y_test, y_pred_pca)
However, my pca_accuracy score changes every time I run the code. What can I do to make it set and consistent?
first4_data = data[:,:4]
X_train, X_test, y_train, y_test = train_test_split(first4_data,
y,
train_size=0.7,
test_size=0.3)
first4 = neighbors.KNeighborsClassifier(n_neighbors=3)
first4.fit(X_train, y_train)
y_pred_first4 = first4.predict(X_test)
first4_accuracy = accuracy_score(y_test, y_pred_first4)
I am only taking the first 4 features/columns and the data should remain the same, but for some reason, the accuracy score changes everytime I run it.
You need to give random_statea value in train_test_split otherwise everytime you run it without specifying random_state, you will get a different result. What happens is that every time you split your data, you do it in different ways, unless you specify a random state, or lack there of. It's the equivalent of seed() in R.

Found input variables with inconsistent numbers of samples error

I wrote the following code to learn the score in the machine learning methods. but I get the following error. what would be the reason??
veri = pd.read_csv("deneme2.csv")
veri = veri.drop(['id'], axis=1)
y = veri[['Rating']]
x = veri.drop(['Rating','Genres'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
DTR = DecisionTreeRegressor()
DTR.fit(X_train,y_train)
ytahmin = DTR.predict(x)
DTR.fit(veri[['Reviews','Size','Installs','Type','Price','Content Rating','Category_c']],veri.Rating)
basari_DTR = DTR.score(X_test,y_test)
#print("DecisionTreeRegressor: Yüzde",basari_DTR*100," oranında:" )
a = np.array([159,19000000.0,10000,0,0.0,0,0]).reshape(1, -1)
predict_DTR = DTR.predict(a)
print(f1_score(y_train, y_test, average='macro'))
Error: Found input variables with inconsistent numbers of samples: [6271, 3089]
There are at least two issues with your code.
The first error you report
print(f1_score(y_train, y_test, average='macro'))
Error: Found input variables with inconsistent numbers of samples: [6271, 3089]
is due to your y_train and y_test having different lengths, as already pointed out in the other answer.
But this is not the main issue here, because, even if you change y_train to y_pred, as suggested, you get a new error:
print(f1_score(y_pred, y_test, average='macro'))
Error: continuous is not supported
This is simply because you are in a regression setting, while the f1 score is a classification metric and, as such, it does not work with continuous predictions.
In other words, f1 score is inappropriate for your (regression) problem, hence the errror.
Check the list of metrics available in scikit-learn, where you can confirm that f1 score is used only in classification, and pick up another metric suitable for regression problems.
For a more detailed exposition about what happens when choosing inappropriate metrics in scikit-learn, see Accuracy Score ValueError: Can't Handle mix of binary and continuous target
f1_score needs to take true y from test and the one you predicted on test set, hence last lines should be:
DTR = DecisionTreeRegressor()
DTR.fit(X_train,y_train)
y_pred = DTR.predict(X_test)
print(f1_score(y_pred, y_test, average='macro'))
You shouldn't call fit twice and the shape of your predictions has to be of the same length as test, see some sklearn basic tutorials for more info.

Categories