I have variable, and I need to predict its value as close as possible, but not greater than it. For example, given y_true = 9000, I want y_pred to be any value within range [0,9000] as close to 9000 as possible. And if y_true = 8000 respectively y_pred should be [0,8000]. That is, I want to make some kind of restriction on the predicted value. That threshold is individual for each pair of prediction and target variable from the sample. if y_true = [8750,9200,8900,7600] that y_pred should be [<=8750,<=9200,<=8900,<=7600]. The only task is to predict exactly no more and get closer. everywhere zero is considered the correct answer, but I just need to get as close as possible
data, target = np.array(data),np.array(df_tar)
X_train,X_test,y_train,y_test=train_test_split(data,target)
gbr = GradientBoostingRegressor(max_depth=1,n_estimators=100)
%time gbr.fit(X_train,np.ravel(y_train))
print(gbr.score(X_test,y_test),gbr.score(X_train,y_train))
Due to the complexity of actually changing and coming up with a model that can take this approach you desire into sklearn's function and apply it, I strongly suggest you pass this filter after the prediction, and replace all predicted values over 9000 to 9000. And afterwards, manually compute the score, which I believe is mse in this scenario.
Here is a full workinge example of my approach:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error as mse
import numpy as np
X = [[8500,9500],[9200,8700],[8500,8250],[5850,8800]]
y = [8750,9200,8900,7600]
data, target = np.array(X),np.array(y)
gbr = GradientBoostingRegressor(max_depth=1,n_estimators=100)
gbr.fit(data,np.ravel(target))
predictions = gbr.predict(data)
print(predictions) ## The original predicitions
Output:
[8750.14958301 9199.23464805 8899.87846735 7600.73730159]
Perform the replacement:
fixed_predictions = np.array([z if y>z else y for y,z in zip(target,predictions)])
print(fixed_predictions)
[8750. 9199.23464805 8899.87846735 7600. ]
Compute the new score:
score = mse(target,predictions)
print(score)
Output:
10000.145189724533
Related
So i'm trying to find a simple (not Dijkstra's algorithm) for a shortest path problem.
Without reproducing everything, I have 3 paths and 50 samples of it (i.e. shape (50,3))and I have identified the shortest path for each sample using the min. function
for x_train being
newx_train = np.zeros((50,3))
newx_train[:,0] = p1_train
newx_train[:,1] = p2_train
newx_train[:,2] = p3_train
[x_train] <- just random numbers generated
and subsequently, y_train (since I'm generating it; i pass min function through it)
newy_train[np.arange(newx_train.shape[0]),newx_train.argmin(axis=1)]=1
print(newy_train)
[newy_train] <- passing min will show a 1 for each row where the minimum value is
So i get something like
[[1,0,0],
[0,1,0],
[1,0,0],
[0,0,1]]
Based on x_train, y_train generated, I am trying to implement SVM, logreg to predict how well they perform for multi-class and then i'll compute the classification matrix and accuracy.
My question is, how do i go about using multi-class for logreg? When i run a fit through x_train, y_train; understandably python throws up error that y should be 1-D array but got (50,3) instead.
from sklearn.linear_model import LogisticRegression
LogReg = LogisticRegression(solver = 'lbfgs', multi_class = 'multinomial')
LogReg.fit(newx_train,newy_train[:,0])
ylog_pred = LogReg.predict(newx_test)
print(ylog_pred)
The above code naturally works for binary (assuming only 2 paths) since predicting '1' for one column (index 0) would naturally mean the other column is a '0'. But this would not work for multi-class. Could anyone help with it?
I think you're just missing the part with how to interpret the y.
LogisticRegression expects the y column to not be one-hot encoded and to actually be the target labels, so you need something like
newy_train = np.argmax(newy_train, axis=1) # index of max across each row
Then you should be able to fit something with
LogReg.fit(newx_train,newy_train)
I've been working with LSTMs to make stock price predictions for my machine learning class and I'm quite new to programming so I'd appreciate your patience!
Anyway, I have this function that generates an accuracy score and I'm trying to better understand some components of the function. Namely, what purpose are the data transformations and the lamda functions serving?
Here's the function:
def accuracy(model, data):
y_test = data["y_test"]
X_test = data["X_test"]
y_pred = model.predict(X_test)
y_test = np.squeeze(data["column_scaler"]["Close"].inverse_transform(np.expand_dims(y_test, axis=0)))
y_pred = np.squeeze(data["column_scaler"]["Close"].inverse_transform(y_pred))
y_pred = list(map(lambda current, future: int(float(future) > float(current)), y_test[:-LOOKUP_STEP], y_pred[LOOKUP_STEP:]))
y_test = list(map(lambda current, future: int(float(future) > float(current)), y_test[:-LOOKUP_STEP], y_test[LOOKUP_STEP:]))
return accuracy_score(y_test, y_pred)
I'm curious what the difference would be if I just used a function like this:
def accuracy(model, data):
y_test = data["y_test"]
X_test = data["X_test"]
y_pred = model.predict(X_test)
return accuracy_score(y_Test, y_pred)
Well, you should describe your data and model output better. I suppose the data is a pandas dataframe and you are using sklearn to preprocess your data.
First you need to normalize your data, so I suppose somewhere in your code you had a MinMaxScaler or some sklearn transformation, that mapped an entire column of your dataframe to values between 0 and 1, from your code it seems it was the column_scaler. So you have to un-normalize those values to be real (cash values) by using inverse_transform. So a y_pred value of 0 becomes the lowest value you originally had on your data, suppose $10,29.
Then you are converting the price predictions to true-false values (boolean) checking if the price in the future is higher than the price in the present (looking LOOKUP_STEP ticks ahead).
With this array that tells only if prices go up or not, you calculate the Jaccard Score (from sklearn I suppose) that just tells you how many UP prices you got right in relation to the ones you got wrong.
If you did not execute those post-processing steps the accuracy_score would give a different value, I am not sure if it even accept floats as values, it would give an error.
NOTE: I appreciate the massive quantity of comments suggesting that this is inappropriate to quantify model performance. However, this is irrelevant to my error, and this error occurs for a variety of other metrics. Also, see here for the appropriate way to respond when you think the OP is "asking the wrong question"
I have an sklearn logistic model for which I am attempting to get the RMSE. However, when I .predict_proba, I get a vector of probabilities. However, my y_test is in its categorical form, which sklearn.linear_model.LogisticRegression just sort of dealt with automagically.
How to I reconcile these two things to get the RMSE?
>>> sklearn.metrics.mean_squared_error(y_test, pred_proba, sample_weight=weights_test)
ValueError: y_true and y_pred have different number of output (1!=13)
predict_proba is predicting the probability that a sample belongs to a class. The arg max of those probabilities is the predicted class (categorical form). RMSE is not a metric for classification. If you want to evaluate your model, consider a different metric like accuracy_score:
from sklearn.metrics import accuracy_score
predictions = your_model.predict(X_test)
print("Accuracy: %.3f" % accuracy_score(y_test, predictions))
The brier score, basically the mean squared error, is a known and valid loss function for classification models that leverage probability scores; I would take a look at that as well.
To your particular issue, you want to compare the probabilities returned for your target class, i.e. for a binary class problem:
from sklearn.metrics import brier_score_loss
probs = your_model.predict_proba(X_test)
brier_score_loss(y_true, probs[:, 1])
I'm not sure brier is formally defined for multiclass problems. I would point to the idea of mean misclassification error, which averages the error across classes.
To leverage this within the sklearn API, encode your y_true categorically, i.e. each class gets its own column, and call
sklearn.metrics.mean_squared_error(y_true, probs, multioutput=’uniform_average’)
Here is how you can calculate RMSE:
import numpy as np
from sklearn.metrics import mean_squared_error
x = np.range(10)
y = x
rmse = np.sqrt(mean_squared_error(x, y))
One can transform the y_test into a format compatible with the predict_proba output as follows:
model = sklearn.linear_model.LogisticRegression().fit(X,y) # or whatever model
label_encoder = sklearn.preprocessing.LabelEncoder()
label_encoder.classes_ = model.classes_
y_test_onehot = sklearn.preprocessing.OneHotEncoder().fit_transform(label_encoder.transform(y_test).reshape((-1,1)))
You can now apply any of the metrics in sklearn.metric. This is essential for computing, say, the brier score.
you can access the data set at this link https://drive.google.com/file/d/0B9Hd-26lI95ZeVU5cDY0ZU5MTWs/view?usp=sharing
My Task is to predict the price movement of a sector fund. How much it goes up or down doesn't really matter, I only want to know whether it's going up or down. So I define it as a classification problem.
Since this data set is a time-series data, I met many problems. I have read articles about these problems like I can't use k-fold cross validation since this is time series data. You can't ignore the order of the data.
my code is as follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from sklearn.linear_model import LinearRegression
from math import sqrt
from sklearn.svm import LinearSVC
from sklearn.svm import SVCenter code here
lag1 = pd.read_csv(#local file path, parse_dates=['Date'])
#Trend : if price going up: ture, otherwise false
lag1['Trend'] = lag1.XLF > lag1.XLF.shift()
train_size = round(len(lag1)*0.50)
train = lag1[0:train_size]
test = lag1[train_size:]
variable_to_use= ['rGDP','interest_rate','private_auto_insurance','M2_money_supply','VXX']
y_train = train['Trend']
X_train = train[variable_to_use]
y_test = test['Trend']
X_test = test[variable_to_use]
#SVM Lag1
this_C = 1.0
clf = SVC(kernel = 'linear', C=this_C).fit(X_train, y_train)
print('XLF Lag1 dataset')
print('Accuracy of Linear SVC classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Linear SVC classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
#Check prediction results
clf.predict(X_test)
First of all, is my method right here : first generating a column of true and false? I am afraid the machine can't understand this column if I simply feed this column to it. Should I first perform a regression then compare the numeric result to generate a list of going up or down?
The accuracy on training set is very low at : 0.58 I am getting an array with all trues with clf.predict(X_test) which I don't know why I would get all trues.
And I don't know whether the resulting accuracy is calculated in which way: for example, I think my current accuracy only counts the number of true and false but ignoring the order of them? Since this is time-series data, ignoring the order is not right and gives me no information about predicting price movement. Let's say I have 40 examples in test set, and I got 20 Tures I would get 50% accuracy. But I guess the trues are no in the right position as it appears in the ground truth set. (Tell me if I am wrong)
I am also considering using Gradient Boosted Tree to do the classification, would it be better?
Some preprocessing of this data would probably be helpful. Step one might go something like:
df = pd.read_csv('YOURLOCALFILEPATH',header=0)
#more code than your method but labels rows as 0 or 1 and easy to output to new file for later reference
df['Date'] = pd.to_datetime(df['date'], unit='d')
df = df.set_index('Date')
df['compare'] = df['XLF'].shift(-1)
df['Label'] np.where(df['XLF']>df['compare'), 1, 0)
df.drop('compare', axis=1, inplace=True)
Step two can use one of sklearn's built in scalers, such as the MinMax scaler, to preprocess the data by scaling your feature inputs before feeding it into your model.
I'm doing a simple linear model. I have
fire = load_data()
regr = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(regr, fire.data, fire.target, cv=10, scoring='r2')
print scores
which yields
[ 0.00000000e+00 0.00000000e+00 -8.27299054e+02 -5.80431382e+00
-1.04444147e-01 -1.19367785e+00 -1.24843536e+00 -3.39950443e-01
1.95018287e-02 -9.73940970e-02]
How is this possible? When I do the same thing with the built in diabetes data, it works perfectly fine, but for my data, it returns these seemingly absurd results. Have I done something wrong?
There is no reason r^2 shouldn't be negative (despite the ^2 in its name). This is also stated in the doc. You can see r^2 as the comparison of your model fit (in the context of linear regression, e.g a model of order 1 (affine)) to a model of order 0 (just fitting a constant), both by minimizing a squared loss. The constant minimizing the squared error is the mean. Since you are doing cross validation with left out data, it can happen that the mean of your test set is wildly different from the mean of your training set. This alone can induce a much higher incurred squared error in your prediction versus just predicting the mean of the test data, which results in a negative r^2 score.
In worst case, if your data do not explain your target at all, these scores can become very strongly negative. Try
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 80)
y = rng.randn(100) # y has nothing to do with X whatsoever
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='r2')
This should result in negative r^2 values.
In [23]: scores
Out[23]:
array([-240.17927358, -5.51819556, -14.06815196, -67.87003867,
-64.14367035])
The important question now is whether this is due to the fact that linear models just do not find anything in your data, or to something else that may be fixed in the preprocessing of your data. Have you tried scaling your columns to have mean 0 and variance 1? You can do this using sklearn.preprocessing.StandardScaler. As a matter of fact, you should create a new estimator by concatenating a StandardScaler and the LinearRegression into a pipeline using sklearn.pipeline.Pipeline.
Next you may want to try Ridge regression.
Just because R^2 can be negative does not mean it should be.
Possibility 1: a bug in your code.
A common bug that you should double check is that you are passing in parameters correctly:
r2_score(y_true, y_pred) # Correct!
r2_score(y_pred, y_true) # Incorrect!!!!
Possibility 2: small datasets
If you are getting a negative R^2, you could also check for over fitting. Keep in mind that cross_validation.cross_val_score() does not randomly shuffle your inputs, so if your sample are inadvertently sorted (by date for example) then you might build models on each fold that are not predictive for the other folds.
Try reducing the number of features, increasing the number samples, and decreasing the number of folds (if you are using cross_validation). While there is no official rule here, your m x n dataset (where m is the number of samples and n is the number of features) should be of a shape where
m > n^2
and when you using cross validation with f as the number of folds, you should aim for
m/f > n^2
R² = 1 - RSS / TSS, where RSS is the residual sum of squares ∑(y - f(x))² and TSS is the total sum of squares ∑(y - mean(y))². Now for R² ≥ -1, it is required that RSS/TSS ≤ 2, but it's easy to construct a model and dataset for which this is not true:
>>> x = np.arange(50, dtype=float)
>>> y = x
>>> def f(x): return -100
...
>>> rss = np.sum((y - f(x)) ** 2)
>>> tss = np.sum((y - y.mean()) ** 2)
>>> 1 - rss / tss
-74.430972388955581
If you are getting negative regression r^2 scores, make sure to remove any unique identifier (e.g. "id" or "rownum") from your dataset before fitting/scoring the model. Simple check, but it'll save you some headache time.