Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 months ago.
Improve this question
I am trying to predict the price for cars using the following DataFrame data. Data types: model, transmission, and fuel type as obj, the rest as float/int.
The full dataset contains ~6500 samples (only data.head() listed here)
model year price transmission mileage fuelType tax mpg engineSize
0 Bolt 2016 16000 Manual 24089 Petrol 265 36.2 2.0
1 Bolt 2017 15995 Manual 18615 Petrol 145 36.2 2.0
2 Bolt 2015 13998 Manual 27469 Petrol 265 36.2 2.0
3 Bolt 2017 18998 Manual 14736 Petrol 150 36.2 2.0
4 Bolt 2017 17498 Manual 36284 Petrol 145 36.2 2.0
I start by dropping all duplicates and encoding the categorical variables:
# Drop duplicates
data = data.drop_duplicates(keep="first")
# Categorical variable encoding
cat_features = ["model", "transmission", "fuelType"]
encoder = LabelEncoder()
encoded = data[cat_features].apply(encoder.fit_transform)
data = data.drop(cat_features, axis=1)
data = pd.concat([encoded, data], axis=1)
Output:
model transmission fuelType year price mileage tax mpg engineSize
0 1 1 3 2016 16000 24089 265 36.2 2.0
1 1 1 3 2017 15995 18615 145 36.2 2.0
2 1 1 3 2015 13998 27469 265 36.2 2.0
3 1 1 3 2017 18998 14736 150 36.2 2.0
4 1 1 3 2017 17498 36284 145 36.2 2.0
Following the scikit-learn documentation (https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html), I tried regression using Lasso, ElasticNet, Ridge, and SVR.
I got the best results using the Ridge regression (see code below) with R^2 of 0.79 and MSE of 2941.73. However, my success criteria is predicting the price within a certain range of the actual price (e.g. +/- 1000).
Even with the Ridge model below, most predictions don't make the cut. Do you have any ideas how I could optimize the regression? Have I made any mistakes in the Ridge regression below or with the hyperparameters? Are there more appropriate models for this case?
Ridge:
X = data.iloc[:, [0, 1, 2, 3, 5, 6, 7, 8]]
y = data.iloc[:, 4]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
std_slc = StandardScaler()
pca = decomposition.PCA()
ridge = linear_model.Ridge()
pipe = Pipeline(steps=[("std_slc", std_slc),
("pca", pca),
("ridge", ridge)])
n_components = list(range(1,X.shape[1]+1,1))
parameters = dict(pca__n_components=n_components,
ridge__solver=["auto", "svd", "cholesky", "lsqr", "sparse_cg", "sag", "saga"],
ridge__alpha=np.linspace(0, 1, 11),
ridge__fit_intercept=[True, False])
clf = GridSearchCV(pipe, parameters, scoring='r2', verbose=1)
clf.fit(X_train, y_train)
y_pred_ridge = clf.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,y_pred_ridge)))
print(r2_score(y_test, y_pred_ridge))
Output:
Fitting 5 folds for each of 56 candidates, totalling 280 fits
2941.734786303254
0.7909623313908631
Voting Regressor:
eclf = VotingRegressor(estimators=[
('ridge', linear_model.Ridge()),
('lasso', linear_model.Lasso()),
('elasticnet', linear_model.ElasticNet())
])
#Use the key for the classifier followed by __ and the attribute
params_eclf = {'ridge__solver': ["auto", "svd", "cholesky", "lsqr", "sparse_cg", "sag", "saga"],
'lasso__selection': ["cyclic", "random"],
'elasticnet__selection': ['cyclic', 'random'],
'ridge__alpha': np.linspace(0, 1, 10),
'ridge__fit_intercept': [True, False]}
grid_eclf = RandomizedSearchCV(estimator=eclf, param_distributions=params_eclf, cv=3, n_iter=250, verbose=1, scoring='r2')
grid_eclf.fit(X_train, y_train)
y_pred_eclf = grid_eclf.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,y_pred_eclf)))
print(r2_score(y_test, y_pred_eclf))
Output:
Fitting 3 folds for each of 250 candidates, totalling 750 fits
3082.2257067911637
0.7705191776922907
I've been dealing with a similar task before and judging by my personal experience:
RMSE is outlier sensitive. You may cheat a bit by excluding extreme prices (and/or check out MAE instead).
Categorical features like model aren't ordinal: label encoding is not so great in this case. Mean target encoding might improve the results. As a side effect, fuel type/gearbox features might turn out to be redundant.
Linear models, KNN and SVR weren't performing so well, random forest and gradient boosting were the best (as they often are), single decision tree had an acceptable result as well;
(This is not the exact same dataset however so YMMV.)
Related
I am using a random forest regression model on the following data.
However, the R-squared value I get is -0.22, which seems very low.
rsquaredscores = []
x = correcteddf[['sleepiness_resolution_index']]
y = correcteddf[['distance']]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,train_size=0.7,random_state=0)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 100,random_state = 0)
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
from sklearn.metrics import r2_score
rsquaredscore = r2_score(y_test,y_pred)
rsquaredscores.append(rsquaredscore)
print(rsquaredscores)
When I use the same data in a linear regression model:
resultmodeldistancevariation2sleep = sm.OLS(y,x).fit()
resultmodeldistancevariation2sleepsummary = resultmodeldistancevariation2sleep.summary()
print(resultmodeldistancevariation2sleepsummary)
I get a much higher R-squared value: 0.027.
Is it normal for one of these values to be positive and one to be negative? Am I right in saying that the negative value basically indicates very poor model fit? Or am I going wrong somewhere?
I would be so grateful for a helping hand!
correcteddf['sleepiness_resolution_index'].head(10)
sleepiness_resolution_index
0 0.555556
1 1.500000
3 1.000000
6 0.900000
7 0.857143
8 0.875000
9 0.300000
10 0.750000
12 1.400000
13 2.333333
correcteddf[['distance']].head(10)
distance
0 -0.251893
1 0.072180
3 -0.555438
6 0.199593
7 -0.378295
8 -0.162809
9 -0.108010
10 0.162275
12 -0.007996
13 4.637838
I have some data that contains difficulty scores for tests plus some features. Example (the numbers are random, my real data has about 800 rows and 8 columns):
question time_needed media_existent frequency_changed_answers score
abc 3545 0 1.25 0.79
dff 3574 0 2.80 0.03
xyz 1123 0 4.50 0.60
mno 7000 1 3.77 1.00
pqr 4656 0 1.00 0.99
stv 4367 0 2.73 0.33
The score is between 0 and 1. The closer to 1, the easier the question. The frequency of changed answers is how many times the answers have been changed before submission (the student was undecided) divided by how many times the question was answered (some questions are more popular).
Just like in this example, I applied the 3 methods (Random Forest, permutations, SHAP) to figure out which features are the most important. All 3 of them consider this frequency the most important, then the time, then whether the test contains media.
For random forest:
list_of_columns = ['time_needed','media_existent', 'frequency_changed_answers']
X = df_random_forest[list_of_columns]
target_column = 'score'
y = df_random_forest[target_column]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25,random_state=12)
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)
rf.feature_importances_
But the following score is just 0.2932189613132453
rf.score(X_test, y_test)
Also:
scores = cross_val_score(rf, X, y, cv=5)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
>>0.25 accuracy with a standard deviation of 0.05.
What would be the problem?
I want to perform CatBoost over my Titanic dataset which consist mostly from categorical data and have a binary target.
My data looks like:
train.head()
Embarked Pclass Sex Survived IsCabin Deck IsAlone IsChild Title AgeBin FareBin
0 S 3 male 0.0 0 Unknown 0 1 Mr Young Low
1 C 1 female 1.0 1 C 0 1 Mrs Adult High
2 S 3 female 1.0 0 Unknown 1 1 Miss Young Mid low
3 S 1 female 1.0 1 C 0 1 Mrs Adult High
4 S 3 male 0.0 0 Unknown 1 1 Mr Adult Mid low
I did:
# Get train and validation sub-datasets
from sklearn.model_selection import train_test_split
x = train.drop(["Survived"], axis=1)
y = train["Survived"]
#Do train data splitting
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
# Get categorical features
cat_features_indices = np.where(x.dtypes != float)[0]
import catboost
model = catboost.CatBoostClassifier(
one_hot_max_size=7,
iterations=100,
random_seed=42,
verbose=False,
eval_metric='Accuracy'
)
pool = catboost.Pool(X_train, y_train, cat_features_indices)
cv_scores = catboost.cv(pool, model.get_params(), fold_count=10, plot=True)
...which returns:
CatBoostError: catboost/libs/metrics/metric.cpp:4617: loss [RMSE] is
incompatible with metric [Accuracy] (no classification support)
Help would be appreaciated. I'm a bit confused by the error. Thanks!
Looks like Catboost is refering to the default loss_function parameter
In your code, model.get_params() will not contain a value for loss_function, which then seems defaults to RMSE (shouldn't for classifier, but seems to for some reason)
If you look at classification loss_functions, there are only two valid choices - Logloss and CrossEntropy. Only these can be used in optimization, the rest are metrics that get reported. See https://catboost.ai/docs/concepts/loss-functions-classification.html
if you add the parameter loss_function='Logloss' to your CatBoostClassifier initialization, it should then work
TCatBoostOptions.LossFunctionDescription is initialized with RSME as the default value.
catboost.cv() internally triggers an assertion in CheckMetrics if loss_function is not set.
It seems to be a bug of catboost.
Dataset
0-9 columns: float features (parameters of a product)
10 column: int labels (products)
Goal
Calculate an 0-1 classification certainty score for the labels (this is what my current code should do)
Calculate the same certainty score for each “product_name”(300 columns) at each rows(22'000)
ERROR I use sklearn.tree.DecisionTreeClassifier.
I am trying to use "predict_proba" but it gives an error.
Python CODE
data_train = pd.read_csv('data.csv')
features = data_train.columns[:-1]
labels = data_train.columns[-1]
x_features = data_train[features]
x_label = data_train[labels]
X_train, X_test, y_train, y_test = train_test_split(x_features, x_label, random_state=0)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
clf = DecisionTreeClassifier(max_depth=3).fit(X_train, y_train)
class_probabilitiesDec = clf.predict_proba(y_train)
#ERORR: ValueError: Number of features of the model must match the input. Model n_features is 10 and input n_features is 16722
print('Decision Tree Classification Accuracy Training Score (max_depth=3): {:.2f}'.format(clf.score(X_train, y_train)*100) + ('%'))
print('Decision Tree Classification Accuracy Test Score (max_depth=3): {:.2f}'.format(clf.score(X_test, y_test)*100) + ('%'))
print(class_probabilitiesDec[:10])
# if I use X_tranin than it jsut prints out a buch of 41 element vectors: [[ 0.00490808 0.00765327 0.01123035 0.00332751 0.00665502 0.00357707
0.05182597 0.03169453 0.04267532 0.02761833 0.01988187 0.01281091
0.02936528 0.03934781 0.02329257 0.02961484 0.0353548 0.02503951
0.03577073 0.04700108 0.07661592 0.04433907 0.03019715 0.02196157
0.0108976 0.0074869 0.0291989 0.03951418 0.01372598 0.0176358
0.02345895 0.0169703 0.02487314 0.01813493 0.0482489 0.01988187
0.03252641 0.01572249 0.01455786 0.00457533 0.00083188]
[....
FEATURES (COLUMNS)
(last columns are the labels)
0 1 1 1 1.0 1462293561 1462293561 0 0 0.0 0.0 1
1 2 2 2 8.0 1460211580 1461091152 1 1 0.0 0.0 2
2 3 3 3 1.0 1469869039 1470560880 1 1 0.0 0.0 3
3 4 4 4 1.0 1461482675 1461482675 0 0 0.0 0.0 4
4 5 5 5 5.0 1462173043 1462386863 1 1 0.0 0.0 5
CLASSES COLUMNS (300 COLUMNS OF ITEMS)
HEADER ROW: apple gameboy battery ....
SCORE in 1st row: 0.763 0.346 0.345 ....
SCORE in 2nd row: 0.256 0.732 0.935 ....
ex.: of similar scores used when someone image classify cat VS. dog and the classification gives confidence scores.
You cannot predict the probability of your labels.
predict_proba predicts the probability for each label from your X Data, thus:
class_probabilitiesDec = clf.predict_proba(X_test)
What you postet as "when i use X_train":
[[ 0.00490808 0.00765327 0.01123035 0.00332751 0.00665502 0.00357707
0.05182597 0.03169453 0.04267532 0.02761833 0.01988187 0.01281091
0.02936528 0.03934781 0.02329257 0.02961484 0.0353548 0.02503951
0.03577073 0.04700108 0.07661592 0.04433907 0.03019715 0.02196157
0.0108976 0.0074869 0.0291989 0.03951418 0.01372598 0.0176358
0.02345895 0.0169703 0.02487314 0.01813493 0.0482489 0.01988187
0.03252641 0.01572249 0.01455786 0.00457533 0.00083188]
Is a list of the probability to be true for every possible label.
EDIT
After reading your comments predict proba is exactly what you want.
Lets make an example. In the following code we have a classifier with 3 classes: either 11, 12 or 13.
If the input is 1 the classifier should predict 11
If the input is 2 the classifier should predict 12
...
If the input is 7 the classifier should predict 13
clf = DecisionTreeClassifier()
clf.fit([[1],[2],[3],[4],[5],[6],[7]], [[11],[12],[13],[13],[12],[11],[13]])
now if you have test data with a single row e.g. 5 than the classifier should predict 12. So lets try that.
clf.predict([[5]])
And voila: the result is array([12])
if we want a probability then predict proba is the way to go:
clf.predict_proba([[5]])
and we get [array([0., 1., 0.])]
In that case the array [0., 1., 0.] means :
0% probability for class 11
100% probability for class 12
0% probability for class 13
If i'm correct thats exactly what you want.
You can even map that to the names of your classes with:
probabilities = clf.predict_proba([[5]])[0]
{clf.classes_[i] : probabilities[i] for i in range(len(probabilities))}
which gives you a dictionary with probabilities for class names:
{11: 0.0, 12: 1.0, 13: 0.0}
Now in your case you have way more classes than only [11,12,13] so the array gets longer. And for every row in your dataset predict_proba creates an array, so for more than a single row of data your output becomes a matrix.
I'm learning how to use Lasso and Ridge with sklearn in Python. I am given the folds in a column. I want to find the best parameter based on a 5 fold cross validation.
My data looks like the following:
mpg cylinders displacement horsepower weight acceleration origin fold
0 18 8 307 130 3504 12.0 1 3
1 15 8 350 165 3693 11.5 1 0
2 18 8 318 150 3436 11.0 1 2
3 16 8 304 150 3433 12.0 1 2
4 17 8 302 140 3449 10.5 1 3
reg_para = [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]
mpg is the y/target variable and the other columns are the predictor. the last column contains the folds. I want to run a Lasso and Ridge and find the best parameter. The problem I am having is incorporating the specified folds in cross validation. Here is what I have so far (for Lasso):
from sklearn.linear_model import Lasso, LassoCV
lasso_model = LassoCV(cv=5, alphas=reg_para)
lasso_fit = lasso_model.fit(X,y)
Is there a simple way to incorporate the fold splits? Any help is greatly appreciated
If your data are in a pandas dataframe, then all you need to do is access that column
fold_labels = df["fold"]
from sklearn.cross_validation import LeaveOneLabelOut
cv = LeaveOneLabelOut(fold_labels)
lasso_model = LassoCV(cv=cv, alphas=reg_para)
So if you obtain the fold labels in an array fold_labels you can just use LeaveOneLabelOut (sorry for the non-functional code. It should be sufficient to elucidate the idea though.)