cross validation in sklearn with given fold splits

cross validation in sklearn with given fold splits - python

I'm learning how to use Lasso and Ridge with sklearn in Python. I am given the folds in a column. I want to find the best parameter based on a 5 fold cross validation.
My data looks like the following:
mpg cylinders displacement horsepower weight acceleration origin fold
0 18 8 307 130 3504 12.0 1 3
1 15 8 350 165 3693 11.5 1 0
2 18 8 318 150 3436 11.0 1 2
3 16 8 304 150 3433 12.0 1 2
4 17 8 302 140 3449 10.5 1 3
reg_para = [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100]
mpg is the y/target variable and the other columns are the predictor. the last column contains the folds. I want to run a Lasso and Ridge and find the best parameter. The problem I am having is incorporating the specified folds in cross validation. Here is what I have so far (for Lasso):
from sklearn.linear_model import Lasso, LassoCV
lasso_model = LassoCV(cv=5, alphas=reg_para)
lasso_fit = lasso_model.fit(X,y)
Is there a simple way to incorporate the fold splits? Any help is greatly appreciated

If your data are in a pandas dataframe, then all you need to do is access that column
fold_labels = df["fold"]
from sklearn.cross_validation import LeaveOneLabelOut
cv = LeaveOneLabelOut(fold_labels)
lasso_model = LassoCV(cv=cv, alphas=reg_para)
So if you obtain the fold labels in an array fold_labels you can just use LeaveOneLabelOut (sorry for the non-functional code. It should be sufficient to elucidate the idea though.)

Related

GridSearchCV for DecisionTreeClassifier in sklearn randomly generates UndefinedMetricWarning on macOS and Linux but not Windows

I am running into an issue after transferring a grid search workflow for decision tree classifiers from Windows to our Linux cluster. While it works every time on Windows most often than not it generates an error/warning.
The code is as follows:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import make_scorer,accuracy_score,recall_score,f1_score, precision_score, roc_auc_score
import numpy as np
from sklearn.model_selection import GridSearchCV
data = pd.read_csv("Dtree_test_data.csv")
d_tree = DecisionTreeClassifier()
scoring = {'accuracy' : make_scorer(accuracy_score),
'precision' : make_scorer(precision_score, average = 'weighted'),
'recall' : make_scorer(recall_score, average = 'weighted'),
'f1_score' : make_scorer(f1_score, average = 'weighted')}
features = data[["H-Bonds","Mass"]]
labels = data["Class"]
parameters = {'max_depth':np.arange(1,3,1),
'min_samples_leaf': np.arange(0.01,0.05,0.01),
'min_impurity_decrease':np.arange(0.01,0.04,0.01),
'criterion':['gini','entropy']}
dtree = DecisionTreeClassifier()
clf = GridSearchCV(dtree,parameters,verbose=1,n_jobs=12,scoring=scoring,refit=False,return_train_score=True,cv=2)
clf.fit(features,labels)
The data looks like this:
H-Bonds Mass Class
0 2 123 0
1 3 45 1
2 1 153 2
3 4 90 0
4 6 300 1
5 1 40 2
6 2 200 0
7 3 245 1
8 4 87 2
9 1 126 1
The error/warning is the following (repeated multiple times):
Fitting 2 folds for each of 48 candidates, totalling 96 fits
/../miniconda3/envs/ml-debug-current/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1334: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Any idea why this is?

Predicting car prices with machine learning - advice on best model [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 months ago.
Improve this question
I am trying to predict the price for cars using the following DataFrame data. Data types: model, transmission, and fuel type as obj, the rest as float/int.
The full dataset contains ~6500 samples (only data.head() listed here)
model year price transmission mileage fuelType tax mpg engineSize
0 Bolt 2016 16000 Manual 24089 Petrol 265 36.2 2.0
1 Bolt 2017 15995 Manual 18615 Petrol 145 36.2 2.0
2 Bolt 2015 13998 Manual 27469 Petrol 265 36.2 2.0
3 Bolt 2017 18998 Manual 14736 Petrol 150 36.2 2.0
4 Bolt 2017 17498 Manual 36284 Petrol 145 36.2 2.0
I start by dropping all duplicates and encoding the categorical variables:
# Drop duplicates
data = data.drop_duplicates(keep="first")
# Categorical variable encoding
cat_features = ["model", "transmission", "fuelType"]
encoder = LabelEncoder()
encoded = data[cat_features].apply(encoder.fit_transform)
data = data.drop(cat_features, axis=1)
data = pd.concat([encoded, data], axis=1)
Output:
model transmission fuelType year price mileage tax mpg engineSize
0 1 1 3 2016 16000 24089 265 36.2 2.0
1 1 1 3 2017 15995 18615 145 36.2 2.0
2 1 1 3 2015 13998 27469 265 36.2 2.0
3 1 1 3 2017 18998 14736 150 36.2 2.0
4 1 1 3 2017 17498 36284 145 36.2 2.0
Following the scikit-learn documentation (https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html), I tried regression using Lasso, ElasticNet, Ridge, and SVR.
I got the best results using the Ridge regression (see code below) with R^2 of 0.79 and MSE of 2941.73. However, my success criteria is predicting the price within a certain range of the actual price (e.g. +/- 1000).
Even with the Ridge model below, most predictions don't make the cut. Do you have any ideas how I could optimize the regression? Have I made any mistakes in the Ridge regression below or with the hyperparameters? Are there more appropriate models for this case?
Ridge:
X = data.iloc[:, [0, 1, 2, 3, 5, 6, 7, 8]]
y = data.iloc[:, 4]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
std_slc = StandardScaler()
pca = decomposition.PCA()
ridge = linear_model.Ridge()
pipe = Pipeline(steps=[("std_slc", std_slc),
("pca", pca),
("ridge", ridge)])
n_components = list(range(1,X.shape[1]+1,1))
parameters = dict(pca__n_components=n_components,
ridge__solver=["auto", "svd", "cholesky", "lsqr", "sparse_cg", "sag", "saga"],
ridge__alpha=np.linspace(0, 1, 11),
ridge__fit_intercept=[True, False])
clf = GridSearchCV(pipe, parameters, scoring='r2', verbose=1)
clf.fit(X_train, y_train)
y_pred_ridge = clf.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,y_pred_ridge)))
print(r2_score(y_test, y_pred_ridge))
Output:
Fitting 5 folds for each of 56 candidates, totalling 280 fits
2941.734786303254
0.7909623313908631
Voting Regressor:
eclf = VotingRegressor(estimators=[
('ridge', linear_model.Ridge()),
('lasso', linear_model.Lasso()),
('elasticnet', linear_model.ElasticNet())
])
#Use the key for the classifier followed by __ and the attribute
params_eclf = {'ridge__solver': ["auto", "svd", "cholesky", "lsqr", "sparse_cg", "sag", "saga"],
'lasso__selection': ["cyclic", "random"],
'elasticnet__selection': ['cyclic', 'random'],
'ridge__alpha': np.linspace(0, 1, 10),
'ridge__fit_intercept': [True, False]}
grid_eclf = RandomizedSearchCV(estimator=eclf, param_distributions=params_eclf, cv=3, n_iter=250, verbose=1, scoring='r2')
grid_eclf.fit(X_train, y_train)
y_pred_eclf = grid_eclf.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,y_pred_eclf)))
print(r2_score(y_test, y_pred_eclf))
Output:
Fitting 3 folds for each of 250 candidates, totalling 750 fits
3082.2257067911637
0.7705191776922907

I've been dealing with a similar task before and judging by my personal experience:
RMSE is outlier sensitive. You may cheat a bit by excluding extreme prices (and/or check out MAE instead).
Categorical features like model aren't ordinal: label encoding is not so great in this case. Mean target encoding might improve the results. As a side effect, fuel type/gearbox features might turn out to be redundant.
Linear models, KNN and SVR weren't performing so well, random forest and gradient boosting were the best (as they often are), single decision tree had an acceptable result as well;
(This is not the exact same dataset however so YMMV.)

Extracting results from gridsearchcv

I just finished a gridsearch CV on tree based modelling and after looking into the results, I managed to access the results of each iteration from gridsearchCV.
I need each run into a separate row and each parameter in a separate column.
I can run a loop or list comprehension to run for each row but unable to separate each run into columns
df = grid.grid_scores_
df[0]
mean: 0.57114, std: 0.00907, params: {'criterion': 'gini', 'max_depth': 10,
'max_features': 8, 'min_samples_leaf': 2, 'min_samples_split': 2, 'splitter': 'best'}`
I tried with tuple and dict accesories but ended up in errors. Essentially I need every parameter in a new column like below.
mean | std | criterion | ..... | splitter
0.57 0.009 'gini' ..... | 'best'
0.58 0.029 'entropy' ..... | 'random'
.
.
.
.

You could use the pre-made class to generate a DataFrame with a report of the parameters (see stackoverflow post using this code here)
Imports and settings
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from gridsearchcv_helper import EstimatorSelectionHelper
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
Generate some data
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
Define model and hyper-parameter grid
models = {'RandomForestClassifier': RandomForestClassifier()}
params = {'RandomForestClassifier': { 'n_estimators': [16, 32],
'max_features': ['auto', 'sqrt', 'log2'],
'criterion' : ['gini', 'entropy'] }}
Perform gridsearch (with CV) and report results
helper = EstimatorSelectionHelper(models, params)
helper.fit(X_iris, y_iris, n_jobs=4)
df_gridsearchcv_summary = helper.score_summary()
Here is the output
print(type(df_gridsearchcv_summary))
print(df_gridsearchcv_summary.iloc[:,1:])
RandomForestClassifier
<class 'pandas.core.frame.DataFrame'>
min_score mean_score max_score std_score criterion max_features n_estimators
0 0.941176 0.973856 1 0.0244553 gini auto 16
1 0.921569 0.96732 1 0.0333269 gini auto 32
8 0.921569 0.96732 1 0.0333269 entropy sqrt 16
10 0.921569 0.96732 1 0.0333269 entropy log2 16
2 0.941176 0.966912 0.980392 0.0182045 gini sqrt 16
3 0.941176 0.966912 0.980392 0.0182045 gini sqrt 32
4 0.941176 0.966912 0.980392 0.0182045 gini log2 16
5 0.901961 0.960784 1 0.0423578 gini log2 32
6 0.921569 0.960376 0.980392 0.0274454 entropy auto 16
7 0.921569 0.960376 0.980392 0.0274454 entropy auto 32
11 0.901961 0.95384 0.980392 0.0366875 entropy log2 32
9 0.921569 0.953431 0.980392 0.0242635 entropy sqrt 32

sciklearn Linear Regression (Final Prediciton always 0)

I'm trying to do simple linear regression using this small Dataset (Screenshot).
The dataset is records divided into small time blocks of 4 years each (Except for the 2nd to the last time block of 2016-2018).
What I'm trying to do is try to predict the output of records for the timeblock of 2019-2022. To do this, I placed a 2019-2022 time block with all its rows containing the value of 0 (Since there's nothing made during that time since it's the future). I did that to accommodate the syntax of sklearn's train_test_split and went with this code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.read_csv("TCO.csv")
df = df[['2000-2003', '2004-2007', '2008-2011','2012-2015','2016-2018','2019-2022']]
linreg = LinearRegression()
X1_train, X1_test, y1_train, y1_test = train_test_split(df[['2000-2003','2004-2007','2008-2011',
'2012-2015','2016-2018']],df['2019-2022'],test_size=0.4,random_state = 42)
linreg.fit(X1_train, y1_train)
linreg.intercept_
list( zip( ['2000-2003','2004-2007','2008-2011','2012-2015','2016-2018'],list(linreg.coef_)))
y1_pred = linreg.predict(X1_test)
print(y1_pred)
test_pred_df = pd.DataFrame({'actual': y1_test,
'predicted': np.round(y1_pred, 2),
'residuals': y1_test - y1_pred})
print(test_pred_df[0:10].to_string())
For some reason, the algorithm would always return a 0 as the final prediction for all rows with 0 residuals (This is due to the timeblock of 2019-2022 having all rows of zero.)
I think I did something wrong but I can't tell what it is. (I'm a beginner in this topic.) Can someone point out what went wrong and how to fix it?
Edit: I added a copy-able version of the data:
df = pd.DataFrame( {'Country:':['Brunei','Cambodia','Indonesia','Laos',
'Malaysia','Myanmar','Philippines','Singaore',
'Thailand','Vietnam'],
'2000-2003': [0,0,14,1,6,0,25,8,26,8],
'2004-2007': [0,3,15,6,21,0,37,11,44,36],
'2008-2011': [0,5,31,9,75,0,58,27,96,61],
'2012-2015': [5,11,129,35,238,3,99,65,170,96],
'2016-2018': [6,22,136,17,211,10,66,89,119,88]})

Based on your data, I think this is what you ask for [Edit: see updated version below]:
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame( {'Country:':['Brunei','Cambodia','Indonesia','Laos',
'Malaysia','Myanmar','Philippines','Singaore',
'Thailand','Vietnam'],
'2000-2003': [0,0,14,1,6,0,25,8,26,8],
'2004-2007': [0,3,15,6,21,0,37,11,44,36],
'2008-2011': [0,5,31,9,75,0,58,27,96,61],
'2012-2015': [5,11,129,35,238,3,99,65,170,96],
'2016-2018': [6,22,136,17,211,10,66,89,119,88]})
# create a transposed version with country in header
df_T = df.T
df_T.columns = df_T.iloc[-1]
df_T = df_T.drop("Country:")
# create a new columns for target
df["2019-2022"] = np.NaN
# now fit a model per country and add the prediction
for country in df_T:
y = df_T[country].values
X = np.arange(0,len(y))
m = LinearRegression()
m.fit(X.reshape(-1, 1), y)
df.loc[df["Country:"] == country, "2019-2022"] = m.predict(5)[0]
This prints:
Country: 2000-2003 2004-2007 2008-2011 2012-2015 2016-2018 2019-2022
Brunei 0 0 0 5 6 7.3
Cambodia 0 3 5 11 22 23.8
Indonesia 14 15 31 129 136 172.4
Laos 1 6 9 35 17 31.9
Malaysia 6 21 75 238 211 298.3
Myanmar 0 0 0 3 10 9.5
Philippines 25 37 58 99 66 100.2
Singaore 8 11 27 65 89 104.8
Thailand 26 44 96 170 119 184.6
Vietnam 8 36 61 96 88 123.8
Forget about my comment with shift(). I thought about it, but it makes not sense for this small amount of data, I think. But considering time series methods and treating each country's series as a time series may still be worth for you.
Edit:
Excuse me. The above code is unnessary complicated, but was just result of me going through it step by step. Of course it can simply be done row by row like tihs:
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame( {'Country:':['Brunei','Cambodia','Indonesia','Laos',
'Malaysia','Myanmar','Philippines','Singaore',
'Thailand','Vietnam'],
'2000-2003': [0,0,14,1,6,0,25,8,26,8],
'2004-2007': [0,3,15,6,21,0,37,11,44,36],
'2008-2011': [0,5,31,9,75,0,58,27,96,61],
'2012-2015': [5,11,129,35,238,3,99,65,170,96],
'2016-2018': [6,22,136,17,211,10,66,89,119,88]})
# create a new columns for target
df["2019-2022"] = np.NaN
for idx, row in df.iterrows():
y = row.drop(["Country:", "2019-2022"]).values
X = np.arange(0,len(y))
m = LinearRegression()
m.fit(X.reshape(-1, 1), y)
df.loc[idx, "2019-2022"] = m.predict(len(y)+1)[0]
1500 rows should be no problem.

Adding sparse matrix from CountVectorizer into dataframe with complimentary information for classifier - keep it in sparse format

I have the following problem. Right now I am building a classifier system which will use text and some additional complimentary information as an input. I store complimentary information in pandas DataFrame. I transform text using CountVectorizer and get a sparse matrix. Now, in order to train a classifier I need to have both inputs in same dataframe. The problem is that, when I merge dataframe with output of CountVectorizer I get a dense matrix, which I means I run out of memory really fast. Is there any way to avoid it and properly merge together this 2 inputs into single dataframe without getting a dense matrix?
Example code:
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
#how many most popular words we consider
n_features = 5000
df = pd.DataFrame.from_csv('DataWithSentimentAndTopics.csv',index_col=None)
#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
max_features=n_features,
stop_words='english')
#getting the TF matrix
tf = tf_vectorizer.fit_transform(df['reviewText'])
df = pd.concat([df.drop(['reviewText', 'Summary'], axis=1), pd.DataFrame(tf.A)], axis=1)
#binning target variable into 4 bins.
df['helpful'] = pd.cut(df['helpful'],[-1,0,10,50,100000], labels = [0,1,2,3])
#creating X and Y variables
train = df.drop(['helpful'], axis=1)
Y = df['helpful']
#splitting into train and test
X_train, X_test, y_train, y_test = train_test_split(train, Y, test_size=0.1)
#creating GBR
gbc = GradientBoostingClassifier(max_depth = 7, n_estimators=1500, min_samples_leaf=10)
print('Training GBC')
print(datetime.datetime.now())
#fit classifier, look for best
gbc.fit(X_train, y_train)
As you see, I set up my CountVectorizer to have 5000 words. I have just 50000 lines in my original dataframe but I already get a matrix of 50000x5000 cells, which is 2.5 billion of units. It already requires a lot of memory.

you dont need to use a data frame.
convert the numerical features from dataframe to a numpy array:
num_feats = df[[cols]].values
from scipy import sparse
training_data = sparse.hstack((count_vectorizer_features, num_feats))
then you can use a scikit-learn algorithm which supports sparse data.
for GBM, you can use xgboost which supports sparse.

As #AbhishekThakur has already said, you don't have to put your one-hot-encoded data into the DataFrame.
But if you want to do so, you can add Pandas.SparseSeries as a columns:
#vecotrizing text
tf_vectorizer = CountVectorizer(max_df=0.5, min_df=2,
max_features=n_features,
stop_words='english')
#getting the TF matrix
tf = tf_vectorizer.fit_transform(df.pop('reviewText'))
# adding "features" columns as SparseSeries
for i, col in enumerate(tf_vectorizer.get_feature_names()):
df[col] = pd.SparseSeries(tf[:, i].toarray().ravel(), fill_value=0)
Result:
In [107]: df.head(3)
Out[107]:
asin price reviewerID LenReview Summary LenSummary overall helpful reviewSentiment 0 \
0 151972036 8.48 A14NU55NQZXML2 199 really a difficult read 23 3 2 -0.7203 0.002632
1 151972036 8.48 A1CSBLAPMYV8Y0 77 wha 3 4 0 -0.1260 0.005556
2 151972036 8.48 A1DDECXCGHDYZK 114 wordy and drags on 18 1 4 0.5707 0.004545
... think thought trailers trying wanted words worth wouldn writing young
0 ... 0 0 0 0 1 0 0 0 0 0
1 ... 0 0 0 1 0 0 0 0 0 0
2 ... 0 0 0 0 1 0 1 0 0 0
[3 rows x 78 columns]
Pay attention at memory usage:
In [108]: df.memory_usage()
Out[108]:
Index 80
asin 112
price 112
reviewerID 112
LenReview 112
Summary 112
LenSummary 112
overall 112
helpful 112
reviewSentiment 112
0 112
1 112
2 112
3 112
4 112
5 112
6 112
7 112
8 112
9 112
10 112
11 112
12 112
13 112
14 112
...
parts 16 # memory used: # of ones multiplied by 8 (np.int64)
peter 16
picked 16
point 16
quick 16
rating 16
reader 16
reading 24
really 24
reviews 16
stars 16
start 16
story 32
tedious 16
things 16
think 16
thought 16
trailers 16
trying 16
wanted 24
words 16
worth 16
wouldn 16
writing 24
young 16
dtype: int64

Pandas also supports importing sparse matrices, which it stores using its sparseDtype
import scipy.sparse
pd.DataFrame.sparse.from_spmatrix(Your_Sparse_Data)
Which you could concatenate to the rest of your dataframe

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

cross validation in sklearn with given fold splits - python

Related

GridSearchCV for DecisionTreeClassifier in sklearn randomly generates UndefinedMetricWarning on macOS and Linux but not Windows

Predicting car prices with machine learning - advice on best model [closed]

Extracting results from gridsearchcv

sciklearn Linear Regression (Final Prediciton always 0)

Adding sparse matrix from CountVectorizer into dataframe with complimentary information for classifier - keep it in sparse format

Categories

Resources