Speed Improvements to Leave One Group Out in Large Datasets [closed]

Speed Improvements to Leave One Group Out in Large Datasets [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I am performing classification by LogisticRegression over a large dataset (1.5 million observations) using LeaveOneGroupOut cross-validation. I am using scikit-learn for implementation. My code takes around 2 days to run and I would appreciate your inputs on how to make it faster. A snippet of my code is shown below:
grp = data['id_x'].values
logo = LeaveOneGroupOut()
LogReg = LogisticRegression()
params_grid = {'C': [0.78287388, 1.19946909, 1.0565957 , 0.69874106, 0.88427995, 1.33028731, 0.51466415, 0.91421747, 1.25318725, 0.82665192, 1, 10],
'penalty': ['l1', 'l2'] }
random_search = RandomizedSearchCV(LogReg, param_distributions = params_grid, n_iter = 3, cv = logo, scoring = 'accuracy')
random_search.fit(X, y, grp)
print random_search.best_params_
print random_search.best_score_

I am going to make the following assumptions:
1- you are using scikit-learn.
2- you need your code to be faster.
To get your final results faster, you can train multiple models at once by running them in parallel. To do so, you need to modify the variable n_jobs in scikit-learn. Possible options for n_jobs can be #of_CPU_cores or #of_CPU_cores-1 if you are not running anything else on your computer while training the model.
Examples:
RandomizedSearchCV in parallel:
random_search = RandomizedSearchCV(LogReg, n_jobs=3, param_distributions = params_grid, n_iter = 3, cv = logo, scoring = 'accuracy')
LogisticRegression in parallel:
LogisticRegression(n_jobs=3)
I recommend parallelizing only RandomizedSearchCV.
It might be helpful to also look at the original scikit-learn documentations:
sklearn.linear_model.LogisticRegression
sklearn.model_selection.RandomizedSearchCV

Related

How to scale datasets correctly [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Which one is more correct or there is any other way to scale data? (I've used StandardScaler as an example)
I've tried every way and computed the accuracy of every model but there is no meaningful difference but I want to know which way is more correct
dataset= pd.read_csv("wine.csv")
x = dataset.iloc[:,:13]
y = dataset.iloc[:,13]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.8,random_state=0)
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_test)
or
dataset= pd.read_csv("wine.csv")
x = dataset.iloc[:,:13]
y = dataset.iloc[:,13]
sc=StandardScaler()
x = sc.fit_transform(x)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.8,random_state=0)
or
dataset= pd.read_csv("wine.csv")
x = dataset.iloc[:,:13]
y = dataset.iloc[:,13]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.8,random_state=0)
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)

Test data should not bee seen or used during the training a of model as they are used to assert the performance of the model.
Therefore the last option is the correct one. The scaling parameter should be computed solely on the train set as follow:
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)

What is causing this discrepancy between the metric displayed at Catboost.select_features's plot and the actual predictions of the fitted final model? [duplicate]

This question already has answers here:
CatBoost precision imbalanced classes
(3 answers)
Closed 1 year ago.
I'm performing feature selection with Catbost. This the training code:
# Parameter grid
params = {
'auto_class_weights': 'Balanced',
'boosting_type': 'Ordered',
'thread_count': -1,
'random_seed': 24,
'loss_function': 'MultiClass',
'eval_metric': 'TotalF1:average=Macro',
'verbose': 0,
'classes_count': 3,
'num_boost_round':500,
'early_stopping_rounds': EARLY_STOPPING_ROUNDS
}
# Datasets
train_pool = Pool(train, y_train)
test_pool = Pool(test, y_test)
# Model Constructor
ctb_model = ctb.CatBoostClassifier(**params)
# Run feature selection
summary = ctb_model.select_features(
train_pool,
eval_set=test_pool,
features_for_select='0-{0}'.format(train.shape[1]-1),
num_features_to_select=10,
steps=1,
algorithm=EFeaturesSelectionAlgorithm.RecursiveByShapValues,
shap_calc_type=EShapCalcType.Exact,
train_final_model=True,
logging_level='Silent',
plot=True
)
After the run has ended, the following plot is displayed:
It is clear that according to the plot the evaluation metric is TotalF1 with macro average, and the best iteration of the model achieved 0.6153 as the best score for this metric. According to the documentation, when the train_final_model argument is set to True, a model is finally fitted using the selected features that gave the best score for the specified evaluation metric during the feature selection process, so one would expect to get the same results when using the fitted model to make predictions and evaluate it. However, this is not the case.
When running:
from sklearn.metrics import f1_score
predictions = ctb_model.predict(test[summary['selected_features_names']], prediction_type='Class')
f1_score(y_test, predictions, average='macro')
The result i'm getting is:
0.41210319323424227
The difference is huge, and i can't figure out what is causing this difference.
How to resolve this issue?

The solution to this question can be found at: CatBoost precision imbalanced classes
After setting the parameter sample_weights of the sklearn's f1_score(), i got the same F1 score than Catboost was throwing.

I want to know the resulting label for all of the input data [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
Learning was carried out using the random forest algorithm. I want to append the results of the input data to the existing input data, how do I do it? In the case of scikit-learn, it provides model evaluation criteria such as accuracy, precision, recall, and f1 score for the result, but I am not sure if there is a function that returns the label of the prediction result like keras. I don't know where to start with the code, so I'll just ask a question.

Usually, you have something like this when using sklearn:
input_data = pd.read_csv("path/to/data")
features = ["area", "location", "rooms"]
y = input_data["Price"]
X = input_data[features]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
model = RandomForest()
model.fit(train_X, train_y)
Now your model is trained. As you mentioned you could get different metrics from the model using sklearn on your validation set.
Getting output label from the model means getting predictions(inference):
output_label = model.predict(val_X)
#This is an nd array with the size of val_y
x = pd.DataFrame(val_X,columns=["input"])
x["output_label"] = output_label
Or you could use numpy.concatenate to append the labels directly to your input data

Appropriate f1 scoring for highly imbalanced data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am confused with three different f1 computation. Which f1 scoring I should use for a severely imbalanced data? I am working on a severely imbalanced binary classification.
‘f1’
‘f1_micro’
‘f1_macro’
‘f1_weighted’
Also, I want to add balanced_accuracy_score(y_true, y_pred, adjusted=True) in balanced_accuracy scoring argument. How can I incorporate this in my code?
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from imblearn.metrics import geometric_mean_score
X, y = load_breast_cancer(return_X_y=True)
gm_scorer = make_scorer(geometric_mean_score, greater_is_better=True)
scores = cross_validate(LogisticRegression(max_iter=100000),X,y, cv=5,scoring={'gm_scorer': gm_scorer, 'F1': 'f1', 'Balanced Accuracy': 'balanced_accuracy'}
)
scores

f1_micro is for global f1, while f1_macro takes the individual class-wise f1 and then takes an average.
Its similar to precision and its micro, macro, weights parameters in sklearn. Do check the SO post Type of precision where I explain the difference. f1 score is basically a way to consider both precision and recall at the same time.
Also, as per documentation:
'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
For your specific case, you might want to use f1_macro (unweighted average of class-wise f1) or f1_weighted (weights average of class-wise f1), as f1_micro will high the class-wise contribution to the f1.

Multiple non-linear regression in Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am looking for any libraries or method which can help me to find a regression equation. The equation is in this format:
Y=a1*x^a+a2*y^b+a3*z^c+D
where:
Y is the dependent variable
x, y, z are independent variables
D is constant
a1, a2, a3 are the coefficients
a, b, c are the exponents of the independent variables respectively.
I have values of Y and x, y, z stored in a data frame.

You can use Random Forest Regressor implementation from scikit learn. It's quite easy to use, you simply do:
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
# train the model
clf.fit(df[['x','y','z']], df['Y'])
# predict on test data
predict = clf.predict(test_data[['x','y','z']])
Make sure train and test data have same number of independent variables.
For more non-linear regressor, check: scikit-learn ensemble module

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speed Improvements to Leave One Group Out in Large Datasets [closed] - python

Related

How to scale datasets correctly [closed]

What is causing this discrepancy between the metric displayed at Catboost.select_features's plot and the actual predictions of the fitted final model? [duplicate]

I want to know the resulting label for all of the input data [closed]

Appropriate f1 scoring for highly imbalanced data [closed]

Multiple non-linear regression in Python [closed]

Categories

Resources