I want to know the resulting label for all of the input data [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
Learning was carried out using the random forest algorithm. I want to append the results of the input data to the existing input data, how do I do it? In the case of scikit-learn, it provides model evaluation criteria such as accuracy, precision, recall, and f1 score for the result, but I am not sure if there is a function that returns the label of the prediction result like keras. I don't know where to start with the code, so I'll just ask a question.

Usually, you have something like this when using sklearn:
input_data = pd.read_csv("path/to/data")
features = ["area", "location", "rooms"]
y = input_data["Price"]
X = input_data[features]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
model = RandomForest()
model.fit(train_X, train_y)
Now your model is trained. As you mentioned you could get different metrics from the model using sklearn on your validation set.
Getting output label from the model means getting predictions(inference):
output_label = model.predict(val_X)
#This is an nd array with the size of val_y
x = pd.DataFrame(val_X,columns=["input"])
x["output_label"] = output_label
Or you could use numpy.concatenate to append the labels directly to your input data

Related

How to scale datasets correctly [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Which one is more correct or there is any other way to scale data? (I've used StandardScaler as an example)
I've tried every way and computed the accuracy of every model but there is no meaningful difference but I want to know which way is more correct
dataset= pd.read_csv("wine.csv")
x = dataset.iloc[:,:13]
y = dataset.iloc[:,13]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.8,random_state=0)
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_test)
or
dataset= pd.read_csv("wine.csv")
x = dataset.iloc[:,:13]
y = dataset.iloc[:,13]
sc=StandardScaler()
x = sc.fit_transform(x)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.8,random_state=0)
or
dataset= pd.read_csv("wine.csv")
x = dataset.iloc[:,:13]
y = dataset.iloc[:,13]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.8,random_state=0)
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)
Test data should not bee seen or used during the training a of model as they are used to assert the performance of the model.
Therefore the last option is the correct one. The scaling parameter should be computed solely on the train set as follow:
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)

What is causing this discrepancy between the metric displayed at Catboost.select_features's plot and the actual predictions of the fitted final model? [duplicate]

This question already has answers here:
CatBoost precision imbalanced classes
(3 answers)
Closed 1 year ago.
I'm performing feature selection with Catbost. This the training code:
# Parameter grid
params = {
'auto_class_weights': 'Balanced',
'boosting_type': 'Ordered',
'thread_count': -1,
'random_seed': 24,
'loss_function': 'MultiClass',
'eval_metric': 'TotalF1:average=Macro',
'verbose': 0,
'classes_count': 3,
'num_boost_round':500,
'early_stopping_rounds': EARLY_STOPPING_ROUNDS
}
# Datasets
train_pool = Pool(train, y_train)
test_pool = Pool(test, y_test)
# Model Constructor
ctb_model = ctb.CatBoostClassifier(**params)
# Run feature selection
summary = ctb_model.select_features(
train_pool,
eval_set=test_pool,
features_for_select='0-{0}'.format(train.shape[1]-1),
num_features_to_select=10,
steps=1,
algorithm=EFeaturesSelectionAlgorithm.RecursiveByShapValues,
shap_calc_type=EShapCalcType.Exact,
train_final_model=True,
logging_level='Silent',
plot=True
)
After the run has ended, the following plot is displayed:
It is clear that according to the plot the evaluation metric is TotalF1 with macro average, and the best iteration of the model achieved 0.6153 as the best score for this metric. According to the documentation, when the train_final_model argument is set to True, a model is finally fitted using the selected features that gave the best score for the specified evaluation metric during the feature selection process, so one would expect to get the same results when using the fitted model to make predictions and evaluate it. However, this is not the case.
When running:
from sklearn.metrics import f1_score
predictions = ctb_model.predict(test[summary['selected_features_names']], prediction_type='Class')
f1_score(y_test, predictions, average='macro')
The result i'm getting is:
0.41210319323424227
The difference is huge, and i can't figure out what is causing this difference.
How to resolve this issue?
The solution to this question can be found at: CatBoost precision imbalanced classes
After setting the parameter sample_weights of the sklearn's f1_score(), i got the same F1 score than Catboost was throwing.

sklearn Logistic Regression has too little accuracy even if I try to predict with the train data

I am currently trying to use Logistic Regression on some vectors and I use the sklearn library.
Here is my code. I first the files that contain the data and the assign the values to arrays.
# load files
xvectors_train = kaldiio.load_scp('train/xvector.scp')
# create empty arrays where to store the data
x_train = np.empty(shape=(len(xvectors_train.keys()), len(xvectors_train[list(xvectors_train.keys())[0]])))
y_train = np.empty(len(xvectors_train.keys()), dtype=object)
# assign values to the empty arrays
for file_id in xvectors_train:
x_train[i] = xvectors_train[file_id]
label = file_id.split('_')
y_train[i] = label[0]
i+=1
# create a model and train it
model = LogisticRegression( max_iter = 200, solver = 'liblinear')
model.fit(x_train, y_train)
# predict
model.predict(x_train)
#score
score = model.score(x_train, y_train)
For some reason even if I use the x_train data for my predictions the score is about 0.32. Shouldn't it be 1.0, because the model already knows the answers for those? If I use my test data the score is still like 0.32.
Does anyone know what the problem could be?
There isn't any obvious problem, and the result looks normal: your test score is very similar to your training score.
Most models try to learn the rules/params that generalize to new data, but NOT memorizing your existing training data, which means "Shouldn't it be 1.0, because the model already knows the answers for those?" is not true...
If you are actually seeing that your test set score is significantly lower than your training score (e.g., 0.32 vs 1.0), then it means your model is badly overfitting and needs to be fixed.

Semi-supervised sentiment analysis in Python?

I have been following this tutorial
https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/
to create a sentiment analysis in python. However, here's what I don't understand: It seems to me that the data they use is already labeled? So, how do I use the training I did on the labeled data to then apply to unlabeled data?
I wanna do sth like this:
Assuming I have 2 dataframes:
df1 is a small one with labeled data, df2 is a big one with unlabeled data. I just finished training with df1. How do I then go about predicting the values for df2?
I thought it would be as straight forward as text_classifier.predict(df2.iloc[:,1].values), but that doesn't work for me.
Also, forgive me if this question may seem stupid, but I don't have a lot of experience with machine learning and nltk ...
EDIT:
Here is the code I'm working on:
enc = preprocessing.LabelEncoder()
//chat_data = chat_data[:180]
//chat_labels = chat_labels[:180]
chat_labels = enc.fit_transform(chat_labels)
vectorizer = TfidfVectorizer (max_features=2500, min_df=1, max_df=1, stop_words=stopwords.words('english'))
features = vectorizer.fit_transform(chat_data).toarray()
print(chat_data)
X_train, X_test, y_train, y_test = train_test_split(features, chat_labels, test_size=0.2, random_state=0)
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))
chatData = pd.read_csv(r"C:\Users\jgott\OneDrive\Dokumente\Thesis\chat.csv")
unlabeled = chatData.iloc[:,1].values
unlabeled = vectorizer.fit_transform(unlabeled.astype('U'))
print(unlabeled)
//features = vectorizer.fit_transform(unlabeled).toarray()
predictions = text_classifier.predict(unlabeled)
Most of it is taken exactly from the tutorial, except for the line with astype in it, which I used to convert the unlabeled data, because I got a valueError that told me it can't convert from String to float if I don't do that first.
how do I use the training I did on the labeled data to then apply to
unlabeled data?
This is really the problem that supervised ML tries to solve: having known labeled data as inputs of the form (sample, label), a model tries to discover the generic patterns that exist in these data. These patterns hopefully will be useful to predict the labels of unseen unlabeled data.
For example in sentiment-analysis (sad, happy) problem, the pattern that may be discovered by a model after training process:
Existence of one or more of this words means sad:
("misery" , 'sad', 'displaced people', 'homeless'...)
Existence of one or more of this words means happy:
("win" , "delightful", "wedding", ...)
If new textual document is given we will search for these patterns inside this document and we will label it accordingly.
As side note: we usually do not use the whole labeled dataset in the training process, instead we take a small portion from the dataset(other than the training set) to validate our model, and verify that it discovered a really generic patterns, not ones tailored specifically for the training data.

Speed Improvements to Leave One Group Out in Large Datasets [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I am performing classification by LogisticRegression over a large dataset (1.5 million observations) using LeaveOneGroupOut cross-validation. I am using scikit-learn for implementation. My code takes around 2 days to run and I would appreciate your inputs on how to make it faster. A snippet of my code is shown below:
grp = data['id_x'].values
logo = LeaveOneGroupOut()
LogReg = LogisticRegression()
params_grid = {'C': [0.78287388, 1.19946909, 1.0565957 , 0.69874106, 0.88427995, 1.33028731, 0.51466415, 0.91421747, 1.25318725, 0.82665192, 1, 10],
'penalty': ['l1', 'l2'] }
random_search = RandomizedSearchCV(LogReg, param_distributions = params_grid, n_iter = 3, cv = logo, scoring = 'accuracy')
random_search.fit(X, y, grp)
print random_search.best_params_
print random_search.best_score_
I am going to make the following assumptions:
1- you are using scikit-learn.
2- you need your code to be faster.
To get your final results faster, you can train multiple models at once by running them in parallel. To do so, you need to modify the variable n_jobs in scikit-learn. Possible options for n_jobs can be #of_CPU_cores or #of_CPU_cores-1 if you are not running anything else on your computer while training the model.
Examples:
RandomizedSearchCV in parallel:
random_search = RandomizedSearchCV(LogReg, n_jobs=3, param_distributions = params_grid, n_iter = 3, cv = logo, scoring = 'accuracy')
LogisticRegression in parallel:
LogisticRegression(n_jobs=3)
I recommend parallelizing only RandomizedSearchCV.
It might be helpful to also look at the original scikit-learn documentations:
sklearn.linear_model.LogisticRegression
sklearn.model_selection.RandomizedSearchCV

Categories