How to apply regression models for separate test and train datasets? - python

I have test and train subset separately. I need to apply regression models on those datasets. Just tried to make an example but something is wrong. My prediction column for regression is 'survival-time'. Also i have 93 column on each datasets. Here is code.
NOTE: Specially, troubling on 'score = svr.score(X_train,y_train)', should i use test for train argument?
import pandas as pd
import numpy as np
from sklearn.svm import SVR
train = pd.read_csv('training.csv',sep=";")
test = pd.read_csv('test.csv',sep=";")
X_train = train.drop('survival-time',axis=1)
y_train = train['survival-time']
X_test = test.drop('survival-time',axis=1)
y_test = test['survival-time']
svr = SVR().fit(X_train, y_train)
print(svr)
yfit = svr.predict(X_train)
score = svr.score(X_train,y_train)
print("R-squared:", score)
print("MSE:", mean_squared_error(y_train, yfit))
print("RMSLE:", mean_squared_log_error(y_train, yfit))

Related

How to use a GradientBoostingRegressor in scikit-learn with 3 output dimensions

I am trying to map 13-dimensional input data to 3-dimensional output data by using RandomForest and GradientBoostingRegressor of scikit-learn. While for the RandomForest regressor this works fine, I get a ValueError for the GradientBoostingRegressor stating ValueError: y should be a 1d array, got an array of shape (16127, 3) instead.
I don't really understand why I get this error when using GradientBoostingRegressor and not when using the RandomForestRegressor. As far as I understand, both of them use decision trees as a weak learner and combine them to get a good result. Of course I know that I could transform the 3-dimensional output-labels to a 1-dimensional array but this does not make sense as i want to map to a 3-dimensional output-vector. Any idea how I can do this using the GradientBoostingRegressor?
Here is my code:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
# Read data from csv files
Input_data_features = pd.read_csv("C:/Users/wi9632/Desktop/TestData_InputFeatures.csv", sep=';')
Input_data_labels = pd.read_csv("C:/Users/wi9632/Desktop/TestData_OutputLabels.csv", sep=';')
Input_data_features = Input_data_features.values
Input_data_labels = Input_data_labels.values
# standardize input features X and output labels Y
scaler_standardized_X = StandardScaler()
Input_data_features = scaler_standardized_X.fit_transform(Input_data_features)
scaler_standardized_Y = StandardScaler()
Input_data_labels = scaler_standardized_Y.fit_transform(Input_data_labels)
# Split dataset into train, validation, an test
index_X_Train_End = int(0.7 * len(Input_data_features))
index_X_Validation_End = int(0.9 * len(Input_data_features))
X_train = Input_data_features[0: index_X_Train_End]
X_valid = Input_data_features[index_X_Train_End: index_X_Validation_End]
X_test = Input_data_features[index_X_Validation_End:]
Y_train = Input_data_labels[0: index_X_Train_End]
Y_valid = Input_data_labels[index_X_Train_End: index_X_Validation_End]
Y_test = Input_data_labels[index_X_Validation_End:]
#Define a random forest model and train it
model_randomForest = RandomForestRegressor( )
model_randomForest.fit(X_train, Y_train)
#Predict the test data with Random Forest
Y_pred_randomForest = model_randomForest.predict(X_test)
print(f"Random Forest Prediction: {Y_pred_randomForest}")
#Define a gradient boosting model and train it (-->Here I get the ValueError)
model_gradientBoosting = GradientBoostingRegressor( )
model_gradientBoosting.fit(X_train, Y_train)
#Predict the test data with Random Forest
Y_pred_gradientBoosting = model_gradientBoosting.predict(X_test)
print(f"Gradient Boosting Prediction: {Y_pred_gradientBoosting}")
Here is the test data: https://filetransfer.io/data-package/ABCrGPzt#link
Reminder: As I could not solve my problem, I would like to remind you on this question. Does anybody have an idea how to tackle this problem?
RandomForestRegressor supports multi output regression, see docs. GradientBoostingRegressor does not.
You can use MultiOutputRegressor + GradientBoostingRegressor for the problem. See this answer.
from sklearn.multioutput import MultiOutputRegressor
params = {'n_estimators': 5000, 'max_depth': 4, 'min_samples_split': 2, 'min_samples_leaf': 2}
estimator = MultiOutputRegressor(ensemble.GradientBoostingRegressor(**params))
estimator.fit(train_data,train_targets)

Logistic regression sklearn - train and apply model

I'm new to machine learning and trying Sklearn for the first time. I have two dataframes, one with data to train a logistic regression model (with 10-fold cross-validation) and another one to predict classes ('0,1') using that model.
Here's my code so far using bits of tutorials I found on Sklearn docs and on the Web:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
# Import dataframe with training data
df = pd.read_csv('summary_44.csv')
cols = df.columns.drop('num_class') # Data to use (num_class is the column with the classes)
# Import dataframe with data to predict
df_pred = pd.read_csv('new_predictions.csv')
# Scores
df_data = df.ix[:,:-1].values
# Target
df_target = df.ix[:,-1].values
# Values to predict
df_test = df_pred.ix[:,:-1].values
# Scores' names
df_data_names = cols.values
# Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
# Define number of folds
kf = KFold(n_splits=10)
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator
# Logistic regression normalizing variables
LogReg = LogisticRegression()
# 10-fold cross-validation
scores = [LogReg.fit(X[train], y[train]).score(X[test], y[test]) for train, test in kf.split(X)]
print scores
# Predict new
novel = LogReg.predict(X_pred)
Is this the correct way to implement a Logistic Regression?
I know that the fit() method should be used after cross-validation in order to train the model and use it for predictions. However, since I called fit() inside a list comprehension I really don't know if my model was "fitted" and can be used to make predictions.
I general things are okay, but there are some problems.
Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
You scale training and test data independently, which isn't correct. Both datasets must be scaled with the same scaler. "Scale" is a simple function, but it is better to use something else, for example StandardScaler.
scaler = StandardScaler()
scaler.fit(df_data)
X = scaler.transform(df_data)
X_pred = scaler.transform(df_test)
Cross-validation and predicting.
How your code works? You split data 10 times into train and hold-out set; 10 times fit model on train set and calculate score on hold-out set. This way you get cross-validation scores, but the model is fitted only on a part of data. So it would be better to fit model on the whole dataset and then make a prediction:
LogReg.fit(X, y)
novel = LogReg.predict(X_pred)
I want to notice that there are advanced technics like stacking and boosting, but if you learn using sklearn, then it is better to stick to the basics.

How to predict after training data using naive bayes with python?

I have got a dataset which contains just two useful columns for training my model, first is news heading and the second is category of news.
So, I got the following training command running successfully using python:
import re
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
# grab the data
news = pd.read_csv("/Users/helloworld/Downloads/NewsAggregatorDataset/newsCorpora.csv",encoding='latin-1')
news.head()
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
news['TEXT'] = [normalize_text(s) for s in news['TITLE']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
So my question is, how can I give a new set of data (e.g. Just news heading) and tell the program to predict the news category using python sklearn command?
P.S. My training data is like:
You should train the model using the training data (as you did) and then you should predict using new data (the test data).
Do the following:
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
Now, if you want to evaluate the predictions based on the **accuracy you can do the following:**
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predicted)
Similarly, you can calculate other metrics.
Finally, we can see all the available metrics here !
EDIT 1
When you type:
y_predicted = nb.predict(x_test)
y_predicted will contain numerical values that correspond to your categories.
To project back these values and get the labels you can do:
y_predicted_labels = encoder.inverse_transform(y_predicted)
You are very close. Just need two more lines of code. Use this link, explains Naives Bayes using Sci Kit,
https://www.digitalocean.com/community/tutorials/how-to-build-a-machine-learning-classifier-in-python-with-scikit-learn
The short answer to your question is below, import the accuracy function,
from sklearn.metrics import accuracy_score
test the model using the predict function,
preds = nb.predict(x_test)
and then test the accuracy
print(accuracy_score(y_test, preds))

Scikit-learn MLPRegressor - How not to predict negative results?

I was trying to train and test my dataset using MLPRegressor. I have two datasets (train dataset and test dataset), both of them have the exact same columns of features and label. Here's the example of my datasets :
Full,Id,Id & PPDB,Id & Words Sequence,Id & Synonyms,Id & Hypernyms,Id & Hyponyms,Gold Standard
1.667,0.476,0.952,0.476,1.429,0.952,0.476,2.345
3.056,1.111,1.667,1.111,3.056,1.389,1.111,1.9
1.765,1.176,1.176,1.176,1.765,1.176,1.176,2.2
0.714,0.714,0.714,0.714,0.714,0.714,0.714,0.0
................
here's my code :
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPRegressor
randomseed = np.random.seed(0)
datatraining = pd.read_csv("datatrain.csv")
datatesting = pd.read_csv("datatest.csv")
columns = ["Full","Id","Id & PPDB","Id & Words Sequence","Id & Synonyms","Id & Hypernyms","Id & Hyponyms"]
labeltrain = datatraining["Gold Standard"].values
featurestrain = datatraining[list(columns)].values
labeltest = datatesting["Gold Standard"].values
featurestest = datatesting[list(columns)].values
X_train = featurestrain
y_train = labeltrain
X_test = featurestest
y_test = labeltest
mlp = MLPRegressor(solver='lbfgs', hidden_layer_sizes=50, max_iter=1000, learning_rate='constant', random_state=randomseed)
mlp.fit(X_train, y_train)
print('Accuracy training : {:.3f}'.format(mlp.score(X_train, y_train)))
print
predicting = mlp.predict(X_test)
print predicting
print
And here's the result of the prediction :
[ 1.97553444 3.43401776 3.04097607 2.7015464 2.03777686 3.63274593
3.37826962 -0.60260337 0.41626517 3.5374289 3.66114929 3.244683
2.6313756 2.14243075 3.20841434 2.105238 4.9805092 4.00868273
2.45508505 4.53332828 3.41862096 3.35721078 3.23069344 3.72149434
4.9805092 2.61705563 1.55052494 -0.14135979 2.65875196 3.05328206
3.51127424 0.51076396 2.39947967 1.95916595 3.71520651 2.1526807
2.26438616 0.73249057 2.46888695 3.56976227 1.03109988 2.15894353
2.06396103 0.66133707 4.72861602 2.4592647 2.84176811 2.3157664
1.68426416 2.56022955 -0.00518545 1.67213609 0.6998739 3.25940136
3.25369266 3.88888542 1.9168694 2.26036302 3.97917769 2.00322903
3.03121106 3.29083723 0.6998739 4.33375678 0.6998739 2.71141538
-4.23755447 3.958574 2.67765274 2.68715423 2.32714117 2.6500056
........]
As we can see, there are some negative results. How not to predict negative results? Besides, my datasets are contain of all positive values.
Assuming you have no categorical variables. Also, you mentioned in the question that you have all positive values.
Try to standardize your data using SatandardSacler(). Use your X_train and y_train to standardize data.
from sklearn import preprocessing as pre
...
scaler = pre.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
After initializing the model with best parameters based on your case, fit scaled data,
mlp.fit(X_train_scaled, y_train)
...
predicting = mlp.predict(X_test_scaled)
This should do it. Let me know how it goes.
Also, there are some good reads,
https://stats.stackexchange.com/questions/189652/is-it-a-good-practice-to-always-scale-normalize-data-for-machine-learning
https://stats.stackexchange.com/questions/7757/data-normalization-and-standardization-in-neural-networks
Add a second hidden layer with one ReLU node.

How can I create a Linear Regression Model from a split dataset?

I've just split my data into a training and testing set and my plan is to train a Linear Regression model and be able to check what the performance is like using my testing split.
My current code is:
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt
df = pd.read_csv('C:/Dataset.csv')
df['split'] = np.random.randn(df.shape[0], 1)
split = np.random.rand(len(df)) <= 0.75
training_set = df[split]
testing_set = df[~split]
Is there a proper method I should be using to plot a Linear Regression model from an external file such as a .csv?
Since you want to use scikit-learn, here's an approach using sklearn.linear_model.LinearRegression:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X_train, y_train = training_set[x_vars], training_set[y_var]
X_test, y_test = testing_test[x_vars], testing_test[y_var]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Depending on whether you need more descriptive output, you might also look into use statsmodels for linear regression.

Categories