Training a sklearn classifier with more than a single feature - python

I'm currently training a LinearSVC classifier with a single feature vectorizer. I'm processing news, which are stored in separated files. Those files originally had a title, a textual body, a date, an author and sometimes an image. But I ended up removing everythong but the textual body as a feature. I'm doing it this way:
# Loading the files (Plain files with just the news content. Nor date, author or other features.)
data_train = load_files(self.TRAIN_FOLDER, encoding=self.ENCODING) # data_train
data_test = load_files(self.TEST_FOLDER, encoding=self.ENCODING)
unlabeled = load_files(self.UNLABELED_FOLDER, encoding=self.ENCODING)
categories = data_train.target_names
# Get the sparse matrix of each dataset
y_train = data_train.target
y_test = data_test.target
# Vectorizing
vectorizer = TfidfVectorizer(encoding=self.ENCODING, use_idf=True, norm='l2', binary=False, sublinear_tf=True, min_df=0.001, max_df=1.0, ngram_range=(1, 2), analyzer='word')
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
X_unlabeled = vectorizer.transform(self.data_unlabeled.data)
# Instantiating the classifier
clf = LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3)
# Fitting the model according to the training set and predicting
scaler = preprocessing.StandardScaler(with_mean=False)
scaler = scaler.fit(X_train)
normalized_X_train = scaler.transform(X_train)
clf.fit(normalized_X_train, y_train)
normalized_X_test = scaler.transform(X_test)
pred = clf.predict(normalized_X_test)
accuracy_score = metrics.accuracy_score(y_test, pred)
recall_score = metrics.recall_score(y_test, pred)
precision_score = metrics.precision_score(y_test, pred)
But now I would like to include other features, as the date or the author, and all the simpler examples I found are using a single feature. So I'm not really sure how to proceed. Should I have all the information in a single file? How to diferentiate authors from content? Should I use a vectorizer for each feature? If so, should I fit a model with different vectorized features? Or should I have a different classifier for each feature? Can you suggest me something to read (explained to newbies)?
Thanks in advance,

The output of TfidfVectorizer is a scipy.sparse.csr.csr_matrix object. You may use hstack to add more features (like here). Alternatively, you may convert the feature space you already have above to a numpy array or pandas df and then add the new features (which you might have created from other vectorizers) as new columns to it. Either way, your final X_train and X_test should include all the features in one place. You may also need to standardize them before doing the training (here). You do not seem to be doing that here.
I do not have your data so here is an example on some dummy data:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(corpus)
X_train = pd.DataFrame(X_train.todense())
X_train['has_image'] = [1, 0, 0, 1] # just adding a dummy feature for demonstration

Related

Unable to make prediction after loading sklearn model

I have created a ML model with Scikit-Learn and saved it. Now when I load the model, I have trouble with transformation and prediction.
I have 4 features in DataFrame. First two features are textual, and other 2 are numerical. The result column is 1 or 0.
In order to train my model, I used ColumnTransformer and CountVectorizer for transformation and vectorization textual features. I specified NAMES of the columns that I want to transform/vectorize.
(columns text1 and text2). Numerical columns do not need to be vectorized so remainder='passthrough' is fixing that.
Part of code that works:
features = df.iloc[:, :-1]
results = df.iloc[:, -1]
transformerVectoriser = ColumnTransformer(transformers=[('vector word 1', CountVectorizer(analyzer='word', ngram_range=(1, 1), max_features = 12000, stop_words = 'english'), 'text1'),
('vector phrase 3', CountVectorizer(analyzer='word', ngram_range=(3, 3), max_features = 2500, stop_words = 'english'), 'text2')],
remainder='passthrough') # Default is to drop untransformed columns, passthrough == leave columns as they are
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)
model = clf.fit(x_train, y_train)
y_pred = model.predict(x_test)
filename = 'ml_model.sav'
pickle.dump(model, open(filename, 'wb'))
filename = 'ml_transformer.sav'
pickle.dump(transformerVectoriser, open(filename, 'wb'))
But when I want to load a model, and make prediction I get an error:
# LOADING MODEL
model = pickle.load(open('ml_model.sav','rb'))
vectorizer = pickle.load(open('ml_transformer.sav','rb'))
# MAKING PREDICTION
data_for_prediction = vectorizer.transform([data_for_prediction]) #ERROR
print(model.predict_proba(data_for_prediction))
I get the error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
When I was training my model I used Pandas dataframe, and when I wanted to make prediction I have just put the values in the list. So data_for_prediction is list, that looks like this:
["text that should be vectorized with vectorizer that i created", "More texts that should be vectorized", 4, 7]
I think that that is the error, Because I used column names when I was using ColumnTransformer, but now when I want to make prediction, vectorizer do not know what to vectorize.
My final model and vectorizer should be used in an API, and api should only take JSON, so I do not want to convert JSON to DataFrame and pass it to the model.
Is there a way to fix this error without using pandas dataframe in my final Flask APP.
The training data is a dataframe with the columns:
x_train.columns
the function vectorizer.transform() wants data in the same format, so assuming that
data_f_p = ["text that should be vectorized", 4,7,0]
corresponds to the same four columns as x_train you can turn it into a dataframe with
data_f_p = pd.DataFrame([data_f_p], columns=x_train.columns)
data_f_p = vectorizer.transform(data_f_p)
In the case you don't want to use pandas.DataFrame in your REST API endpoint, just don't train your model with the DataFrame but convert your data to a numpy array first:
>>> df
TEXT_1 TEXT_2 NUM_1 NUM_2
0 This is the first text. The second text. 300.000 23.3
1 Here is the third text. And the fourth text. 2.334 29.0
>>> df.to_numpy()
array([['This is the first text.', 'The second text.', 300.0, 23.3],
['Here is the third text.', 'And the fourth text.', 2.334, 29.0]],
dtype=object)
Then, make changes in how you define the model. I'd suggest to combine preprocessing and predicting steps using sklearn.pipeline.Pipeline into a single model like this:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
model = Pipeline(steps=[
('transformer', ColumnTransformer(
transformers=[
('TEXT_1', CountVectorizer(analyzer='word', stop_words='english'), 0),
('TEXT_2', CountVectorizer(analyzer='word', stop_words='english'), 1),
],
remainder='passthrough',
)),
('predictor', RandomForestClassifier()),
])
Note, here we are using indices instead of names to reference texts when defining transformers for the ColumnTransformer instance. Once we've transformed the initial DataFrame to a numpy array, the TEXT_1 feature is located at 0, and the TEXT_2 at 1 in a data row. Here is how you can use the model:
from joblib import dump, load
X = df.to_numpy()
model.fit(X, y)
dump(model, 'model.joblib')
...
model = load('model.joblib')
results = model.predict(data)
As a result, you don't have to convert your incoming data to the DataFrame in order to make a prediction.

Make predictions with a trained model on Python

I'm very new to programming and machine learning but I've been trying to create a prediction model to tag product reviews. I found the following model:
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
dataset = pd.read_csv('dataset.csv')
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
dataset['TEXT'] = [normalize_text(s) for s in dataset['texto']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(dataset['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(dataset['codigo'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
So far so good. But then, I tried to use that trained model to predict another set of data like this:
#new data
test = pd.read_csv('testset.csv')
test['TEXT'] = [normalize_text(s) for s in test['respostas']]
# pull the data into vectors
vectorizer = CountVectorizer()
classes = vectorizer.fit_transform(test['TEXT'])
classificacao = nb.predict(classes)
However, I got a "ValueError: dimension mismatch"
I'm not sure how to do this second step, which is using the model to predict the category of a fresh data set.
Thanks in advance for your assistance.

Oversampling after splitting the dataset - Text classification

I am having some issues with the steps to follow for over-sampling a dataset.
What I have done is the following:
# Separate input features and target
y_up = df.Label
X_up = df.drop(columns=['Date','Links', 'Paths'], axis=1)
# setting up testing and training sets
X_train_up, X_test_up, y_train_up, y_test_up = train_test_split(X_up, y_up, test_size=0.30, random_state=27)
class_0 = X_train_up[X_train_up.Label==0]
class_1 = X_train_up[X_train_up.Label==1]
# upsample minority
class_1_upsampled = resample(class_1,
replace=True,
n_samples=len(class_0),
random_state=27) #
# combine majority and upsampled minority
upsampled = pd.concat([class_0, class_1_upsampled])
Since my dataset looks like:
Label Text
1 bla bla bla
0 once upon a time
1 some other sentences
1 a few sentences more
1 this is my dataset!
I applied a vectorizer to transform string into numbers:
X_train_up=upsampled[['Text']]
y_train_up=upsampled[['Label']]
X_train_up = pd.DataFrame(vectorizer.fit_transform(X_train_up['Text'].replace(np.NaN, "")).todense(), index=X_train_up.index)
Then I applied the logistic regression function:
upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up, y_train_up)
However, I have got the following error at this step:
X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)
pred_up_log = upsampled_log.predict(X_test_up)
ValueError: X has 3021 features per sample; expecting 5542
Since it was told me that I should apply the oversampling after splitting my dataset into train e test, I have not vectorised the test set.
My doubts are then the following:
is it right to consider later a vectorisation of the test set: X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)
is it right to consider the over-sampling after splitting the dataset into training and test?
Alternatively, I tried with Smote function. The code below works, but I would prefer to consider also the oversampling, if possible, rather than SMOTE.
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df['Text'],df['Label'], test_size=0.2,random_state=42)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train_up)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train_tfidf, y_train_up)
print("Shape after smote is:",X_train_res.shape,y_train_res.shape)
nb = Pipeline([('clf', LogisticRegression())])
nb.fit(X_train_res, y_train_res)
y_pred = nb.predict(count_vect.transform(X_test_up))
print(accuracy_score(y_test_up,y_pred))
Any comments and suggestions will be appreciated.
Thanks
It is better to do the countVectorizing and transformation on the whole dataset, split into test and train, and keep it as a sparse matrix without converting back into a data.frame.
For example this is a dataset:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
df = pd.DataFrame({'Text':['This is bill','This is mac','here’s an old saying',
'at least old','data scientist years','data science is data wrangling',
'This rings particularly','true for data science leaders',
'who watch their data','scientists spend days',
'painstakingly picking apart','ossified corporate datasets',
'arcane Excel spreadsheets','Does data science really',
'they just delegate the job','Data Is More Than Just Numbers',
'The reason that',
'data wrangling is so difficult','data is more than text and numbers'],
'Label':[0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0]})
We perform the vectorization and transformation, followed by split:
count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])
tfidf_transformer = TfidfTransformer()
df_tfidf = tfidf_transformer.fit_transform(df_counts)
X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_tfidf,df['Label'].values,
test_size=0.2,random_state=42)
Up sampling can be done by resampling the index of the minority classes:
class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
np.random.choice(class_1,len(class_0),replace=True)
))
upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up[up_idx,:], y_train_up[up_idx])
And the prediction will work:
upsampled_log.predict(X_test_up)
array([0, 1, 0, 0])
If you have concerns about data leakage, that is some of the information from test actually goes into the training, through the use of TfidfTransformer(). Honestly yet to see concrete proof or demonstration of this, but below is an alternative where you apply the tfid separately:
count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])
X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_counts,df['Label'].values,
test_size=0.2,random_state=42)
class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
np.random.choice(class_1,len(class_0),replace=True)
))
tfidf_transformer = TfidfTransformer()
upsample_Xtrain = tfidf_transformer.fit_transform(X_train_up[up_idx,:])
upsamle_y = y_train_up[up_idx]
upsampled_log = LogisticRegression(solver='liblinear').fit(upsample_Xtrain,upsamle_y)
X_test_up = tfidf_transformer.transform(X_test_up)
upsampled_log.predict(X_test_up)

How to predict after training data using naive bayes with python?

I have got a dataset which contains just two useful columns for training my model, first is news heading and the second is category of news.
So, I got the following training command running successfully using python:
import re
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
# grab the data
news = pd.read_csv("/Users/helloworld/Downloads/NewsAggregatorDataset/newsCorpora.csv",encoding='latin-1')
news.head()
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
news['TEXT'] = [normalize_text(s) for s in news['TITLE']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
So my question is, how can I give a new set of data (e.g. Just news heading) and tell the program to predict the news category using python sklearn command?
P.S. My training data is like:
You should train the model using the training data (as you did) and then you should predict using new data (the test data).
Do the following:
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
Now, if you want to evaluate the predictions based on the **accuracy you can do the following:**
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predicted)
Similarly, you can calculate other metrics.
Finally, we can see all the available metrics here !
EDIT 1
When you type:
y_predicted = nb.predict(x_test)
y_predicted will contain numerical values that correspond to your categories.
To project back these values and get the labels you can do:
y_predicted_labels = encoder.inverse_transform(y_predicted)
You are very close. Just need two more lines of code. Use this link, explains Naives Bayes using Sci Kit,
https://www.digitalocean.com/community/tutorials/how-to-build-a-machine-learning-classifier-in-python-with-scikit-learn
The short answer to your question is below, import the accuracy function,
from sklearn.metrics import accuracy_score
test the model using the predict function,
preds = nb.predict(x_test)
and then test the accuracy
print(accuracy_score(y_test, preds))

Scikit-learn MLPRegressor - How not to predict negative results?

I was trying to train and test my dataset using MLPRegressor. I have two datasets (train dataset and test dataset), both of them have the exact same columns of features and label. Here's the example of my datasets :
Full,Id,Id & PPDB,Id & Words Sequence,Id & Synonyms,Id & Hypernyms,Id & Hyponyms,Gold Standard
1.667,0.476,0.952,0.476,1.429,0.952,0.476,2.345
3.056,1.111,1.667,1.111,3.056,1.389,1.111,1.9
1.765,1.176,1.176,1.176,1.765,1.176,1.176,2.2
0.714,0.714,0.714,0.714,0.714,0.714,0.714,0.0
................
here's my code :
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPRegressor
randomseed = np.random.seed(0)
datatraining = pd.read_csv("datatrain.csv")
datatesting = pd.read_csv("datatest.csv")
columns = ["Full","Id","Id & PPDB","Id & Words Sequence","Id & Synonyms","Id & Hypernyms","Id & Hyponyms"]
labeltrain = datatraining["Gold Standard"].values
featurestrain = datatraining[list(columns)].values
labeltest = datatesting["Gold Standard"].values
featurestest = datatesting[list(columns)].values
X_train = featurestrain
y_train = labeltrain
X_test = featurestest
y_test = labeltest
mlp = MLPRegressor(solver='lbfgs', hidden_layer_sizes=50, max_iter=1000, learning_rate='constant', random_state=randomseed)
mlp.fit(X_train, y_train)
print('Accuracy training : {:.3f}'.format(mlp.score(X_train, y_train)))
print
predicting = mlp.predict(X_test)
print predicting
print
And here's the result of the prediction :
[ 1.97553444 3.43401776 3.04097607 2.7015464 2.03777686 3.63274593
3.37826962 -0.60260337 0.41626517 3.5374289 3.66114929 3.244683
2.6313756 2.14243075 3.20841434 2.105238 4.9805092 4.00868273
2.45508505 4.53332828 3.41862096 3.35721078 3.23069344 3.72149434
4.9805092 2.61705563 1.55052494 -0.14135979 2.65875196 3.05328206
3.51127424 0.51076396 2.39947967 1.95916595 3.71520651 2.1526807
2.26438616 0.73249057 2.46888695 3.56976227 1.03109988 2.15894353
2.06396103 0.66133707 4.72861602 2.4592647 2.84176811 2.3157664
1.68426416 2.56022955 -0.00518545 1.67213609 0.6998739 3.25940136
3.25369266 3.88888542 1.9168694 2.26036302 3.97917769 2.00322903
3.03121106 3.29083723 0.6998739 4.33375678 0.6998739 2.71141538
-4.23755447 3.958574 2.67765274 2.68715423 2.32714117 2.6500056
........]
As we can see, there are some negative results. How not to predict negative results? Besides, my datasets are contain of all positive values.
Assuming you have no categorical variables. Also, you mentioned in the question that you have all positive values.
Try to standardize your data using SatandardSacler(). Use your X_train and y_train to standardize data.
from sklearn import preprocessing as pre
...
scaler = pre.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
After initializing the model with best parameters based on your case, fit scaled data,
mlp.fit(X_train_scaled, y_train)
...
predicting = mlp.predict(X_test_scaled)
This should do it. Let me know how it goes.
Also, there are some good reads,
https://stats.stackexchange.com/questions/189652/is-it-a-good-practice-to-always-scale-normalize-data-for-machine-learning
https://stats.stackexchange.com/questions/7757/data-normalization-and-standardization-in-neural-networks
Add a second hidden layer with one ReLU node.

Categories