This is the first time I'm doing supervised machine learning. This is a pretty advanced topic (at least for me) and I find it hard to specify a question, since I'm not sure what is going wrong.
# Create a training list and test list (looks something like this):
train = [('this hostel was nice',2),('i hate this hostel',1)]
test = [('had a wonderful time',2),('terrible experience',1)]
# Loading modules
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
# Use a BOW representation of the reviews
vectorizer = CountVectorizer(stop_words='english')
train_features = vectorizer.fit_transform([r[0] for r in train])
test_features = vectorizer.fit([r[0] for r in test])
# Fit a naive bayes model to the training data
nb = MultinomialNB()
nb.fit(train_features, [r[1] for r in train])
# Use the classifier to predict classification of test dataset
predictions = nb.predict(test_features)
actual=[r[1] for r in test]
Here I get the error:
float() argument must be a string or a number, not 'CountVectorizer'
This confuses me, since the original ratings that I have zipped up in with the reviews are:
type(ratings_new[0])
int
You should change the line
test_features = vectorizer.fit([r[0] for r in test])
to:
test_features = vectorizer.transform([r[0] for r in test])
The reason is that you already used your training data to fit vectorizer, so you don't need to fit it again on your test data. Instead, you need to transform it.
Related
Hi as I am new to machine learning methods using the sklearn library, I try to incorporate the decision tree into pipeline and then make both the prediction and output of the model, but as I run the following code, I got the warning:
'Pipeline' object has no attribute 'tree_'
So I wonder if the pipeline does not support with tree output, and how am I able to fix this problem? I have also tried using the decision_tree class directly, but I got another warning that:
setting an array element with a sequence.
I know that this appears as I have vectors with different dimension, but still no clue how to deal with the situation.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_text
from sklearn import tree
# a function that reads the corpus, tokenizes it and returns the documents
# and their labels
def read_corpus(corpus_file, use_sentiment):
documents = []
labels = []
with open(corpus_file, encoding='utf-8') as f:
for line in f:
tokens = line.strip().split()
documents.append(tokens[3:])
if use_sentiment:
# 2-class problem: positive vs negative
labels.append( tokens[1] )
else:
# 6-class problem: books, camera, dvd, health, music, software
labels.append( tokens[0] )
return documents, labels
# a dummy function that just returns its input
def identity(x):
return x
# read the data and split i into train and test
X, Y = read_corpus('/Users/dengchenglong/Downloads/trainset', use_sentiment=False)
split_point = int(0.75*len(X))
Xtrain = X[:split_point]
Ytrain = Y[:split_point]
Xtest = X[split_point:]
Ytest = Y[split_point:]
# let's use the TF-IDF vectorizer
tfidf = False
# we use a dummy function as tokenizer and preprocessor,
# since the texts are already preprocessed and tokenized.
if tfidf:
vec = TfidfVectorizer(preprocessor = identity,
tokenizer = identity)
else:
vec = CountVectorizer(preprocessor = identity,
tokenizer = identity)
# combine the vectorizer with a Naive Bayes classifier
classifier = Pipeline( [('vec', vec),
('cls', tree.DecisionTreeClassifier())])
# train the classifier on the train dataset
decision_tree = classifier.fit(Xtrain, Ytrain)
# predict the labels of the test data
Yguess = classifier.predict(Xtest)
tree.plot_tree(classifier.fit(Xtest, Ytest))
# report performance of the classifier
print(accuracy_score(Ytest, Yguess))
print(classification_report(Ytest, Yguess))
What if you try this:
from sklearn.pipeline import make_pipeline
# combine the vectorizer with a Naive Bayes classifier
clf = DecisionTreeClassifier()
classifier = make_pipeline(vec,clf)
As it seems, before using pipeline you must initiate the model you are trying to apply. Let me know if this works and if not, the errors it's returning.
From: Scikit-learn documentation
Example out of: Make pipeline example with trees
I have trained the model using labeled data for Naive Bayes algorithm. And tested the same model with the other set of labeled data. And I have calculated accuracy, precision and recall scores using the below code.
My code :
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from io import open
def load_data(filename):
reviews = list()
labels = list()
with open(filename, encoding='utf-8') as file:
file.readline()
for line in file:
line = line.strip().split(' ',1)
labels.append(line[0])
reviews.append(line[1])
return reviews, labels
X_train, y_train = load_data('./train_data.txt')
X_test, y_test = load_data('./test_data.txt')
vec = CountVectorizer()
X_train_transformed = vec.fit_transform(X_train)
X_test_transformed = vec.transform(X_test)
clf= MultinomialNB()
clf.fit(X_train_transformed, y_train)
score = clf.score(X_test_transformed, y_test)
print("score of Naive Bayes algo is :" , score)
y_pred = clf.predict(X_test_transformed)
print(confusion_matrix(y_test,y_pred))
print("Precision Score : ",precision_score(y_test, y_pred,average='micro'))
print("Recall Score : ",recall_score(y_test, y_pred,average='micro'))
But, now I have another test set which contains unlabeled data. Now, can I test the model with this unlabeled data using the above code ?
This is what I could interpret from your question.
You have trained Naive Bayes model use train data & tested it using test data and you have used confusion matrix & accuracy as a metric to measure the performance of the model.
Now your question may be
Using this model, is it possible to predict label of the unseen data which don't have any labels ?
if that is your question, then, YES it is possible. Moreover that is the reason why you have trained the model i.e, to predict the labels on unseen data.
Since the unseen data don't have labels, how do you know predicted labels are correct ? For this reason only you have tested the model with test data & measured the performance of the model. If the accuracy of the model is 70%, then 70% of the times your model is predicting correctly.
I strongly suggest you to think why are you doing what are you doing before start doing it!!
If you want to automatically calculate the accuracy of the model with unseen data then answer is NO.
To create confusion matix and find out these matrices you need to pass Y label variable. Good practice is to split your training data into training and test data.
I want to start develop an application using Machine Learning. I want to classify text - spam or not spam. I have 2 files - spam.txt, ham.txt - that contain thousand of sentences each file. If I want to use a classifier, let's say LogisticRegression.
For example, as I saw on the Internet, to fit my model I need to do like this:
`lr = LogisticRegression()
model = lr.fit(X_train, y_train)`
So here comes my question, what are actually X_train and y_train? How can I obtain them from my sentences? I searched on the Internet, I did not understand, here is my last call, I am pretty new to this topic. Thank you!
According to the documentation (see here):
X corresponds to your float feature matrix of shape (n_samples, n_features) (aka. the design matrix of your training set)
y is the float target vector of shape (n_samples,) (the label vector). In your case, label 0 could correspond to a spam example, and 1 to a ham one
The question is now about how to get a float feature matrix from text data.
A common scheme is to use a tf-idf vectorisation (more on this here), which is available in sklearn.
The vectorisation can be chained with the logistic regression via the Pipeline API of sklearn.
This is how the code would look like roughly
from itertools import chain
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import numpy as np
# prepare string data
with open('spam.txt', 'r') as f:
spam = f.readlines()
with open('ham.txt', 'r') as f:
ham = f.readlines()
text_train = list(chain(spam, ham))
# prepare labels
labels_train = np.concatenate((np.zeros(len(spam)),np.ones(len(ham))))
# build pipeline
vectorizer = TfidfVectorizer()
regressor = LogisticRegression()
pipeline = Pipeline([('vectorizer', vectorizer), ('regressor', regressor)])
# fit pipeline
pipeline.fit(text_train, labels_train)
# test predict
test = ["Is this spam or ham?"]
pipeline.predict(test) # value in [0,1]
I'm new to machine learning and trying Sklearn for the first time. I have two dataframes, one with data to train a logistic regression model (with 10-fold cross-validation) and another one to predict classes ('0,1') using that model.
Here's my code so far using bits of tutorials I found on Sklearn docs and on the Web:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
# Import dataframe with training data
df = pd.read_csv('summary_44.csv')
cols = df.columns.drop('num_class') # Data to use (num_class is the column with the classes)
# Import dataframe with data to predict
df_pred = pd.read_csv('new_predictions.csv')
# Scores
df_data = df.ix[:,:-1].values
# Target
df_target = df.ix[:,-1].values
# Values to predict
df_test = df_pred.ix[:,:-1].values
# Scores' names
df_data_names = cols.values
# Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
# Define number of folds
kf = KFold(n_splits=10)
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator
# Logistic regression normalizing variables
LogReg = LogisticRegression()
# 10-fold cross-validation
scores = [LogReg.fit(X[train], y[train]).score(X[test], y[test]) for train, test in kf.split(X)]
print scores
# Predict new
novel = LogReg.predict(X_pred)
Is this the correct way to implement a Logistic Regression?
I know that the fit() method should be used after cross-validation in order to train the model and use it for predictions. However, since I called fit() inside a list comprehension I really don't know if my model was "fitted" and can be used to make predictions.
I general things are okay, but there are some problems.
Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
You scale training and test data independently, which isn't correct. Both datasets must be scaled with the same scaler. "Scale" is a simple function, but it is better to use something else, for example StandardScaler.
scaler = StandardScaler()
scaler.fit(df_data)
X = scaler.transform(df_data)
X_pred = scaler.transform(df_test)
Cross-validation and predicting.
How your code works? You split data 10 times into train and hold-out set; 10 times fit model on train set and calculate score on hold-out set. This way you get cross-validation scores, but the model is fitted only on a part of data. So it would be better to fit model on the whole dataset and then make a prediction:
LogReg.fit(X, y)
novel = LogReg.predict(X_pred)
I want to notice that there are advanced technics like stacking and boosting, but if you learn using sklearn, then it is better to stick to the basics.
I have got a dataset which contains just two useful columns for training my model, first is news heading and the second is category of news.
So, I got the following training command running successfully using python:
import re
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
# grab the data
news = pd.read_csv("/Users/helloworld/Downloads/NewsAggregatorDataset/newsCorpora.csv",encoding='latin-1')
news.head()
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
news['TEXT'] = [normalize_text(s) for s in news['TITLE']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
So my question is, how can I give a new set of data (e.g. Just news heading) and tell the program to predict the news category using python sklearn command?
P.S. My training data is like:
You should train the model using the training data (as you did) and then you should predict using new data (the test data).
Do the following:
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
Now, if you want to evaluate the predictions based on the **accuracy you can do the following:**
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predicted)
Similarly, you can calculate other metrics.
Finally, we can see all the available metrics here !
EDIT 1
When you type:
y_predicted = nb.predict(x_test)
y_predicted will contain numerical values that correspond to your categories.
To project back these values and get the labels you can do:
y_predicted_labels = encoder.inverse_transform(y_predicted)
You are very close. Just need two more lines of code. Use this link, explains Naives Bayes using Sci Kit,
https://www.digitalocean.com/community/tutorials/how-to-build-a-machine-learning-classifier-in-python-with-scikit-learn
The short answer to your question is below, import the accuracy function,
from sklearn.metrics import accuracy_score
test the model using the predict function,
preds = nb.predict(x_test)
and then test the accuracy
print(accuracy_score(y_test, preds))