What are X_train and y_train? - python

I want to start develop an application using Machine Learning. I want to classify text - spam or not spam. I have 2 files - spam.txt, ham.txt - that contain thousand of sentences each file. If I want to use a classifier, let's say LogisticRegression.
For example, as I saw on the Internet, to fit my model I need to do like this:
`lr = LogisticRegression()
model = lr.fit(X_train, y_train)`
So here comes my question, what are actually X_train and y_train? How can I obtain them from my sentences? I searched on the Internet, I did not understand, here is my last call, I am pretty new to this topic. Thank you!

According to the documentation (see here):
X corresponds to your float feature matrix of shape (n_samples, n_features) (aka. the design matrix of your training set)
y is the float target vector of shape (n_samples,) (the label vector). In your case, label 0 could correspond to a spam example, and 1 to a ham one
The question is now about how to get a float feature matrix from text data.
A common scheme is to use a tf-idf vectorisation (more on this here), which is available in sklearn.
The vectorisation can be chained with the logistic regression via the Pipeline API of sklearn.
This is how the code would look like roughly
from itertools import chain
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import numpy as np
# prepare string data
with open('spam.txt', 'r') as f:
spam = f.readlines()
with open('ham.txt', 'r') as f:
ham = f.readlines()
text_train = list(chain(spam, ham))
# prepare labels
labels_train = np.concatenate((np.zeros(len(spam)),np.ones(len(ham))))
# build pipeline
vectorizer = TfidfVectorizer()
regressor = LogisticRegression()
pipeline = Pipeline([('vectorizer', vectorizer), ('regressor', regressor)])
# fit pipeline
pipeline.fit(text_train, labels_train)
# test predict
test = ["Is this spam or ham?"]
pipeline.predict(test) # value in [0,1]

Related

output decision tree in the pipeline manner

Hi as I am new to machine learning methods using the sklearn library, I try to incorporate the decision tree into pipeline and then make both the prediction and output of the model, but as I run the following code, I got the warning:
'Pipeline' object has no attribute 'tree_'
So I wonder if the pipeline does not support with tree output, and how am I able to fix this problem? I have also tried using the decision_tree class directly, but I got another warning that:
setting an array element with a sequence.
I know that this appears as I have vectors with different dimension, but still no clue how to deal with the situation.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree.export import export_text
from sklearn import tree
# a function that reads the corpus, tokenizes it and returns the documents
# and their labels
def read_corpus(corpus_file, use_sentiment):
documents = []
labels = []
with open(corpus_file, encoding='utf-8') as f:
for line in f:
tokens = line.strip().split()
documents.append(tokens[3:])
if use_sentiment:
# 2-class problem: positive vs negative
labels.append( tokens[1] )
else:
# 6-class problem: books, camera, dvd, health, music, software
labels.append( tokens[0] )
return documents, labels
# a dummy function that just returns its input
def identity(x):
return x
# read the data and split i into train and test
X, Y = read_corpus('/Users/dengchenglong/Downloads/trainset', use_sentiment=False)
split_point = int(0.75*len(X))
Xtrain = X[:split_point]
Ytrain = Y[:split_point]
Xtest = X[split_point:]
Ytest = Y[split_point:]
# let's use the TF-IDF vectorizer
tfidf = False
# we use a dummy function as tokenizer and preprocessor,
# since the texts are already preprocessed and tokenized.
if tfidf:
vec = TfidfVectorizer(preprocessor = identity,
tokenizer = identity)
else:
vec = CountVectorizer(preprocessor = identity,
tokenizer = identity)
# combine the vectorizer with a Naive Bayes classifier
classifier = Pipeline( [('vec', vec),
('cls', tree.DecisionTreeClassifier())])
# train the classifier on the train dataset
decision_tree = classifier.fit(Xtrain, Ytrain)
# predict the labels of the test data
Yguess = classifier.predict(Xtest)
tree.plot_tree(classifier.fit(Xtest, Ytest))
# report performance of the classifier
print(accuracy_score(Ytest, Yguess))
print(classification_report(Ytest, Yguess))
What if you try this:
from sklearn.pipeline import make_pipeline
# combine the vectorizer with a Naive Bayes classifier
clf = DecisionTreeClassifier()
classifier = make_pipeline(vec,clf)
As it seems, before using pipeline you must initiate the model you are trying to apply. Let me know if this works and if not, the errors it's returning.
From: Scikit-learn documentation
Example out of: Make pipeline example with trees

Text Classification Using Python

I have list of words in text variable with their labels. I like to make a classifier that can predict the label of new input text.
I am thinking of using scikit-learn package in Python to use SVM model.
I realize that the text need to be corverted to vector form so I am trying TfidfVectorizer and CountVectorizer.
This is my code so far using TfidfVectorizer:
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
label = ['organisasi','organisasi','organisasi','organisasi','organisasi','lokasi','lokasi','lokasi','lokasi','lokasi']
text = ['Partai Anamat Nasional','Persatuan Sepak Bola', 'Himpunan Mahasiswa','Organisasi Sosial','Masyarakat Peduli','Malioboro','Candi Borobudur','Taman Pintar','Museum Sejarah','Monumen Mandala']
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(text)
y = label
klasifikasi = svm.SVC()
klasifikasi = klasifikasi.fit(X,y) #training
test_text = ['Partai Perjuangan']
test_vector = vectorizer.fit_transform(test_text)
prediksi = klasifikasi.predict([test_vector]) #test
print(prediksi)
I also try the CountVectorizer with same code above.
Both showing the same Error result:
ValueError: setting an array element with a sequence.
How to solve this problem? Thanks
The error is due to this line:
prediksi = klasifikasi.predict([test_vector])
Most scikit estimators require an array of shape [n_samples, n_features]. The test_vector output from TfidfVectorizer is already in that shape ready to use for estimators. You don't need to wrap it in square brackets ([ and ]). The wrapping makes it a list which is unsuitable.
Try using it like this:
prediksi = klasifikasi.predict(test_vector)
But even then you will gt error. Because of this line:
test_vector = vectorizer.fit_transform(test_text)
Here you are fitting the vectorizer in a different way than what was learned by the klasifikasi estimator. fit_transform() is just a shortcut for calling fit() (learning the data) and then transform() it. For test data, always use transform() method, never fit() or fit_transform()
So the correct code will be:
test_vector = vectorizer.transform(test_text)
prediksi = klasifikasi.predict(test_vector)
#Output: array(['organisasi'], dtype='|S10')

Supervised machine learning with scikit-learn

This is the first time I'm doing supervised machine learning. This is a pretty advanced topic (at least for me) and I find it hard to specify a question, since I'm not sure what is going wrong.
# Create a training list and test list (looks something like this):
train = [('this hostel was nice',2),('i hate this hostel',1)]
test = [('had a wonderful time',2),('terrible experience',1)]
# Loading modules
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
# Use a BOW representation of the reviews
vectorizer = CountVectorizer(stop_words='english')
train_features = vectorizer.fit_transform([r[0] for r in train])
test_features = vectorizer.fit([r[0] for r in test])
# Fit a naive bayes model to the training data
nb = MultinomialNB()
nb.fit(train_features, [r[1] for r in train])
# Use the classifier to predict classification of test dataset
predictions = nb.predict(test_features)
actual=[r[1] for r in test]
Here I get the error:
float() argument must be a string or a number, not 'CountVectorizer'
This confuses me, since the original ratings that I have zipped up in with the reviews are:
type(ratings_new[0])
int
You should change the line
test_features = vectorizer.fit([r[0] for r in test])
to:
test_features = vectorizer.transform([r[0] for r in test])
The reason is that you already used your training data to fit vectorizer, so you don't need to fit it again on your test data. Instead, you need to transform it.

How to compute labels from predicted values of regression model?

I am working on a Machine Learning Project. I need to create two python scripts:
1) a classifier
2) Produce a text file of labels using that classifier.
I am just saving the model in the first script. Then, in the second script, I am applying that model to a different dataset containing text to produce predicted labels (ham, or spam) and saving those predicted labels in a text file.
Basically I have a list of text with labels, ham or spam.
I created a classifier using the Linear Regression Model. I had two different files of training data (texts_training, and labels_training), so I loaded my training data into variables called texts and labels. And then, I worked on the classifier. This is what I have for the classifier:
#classifier.py
def features (words):
fe = np.ndarrary ((len(tweets), 56)
for t, text in enumerate (words):
if "money" in text:
money = 1
else:
money = 0
...(55 more features)
fe = [i:] = [money, ...]
return fe
fe = features (words)
feat.shape
>>>(1000, 56)
import sklearn
X = fe
label = preprocessing.LabelEncoder()
label.fit(labels)
label = lab.transform(labels)
y.shape
>>>(1000,)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X,y, random_state = 4)
from sklearn.preprocessing import StandardScaler
scaler = preprocessing.StandardScaler().fit(X_train)
#Model
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf = lreg.fit(X, y)
import pickle
f = open ("clf.pkl", "w")
pickle.dump ((clf, f)
f.close ()
Now, I was loading this into a different script but both scripts are saved in the same folder. This script basically has to use that classifier to save the labels produced in a txt.
system.py
def features (words):
fe = np.ndarrary ((len(tweets), 56)
for t, text in enumerate (words):
if "money" in text:
money = 1
else:
money = 0
...(55 more features)
feat = [t, :] = [money, ...]
return fe
fe = features (words)
X = feat
from sklearn import preprocessing
label = preprocessing.LabelEncoder()
label.fit(labels)
label = label.transform(labels)
y = label
from sklearn.preprocessing import StandardScaler
scaler = preprocessing.StandardScaler().fit(X)
import pickle
#class_output = pickle.load (open('clf.pkl', 'r'))
loaded_model = pickle.load (open('clf.pkl', 'r'))
class_output = loaded_model.predict (X)
**print class_output
>>>array([ 0.06140778, 0.053107 , 0.14343903, ..., 0.05701325,
0.18738435, -0.08788421])**
f = open ("labels_produced.txt", "w")
for output in class_output:
if output ==0:
f.write ("ham\n")
else:
f.write("spam\n")
f.close()
However, how do I compute spam or ham for the new data set as none of the values in the class_output are equal to 0. My features were set to be either 0 or 1.
I am a beginner learner, I have been struggling with this all day today. I do not understand why I get this error and how to fix it. If someone helps, I would really appreciate it.
You are trying to iterate over an object. That is what the the error means:
'LinearRegression' object is not iterable'
This can be seen by doing:
type(clf) = sklearn.linear_model.base.LinearRegression
clf is a LinearRegression object, with its own series of attributes. You cannot iterate over it like you try to do in the line:
for output in class_output:
if output == 0:
# etc
You need to extract the required attributes from your LinearRegression object clf, before you save them to Pickle or before you try and iterate over them.
There are several attributes contained in the LinearRegression obeject.
For example, the following would five you the coefficients of your LinearRegression fit:
Coefficients = clf.coef_
If you decide what attribute of clf you actually want to iterate over, you can extract it in the way shown above.
Edit: A list of attributes in the LinearRegression object is available here:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
coef_ : array, shape (n_features, ) or (n_targets, n_features)
Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.
residues_ : array, shape (n_targets,) or (1,) or empty
Sum of residuals. Squared Euclidean 2-norm for each target passed during the fit. If the linear regression problem is under-determined (the number of linearly independent rows of the training matrix is less than its number of linearly independent columns), this is an empty array. If the target vector passed during the fit is 1-dimensional, this is a (1,) shape array.
New in version 0.18.
intercept_ : array
Independent term in the linear model.
Edit: Great example available here:
http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
Further Question:
In your function features, it looks like you pass in your (texts) and then save whether a feature is 0 or 1 in the list feat. But at the end of the function, you return re instead of returning feat. If you return feat you may get the information you want. Also, you iterate with the variables t,text but then in (feat = [i:] = [money, ...]) you assign the values in your feat array using the variable i . Should i be replaced with t?

Predicting Classifications with some Words not in the training set (Naive Bayes)

I am created a Naive Bayes model to predict if the outcome is 'negative' or 'positive'. The problem I am having is running the model on a new set of data with some of the words not in the model. The error I receive for predicting a new data set is :
ValueError: Expected input with 6 features, got 4 instead
I read that I would have to put a Laplace Smoother in my model and Bernoulli() already has a default alpha of 1. What else can I do to fix my error? Thank you
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import BernoulliNB
from sklearn import cross_validation
from sklearn.metrics import classification_report
import numpy as np
from sklearn.metrics import accuracy_score
import textblob as TextBlob
#scikit
comments = list(['happy','sad','this is negative','this is positive', 'i like this', 'why do i hate this'])
classes = list(['positive','negative','negative','positive','positive','negative'])
# preprocess creates the term frequency matrix for the review data set
stop = stopwords.words('english')
count_vectorizer = CountVectorizer(analyzer =u'word',stop_words = stop, ngram_range=(1, 3))
comments = count_vectorizer.fit_transform(comments)
tfidf_comments = TfidfTransformer(use_idf=True).fit_transform(comments)
# preparing data for split validation. 60% training, 40% test
data_train,data_test,target_train,target_test = cross_validation.train_test_split(tfidf_comments,classes,test_size=0.2,random_state=43)
classifier = BernoulliNB().fit(data_train,target_train)
#new data
comments_new = list(['positive','zebra','george','nothing'])
comments_new = count_vectorizer.fit_transform(comments_new)
tfidf_comments_new = TfidfTransformer(use_idf=True).fit_transform(comments_new)
classifier.predict(tfidf_comments_new)
You should not fit a new estimator on the new data using fit_transform, but use the previously build count_vectorizer, just using transform. That will ignore all words that were not in the dictionary.
I disagree with Maxim: While this doesn't make a difference for CountVectorizer, using TfidfTransformer on the joined dataset will leak information from the test set to the training set, which you need to avoid.
You are creating a count matrix from 'comments' words. While creating count matrix you must use all of possible words you will encounter in your problem.Imagine the simpler case when you create a membership matrix. Each column states for specific word, each row - for specific example from dataset (for example, email text). Matrix holds 0 if specific word is not in the example, and 1 if it is in example. Obviously, if you have built such matrix for emails which hold, for example, 100 different words, the matrix will have 100 of columns. But if after that you will try to use trained classifier against new data in which you will have some new word, which wasn't in the training set - it will just fail. Since there was no column in original matrix to hold values for these new word. So once again, during vectorization of text you must provide all terms you will ever face in train and test datasets.
So instead of calling CountVectorizer and tfidfTransformer against 'comments' you must join both comments and comments_new into one list and call CountVectorizer and tfidfTransformer against the joined list.

Categories