I want to use a sklearn classifier using n-gram features. Furthermore, I want to do cross-validation to find the best order of the n-grams. However, I am a bit stuck on how I can fit all the pieces together.
For now, I have the following code:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
text = ... # This is the input text. A list of strings
labels = ... # These are the labels of each sentence
# Find the optimal order of the ngrams by cross-validation
scores = pd.Series(index=range(1,6), dtype=float)
folds = KFold(n_splits=3)
for n in range(1,6):
count_vect = CountVectorizer(ngram_range=(n,n), stop_words='english')
X = count_vect.fit_transform(text)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.33, random_state=42)
clf = MultinomialNB()
score = cross_val_score(clf, X_train, y_train, cv=folds, n_jobs=-1)
scores.loc[n] = np.mean(score)
# Evaluate the classifier using the best order found
order = scores.idxmax()
count_vect = CountVectorizer(ngram_range=(order,order), stop_words='english')
X = count_vect.fit_transform(text)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.33, random_state=42)
clf = MultinomialNB()
clf = clf.fit(X_train, y_train)
acc = clf.score(X_test, y_test)
print('Accuracy is {}'.format(acc))
However, I feel like this is the wrong way to do it, since I create a train-test split in every loop.
If a do a train-test split beforehand and apply the CountVectorizer to both parts separately, than these parts have different shapes, which causes problems when using clf.fit and clf.score.
How can I solve this?
EDIT: If I try to build a vocabulary first, I still have to build several vocabularies, since the vocabulary for unigrams is different from that of bigrams, etc.
To give an example:
# unigram vocab
vocab = set()
for sentence in text:
for word in sentence:
if word not in vocab:
vocab.add(word)
len(vocab) # 47291
# bigram vocab
vocab = set()
for sentence in text:
bigrams = nltk.ngrams(sentence, 2)
for bigram in bigrams:
if bigram not in vocab:
vocab.add(bigram)
len(vocab) # 326044
This again leads me to the same problem of needing to apply the CountVectorizer for every n-gram size.
You need to set the vocabulary parameter first. In some way you have to provide the entire vocabulary, otherwise the dimensions can never match (obviously). If you do the train/test split first, there might be words in one set which are not present in the other and there you get your dimension mismatch.
The documentation says:
If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.
Further down you'll find a description for vocabulary.
vocabulary:
Mapping or iterable, optional
Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents. Indices in the mapping should not be repeated and should not have any gap between 0 and the largest index.
Related
I'm very new to programming and machine learning but I've been trying to create a prediction model to tag product reviews. I found the following model:
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
dataset = pd.read_csv('dataset.csv')
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
dataset['TEXT'] = [normalize_text(s) for s in dataset['texto']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(dataset['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(dataset['codigo'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
So far so good. But then, I tried to use that trained model to predict another set of data like this:
#new data
test = pd.read_csv('testset.csv')
test['TEXT'] = [normalize_text(s) for s in test['respostas']]
# pull the data into vectors
vectorizer = CountVectorizer()
classes = vectorizer.fit_transform(test['TEXT'])
classificacao = nb.predict(classes)
However, I got a "ValueError: dimension mismatch"
I'm not sure how to do this second step, which is using the model to predict the category of a fresh data set.
Thanks in advance for your assistance.
I'm really freshman on machine learning. I'm reviewing code that separates spam or ham values on an email. I have a problem when I set up codes for another data set. So, my dataset doesn't just have ham or spam values. I have 2 different classification values (age and gender). When I try to use 2 classification values at below code block , I'm getting an error , too many value for unpack. How can I put my whole values ?
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(messages_bow, import_data['age'], import_data['gender'], test_size = 0.20, random_state = 0)
Whole Codes :
import numpy as np
import pandas
import nltk
from nltk.corpus import stopwords
import string
# Import Data.
import_data = pandas.read_csv('/root/Desktop/%20/%100.csv' , encoding='cp1252')
# To See Columns Headers.
print(import_data.columns)
# To Remove Duplications.
import_data.drop_duplicates(inplace = True)
# To Find Data Size.
print(import_data.shape)
#Tokenization (a list of tokens), will be used as the analyzer
#1.Punctuations are [!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~]
#2.Stop words in natural language processing, are useless words (data).
def process_text(text):
'''
What will be covered:
1. Remove punctuation
2. Remove stopwords
3. Return list of clean text words
'''
#1
nopunc = [char for char in text if char not in string.punctuation]
nopunc = ''.join(nopunc)
#2
clean_words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
#3
return clean_words
#Show the Tokenization (a list of tokens )
print(import_data['text'].head().apply(process_text))
# Convert the text into a matrix of token counts.
from sklearn.feature_extraction.text import CountVectorizer
messages_bow = CountVectorizer(analyzer=process_text).fit_transform(import_data['text'])
#Split data into 80% training & 20% testing data sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(messages_bow, import_data['gender'], import_data['frequency'], test_size = 0.20, random_state = 0)
#Get the shape of messages_bow
print(messages_bow.shape)
train_test_split splits each argument you pass to it into train and test sets. Since you are splitting three separate types of data, you need 6 variables:
X_train, X_test, age_train, age_test, gender_train, gender_test = train_test_split(messages_bow, import_data['age'], import_data['gender'], test_size=0.20, random_state=0)
I am learning sentiment analysis and I have a data frame of reviews, which I have to evaluate given a list of words, and get the weights assigned to those words. Unfortunately, when I try to fit the regression I get the following error:
"ValueError: Found input variables with inconsistent numbers of samples: [11, 133401]"
What am I missing on?
CSV file
import pandas
import sklearn
import numpy as np
products = pandas.read_csv('amazon_baby.csv')
selected_words=["awesome", "great", "fantastic", "amazing", "love", "horrible", "bad", "terrible", "awful", "wow", "hate"]
#ignore all 3* reviews
products = products[products['rating'] != 3]
#positive sentiment = 4* or 5* reviews
products['sentiment'] = products['rating'] >=4
#create a separate column for each word
for word in selected_words:
products[word]=[len(re.findall(word,x)) for x in products['review'].tolist()]
# Define X and y
X = products[selected_words]
y = products['sentiment']
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train) #here is where I get the error
CountVectorizer() expects an iterable of strings and returns vectors that represents the counts of words. You already implemented this with the for loop and now trying to fit CountVectorizer() to counts of your selected words.
Assuming you want to just want to use your selected words as features
logreg.fit(X_train, y_train)
without the transformation will be fine.
Or if you would like to use all the words as features you could change your X to include the full review
X = products['review'].astype(str)
and then fit the CountVectorizer() and then use
logreg.fit(X_train_dtm, y_train)
I am working with a large dataset of tweets from which I have trained a small subset into four manually classified categories. The manual classifications have about twenty tweets each, while the dataset has tens of thousands of tweets. Here is the code I used to train the model.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
tweets = []
labels_list = []
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2',
encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(tweets).toarray()
labels = labels_list
X_train, X_test, y_train, y_test = train_test_split(tweets, labels,
random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)
Whenever I type
print(clf.predict(count_vect.transform(["Some random content"])))
the machine accurately outputs the label that the tweet belongs to if I fill in the content with something that matches the training data. However, if I type in total nonsense, it will also output some random category that I know it doesn't belong to.
My goal is to find 100 tweets that are most likely to belong to that category, however, the four categories mentioned above are not representative of the entire dataset, therefore, I need to know if there some sort of probability threshold I could use to eliminate that tweet and not add it to the 100 if it is too low on the threshold?
I tried looking into multinomial logistic regression but I could not find any sort of probability output, so maybe if I am just doing something wrong or if there is another way I would like to know!
You can use .predict_proba() method on your clf to get probabilities of every class for every tweet. Then to get top-100 tweets for, say, class 0, you sort all your tweets by the probability of class 0 and take top 100.
You can do it easily with pandas for instance:
import pandas as pd
probsd = pd.DataFrame(clf.predict_proba(Xtest_tfidf))
top_100_class_0_tweets = probsd.sort_values(0, ascending=False).head(100).index
I have got a dataset which contains just two useful columns for training my model, first is news heading and the second is category of news.
So, I got the following training command running successfully using python:
import re
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder
# grab the data
news = pd.read_csv("/Users/helloworld/Downloads/NewsAggregatorDataset/newsCorpora.csv",encoding='latin-1')
news.head()
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
news['TEXT'] = [normalize_text(s) for s in news['TITLE']]
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(news['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nb = MultinomialNB()
nb.fit(x_train, y_train)
So my question is, how can I give a new set of data (e.g. Just news heading) and tell the program to predict the news category using python sklearn command?
P.S. My training data is like:
You should train the model using the training data (as you did) and then you should predict using new data (the test data).
Do the following:
nb = MultinomialNB()
nb.fit(x_train, y_train)
y_predicted = nb.predict(x_test)
Now, if you want to evaluate the predictions based on the **accuracy you can do the following:**
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predicted)
Similarly, you can calculate other metrics.
Finally, we can see all the available metrics here !
EDIT 1
When you type:
y_predicted = nb.predict(x_test)
y_predicted will contain numerical values that correspond to your categories.
To project back these values and get the labels you can do:
y_predicted_labels = encoder.inverse_transform(y_predicted)
You are very close. Just need two more lines of code. Use this link, explains Naives Bayes using Sci Kit,
https://www.digitalocean.com/community/tutorials/how-to-build-a-machine-learning-classifier-in-python-with-scikit-learn
The short answer to your question is below, import the accuracy function,
from sklearn.metrics import accuracy_score
test the model using the predict function,
preds = nb.predict(x_test)
and then test the accuracy
print(accuracy_score(y_test, preds))