CountVectorizer reading and writing vocabulary - python

I am currently working on a fairly trivial sentiment classification program. Everything works well in the training phase. However, I am having trouble using CountVectorizer to test new strings of text that contain unseen words.
For this reason I am trying to write a lookup vocabulary for vectorization in the testing phase. However, I don't know how to create and retrieve the vocabulary object to pass as a parameter.
My two methods currently appear as follows:
def trainingVectorTransformation (messages):
#--> ReviewText to vectors
vect = CountVectorizer(analyzer=split_into_lemmas).fit(messages['reviewText'])
messages_bow = vect.transform(messages['reviewText'])
feature_list = vect.get_feature_names()
#NOT SURE HOW TO CREATE VOCABULARY
with open("vocab.txt", "w") as text_file:
text_file.write(str(feature_list))
tfidf_transformer = TfidfTransformer().fit(messages_bow)
messages_tfidf = tfidf_transformer.transform(messages_bow)
return messages_tfidf
and
def testingVectorTransformation (messages):
#--> ReviewText to vectors
#NOT SURE HOW TO READ THE CREATED VOCABULARY AND USE IT APPROPRIATELY
txt = open("vocab.txt")
vocabulary = txt.read()
vect = CountVectorizer(analyzer=split_into_lemmas, vocabulary = vocabulary).fit(messages['reviewText'])
messages_bow = vect.transform(messages['reviewText'])
tfidf_transformer = TfidfTransformer().fit(messages_bow)
messages_tfidf = tfidf_transformer.transform(messages_bow)
return messages_tfidf
If anyone has any advice on how to properly create and use the vocabulary I would very much appreciate it.

You need to save a copy of your vectorizer using some serializer, e.g. pickle and load it in the test phase. You can also get the vocab using, vocabulary_ attribute see here for more details
Also looking at your code, in training you should call vect.fit_transform not just transform

Related

Text vocabulary coverage using sklearn

I have a trained sklearn's CountVectorizer object on some corpus. When vectorizing a new document, the vector contains only the tokens that appear in the vectorizer's vocabulary.
I'd like to add another feature to the vector which is the vocabulary coverage, or in other words, the percentage of tokens that are in the vocabulary.
Here's my code:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"good morning sunshine",
"hello world",
"hello sunshine",
]
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus)
def get_vocab_coverage(vectorizer, sent):
preprocessor = vectorizer.build_preprocessor()
tokenizer = vectorizer.build_tokenizer()
processed = preprocessor(sent)
tokenized_license = tokenizer(processed)
count = sum(w in vectorizer.vocabulary_ for w in tokenized_license)
return count / len(tokenized_license)
get_vocab_coverage(vectorizer, "hello world") # => 1.0
get_vocab_coverage(vectorizer, "hello to you") # => 0.333
The problem with this code is that it's not very pythonic, it uses an internal sklearn's variable and is not very scalable. Any idea how I can improve it? or is there an existing method that does the same?
transform method of CountVectorizer may be useful:
def get_vocab_coverage(vectorizer, sent: str):
preprocessor = vectorizer.build_preprocessor()
tokenizer = vectorizer.build_tokenizer()
processed = preprocessor(sent)
tokenized_license = tokenizer(processed)
if len(tokenized_license):
return vectorizer.transform([sent]).sum() / len(tokenized_license)
return 0
Note wrapping sample string to list before passing to transform (because it expects batch of documents, not one string), and also handling len == 0 case (no need to catch error dividing by zero). Few things to point out:
method transform returns sparse vector, so sum works fast
CountVectorizer make no advantage of passing documents by batches to transform, inside there is ordinary for-cycle, so function get_vocab_coverage written for processing one example is enough
transform expects raw documents, so function tokenize one example twice (explicitly while creating tokenized_license and implicitly inside transform), and it seems there is no simple way to avoid it (only return to previous version); if that's critical consider rewriting _count_vocab method with preprocessed_documents argument instead of raw_documents

Understanding introductory example on transformers in Trax

My goal is to understand the introductory example on transformers in Trax, which can be found at https://trax-ml.readthedocs.io/en/latest/notebooks/trax_intro.html:
import trax
# Create a Transformer model.
# Pre-trained model config in gs://trax-ml/models/translation/ende_wmt32k.gin
model = trax.models.Transformer(
input_vocab_size=33300,
d_model=512, d_ff=2048,
n_heads=8, n_encoder_layers=6, n_decoder_layers=6,
max_len=2048, mode='predict')
# Initialize using pre-trained weights.
model.init_from_file('gs://trax-ml/models/translation/ende_wmt32k.pkl.gz',
weights_only=True)
# Tokenize a sentence.
sentence = 'It is nice to learn new things today!'
tokenized = list(trax.data.tokenize(iter([sentence]), # Operates on streams.
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword'))[0]
# Decode from the Transformer.
tokenized = tokenized[None, :] # Add batch dimension.
tokenized_translation = trax.supervised.decoding.autoregressive_sample(
model, tokenized, temperature=0.0) # Higher temperature: more diverse results.
# De-tokenize,
tokenized_translation = tokenized_translation[0][:-1] # Remove batch and EOS.
translation = trax.data.detokenize(tokenized_translation,
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword')
print(translation)
The example works pretty fine. However, when I try to translate another example with the initialised model, e.g.
sentence = 'I would like to try another example.'
tokenized = list(trax.data.tokenize(iter([sentence]),
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword'))[0]
tokenized = tokenized[None, :]
tokenized_translation = trax.supervised.decoding.autoregressive_sample(
model, tokenized, temperature=0.0)
tokenized_translation = tokenized_translation[0][:-1]
translation = trax.data.detokenize(tokenized_translation,
vocab_dir='gs://trax-ml/vocabs/',
vocab_file='ende_32k.subword')
print(translation)
I get the output !, on my local machine as well as on Google Colab. The same happens with other examples.
When I build and initialise a new model, everything works fine.
Is this a bug? If not, what is happening here and how can I avoid/fix that behaviour?
Tokenization and detokenization seem to work well, I debugged that. Things seem to go wrong/unexpected in trax.supervised.decoding.autoregressive_sample.
I found it out myself... one needs to reset the model's state. So the following code works for me:
def translate(model, sentence, vocab_dir, vocab_file):
empty_state = model.state # save empty state
tokenized_sentence = next(trax.data.tokenize(iter([sentence]), vocab_dir=vocab_dir,
vocab_file=vocab_file))
tokenized_translation = trax.supervised.decoding.autoregressive_sample(
model, tokenized_sentence[None, :], temperature=0.0)[0][:-1]
translation = trax.data.detokenize(tokenized_translation, vocab_dir=vocab_dir,
vocab_file=vocab_file)
model.state = empty_state # reset state
return translation
# Create a Transformer model.
# Pre-trained model config in gs://trax-ml/models/translation/ende_wmt32k.gin
model = trax.models.Transformer(input_vocab_size=33300, d_model=512, d_ff=2048, n_heads=8,
n_encoder_layers=6, n_decoder_layers=6, max_len=2048,
mode='predict')
# Initialize using pre-trained weights.
model.init_from_file('gs://trax-ml/models/translation/ende_wmt32k.pkl.gz',
weights_only=True)
print(translate(model, 'It is nice to learn new things today!',
vocab_dir='gs://trax-ml/vocabs/', vocab_file='ende_32k.subword'))
print(translate(model, 'I would like to try another example.',
vocab_dir='gs://trax-ml/vocabs/', vocab_file='ende_32k.subword'))

How to call a corpus file in python?

I am currently working on gensim doc2vec model to implement sentence similarity.
I came across this sample code by William Bert where he has mentioned that to train this model I need to provide my own background corpus. The code is copied below for convenience:
import logging, sys, pprint
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
### Generating a training/background corpus from your own source of documents
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
# gensim docs: "Provide a filename or a file-like object as input and TextCorpus will be initialized with a
# dictionary in `self.dictionary`and will support the `iter` corpus method. For other kinds of corpora, you only
# need to override `get_texts` and provide your own implementation."
background_corpus = TextCorpus(input=YOUR_CORPUS)
# Important -- save the dictionary generated by the corpus, or future operations will not be able to map results
# back to original words.
background_corpus.dictionary.save(
"my_dict.dict")
MmCorpus.serialize("background_corpus.mm",
background_corpus) # Uses numpy to persist wiki corpus in Matrix Market format. File will be several GBs.
### Generating a large training/background corpus using Wikipedia
from gensim.corpora import WikiCorpus, wikicorpus
articles = "enwiki-latest-pages-articles.xml.bz2" # available from http://en.wikipedia.org/wiki/Wikipedia:Database_download
# This will take many hours! Output is Wikipedia in bucket-of-words (BOW) sparse matrix.
wiki_corpus = WikiCorpus(articles)
wiki_corpus.dictionary.save("wiki_dict.dict")
MmCorpus.serialize("wiki_corpus.mm", wiki_corpus) # File will be several GBs.
### Working with persisted corpus and dictionary
bow_corpus = MmCorpus("wiki_corpus.mm") # Revive a corpus
dictionary = Dictionary.load("wiki_dict.dict") # Load a dictionary
### Transformations among vector spaces
from gensim.models import LsiModel, LogEntropyModel
logent_transformation = LogEntropyModel(wiki_corpus,
id2word=dictionary) # Log Entropy weights frequencies of all document features in the corpus
tokenize_func = wikicorpus.tokenize # The tokenizer used to create the Wikipedia corpus
document = "Some text to be transformed."
# First, tokenize document using the same tokenization as was used on the background corpus, and then convert it to
# BOW representation using the dictionary created when generating the background corpus.
bow_document = dictionary.doc2bow(tokenize_func(
document))
# converts a single document to log entropy representation. document must be in the same vector space as corpus.
logent_document = logent_transformation[[
bow_document]]
# Transform arbitrary documents by getting them into the same BOW vector space created by your training corpus
documents = ["Some iterable", "containing multiple", "documents", "..."]
bow_documents = (dictionary.doc2bow(
tokenize_func(document)) for document in documents) # use a generator expression because...
logent_documents = logent_transformation[
bow_documents] # ...transformation is done during iteration of documents using generators, so this uses constant memory
### Chained transformations
# This builds a new corpus from iterating over documents of bow_corpus as transformed to log entropy representation.
# Will also take many hours if bow_corpus is the Wikipedia corpus created above.
logent_corpus = MmCorpus(corpus=logent_transformation[bow_corpus])
# Creates LSI transformation model from log entropy corpus representation. Takes several hours with Wikipedia corpus.
lsi_transformation = LsiModel(corpus=logent_corpus, id2word=dictionary,
num_features=400)
# Alternative way of performing same operation as above, but with implicit chaining
# lsi_transformation = LsiModel(corpus=logent_transformation[bow_corpus], id2word=dictionary,
# num_features=400)
# Can persist transformation models, too.
logent_transformation.save("logent.model")
lsi_transformation.save("lsi.model")
### Similarities (the best part)
from gensim.similarities import Similarity
# This index corpus consists of what you want to compare future queries against
index_documents = ["A bear walked in the dark forest.",
"Tall trees have many more leaves than short bushes.",
"A starship may someday travel across vast reaches of space to other stars.",
"Difference is the concept of how two or more entities are not the same."]
# A corpus can be anything, as long as iterating over it produces a representation of the corpus documents as vectors.
corpus = (dictionary.doc2bow(tokenize_func(document)) for document in index_documents)
index = Similarity(corpus=lsi_transformation[logent_transformation[corpus]], num_features=400, output_prefix="shard")
print "Index corpus:"
pprint.pprint(documents)
print "Similarities of index corpus documents to one another:"
pprint.pprint([s for s in index])
query = "In the face of ambiguity, refuse the temptation to guess."
sims_to_query = index[lsi_transformation[logent_transformation[dictionary.doc2bow(tokenize_func(query))]]]
print "Similarities of index corpus documents to '%s'" % query
pprint.pprint(sims_to_query)
best_score = max(sims_to_query)
index = sims_to_query.tolist().index(best_score)
most_similar_doc = documents[index]
print "The document most similar to the query is '%s' with a score of %.2f." % (most_similar_doc, best_score)
Where and how should I provide my own corpus in the code?
Thanks in advance for your help.

Python nltk classify with large feature set (Replicate Go Et Al 2009)

I'm trying to replicate Go Et Al. Twitter sentiment Analysis which can be found here http://help.sentiment140.com/for-students
The problem I'm having is the number of features is 364464. I'm currently using nltk and nltk.NaiveBayesClassifier to do this where tweets holds a replication of the 1,600,000 tweets and there polarity:
for tweet in tweets:
tweet[0] = extract_features(tweet[0], features)
classifier = nltk.NaiveBayesClassifier.train(training_set)
# print "NB Classified"
classifier.show_most_informative_features()
print(nltk.classify.util.accuracy(classifier, testdata))
Everything doesn't take very long apart from the extract_features function
def extract_features(tweet, featureList):
tweet_words = set(tweet)
features = {}
for word in featureList:
features['contains(%s)' % word] = (word in tweet_words)
return features
This is because for each tweet it's creating a dictionary of size 364,464 to represent whether something is present or not.
Is there a way to make this faster or more efficient without reducing the number of features like in this paper?
Turns out there is a wonderful function called:
nltk.classify.util.apply_features()
which you can find herehttp://www.nltk.org/api/nltk.classify.html
training_set = nltk.classify.apply_features(extract_features, tweets)
I had to change my extract_features function but it now works with the huge sizes without memory issues.
Here's a lowdown of the function description:
The primary purpose of this function is to avoid the memory overhead involved in storing all the featuresets for every token in a corpus. Instead, these featuresets are constructed lazily, as-needed. The reduction in memory overhead can be especially significant when the underlying list of tokens is itself lazy (as is the case with many corpus readers).
and my changed function:
def extract_features(tweet):
tweet_words = set(tweet)
global featureList
features = {}
for word in featureList:
features[word] = False
for word in tweet_words:
if word in featureList:
features[word] = True
return features

Processing a corpus so big I'm getting runtime errors

I am trying to process a big corpus of tweets (1,600,000, can be found here) with the following code to train a Naive Bayes Classifier in order to play around with sentiment analysis.
My problem is I never coded anything that ever had to handle much memory or big variables.
At the moment the script runs for a while and then after a couple hours I get a runtime error (I'm on a Windows machine). I belive I'm not managing the list objects properly.
I am successfully running the program while limiting the for cycle but that means limiting the training set and quite likely getting worse sentiment analysis results.
How can I process the whole corpus? How can I better manage those lists? Are really those the ones causing the problem?
These are the imports
import pickle
import re
import os, errno
import csv
import nltk, nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
Here I load the corpora and create the lists where I want to store the features I extract from the corpus
inpTweets = csv.reader(open('datasets/training.1600000.processed.noemoticon.csv', 'rb'), delimiter=',', quotechar='"')
tweets = []
featureList = []
n=0
This for cycle extracts the stuff from the corpora and thanks to processTweet(), a long algorithm, I extract the features from each row of the .CSV
for row in inpTweets:
sentiment = row[0]
status_text = row[5]
featureVector = processTweet(status_text.decode('utf-8'))
#to know it's doing something
n = n + 1
print n
#we'll need both the featurelist and the tweets variable, carrying tweets and sentiments
Here I extend/append the lists / the variables to the lists, we're still inside the for cycle.
featureList.extend(featureVector)
tweets.append((featureVector, sentiment))
When the cycle ends I get rid of duplicates in the featureList and save it to a pickle.
featureList = list(set(featureList))
flist = open('fList.pickle', 'w')
pickle.dump(featureList, flist)
flist.close()
I get the features ready for the classifier.
training_set = nltk.classify.util.apply_features(extract_features, tweets)
Then I train the classifier and save it to a pickle.
# Train the Naive Bayes classifier
print "\nTraining the classifier.."
NBClassifier = nltk.NaiveBayesClassifier.train(training_set)
fnbc = open('nb_classifier.pickle', 'w')
pickle.dump(NBClassifier, fnbc)
fnbc.close()
edit: 19:45 gmt+1 - forgot to add n=0 in this post.
edit1: Due to lack of time and computing power limitations I choose to reduce the corpus like this -
.....
n=0
i=0
for row in inpTweets:
i = i+1
if (i==160): #limiter
i = 0
sentiment = row[0]
status_text = row[5]
n = n + 1
.....
As in the end the classifier was taking ages to train. About the runtime error please see the comments. Thanks everyone for the help.
You could use csv.field_size_limit(int)
For example:
f = open('datasets/training.1600000.processed.noemoticon.csv', 'rb')
csv.field_size_limit(100000)
inpTweets = csv.reader(f, delimiter=',', quotechar='"')
You can try changing the value 100,000 to something better maybe.
+1 on the comment about Pandas.
Also, you might want to check out cPickle here.
(1000x faster)
Check out this question / answer too !
Another relevant blog post here.

Categories