I have a trained sklearn's CountVectorizer object on some corpus. When vectorizing a new document, the vector contains only the tokens that appear in the vectorizer's vocabulary.
I'd like to add another feature to the vector which is the vocabulary coverage, or in other words, the percentage of tokens that are in the vocabulary.
Here's my code:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"good morning sunshine",
"hello world",
"hello sunshine",
]
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus)
def get_vocab_coverage(vectorizer, sent):
preprocessor = vectorizer.build_preprocessor()
tokenizer = vectorizer.build_tokenizer()
processed = preprocessor(sent)
tokenized_license = tokenizer(processed)
count = sum(w in vectorizer.vocabulary_ for w in tokenized_license)
return count / len(tokenized_license)
get_vocab_coverage(vectorizer, "hello world") # => 1.0
get_vocab_coverage(vectorizer, "hello to you") # => 0.333
The problem with this code is that it's not very pythonic, it uses an internal sklearn's variable and is not very scalable. Any idea how I can improve it? or is there an existing method that does the same?
transform method of CountVectorizer may be useful:
def get_vocab_coverage(vectorizer, sent: str):
preprocessor = vectorizer.build_preprocessor()
tokenizer = vectorizer.build_tokenizer()
processed = preprocessor(sent)
tokenized_license = tokenizer(processed)
if len(tokenized_license):
return vectorizer.transform([sent]).sum() / len(tokenized_license)
return 0
Note wrapping sample string to list before passing to transform (because it expects batch of documents, not one string), and also handling len == 0 case (no need to catch error dividing by zero). Few things to point out:
method transform returns sparse vector, so sum works fast
CountVectorizer make no advantage of passing documents by batches to transform, inside there is ordinary for-cycle, so function get_vocab_coverage written for processing one example is enough
transform expects raw documents, so function tokenize one example twice (explicitly while creating tokenized_license and implicitly inside transform), and it seems there is no simple way to avoid it (only return to previous version); if that's critical consider rewriting _count_vocab method with preprocessed_documents argument instead of raw_documents
Related
With Gensim, there are three functions I use regularly, for example this one:
model = gensim.models.Word2Vec(corpus,size=100,min_count=5)
The output from gensim, but I cannot understand how to set the size and min_count parameters in the equivalent SciSpacy command of:
model = spacy.load('en_core_web_md')
(The output is a model of embeddings (too big to add here))).
This is another command I regularly use:
model.most_similar(positive=['car'])
and this is the output from gensim/Expected output from SciSpacy:
[('vehicle', 0.7857330441474915),
('motorbike', 0.7572781443595886),
('train', 0.7457204461097717),
('honda', 0.7383008003234863),
('volkswagen', 0.7298516035079956),
('mini', 0.7158907651901245),
('drive', 0.7093928456306458),
('driving', 0.7084407806396484),
('road', 0.7001082897186279),
('traffic', 0.6991947889328003)]
This is the third command I regularly use:
print(model.wv['car'])
Output from Gensim/Expected output from SciSpacy (in reality this vector is length 100):
[ 1.0942473 2.5680697 -0.43163642 -1.171171 1.8553845 -0.3164575
1.3645878 -0.5003705 2.912658 3.099512 2.0184739 -1.2413547
0.9156444 -0.08406237 -2.2248871 2.0038593 0.8751471 0.8953876
0.2207374 -0.157277 -1.4984075 0.49289042 -0.01171476 -0.57937795...]
Could someone show me the equivalent commands for SciSpacy? For example, for 'gensim.models.Word2Vec' I can't find how to specify the length of the vectors (size parameter), or the minimum number of times the word should be in the corpus (min_count) in SciSpacy (e.g. I looked here and here), but I'm not sure if I'm missing them?
A possible way to achieve your goal would be to:
parse you documents via nlp.pipe
collect all the words and pairwise similarities
process similarities to get the desired results
Let's prepare some data:
import spacy
nlp = spacy.load("en_core_web_md", disable = ['ner', 'tagger', 'parser'])
Then, to get a vector, like in model.wv['car'] one would do:
nlp("car").vector
To get most similar words like model.most_similar(positive=['car']) let's process the corpus:
corpus = ["This is a sentence about cars. This a sentence aboout train"
, "And this is a sentence about a bike"]
docs = nlp.pipe(corpus)
tokens = []
tokens_orth = []
for doc in docs:
for tok in doc:
if tok.orth_ not in tokens_orth:
tokens.append(tok)
tokens_orth.append(tok.orth_)
sims = np.zeros((len(tokens),len(tokens)))
for i, tok in enumerate(tokens):
sims[i] = [tok.similarity(tok_) for tok_ in tokens]
Then to retrieve top=3 most similar words:
def most_similar(word, tokens_orth = tokens_orth, sims=sims, top=3):
tokens_orth = np.array(tokens_orth)
id_word = np.where(tokens_orth == word)[0][0]
sim = sims[id_word]
id_ms = np.argsort(sim)[:-top-1:-1]
return list(zip(tokens_orth[id_ms], sim[id_ms]))
most_similar("This")
[('this', 1.0000001192092896), ('This', 1.0), ('is', 0.5970357656478882)]
PS
I have also noticed you asked for specification of dimension and frequency. Embedding length is fixed at the time the model is initialized, so it can't be changed after that. You can start from a blank model if you wish so, and feed embeddings you're comfortable with. As for the frequency, it's doable, via counting all the words and throwing away anything that is below desired threshold. But again, underlying embeddings will be from a not filtered text. SpaCy is different from Gensim in that it uses readily available embeddings whereas Gensim trains them.
So i have build a model with sklearn Naive Bayes classifier.
I need to know how to predict a sentence with input
when i just hardcode the sentence its work fine, look like this
new_sentence = ['its so broken']
new_testdata_tfidf= tfidf.transform(new_sentence)
#transform it to matrix to see the score TFIDF on the training data
fit_feature_selection = selection.transform(new_testdata_tfidf)
#transform the new data to see if the feature remove or not, because after tfidf i use chi2 selection feature.
predicted = classifier.predict(feature_selection )
#then predict it. the classificaiton out, the class is -1 which is the correct answer
i need to type the text data with hand as an input so I use like this
new_sentence = input[('')]
#i input the same sentence its so broken
new_testdata_tfidf= tfidf.transform(new_sentence)
#transform it to matrix to see the score TFIDF on the training data
fit_feature_selection = selection.transform(new_testdata_tfidf)
#transform the new data to see if the feature remove or not, because after tfidf i use chi2 selection feature.
predicted = classifier.predict(feature_selection )
but it give me output
File "C:\Users\Myfile\OneDrive\Desktop\model.py", line 170, in <module>
new_testdata_tfidf= tfidf.transform(new_sentence)
File "E:\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 1898, in transform
X = super().transform(raw_documents)
File "E:\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 1265, in transform
"Iterable over raw text documents expected, "
ValueError: Iterable over raw text documents expected, string object received.
How to resolve this?
any help really appreciated.
Have you tried passing the new sentence as an array? i.e.
new_testdata_tfidf= tfidf.transform([new_sentence])
The first instance you are passing an array with one string element and the other you are simply passing a string
If you are trying to pass the list of strings with new_sentence = input[('')] in your code, then you might want to replace it with
new_sentence = [input()]
Hope this helps.
I wrote a couple of user defined functions to remove named entities (using NLTK) in Python from a list of text sentences/paragraphs. The problem I'm having is that my method is very slow, especially for large amounts of data. Does anyone have a suggestion for how to optimize this to make it run faster?
import nltk
import string
# Function to reverse tokenization
def untokenize(tokens):
return("".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip())
# Remove named entities
def ne_removal(text):
tokens = nltk.word_tokenize(text)
chunked = nltk.ne_chunk(nltk.pos_tag(tokens))
tokens = [leaf[0] for leaf in chunked if type(leaf) != nltk.Tree]
return(untokenize(tokens))
To use the code I typically have a text list and call the ne_removal function through a list comprehension. Example below:
text_list = ["Bob Smith went to the store.", "Jane Doe is my friend."]
named_entities_removed = [ne_removal(text) for text in text_list]
print(named_entities_removed)
## OUT: ['went to the store.', 'is my friend.']
UPDATE: I tried switching to batch version with this code, but it's only slightly faster. Will keep exploring. Thanks for the input so far.
def extract_nonentities(tree):
tokens = [leaf[0] for leaf in tree if type(leaf) != nltk.Tree]
return(untokenize(tokens))
def fast_ne_removal(text_list):
token_list = [nltk.word_tokenize(text) for text in text_list]
tagged = nltk.pos_tag_sents(token_list)
chunked = nltk.ne_chunk_sents(tagged)
non_entities = []
for tree in chunked:
non_entities.append(extract_nonentities(tree))
return(non_entities)
Every time you call ne_chunk(), it needs to initialize a chunker object and load the statistical model for chunking from disk. Ditto for pos_tag(). So instead of calling them on one sentence at a time, call their batch versions on the complete list of texts:
all_data = [ nltk.word_tokenize(sent) for sent in list_of_all_sents ]
tagged = nltk.pos_tag_sents(all_data)
chunked = nltk.ne_chunk_sents(tagged)
This should give you a considerable speed-up. If that's still too slow for your needs, try profiling your code and consider whether you need to switch to more high-powered tools, like #Lenz suggested.
I am currently working on gensim doc2vec model to implement sentence similarity.
I came across this sample code by William Bert where he has mentioned that to train this model I need to provide my own background corpus. The code is copied below for convenience:
import logging, sys, pprint
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
### Generating a training/background corpus from your own source of documents
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
# gensim docs: "Provide a filename or a file-like object as input and TextCorpus will be initialized with a
# dictionary in `self.dictionary`and will support the `iter` corpus method. For other kinds of corpora, you only
# need to override `get_texts` and provide your own implementation."
background_corpus = TextCorpus(input=YOUR_CORPUS)
# Important -- save the dictionary generated by the corpus, or future operations will not be able to map results
# back to original words.
background_corpus.dictionary.save(
"my_dict.dict")
MmCorpus.serialize("background_corpus.mm",
background_corpus) # Uses numpy to persist wiki corpus in Matrix Market format. File will be several GBs.
### Generating a large training/background corpus using Wikipedia
from gensim.corpora import WikiCorpus, wikicorpus
articles = "enwiki-latest-pages-articles.xml.bz2" # available from http://en.wikipedia.org/wiki/Wikipedia:Database_download
# This will take many hours! Output is Wikipedia in bucket-of-words (BOW) sparse matrix.
wiki_corpus = WikiCorpus(articles)
wiki_corpus.dictionary.save("wiki_dict.dict")
MmCorpus.serialize("wiki_corpus.mm", wiki_corpus) # File will be several GBs.
### Working with persisted corpus and dictionary
bow_corpus = MmCorpus("wiki_corpus.mm") # Revive a corpus
dictionary = Dictionary.load("wiki_dict.dict") # Load a dictionary
### Transformations among vector spaces
from gensim.models import LsiModel, LogEntropyModel
logent_transformation = LogEntropyModel(wiki_corpus,
id2word=dictionary) # Log Entropy weights frequencies of all document features in the corpus
tokenize_func = wikicorpus.tokenize # The tokenizer used to create the Wikipedia corpus
document = "Some text to be transformed."
# First, tokenize document using the same tokenization as was used on the background corpus, and then convert it to
# BOW representation using the dictionary created when generating the background corpus.
bow_document = dictionary.doc2bow(tokenize_func(
document))
# converts a single document to log entropy representation. document must be in the same vector space as corpus.
logent_document = logent_transformation[[
bow_document]]
# Transform arbitrary documents by getting them into the same BOW vector space created by your training corpus
documents = ["Some iterable", "containing multiple", "documents", "..."]
bow_documents = (dictionary.doc2bow(
tokenize_func(document)) for document in documents) # use a generator expression because...
logent_documents = logent_transformation[
bow_documents] # ...transformation is done during iteration of documents using generators, so this uses constant memory
### Chained transformations
# This builds a new corpus from iterating over documents of bow_corpus as transformed to log entropy representation.
# Will also take many hours if bow_corpus is the Wikipedia corpus created above.
logent_corpus = MmCorpus(corpus=logent_transformation[bow_corpus])
# Creates LSI transformation model from log entropy corpus representation. Takes several hours with Wikipedia corpus.
lsi_transformation = LsiModel(corpus=logent_corpus, id2word=dictionary,
num_features=400)
# Alternative way of performing same operation as above, but with implicit chaining
# lsi_transformation = LsiModel(corpus=logent_transformation[bow_corpus], id2word=dictionary,
# num_features=400)
# Can persist transformation models, too.
logent_transformation.save("logent.model")
lsi_transformation.save("lsi.model")
### Similarities (the best part)
from gensim.similarities import Similarity
# This index corpus consists of what you want to compare future queries against
index_documents = ["A bear walked in the dark forest.",
"Tall trees have many more leaves than short bushes.",
"A starship may someday travel across vast reaches of space to other stars.",
"Difference is the concept of how two or more entities are not the same."]
# A corpus can be anything, as long as iterating over it produces a representation of the corpus documents as vectors.
corpus = (dictionary.doc2bow(tokenize_func(document)) for document in index_documents)
index = Similarity(corpus=lsi_transformation[logent_transformation[corpus]], num_features=400, output_prefix="shard")
print "Index corpus:"
pprint.pprint(documents)
print "Similarities of index corpus documents to one another:"
pprint.pprint([s for s in index])
query = "In the face of ambiguity, refuse the temptation to guess."
sims_to_query = index[lsi_transformation[logent_transformation[dictionary.doc2bow(tokenize_func(query))]]]
print "Similarities of index corpus documents to '%s'" % query
pprint.pprint(sims_to_query)
best_score = max(sims_to_query)
index = sims_to_query.tolist().index(best_score)
most_similar_doc = documents[index]
print "The document most similar to the query is '%s' with a score of %.2f." % (most_similar_doc, best_score)
Where and how should I provide my own corpus in the code?
Thanks in advance for your help.
I'm trying to replicate Go Et Al. Twitter sentiment Analysis which can be found here http://help.sentiment140.com/for-students
The problem I'm having is the number of features is 364464. I'm currently using nltk and nltk.NaiveBayesClassifier to do this where tweets holds a replication of the 1,600,000 tweets and there polarity:
for tweet in tweets:
tweet[0] = extract_features(tweet[0], features)
classifier = nltk.NaiveBayesClassifier.train(training_set)
# print "NB Classified"
classifier.show_most_informative_features()
print(nltk.classify.util.accuracy(classifier, testdata))
Everything doesn't take very long apart from the extract_features function
def extract_features(tweet, featureList):
tweet_words = set(tweet)
features = {}
for word in featureList:
features['contains(%s)' % word] = (word in tweet_words)
return features
This is because for each tweet it's creating a dictionary of size 364,464 to represent whether something is present or not.
Is there a way to make this faster or more efficient without reducing the number of features like in this paper?
Turns out there is a wonderful function called:
nltk.classify.util.apply_features()
which you can find herehttp://www.nltk.org/api/nltk.classify.html
training_set = nltk.classify.apply_features(extract_features, tweets)
I had to change my extract_features function but it now works with the huge sizes without memory issues.
Here's a lowdown of the function description:
The primary purpose of this function is to avoid the memory overhead involved in storing all the featuresets for every token in a corpus. Instead, these featuresets are constructed lazily, as-needed. The reduction in memory overhead can be especially significant when the underlying list of tokens is itself lazy (as is the case with many corpus readers).
and my changed function:
def extract_features(tweet):
tweet_words = set(tweet)
global featureList
features = {}
for word in featureList:
features[word] = False
for word in tweet_words:
if word in featureList:
features[word] = True
return features