I am trying to use the word2vec module from gensim natural language processing library in Python.
The docs say to initialize the model:
from gensim.models import word2vec
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
What format does gensim expect for the input sentences? I have raw text
"the quick brown fox jumps over the lazy dogs"
"Then a cop quizzed Mick Jagger's ex-wives briefly."
etc.
What additional processing do I need to post into word2fec?
UPDATE: Here is what I have tried. When it loads the sentences, I get nothing.
>>> sentences = ['the quick brown fox jumps over the lazy dogs',
"Then a cop quizzed Mick Jagger's ex-wives briefly."]
>>> x = word2vec.Word2Vec()
>>> x.build_vocab([s.encode('utf-8').split( ) for s in sentences])
>>> x.vocab
{}
A list of utf-8 sentences. You can also stream the data from the disk.
Make sure it's utf-8, and split it:
sentences = [ "the quick brown fox jumps over the lazy dogs",
"Then a cop quizzed Mick Jagger's ex-wives briefly." ]
word2vec.Word2Vec([s.encode('utf-8').split() for s in sentences], size=100, window=5, min_count=5, workers=4)
Like alKid pointed out, make it utf-8.
Talking about two additional things you might have to worry about.
Input is too large and you're loading it from a file.
Removing stop words from the sentences.
Instead of loading a big list into the memory, you can do something like:
import nltk, gensim
class FileToSent(object):
def __init__(self, filename):
self.filename = filename
self.stop = set(nltk.corpus.stopwords.words('english'))
def __iter__(self):
for line in open(self.filename, 'r'):
ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
yield ll
And then,
sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, window=5, min_count=5, workers=4, hs=1)
Related
As a beginner in text mining, I am trying to replicate the analyses from this paper. Essentially, the authors extract LDA topics (1-4) from a document and then "the topics extracted by LDA have been converted to suitable TF-IDF matrices that have been then used to predict..." (not important what they predict, it's a bunch of regressions). Their definition of TF and IDF (section 4.2.5) seems common, though, my understanding is that the TF-IDF measures apply to a word in a document, not topics. Given that they have a case where they extract a single topic, I think it's impossible to use the probability of the topic in a document, as this will always be 1 (though correct me if I am wrong).
So, what are the possible interpretations of converting LDA topics to "suitable TF-IDF" matrices? (and how would one go about doing that using the below code?)
Would that mean converting each and every word in a document to its TF-IDF weight and then use in prediction? That does not seem plausible as with 1000+ documents, that'd be pretty high and almost certainly most of them would be useless.
Minimally reproducible example
(credit: Jordan Barber)
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = get_stop_words('en')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
# add tokens to list
texts.append(stemmed_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)
I would like to remove stop words from the arraylist named arrayList1, which is stored in the data variable.
I try the below method but it does not work. Please help me check the below codes and improve the codes.Thanks.
import Retrieve_ED_Notes
from nltk.corpus import stopwords
data = Retrieve_ED_Notes.arrayList1
stop_words = set(stopwords.words('english'))
def remove_stopwords(data):
data = [word for word in data if word not in stop_words]
return data
for i in range(0, len(remove_stopwords(data))):
print(remove_stopwords(data[i]))
Console output of the arrayList1:
1|I really love writing journals
2|The mat is very comfortable and I will buy it again likes
3|The mousepad is smooth
1|I really love writing journals
4|This pen is very special to me.
4|This pencil is very special to me.
5|Meaningful novels
4|It brights up my day like a lighter and makes me higher.
6|School foolscap
7|As soft as my heart.lovey
convert the word to lower and check with stopwords.
from nltk.corpus import stopwords
stopwords=set(stopwords.words('english'))
data =['I really love writing journals','The mat is very comfortable and I will buy it again likes','The mousepad is smooth']
def remove_stopwords(data):
output_array=[]
for sentence in data:
temp_list=[]
for word in sentence.split():
if word.lower() not in stopwords:
temp_list.append(word)
output_array.append(' '.join(temp_list))
return output_array
output=remove_stopwords(data)
print(output)
['really love writing journals','mat comfortable buy likes', 'mousepad smooth']
What is the correct way to use gensim's Phrases and preprocess_string together ?, i am doing this way but it a little contrived.
from gensim.models.phrases import Phrases
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import strip_tags
from gensim.parsing.preprocessing import strip_short
from gensim.parsing.preprocessing import strip_multiple_whitespaces
from gensim.parsing.preprocessing import stem_text
from gensim.parsing.preprocessing import remove_stopwords
from gensim.parsing.preprocessing import strip_numeric
import re
from gensim import utils
# removed "_" from regular expression
punctuation = r"""!"#$%&'()*+,-./:;<=>?#[\]^`{|}~"""
RE_PUNCT = re.compile(r'([%s])+' % re.escape(punctuation), re.UNICODE)
def strip_punctuation(s):
"""Replace punctuation characters with spaces in `s` using :const:`~gensim.parsing.preprocessing.RE_PUNCT`.
Parameters
----------
s : str
Returns
-------
str
Unicode string without punctuation characters.
Examples
--------
>>> from gensim.parsing.preprocessing import strip_punctuation
>>> strip_punctuation("A semicolon is a stronger break than a comma, but not as much as a full stop!")
u'A semicolon is a stronger break than a comma but not as much as a full stop '
"""
s = utils.to_unicode(s)
return RE_PUNCT.sub(" ", s)
my_filter = [
lambda x: x.lower(), strip_tags, strip_punctuation,
strip_multiple_whitespaces, strip_numeric,
remove_stopwords, strip_short, stem_text
]
documents = ["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"]
sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1, threshold=2)
sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
test = " ".join(bigram[sent])
print(preprocess_string(test))
print(preprocess_string(test, filters=my_filter))
The result is:
['mayor', 'new', 'york']
['mayor', 'new_york'] #correct
part of the code was taken from: How to extract phrases from corpus using gensim
I would recommend using gensim.utils.tokenize() instead of gensim.parsing.preprocessing.preprocess_string() for your example.
In many cases tokenize() does a very good job as it will only return sequences of alphabetic characters (no digits). This saves you the extra cleaning steps for punctuation etc.
However, tokenize() does not include removal of stopwords, short tokens nor stemming. This has to be cutomized anyway if you are dealing with other languages than English.
Here is some code for your (already clean) example documents which gives you the desired bigrams.
documents = ["the mayor of new york was there",
"machine learning can be useful sometimes",
"new york mayor was present"]
import gensim, pprint
# tokenize documents with gensim's tokenize() function
tokens = [list(gensim.utils.tokenize(doc, lower=True)) for doc in documents]
# build bigram model
bigram_mdl = gensim.models.phrases.Phrases(tokens, min_count=1, threshold=2)
# do more pre-processing on tokens (remove stopwords, stemming etc.)
# NOTE: this can be done better
from gensim.parsing.preprocessing import preprocess_string, remove_stopwords, stem_text
CUSTOM_FILTERS = [remove_stopwords, stem_text]
tokens = [preprocess_string(" ".join(doc), CUSTOM_FILTERS) for doc in tokens]
# apply bigram model on tokens
bigrams = bigram_mdl[tokens]
pprint.pprint(list(bigrams))
Output:
[['mayor', 'new_york'],
['machin', 'learn', 'us'],
['new_york', 'mayor', 'present']]
I was testing the NLTK package's vocabulary. I used the following code and was hoping to see all True.
import nltk
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
print ('answered' in english_vocab)
print ('unanswered' in english_vocab)
print ('altered' in english_vocab)
print ('alter' in english_vocab)
print ('looks' in english_vocab)
print ('look' in english_vocab)
But my results are as follows, so many words are missing, or rather some forms of the word are missing? Am I missing something?
False
True
False
True
False
True
Indeed, the corpus is not an exhaustive list of all the english words, but rather a collection of texts. A more appropriate way of telling if a word is a valid english word is to use wordnet:
from nltk.corpus import wordnet as wn
print wn.synsets('answered')
# [Synset('answer.v.01'), Synset('answer.v.02'), Synset('answer.v.03'), Synset('answer.v.04'), Synset('answer.v.05'), Synset('answer.v.06'), Synset('suffice.v.01'), Synset('answer.v.08'), Synset('answer.v.09'), Synset('answer.v.10')]
print wn.synsets('unanswered')
# [Synset('unanswered.s.01')]
print wn.synsets('notaword')
# []
NLTK corpora do not actually store every word, they are defined as "a large body of text".
For example, you were using the words corpus, and we can check its definition by using its readme() method:
>>> print(nltk.corpus.words.readme())
Wordlists
en: English, http://en.wikipedia.org/wiki/Words_(Unix)
en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932)
Unix's words is not exhaustive, so it may indeed be missing some words. Corpora are, by their nature, incomplete (hence the emphasis on natural language).
That being said, you might want to try using a corpus that is derived from a dictionary, like brown:
>>> print(nltk.corpus.brown.readme())
BROWN CORPUS
A Standard Corpus of Present-Day Edited American English, for use with Digital Computers.
by W. N. Francis and H. Kucera (1964)
Department of Linguistics, Brown University
Providence, Rhode Island, USA
Revised 1971, Revised and Amplified 1979
http://www.hit.uib.no/icame/brown/bcm.html
Distributed with the permission of the copyright holder, redistribution permitted.
I'm having trouble with the NLTK under Python, specifically the .generate() method.
generate(self, length=100)
Print random text, generated using a trigram language model.
Parameters:
* length (int) - The length of text to generate (default=100)
Here is a simplified version of what I am attempting.
import nltk
words = 'The quick brown fox jumps over the lazy dog'
tokens = nltk.word_tokenize(words)
text = nltk.Text(tokens)
print text.generate(3)
This will always generate
Building ngram index...
The quick brown
None
As opposed to building a random phrase out of the words.
Here is my output when I do
print text.generate()
Building ngram index...
The quick brown fox jumps over the lazy dog fox jumps over the lazy
dog dog The quick brown fox jumps over the lazy dog dog brown fox
jumps over the lazy dog over the lazy dog The quick brown fox jumps
over the lazy dog fox jumps over the lazy dog lazy dog The quick brown
fox jumps over the lazy dog the lazy dog The quick brown fox jumps
over the lazy dog jumps over the lazy dog over the lazy dog brown fox
jumps over the lazy dog quick brown fox jumps over the lazy dog The
None
Again starting out with the same text, but then varying it. I've also tried using the first chapter from Orwell's 1984. Again that always starts with the first 3 tokens (one of which is a space in this case) and then goes on to randomly generate text.
What am I doing wrong here?
To generate random text, U need to use Markov Chains
code to do that: from here
import random
class Markov(object):
def __init__(self, open_file):
self.cache = {}
self.open_file = open_file
self.words = self.file_to_words()
self.word_size = len(self.words)
self.database()
def file_to_words(self):
self.open_file.seek(0)
data = self.open_file.read()
words = data.split()
return words
def triples(self):
""" Generates triples from the given data string. So if our string were
"What a lovely day", we'd generate (What, a, lovely) and then
(a, lovely, day).
"""
if len(self.words) < 3:
return
for i in range(len(self.words) - 2):
yield (self.words[i], self.words[i+1], self.words[i+2])
def database(self):
for w1, w2, w3 in self.triples():
key = (w1, w2)
if key in self.cache:
self.cache[key].append(w3)
else:
self.cache[key] = [w3]
def generate_markov_text(self, size=25):
seed = random.randint(0, self.word_size-3)
seed_word, next_word = self.words[seed], self.words[seed+1]
w1, w2 = seed_word, next_word
gen_words = []
for i in xrange(size):
gen_words.append(w1)
w1, w2 = w2, random.choice(self.cache[(w1, w2)])
gen_words.append(w2)
return ' '.join(gen_words)
Explaination:
Generating pseudo random text with Markov chains using Python
You should be "training" the Markov model with multiple sequences, so that you accurately sample the starting state probabilities as well (called "pi" in Markov-speak). If you use a single sequence then you will always start in the same state.
In the case of Orwell's 1984 you would want to use sentence tokenization first (NLTK is very good at it), then word tokenization (yielding a list of lists of tokens, not just a single list of tokens) and then feed each sentence separately to the Markov model. This will allow it to properly model sequence starts, instead of being stuck on a single way to start every sequence.
Your sample corpus is most likely to be too small. I don't know how exactly nltk builds its trigram model but it is common practice that beginning and end of sentences are handled somehow. Since there is only one beginning of sentence in your corpus this might be the reason why every sentence has the same beginning.
Are you sure that using word_tokenize is the right approach?
This Google groups page has the example:
>>> import nltk
>>> text = nltk.Text(nltk.corpus.brown.words()) # Get text from brown
>>> text.generate()
But I've never used nltk, so I can't say whether that works the way you want.
Maybe you can sort the tokens array randomly before generating a sentence.