Given a single word such as "table", I want to identify what it is most commonly used as, whether its most common usage is noun, verb or adjective. I want to do this in python. Is there anything else besides wordnet too? I don't prefer wordnet. Or, if I use wordnet, how would I do it exactly with it?
import nltk
text = 'This is a table. We should table this offer. The table is in the center.'
text = nltk.word_tokenize(text)
result = nltk.pos_tag(text)
result = [i for i in result if i[0].lower() == 'table']
print(result) # [('table', 'JJ'), ('table', 'VB'), ('table', 'NN')]
If you have a word out of context and want to know its most common use, you could look at someone else's frequency table (e.g. WordNet), or you can do your own counts: Just find a tagged corpus that's large enough for your purposes, and count its instances. If you want to use a free corpus, the NLTK includes the Brown corpus (1 million words). The NLTK also provides methods for working with larger, non-free corpora (e.g, the British National Corpus).
import nltk
from nltk.corpus import brown
table = nltk.FreqDist(t for w, t in brown.tagged_words() if w.lower() == 'table')
[('NN', 147), ('NN-TL', 50), ('VB', 1)]
As a beginner in text mining, I am trying to replicate the analyses from this paper. Essentially, the authors extract LDA topics (1-4) from a document and then "the topics extracted by LDA have been converted to suitable TF-IDF matrices that have been then used to predict..." (not important what they predict, it's a bunch of regressions). Their definition of TF and IDF (section 4.2.5) seems common, though, my understanding is that the TF-IDF measures apply to a word in a document, not topics. Given that they have a case where they extract a single topic, I think it's impossible to use the probability of the topic in a document, as this will always be 1 (though correct me if I am wrong).
So, what are the possible interpretations of converting LDA topics to "suitable TF-IDF" matrices? (and how would one go about doing that using the below code?)
Would that mean converting each and every word in a document to its TF-IDF weight and then use in prediction? That does not seem plausible as with 1000+ documents, that'd be pretty high and almost certainly most of them would be useless.
Minimally reproducible example
(credit: Jordan Barber)
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = get_stop_words('en')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
# add tokens to list
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)
I want to extract some desirable concepts (noun phrases) in the text automatically. My plan is to extract all noun phrases and then label them as two classifications (i.e., desirable phrases and non-desirable phrases). After that, train a classifier to classify them. What I am trying now is to extract all possible phrases as the training set first. For example, one sentence is Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described. I want to get all phrases like shoulder, richer mix, shoulder of richer mix,junctions,junctions of columns and beams, columns and beams, columns, beams or whatever possible. The desirable phrases are shoulder, junctions, junctions of columns and beams. But I don't care the correctness at this step, I just want to get the training set first. Are there available tools for such task?
I tried Rake in rake_nltk, but the results failed to include my desirable phrases (i.e., it did not extract all possible phrases)
from rake_nltk import Rake
data = 'Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.'
r = Rake()
phrase = r.get_ranked_phrases()
print(phrase)
Result: ['richer mix', 'shoulder', 'required', 'junctions', 'items', 'described', 'columns', 'beams']
(Missed junctions of columns and beams here)
I also tried phrasemachine, the results also missed some desirable ones.
import spacy
import phrasemachine
doc = nlp(data)
tokens = [token.text for token in doc]
pos = [token.pos_ for token in doc]
out = phrasemachine.get_phrases(tokens=tokens, postags=pos, output="token_spans")
while len(out['token_spans']):
start,end = out['token_spans'].pop()
[(2, 6), (4, 6), (14, 17)]
['junctions', 'of', 'columns']
['richer', 'mix']
['shoulder', 'of', 'richer', 'mix']
(Missed many noun phrases here)
You may wish to make use of noun_chunks attribute:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.')
phrases = set()
for nc in doc.noun_chunks:
{'junctions of columns and beams', 'junctions', 'the items', 'a shoulder', 'columns', 'richer mix', 'beams', 'columns and beams', 'a shoulder of richer mix', 'these junctions'}
I'm new in python . I have a big data set from twitter and i want to tokenize it .
but i don't know how can i token verbs like this : "look for , take off ,grow up and etc." and it's important to me .
my code is :
>>> from nltk.tokenize import word_tokenize
>>> s = "I'm looking for the answer"
>>> word_tokenize(s)
['I', "'m", 'looking', 'for', 'the', 'answer']
my data set is big and i can't use this page code :
Find multi-word terms in a tokenized text in Python
so , how can i solve my problem?
You need to use parts of speech tags for that, or actually dependency parsing would be more accurate. I haven't tried with nltk, but with spaCy you can do it like this:
import spacy
nlp = spacy.load('en_core_web_lg')
def chunk_phrasal_verbs(lemmatized_sentence):
ph_verbs = []
for word in nlp(lemmatized_sentence):
if word.dep_ == 'prep' and word.head.pos_ == 'VERB':
ph_verb = word.head.text+ ' ' + word.text
return ph_verbs
I also suggest first lemmatizing the sentence to get rid of conjugations. Also if you need noun phrases, with the similar way you can use compound relationship.
I'm looking to get the similarity between a single word and each word in a sentence using NLTK.
NLTK can get the similarity between two specific words as shown below. This method requires that a specific reference to the word is given, in this case it is 'dog.n.01' where dog is a noun and we want to use the first (01) NLTK definition.
dog = wordnet.synset('dog.n.01')
cat = wordnet.synset('cat.n.01')
print dog.path_similarity(cat)
>> 0.2
The problem is that I need to get the part of speech information from each word in the sentence. The NLTK package has the ability to get the parts of speech for each word in a sentence as shown below. However, these speech parts ('NN', 'VB', 'PRP'...) don't match up with the format that the synset takes as a parameter.
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
>> [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
Is is possible to get the synset formatted data from pos_tag() results in NLTK? By synset formatted I mean the format like dog.n.01
You can use a simple conversion function:
from nltk.corpus import wordnet as wn
def penn_to_wn(tag):
if tag.startswith('J'):
return wn.ADJ
elif tag.startswith('N'):
return wn.NOUN
elif tag.startswith('R'):
return wn.ADV
elif tag.startswith('V'):
return wn.VERB
return None
After tagging a sentence you can tie a word inside the sentence with a SYNSET using this function. Here's an example:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize
sentence = "I am going to buy some gifts"
tagged = pos_tag(word_tokenize(sentence))
synsets = []
lemmatzr = WordNetLemmatizer()
for token in tagged:
wn_tag = penn_to_wn(token[1])
if not wn_tag:
lemma = lemmatzr.lemmatize(token[0], pos=wn_tag)
synsets.append(wn.synsets(lemma, pos=wn_tag)[0])
print synsets
Result: [Synset('be.v.01'), Synset('travel.v.01'), Synset('buy.v.01'), Synset('gift.n.01')]
You can use the alternative form of wordnet.synset:
wordnet.synset('dog', pos=wordnet.NOUN)
You'll still need to translate the tags offered by pos_tag into those supported by wordnet.sysnset -- unfortunately, I don't know of a pre-built dictionary doing that, so (unless I'm missing the existence of such a correspondence table) you'll need to build your own (you can do that once and pickle it for subsequent reloading).
See , subchapter 1, on how to get help about a specific tagset -- e.g'N.*') will confirm that the UPenn tagset (which I believe is the default one used by pos_tag) uses 'N' followed by something to identify variants of what synset will see as a wordnet.NOUN.
I have not tried but it might be just what you require -- give it a try!
How do you find collocations in text?
A collocation is a sequence of words that occurs together unusually often.
python has built-in func bigrams that returns word pairs.
>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?
Try NLTK. You will mostly be interested in nltk.collocations.BigramCollocationFinder, but here is a quick demonstration to show you how to get started:
>>> import nltk
>>> def tokenize(sentences):
... for sent in nltk.sent_tokenize(sentences.lower()):
... for word in nltk.word_tokenize(sent):
... yield word
>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
There are none in this small segment, but here goes:
>>> text.collocations(num=20)
Building collocations list
Here is some code that takes a list of lowercase words and returns a list of all bigrams with their respective counts, starting with the highest count. Don't use this code for large lists.
from itertools import izip
words = ["more", "is", "said", "than", "done", "is", "said"]
words_iter = iter(words)
next(words_iter, None)
count = {}
for bigram in izip(words, words_iter):
count[bigram] = count.get(bigram, 0) + 1
print sorted(((c, b) for b, c in count.iteritems()), reverse=True)
(words_iter is introduced to avoid copying the whole list of words as you would do in izip(words, words[1:])
import itertools
from collections import Counter
words = ['more', 'is', 'said', 'than', 'done']
nextword = iter(words)
A collocation is a sequence of tokens that are better treated as a single token when parsing e.g. "red herring" has a meaning that can't be derived from its components. Deriving a useful set of collocations from a corpus involves ranking the n-grams by some statistic (n-gram frequency, mutual information, log-likelihood, etc) followed by judicious manual editing.
Points that you appear to be ignoring:
(1) the corpus must be rather large ... attempting to get collocations from one sentence as you appear to suggest is pointless.
(2) n can be greater than 2 ... e.g. analysing texts written about 20th century Chinese history will throw up "significant" bigrams like "Mao Tse" and "Tse Tung".
What are you actually trying to achieve? What code have you written so far?
Agree with Tim McNamara on using nltk and problems with the unicode. However, I like the text class a lot - there is a hack that you can use to get the collocations as list , i discovered it looking at the source code . Apparently whenever you invoke the collocations method it saves it as a class variable!
import nltk
def tokenize(sentences):
for sent in nltk.sent_tokenize(sentences.lower()):
for word in nltk.word_tokenize(sent):
yield word
text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
collocations = [" ".join(el) for el in list(text._collocations)]
enjoy !