I want to extract specific phrases from text with a help of NLTK RegexpParser.
Is there a way to combine exact word in pos_tags?
For example, this is my text:
import nltk
text = "Samle Text and sample Text and text With University of California and Institute for Technology with SAPLE TEXT"
tokens = nltk.word_tokenize(text)
tagged_text = nltk.pos_tag(tokens)
regex = "ENTITY:{<University|Institute><for|of><NNP|NN>}"
# searching by regex that is defined
entity_search = nltk.RegexpParser(regex)
entity_result = entity_search.parse(tagged_text)
entity_result = list(entity_result)
print(entity_result)
Ofc, I have a lot of different combinations of words that I want to use in my "ENTITY" regex, and I have much longer text.
Is there a way to make it work?
FYI, I want to make it work with RegexpParser, I do not want to use regular regexes.
Here is a solution that doesn't require you specify exact words and still extracts the entities of interest. The RegExp ({<N.*><IN><N.*>}) matches any noun-related tag <N.*>, followed by a preposition or subordinating conjunction tag <IN>, followed by another noun-related tag <N.*>. This is the general PoS tag pattern of strings like of "University of ____" or "Institute for _____". You can make this more strict to match only proper nouns by changing <N.*> to be <NNP>. For more information on PoS tags, see this tutorial.
Solution #1
from nltk import word_tokenize, pos_tag, RegexpParser
text = "This is sample text with the University of California and Institute for Technology with more sample text."
tokenized = word_tokenize(text) # Tokenize text
tagged_text = pos_tag(tokenized) # Tag tokenized text with PoS tags
# Create custom grammar rule to find occurrences of a noun followed proposition or subordinating conjunction, followed by another noun (e.g. University of ___)
my_grammar = r"""
ENTITY: {<N.*><IN><N.*>}"""
# Function to create parse tree using custom grammar rules and PoS tagged text
def get_parse_tree(grammar, pos_tagged_text):
cp = RegexpParser(grammar)
parse_tree = cp.parse(pos_tagged_text)
parse_tree.draw() # Visualise parse tree
return parse_tree
# Function to get labels from custom grammar:
# takes line separated NLTK regexp grammar rules
def get_labels_from_grammar(grammar):
labels = []
for line in grammar.splitlines()[1:]:
labels.append(line.split(":")[0])
return labels
# Function takes parse tree & list of NLTK custom grammar labels as input
# Returns phrases which match
def get_phrases_using_custom_labels(parse_tree, custom_labels_to_get):
matching_phrases = []
for node in parse_tree.subtrees(filter=lambda x: any(x.label() == custom_l for custom_l in custom_labels_to_get)):
# Get phrases only, drop PoS tags
matching_phrases.append([leaf[0] for leaf in node.leaves()])
return matching_phrases
text_parse_tree = get_parse_tree(my_grammar, tagged_text)
my_labels = get_labels_from_grammar(my_grammar)
phrases = get_phrases_using_custom_labels(text_parse_tree, my_labels)
for phrase in phrases:
print(phrase)
Output
['University', 'of', 'California']
['Institute', 'for', 'Technology']
If you really require the ability to capture exact words, you can do this by defining custom tags for each of the words you require. One crude solution to do this without training your own custom tagger is as follows:
Solution #2
from nltk import word_tokenize, pos_tag, RegexpParser
text = "This is sample text with the University of California and Institute for Technology with more sample text."
tokenized = word_tokenize(text) # Tokenize text
tagged_text = pos_tag(tokenized) # Tag tokenized text with PoS tags
# Define custom tags for specific words
my_specific_tagged_words = {
"ORG_TYPE": ["University", "Institute"],
"PREP": ["of", "for"]
}
# Create copy of tagged text to modify with custom tags
modified_tagged_text = tagged_text
# Iterate over tagged text, find the specified words and then modify the tags
for i, text_tag_tuple in enumerate(tagged_text):
for tag in my_specific_tagged_words.keys():
for word in my_specific_tagged_words[tag]:
if text_tag_tuple[0] == word:
modified_tagged_text[i] = (word, tag) # Modify tag for specific word
# Create custom grammar rule to find occurrences of ORG_TYPE tag, followed PREP tag, followed by another noun
my_grammar = r"""
ENTITY: {<ORG_TYPE><PREP><N.*>}"""
# Copy previously defined get_parse_tree, get_labels_from_grammar, get_phrases_using_custom_labels functions here...
text_parse_tree = get_parse_tree(my_grammar, tagged_text)
my_labels = get_labels_from_grammar(my_grammar)
phrases = get_phrases_using_custom_labels(text_parse_tree, my_labels)
for phrase in phrases:
print(phrase)
Output
['University', 'of', 'California']
['Institute', 'for', 'Technology']
Related
I'm using nltk via the following code to extract nouns from a sentence:
words = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(words)
And then I choose the words tagged with the NN and NNP Part of Speech (PoS) tags. However, it only extracts single nouns like "book" and "table", yet ignores the pair of nouns like "basketball shoe". What should I do to expand the results to contain such compond noun pairs?
Assuming you just want to find noun-noun compounds (e.g. "book store") and not other combinations like noun-verb (e.g. "snow fall") or adj-noun (e.g. "hot dog"), the following solution will capture 2 or more consecutive occurrences of either the NN, NNS, NNP or NNPS Part of Speech (PoS) tags.
Example
Using the NLTK RegExpParser with the custom grammar rule defined in the solution below, three compound nouns ("basketball shoe", "book store" and "peanut butter") are extracted from the following sentence:
John lost his basketball shoe in the book store while eating peanut butter
Solution
from nltk import word_tokenize, pos_tag, RegexpParser
text = "John lost his basketball shoe in the book store while eating peanut butter"
tokenized = word_tokenize(text) # Tokenize text
tagged = pos_tag(tokenized) # Tag tokenized text with PoS tags
# Create custom grammar rule to find consecutive occurrences of nouns
my_grammar = r"""
CONSECUTIVE_NOUNS: {<N.*><N.*>+}"""
# Function to create parse tree using custom grammar rules and PoS tagged text
def get_parse_tree(grammar, pos_tagged_text):
cp = RegexpParser(grammar)
parse_tree = cp.parse(pos_tagged_text)
# parse_tree.draw() # Visualise parse tree
return parse_tree
# Function to get labels from custom grammar:
# takes line separated NLTK regexp grammar rules
def get_labels_from_grammar(grammar):
labels = []
for line in grammar.splitlines()[1:]:
labels.append(line.split(":")[0])
return labels
# Function takes parse tree & list of NLTK custom grammar labels as input
# Returns phrases which match
def get_phrases_using_custom_labels(parse_tree, custom_labels_to_get):
matching_phrases = []
for node in parse_tree.subtrees(filter=lambda x: any(x.label() == custom_l for custom_l in custom_labels_to_get)):
# Get phrases only, drop PoS tags
matching_phrases.append([leaf[0] for leaf in node.leaves()])
return matching_phrases
text_parse_tree = get_parse_tree(my_grammar, tagged)
my_labels = get_labels_from_grammar(my_grammar)
phrases = get_phrases_using_custom_labels(text_parse_tree, my_labels)
for phrase in phrases:
print(phrase)
Output
['basketball', 'shoe']
['book', 'store']
['peanut', 'butter']
This is code I define a noun phrase chunking method
def np_chunking(sentence):
import nltk
from nltk import word_tokenize,pos_tag, ne_chunk
from nltk import Tree
grammer = "NP: {<JJ>*<NN.*>+}\n{<NN.*>+}" # chunker rules. adjective+noun or one or more nouns
sen=sentence
cp=nltk.RegexpParser(grammer)
mychunk=cp.parse(pos_tag(word_tokenize(sen)))
result=mychunk
return result.draw()
It works like this
print(np_chunking("""I like to listen to music from musical genres,such as blues,rock and jazz."""))
But when I change the text into another sentence like
print(np_chunking("""He likes to play basketball,football and other sports."""))
I do want to extract noun phrase chunking with structure like adjective plus noun or mutiple nouns. But in the second example, the word other is in the sutructure of 'np_1, np_2 and other np_3'. After the 'and other' it often comes up with a hypernym.
In the second part
def hyponym_extract(prepared_text, hearst_patterns):
text=merge_NP(prepared_text)
hyponyms=[]
result=[]
if re.search(hearst_patterns[0][0],text)!=None:
matches=re.search(hearst_patterns[0][0],text)
NP_match=re.findall(r"NP_\w+",matches.group(0))
hyponyms=NP_match[1:]
result=[(NP_match[0],x) for x in hyponyms]
if re.search(hearst_patterns[1][0],text)!=None:
matches=re.search(hearst_patterns[1][0],text)
NP_match=re.findall(r"NP_\w+",matches.group(0))
hyponyms=NP_match[:-1]
result=[(NP_match[-1],x) for x in hyponyms]
return result
hearst_patterns = [("(NP_\w+ (, )?such as (NP_\w+ ?(, )?(and |or )?)+)", "first"),
("((NP_\w+ ?(, )?)+(and |or )?other NP_\w+)","last")] # two examples for hearst pattern
print(hyponym_extract(prepare_chunks(np_chunking("I like to listen to music from musical genres,such as blues,rock and jazz.")),hearst_patterns))
print(hyponym_extract(prepare_chunks(np_chunking("He likes to play basketball,football and other sports.")),hearst_patterns))
The other is a part of the hearst pattern to extract hypernym and hyponyms.
So how could I improve my first code to let the second one work correctly?
The input word is standalone and not part of a sentence but I would like to get all of its possible lemmas as if the input word were in different sentences with all possible POS tags. I would also like to get the lookup version of the word's lemma.
Why am I doing this?
I have extracted lemmas from all the documents and I have also calculated the number of dependency links between lemmas. Both of which I have done using en_core_web_sm. Now, given an input word, I would like to return the lemmas that are linked most frequently to all the possible lemmas of the input word.
So in short, I would like to replicate the behaviour of token._lemma for the input word with all possible POS tags to maintain consistency with the lemma links I have counted.
I found it difficult to get lemmas and inflections directly out of spaCy without first constructing an example sentence to give it context. This wasn't ideal, so I looked further and found LemmaInflect did this very well.
> from lemminflect import getAllLemmas, getInflection, getAllInflections, getAllInflectionsOOV
> getAllLemmas('watches')
{'NOUN': ('watch',), 'VERB': ('watch',)}
> getAllInflections('watch')
{'NN': ('watch',), 'NNS': ('watches', 'watch'), 'VB': ('watch',), 'VBD': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',), 'VBP': ('watch',)}
spaCy is just not designed to do this - it's made for analyzing text, not producing text.
The linked library looks good, but if you want to stick with spaCy or need languages besides English, you can look at spacy-lookups-data, which is the raw data used for lemmas. Generally there will be a dictionary for each part of speech that lets you look up the lemma for a word.
To get alternative lemmas, I am trying a combination of Spacy rule_lemmatize and Spacy lookup data. rule_lemmatize may produce more than one valid lemma whereas the lookup data will only offer one lemma for a given word (in the files I have inspected). There are however cases where the lookup data produces a lemma whilst rule_lemmatize does not.
My examples are for Spanish:
import spacy
import spacy_lookups_data
import json
import pathlib
# text = "fui"
text = "seguid"
# text = "contenta"
print("Input text: \t\t" + text)
# Find lemmas using rules:
nlp = spacy.load("es_core_news_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
doc = nlp(text)
rule_lemmas = lemmatizer.rule_lemmatize(doc[0])
print("Lemmas using rules: " + ", ".join(rule_lemmas))
# Find lemma using lookup:
lookups_path = str(pathlib.Path(spacy_lookups_data.__file__).parent.resolve()) + "/data/es_lemma_lookup.json"
fileObject = open(lookups_path, "r")
lookup_json = fileObject.read()
lookup = json.loads(lookup_json)
print("Lemma from lookup: \t" + lookup[text])
Output:
Input text: fui # I went; I was (two verbs with same form in this tense)
Lemmas using rules: ir, ser # to go, to be (both possible lemmas returned)
Lemma from lookup: ser # to be
Input text: seguid # Follow! (imperative)
Lemmas using rules: seguid # Follow! (lemma not returned)
Lemma from lookup: seguir # to follow
Input text: contenta # (it) satisfies (verb); contented (adjective)
Lemmas using rules: contentar # to satisfy (verb but not adjective lemma returned)
Lemma from lookup: contento # contented (adjective, lemma form)
I am trying to define a grammar in order to retrieve quantity and fruit from a text with Regex parser. Apparently there is a problem in the grammar because in the result I can only see the quantity. I paste below an example text and the code I am using. The HMM tagger was trained with cess_esp corpus.
grammar = r"""
fruits: {<NCFP000>}
quantity:{<Z>}
"""
regex_parser = nltk.RegexpParser(grammar)
cp = nltk.RegexpParser(grammar)
example=['quiero 3 cervezas']
for sent in example:
tokens = nltk.word_tokenize(sent)
taggex = hmm_tagger.tag(tokens)
print(taggex)
result = cp.parse(taggex)
result.draw()
Try to use NLTK tagger instead of Markov one:
taggex = nltk.pos_tag(tokens)
I checked it and it should work on your code as well.
I have a list of sentences. Each sentence has to be converted to a json. There is a unique 'name' for each sentence that is also specified in that json. The problem is that the number of sentences is large so it's monotonous to manually give a name. The name should be similar to the meaning of the sentence e.g., if the sentence is "do you like cake?" then the name should be like "likeCake". I want to automate the process of creation of name for each sentence. I googled text summarization but the results were not for sentence summarization but paragraph summarization. How to go about this?
This sort of task is used for natural language processing. You can get a result similar to what you want by removing Stop Words. Bases on this article, you can use the Natural Language Toolkit for dealing with the stop words. After installing the libray (pip install nltk), you can do something around the lines of:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
# load data
file = open('yourFileWithSentences.txt', 'rt')
lines = file.readlines()
file.close()
stop_words = set(stopwords.words('english'))
for line in Lines:
# split into words
tokens = word_tokenize(line)
# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# filter out stop words
words = [w for w in words if not w in stop_words]
print(f"Var name is {''.join(words)}")
Note that you can extend the stop_words set by adding any other words you might want to remove.