Regex parser for a Spanish text - python

I am trying to define a grammar in order to retrieve quantity and fruit from a text with Regex parser. Apparently there is a problem in the grammar because in the result I can only see the quantity. I paste below an example text and the code I am using. The HMM tagger was trained with cess_esp corpus.
grammar = r"""
fruits: {<NCFP000>}
quantity:{<Z>}
"""
regex_parser = nltk.RegexpParser(grammar)
cp = nltk.RegexpParser(grammar)
example=['quiero 3 cervezas']
for sent in example:
tokens = nltk.word_tokenize(sent)
taggex = hmm_tagger.tag(tokens)
print(taggex)
result = cp.parse(taggex)
result.draw()

Try to use NLTK tagger instead of Markov one:
taggex = nltk.pos_tag(tokens)
I checked it and it should work on your code as well.

Related

How to extract noun-based compound words from a sentence using Python?

I'm using nltk via the following code to extract nouns from a sentence:
words = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(words)
And then I choose the words tagged with the NN and NNP Part of Speech (PoS) tags. However, it only extracts single nouns like "book" and "table", yet ignores the pair of nouns like "basketball shoe". What should I do to expand the results to contain such compond noun pairs?
Assuming you just want to find noun-noun compounds (e.g. "book store") and not other combinations like noun-verb (e.g. "snow fall") or adj-noun (e.g. "hot dog"), the following solution will capture 2 or more consecutive occurrences of either the NN, NNS, NNP or NNPS Part of Speech (PoS) tags.
Example
Using the NLTK RegExpParser with the custom grammar rule defined in the solution below, three compound nouns ("basketball shoe", "book store" and "peanut butter") are extracted from the following sentence:
John lost his basketball shoe in the book store while eating peanut butter
Solution
from nltk import word_tokenize, pos_tag, RegexpParser
text = "John lost his basketball shoe in the book store while eating peanut butter"
tokenized = word_tokenize(text) # Tokenize text
tagged = pos_tag(tokenized) # Tag tokenized text with PoS tags
# Create custom grammar rule to find consecutive occurrences of nouns
my_grammar = r"""
CONSECUTIVE_NOUNS: {<N.*><N.*>+}"""
# Function to create parse tree using custom grammar rules and PoS tagged text
def get_parse_tree(grammar, pos_tagged_text):
cp = RegexpParser(grammar)
parse_tree = cp.parse(pos_tagged_text)
# parse_tree.draw() # Visualise parse tree
return parse_tree
# Function to get labels from custom grammar:
# takes line separated NLTK regexp grammar rules
def get_labels_from_grammar(grammar):
labels = []
for line in grammar.splitlines()[1:]:
labels.append(line.split(":")[0])
return labels
# Function takes parse tree & list of NLTK custom grammar labels as input
# Returns phrases which match
def get_phrases_using_custom_labels(parse_tree, custom_labels_to_get):
matching_phrases = []
for node in parse_tree.subtrees(filter=lambda x: any(x.label() == custom_l for custom_l in custom_labels_to_get)):
# Get phrases only, drop PoS tags
matching_phrases.append([leaf[0] for leaf in node.leaves()])
return matching_phrases
text_parse_tree = get_parse_tree(my_grammar, tagged)
my_labels = get_labels_from_grammar(my_grammar)
phrases = get_phrases_using_custom_labels(text_parse_tree, my_labels)
for phrase in phrases:
print(phrase)
Output
['basketball', 'shoe']
['book', 'store']
['peanut', 'butter']

How to use exact words in NLTK RegexpParser

I want to extract specific phrases from text with a help of NLTK RegexpParser.
Is there a way to combine exact word in pos_tags?
For example, this is my text:
import nltk
text = "Samle Text and sample Text and text With University of California and Institute for Technology with SAPLE TEXT"
tokens = nltk.word_tokenize(text)
tagged_text = nltk.pos_tag(tokens)
regex = "ENTITY:{<University|Institute><for|of><NNP|NN>}"
# searching by regex that is defined
entity_search = nltk.RegexpParser(regex)
entity_result = entity_search.parse(tagged_text)
entity_result = list(entity_result)
print(entity_result)
Ofc, I have a lot of different combinations of words that I want to use in my "ENTITY" regex, and I have much longer text.
Is there a way to make it work?
FYI, I want to make it work with RegexpParser, I do not want to use regular regexes.
Here is a solution that doesn't require you specify exact words and still extracts the entities of interest. The RegExp ({<N.*><IN><N.*>}) matches any noun-related tag <N.*>, followed by a preposition or subordinating conjunction tag <IN>, followed by another noun-related tag <N.*>. This is the general PoS tag pattern of strings like of "University of ____" or "Institute for _____". You can make this more strict to match only proper nouns by changing <N.*> to be <NNP>. For more information on PoS tags, see this tutorial.
Solution #1
from nltk import word_tokenize, pos_tag, RegexpParser
text = "This is sample text with the University of California and Institute for Technology with more sample text."
tokenized = word_tokenize(text) # Tokenize text
tagged_text = pos_tag(tokenized) # Tag tokenized text with PoS tags
# Create custom grammar rule to find occurrences of a noun followed proposition or subordinating conjunction, followed by another noun (e.g. University of ___)
my_grammar = r"""
ENTITY: {<N.*><IN><N.*>}"""
# Function to create parse tree using custom grammar rules and PoS tagged text
def get_parse_tree(grammar, pos_tagged_text):
cp = RegexpParser(grammar)
parse_tree = cp.parse(pos_tagged_text)
parse_tree.draw() # Visualise parse tree
return parse_tree
# Function to get labels from custom grammar:
# takes line separated NLTK regexp grammar rules
def get_labels_from_grammar(grammar):
labels = []
for line in grammar.splitlines()[1:]:
labels.append(line.split(":")[0])
return labels
# Function takes parse tree & list of NLTK custom grammar labels as input
# Returns phrases which match
def get_phrases_using_custom_labels(parse_tree, custom_labels_to_get):
matching_phrases = []
for node in parse_tree.subtrees(filter=lambda x: any(x.label() == custom_l for custom_l in custom_labels_to_get)):
# Get phrases only, drop PoS tags
matching_phrases.append([leaf[0] for leaf in node.leaves()])
return matching_phrases
text_parse_tree = get_parse_tree(my_grammar, tagged_text)
my_labels = get_labels_from_grammar(my_grammar)
phrases = get_phrases_using_custom_labels(text_parse_tree, my_labels)
for phrase in phrases:
print(phrase)
Output
['University', 'of', 'California']
['Institute', 'for', 'Technology']
If you really require the ability to capture exact words, you can do this by defining custom tags for each of the words you require. One crude solution to do this without training your own custom tagger is as follows:
Solution #2
from nltk import word_tokenize, pos_tag, RegexpParser
text = "This is sample text with the University of California and Institute for Technology with more sample text."
tokenized = word_tokenize(text) # Tokenize text
tagged_text = pos_tag(tokenized) # Tag tokenized text with PoS tags
# Define custom tags for specific words
my_specific_tagged_words = {
"ORG_TYPE": ["University", "Institute"],
"PREP": ["of", "for"]
}
# Create copy of tagged text to modify with custom tags
modified_tagged_text = tagged_text
# Iterate over tagged text, find the specified words and then modify the tags
for i, text_tag_tuple in enumerate(tagged_text):
for tag in my_specific_tagged_words.keys():
for word in my_specific_tagged_words[tag]:
if text_tag_tuple[0] == word:
modified_tagged_text[i] = (word, tag) # Modify tag for specific word
# Create custom grammar rule to find occurrences of ORG_TYPE tag, followed PREP tag, followed by another noun
my_grammar = r"""
ENTITY: {<ORG_TYPE><PREP><N.*>}"""
# Copy previously defined get_parse_tree, get_labels_from_grammar, get_phrases_using_custom_labels functions here...
text_parse_tree = get_parse_tree(my_grammar, tagged_text)
my_labels = get_labels_from_grammar(my_grammar)
phrases = get_phrases_using_custom_labels(text_parse_tree, my_labels)
for phrase in phrases:
print(phrase)
Output
['University', 'of', 'California']
['Institute', 'for', 'Technology']

Keyphrase extraction in Python - How to preprocess the text to get better performances

I'm trying to extract keyphrases from some English texts but I think that the quality of my results is affected by how the sentences are formulated. For example:
Sentence 1
import pke
text = "Manufacture of equipment for the production and use of hydrogen."
# define the valid Part-of-Speeches to occur in the graph
pos = {'NOUN', 'PROPN', 'ADJ'}
# define the grammar for selecting the keyphrase candidates
grammar = "NP: {<ADJ>*<NOUN|PROPN>+}"
extractor = pke.unsupervised.PositionRank()
extractor.load_document(input=text, language='en')
extractor.grammar_selection(grammar = grammar)
extractor.candidate_selection(maximum_word_number = 5)
extractor.candidate_weighting(window = 5, pos = pos)
keyphrases = extractor.get_n_best(n = 10, redundancy_removal = True)
#dict_keys(['bert', 'state-of-the-art model'])
keyphrases
returns this:
[('equipment', 0.2712123844387682),
('production', 0.24805759926043025),
('manufacture', 0.20214941371717332),
('use', 0.14005307983173715),
('hydrogen', 0.1385275227518909)]
While:
Sentence 2
text = "Equipment manufacture for hydrogen production and hydrogen use"
with the same piece of code returns this:
[('hydrogen production', 0.5110246649313613),
('hydrogen use', 0.4067693357279659),
('equipment manufacture', 0.3619113634611547)]
which, in my opinion, is a better result since allows me to understand what we're talking about.
I wonder if there's a way to preprocess Sentence 1 making it more similar to Sentence 2. I've already tried with Neuralcoref but, in this particular case, doesn't help me.
Thank you in advance for any suggestion.
Francesca

Spacy tokenizer with only "Whitespace" rule

I would like to know if the spacy tokenizer could tokenize words only using the "space" rule.
For example:
sentence= "(c/o Oxford University )"
Normally, using the following configuration of spacy:
nlp = spacy.load("en_core_news_sm")
doc = nlp(sentence)
for token in doc:
print(token)
the result would be:
(
c
/
o
Oxford
University
)
Instead, I would like an output like the following (using spacy):
(c/o
Oxford
University
)
Is it possible to obtain a result like this using spacy?
Let's change nlp.tokenizer with a custom Tokenizer with token_match regex:
import re
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_sm')
text = "This is it's"
print("Before:", [tok for tok in nlp(text)])
nlp.tokenizer = Tokenizer(nlp.vocab, token_match=re.compile(r'\S+').match)
print("After :", [tok for tok in nlp(text)])
Before: [This, is, it, 's]
After : [This, is, it's]
You can further adjust Tokenizer by adding custom suffix, prefix, and infix rules.
An alternative, more fine grained way would be to find out why it's token is split like it is with nlp.tokenizer.explain():
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en_core_web_sm')
text = "This is it's. I'm fine"
nlp.tokenizer.explain(text)
You'll find out that split is due to SPECIAL rules:
[('TOKEN', 'This'),
('TOKEN', 'is'),
('SPECIAL-1', 'it'),
('SPECIAL-2', "'s"),
('SUFFIX', '.'),
('SPECIAL-1', 'I'),
('SPECIAL-2', "'m"),
('TOKEN', 'fine')]
that could be updated to remove "it's" from exceptions like:
exceptions = nlp.Defaults.tokenizer_exceptions
filtered_exceptions = {k:v for k,v in exceptions.items() if k!="it's"}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]
[This, is, it's., I, 'm, fine]
or remove split on apostrophe altogether:
filtered_exceptions = {k:v for k,v in exceptions.items() if "'" not in k}
nlp.tokenizer = Tokenizer(nlp.vocab, rules = filtered_exceptions)
[tok for tok in nlp(text)]
[This, is, it's., I'm, fine]
Note the dot attached to the token, which is due to the suffix rules not specified.
You can find the solution to this very question in the spaCy docs: https://spacy.io/usage/linguistic-features#custom-tokenizer-example. In a nutshell, you create a function that takes a string text and returns a Doc object, and then assign that callable function to nlp.tokenizer:
import spacy
from spacy.tokens import Doc
class WhitespaceTokenizer(object):
def __init__(self, vocab):
self.vocab = vocab
def __call__(self, text):
words = text.split(' ')
# All tokens 'own' a subsequent space character in this tokenizer
spaces = [True] * len(words)
return Doc(self.vocab, words=words, spaces=spaces)
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("What's happened to me? he thought. It wasn't a dream.")
print([t.text for t in doc])

CV Parser name matching

I am using NLP with python to find the names from the string. I am able to find the if i have a full name (first name and last name) but in the string i have only first name means my code is not able to recognize as Person. Below is my code.
import re
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
string = """
Sriram is working as a python developer
"""
def ie_preprocess(document):
document = ' '.join([i for i in document.split() if i not in stop])
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return sentences
def extract_names(document):
names = []
sentences = ie_preprocess(document)
#print(sentences)
for tagged_sentence in sentences:
for chunk in nltk.ne_chunk(tagged_sentence):
#print("Out Side ",chunk)
if type(chunk) == nltk.tree.Tree:
if chunk.label() == 'PERSON':
print("In Side ",chunk)
names.append(' '.join([c[0] for c in chunk]))
return names
if __name__ == '__main__':
names = extract_names(string)
print(names)
My advice is to use the StanfordNLP/Spacy NER, using nltk ne chunks is a little janky. StanfordNLP is more commonly used by researchers, but Spacy is easier to work with. Here is an example using Spacy to print the name of each named entity and its type:
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> text = 'Sriram is working as a python developer'
>>> doc = nlp(text)
>>> for ent in doc.ents:
print(ent.text,ent.label_)
Sriram ORG
>>>
Note that it classifies Sriram as an organization, which may be because it is not a common English name and Spacy is trained on English corpa. Good luck!

Categories