How to write code to merge punctuations and phrases using spaCy - python

What I would like to do
I would like to perse and dependency analysis using spaCy, one of the open-source libraries for natural language processing.
And especially, I hope to know how to write code for the option to merge punctuations and phrases in Python.
There are bottons to mearge punctuations and phrases on the displaCy Dependency Vizualizer Web App.
However, I cannot find the way to write these options when it comes to writing code in the local environment.
The current code returns the following not merged version.
The sample sentence is from your dictionary.
Current Code
It is from the sample code on the spaCy official website.
Please let me know how to fix it to set punctuations and phrases merge options.
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
sentence = "On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories."
doc = nlp(sentence)
displacy.render(doc, style="dep")
What I tried to do
There was one example for the merge implementation.
However it didn't work when I apply the sentence.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories.")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
Example Code
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Credit and mortgage account holders must submit their requests")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
for token in doc:
print(token.text, token.pos_, token.dep_, token.head.text)

If you need to merge noun chunks, check out the built-in merge_noun_chunks pipeline component. When added to your pipeline using nlp.add_pipe, it will take care of merging the spans automatically.
You can just use the code from the displaCy Dependency Vizualizer:
import spacy
nlp = spacy.load("en_core_web_sm")
def merge_phrases(doc):
with doc.retokenize() as retokenizer:
for np in list(doc.noun_chunks):
attrs = {
"tag": np.root.tag_,
"lemma": np.root.lemma_,
"ent_type": np.root.ent_type_,
retokenizer.merge(np, attrs=attrs)
return doc
def merge_punct(doc):
spans = []
for word in doc[:-1]:
if word.is_punct or not word.nbor(1).is_punct:
start = word.i
end = word.i + 1
while end < len(doc) and doc[end].is_punct:
end += 1
span = doc[start:end]
spans.append((span, word.tag_, word.lemma_, word.ent_type_))
with doc.retokenize() as retokenizer:
for span, tag, lemma, ent_type in spans:
attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
retokenizer.merge(span, attrs=attrs)
return doc
text = "On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories."
doc = nlp(text)
# Merge noun phrases into one token.
doc = merge_phrases(doc)
# Attach punctuation to tokens
doc = merge_punct(doc)
for token in doc:
print(token.text, token.pos_, token.dep_, token.head.text)


How to use exact words in NLTK RegexpParser

I want to extract specific phrases from text with a help of NLTK RegexpParser.
Is there a way to combine exact word in pos_tags?
For example, this is my text:
import nltk
text = "Samle Text and sample Text and text With University of California and Institute for Technology with SAPLE TEXT"
tokens = nltk.word_tokenize(text)
tagged_text = nltk.pos_tag(tokens)
regex = "ENTITY:{<University|Institute><for|of><NNP|NN>}"
# searching by regex that is defined
entity_search = nltk.RegexpParser(regex)
entity_result = entity_search.parse(tagged_text)
entity_result = list(entity_result)
Ofc, I have a lot of different combinations of words that I want to use in my "ENTITY" regex, and I have much longer text.
Is there a way to make it work?
FYI, I want to make it work with RegexpParser, I do not want to use regular regexes.
Here is a solution that doesn't require you specify exact words and still extracts the entities of interest. The RegExp ({<N.*><IN><N.*>}) matches any noun-related tag <N.*>, followed by a preposition or subordinating conjunction tag <IN>, followed by another noun-related tag <N.*>. This is the general PoS tag pattern of strings like of "University of ____" or "Institute for _____". You can make this more strict to match only proper nouns by changing <N.*> to be <NNP>. For more information on PoS tags, see this tutorial.
Solution #1
from nltk import word_tokenize, pos_tag, RegexpParser
text = "This is sample text with the University of California and Institute for Technology with more sample text."
tokenized = word_tokenize(text) # Tokenize text
tagged_text = pos_tag(tokenized) # Tag tokenized text with PoS tags
# Create custom grammar rule to find occurrences of a noun followed proposition or subordinating conjunction, followed by another noun (e.g. University of ___)
my_grammar = r"""
ENTITY: {<N.*><IN><N.*>}"""
# Function to create parse tree using custom grammar rules and PoS tagged text
def get_parse_tree(grammar, pos_tagged_text):
cp = RegexpParser(grammar)
parse_tree = cp.parse(pos_tagged_text)
parse_tree.draw() # Visualise parse tree
return parse_tree
# Function to get labels from custom grammar:
# takes line separated NLTK regexp grammar rules
def get_labels_from_grammar(grammar):
labels = []
for line in grammar.splitlines()[1:]:
return labels
# Function takes parse tree & list of NLTK custom grammar labels as input
# Returns phrases which match
def get_phrases_using_custom_labels(parse_tree, custom_labels_to_get):
matching_phrases = []
for node in parse_tree.subtrees(filter=lambda x: any(x.label() == custom_l for custom_l in custom_labels_to_get)):
# Get phrases only, drop PoS tags
matching_phrases.append([leaf[0] for leaf in node.leaves()])
return matching_phrases
text_parse_tree = get_parse_tree(my_grammar, tagged_text)
my_labels = get_labels_from_grammar(my_grammar)
phrases = get_phrases_using_custom_labels(text_parse_tree, my_labels)
for phrase in phrases:
['University', 'of', 'California']
['Institute', 'for', 'Technology']
If you really require the ability to capture exact words, you can do this by defining custom tags for each of the words you require. One crude solution to do this without training your own custom tagger is as follows:
Solution #2
from nltk import word_tokenize, pos_tag, RegexpParser
text = "This is sample text with the University of California and Institute for Technology with more sample text."
tokenized = word_tokenize(text) # Tokenize text
tagged_text = pos_tag(tokenized) # Tag tokenized text with PoS tags
# Define custom tags for specific words
my_specific_tagged_words = {
"ORG_TYPE": ["University", "Institute"],
"PREP": ["of", "for"]
# Create copy of tagged text to modify with custom tags
modified_tagged_text = tagged_text
# Iterate over tagged text, find the specified words and then modify the tags
for i, text_tag_tuple in enumerate(tagged_text):
for tag in my_specific_tagged_words.keys():
for word in my_specific_tagged_words[tag]:
if text_tag_tuple[0] == word:
modified_tagged_text[i] = (word, tag) # Modify tag for specific word
# Create custom grammar rule to find occurrences of ORG_TYPE tag, followed PREP tag, followed by another noun
my_grammar = r"""
# Copy previously defined get_parse_tree, get_labels_from_grammar, get_phrases_using_custom_labels functions here...
text_parse_tree = get_parse_tree(my_grammar, tagged_text)
my_labels = get_labels_from_grammar(my_grammar)
phrases = get_phrases_using_custom_labels(text_parse_tree, my_labels)
for phrase in phrases:
['University', 'of', 'California']
['Institute', 'for', 'Technology']

How to ask Spacy phraseMatch to match all the token in the list?

I have a following algorithm:
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
phrase_matcher = PhraseMatcher(nlp.vocab)
CAT = [nlp.make_doc(text) for text in ['pension', 'underwriter', 'health', 'client']]
phrase_matcher.add("CATEGORY 1",None, *CAT)
text = 'The client works as a marine assistant underwriter. He has recently opted to stop paying into his pension. '
doc = nlp(text)
matches = phrase_matcher(doc)
for match_id, start, end in matches:
rule_id = nlp.vocab.strings[match_id] # get the unicode ID, i.e. 'CategoryID'
span = doc[start : end] # get the matched slice of the doc
print(rule_id, span.text)
# Output
CATEGORY 1 client
CATEGORY 1 underwriter
CATEGORY 1 pension
Can I ask to return the result when all words can be found in the sentence. I expect not to see anything here as 'health' is not part of the sentence.
Can I do this type of matching with PhraseMatcher? or Do I need to change for another type of rule based match? Thank you

Accessing out of range word in spaCy doc : why does it work?

I'm learning spaCy and am playing with Matchers.
I have:
a very basic sentence ("white shepherd dog")
a matcher object, searching for a pattern ("white shepherd")
a print to show the match, and the word and POS before that match
I just wanted to check how to handle the index out of range exception I'm expecting to get because there's nothing before the match. I didn't expect it to work, but it did and is returning 'dog', which is after the match... and now I'm confused.
It looks like spaCy uses a circular list (or deque I think) ?
This needs a language model to run, you can install it with the following command line, if you'd like to reproduce it:
python -m spacy download en_core_web_md
And this is the code
import spacy
from spacy.matcher import Matcher
# Loading language model
nlp = spacy.load("en_core_web_md")
# Initialising with shared vocab
matcher = Matcher(nlp.vocab)
# Adding statistical predictions
matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}]) # searching for white shepherd
doc = nlp("white shepherd dog")
for match_id, start, end in matcher(doc):
span = doc[start:end]
print("Matched span: ", span.text)
# Get previous token and its POS
print("Previous token: ", doc[start - 1].text, doc[start - 1].pos_) # I would expect the error here
I get the following:
>>> Matched span: white shepherd
>>> Previous token: dog PROPN
Can someone explain what's going on ?
Thanks !
You are looking for a token at index 0-1 which evaluated to -1, which is the last token.
I recommend using the Token.nbor method to look for the first token before the span, and if no previous token exists make it None or an empty string.
import spacy
from spacy.matcher import Matcher
# Loading language model
nlp = spacy.load("en_core_web_md")
# Initialising with shared vocab
matcher = Matcher(nlp.vocab)
# Adding statistical predictions
matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}]) # searching for white shepherd
doc = nlp("white shepherd dog")
for match_id, start, end in matcher(doc):
span = doc[start:end]
print("Matched span: ", span.text)
nbor_tok = span[0].nbor(-1)
print("Previous token:", nbor_tok, nbor_tok.pos_)
except IndexError:
nbor_tok = ''
print("Previous token: None None")

Spacy custom sentence segmentation on line break

I'm trying to split this document into paragraphs. Specifically, I would like to split the text whenever there is a line break (<br>)
This is the code I'm using but is not producing the results I hoped
nlp = spacy.load("en_core_web_lg")
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == "<br>":
doc[token.i+1].is_sent_start = True
return doc
nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)
print([sent.text for sent in doc.sents])
A similar solution could be achieved by using NLTK's TextTilingTokenizer but wanted to check whether there is anything similar within Spacy
You're almost there, but the problem is that the default Tokenizer splits on '<' and '>', hence the condition token.text == "<br>" is never true. I'd add space before and after <br>. E.g.
import spacy
from spacy.symbols import ORTH
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == "<br>":
doc[token.i+1].is_sent_start = True
return doc
nlp = spacy.load("en_core_web_sm")
text = "the quick brown fox<br>jumps over the lazy dog"
text = text.replace('<br>', ' <br> ')
special_case = [{ORTH: "<br>"}]
nlp.tokenizer.add_special_case("<br>", special_case)
nlp.add_pipe(set_custom_boundaries, first=True)
doc = nlp(text)
print([sent.text for sent in doc.sents])
Also take a look at this PR, after it's merged to master, it'll no longer be necessary to wrap in spaces.

Differentiating between the two types of nouns using spacy

I am using spacy to understand phrases and I am trying to differentiate between Nouns like food, beer, wine etc. and other nouns like yesterday and today.
I am not able to come up with an idea as to how to differentiate them.
query = input()
doc = nlp(query)
What can I do to differentiate between the first three nouns and yesterday?
The diplacy rendering is as shown in the image
image link =>
Well, what you can do is to check its entity:
sent = "Yesterday, I drank a beer"
doc = nlp(sent)
for token in doc:
print(token.text, token.pos_, token.tag_, token.ent_type_)
#Yesterday NOUN NN DATE
#, PUNCT ,
#drank VERB VBD
#beer NOUN NN
As you can see, dates like yesterday and today are recogniced as a date-entity. There are a few entities defined in spacy, here is a list.
