Extracting start and end indices of a token using spacy

Extracting start and end indices of a token using spacy - python

I am looking at lots of sentences and looking to extract the start and end indices of a word in a given sentence.
For example, the input is as follows:
"This is a sentence written in English by a native English speaker."
And What I want is the span of the word 'English' which in this case is : (30,37) and (50, 57).
Note: I was pointed to this answer (Get position of word in sentence with spacy)
But this answer doesn't solve my problem. It can help me in getting the start character of the token but not the end index.
All help appreciated

You can do this with re in pure python:
s="This is a sentence written in english by a native English speaker."
import re
[(i.start(), i.end()) for i in re.finditer('ENGLISH', s.upper())]
#output
[(30, 37), (50, 57)]
You can do in spacy as well:
import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp("This is a sentence written in english by a native English speaker.")
for ent in doc.ents:
if ent.text.upper()=='ENGLISH':
print(ent.start_char,ent.end_char)

Using the idea from the answer you link you could do something like this
from spacy.lang.en import English
nlp = English()
s = nlp("This is a sentence written in english by a native English speaker")
boundaries = []
for idx, i in enumerate(s[:-1]):
if i.text.lower() == "english":
boundaries.append((i.idx, s[idx+1].idx-1))

You can simply do it like this using SpaCy, which do not need any check for the last token (unlike #giovanni's solution):
def get_char_span(input_txt):
doc = nlp(input_txt)
for i, token in enumerate(doc):
start_i = token.idx
end_i = start_i + len(token.text)
# token span and the token
print(i, token)
# character span
print((start_i, end_i))
# veryfying it in the original input_text
print(input_txt[start_i:end_i])
inp = "My name is X, what's your name?"
get_char_span(inp)

Related

Given a word can we get all possible lemmas for it using Spacy?

The input word is standalone and not part of a sentence but I would like to get all of its possible lemmas as if the input word were in different sentences with all possible POS tags. I would also like to get the lookup version of the word's lemma.
Why am I doing this?
I have extracted lemmas from all the documents and I have also calculated the number of dependency links between lemmas. Both of which I have done using en_core_web_sm. Now, given an input word, I would like to return the lemmas that are linked most frequently to all the possible lemmas of the input word.
So in short, I would like to replicate the behaviour of token._lemma for the input word with all possible POS tags to maintain consistency with the lemma links I have counted.

I found it difficult to get lemmas and inflections directly out of spaCy without first constructing an example sentence to give it context. This wasn't ideal, so I looked further and found LemmaInflect did this very well.
> from lemminflect import getAllLemmas, getInflection, getAllInflections, getAllInflectionsOOV
> getAllLemmas('watches')
{'NOUN': ('watch',), 'VERB': ('watch',)}
> getAllInflections('watch')
{'NN': ('watch',), 'NNS': ('watches', 'watch'), 'VB': ('watch',), 'VBD': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',), 'VBP': ('watch',)}

spaCy is just not designed to do this - it's made for analyzing text, not producing text.
The linked library looks good, but if you want to stick with spaCy or need languages besides English, you can look at spacy-lookups-data, which is the raw data used for lemmas. Generally there will be a dictionary for each part of speech that lets you look up the lemma for a word.

To get alternative lemmas, I am trying a combination of Spacy rule_lemmatize and Spacy lookup data. rule_lemmatize may produce more than one valid lemma whereas the lookup data will only offer one lemma for a given word (in the files I have inspected). There are however cases where the lookup data produces a lemma whilst rule_lemmatize does not.
My examples are for Spanish:
import spacy
import spacy_lookups_data
import json
import pathlib
# text = "fui"
text = "seguid"
# text = "contenta"
print("Input text: \t\t" + text)
# Find lemmas using rules:
nlp = spacy.load("es_core_news_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
doc = nlp(text)
rule_lemmas = lemmatizer.rule_lemmatize(doc[0])
print("Lemmas using rules: " + ", ".join(rule_lemmas))
# Find lemma using lookup:
lookups_path = str(pathlib.Path(spacy_lookups_data.__file__).parent.resolve()) + "/data/es_lemma_lookup.json"
fileObject = open(lookups_path, "r")
lookup_json = fileObject.read()
lookup = json.loads(lookup_json)
print("Lemma from lookup: \t" + lookup[text])
Output:
Input text: fui # I went; I was (two verbs with same form in this tense)
Lemmas using rules: ir, ser # to go, to be (both possible lemmas returned)
Lemma from lookup: ser # to be
Input text: seguid # Follow! (imperative)
Lemmas using rules: seguid # Follow! (lemma not returned)
Lemma from lookup: seguir # to follow
Input text: contenta # (it) satisfies (verb); contented (adjective)
Lemmas using rules: contentar # to satisfy (verb but not adjective lemma returned)
Lemma from lookup: contento # contented (adjective, lemma form)

Expected str instance, spacy.tokens.token.Token found

I am executing a data extraction use-case. To preprocess and tokenize my data, I am using both spacy English and German tokenizers, because the sentences are in both the languages. Here's my code:
import spacy
from spacy.lang.de import German
from spacy.lang.en import English
from spacy.lang.de import STOP_WORDS as stp_wrds_de
from spacy.lang.en.stop_words import STOP_WORDS as stp_wrds_en
import string
punctuations = string.punctuation
# German Parser
parser_de = German()
# English Parser
parser_en = English()
def spacy_tokenizer_de(document):
# Token object for splitting text into 'units'
tokens = parser_de(document)
# Lemmatization: Grammatical conversion of words
tokens = [word.lemma_.strip() if word.lemma_ != '-PRON-' else word for word in tokens]
# Remove punctuations
tokens = [word for word in tokens if word not in punctuations]
tokens_de_str = converttostr(tokens,' ')
tokens_en = spacy_tokenizer_en(tokens_de_str)
print("Tokens EN: {}".format(tokens_en))
tokens_en = converttostr(tokens_en,' ')
return tokens_en
def converttostr(input_seq, separator):
# Join all the strings in list
final_str = separator.join(input_seq)
return final_str
def spacy_tokenizer_en(document):
tokens = parser_en(document)
tokens = [word.lemma_.strip() if word.lemma_ != '-PRON-' else word for word in tokens]
return tokens
Here's a further elucidation of the above code:
1. spacy_tokenizer_de(): Method to parse and tokenize document in German
2. spacy_tokenizer_en(): Method to parse and tokenize document in English
3. converttostr(): Converts list of tokens to a string, so that the English spacy tokenizer can read the input (only accepts document/string format) and tokenize the data.
However, some sentences when parsed, lead to the following error:
Why is a spacy token object coming up in such scenarios, whereas, some of the sentences are being processed successfully? Can anyone please help here?

token.lemma_.strip() if token.lemma_ != '-PRON-' else token.text for token in tokens
You're supposed to get a list of words here, right? Instead, sometimes you return a string (when lemma doesn't equal to '-PRON-') but other times just token but not a string.
You may get a string from token.text.

Spacy custom sentence segmentation on line break

I'm trying to split this document into paragraphs. Specifically, I would like to split the text whenever there is a line break (<br>)
This is the code I'm using but is not producing the results I hoped
nlp = spacy.load("en_core_web_lg")
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == "<br>":
doc[token.i+1].is_sent_start = True
return doc
nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)
print([sent.text for sent in doc.sents])
A similar solution could be achieved by using NLTK's TextTilingTokenizer but wanted to check whether there is anything similar within Spacy

You're almost there, but the problem is that the default Tokenizer splits on '<' and '>', hence the condition token.text == "<br>" is never true. I'd add space before and after <br>. E.g.
import spacy
from spacy.symbols import ORTH
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == "<br>":
doc[token.i+1].is_sent_start = True
return doc
nlp = spacy.load("en_core_web_sm")
text = "the quick brown fox<br>jumps over the lazy dog"
text = text.replace('<br>', ' <br> ')
special_case = [{ORTH: "<br>"}]
nlp.tokenizer.add_special_case("<br>", special_case)
nlp.add_pipe(set_custom_boundaries, first=True)
doc = nlp(text)
print([sent.text for sent in doc.sents])
Also take a look at this PR, after it's merged to master, it'll no longer be necessary to wrap in spaces.
https://github.com/explosion/spaCy/pull/4259

How to replace words with their synonyms of word-net?

i want to do data augmentation for sentiment analysis task by replacing words with it's synonyms from wordnet but replacing is random i want to loop over the synonyms and replace word with all synonyms one at the time to increase data-size
sentences=[]
for index , r in pos_df.iterrows():
text=normalize(r['text'])
words=tokenize(text)
output = ""
# Identify the parts of speech
tagged = nltk.pos_tag(words)
for i in range(0,len(words)):
replacements = []
# Only replace nouns with nouns, vowels with vowels etc.
for syn in wordnet.synsets(words[i]):
# Do not attempt to replace proper nouns or determiners
if tagged[i][1] == 'NNP' or tagged[i][1] == 'DT':
break
# The tokenizer returns strings like NNP, VBP etc
# but the wordnet synonyms has tags like .n.
# So we extract the first character from NNP ie n
# then we check if the dictionary word has a .n. or not
word_type = tagged[i][1][0]
if syn.name().find("."+word_type+"."):
# extract the word only
r = syn.name()[0:syn.name().find(".")]
replacements.append(r)
if len(replacements) > 0:
# Choose a random replacement
replacement = replacements[randint(0,len(replacements)-1)]
print(replacement)
output = output + " " + replacement
else:
# If no replacement could be found, then just use the
# original word
output = output + " " + words[i]
sentences.append([output,'positive'])

Even I'm working with a similar kind of project, generating new sentences from a given input but without changing the context from the input text.
While coming across this, I found a data augmentation technique. Which seems to work well on the augmentation part. EDA(Easy Data Augmentation) is a paper[https://github.com/jasonwei20/eda_nlp].
Hope this helps you.

Python - specify corpus from whose sentence to perform a function on?

I have imported all the books from the NLTK Book library, and I am just trying to figure out how to define a corpus then sentence to be printed.
For example, if I wanted to print sentence 1 of text 3, then sentence 2 of text 4
import nltk
from nltk.book import *
print(???)
print(???)
I've tried the below combinations, which do not work:
print(text3.sent1)
print(text4.sent2)
print(sent1.text3)
print(sent2.text4)
print(text3(sent1))
print(text4(sent2))
I am new to python, so it is likely a v. basic question, but I cannot seem to find the solution elsewhere.
Many thanks, in advance!

Simple example can be given as :
from nltk.tokenize import sent_tokenize
# List of sentences
sentences = "This is first sentence. This is second sentence. Let's try to tokenize the sentences. how are you? I am doing good"
# define function
def sentence_tokenizer(sentences):
sentence_tokenize_list = sent_tokenize(sentences)
print "tokenized sentences are = ", sentence_tokenize_list
return sentence_tokenize_list
# call function
tokenized_sentences = sentence_tokenizer(sentences)
# print first sentence
print tokenized_sentences[0]
Hope this helps.

You need to split the texts into lists of sentences first.
If you already have text3 and text4:
from nltk.tokenize import sent_tokenize
sents = sent_tokenize(text3)
print(sents[0]) # the first sentence in the list is at position 0
sents = sent_tokenize(text4)
print(sents[1]) # the second sentence in the list is at position 1
print(text3[0]) # prints the first word of text3
You seem to need both a NLTK tutorial and a python tutorial. Luckily, the NLTK book is both.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting start and end indices of a token using spacy - python

Related

Given a word can we get all possible lemmas for it using Spacy?

Expected str instance, spacy.tokens.token.Token found

Spacy custom sentence segmentation on line break

How to replace words with their synonyms of word-net?

Python - specify corpus from whose sentence to perform a function on?

Categories

Resources