how to evaluate English words as something or someone - python

I'm writing python code to recognize idioms/phrases in any given English sentence.
here is an example of a phrase
phrase = 'land someone in something'
here are two sentence examples
sentence1 = 'His criminal activity finally landed him in jail.'
sentence2 = 'You really landed yourself in a fine mess!'
is there a way to use nltk (or any other python nlp package) to return 'someone' if the input was 'him' or 'yourself', and return 'something' if the input was 'jail' or 'fine mess'

Related

Autodetect and translate two or more languages ins a sentence using python

I have the following example sentence
text_to_translate1=" I want to go for swimming however the 天气 (weather) is not good"
As you can see there exist two languages in the sentence (i.e, English and Chinese).
I want a translator. The result that I want is the following:
I want to go for swimming however the weather(weather) is not good
I used the deepl translator but cannot autodetect two languages in one.
Code that I follow:
import deepl
auth_key = "xxxxx"
translator = deepl.Translator(auth_key)
result = translator.translate_text(text_to_translate1, target_lang="EN-US")
print(result.text)
print(result.detected_source_lang)
The result is the following:
I want to go for swimming however the 天气 (weather) is not good
EN
Any ideas?
I do not have access to the DeepL Translator, but according to their API you can supply a source language as a parameter:
result = translator.translate_text(text_to_translate1, source_lang="ZH", target_lang="EN-US")
If this doesn't work, my next best idea (if your target language is only English) is to send only the words that aren't English, then translate them in place. You can do this by looping over all words, checking if they are English, and if not, then translating them. You can then merge all the words together.
I would do it like so:
text_to_translate1 = "I want to go for swimming however the 天气 (weather) is not good"
new_sentence = []
# Split the sentence into words
for word in text_to_translate1.split(" "):
# For each word, check if it is English
if not word.isascii():
new_sentence.append(translator.translate_text(word, target_lang="EN-US").text)
else:
new_sentence.append(word)
# Merge the sentence back together
translated_sentence = " ".join(new_sentence)
print(translated_sentence)

How to get NLTK Punkt sentence tokenizer to recognize abbreviations that occur in the middle or end of a sentence?

I am parsing sentences with the NLTK Punkt tokenizer. But some specific abbreviations are causing sentences to split in the wrong locations.
For example:
"Hello, good day. Said the dog, all canines understood the dog(Wolfs,
etc.) the message."
The parser splits it up like this:
'Hello, good day.'
'Said the dog, all canines understood the dog(Wolfs, etc.)'
'the message.'
But I need to be like this:
'Hello, good day.'
'Said the dog, all canines understood the dog(Wolfs, etc.) the message.'
My code:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
def parser(text):
punkt_param = PunktParameters()
abbreviation = ['u.s.a', 'e.g', 'u.s']
punkt_param.abbrev_types = set(abbreviation)
# Training a new model with the text.
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.train(text)
# It automatically learns the abbreviations.
tokenizer._params.abbrev_types
# Use the customized tokenizer.
sentences = tokenizer.tokenize(text)
I cannot simply add "etc" to the list of abbreviations, since it sometimes occurs at the end of sentences.
The Punkt tokenizer can be trained to recognize "etc." in the middle of a sentence, or at the end of a sentence.
Training example to identifying "etc"
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
trainer = PunktTrainer()
corpus = """
It can take a few examples to learn a new abbreviation, e.g., when parsing a list like 1, 2, 3, etc., and then recognizing "etc".
Warehouses, cellars, and vaults, etc., may all be used for long-term storage.
Sometimes an abbreviation can occur at the end of a sentence, such as etc.
And then it needs to split at the end.
"""
trainer.train(corpus, finalize=False, verbose=True)
abbreviations = "u.s.a., e.g., u.s."
trainer.train(abbreviations, finalize=False, verbose=True)
tokenizer = PunktSentenceTokenizer(trainer.get_params())
text = "Hello, good day. Said the dog, all canines understood the dog(Wolfs, etc.) the message. Abbreviations can be tricky at the end, or final position etc. The question becomes whether or not the tokenizer can spot the difference."
sentences = tokenizer.tokenize(text)
for sentence in sentences:
print(sentence)
Output
Abbreviation: [1.2711] e.g
Rare Abbrev: etc.
Abbreviation: [1.1134] u.s
Abbreviation: [0.6144] u.s.a
Abbreviation: [2.2269] e.g
Hello, good day.
Said the dog, all canines understood the dog(Wolfs, etc.) the message.
Abbreviations can be tricky at the end, or final position etc.
The question becomes whether or not the tokenizer can spot the difference.
Example of insufficient training data to recognize "etc" at end of sentence
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
trainer = PunktTrainer()
corpus = """
It can take a few examples to learn a new abbreviation, e.g., when parsing a list like 1, 2, 3, etc., and then recognizing "etc".
Warehouses, cellars, and vaults, etc., may all be used for long-term storage.
"""
trainer.train(corpus, finalize=False, verbose=True)
abbreviations = "u.s.a., e.g., u.s."
trainer.train(abbreviations, finalize=False, verbose=True)
tokenizer = PunktSentenceTokenizer(trainer.get_params())
text = "Hello, good day. Said the dog, all canines understood the dog(Wolfs, etc.) the message. Abbreviations can be tricky at the end, or final position etc. The question becomes whether or not the tokenizer can spot the difference."
sentences = tokenizer.tokenize(text)
for sentence in sentences:
print(sentence)
Output
Abbreviation: [1.2410] e.g
Rare Abbrev: etc.
Abbreviation: [1.0382] u.s
Abbreviation: [0.5729] u.s.a
Abbreviation: [2.0764] e.g
Hello, good day.
Said the dog, all canines understood the dog(Wolfs, etc.) the message.
Abbreviations can be tricky at the end, or final position etc. The question becomes whether or not the tokenizer can spot the difference.

Is there a way to extract text after a word is stated?

So I am trying to make a virtual assistant and am working on a spelling feature like the one on a google home.
An example of what I am trying to achieve is when I say "hey google spell cat" it will say C A T
How would I get cat into a variable?
I know how to split it
If I understand you correctly, you're saying that you have a string and wish to store the last word in it. This can be achieved by split as you said and then assignment:
text = 'hey google spell cat'
last_word = text.split()[-1]
If you instead want the word after spell you can just index spell and add one:
text = 'hi google spell cat for me'
split = text.split()
split[split.index('spell')+1]
Try this:
given_text="hey google spell cat"
last_word=given_text.split()[-1]
reply=' '.join(list(last_word)).upper()
print(reply)
'C A T'

Using a regular expression to find all noun phrases in a paragraph following the occurrence of a specific phrase

I have a data frame of paragraphs, which I have (*can) split into word tokens and sentence tokens and am looking to find all the noun phrases following any instance where the phrase: "contribute to" or "donate to" occurs.
Or really some form of that, so:
"Contributions are welcome to be made to the charity of your choice."
---> would return: "the charity of your choice"
and
"blah blah blah donations, in honor of Firstname Lastname, can be made to ABC Foundation"
---> would return: "ABC Foundation"
I've created a regular expression work-around that captures the correct phrase about 90% of the time... see below:
text = nltk.Text(nltk.word_tokenize(x))
donation = TokenSearcher(text).findall(r"<\.> <.*>{,15}? <donat.*|contrib.*> <.*>*? <to> (<.*>+?) <\.|\,|\;> ")
donation = [' '.join(tokens) for tokens in donation]
return donation
I'd like to clean up that regular expression to get rid of the "{,15}" requirements because it's missing some of the values that I need. However, I'm not too polished with the "greedy" expressions and can't get it to work correctly.
so this phrase:
While she lived a full life , had many achievements and made many
**contributions** , FirstName is remembered by most for her cheerful smile ,
colorful track suits , and beautiful necklaces hand made by daughter FirstName .
FirstName always cherished her annual visit home for Thanksgiving to visit
brother FirstName LastName
is returning: "visit brother FirstName Lastname" due to the previous mentioning of contributions even though the word "to" comes well after 15 words later.
Looks like you're struggling with how to restrict your search criteria to a single sentence. So just use the NLTK to break your text into sentences (which it can do far better than just looking at periods), and your problem disappears.
sents = nltk.sent_tokenize(x) # `x` is a single string, as in your example
recipients = []
for sent in sents:
m = re.search(r"\b(contrib|donat).*?\bto\b([^.,;]*)", sent)
if m:
recipients.append(m.group(2).strip())
For further work, I also recommend that you use a better tool than Text, which is intended for simple interactive explorations. If you do want to do more with your texts, the nltk's PlaintextCorpusReader is your friend.
(?:contrib|donat|gifts)(?=[^\.]+\bto\b[^\.]+).*to\s([^\.]+)
Example
If works and does what you need then let me know and I will explain my regexp.

How to identify the subject of a sentence?

Can Python + NLTK be used to identify the subject of a sentence? From what I have learned till now is that a sentence can be broken into a head and its dependents. For e.g. "I shot an elephant". In this sentence, I and elephant are dependents to shot. But How do I discern that the subject in this sentence is I.
You can use Spacy.
Code
import spacy
nlp = spacy.load('en')
sent = "I shot an elephant"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "nsubj") ]
print(sub_toks)
As NLTK book (exercise 29) says, "One common way of defining the subject of a sentence S in English is as the noun phrase that is the child of S and the sibling of VP."
Look at tree example: indeed, "I" is the noun phrase that is the child of S that is the sibling of VP, while "elephant" is not.
English language has two voices: Active voice and passive voice. Lets take most used voice: Active voice.
It follows subject-verb-object model. To mark the subject, write a rule set with POS tags. Tag the sentence I[NOUN] shot[VERB] an elephant[NOUN]. If you see the first noun is subject, then there is a verb and then there is an object.
If you want to make it more complicated, a sentence- I shot an elephant with a gun. Here the prepositions or subordinate conjunctions like with, at, in can be given roles. Here the sentence will be tagged as I[NOUN] shot[VERB] an elephant[NOUN] with[IN] a gun[NOUN]. You can easily say that word with gets instrumentative role. You can build a rule based system to get role of every word in the sentence.
Also look at the patterns in passive voice and write rules for the same.
rake_nltk (pip install rake_nltk) is a python library that wraps nltk and apparently uses the RAKE algorithm.
from rake_nltk import Rake
rake = Rake()
kw = rake.extract_keywords_from_text("Can Python + NLTK be used to identify the subject of a sentence?")
ranked_phrases = rake.get_ranked_phrases()
print(ranked_phrases)
# outputs the keywords ordered by rank
>>> ['used', 'subject', 'sentence', 'python', 'nltk', 'identify']
By default the stopword list from nltk is used. You can provide your custom stopword list and punctuation chars by passing them in the constructor:
rake = Rake(stopwords='mystopwords.txt', punctuations=''',;:!##$%^*/\''')
By default string.punctuation is used for punctuation.
The constructor also accepts a language keyword which can be any language supported by nltk.
Stanford Corenlp Tool can also be used to extract Subject-Relation-Object information of a sentence.
Attaching screenshot of same:
code using spacy :
here the doc is the sentence and dep='nsubj' for subject and 'dobj' for object
import spacy
nlp = spacy.load('en_core_web_lg')
def get_subject_object_phrase(doc, dep):
doc = nlp(doc)
for token in doc:
if dep in token.dep_:
subtree = list(token.subtree)
start = subtree[0].i
end = subtree[-1].i + 1
return str(doc[start:end])
You can paper over the issue by doing something like doc = nlp(text.decode('utf8')), but this will likely bring you more bugs in future.
Credits: https://github.com/explosion/spaCy/issues/380

Categories