Suppose i have a paragraph:
text = '''Darwin published his theory of evolution with compelling evidence in his 1859 book On the Origin of Species, overcoming scientific rejection of earlier concepts of transmutation of species.[4][5] By the 1870s the scientific community and much of the general public had accepted evolution as a fact. However, many favoured competing explanations and it was not until the emergence of the modern evolutionary synthesis from the 1930s to the 1950s that a broad consensus developed in which natural selection was the basic mechanism of evolution.[6][7] In modified form, Darwin's scientific discovery is the unifying theory of the life sciences, explaining the diversity of life.[8][9]'''
If say i enter a word (favoured), then how can i remove the entire sentence the word is in.
The method i used earlier was tedious; i would use sent_tokenize to break the para (which is over 13000 words) and since i had to check for more than 1000 words, i would run a loop to check for each word in each sentence. This takes a lot of time as there are over 400 sentences.
Instead i want to check for those 1000 words in the para, and when the word is found it selects all words before till full stop and all words after, till full stop.
This removes all sentences (things bounded by a .) that contain the word somewhere.
def remove_sentence(input, word):
return ".".join((sentence for sentence in input.split(".")
if word not in sentence))
>>>> remove_sentence(text, "published")
"[4][5] By the 1870s the scientific community and much of the general public had accepted evolution as a fact. However, many favoured competing explanations and it was not until the emergence of the modern evolutionary synthesis from the 1930s to the 1950s that a broad consensus developed in which natural selection was the basic mechanism of evolution.[6][7] In modified form, Darwin's scientific discovery is the unifying theory of the life sciences, explaining the diversity of life.[8][9]"
>>>
>>> remove_sentence(text, "favoured")
"Darwin published his theory of evolution with compelling evidence in his 1859 book On the Origin of Species, overcoming scientific rejection of earlier concepts of transmutation of species.[4][5] By the 1870s the scientific community and much of the general public had accepted evolution as a fact.[6][7] In modified form, Darwin's scientific discovery is the unifying theory of the life sciences, explaining the diversity of life.[8][9]"
I'm not sure to understand you question, but you can do something like:
text = 'whatever....'
sentences = text.split('.')
good_sentences = [e for e in sentences if 'my_word' not in e]
Is that what you are looking for?
You might be interested in trying something similar to the following program:
import re
SENTENCES = ('This is a sentence.',
'Hello, world!',
'Where do you want to go today?',
'The apple does not fall far from the tree.',
'Sally sells sea shells by the sea shore.',
'The Jungle Book has several stories in it.',
'Have you ever been up to the moon?',
'Thank you for helping with my problem!')
BAD_WORDS = frozenset(map(str.lower, ('to', 'sea')))
def main():
for index, sentence in enumerate(SENTENCES):
if frozenset(words(sentence.lower())) & BAD_WORDS:
print('Delete:', repr(sentence))
words = lambda sentence: (m.group() for m in re.finditer('\w+', sentence))
if __name__ == '__main__':
main()
Reason
You start out with the sentences that you want to filter and the words you want to find.
You compare each sentence's set of words with the set of words you are looking for.
If there was an intersection, the sentence you are looking at is one you will remove.
Output
Delete: 'Where do you want to go today?'
Delete: 'Sally sells sea shells by the sea shore.'
Delete: 'Have you ever been up to the moon?'
Related
This is what my prof has given us for clues:
text = '''I make my own cheese. Cheese is a dairy product, derived from milk and produced in wide ranges of flavors, textures and forms by coagulation of the milk protein casein. I personally really love cheese. Casein: a family of related phosphoproteins. These proteins are commonly found in mammalian milk'''
for r in re.finditer('\w+',text): #Here, I would split my text into sentences
word = r.group(0)
if re.search(r'lly\b',word): #Here, I would identify a type of sentence
print(word)
if re.search(r'tion\b',word): #Here, I would identify another type of sentence
print(word)
Basically, what I gathered from my own text are two types of definitions; one that is integrated into the sentence usually followed by a descriptive verb ("Cheese is...") and one that is the defined word followed by a colon and its definitory (invented word?) sentence ("Casein: [...]"). I've scraped my brain the whole week trying to find a way to extract and print these sentences without any luck. As a Linguistics major who's just trying to get by, any help would be greatly appreciated. Thanks.
I am using nltk PunktSentenceTokenizer for splitting paragraphs into sentences. I have paragraphs as follows:
paragraphs = "1. Candidate is very poor in mathematics. 2. Interpersonal skills are good. 3. Very enthusiastic about social work"
Output:
['1.', 'Candidate is very poor in mathematics.', '2.', 'Interpersonal skills are good.', '3.', 'Very enthusiastic about social work']
I tried to add sent starters using below code but that didnt even work out.
from nltk.tokenize.punkt import PunktSentenceTokenizer
tokenizer = PunktSentenceTokenizer()
tokenizer._params.sent_starters.add('1.')
I really appreciate if anybody could drive me towards correct direction
Thanks in advance :)
The use of regular expressions can provide a solution to this type of problem, as illustrated by the code below:
paragraphs = "1. Candidate is very poor in mathematics. 2. Interpersonal skills are good. 3. Very enthusiastic about social work"
import re
reSentenceEnd = re.compile("\.|$")
reAtLeastTwoLetters = re.compile("[a-zA-Z]{2}")
previousMatch = 0
sentenceStart = 0
end = len(paragraphs)
while(True):
candidateSentenceEnd = reSentenceEnd.search(paragraphs, previousMatch)
# A sentence must contain at least two consecutive letters:
if reAtLeastTwoLetters.search(paragraphs[sentenceStart:candidateSentenceEnd.end()]) :
print(paragraphs[sentenceStart:candidateSentenceEnd.end()])
sentenceStart = candidateSentenceEnd.end()
if candidateSentenceEnd.end() == end:
break
previousMatch=candidateSentenceEnd.start() + 1
the output is:
Candidate is very poor in mathematics.
Interpersonal skills are good.
Very enthusiastic about social work
Many tokenizers including (nltk and Spacy) can handle regular expressions. Adapting this code to their framework might not be trivial though.
I'm working on a project that takes headlines from newspaper's websites that I have stored in 2 text files (nyt.text and wapo.text) and compares them against each other, and if the strings are determined to be similar by the SequenceMatcher built-in for Python, prints them to me along with their similarity rating:
from difflib import SequenceMatcher
f = open('nyt.text','r+')
w = open('wapo.text','r+')
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
def compare(self):
wapo = []
times = []
for line in w.readlines():
wapo.append(line)
for i in f.readlines():
times.append(i)
print(wapo[0],times[0])
for i in wapo:
for s in times:
print(similar(i,s))
if similar(i,s) > 0.35:
print(i,s)
return
compare()
The result that I'm getting looks something like this:
Attorney for San Bernardino gunman's family floats hoax theory
Op-Ed Contributor: A Battle in San Bernardino
San Bernardino attacker pledged allegiance to Islamic State leader, officials say
Sunday Routine: How Jamie Hodari, Workplace Entrepreneur, Spends His Sundays
Why some police departments let anyone listen to their scanner conversations - even criminals
White House Seeks Path to Executive Action on Gun Sales
Why the Pentagon opening all combat roles to women could subject them to a military draft
Scientists Seek Moratorium on Edits to Human Genome That Could Be Inherited
Destroying the Death Star was a huge mistake
Mark Zuckerberg Defends Structure of His Philanthropic Outfit
As you can see, they're not too terribly similar besides the first one, despite being rated at .35 similarity by the SequenceMatcher. I have an inkling that this is because the SequenceMatcher judges similarity by letter, not by word. Would anyone have ideas on how to tokenize the words in the titles such that SequenceMatcher reads them as whole words, instead of as individual letters?
Your intuition here is likely spot on. You're seeing matching based on uninterrupted string of matched letters, which is generally a pretty poor metric for headline similarity.
It's doing this because the sequence you're passing in is a string, or as the computer sees it, a really long list of letters.
If you want to judge on words instead I would suggest splitting the text using the .split() function, which will just split on whitespace.
There's a lot of cleaning you can and probably should do, such as removing punctuation, setting everything to lowercase ('.lower()'), as well as potentially stemming the words to get reasonable matches. That said, all of those pieces are well documented elsewhere and might not make sense for your particular use case.
You can also look at other tokenizers in sklearn, but they're unlikely to make a huge difference here.
"First thing we do, let's kill all the lawyers." - William Shakespeare
Given the quote above, I would like to pull out "kill" and "lawyers" as the two prominent keywords to describe the overall meaning of the sentence. I have extracted the following noun/verb POS tags:
[["First", "NNP"], ["thing", "NN"], ["do", "VBP"], ["lets", "NNS"], ["kill", "VB"], ["lawyers", "NNS"]]
The more general problem I am trying to solve is to distill a sentence to the "most important"* words/tags to summarise the overall "meaning"* of a sentence.
*note the scare quotes. I acknowledge this is a very hard problem and there is most likely no perfect solution at this point in time. Nonetheless, I am interested to see attempts at solving the specific problem (extracting "kill" and "lawyers") and the general problem (summarising the overall meaning of a sentence in keywords/tags)
I don't think theres any perfect answer to this question because there aren't any gold-set of input/output mappings which everybody will agree upon. You think the most important words for that sentence are ('kill', 'lawyers'), someone else might argue the correct answer should be ('first', 'kill', 'lawyers'). If you are able to very precisely and completely unambiguously describe exactly what you want your system to do, your problem will be more than half solved.
Until then, I can suggest some additional heuristics to help you get what you want.
Build an idf dictionary using your data, i.e. build a mapping from every word to a number that correlates with how rare that word is. Bonus points for doing it for larger n-grams as well.
By combining the idf values of each word in your input sentence along with their POS tags, you answer questions of the form 'What is the rarest verb in this sentence?', 'What is the rarest noun in this sentence', etc. In any reasonable corpus, 'kill' should be rarer than 'do', and 'lawyers' rarer than 'thing', so maybe trying to find the rarest noun and rarest verb in a sentence and returning just those two will do the trick for most of your intended use cases. If not, you can always make your algorithm a little more complicated and see if that seems to do the job better.
Ways to expand this include trying to identify larger phrases using n-gram idf's, building a full parse-tree of the sentence (using maybe the stanford parser) and identifying some pattern within these trees to help you figure out which parts of the tree do important things tend to be based, etc.
One simple approach would be to keep stop word lists for NN, VB etc. These would be high frequency words that usually don't add much semantic content to a sentence.
The snippet below shows distinct lists for each type of word token, but you could just as well employ a single stop word list for both verbs and nouns (such as this one).
stop_words = dict(
NNP=['first', 'second'],
NN=['thing'],
VBP=['do','done'],
VB=[],
NNS=['lets', 'things'],
)
def filter_stop_words(pos_list):
return [[token, token_type]
for token, token_type in pos_list
if token.lower() not in stop_words[token_type]]
in your case, you can simply use Rake (thanks to Fabian) package for python to get what you need:
>>> path = #your path
>>> r = RAKE.Rake(path)
>>> r.run("First thing we do, let's kill all the lawyers")
[('lawyers', 1.0), ('kill', 1.0), ('thing', 1.0)]
the path can be for example this file.
but in general, you better to use NLTK package for the NLP usages
Ideally using regex, in python. I'm making a simple chatbot, and it's currently having problems responding to phrases like "I love you" correctly (it'll throw back "You love I" out of the grammar handler, when it should be giving back "You love me").
In addition, I'd like it if you could think of good phrases to throw into this grammar handler, that'd be great. I'd love some testing data.
If there's a good list of transitive verbs out there (something like a "top 100 used") it may be acceptable to use that and special case the "transitive verb + you" pattern.
Well, what you're trying to implement is definitely very challenging but also very difficult.
Logic
As a starter, I would look a bit into the Grammar rules first.
Basic sentence structure :
SUBJECT + TRANSITIVE VERB + OBJECT
SUBJECT + INTRANSITIVE VERB
(Of course, we could also talk about "Subject+Verb+Indirect Object+Direct Object" formats, etc (e.g. I give you the ball) but this would get too complicated for now...)
Obviously, this scheme is VERY simplistic, but let's stick to that for now.
Then (another over-simplistic assumption), that each part is a single word.
so basically you have the following Sentence Scheme :
WORD WORD WORD
which could be generally matched using a regex like :
([\w]+)\s+([\w]+)\s+([\w]+)?
Explanation :
([\w]+) # first word (=subject)
\s+ # one or more spaces
([\w]+) # second word (=verb)
\s+ # one or more spaces
([\w]+)? # (optional) third word (=object - if the verb is transitive)
Now, obviously to formulate sentences like "You love me" and not "You love I", your algorithm should also "understand" that :
The third part of the sentence has the role of the Object
Since "I" is a personal pronoun (used only in nominative case : "as a subject"), we should you its "accusative form" (=as an object); so, for this purpose, you may also need e.g. personal pronoun tables like :
I - my - me
You - your - you
He - his - him
etc...
Just a few ideas... (purely out of my enthusiasm for linguistics :-))
Data
As for the wordlists you are interested in, just a few samples :
330 Most Common English Verbs (most - if not all of them - are
transitive)
Personal Pronouns Chart
What you want is a syntactic analyser (aka parser)- this can be done by a rule-based system as described by #Dr.Kameleon, or statistically. There are many implementations out there, one being the Stanford one. These will generally tell you what the syntactic role of a word is (e.g. subject "You are here", or object "She like you"). How you use that information to turn statements into questions is a whole different can of worms. For English, you can get a fairly simple rule-based system to work OK.