Search key phrases in text

Search key phrases in text - python

I'm looking for a fast solution which allows me to find predefined phrases (1-5 words) in a (not big) text.
The phrases can be up to 1000. Suppose, the simple find() function is not a good solution.
Could you advise what should I use?
Thanks in advance.
Update
Why i don't want to use bruit force search:
I believe, it is not fast enough.
Text can have some inclusions in the phrases. I.e. phrase can be Bank America, but text has bank of America.
Phrases can be a little bit changed - apostrophes, -s ending etc.

I'm not sure about your goal but you can easily find predefined prephrasses in text like that:
predefined_phrases = ["hello", "unicorns with a big mouth!", "Sweet donats"]
isnt_big_text = "A big mouse fly by unicorns with a big mouth! with hello wold."
for phrase in predefined_phrases:
if phrase in isnt_big_text:
print("Phrase '%s' found in text" % phrase)

Related

How do extract text in complete sentences/paragraphs with Python?

I am currently trying to convert a PDF into text for the purposes of ML, but whenever I do so, it returns the text in broken lines, which is making the text less readable.
Here is what I am currently doing to convert the text:
import fitz, spacy
with fitz.open("asset/example2.pdf") as doc:
text_ = ""
for page in doc:
text_ += page.getText()
and here are the results:
Animals - Animals have
always been near my
heart and it has led me to
involve myself in animal
rights events and
protests. It still stays as
a dream of mine to go
volunteer at an animal
sanctuary one day.
Food Travel - Through a
diet change, I have
found my love for food
and exploring different
food cultures across the
world. I feel confident
saying that I could write
an extensive
encyclopaedia for great
vegan restaurants.
what would be the best way to approach this?

I don't quite understand what result you are looking for but you if you would like all the text to be on one line you can use text.replace('\n', ''). If you want. You may also find text.split(separator) and separator.join(list) useful for formating your string, for example:
string = 'This is my \nfirst sentance. This \nsecond sentance\n.'
print(string)
string = string.replace('\n', '')
sentanceList = string.split('.')
string = '.\n'.join(sentanceList)
print(string)
I hope this answers your question.

how can I count the specific bigram words?

I want to find and count the specific bigram words such as "red apple" in the text file.
I already made the text file to the word list, so I couldn't use regex to count the whole phrase. (i.e. bigram) ( or can I ? )
How can I count the specific bigram in the text file? not using nltk or other module... regex can be a solution?

Why you have made text file into list. Also it's not memory efficient.
Instead of text you can use file.read() method directly.
import re
text = 'I like red apples and green apples but I like red apples more.'
bigram = ['red apples', 'green apples']
for i in bigram:
print 'Found', i, len(re.findall(i, text))
out:
Found red apples 2
Found green apples 1

Are you looking only for a specific bigrams or you might need to extend the search to detect any bigrams common in your text or something? In the latter case have a look at NLTK collocations module. You say you want to do this without using NLTK or other module, but in practice that's a very very bad idea. You'll miss what you are looking for due to there being eg 'red apple', not 'red apples'. NLTK, on the other hand, provides useful tools for lemmatizaton, calculating tons of the statistics and such.
And think of this: why and how have you turned the lines to list of words? Not only this is inefficient, but depending on exactly how you did that you may have lost information on word order, improperly processed punctuation, messed up uppercase/lowercase, or made a million of other mistakes. Which, again, is why NLTK is what you need.

python nltk keyword extraction from sentence

"First thing we do, let's kill all the lawyers." - William Shakespeare
Given the quote above, I would like to pull out "kill" and "lawyers" as the two prominent keywords to describe the overall meaning of the sentence. I have extracted the following noun/verb POS tags:
[["First", "NNP"], ["thing", "NN"], ["do", "VBP"], ["lets", "NNS"], ["kill", "VB"], ["lawyers", "NNS"]]
The more general problem I am trying to solve is to distill a sentence to the "most important"* words/tags to summarise the overall "meaning"* of a sentence.
*note the scare quotes. I acknowledge this is a very hard problem and there is most likely no perfect solution at this point in time. Nonetheless, I am interested to see attempts at solving the specific problem (extracting "kill" and "lawyers") and the general problem (summarising the overall meaning of a sentence in keywords/tags)

I don't think theres any perfect answer to this question because there aren't any gold-set of input/output mappings which everybody will agree upon. You think the most important words for that sentence are ('kill', 'lawyers'), someone else might argue the correct answer should be ('first', 'kill', 'lawyers'). If you are able to very precisely and completely unambiguously describe exactly what you want your system to do, your problem will be more than half solved.
Until then, I can suggest some additional heuristics to help you get what you want.
Build an idf dictionary using your data, i.e. build a mapping from every word to a number that correlates with how rare that word is. Bonus points for doing it for larger n-grams as well.
By combining the idf values of each word in your input sentence along with their POS tags, you answer questions of the form 'What is the rarest verb in this sentence?', 'What is the rarest noun in this sentence', etc. In any reasonable corpus, 'kill' should be rarer than 'do', and 'lawyers' rarer than 'thing', so maybe trying to find the rarest noun and rarest verb in a sentence and returning just those two will do the trick for most of your intended use cases. If not, you can always make your algorithm a little more complicated and see if that seems to do the job better.
Ways to expand this include trying to identify larger phrases using n-gram idf's, building a full parse-tree of the sentence (using maybe the stanford parser) and identifying some pattern within these trees to help you figure out which parts of the tree do important things tend to be based, etc.

One simple approach would be to keep stop word lists for NN, VB etc. These would be high frequency words that usually don't add much semantic content to a sentence.
The snippet below shows distinct lists for each type of word token, but you could just as well employ a single stop word list for both verbs and nouns (such as this one).
stop_words = dict(
NNP=['first', 'second'],
NN=['thing'],
VBP=['do','done'],
VB=[],
NNS=['lets', 'things'],
)
def filter_stop_words(pos_list):
return [[token, token_type]
for token, token_type in pos_list
if token.lower() not in stop_words[token_type]]

in your case, you can simply use Rake (thanks to Fabian) package for python to get what you need:
>>> path = #your path
>>> r = RAKE.Rake(path)
>>> r.run("First thing we do, let's kill all the lawyers")
[('lawyers', 1.0), ('kill', 1.0), ('thing', 1.0)]
the path can be for example this file.
but in general, you better to use NLTK package for the NLP usages

Detect whether the word you is a subject or object pronoun based on sentence context.

Ideally using regex, in python. I'm making a simple chatbot, and it's currently having problems responding to phrases like "I love you" correctly (it'll throw back "You love I" out of the grammar handler, when it should be giving back "You love me").
In addition, I'd like it if you could think of good phrases to throw into this grammar handler, that'd be great. I'd love some testing data.
If there's a good list of transitive verbs out there (something like a "top 100 used") it may be acceptable to use that and special case the "transitive verb + you" pattern.

Well, what you're trying to implement is definitely very challenging but also very difficult.
Logic
As a starter, I would look a bit into the Grammar rules first.
Basic sentence structure :
SUBJECT + TRANSITIVE VERB + OBJECT
SUBJECT + INTRANSITIVE VERB
(Of course, we could also talk about "Subject+Verb+Indirect Object+Direct Object" formats, etc (e.g. I give you the ball) but this would get too complicated for now...)
Obviously, this scheme is VERY simplistic, but let's stick to that for now.
Then (another over-simplistic assumption), that each part is a single word.
so basically you have the following Sentence Scheme :
WORD WORD WORD
which could be generally matched using a regex like :
([\w]+)\s+([\w]+)\s+([\w]+)?
Explanation :
([\w]+) # first word (=subject)
\s+ # one or more spaces
([\w]+) # second word (=verb)
\s+ # one or more spaces
([\w]+)? # (optional) third word (=object - if the verb is transitive)
Now, obviously to formulate sentences like "You love me" and not "You love I", your algorithm should also "understand" that :
The third part of the sentence has the role of the Object
Since "I" is a personal pronoun (used only in nominative case : "as a subject"), we should you its "accusative form" (=as an object); so, for this purpose, you may also need e.g. personal pronoun tables like :
I - my - me
You - your - you
He - his - him
etc...
Just a few ideas... (purely out of my enthusiasm for linguistics :-))
Data
As for the wordlists you are interested in, just a few samples :
330 Most Common English Verbs (most - if not all of them - are
transitive)
Personal Pronouns Chart

What you want is a syntactic analyser (aka parser)- this can be done by a rule-based system as described by #Dr.Kameleon, or statistically. There are many implementations out there, one being the Stanford one. These will generally tell you what the syntactic role of a word is (e.g. subject "You are here", or object "She like you"). How you use that information to turn statements into questions is a whole different can of worms. For English, you can get a fairly simple rule-based system to work OK.

Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed?
The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance'
In the novel, the Sheep Man is translated as saying things like:
"likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
So, some punctuation is kept, but not all. Enough for a human to read, but somewhat arbitrary.
What would be your strategy for building a parser for this? Common combinations of letters, syllable counts, conditional grammars, look-ahead/behind regexps etc.?
Specifically, python-wise, how would you structure a (forgiving) translation flow? Not asking for a completed answer, just more how your thought process would go about breaking the problem down.
I ask this in a frivolous manner, but I think it's a question that might get some interesting (nlp/crypto/frequency/social) answers.
Thanks!

I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.

First of all, I think you need a dictionary of English words -- you could try some methods that rely solely on some statistical analysis, but I think a dictionary has better chances of good results.
Once you have the words, you have two possible approaches:
You could categorize the words into grammar categories and use a formal grammar to parse the sentences -- obviously, you would sometimes get no match or multiple matches -- I'm not familiar with techniques that would allow you to loosen the grammar rules in case of no match, but I'm sure there must be some.
On the other hand, you could just take some large corpus of English text and compute relative probabilities of certain words being next to each other -- getting a list of pair and triples of words. Since that data structure would be rather big, you could use word categories (grammatical and/or based on meaning) to simplify it. Then you just build an automaton and choose the most probable transitions between the words.
I am sure there are many more possible approaches. You can even combine the two I mentioned, building some kind of grammar with weight attached to its rules. It's a rich field for experimenting.

I don't know if this is of much help to you, but you might be able to make use of this spelling corrector in some way.

This is just some quick code I wrote out that I think would work fairly well to extract words from a snippet like the one you gave... Its not fully thought out, but I think something along these lines would work if you can't find a pre-packaged type of solution
textstring = "likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
indiv_characters = list(textstring) #splits string into individual characters
teststring = ''
sequential_indiv_word_list = []
for cur_char in indiv_characters:
teststring = teststring + cur_char
# do some action here to test the testsring against an English dictionary where you can API into it to get True / False if it exists as an entry
if in_english_dict == True:
sequential_indiv_word_list.append(teststring)
teststring = ''
#at the end just assemble a sentence from the pieces of sequential_indiv_word_list by putting a space between each word
There are some more issues to be worked out, such as if it never returns a match, this would obviously not work as it would never match if it just kept adding in more characters, however since your demo string had some spaces you could have it recognize these too and automatically start over at each of these.
Also you need to account for punctuation, write conditionals like
if cur_char == ',' or cur_char =='.':
#do action to start new "word" automatically

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.