Create Array of found words in Python - python

I have a message of text and a list of terms. I'd like to create an array that shows which terms are found in the message. For instance:
message = "the quick brown fox jumps over the lazy dog"
terms = ["quick", "fox", "horse", "Lorem", "Ipsum", "the"]
result = idealMethod(message, terms)
=> [1,1,0,0,0,1]
Because "quick" was the first item in the list of terms and was also in the message a 1 is placed in the first position in the result. Here's another example:
message2 = "Every Lorem has a fox"
result2 = idealMethod(message2, terms)
=> [0,1,0,1,0,0]
Update:
The terms need to be exact matches. For instance if my search term include sam I don't want a match for same

I think you want:
words = set(message.split(" "))
result = [int(word in words) for word in terms]
Note that split() splits on whitespace by default, so you could leave out the " ".

You can use list comprehension and leverage fact, that True is evaluated as 1 in numeric context:
words = set(message.split())
result = [int(term in words) for term in terms]
Out[24]: [1, 1, 0, 0, 0]
EDIT changed to look only for whole word matches, after clarification.

Related

Finding the singular or plural form of a word with regex

Let's assume I have the sentence:
sentence = "A cow runs on the grass"
If I want to replace the word cow with "some" special token, I can do:
to_replace = "cow"
# A <SPECIAL> runs on the grass
sentence = re.sub(rf"(?!\B\w)({re.escape(to_replace)})(?<!\w\B)", "<SPECIAL>", sentence, count=1)
Additionally, if I want to replace it's plural form, I could do:
sentence = "The cows run on the grass"
to_replace = "cow"
# Influenza is one of the respiratory <SPECIAL>
sentence = re.sub(rf"(?!\B\w)({re.escape(to_replace) + 's?'})(?<!\w\B)", "<SPECIAL>", sentence, count=1)
which does the replacement even if the word to replace remains in its singular form cow, while the s? does the job to perform the replacement.
My question is what happens if I want to apply the same in a more general way, i.e., find-and-replace words which can be singular, plural - ending with s, and also plural - ending with es (note that I'm intentionally ignoring many edge cases that could appear - discussed in the comments of the question). Another way to frame the question would be how can add multiple optional ending suffixes to a word, so that it works for the following examples:
to_replace = "cow"
sentence1 = "The cow runs on the grass"
sentence2 = "The cows run on the grass"
# --------------
to_replace = "gas"
sentence3 = "There are many natural gases"
I suggest using regular python logic, remember to avoid stretching regexes too much if you don't need to:
phrase = "There are many cows in the field cowes"
for word in phrase.split():
if word == "cow" or word == "cow" + "s" or word == "cow" + "es":
phrase = phrase.replace(word, "replacement")
print(phrase)
Output:
There are many replacement in the field replacement
Apparently, for the use-case I posted, I can make the suffix optional. So it could go as:
re.sub(rf"(?!\B\w)({re.escape(e_obj) + '(s|es)?'})(?<!\w\B)", "<SPECIAL>", sentence, count=1)
Note that this would not work for many edge cases discussed in the comments!

Extract words in a paragraph that are similar to words in list

I have the following string:
"The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
List of words to be extracted:
["town","teddy","chicken","boy went"]
NB: town and teddy are wrongly spelt in the given sentence.
I have tried the following but I get other words that are not part of the answer:
import difflib
sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
list1 = ["town","teddy","chicken","boy went"]
[difflib.get_close_matches(x.lower().strip(), sent.split()) for x in list1 ]
I am getting the following result:
[['twn', 'to'], ['tddy'], ['chicken.', 'picked'], ['went']]
instead of:
'twn', 'tddy', 'chicken','boy went'
Notice in the documentation for difflib.get_closest_matches():
difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)
Return a list of the best "good enough" matches. word is a sequence for which close matches are desired (typically a string), and
possibilities is a list of sequences against which to match word
(typically a list of strings).
Optional argument n (default 3) is the maximum number of close matches to return; n must be greater than 0.
Optional argument cutoff (default 0.6) is a float in the range [0, 1]. Possibilities that don’t score at least that similar to word are
ignored.
At the moment, you are using the default n and cutoff arguments.
You can specify either (or both), to narrow down the returned matches.
For example, you could use a cutoff score of 0.75:
result = [difflib.get_close_matches(x.lower().strip(), sent.split(), cutoff=0.75) for x in list1]
Or, you could specify that only at most 1 match should be returned:
result = [difflib.get_close_matches(x.lower().strip(), sent.split(), n=1) for x in list1]
In either case, you could use a list comprehension to flatten the lists of lists (since difflib.get_close_matches() always returns a list):
matches = [r[0] for r in result]
Since you also want to check for close matches of bigrams, you can do so by extracting pairings of adjacent "words", and pass them to difflib.get_close_matches() as part of the possibilities argument.
Here is a full working example of this in action:
import difflib
import re
sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
list1 = ["town", "teddy", "chicken", "boy went"]
# this extracts overlapping pairings of "words"
# i.e. ['The boy', 'boy went', 'went to', 'to twn', ...
pairs = re.findall(r'(?=(\b[^ ]+ [^ ]+\b))', sent)
# we pass the sent.split() list as before
# and concatenate the new pairs list to the end of it also
result = [difflib.get_close_matches(x.lower().strip(), sent.split() + pairs, n=1) for x in list1]
matches = [r[0] for r in result]
print(matches)
# ['twn', 'tddy', 'chicken.', 'boy went']
If you read Python documentation fordifflib.get_close_matches()
https://docs.python.org/3/library/difflib.html
It returns all possible best matches.
Method signature:
difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)
Here n is the maximum number of close matches to return. So I think you can pass this as 1.
>>> [difflib.get_close_matches(x.lower().strip(), sent.split(),1)[0] for x in list1]
['twn', 'tddy', 'chicken.', 'went']

How to slice a string input at a certain unknown index

A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.
So the input could be (but not limited to) the following:
1- "eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn" 2- "What is your\nlastname and email?\ndasf?lkjas" 3- "askjdmk.\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"
(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")
The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text
I only want to extract the question from the garbage input and nothing else.
I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted \n in the input, but the question doesn't always have \n before the first letter of the question.
def extractQuestion(q):
index_end_q = q.find('?', 1)
index_first_letter_of_q = 0 # TODO
question = '\n ' . join(q[index_first_letter_of_q :index_end_q ])
A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant:
#!/usr/bin/env python
import enchant
GLOSSARY = enchant.Dict("en_US")
def isWord(word):
return True if GLOSSARY.check(word) else False
sentences = [
"eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
for sentence in sentences:
for i,w in enumerate(sentence.split()):
if isWord(w):
print('index: {} => {}'.format(i, w))
break
The above piece of code gives as a result:
index: 3 => What
index: 0 => What
index: 0 => Given
You could try a regular expression like \b[A-Z][a-z][^?]+\?, meaning:
The start of a word \b with an upper case letter [A-Z] followed by a lower case letter [a-z],
then a sequence of non-questionmark-characters [^?]+,
followed by a literal question mark \?.
This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.
>>> tests = ["eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
>>> import re
>>> p = r"\b[A-Z][a-z][^?]+\?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
If that's one blob of text, you can use findall instead of search:
>>> text = "\n".join(tests)
>>> re.findall(p, text)
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
Actually, this also seems to work reasonably well for questions with names in them:
>>> t = "asdGARBAGEasd\nHow did you like St. Petersburg? more stuff with ?"
>>> re.search(p, t).group()
'How did you like St. Petersburg?'

Search in a string and obtain the 2 words before and after the match in Python

I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]

What is a simple fuzzy string matching algorithm in Python?

I'm trying to find some sort of a good, fuzzy string matching algorithm. Direct matching doesn't work for me — this isn't too good because unless my strings are a 100% similar, the match fails. The Levenshtein method doesn't work too well for strings as it works on a character level. I was looking for something along the lines of word level matching e.g.
String A: The quick brown fox.
String B: The quick brown fox jumped
over the lazy dog.
These should match as all words in
string A are in string B.
Now, this is an oversimplified example but would anyone know a good, fuzzy string matching algorithm that works on a word level.
I like Drew's answer.
You can use difflib to find the longest match:
>>> a = 'The quick brown fox.'
>>> b = 'The quick brown fox jumped over the lazy dog.'
>>> import difflib
>>> s = difflib.SequenceMatcher(None, a, b)
>>> s.find_longest_match(0,len(a),0,len(b))
Match(a=0, b=0, size=19) # returns NamedTuple (new in v2.6)
Or pick some minimum matching threshold. Example:
>>> difflib.SequenceMatcher(None, a, b).ratio()
0.61538461538461542
Take a look at this python library, which SeatGeek open-sourced yesterday. Obviously most of these kinds of problems are very context dependent, but it might help you.
from fuzzywuzzy import fuzz
s1 = "the quick brown fox"
s2 = "the quick brown fox jumped over the lazy dog"
s3 = "the fast fox jumped over the hard-working dog"
fuzz.partial_ratio(s1, s2)
> 100
fuzz.token_set_ratio(s2, s3)
> 73
SeatGeek website
and Github repo
If all you want to do is to test whether or not all the words in a string match another string, that's a one liner:
if not [word for word in b.split(' ') if word not in a.split(' ')]:
print 'Match!'
If you want to score them instead of a binary test, why not just do something like:
((# of matching words) / (# of words in bigger string)) *
((# of words in smaller string) / (# of words in bigger string))
?
If you wanted to, you could get fancier and do fuzzy match on each string.
You can try this python package which uses fuzzy name matching with machine learning.
pip install hmni
Initialize a Matcher Object
import hmni
matcher = hmni.Matcher(model='latin')
Single Pair Similarity
matcher.similarity('Alan', 'Al')
# 0.6838303319889133
matcher.similarity('Alan', 'Al', prob=False)
# 1
matcher.similarity('Alan Turing', 'Al Turing', surname_first=False)
# 0.6838303319889133
Note: I have not built this package. Sharing it here because it was quite useful for my use.
GitHub
You could modify the Levenshtein algorithm to compare words rather than characters. It's not a very complex algorithm and the source is available in many languages online.
Levenshtein works by comparing two arrays of chars. There is no reason that the same logic could not be applied against two arrays of strings.
I did this some time ago with C#, my previous question is here. There is starter algorith for your interest, you can easily transform it to python.
Ideas you should use writing your own
algorithm is something like this:
Have a list with original "titles" (words/sentences you want to match
with).
Each title item should have minimal match score on word/sentence, ignore
title as well.
You also should have global minimal match percentage of final result.
You should calculate each word - word Levenshtein distance.
You should increase total match weight if words are going in the same
order (quick brown vs quick brown,
should have definitively higher weight
than quick brown vs. brown quick.)
You can try FuzzySearchEngine from https://github.com/frazenshtein/fastcd/blob/master/search.py.
This fuzzy search supports only search for words and has a fixed admissible error for the word (only one substitution or transposition of two adjacent characters).
However, for example you can try something like:
import search
string = "Chapter I. The quick brown fox jumped over the lazy dog."
substr = "the qiuck broqn fox."
def fuzzy_search_for_sentences(substr, string):
start = None
pos = 0
for word in substr.split(" "):
if not word:
continue
match = search.FuzzySearchEngine(word).search(string, pos=pos)
if not match:
return None
if start is None:
start = match.start()
pos = match.end()
return start
print(fuzzy_search_for_sentences(substr, string))
11 will be printed
Levenshtein should work ok if you compare words (strings separated by sequences of stop charactes) instead of individual letters.
def ld(s1, s2): # Levenshtein Distance
len1 = len(s1)+1
len2 = len(s2)+1
lt = [[0 for i2 in range(len2)] for i1 in range(len1)] # lt - levenshtein_table
lt[0] = list(range(len2))
i = 0
for l in lt:
l[0] = i
i += 1
for i1 in range(1, len1):
for i2 in range(1, len2):
if s1[i1-1] == s2[i2-1]:
v = 0
else:
v = 1
lt[i1][i2] = min(lt[i1][i2-1]+1, lt[i1-1][i2]+1, lt[i1-1][i2-1]+v)
return lt[-1][-1]
str1 = "The quick brown fox"
str2 = "The quick brown fox jumped over the lazy dog"
print("{} words need to be added, deleted or replaced to convert string 1 into string 2".format(ld(str1.split(),str2.split())))

Categories