Python find all fuzzy matching sequences in a string - python

I have a large string and I want to find all the input sequences that are matching in this string.
So for example, I want to find all the possible matches of defensive rebound in:
Player xy had 10 defensive rebounds only in the 3rd quarter of a match that was a defensive battle between 2 teams that have a defensive rebound rate of over 80% and moreover the average number of rebounds in the defence by player was a staggering 3.5
I want to find all the bold words and after that extract them.
I managed to build a script that does the extraction but it only works for exact matches.
I was thinking of using difflib.SequenceMatcher but I got stuck.

You can use regex:
import re
#Find [defence(s)][space][rebound(s)][space][any word]
re.findall('defensive[\w]* rebound[\w]* [\w]+', s)
#Find [rebound(s)][space][any word][space][any word][space][any word]
re.findall('rebound[\w]* [\w]+ [\w]+ [\w]+', s)
findall returns a list of matches
If all your matches are in the same form of bold words you can extract them with:
re.findall('rebound[ \w]*defence', s)
re.findall('defensive[\w]* rebound[\w]*[ rate]*', s)

Related

Python create acronym from first characters of each word and include the numbers

I have a string as follows:
theatre = 'Regal Crown Center Stadium 14'
I would like to break this into an acronym based on the first letter in each word but also include both numbers:
desired output = 'RCCS14'
My code attempts below:
acronym = "".join(word[0] for word in theatre.lower().split())
acronym = "".join(word[0].lower() for word in re.findall("(\w+)", theatre))
acronym = "".join(word[0].lower() for word in re.findall("(\w+ | \d{1,2})", theatre))
acronym = re.search(r"\b(\w+ | \d{1,2})", theatre)
In which I wind up with something like: rccs1 but can't seem to capture that last number. There could be instances when the number is in the middle of the name as well: 'Regal Crown Center 14 Stadium' as well. TIA!
See regex in use here
(?:(?<=\s)|^)(?:[a-z]|\d+)
(?:(?<=\s)|^) Ensure what precedes is either a space or the start of the line
(?:[a-z]|\d+) Match either a single letter or one or more digits
The i flag (re.I in python) allows [a-z] to match its uppercase variants.
See code in use here
import re
r = re.compile(r"(?:(?<=\s)|^)(?:[a-z]|\d+)", re.I)
s = 'Regal Crown Center Stadium 14'
print(''.join(r.findall(s)))
The code above finds all instances where the regex matches and joins the list items into a single string.
Result: RCCS14
You can use re.sub() to remove all lowercase letters and spaces.
Regex: [a-z ]+
Details:
[]+ Match a single character present in the list between one and
unlimited times
Python code:
re.sub(r'[a-z ]+', '', theatre)
Output: RCCS14
Code demo
I can't comment since I don't have enough reputation, but S. Jovan answer isn't satisfying since it assumes that each word starts with a capital letter and that each word has one and only one capital letter.
re.sub(r'[a-z ]+', '', "Regal Crown Center Stadium YB FIEUBFB DBUUFG FUEH 14")
will returns 'RCCSYBFIEUBFBDBUUFGFUEH14'
However ctwheels answers will be able to work in this case :
r = re.compile(r"\b(?:[a-z]|\d+)", re.I)
s = 'Regal Crown Center Stadium YB FIEUBFB DBUUFG FUEH 14'
print(''.join(r.findall(s)))
will print
RCCSYFDF14
import re
theatre = 'Regal Crown Center Stadium 14'
r = re.findall("\s(\d+|\S)", ' '+theatre)
print(''.join(r))
Gives me RCCS14

A Python regex to find soccer team fixtures in string

I am using the Requests module to access the HTML from my target website and then using Beautiful Soup to select a specific element on the website. The element in question is a table that contains the results thus far of the English Premier League 2016/2017 season. The table contains the match date, the teams involved, the full-time score and the half-time score. I want to use Python to parse the HTML of the table element and extract the fixtures listed on there. The teams are always listed as:
Team A - Team B
A team name can be 1-3 separate strings (e.g. Burnley, Manchester United, West Ham United.
My attempt so far is:
import re
teamsRegex = re.compile(r'((\w+\s)+-(\s\w+)+)')
My logic here is that the first team can be 1-3 separate strings in length and each string is always followed by a white space. Therefore, the pattern (\w+\s)+ represents a string of any length followed by a white space and can be repeated 1 or many times. The second team name will always begin with a white space following the "-" character and again can be a string of any length, repeated 1 or many times (\s\w+)+.
I'm sort of achieving the desired results but the above is not entirely correct. I am returned a list with my desired result at index 0 followed by the first string of index 0 as index 1, and the last string in index 0 as index 2.
Example string:
'Burnley - Swansea City align=center width=45> 0 - 1 align=center> (0-0)'
Regex finds:
[('Burnley - Swansea City', 'Burnley ', ' City'), ('0 - 1', '0 ', ' 1')]
I would just like it to find [('Burnley - Swansea City')]
Many thanks in anticipation of any help!
r'(?:[A-Z][a-z]*\s)+-(?:\s[A-Z][a-z]*)+'
Here you have two non-capturing (?:, so you'll get the full match only) groups to match the teams' names. I chose to use letters explicitly, so the expressions only match words beginning with capital letters and exclude digits. You should change that if the teams' names can contain digits (like "BVB 09").
Depending on the HTML file's content one could add a final lookahead (?= align) to increase specifity.
Edit:
To match up to three capitals and optional '&'s, try this :
r'(?:[A-Z&]{1,3}[a-z]*\s)+-(?:\s[A-Z&]{1,3}[a-z]*)+'

getting words between m and n characters

I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.
Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])
You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words

Finding the surrounding sentence of a char/word in a string

I am trying to get sentences from a string that contain a given substring using python.
I have access to the string (an academic abstract) and a list of highlights with start and end indexes. For example:
{
abstract: "...long abstract here..."
highlights: [
{
concept: 'a word',
start: 1,
end: 10
}
{
concept: 'cancer',
start: 123,
end: 135
}
]
}
I am looping over each highlight, locating it's start index in the abstract (the end doesn't really matter as I just need to get a location within a sentence), and then somehow need to identify the sentence that index occurs in.
I am able to tokenize the abstract into sentences using nltk.tonenize.sent_tokenize, but by doing that I render the index location useless.
How should I go about solving this problem? I suppose regexes are an option but the nltk tokenizer seems such a nice way of doing it that it would be a shame not to make use of it.. Or somehow reset the start index by finding the number of chars since the previous full stop/exclamation mark/question mark?
You are right, the NLTK tokenizer is really what you should be using in this situation since it is robust enough to handle delimiting mostly all sentences including ending a sentence with a "quotation." You can do something like this (paragraph from a random generator):
Start with,
from nltk.tokenize import sent_tokenize
paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []
Most intuitive way:
for sentence in sent_tokenize(paragraph):
for highlight in highlights:
if highlight in sentence:
sentencesWithHighlights.append(sentence)
break
But using this method we actually have what is effectively a 3x nested for loop. This is because we first check each sentence, then each highlight, then each subsequence in the sentence for the highlight.
We can get better performance since we know the start index for each highlight:
highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
for index in highlightIndices:
if 0 < index - subtractFromIndex < len(sentence):
sentencesWithHighlights.append(sentence)
break
subtractFromIndex += len(sentence)
In either case we get:
sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']
I assume that all your sentences end with one of these three characters: !?.
What about looping over the list of highlights, creating a regexp group:
(?:list|of|your highlights)
Then matching your whole abstract against this regexp:
/(?:[\.!\?]|^)\s*([^\.!\?]*(?:list|of|your highlights)[^\.!\?]*?)(?=\s*[\.!\?])/ig
This way you would get the sentence containing at least one of your highlights in the first subgroup of each match (RegExr).
Another option (though it's tough to say how reliable it would be with variably defined text), would be to split the text into a list of sentences and test against them:
re.split('(?<=\?|!|\.)\s{0,2}(?=[A-Z]|$)', text)

Match longest substring in Python

Consider I have following string with a tab in between left & right part in a text file:
The dreams of REM (Geo) sleep The sleep paralysis
I want to match the above string that match both left part & right part in each line of another following file:
The pons also contains the sleep paralysis center of the brain as well as generating the dreams of REM sleep.
If can not match with fill string, then try to match with substring.
I want to search with leftmost and rightmost pattern.
eg.(leftmost cases)
The dreams of REM sleep paralysis
The dreams of REM sleep The sleep
eg.(Right most cases):
REM sleep The sleep paralysis
The dreams of The sleep paralysis
Thanks a lot again for any kind of help.
(Ok, you clarified most of what you want. Let me restate, then clarify the points I listed below as remaining unclear... Also take the starter code I show you, adapt it, post us the result.)
You want to search, line-by-line, case-insensitive, for the longest contiguous matches to each of a pair of match-patterns. All the patterns seem to be disjoint (impossible to get a match on both patternX and patternY, since they use different phrases, e.g. can't match both 'frontal lobe' and 'prefrontal cortex').
Your patterns are supplied as a sequence of pairs ('dom','rang'), => let's just refer to them by their subscript [0] and [1, you can use string.split('\t') to get that.)
The important thing is a matching line must match both the dom and rang patterns (fully or partially).
Order is independent, so we can match rang then dom, or vice versa => use 2 separate regexes per line, and test d and r matched.
Patterns have optional parts, in parentheses => so just write/convert them to regex syntax using (optionaltext)? syntax already, e.g.: re.compile('Frontallobes of (leftside)? the brain', re.IGNORECASE)
The return value should be the string buffer with the longest substring match so far.
Now this is where several things remain to be clarified - please edit your question to explain the following:
If you find full matches to any pair of patterns, then return that.
If you can't find any full matches, then search for partial matches of both of the pair of patterns. Where 'partial match' means 'the most words' or 'the highest proportion(%) of words' from a pattern? Presumably we exclude spurious matches to words like 'the', in which case we lose nothing by simply omitting 'the' from your search patterns, then this guarantees that all partial matches to any pattern are significant.
We score the partial matches (somehow), e.g. 'contains most words from pattern X', or 'contains highest % of words from pattern X'. We should do this for all patterns, then return the pattern with the highest score. You'll need to think about this a little, is it better to match 2 words of a 5-word pattern (40%) e.g. 'dreams of', or 1 of 2 (50%) e.g. 'prefrontal BUT NOT cortex'? How do we break ties, etc? What happens if we match 'sleep' but nothing else?
Each of the above questions will affect the solution, so you need to answer them for us. There's no point in writing pages of code to solve the most general case when you only needed something simple.
In general this is called 'NLP' (natural language processing). You might end up using an NLP library.
The general structure of the code so far is sounding like:
import re
# normally, read your input directly from file, but this allows us to test:
input = """The pons also contains the sleep paralysis center of the brain as well as generating the dreams of REM sleep.
The optic tract is a part of the visual system in the brain.
The inferior frontal gyrus is a gyrus of the frontal lobe of the human brain.
The prefrontal cortex (PFC) is the anterior part of the frontallobes of the brain, lying in front of the motor and premotor areas.
There are three possible ways to define the prefrontal cortex as the granular frontal cortex as that part of the frontal cortex whose electrical stimulation does not evoke movements.
This allowed the establishment of homologies despite the lack of a granular frontal cortex in nonprimates.
Modern tracing studies have shown that projections of the mediodorsal nucleus of the thalamus are not restricted to the granular frontal cortex in primates.
""".split('\n')
patterns = [
('(dreams of REM (Geo)? sleep)', '(sleep paralysis)'),
('(frontal lobe)', '(inferior frontal gyrus)'),
('(prefrontal cortex)', '(frontallobes of (leftside )?(the )?brain)'),
('(modern tract)', '(probably mediodorsal nucleus)') ]
# Compile the patterns as regexes
patterns = [ (re.compile(dstr),re.compile(rstr)) for (dstr,rstr) in patterns ]
def longest(t):
"""Get the longest from a tuple of strings."""
l = list(t) # tuples can't be sorted (immutable), so convert to list...
l.sort(key=len,reverse=True)
return l[0]
def custommatch(line):
for (d,r) in patterns:
# If got full match to both (d,r), return it immediately...
(dm,rm) = (d.findall(line), r.findall(line))
# Slight design problem: we get tuples like: [('frontallobes of the brain', '', 'the ')]
#... so return the longest match strings for each of dm,rm
if dm and rm: # must match both dom & rang
return [longest(dm), longest(rm)]
# else score any partial matches to (d,r) - how exactly?
# TBD...
else:
# We got here because we only have partial matches (or none)
# TBD: return the 'highest-scoring' partial match
return ('TBD... partial match')
for line in input:
print custommatch(line)
and running on the 7 lines of input you supplied currently gives:
TBD... partial match
TBD... partial match
['frontal lobe', 'inferior frontal gyrus']
['prefrontal cortex', ('frontallobes of the brain', '', 'the ')]
TBD... partial match
TBD... partial match
TBD... partial match
TBD... partial match

Categories