What Regex to use in this example - python

I am parsing a string that I know will definitely only contain the following distinct phrases that I want to parse:
'Man of the Match'
'Goal'
'Assist'
'Yellow Card'
'Red Card'
The string that I am parsing could contain everything from none of the elements above to all of them (i.e. the string being parsed could be anything from None to 'Man of the Match Goal Assist Yellow Card Red Card'.
For those of you that understand football, you will also realise that the elements 'Goal' and 'Assist' could in theory be repeated an infinite number of times. The element 'Yellow Card' could be repeated 0, 1 or 2 times also.
I have built the following Regex (where 'incident1' is the string being parsed), which I believed would return an unlimited number of all preceding Regexes, however all I am getting is single instances:
regex1 = re.compile("Man of the Match*", re.S)
regex2 = re.compile("Goal*", re.S)
regex3 = re.compile("Assist*", re.S)
regex4 = re.compile("Red Card*", re.S)
regex5 = re.compile("Yellow Card*", re.S)
mysearch1 = re.search(regex1, incident1)
mysearch2 = re.search(regex2, incident1)
mysearch3 = re.search(regex3, incident1)
mysearch4 = re.search(regex4, incident1)
mysearch5 = re.search(regex5, incident1)
#print mystring
print "incident1 = ", incident1
if mysearch1 is not None:
print "Man of the match = ", mysearch1.group()
if mysearch2 is not None:
print "Goal = ", mysearch2.group()
if mysearch3 is not None:
print "Assist = ", mysearch3.group()
if mysearch4 is not None:
print "Red Card = ", mysearch4.group()
if mysearch5 is not None:
print "Yellow Card = ", mysearch5.group()
This works as long as there is only one instance of every element encountered in a string, however if a player was for example to score more than one goal, this code only returns one instance of 'Goal'.
Can anyone see what I am doing wrong?

You can try something like this:
import re
s = "here's an example Man of the Match match and a Red Card match, and another Red Card match"
patterns = [
'Man of the Match',
'Goal',
'Assist',
'Yellow Card',
'Red Card',
]
repattern = '|'.join(patterns)
matches = re.findall(repattern, s, re.IGNORECASE)
print matches # ['Man of the Match', 'Red Card', 'Red Card']

Some general overview on regex methods in python:
re.search | re.match
In your previous attempt, you tried to use re.search. This only returned one result, and as you'll see this isn't unusual. These two functions are used to identify if a line contains a certain regex. You'd use these for something like:
s = subprocess.check_output('ipconfig') # calls ipconfig and sends output to s
for line in s.splitlines():
if re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", str(line)):
# if line contains an IP address...
print(line)
You use re.match to specifically check if the regex matches at the BEGINNING of the string. This is usually used with a regex that matches the WHOLE string. For example:
lines = ['Adam Smith, Age: 24, Male, Favorite Thing: Reading page: 16',
'Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example']
# two Adams, but we only want the one who is 16 years old.
repattern = re.compile(r'''Adam \w+, Age: 16, (?:Male|Female), Favorite Thing: [^,]*?''')
for line in lines:
if repattern.match(line):
print(line)
# Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example
# note if we'd used re.search for Age: 16, it would have found both lines!
The take away is that you use these two functions to select lines in a longer document (or any iterable)
re.findall | re.finditer
It seems in this case, you aren't trying to match a line, you're trying to pull some specifically-formatted information from the string. Let's see some examples of that.
s = """Phone book:
Adam: (555)123-4567
Joe: (555)987-6543
Alice:(555)135-7924"""
pat = r'''(?:\(\d{3}\))?\d{3}-?\d{4}'''
phone_numbers = re.findall(pat, s)
print(phone_numbers)
# ['(555)123-4567','(555)987-6543','(555)135-7924']
re.finditer returns a generator instead of a list. You'd use this the same way you'd use xrange instead of range in Python2. re.findall(some_pattern, some_string) can make a GIANT list if there are a TON of matches. re.finditer will not.
other methods: re.split | re.sub
re.split is great if you have a number of things you need to split by. Imagine you had the string:
s = '''Hello, world! It's great that you're talking to me, and everything, but I'd really rather you just split me on punctuation marks. Okay?'''
There's no great way to do that with str.split like you're used to, so instead do:
separators = [".", "!", "?", ","]
splitpattern = '|'.join(map(re.escape, separators))
# re.escape takes a string and escapes out any characters that regex considers
# special, for instance that . would otherwise be "any character"!
split_s = re.split(splitpattern, s)
print(split_s)
# ['Hello', ' world', " It's great that you're talking to me", ' and everything', " but I'd really rather you just split me on punctuation marks", ' Okay', '']
re.sub is great in cases where you know something will be formatted regularly, but you're not sure exactly how. However, you REALLY want to make sure they're all formatted the same! This will be a little advanced and use several methods, but stick with me....
dates = ['08/08/2014', '09-13-2014', '10.10.1997', '9_29_09']
separators = list()
new_sep = "/"
match_pat = re.compile(r'''
\d{1,2} # two digits
(.) # followed by a separator (capture)
\d{1,2} # two more digits
\1 # a backreference to that separator
\d{2}(?:\d{2})? # two digits and optionally four digits''', re.X)
for idx,date in enumerate(dates):
match = match_pat.match(date)
if match:
sep = match.group(1) # the separator
separators.append(sep)
else:
dates.pop(idx) # this isn't really a date, is it?
repl_pat = '|'.join(map(re.escape, separators))
final_dates = re.sub(repl_pat, new_sep, '\n'.join(dates))
print(final_dates)
# 08/08/2014
# 09/13/2014
# 10/10/1997
# 9/29/09
A slightly less advanced example, you can use re.sub with any sort of formatted expression and pass it a function to return! For instance:
def get_department(dept_num):
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
if hasattr(dept_num, 'group'): # then it's a match, not a number
dept_num = dept_num.group(0)
return departments.get(dept_num, "Unknown Dept")
file = r"""Name,Performance Review,Department
Adam,3,1
Joe,5,2
Alice,1,3
Eve,12,4""" # this looks like a csv file
dept_names = re.sub(r'''\d+$''', get_department, file, flags=re.M)
print(dept_names)
# Name,Performance Review,Department
# Adam,3,I.T.
# Joe,5,Administration
# Alice,1,Human Resources
# Eve,12,Maintenance
Without using regex here you could do:
replaced_lines = []
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
for line in file.splitlines():
the_split_line = line.split(',')
replaced_lines.append(','.join(the_split_line[:-1]+ \
departments.get(the_split_line[-1], "Unknown Dept")))
new_file = '\n'.join(replaced_lines)
# LOTS OF STRING MANIPULATION, YUCK!
Instead we replace all that for loop and string splitting, list slicing, and string manipulation with a function and a re.sub call. In fact, if you use a lambda it's even easier!
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
re.sub(r'''\d+$''', lambda x: departments.get(x, "Unknown Dept"), file, flags=re.M)
# DONE!

Related

Python RE. excluding some results

I'm new to RE and I'm trying to take song lyrics and isolate the verse titles, the backing vocals, and main vocals:
Here's an example of some lyrics:
[Intro]
D.A. got that dope!
[Chorus: Travis Scott]
Ice water, turned Atlantic (Freeze)
Nightcrawlin' in the Phantom (Skrrt, Skrrt)...
The verse titles include the square brackets and any words between them. They can be successfully isolated with
r'\[{1}.*?\]{1}'
The backing vocals are similar to the verse titles, but between (). They've been successfully isolated with:
r'\({1}.*?\){1}'
For the main vocals, I've used
r'\S+'
which does isolate the main_vocals, but also the verse titles and backing vocals. I cannot figure out how to isolate only the main vocals with simple REs.
Here's a python script that gets the output I desire, but I'd like to do it with REs (as a learning exercise) and cannot figure it out through documentation.
import re
file = 'D:/lyrics.txt'
with open(file, 'r') as f:
lyrics = f.read()
def find_spans(pattern, string):
pattern = re.compile(pattern)
return [match.span() for match in pattern.finditer(string)]
verses = find_spans(r'\[{1}.*?\]{1}', lyrics)
backing_vocals = find_spans(r'\({1}.*?\){1}', lyrics)
main_vocals = find_spans(r'\S+', lyrics)
exclude = verses
exclude.extend(backing_vocals)
not_main_vocals = []
for span in exclude:
start, stop = span
not_main_vocals.extend(list(range(start, stop)))
main_vocals_temp = []
for span in main_vocals:
append = True
start, stop = span
for i in range(start, stop):
if i in not_main_vocals:
append = False
continue
if append == True:
main_vocals_temp.append(span)
main_vocals = main_vocals_temp
Try this Demo:
pattern = r'(?P<Verse>\[[^\]]+])|(?P<Backing>\([^\)]+\))|(?P<Lyrics>[^\[\(]+)'
You can use re.finditer to isolate the groups.
breakdown = {k: [] for k in ('Verse', 'Backing', 'Lyrics')}
for p in pattern.finditer(song):
for key, item in p.groupdict().items():
if item: breakdown[key].append(item)
Result:
{
'Verse':
[
'[Intro]',
'[Chorus: Travis Scott]'
],
'Backing':
[
'(Freeze)',
'(Skrrt, Skrrt)'
],
'Lyrics':
[
'\nD.A. got that dope!\n\n',
'\nIce water, turned Atlantic ',
"\nNightcrawlin' in the Phantom ",
'...'
]
}
To elaborate a bit further on the pattern, it's using the named groups to separate the three distinct groups. Using [^\]+] and similar just means to find everything that is not ] (and likewise when \) means everything not )). In the Lyrics part we exclude anything that starts with [ and (. The link to the demo on regex101 would explain the components in more details if you need.
If you don't care for the newlines in the main lyrics, use (?P<Lyrics>[^\[\(\n]+) (which excludes the \n) to turn your Lyrics without newlines:
'Lyrics': [
'D.A. got that dope!',
'Ice water, turned Atlantic ',
"Nightcrawlin' in the Phantom ",
'...'
]
You could search for the text between close-brackets and open-brackets, using regex groups. If you have a single group (sub-pattern inside round-brackets) in your regex, re.findall will just return the contents of those brackets.
For example, "\[(.*?)\]" would find you just the section labels, not including the square brackets (since they're outside the group).
The regex "\)(.*?)\(" would find just the last line ("\nNightcrawlin' in the Phantom ").
Similarly, we could find the first line with "\](.*?)\[".
Combining the two types of brackets into a character class, the (significantly messier looking) regex "[\]\)](.*?)[\[\(]" captures all of the lyrics.
It will miss lines that don't have brackets before or after them (ie. a the very start before [Intro] if there are any, or at the end if there are no backing vocals afterwards). A possible workaround is to prepend a "]" character and append a "[" character to the end to force a match to start/end at the end of the string. Note we need to add the DOTALL option to make sure the wildcard "." will match the newline character "\n"
import re
lyrics = """[Intro]
D.A. got that dope!
[Chorus: Travis Scott]
Ice water, turned Atlantic (Freeze)
Nightcrawlin' in the Phantom (Skrrt, Skrrt)..."""
matches = re.findall(r"[\]\)](.*?)[\[\(]", "]" + lyrics + "[", re.DOTALL)
main_vocals = '\n'.join(matches)

How to extract set of substrings from a paragraph of string

Say I have a string:
output='[{ "id":"b678792277461" ,"Responses":{"SUCCESS":{"sh xyz":"sh xyz\\n Name Age Height Weight\\n Ana \\u003c15 \\u003e 163 47\\n 43\\n DEB \\u003c23 \\u003e 155 \\n Grey \\u003c53 \\u003e 143 54\\n 63\\n Sch#"},"FAILURE":{},"BLACKLISTED":{}}}]'
This is just an example but I have much longer output which is response from an api call.
I want to extract all names (ana, dab, grey) and put in a separate list.
how can I do it?
json_data = json.loads(output)
json_data = [{'id': 'b678792277461', 'Responses': {'SUCCESS': {'sh xyz': 'sh xyz\n Name Age Height Weight\n Ana <15 > 163 47\n 43\n DEB <23 > 155 \n Grey <53 > 143 54\n 63\n Sch#'}, 'FAILURE': {}, 'BLACKLISTED': {}}}]
1) I have tried re.findall('\\n(.+)\\u',output)
but this didn't work because it says "incomplete sequence u"
2)
start = output.find('\\n')
end = output.find('\\u', start)
x=output[start:end]
But I couldn't figure out how to run this piece of code in loop to extract names
Thanks
The \u object is not a letter and it cannot be matched. It is a part of a Unicode sequence. The following regex works, but it is kind of quirky. It looks for the beginning of each line, except for the first one, until the first space.
output = json_data[0]['Responses']['SUCCESS']['sh xyz']
pattern = "\n\s*([a-z]+)\s+"
result = re.findall(pattern, output, re.M | re.I)
#['Name', 'Ana', 'DEB', 'Grey']
Explanation of the pattern:
start at a new line (\n)
skip all spaces, if any (\s*)
collect one or more letters ([a-z]+)
skip at least one space (\s+)
Unfortunately, "Name" is also recognized as a name. If you know that it is always present in the first line, slice the list of the results:
result[1:]
#['Ana', 'DEB', 'Grey']
I use regexr.com and play around with the regular expression until I get it right and then covert that into Python.
https://regexr.com/
I'm assuming the \n is the newline character here and I'll bet your \u error is caused by a line break. To use the multiline match in Python, you need to use that flag when you compile.
\n(.*)\n - this will be greedy and grab as many matches as possible (In the example it would grab the entire \nAna through 54\n
[{ "id":"678792277461" ,"Responses": {Name Age Height Weight\n Ana \u00315 \u003163 47\n 43\n Deb \u00323 \u003155 60 \n Grey \u00353 \u003144 54\n }]
import re
a = re.compile("\\n(.*)\\n", re.MULTILINE)
for responses in a.match(source):
match = responses.split("\n")
# match[0] should be " Ana \u00315 \u003163 47"
# match[1] should be " Deb \u00323 \u003155 60" etc.

Search in a string and obtain the 2 words before and after the match in Python

I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]

Python Error"TypeError: coercing to Unicode: need string or buffer, list found"

The purpose of this code is to make a program that searches a persons name (on Wikipedia, specifically) and uses keywords to come up with reasons why that person is significant.
I'm having issues with this specific line "if fact_amount < 5 and (terms in sentence.lower()):" because I get this error ("TypeError: coercing to Unicode: need string or buffer, list found")
If you could offer some guidance it would be greatly appreciated, thank you.
import requests
import nltk
import re
#You will need to install requests and nltk
terms = ['pronounced'
'was a significant'
'major/considerable influence'
'one of the (X) most important'
'major figure'
'earliest'
'known as'
'father of'
'best known for'
'was a major']
names = ["Nelson Mandela","Bill Gates","Steve Jobs","Lebron James"]
#List of people that you need to get info from
for name in names:
print name
print '==============='
#Goes to the wikipedia page of the person
r = requests.get('http://en.wikipedia.org/wiki/%s' % (name))
#Parses the raw html into text
raw = nltk.clean_html(r.text)
#Tries to split each sentence.
#sort of buggy though
#For example St. Mary will split after St.
sentences = re.split('[?!.][\s]*',raw)
fact_amount = 0
for sentence in sentences:
#I noticed that important things came after 'he was' and 'she was'
#Seems to work for my sample list
#Also there may be buggy sentences, so I return 5 instead of 3
if fact_amount < 5 and (terms in sentence.lower()):
#remove the reference notation that wikipedia has
#ex [ 33 ]
sentence = re.sub('[ [0-9]+ ]', '', sentence)
#removes newlines
sentence = re.sub('\n', '', sentence)
#removes trailing and leading whitespace
sentence = sentence.strip()
fact_amount += 1
#sentence is formatted. Print it out
print sentence + '.'
print
You should be checking it the other way
sentence.lower() in terms
terms is list and sentence.lower() is a string. You can check if a particular string is there in a list, but you cannot check if a list is there in a string.
you might mean if any(t in sentence_lower for t in terms), to check whether any terms from terms list is in the sentence string.

Find Pattern in Textfile From Several Elements In Several Lists?

I am a beginner, been learning python for a few months as my very first programming language. I am looking to find a pattern from a text file. My first attempt has been using regex, which does work but has a limitation:
import re
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
CC_list = ['and', 'or']
noun_list_pattern1 = r'\b\w+\b,\s\b\w+\b,\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\sor\s\b\w+\b|\b\w+\b,\s\b\w+\b\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\saor\s\b\w+\b'
with open('test_sentence.txt', 'r') as input_f:
read_input = input_f.read()
word = re.findall(noun_list_pattern1, read_input)
for w in word:
print w
else:
pass
So at this point you may be asking why are the lists in this code since they are not being used. Well, I have been racking my brains out, trying all sort of for loops and if statements in functions to try and find a why to replicate the regex pattern, but using the lists.
The limitation with regex is that the \b\w+\w\ code which is found a number of times in `noun_list_pattern' actually only finds words - any words - but not specific nouns. This could raise false positives. I want to narrow things down more by using the elements in the list above instead of the regex.
Since I actually have 4 different regex in the regex pattern (it contains 4 |), I will just go with 1 of them here. So I would need to find a pattern such as:
'noun in noun_list' + ', ' + 'noun in noun_list' + ', ' + 'C in CC_list' + ' ' + 'noun in noun_list
Obviously, the above code quoted line is not real python code, but is an experession of my thoughts about the match needed. Where I say noun in noun_list I mean an iteration through the noun_list; C in CC_list is an iteration through the CC_list; , is a literal string match for a comma and whitespace.
Hopefully I have made myself clear!
Here is the content of the test_sentence.txt file that I am using:
I need to buy are bacon, cheese and eggs.
I also need to buy milk, cheese, and bacon.
What's your favorite: milk, cheese or eggs.
What's my favorite: milk, bacon, or eggs.
Break your problem down a little. First, you need a pattern that will match the words from your list, but no other. You can accomplish that with the alternation operator | and the literal words. red|green|blue, for example, will match "red", "green", or "blue", but not "purple". Join the noun list with that character, and add the word boundary metacharacters along with parentheses to group the alternations:
noun_patt = r'\b(' + '|'.join(nouns) + r')\b'
Do the same for your list of conjunctions:
conj_patt = r'\b(' + '|'.join(conjunctions) + r')\b'
The overall match you want to make is "one or more noun_patt match, each optionally followed by a comma, followed by a match for the conj_patt and then one more noun_patt match". Easy enough for a regex:
patt = r'({0},? )+{1} {0}'.format(noun_patt, conj_patt)
You don't really want to use re.findall(), but re.search(), since you're only expecting one match per line:
for line in lines:
... print re.search(patt, line).group(0)
...
bacon, cheese and eggs
milk, cheese, and bacon
milk, cheese or eggs
milk, bacon, or eggs
As a note, you're close to, if not rubbing up against, the limits of regular expressions as far as parsing English. Any more complex than this, and you will want to look into actual parsing, perhaps with NLTK.
In actuality, you don't necessarily need regular expressions, as there are a number of ways to do this using just your original lists.
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
conjunctions = ['and', 'or']
#This assumes that file has been read into a list of newline delimited lines called `rawlines`
for line in rawlines:
matches = [noun for noun in noun_list if noun in line] + [conj for conj in conjunctions if conj in line]
if len(matches) == 4:
for match in matches:
print match
The reason the match number is 4, is that 4 is the correct number of matches. (Note, that this could also be the case for repeated nouns or conjunctions).
EDIT:
This version prints the lines that are matched and the words matched. Also fixed the possible multiple word match problem:
words_matched = []
matching_lines = []
for l in lst:
matches = [noun for noun in noun_list if noun in l] + [conj for conj in conjunctions if conj in l]
invalid = True
valid_count = 0
for match in matches:
if matches.count(match) == 1:
valid_count += 1
if valid_count == len(matches):
invalid = False
if not invalid:
words_matched.append(matches)
matching_lines.append(l)
for line, matches in zip(matching_lines, words_matched):
print line, matches
However, if this doesn't suit you, you can always build the regex as follows (using the itertools module):
#The number of permutations choices is 3 (as revealed from your examples)
for nouns, conj in itertools.product(itertools.permutations(noun_list, 3), conjunctions):
matches = [noun for noun in nouns]
matches.append(conj)
#matches[:2] is the sublist containing the first 2 items, -1 is the last element, and matches[2:-1] is the element before the last element (if the number of nouns were more than 3, this would be the elements between the 2nd and last).
regex_string = '\s,\s'.join(matches[:2]) + '\s' + matches[-1] + '\s' + '\s,\s'.join(matches[2:-1])
print regex_string
#... do regex related matching here
The caveat of this method is that it is pure brute-force as it generates all the possible combinations (read permutations) of both lists which can then be tested to see if each line matches. Hence, it is horrendously slow, but in this example that matches the ones given (the non-comma before the conjunction), this will generate exact matches perfectly.
Adapt as required.

Categories