How to capture all regex groups in one regex? - python

Given a file like this:
# For more information about CC-CEDICT see:
# http://cc-cedict.org/wiki/
A A [A] /(slang) (Tw) to steal/
AA制 AA制 [A A zhi4] /to split the bill/to go Dutch/
AB制 AB制 [A B zhi4] /to split the bill (where the male counterpart foots the larger portion of the sum)/(theater) a system where two actors take turns in acting the main role, with one actor replacing the other if either is unavailable/
A咖 A咖 [A ka1] /class "A"/top grade/
A圈兒 A圈儿 [A quan1 r5] /at symbol, #/
A片 A片 [A pian4] /adult movie/pornography/
I want to build a json object that:
skip lines that starts with #
breaks lines into 4 parts
tradition character (spans from start ^ until the next space)
simplified character (spans from the first space to the second)
pinyin (spans between the square brackets [...])
the gloss space between the first / till the last / (note there are cases where there can be slashes within the gloss, e.g. /adult movie/pornography/
I am currently doing it as such:
>>> for line in text.split('\n'):
... if line.startswith('#'): continue;
... line = line.strip()
... simple, _, line = line.partition(' ')
... trad, _, line = line.partition(' ')
... print simple, trad
...
A A
AA制 AA制
AB制 AB制
A咖 A咖
A圈兒 A圈儿
A片 A片
To get the [...], I had to do:
>>> import re
>>> line = "A片 A片 [A pian4] /adult movie/pornography/"
>>> simple, _, line = line.partition(' ')
>>> trad, _, line = line.partition(' ')
>>> re.findall(r'\[.*\]', line)[0].strip('[]')
'A pian4'
And to find the /.../, I had to do:
>>> line = "A片 A片 [A pian4] /adult movie/pornography/"
>>> re.findall(r'\/.*\/$', line)[0].strip('/')
'adult movie/pornography'
How do I use regex groups to catch all of them at once which doing multiple partitions/splits/findall?

I could extract the info using regular expressions instead. This way, you can catch blocks in groups and then handle them as desired:
import re
with open("myfile") as f:
data = f.read().split('\n')
for line in data:
if line.startswith('#'): continue
m = re.search(r"^([^ ]*) ([^ ]*) \[([^]]*)\] \/(.*)\/$", line)
if m:
print(m.groups())
That is regular expression splits the string in the following groups:
^([^ ]*) ([^ ]*) \[([^]]*)\] \/(.*)\/$
^^^^^ ^^^^^ ^^^^^ ^^
1) 2) 3) 4)
That is:
the first word.
the second word.
the text within [ and ].
the text from / up to the / before the end of the line.
It returns:
('A', 'A', 'A', '(slang) (Tw) to steal')
('AA制', 'AA制', 'A A zhi4', 'to split the bill/to go Dutch')
('AB制', 'AB制', 'A B zhi4', 'to split the bill (where the male counterpart foots the larger portion of the sum)/(theater) a system where two actors take turns in acting the main role, with one actor replacing the other if either is unavailable')
('A咖', 'A咖', 'A ka1', 'class "A"/top grade')
('A圈兒', 'A圈儿', 'A quan1 r5', 'at symbol, #')
('A片', 'A片', 'A pian4', 'adult movie/pornography')

p = re.compile(ru"(\S+)\s+(\S+)\s+\[([^\]]*)\]\s+/(.*)/$")
m = p.match(line)
if m:
simple, trad, pinyin, gloss = m.groups()
See https://docs.python.org/2/howto/regex.html#grouping for more details.

This might help:
preg = re.compile(r'^(?<!#)(\w+)\s(\w+)\s(\[.*?\])\s/(.+)/$',
re.MULTILINE | re.UNICODE)
with open('your_file') as f:
for line in f:
match = preg.match(line)
if match:
print(match.groups())
Take a look here for a detailed explanation of the used regular expression.

I created following regex to match all the four groups:
REGEX DEMO
^(.*)\s(.*)\s(\[.*\])\s(\/.*\/)
This does assume that there is only one space in between the groups however if you have more you can just add a modifier.
Here is a demo of how this works with python with the lines provided in the question:
IDEONE DEMO

Related

Python RE. excluding some results

I'm new to RE and I'm trying to take song lyrics and isolate the verse titles, the backing vocals, and main vocals:
Here's an example of some lyrics:
[Intro]
D.A. got that dope!
[Chorus: Travis Scott]
Ice water, turned Atlantic (Freeze)
Nightcrawlin' in the Phantom (Skrrt, Skrrt)...
The verse titles include the square brackets and any words between them. They can be successfully isolated with
r'\[{1}.*?\]{1}'
The backing vocals are similar to the verse titles, but between (). They've been successfully isolated with:
r'\({1}.*?\){1}'
For the main vocals, I've used
r'\S+'
which does isolate the main_vocals, but also the verse titles and backing vocals. I cannot figure out how to isolate only the main vocals with simple REs.
Here's a python script that gets the output I desire, but I'd like to do it with REs (as a learning exercise) and cannot figure it out through documentation.
import re
file = 'D:/lyrics.txt'
with open(file, 'r') as f:
lyrics = f.read()
def find_spans(pattern, string):
pattern = re.compile(pattern)
return [match.span() for match in pattern.finditer(string)]
verses = find_spans(r'\[{1}.*?\]{1}', lyrics)
backing_vocals = find_spans(r'\({1}.*?\){1}', lyrics)
main_vocals = find_spans(r'\S+', lyrics)
exclude = verses
exclude.extend(backing_vocals)
not_main_vocals = []
for span in exclude:
start, stop = span
not_main_vocals.extend(list(range(start, stop)))
main_vocals_temp = []
for span in main_vocals:
append = True
start, stop = span
for i in range(start, stop):
if i in not_main_vocals:
append = False
continue
if append == True:
main_vocals_temp.append(span)
main_vocals = main_vocals_temp
Try this Demo:
pattern = r'(?P<Verse>\[[^\]]+])|(?P<Backing>\([^\)]+\))|(?P<Lyrics>[^\[\(]+)'
You can use re.finditer to isolate the groups.
breakdown = {k: [] for k in ('Verse', 'Backing', 'Lyrics')}
for p in pattern.finditer(song):
for key, item in p.groupdict().items():
if item: breakdown[key].append(item)
Result:
{
'Verse':
[
'[Intro]',
'[Chorus: Travis Scott]'
],
'Backing':
[
'(Freeze)',
'(Skrrt, Skrrt)'
],
'Lyrics':
[
'\nD.A. got that dope!\n\n',
'\nIce water, turned Atlantic ',
"\nNightcrawlin' in the Phantom ",
'...'
]
}
To elaborate a bit further on the pattern, it's using the named groups to separate the three distinct groups. Using [^\]+] and similar just means to find everything that is not ] (and likewise when \) means everything not )). In the Lyrics part we exclude anything that starts with [ and (. The link to the demo on regex101 would explain the components in more details if you need.
If you don't care for the newlines in the main lyrics, use (?P<Lyrics>[^\[\(\n]+) (which excludes the \n) to turn your Lyrics without newlines:
'Lyrics': [
'D.A. got that dope!',
'Ice water, turned Atlantic ',
"Nightcrawlin' in the Phantom ",
'...'
]
You could search for the text between close-brackets and open-brackets, using regex groups. If you have a single group (sub-pattern inside round-brackets) in your regex, re.findall will just return the contents of those brackets.
For example, "\[(.*?)\]" would find you just the section labels, not including the square brackets (since they're outside the group).
The regex "\)(.*?)\(" would find just the last line ("\nNightcrawlin' in the Phantom ").
Similarly, we could find the first line with "\](.*?)\[".
Combining the two types of brackets into a character class, the (significantly messier looking) regex "[\]\)](.*?)[\[\(]" captures all of the lyrics.
It will miss lines that don't have brackets before or after them (ie. a the very start before [Intro] if there are any, or at the end if there are no backing vocals afterwards). A possible workaround is to prepend a "]" character and append a "[" character to the end to force a match to start/end at the end of the string. Note we need to add the DOTALL option to make sure the wildcard "." will match the newline character "\n"
import re
lyrics = """[Intro]
D.A. got that dope!
[Chorus: Travis Scott]
Ice water, turned Atlantic (Freeze)
Nightcrawlin' in the Phantom (Skrrt, Skrrt)..."""
matches = re.findall(r"[\]\)](.*?)[\[\(]", "]" + lyrics + "[", re.DOTALL)
main_vocals = '\n'.join(matches)

Search in a string and obtain the 2 words before and after the match in Python

I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]

What Regex to use in this example

I am parsing a string that I know will definitely only contain the following distinct phrases that I want to parse:
'Man of the Match'
'Goal'
'Assist'
'Yellow Card'
'Red Card'
The string that I am parsing could contain everything from none of the elements above to all of them (i.e. the string being parsed could be anything from None to 'Man of the Match Goal Assist Yellow Card Red Card'.
For those of you that understand football, you will also realise that the elements 'Goal' and 'Assist' could in theory be repeated an infinite number of times. The element 'Yellow Card' could be repeated 0, 1 or 2 times also.
I have built the following Regex (where 'incident1' is the string being parsed), which I believed would return an unlimited number of all preceding Regexes, however all I am getting is single instances:
regex1 = re.compile("Man of the Match*", re.S)
regex2 = re.compile("Goal*", re.S)
regex3 = re.compile("Assist*", re.S)
regex4 = re.compile("Red Card*", re.S)
regex5 = re.compile("Yellow Card*", re.S)
mysearch1 = re.search(regex1, incident1)
mysearch2 = re.search(regex2, incident1)
mysearch3 = re.search(regex3, incident1)
mysearch4 = re.search(regex4, incident1)
mysearch5 = re.search(regex5, incident1)
#print mystring
print "incident1 = ", incident1
if mysearch1 is not None:
print "Man of the match = ", mysearch1.group()
if mysearch2 is not None:
print "Goal = ", mysearch2.group()
if mysearch3 is not None:
print "Assist = ", mysearch3.group()
if mysearch4 is not None:
print "Red Card = ", mysearch4.group()
if mysearch5 is not None:
print "Yellow Card = ", mysearch5.group()
This works as long as there is only one instance of every element encountered in a string, however if a player was for example to score more than one goal, this code only returns one instance of 'Goal'.
Can anyone see what I am doing wrong?
You can try something like this:
import re
s = "here's an example Man of the Match match and a Red Card match, and another Red Card match"
patterns = [
'Man of the Match',
'Goal',
'Assist',
'Yellow Card',
'Red Card',
]
repattern = '|'.join(patterns)
matches = re.findall(repattern, s, re.IGNORECASE)
print matches # ['Man of the Match', 'Red Card', 'Red Card']
Some general overview on regex methods in python:
re.search | re.match
In your previous attempt, you tried to use re.search. This only returned one result, and as you'll see this isn't unusual. These two functions are used to identify if a line contains a certain regex. You'd use these for something like:
s = subprocess.check_output('ipconfig') # calls ipconfig and sends output to s
for line in s.splitlines():
if re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", str(line)):
# if line contains an IP address...
print(line)
You use re.match to specifically check if the regex matches at the BEGINNING of the string. This is usually used with a regex that matches the WHOLE string. For example:
lines = ['Adam Smith, Age: 24, Male, Favorite Thing: Reading page: 16',
'Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example']
# two Adams, but we only want the one who is 16 years old.
repattern = re.compile(r'''Adam \w+, Age: 16, (?:Male|Female), Favorite Thing: [^,]*?''')
for line in lines:
if repattern.match(line):
print(line)
# Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example
# note if we'd used re.search for Age: 16, it would have found both lines!
The take away is that you use these two functions to select lines in a longer document (or any iterable)
re.findall | re.finditer
It seems in this case, you aren't trying to match a line, you're trying to pull some specifically-formatted information from the string. Let's see some examples of that.
s = """Phone book:
Adam: (555)123-4567
Joe: (555)987-6543
Alice:(555)135-7924"""
pat = r'''(?:\(\d{3}\))?\d{3}-?\d{4}'''
phone_numbers = re.findall(pat, s)
print(phone_numbers)
# ['(555)123-4567','(555)987-6543','(555)135-7924']
re.finditer returns a generator instead of a list. You'd use this the same way you'd use xrange instead of range in Python2. re.findall(some_pattern, some_string) can make a GIANT list if there are a TON of matches. re.finditer will not.
other methods: re.split | re.sub
re.split is great if you have a number of things you need to split by. Imagine you had the string:
s = '''Hello, world! It's great that you're talking to me, and everything, but I'd really rather you just split me on punctuation marks. Okay?'''
There's no great way to do that with str.split like you're used to, so instead do:
separators = [".", "!", "?", ","]
splitpattern = '|'.join(map(re.escape, separators))
# re.escape takes a string and escapes out any characters that regex considers
# special, for instance that . would otherwise be "any character"!
split_s = re.split(splitpattern, s)
print(split_s)
# ['Hello', ' world', " It's great that you're talking to me", ' and everything', " but I'd really rather you just split me on punctuation marks", ' Okay', '']
re.sub is great in cases where you know something will be formatted regularly, but you're not sure exactly how. However, you REALLY want to make sure they're all formatted the same! This will be a little advanced and use several methods, but stick with me....
dates = ['08/08/2014', '09-13-2014', '10.10.1997', '9_29_09']
separators = list()
new_sep = "/"
match_pat = re.compile(r'''
\d{1,2} # two digits
(.) # followed by a separator (capture)
\d{1,2} # two more digits
\1 # a backreference to that separator
\d{2}(?:\d{2})? # two digits and optionally four digits''', re.X)
for idx,date in enumerate(dates):
match = match_pat.match(date)
if match:
sep = match.group(1) # the separator
separators.append(sep)
else:
dates.pop(idx) # this isn't really a date, is it?
repl_pat = '|'.join(map(re.escape, separators))
final_dates = re.sub(repl_pat, new_sep, '\n'.join(dates))
print(final_dates)
# 08/08/2014
# 09/13/2014
# 10/10/1997
# 9/29/09
A slightly less advanced example, you can use re.sub with any sort of formatted expression and pass it a function to return! For instance:
def get_department(dept_num):
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
if hasattr(dept_num, 'group'): # then it's a match, not a number
dept_num = dept_num.group(0)
return departments.get(dept_num, "Unknown Dept")
file = r"""Name,Performance Review,Department
Adam,3,1
Joe,5,2
Alice,1,3
Eve,12,4""" # this looks like a csv file
dept_names = re.sub(r'''\d+$''', get_department, file, flags=re.M)
print(dept_names)
# Name,Performance Review,Department
# Adam,3,I.T.
# Joe,5,Administration
# Alice,1,Human Resources
# Eve,12,Maintenance
Without using regex here you could do:
replaced_lines = []
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
for line in file.splitlines():
the_split_line = line.split(',')
replaced_lines.append(','.join(the_split_line[:-1]+ \
departments.get(the_split_line[-1], "Unknown Dept")))
new_file = '\n'.join(replaced_lines)
# LOTS OF STRING MANIPULATION, YUCK!
Instead we replace all that for loop and string splitting, list slicing, and string manipulation with a function and a re.sub call. In fact, if you use a lambda it's even easier!
departments = {'1': 'I.T.',
'2': 'Administration',
'3': 'Human Resources',
'4': 'Maintenance'}
re.sub(r'''\d+$''', lambda x: departments.get(x, "Unknown Dept"), file, flags=re.M)
# DONE!

How to to extarct a pattern from a file using regex in python

I have a input file like below, and need to extract the pattern of words which starts with nsub, rcmod, ccomp, acomp and to be print in two output files as shown below, I new to python I not getting how to use regex here
Input file
nsubj(believe-4, i-1)
aux(believe-4, ca-2)
neg(believe-4, n't-3)
root(ROOT-0, believe-4)
acomp(believe-4, #mistamau-5)
aux(know-8, does-6)
neg(know-8, n't-7)
ccomp(#mistamau-5, know-8)
dobj(is-12, who-9)
amod(tatum-11, channing-10)
nsubj(is-12, tatum-11)
ccomp(know-8, is-12)
root(ROOT-0, What-1)
cop(What-1, is-2)
amod(people-4, worse-3)
xsubj(hear-9, I-5)
aux(talking-7, am-6)
rcmod(people-4, talking-7)
xcomp(talking-7, hear-9)
dobj(hear-9, me-10)
advmod(poorly-12, very-11)
Output file_1
nsubj(believe-4, i-1)
nsubj(is-12, tatum-11)
acomp(believe-4, #mistamau-5)
rcmod(people-4, talking-7)
ccomp(know-8, is-12)
ccomp(#mistamau-5, know-8)
Output file_2
believe, i
is, tatum
believe, #mistamau
people, talking
know, is
#mistamau, know
Here's a program that takes in words from stdin and prints 'matched' or 'not matched' depending if the word starts with 'Big' or 'Daddy'.
import re
import sys
prog = re.compile('((Big)|(Daddy))[a-z]*')
while True:
line = sys.stdin.readline()
if not line: break
if prog.match(line):
print 'matched'
else:
print 'not matched'
Just replace the regular expression pattern with your own and the input from a file instead of standard in and you should be set~.
regex = re.compile(r"""
^ # Start of line (re.M modifier set!)
( # Start of capturing group 1:
(?:nsubj|rcmod|ccomp|acomp) # Match one of these
\( # Match (
([^-]*) # Match and capture in group 2 any no. of non-dash characters
-\d+,[ ] # Match a dash and a number, a comma and a space
([^-]*) # Match and capture in group 3 any no. of non-dash characters
-\d+ # Match a dash and a number
\) # Match )
) # End of group 1""", re.M|re.X)
should work if I understand your requirements correctly.
When applied to the entire file (s = myfile.read()) you get the following result:
>>> regex.findall(s)
[('nsubj(believe-4, i-1)', 'believe', 'i'),
('acomp(believe-4, #mistamau-5)', 'believe', '#mistamau'),
('ccomp(#mistamau-5, know-8)', '#mistamau', 'know'),
('nsubj(is-12, tatum-11)', 'is', 'tatum'),
('ccomp(know-8, is-12)', 'know', 'is'),
('rcmod(people-4, talking-7)', 'people', 'talking')]

Find Pattern in Textfile From Several Elements In Several Lists?

I am a beginner, been learning python for a few months as my very first programming language. I am looking to find a pattern from a text file. My first attempt has been using regex, which does work but has a limitation:
import re
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
CC_list = ['and', 'or']
noun_list_pattern1 = r'\b\w+\b,\s\b\w+\b,\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\sor\s\b\w+\b|\b\w+\b,\s\b\w+\b\sand\s\b\w+\b|\b\w+\b,\s\b\w+\b,\saor\s\b\w+\b'
with open('test_sentence.txt', 'r') as input_f:
read_input = input_f.read()
word = re.findall(noun_list_pattern1, read_input)
for w in word:
print w
else:
pass
So at this point you may be asking why are the lists in this code since they are not being used. Well, I have been racking my brains out, trying all sort of for loops and if statements in functions to try and find a why to replicate the regex pattern, but using the lists.
The limitation with regex is that the \b\w+\w\ code which is found a number of times in `noun_list_pattern' actually only finds words - any words - but not specific nouns. This could raise false positives. I want to narrow things down more by using the elements in the list above instead of the regex.
Since I actually have 4 different regex in the regex pattern (it contains 4 |), I will just go with 1 of them here. So I would need to find a pattern such as:
'noun in noun_list' + ', ' + 'noun in noun_list' + ', ' + 'C in CC_list' + ' ' + 'noun in noun_list
Obviously, the above code quoted line is not real python code, but is an experession of my thoughts about the match needed. Where I say noun in noun_list I mean an iteration through the noun_list; C in CC_list is an iteration through the CC_list; , is a literal string match for a comma and whitespace.
Hopefully I have made myself clear!
Here is the content of the test_sentence.txt file that I am using:
I need to buy are bacon, cheese and eggs.
I also need to buy milk, cheese, and bacon.
What's your favorite: milk, cheese or eggs.
What's my favorite: milk, bacon, or eggs.
Break your problem down a little. First, you need a pattern that will match the words from your list, but no other. You can accomplish that with the alternation operator | and the literal words. red|green|blue, for example, will match "red", "green", or "blue", but not "purple". Join the noun list with that character, and add the word boundary metacharacters along with parentheses to group the alternations:
noun_patt = r'\b(' + '|'.join(nouns) + r')\b'
Do the same for your list of conjunctions:
conj_patt = r'\b(' + '|'.join(conjunctions) + r')\b'
The overall match you want to make is "one or more noun_patt match, each optionally followed by a comma, followed by a match for the conj_patt and then one more noun_patt match". Easy enough for a regex:
patt = r'({0},? )+{1} {0}'.format(noun_patt, conj_patt)
You don't really want to use re.findall(), but re.search(), since you're only expecting one match per line:
for line in lines:
... print re.search(patt, line).group(0)
...
bacon, cheese and eggs
milk, cheese, and bacon
milk, cheese or eggs
milk, bacon, or eggs
As a note, you're close to, if not rubbing up against, the limits of regular expressions as far as parsing English. Any more complex than this, and you will want to look into actual parsing, perhaps with NLTK.
In actuality, you don't necessarily need regular expressions, as there are a number of ways to do this using just your original lists.
noun_list = ['bacon', 'cheese', 'eggs', 'milk', 'list', 'dog']
conjunctions = ['and', 'or']
#This assumes that file has been read into a list of newline delimited lines called `rawlines`
for line in rawlines:
matches = [noun for noun in noun_list if noun in line] + [conj for conj in conjunctions if conj in line]
if len(matches) == 4:
for match in matches:
print match
The reason the match number is 4, is that 4 is the correct number of matches. (Note, that this could also be the case for repeated nouns or conjunctions).
EDIT:
This version prints the lines that are matched and the words matched. Also fixed the possible multiple word match problem:
words_matched = []
matching_lines = []
for l in lst:
matches = [noun for noun in noun_list if noun in l] + [conj for conj in conjunctions if conj in l]
invalid = True
valid_count = 0
for match in matches:
if matches.count(match) == 1:
valid_count += 1
if valid_count == len(matches):
invalid = False
if not invalid:
words_matched.append(matches)
matching_lines.append(l)
for line, matches in zip(matching_lines, words_matched):
print line, matches
However, if this doesn't suit you, you can always build the regex as follows (using the itertools module):
#The number of permutations choices is 3 (as revealed from your examples)
for nouns, conj in itertools.product(itertools.permutations(noun_list, 3), conjunctions):
matches = [noun for noun in nouns]
matches.append(conj)
#matches[:2] is the sublist containing the first 2 items, -1 is the last element, and matches[2:-1] is the element before the last element (if the number of nouns were more than 3, this would be the elements between the 2nd and last).
regex_string = '\s,\s'.join(matches[:2]) + '\s' + matches[-1] + '\s' + '\s,\s'.join(matches[2:-1])
print regex_string
#... do regex related matching here
The caveat of this method is that it is pure brute-force as it generates all the possible combinations (read permutations) of both lists which can then be tested to see if each line matches. Hence, it is horrendously slow, but in this example that matches the ones given (the non-comma before the conjunction), this will generate exact matches perfectly.
Adapt as required.

Categories