I'm looking for an efficient way to match 2 lists, one wich contains complete information, and one which contains wildcards. I've been able to do this with wildcards of fixed lengths, but am now trying to do it with wildcards of variable lengths.
Thus:
match( ['A', 'B', '*', 'D'], ['A', 'B', 'C', 'C', 'C', 'D'] )
would return True as long as all the elements are in the same order in both lists.
I'm working with lists of objects, but used strings above for simplicity.
[edited to justify no RE after OP comment on comparing objects]
It appears you are not using strings, but rather comparing objects. I am therefore giving an explicit algorithm — regular expressions provide a good solution tailored for strings, don't get me wrong, but from what you say as a comment to your questions, it seems an explicit, simple algorithm may make things easier for you.
It turns out that this can be solved with a much simpler algorithm than this previous answer:
def matcher (l1, l2):
if (l1 == []):
return (l2 == [] or l2 == ['*'])
if (l2 == [] or l2[0] == '*'):
return matcher(l2, l1)
if (l1[0] == '*'):
return (matcher(l1, l2[1:]) or matcher(l1[1:], l2))
if (l1[0] == l2[0]):
return matcher(l1[1:], l2[1:])
else:
return False
The key idea is that when you encounter a wildcard, you can explore two options :
either advance in the list that contains the wildcard (and consider the wildcard matched whatever there was until now)
or advance in the list that doesn't contain the wildcard (and consider that whatever is at the head of the list has to be matched by the wildcard).
How about the following:
import re
def match(pat, lst):
regex = ''.join(term if term != '*' else '.*' for term in pat) + '$'
s = ''.join(lst)
return re.match(regex, s) is not None
print match( ['A', 'B', '*', 'D'], ['A', 'B', 'C', 'C', 'C', 'D'] )
It uses regular expressions. Wildcards (*) are changed to .* and all other search terms are kept as-is.
One caveat is that if your search terms could contain things that have special meaning in the regex language, those would need to be properly escaped. It's pretty easy to handle this in the match function, I just wasn't sure if this was something you required.
I'd recommend converting ['A', 'B', '*', 'D'] to '^AB.*D$', ['A', 'B', 'C', 'C', 'C', 'D'] to 'ABCCCD', and then using the re module (regular expressions) to do the match.
This will be valid if the elements of your lists are only one character each, and if they're strings.
something like:
import(re)
def myMatch( patternList, stringList ):
# convert pattern to flat string with wildcards
# convert AB*D to valid regex ^AB.*D$
pattern = ''.join(patternList)
regexPattern = '^' + pattern.replace('*','.*') + '$'
# perform matching
against = ''.join(stringList) # convert ['A','B','C','C','D'] to ABCCCD
# return whether there is a match
return (re.match(regexPattern,against) is not None)
If the lists contain numbers, or words, choose a character that you wouldn't expect to be in either, for example #. Then ['Aa','Bs','Ce','Cc','CC','Dd'] can be converted to Aa#Bs#Ce#Cc#CC#Dd, the wildcard pattern ['Aa','Bs','*','Dd'] could be converted to ^Aa#Bs#.*#Dd$, and the match performed.
Practically speaking this just means all the ''.join(...) becomes '#'.join(...) in myMatch.
I agree with the comment regarding this could be done with regular expressions. For example:
import re
lst = ['A', 'B', 'C', 'C', 'C', 'D']
pattern = ['A', 'B', 'C+', 'D']
print re.match(''.join(pattern), ''.join(lst)) # Will successfully match
Edit: As pointed out by a comment, it might be known in advance just that some character has to be matched, but not which one. In that case, regular expressions are useful still:
import re
lst = ['A', 'B', 'C', 'C', 'C', 'D']
pattern = r'AB(\w)\1*D'
print re.match(pattern, ''.join(lst)).groups()
I agree, regular expressions are usually the way to go with this sort of thing. This algorithm works, but it just looks convoluted to me. It was fun to write though.
def match(listx, listy):
listx, listy = map(iter, (listx, listy))
while 1:
try:
x = next(listx)
except StopIteration:
# This means there are values left in listx that are not in listy.
try:
y = next(listy)
except StopIteration:
# This means there are no more values to be compared in either
# listx or listy; since no exception was raied elsewhere, the
# lists match.
return True
else:
# This means that there are values in listy that are not in
# listx.
return False
else:
try:
y = next(listy)
except StopIteration:
# Similarly, there are values in listy that aren't in listx.
return False
if x == y:
pass
elif x == '*':
try:
# Get the value in listx after '*'.
x = next(listx)
except StopIteration:
# This means that listx terminates with '*'. If there are any
# remaining values of listy, they will, by definition, match.
return True
while 1:
if x == y:
# I didn't shift to the next value in listy because I
# assume that a '*' matches the empty string and well as
# any other.
break
else:
try:
y = next(listy)
except StopIteration:
# This means there is at least one remaining value in
# listx that is not in listy, because listy has no
# more values.
return False
else:
pass
# Same algorithm as above, given there is a '*' in listy.
elif y == '*':
try:
y = next(listy)
except StopIteration:
return True
while 1:
if x == y:
break
else:
try:
x = next(listx)
except StopIteration:
return False
else:
pass
I had this c++ piece of code which seems to be doing what you are trying to do (inputs are strings instead of arrays of characters but you'll have to adapt stuff anyway).
bool Utils::stringMatchWithWildcards (const std::string str, const std::string strWithWildcards)
PRINT("Starting in stringMatchWithWildcards('" << str << "','" << strWithWildcards << "')");
const std::string wildcard="*";
const bool startWithWildcard=(strWithWildcards.find(wildcard)==0);
int pos=strWithWildcards.rfind(wildcard);
const bool endWithWildcard = (pos!=std::string::npos) && (pos+wildcard.size()==strWithWildcards.size());
// Basically, the point is to split the string with wildcards in strings with no wildcard.
// Then search in the first string for the different chunks of the second in the correct order
std::vector<std::string> vectStr;
boost::split(vectStr, strWithWildcards, boost::is_any_of(wildcard));
// I expected all the chunks in vectStr to be non-empty. It doesn't seem the be the case so let's remove them.
vectStr.erase(std::remove_if(vectStr.begin(), vectStr.end(), std::mem_fun_ref(&std::string::empty)), vectStr.end());
// Check if at least one element (to have first and last element)
if (vectStr.empty())
{
const bool matchEmptyCase = (startWithWildcard || endWithWildcard || str.empty());
PRINT("Match " << (matchEmptyCase?"":"un") << "successful (empty case) : '" << str << "' and '" << strWithWildcards << "'");
return matchEmptyCase;
}
// First Element
std::vector<std::string>::const_iterator vectStrIt = vectStr.begin();
std::string aStr=*vectStrIt;
if (!startWithWildcard && str.find(aStr, 0)!=0) {
PRINT("Match unsuccessful (beginning) : '" << str << "' and '" << strWithWildcards << "'");
return false;
}
// "Normal" Elements
bool found(true);
pos=0;
std::vector<std::string>::const_iterator vectStrEnd = vectStr.end();
for ( ; vectStrIt!=vectStrEnd ; vectStrIt++)
{
aStr=*vectStrIt;
PRINT( "Searching '" << aStr << "' in '" << str << "' from " << pos);
pos=str.find(aStr, pos);
if (pos==std::string::npos)
{
PRINT("Match unsuccessful ('" << aStr << "' not found) : '" << str << "' and '" << strWithWildcards << "'");
return false;
} else
{
PRINT( "Found at position " << pos);
pos+=aStr.size();
}
}
// Last Element
const bool matchEnd = (endWithWildcard || str.rfind(aStr)+aStr.size()==str.size());
PRINT("Match " << (matchEnd?"":"un") << "successful (usual case) : '" << str << "' and '" << strWithWildcards);
return matchEnd;
}
/* Tested on these values :
assert( stringMatchWithWildcards("ABC","ABC"));
assert( stringMatchWithWildcards("ABC","*"));
assert( stringMatchWithWildcards("ABC","*****"));
assert( stringMatchWithWildcards("ABC","*BC"));
assert( stringMatchWithWildcards("ABC","AB*"));
assert( stringMatchWithWildcards("ABC","A*C"));
assert( stringMatchWithWildcards("ABC","*C"));
assert( stringMatchWithWildcards("ABC","A*"));
assert(!stringMatchWithWildcards("ABC","BC"));
assert(!stringMatchWithWildcards("ABC","AB"));
assert(!stringMatchWithWildcards("ABC","AB*D"));
assert(!stringMatchWithWildcards("ABC",""));
assert( stringMatchWithWildcards("",""));
assert( stringMatchWithWildcards("","*"));
assert(!stringMatchWithWildcards("","ABC"));
*/
It's not something I'm really proud of but it seems to be working so far. I hope you can find it useful.
Related
Let's say I have a string 'gfgfdAAA1234ZZZuijjk' and I want to extract just the '1234' part.
I only know what will be the few characters directly before AAA, and after ZZZ the part I am interested in 1234.
With sed it is possible to do something like this with a string:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
And this will give me 1234 as a result.
How to do the same thing in Python?
Using regular expressions - documentation for further reference
import re
text = 'gfgfdAAA1234ZZZuijjk'
m = re.search('AAA(.+?)ZZZ', text)
if m:
found = m.group(1)
# found: 1234
or:
import re
text = 'gfgfdAAA1234ZZZuijjk'
try:
found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
# AAA, ZZZ not found in the original string
found = '' # apply your error handling
# found: 1234
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'
Then you can use regexps with the re module as well, if you want, but that's not necessary in your case.
regular expression
import re
re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)
The above as-is will fail with an AttributeError if there are no "AAA" and "ZZZ" in your_text
string methods
your_text.partition("AAA")[2].partition("ZZZ")[0]
The above will return an empty string if either "AAA" or "ZZZ" don't exist in your_text.
PS Python Challenge?
Surprised that nobody has mentioned this which is my quick version for one-off scripts:
>>> x = 'gfgfdAAA1234ZZZuijjk'
>>> x.split('AAA')[1].split('ZZZ')[0]
'1234'
you can do using just one line of code
>>> import re
>>> re.findall(r'\d{1,5}','gfgfdAAA1234ZZZuijjk')
>>> ['1234']
result will receive list...
import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)
You can use re module for that:
>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)
In python, extracting substring form string can be done using findall method in regular expression (re) module.
>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']
text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'
print(text[text.index(left)+len(left):text.index(right)])
Gives
string
>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')
With sed it is possible to do something like this with a string:
echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"
And this will give me 1234 as a result.
You could do the same with re.sub function using the same regex.
>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'
In basic sed, capturing group are represented by \(..\), but in python it was represented by (..).
You can find first substring with this function in your code (by character index). Also, you can find what is after a substring.
def FindSubString(strText, strSubString, Offset=None):
try:
Start = strText.find(strSubString)
if Start == -1:
return -1 # Not Found
else:
if Offset == None:
Result = strText[Start+len(strSubString):]
elif Offset == 0:
return Start
else:
AfterSubString = Start+len(strSubString)
Result = strText[AfterSubString:AfterSubString + int(Offset)]
return Result
except:
return -1
# Example:
Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"
print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")
print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")
print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))
# Your answer:
Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"
AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0)
print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))
Using PyParsing
import pyparsing as pp
word = pp.Word(pp.alphanums)
s = 'gfgfdAAA1234ZZZuijjk'
rule = pp.nestedExpr('AAA', 'ZZZ')
for match in rule.searchString(s):
print(match)
which yields:
[['1234']]
One liner with Python 3.8 if text is guaranteed to contain the substring:
text[text.find(start:='AAA')+len(start):text.find('ZZZ')]
Just in case somebody will have to do the same thing that I did. I had to extract everything inside parenthesis in a line. For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution:
regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'
I.e. you need to block parenthesis with slash \ sign. Though it is a problem about more regular expressions that Python.
Also, in some cases you may see 'r' symbols before regex definition. If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.
also, you can find all combinations in the bellow function
s = 'Part 1. Part 2. Part 3 then more text'
def find_all_places(text,word):
word_places = []
i=0
while True:
word_place = text.find(word,i)
i+=len(word)+word_place
if i>=len(text):
break
if word_place<0:
break
word_places.append(word_place)
return word_places
def find_all_combination(text,start,end):
start_places = find_all_places(text,start)
end_places = find_all_places(text,end)
combination_list = []
for start_place in start_places:
for end_place in end_places:
print(start_place)
print(end_place)
if start_place>=end_place:
continue
combination_list.append(text[start_place:end_place])
return combination_list
find_all_combination(s,"Part","Part")
result:
['Part 1. ', 'Part 1. Part 2. ', 'Part 2. ']
In case you want to look for multiple occurences.
content ="Prefix_helloworld_Suffix_stuff_Prefix_42_Suffix_andsoon"
strings = []
for c in content.split('Prefix_'):
spos = c.find('_Suffix')
if spos!=-1:
strings.append( c[:spos])
print( strings )
Or more quickly :
strings = [ c[:c.find('_Suffix')] for c in content.split('Prefix_') if c.find('_Suffix')!=-1 ]
Here's a solution without regex that also accounts for scenarios where the first substring contains the second substring. This function will only find a substring if the second marker is after the first marker.
def find_substring(string, start, end):
len_until_end_of_first_match = string.find(start) + len(start)
after_start = string[len_until_end_of_first_match:]
return string[string.find(start) + len(start):len_until_end_of_first_match + after_start.find(end)]
Another way of doing it is using lists (supposing the substring you are looking for is made of numbers, only) :
string = 'gfgfdAAA1234ZZZuijjk'
numbersList = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
output = []
for char in string:
if char in numbersList: output.append(char)
print(f"output: {''.join(output)}")
### output: 1234
Typescript. Gets string in between two other strings.
Searches shortest string between prefixes and postfixes
prefixes - string / array of strings / null (means search from the start).
postfixes - string / array of strings / null (means search until the end).
public getStringInBetween(str: string, prefixes: string | string[] | null,
postfixes: string | string[] | null): string {
if (typeof prefixes === 'string') {
prefixes = [prefixes];
}
if (typeof postfixes === 'string') {
postfixes = [postfixes];
}
if (!str || str.length < 1) {
throw new Error(str + ' should contain ' + prefixes);
}
let start = prefixes === null ? { pos: 0, sub: '' } : this.indexOf(str, prefixes);
const end = postfixes === null ? { pos: str.length, sub: '' } : this.indexOf(str, postfixes, start.pos + start.sub.length);
let value = str.substring(start.pos + start.sub.length, end.pos);
if (!value || value.length < 1) {
throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
}
while (true) {
try {
start = this.indexOf(value, prefixes);
} catch (e) {
break;
}
value = value.substring(start.pos + start.sub.length);
if (!value || value.length < 1) {
throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes);
}
}
return value;
}
a simple approach could be the following:
string_to_search_in = 'could be anything'
start = string_to_search_in.find(str("sub string u want to identify"))
length = len("sub string u want to identify")
First_part_removed = string_to_search_in[start:]
end_coord = length
Extracted_substring=First_part_removed[:end_coord]
One liners that return other string if there was no match.
Edit: improved version uses next function, replace "not-found" with something else if needed:
import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )
My other method to do this, less optimal, uses regex 2nd time, still didn't found a shorter way:
import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )
string = "Tes.t / &hi-&"
Expected Output - ["Tes" , "." , "t" , " " ," /" , "&" , "hi" ,"-", "&"]
or
Expected Output - ["Tes" , "." , "t" , " / &" , "hi" , "-&"]
Preferably the latter output would be more better but either would work perfectly.
Code
def splitnonalpha(s):
"""Split whenever the type of following characater is different (i.e. alpha or non-alpha)"""
current = s[0]
result = []
for pos in range(1, len(s)):
if s[pos].isalpha() and current[-1].isalpha():
current += s[pos] # same type as previous
elif not s[pos].isalpha() and not current[-1].isalpha():
current += s[pos] # same type as previous
else:
# Different type-->store current, and reset to current character
result.append(current)
current = s[pos]
if current:
result.append(current)
return result
Test
s = "Tes.t / &hi-&"
print(splitnonalpha(s))
Output
['Tes', '.', 't', ' / &', 'hi', '-&']
You could try something where you check if a character is in ascii_letters or not and add it to the same string or a different to the last depending on this. This could look like;
from string import ascii_letters
import sys
from typing import List
def main(input_string: str) -> List[str]:
output = []
sub_string = ''
last_was_ascii = None
for char in input_string:
char_is_ascii = char in ascii_letters
if last_was_ascii is None or char_is_ascii == last_was_ascii:
sub_string += char
else:
output.append(sub_string)
sub_string = char
last_was_ascii = char_is_ascii
output.append(sub_string)
print(output)
if __name__ == "__main__":
main(*sys.argv[1:])
Which given the command line input python example_file.py "Tes.t / &hi-&" will print ['Tes', '.', 't', ' / &', 'hi', '-&'], i.e. the second example you have listed.
It's a little verbose however does the trick
one solution is to use regex:
find all alphanumerical:
an = re.findall("[a-zA-Z0-9]+", s)
find all non alphanumerical:
non_an = re.findall("[^a-zA-Z0-9]+", s)
zip them:
ziped = zip(an, non_an)
flatten the zip:
flat = sum(ziped, ())
or in a one liner:
sum(zip(re.findall("[a-zA-Z0-9]+", s), re.findall("[^a-zA-Z0-9]+", s)), ())
to cover cases that include more alphanumerical than non alphanumerical (or vice versa) use itertools.zip_longest() and drop nulls:
from itertools import zip_longest
[x for x in sum(zip_longest(re.findall("\w+", s), re.findall("[\W]+", s)), ()) if x]
Suppose that I will be having a string consists of 3 letters 'a', 'b', 'c' and I need to shorten the string by replacing every two characters with the 3rd character. What is the best way to do it?
example:
aa (no change, substitution only done with different characters)
bcab > a ab > a c > b (NOT bbb, because b is the shortest)
aba > ca > b
a (no change)
I did the following, but I guess there is a better solution or algorithm:
def replaceChar(input_string):
possibilites = {'ab':'c',
'bc':'a',
'ca':'b',
'cb':'a',
'ac':'b',
'ba':'c'
}
for key, value in possibilites.items():
input_string = input_string.replace(key, value)
char_game(input_string)
def char_game(input_string):
if len( list(set(input_string)) ) == 1: print(input_string)
elif len( list(set(input_string)) ) >= 2 : replaceChar(input_string)
else: print( input_string )
Agree with #gene that your solution might not give the best possible solution.
But, if you want to go with your approach, then following O(N) solution with an additional stack might do the trick
def getchar(input_string):
rep = {
'ab':'c',
'bc':'a',
'ca':'b',
'cb':'a',
'ac':'b',
'ba':'c'
}
stack = []
for c in input_string:
t = c
while len(stack) and stack[-1] != t:
t = rep[t+stack[-1]]
stack.pop()
stack.append(t)
return stack
Here's my first Python program, a little utility that converts from a Unix octal code for file permissions to the symbolic form:
s=raw_input("Octal? ");
digits=[int(s[0]),int(s[1]),int(s[2])];
lookup=['','x','w','wx','r','rx','rw','rwx'];
uout='u='+lookup[digits[0]];
gout='g='+lookup[digits[1]];
oout='o='+lookup[digits[2]];
print(uout+','+gout+','+oout);
Are there ways to shorten this code that take advantage of some kind of "list processing"? For example, to apply the int function all at once to all three characters of s without having to do explicit indexing. And to index into lookup using the whole list digits at once?
digits=[int(s[0]),int(s[1]),int(s[2])];
can be written as:
digits = map(int,s)
or:
digits = [ int(x) for x in s ] #list comprehension
As it looks like you might be using python3.x (or planning on using it in the future based on your function-like print usage), you may want to opt for the list-comprehension unless you want to dig in further and use zip as demonstrated by one of the later answers.
Here is a slightly optimized version of your code:
s = raw_input("Octal? ")
digits = map(int, s)
lookup = ['','x','w','wx','r','rx','rw','rwx']
perms = [lookup[d] for d in digits]
rights = ['{}={}'.format(*x) for x in zip('ugo', perms)]
print ','.join(rights)
You can also do it with bitmasks:
masks = {
0b100: 'r', # 4
0b010: 'x', # 2
0b001: 'w' # 1
}
octal = raw_input('Octal? ')
result = '-'
for digit in octal[1:]:
for mask, letter in sorted(masks.items(), reverse=True):
if int(digit, 8) & mask:
result += letter
else:
result += '-'
print result
Here's my version, inspired by Blender's solution:
bits = zip([4, 2, 1], "rwx")
groups = "ugo"
s = raw_input("Octal? ");
digits = map(int, s)
parts = []
for group, digit in zip(groups, digits):
letters = [letter for bit, letter in bits if digit & bit]
parts.append("{0}={1}".format(group, "".join(letters)))
print ",".join(parts)
I think it's better not to have to explicitly enter the lookup list.
Here's my crack at it (including '-' for missing permissions):
lookup = {
0b000 : '---',
0b001 : '--x',
0b010 : '-w-',
0b011 : '-wx',
0b100 : 'r--',
0b101 : 'r-x',
0b110 : 'rw-',
0b111 : 'rwx'
}
s = raw_input('octal?: ')
print(','.join( # using ',' as the delimiter
r + '=' + lookup[int(n, 8)] # the letter followed by the permissions
for n, r in zip(tuple(s), 'ugo'))) # for each number/ letter pair
Apparently this problem comes up fairly often, after reading
Regular expression to detect semi-colon terminated C++ for & while loops
and thinking about the problem for a while, i wrote a function to return the content contained inside an arbitrary number of nested ()
The function could easily be extended to any regular expression object, posting here for your thoughts and considerations.
any refactoring advice would be appreciated
(note, i'm new to python still, and didn't feel like figuring out how to raise exceptions or whatever, so i just had the function return 'fail' if it couldin't figure out what was going on)
Edited function to take into account comments:
def ParseNestedParen(string, level):
"""
Return string contained in nested (), indexing i = level
"""
CountLeft = len(re.findall("\(", string))
CountRight = len(re.findall("\)", string))
if CountLeft == CountRight:
LeftRightIndex = [x for x in zip(
[Left.start()+1 for Left in re.finditer('\(', string)],
reversed([Right.start() for Right in re.finditer('\)', string)]))]
elif CountLeft > CountRight:
return ParseNestedParen(string + ')', level)
elif CountLeft < CountRight:
return ParseNestedParen('(' + string, level)
return string[LeftRightIndex[level][0]:LeftRightIndex[level][1]]
You don't make it clear exactly what the specification of your function is, but this behaviour seems wrong to me:
>>> ParseNestedParen('(a)(b)(c)', 0)
['a)(b)(c']
>>> nested_paren.ParseNestedParen('(a)(b)(c)', 1)
['b']
>>> nested_paren.ParseNestedParen('(a)(b)(c)', 2)
['']
Other comments on your code:
Docstring says "generate", but function returns a list, not a generator.
Since only one string is ever returned, why return it in a list?
Under what circumstances can the function return the string fail?
Repeatedly calling re.findall and then throwing away the result is wasteful.
You attempt to rebalance the parentheses in the string, but you do so only one parenthesis at a time:
>>> ParseNestedParen(')' * 1000, 1)
RuntimeError: maximum recursion depth exceeded while calling a Python object
As Thomi said in the question you linked to, "regular expressions really are the wrong tool for the job!"
The usual way to parse nested expressions is to use a stack, along these lines:
def parenthetic_contents(string):
"""Generate parenthesized contents in string as pairs (level, contents)."""
stack = []
for i, c in enumerate(string):
if c == '(':
stack.append(i)
elif c == ')' and stack:
start = stack.pop()
yield (len(stack), string[start + 1: i])
>>> list(parenthetic_contents('(a(b(c)(d)e)(f)g)'))
[(2, 'c'), (2, 'd'), (1, 'b(c)(d)e'), (1, 'f'), (0, 'a(b(c)(d)e)(f)g')]
Parentheses matching requires a parser with a push-down automaton. Some libraries exist, but the rules are simple enough that we can write it from scratch:
def push(obj, l, depth):
while depth:
l = l[-1]
depth -= 1
l.append(obj)
def parse_parentheses(s):
groups = []
depth = 0
try:
for char in s:
if char == '(':
push([], groups, depth)
depth += 1
elif char == ')':
depth -= 1
else:
push(char, groups, depth)
except IndexError:
raise ValueError('Parentheses mismatch')
if depth > 0:
raise ValueError('Parentheses mismatch')
else:
return groups
print(parse_parentheses('a(b(cd)f)')) # ['a', ['b', ['c', 'd'], 'f']]
Below is my Python solution with a time complexity of O(N)
str1 = "(a(b(c)d)(e(f)g)hi)"
def content_by_level(str1, l):
level_dict = {}
level = 0
level_char = ''
for s in str1:
if s == '(':
if level not in level_dict:
level_dict[level] = [level_char]
elif level_char != '':
level_dict[level].append(level_char)
level_char = ''
level += 1
elif s == ')':
if level not in level_dict:
level_dict[level] = [level_char]
elif level_char != '':
level_dict[level].append(level_char)
level_char = ''
level -= 1
else:
level_char += s
print(level_dict) # {0: [''], 1: ['a', 'hi'], 2: ['b', 'd', 'e', 'g'], 3: ['c', 'f']}
return level_dict[l]
print(content_by_level(str1,0)) # ['']
print(content_by_level(str1,1)) # ['a', 'hi']
print(content_by_level(str1,2)) # ['b', 'd', 'e', 'g']
print(content_by_level(str1,3)) # ['c', 'f']
#!/usr/bin/env python
import re
def ParseNestedParen(string, level):
"""
Generate strings contained in nested (), indexing i = level
"""
if len(re.findall("\(", string)) == len(re.findall("\)", string)):
LeftRightIndex = [x for x in zip(
[Left.start()+1 for Left in re.finditer('\(', string)],
reversed([Right.start() for Right in re.finditer('\)', string)]))]
elif len(re.findall("\(", string)) > len(re.findall("\)", string)):
return ParseNestedParen(string + ')', level)
elif len(re.findall("\(", string)) < len(re.findall("\)", string)):
return ParseNestedParen('(' + string, level)
else:
return 'fail'
return [string[LeftRightIndex[level][0]:LeftRightIndex[level][1]]]
Tests:
if __name__ == '__main__':
teststring = "outer(first(second(third)second)first)outer"
print(ParseNestedParen(teststring, 0))
print(ParseNestedParen(teststring, 1))
print(ParseNestedParen(teststring, 2))
teststring_2 = "outer(first(second(third)second)"
print(ParseNestedParen(teststring_2, 0))
print(ParseNestedParen(teststring_2, 1))
print(ParseNestedParen(teststring_2, 2))
teststring_3 = "second(third)second)first)outer"
print(ParseNestedParen(teststring_3, 0))
print(ParseNestedParen(teststring_3, 1))
print(ParseNestedParen(teststring_3, 2))
output:
Running tool: python3.1
['first(second(third)second)first']
['second(third)second']
['third']
['first(second(third)second)']
['second(third)second']
['third']
['(second(third)second)first']
['second(third)second']
['third']
>>>