Python regex that repeats \d number of times - python

Using python regex, I am trying to match as many number of p as the the digit first matched in pattern.
Sample Input
1pp
2p
3ppp
4ppppppppp
Expected Output
1p
None
3ppp
4pppp
Code Tried
I have tried the following code, where i use named group, and give the name 'dig' to the matched digit, now I want to use dig in repetition {m}. But the following code does not find any match in pattern.
pattern = "2pppp"
reTriple = '((?P<dig>\d)p{(?P=dig)})'
regex = re.compile(reTriple,re.IGNORECASE)
matches = re.finditer(regex,pattern)
I think the problem is that repetition {m} expects an int m, where as dig is a string. But I can't find a way to concatenate an int to string while keeping it int! I tried casting as follows:
reTrip = '((?P<dig>\d)p{%d}'%int('(?P=dig)')+')'
But I get the following error:
ValueError: invalid literal for int() with base 10: '(?P=dig)'
I feel stuck. Can someone please guide.
And its weird that if i instead break reTriple as follows: save the matched digit in a variable first and then concatenate this variable in reTriple, it works, and the expected output is achieved. But this is a work around, and I am looking for a better method.
reTriple = '(?P<dig>\d)'
dig = re.search(reTriple , pattern).group('dig')
reTriple = reTriple + '(p{1,' + dig + '})'

It seems that what you are trying basically comes down to: (\d+)p{\1} where you would use capture group 1 as input for how often you need to match "p". However capture group one seems to be returned as text (not numeric) causing you to find no results. Have a look here for example.
Maybe it helps to split this into two operations. For example:
import re
def val_txt(txt):
i = int(re.search(r'\d+', txt).group(0))
fnd = re.compile(fr'(?i)\d+p{{{i}}}')
if fnd.search(txt):
return fnd.search(txt).group(0)
print(val_txt('2p'))

You can also do pure string operations without depending on any module for the mentioned strings in the question (digits < 10):
def val_txt(txt):
dig = int(txt[0])
rest_val = 'p' * dig
return f'{dig}{rest_val}' if txt[1:1+dig] == rest_val else None
print(val_txt('1ppp'))
# 1p

Hi you can do another approach something like this without regex:
from typing import Union
def test(txt: str, var: str ='p') -> Union[str, None]:
var_count = txt.count(var)
number = int(txt[0:len(txt) - var_count:])
if number <= var_count:
return f'{number}{number * var}'
return None
lets test it
output:
t = ['1pp', '2p', '3ppp', '4ppppppppp', '10pppppppppp']
for i in t:
print(test(i))
1p
None
3ppp
4pppp
10pppppppppp

Here's a single step regex solution which uses a lambda function to check if there are sufficient p's to match the digits at the beginning of the string; if there are it returns the appropriate string (e.g. 1p or 3ppp), otherwise it returns an empty string:
import re
strs = ['1pp',
'2p',
'3ppp',
'4ppppppppp'
]
for s in strs:
print(re.sub(r'^(\d+)(p+).*', lambda m: m.group(1) + m.group(2)[:int(m.group(1))] if len(m.group(2)) >= int(m.group(1)) else '', s))
Output:
1p
3ppp
4pppp

Related

Replacing sub-string occurrences with elements of a given list

Suppose I have a string that has the same sub-string repeated multiple times and I want to replace each occurrence with a different element from a list.
For example, consider this scenario:
pattern = "_____" # repeated pattern
s = "a(_____), b(_____), c(_____)"
r = [0,1,2] # elements to insert
The goal is to obtain a string of the form:
s = "a(_001_), b(_002_), c(_003_)"
The number of occurrences is known, and the list r has the same length as the number of occurrences (3 in the previous example) and contains increasing integers starting from 0.
I've came up with this solution:
import re
pattern = "_____"
s = "a(_____), b(_____), c(_____)"
l = [m.start() for m in re.finditer(pattern, s)]
i = 0
for el in l:
s = s[:el] + f"_{str(i).zfill(5 - 2)}_" + s[el + 5:]
i += 1
print(s)
Output: a(_000_), b(_001_), c(_002_)
This solves my problem, but it seems to me a bit cumbersome, especially the for-loop. Is there a better way, maybe more "pythonic" (intended as concise, possibly elegant, whatever it means) to solve the task?
You can simply use re.sub() method to replace each occurrence of the pattern with a different element from the list.
import re
pattern = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0,1,2]
for val in r:
s = re.sub(pattern, f"_{val:03d}_", s, count=1)
print(s)
You can also choose to go with this approach without re using the values in the r list with their indexes respectively:
r = [0,1,2]
s = ", ".join(f"{'abc'[i]}(_{val:03d}_)" for i, val in enumerate(r))
print(s)
a(_000_), b(_001_), c(_002_)
TL;DR
Use re.sub with a replacement callable and an iterator:
import re
p = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0, 1, 2]
it = iter(r)
print(re.sub(p, lambda _: f"_{next(it):03d}_", s))
Long version
Generally speaking, it is a good idea to re.compile your pattern once ahead of time. If you are going to use that pattern repeatedly later, this makes the regex calls much more efficient. There is basically no downside to compiling the pattern, so I would just make it a habit.
As for avoiding the for-loop altogether, the re.sub function allows us to pass a callable as the repl argument, which takes a re.Match object as its only argument and returns a string. Wouldn't it be nice, if we could have such a replacement function that takes the next element from our replacements list every time it is called?
Well, since you have an iterable of replacement elements, we can leverage the iterator protocol to avoid explicit looping over the elements. All we need to do is give our replacement function access to an iterator over those elements, so that it can grab a new one via the next function every time it is called.
The string format specification that Jamiu used in his answer is great if you know exactly that the sub-string to be replaced will always be exactly five underscores (_____) and that your replacement numbers will always be < 999.
So in its simplest form, a function doing what you described, could look like this:
import re
from collections.abc import Iterable
def multi_replace(
pattern: re.Pattern[str],
replacements: Iterable[int],
string: str,
) -> str:
iterator = iter(replacements)
def repl(_match: re.Match[str]) -> str:
return f"_{next(iterator):03d}_"
return re.sub(pattern, repl, string)
Trying it out with your example data:
if __name__ == "__main__":
p = re.compile("_____")
s = "a(_____), b(_____), c(_____)"
r = [0, 1, 2]
print(multi_replace(p, r, s))
Output: a(_000_), b(_001_), c(_002_)
In this simple application, we aren't doing anything with the Match object in our replacement function.
If you want to make it a bit more flexible, there are a few avenues possible. Let's say the sub-strings to replace might (perhaps unexpectedly) be a different number of underscores. Let's further assume that the numbers might get bigger than 999.
First of all, the pattern would need to change a bit. And if we still want to center the replacement in an arbitrary number of underscores, we'll actually need to access the match object in our replacement function to check the number of underscores.
The format specifiers are still useful because the allow centering the inserted object with the ^ align code.
import re
from collections.abc import Iterable
def dynamic_replace(
pattern: re.Pattern[str],
replacements: Iterable[int],
string: str,
) -> str:
iterator = iter(replacements)
def repl(match: re.Match[str]) -> str:
replacement = f"{next(iterator):03d}"
length = len(match.group())
return f"{replacement:_^{length}}"
return re.sub(pattern, repl, string)
if __name__ == "__main__":
p = re.compile("(_+)")
s = "a(_____), b(_____), c(_____), d(_______), e(___)"
r = [0, 1, 2, 30, 4000]
print(dynamic_replace(p, r, s))
Output: a(_000_), b(_001_), c(_002_), d(__030__), e(4000)
Here we are building the replacement string based on the length of the match group (i.e. the number of underscores) to ensure it the number is always centered.
I think you get the idea. As always, separation of concerns is a good idea. You can put the replacement logic in its own function and refer to that, whenever you need to adjust it.
i dun see regex best suit the situation.
pattern = "_____" # repeated pattern
s = "a(_____), b(_____), c(_____)"
r = [0,1,2] # elements to insert
fstring = s.replace(pattern, "_{}_")
str_out = fstring.format(*r)
str_out_pad = fstring.format(*[str(entry).zfill(3) for entry in r])
print(str_out)
print(str_out_pad)
--
a(_0_), b(_1_), c(_2_)
a(_000_), b(_001_), c(_002_)

Complex string filtering with python

I have a long string that is a phylogenetic tree and I want to do a very specific filtering.
(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;
Basically every x#y is a species#gene_id information. What I am trying to do is trimming this down so that I will only have x instead of x#y.
(Esy, Aar,(Spa,Cpl))...
I tried splitting the string first but the problem is string has different 'split points' for what I want to achieve i.e. some parts x#y is ending with a , and others with a ). I searched for a solution and saw regular expression operations, but I am new to Python and I couldn't be sure if that is what I should be focusing on. I also thought about strip() but it seems like I need to specify the characters to be stripped for this.
Main problem is there is no 'pattern' for me to tell Python to follow. Only thing is that all species ids are 3 letters and they are before an # character.
Is there a method that can do what I want? I will be really glad if you can help me out with my problem. Thanks in advance.
Give this a try:
import re:
pat = re.compile(r'(\w{3})#')
txt = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
pat.findall(t)
Result:
['Esy', 'Aar', 'Spa', 'Cpl', 'Bst', 'Aly', 'Ath', 'Chi', 'Cru', 'Hco', 'Hlo', 'Hla', 'Hse', 'Esa', 'Aal']
If you need the structure intact, we can try to remove the unnecessary parts instead:
pat = re.compile(r'(#|:)[^/),]*')
pat.sub('',t).replace(',', ', ')
Result:
'(Esy, Aar, ((Spa, Cpl), (((Bst, ((Aly, Ath), (Chi, Cru))), (((Hco, Hlo), Hla), Hse)), (Esa, Aal))))'
Regex demo
How about this kind of function:
def parse_string(string):
new_string = ''
skip = False
for char in string:
if char == '#':
skip = True
if char == ',':
skip = False
if not skip or char in ['(', ')']:
new_string += char
return new_string
Calling it on your string:
string = '(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
parse_string(string)
> '(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hse)),(Esa,Aal))))'
you can use regex:
import re
s = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
p = "...?(?=#)|\(|\)"
result = re.findall(p, s)
and you have your result as a list, so you can make it string or do anything with it
for explaining what is happening :
p is regular expression pattern
so in this pattern:
. means matching any word
...?(?=#) means match any word until I get to a word ? wich ? is #, so this whole pattern means that you get any three words before #
| is or statement, I used it here to find another pattern
and the rest of them is to find ) and (
Try this regex if you need the brackets in the output:
import re
regex = r"#[A-Za-z0-9_\.:]+|[0-9:\.;e-]+"
phylogenetic_tree = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
print(re.sub(regex,"",phylogenetic_tree))
Output:
(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hs)),(Esa,Aal))))
Because you are trying to parse a phylogenetic tree, I highly suggest to let BioPython do the heavy lifting for you.
You can easily parse and display a phylogenetic with Bio.Phylo. Then it is just iterating over all tree elements and splitting the names at the 'at'-sign.
Because Phylo expects the input to be in a file, we create an in-memory file-like object with io.StringIO. Getting the complete tree is then as easy as
Phylo.read(io.StringIO(s), 'newick')
In order to check if the parsed tree looks sane, I print it once with print(tree).
Now we want to change all node names that contain a '#'. With tree.find_elements we get access to all nodes. Some nodes don't have a name and some might not contain a '#'. So to be extra careful, we first check if n.name and '#' in n.name. Only then do we split each node's name at the '#' and take just the first part (index 0) of it:
n.name = n.name.split('#')[0]
In order to recreate the initial string representation, we use Phylo.write:
out = io.StringIO()
Phylo.write(tree, out, "newick")
print(out.getvalue())
Again, write wants to get a file argument - if we just want to get a string, we can use a StringIO object again.
Full code:
import io
from Bio import Phylo
if __name__ == '__main__':
s = '(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
tree = Phylo.read(io.StringIO(s), 'newick')
print(' before '.center(20, '='))
print(tree)
for n in tree.find_elements():
if n.name and '#' in n.name:
n.name = n.name.split('#')[0]
print(' result '.center(20, '='))
out = io.StringIO()
Phylo.write(tree, out, "newick")
print(out.getvalue())
Output:
====== before ======
Tree(rooted=False, weight=1.0)
Clade(branch_length=0.0129090235079)
Clade(branch_length=0.0726396855636, name='Esy#ESY15_g64743_DN3_SP7_c0')
Clade(branch_length=0.137507902808, name='Aar#AA_maker7399_1')
Clade(branch_length=0.0129090235079)
Clade(branch_length=9.05326020871e-05)
Clade(branch_length=0.0318934795022, name='Spa#Tp2g18720')
Clade(branch_length=0.0273465005242, name='Cpl#CP2_g48793_DN3_SP8_c')
Clade(branch_length=0.00328120860999)
Clade(branch_length=0.00859075940423)
Clade(branch_length=0.0340484449097)
Clade(branch_length=0.0332592496158, name='Bst#Bostr_13083s0053_1')
Clade(branch_length=0.0150356382287)
Clade(branch_length=0.0205924636564)
Clade(branch_length=0.0328569260951, name='Aly#AL8G21130_t1')
Clade(branch_length=0.0391706378372, name='Ath#AT5G48370_1')
Clade(branch_length=0.00998579652059)
Clade(branch_length=0.0954469923893, name='Chi#CARHR183840_1')
Clade(branch_length=0.0570981548016, name='Cru#Carubv10026342m')
Clade(branch_length=0.0372829371381)
Clade(branch_length=0.0206478928557)
Clade(branch_length=0.0144626717872)
Clade(branch_length=0.00823215335663, name='Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100')
Clade(branch_length=0.0085462978729, name='Hlo#DN13684_c0_g1_i1_p1')
Clade(branch_length=0.0225079453622, name='Hla#DN22821_c0_g1_i1_p1')
Clade(branch_length=0.048590776459, name='Hse#DN23412_c0_g1_i3_p1')
Clade(branch_length=1.00000050003e-06)
Clade(branch_length=0.0378509854703, name='Esa#Thhalv10004228m')
Clade(branch_length=0.0712272454125, name='Aal#Aa_G102140_t1')
==== result =====
(Esy:0.07264,Aar:0.13751,((Spa:0.03189,Cpl:0.02735):0.00009,(((Bst:0.03326,((Aly:0.03286,Ath:0.03917):0.02059,(Chi:0.09545,Cru:0.05710):0.00999):0.01504):0.03405,(((Hco:0.00823,Hlo:0.00855):0.01446,Hla:0.02251):0.02065,Hse:0.04859):0.03728):0.00859,(Esa:0.03785,Aal:0.07123):0.00000):0.00328):0.01291):0.01291;
The default format of Phylo uses less digits than in your original tree. In order to keep the numbers unchanged, just override the branch length format string with a '%s':
Phylo.write(tree, out, "newick", format_branch_length="%s")
Parsing code can be hard to follow. Tatsu lets you write readable parsing code by combining grammars and python:
text = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
import sys
import tatsu
grammar = """
start = things ';'
;
things = thing [ ',' things ]
;
thing = x '#' y ':' number
| '(' things ')' ':' number
;
x = /\w+/
;
y = /\w+/
;
number = /[+-]?\d+\.?\d*(e?[+-]?\d*)/
;
"""
class Semantics:
def x(self, ast):
# the method name matches the rule name
print('X =', ast)
parser = tatsu.compile(grammar, semantics=Semantics())
parser.parse(text)

Printing substrings' patterns from a string in Python

The input to this problem is a string and has a specific form. For example if s is a string then inputs can be s='3(a)2(b)' or s='3(aa)2(bbb)' or s='4(aaaa)'. The output should be a string, that is the substring inside the brackets multiplied by numerical substring value the substring inside the brackets follows.
For example,
Input ='3(a)2(b)'
Output='aaabb'
Input='4(aaa)'
Output='aaaaaaaaaaaa'
and similarly for other inputs. The program should print an empty string for wrong or invalid inputs.
This is what I've tried so far
s='3(aa)2(b)'
p=''
q=''
for i in range(0,len(s)):
#print(s[i],end='')
if s[i]=='(':
k=int(s[i-1])
while(s[i+1]!=')'):
p+=(s[i+1])
i+=1
if s[i]==')':
q+=k*p
print(q)
Can anyone tell what's wrong with my code?
A oneliner would be:
''.join(int(y[0])*y[1] for y in (x.split('(') for x in Input.split(')')[:-1]))
It works like this. We take the input, and split on the close paren
In [1]: Input ='3(a)2(b)'
In [2]: a = Input.split(')')[:-1]
In [3]: a
Out[3]: ['3(a', '2(b']
This gives us the integer, character pairs we're looking for, but we need to get rid of the open paren, so for each x in a, we split on the open paren to get a two-element list where the first element is the int (as a string still) and the character. You'll see this in b
In [4]: b = [x.split('(') for x in a]
In [5]: b
Out[5]: [['3', 'a'], ['2', 'b']]
So for each element in b, we need to cast the first element as an integer with int() and multiply by the character.
In [6]: c = [int(y[0])*y[1] for y in b]
In [7]: c
Out[7]: ['aaa', 'bb']
Now we join on the empty string to combine them into one string with
In [8]: ''.join(c)
Out[8]: 'aaabb'
Try this:
a = re.findall(r'[\d]+', s)
b = re.findall(r'[a-zA-Z]+', s)
c = ''
for i, j in zip(a, b):
c+=(int(i)*str(j))
print(c)
Here is how you could do it:
Step 1: Simple case, getting the data out of a really simple template
Let's assume your template string is 3(a). That's the simplest case I could think of. We'll need to extract pieces of information from that string. The first one is the count of chars that will have to be rendered. The second is the char that has to be rendered.
You are in a case where regex are more than suited (hence, the use of re module from python's standard library).
I won't do a full course on regex. You'll have to do that by our own. However, I'll explain quickly the step I used. So, count (the variable that holds the number of times we should render the char to render) is a digit (or several). Hence our first capturing group will be something like (\d+). Then we have a char to extract that is enclosed by parenthesis, hence \((\w+)\) (I actually enable several chars to be rendered at once). So, if we put them together, we get (\d+)\((\w+)\). For testing you can check this out.
Applied to our case, a straight forward use of the re module is:
import re
# Our template
template = '3(a)'
# Run the regex
match = re.search(r'(\d+)\((\w+)\)', template)
if match:
# Get the count from the first capturing group
count = int(match.group(1))
# Get the string to render from the second capturing group
string = match.group(2)
# Print as many times the string as count was given
print count * string
Output:
aaa
Yeah!
Step 2: Full case, with several templates
Okay, we know how to do it for 1 template, how to do the same for several, for instance 3(a)4(b)? Well... How would we do it "by hand"? We'd read the full template from left to right and apply each template one by one. Then this is what we'll do with python!
Hopefully for us the re module has a function just for that: finditer. It does exactly what we described above.
So, we'll do something like:
import re
# Our template
template = '3(a)4(b)'
# Iterate through found templates
for match in re.finditer(r'(\d+)\((\w+)\)', template):
# Get the count from the first capturing group
count = int(match.group(1))
# Get the string to render from the second capturing group
string = match.group(2)
print count * string
Output:
aaa
bbbb
Okay... Just remains the combination of that stuff. We know we can put everything at each step in an array, and then join each items of this array at the end, no?
Let's do it!
import re
template = '3(a)4(b)'
parts = []
for match in re.finditer(r'(\d+)\((\w+)\)', template):
parts.append(int(match.group(1)) * match.group(2))
print ''.join(parts)
Output:
aaabbb
Yeah!
Step 3: Final step, optimization
Because we can always do better, we won't stop. for loops are cool. But what I love (it's personal) about python is that there is so much stuff you can actually just write with one line! Is it the case here? Well yes :).
First we can remove the for loop and the append using a list comprehension:
parts = [int(match.group(1)) * match.group(2) for match in re.finditer(r'(\d+)\((\w+)\)', template)]
rendered = ''.join(parts)
Finally, let's remove the two lines with parts populating and then join and let's do all that in a single line:
import re
template = '3(a)4(b)'
rendered = ''.join(
int(match.group(1)) * match.group(2) \
for match in re.finditer(r'(\d+)\((\w+)\)', template))
print rendered
Output:
aaabbb
Yeah! Still the same output :).
Hope it helped!
The value of 'p' should be refreshed after each iteration.
s='1(aaa)2(bb)'
p=''
q=''
i=0
while i<len(s):
if s[i]=='(':
k=int(s[i-1])
p=''
while(s[i+1]!=')'):
p+=(s[i+1])
i+=1
if s[i]==')':
q+=k*p
i+=1
print(q)
The code is not behaving the way I want it to behave. The problem here is the placement of 'p'. 'p' is the variable that adds the substring inside the ( )s. I'm repeating the process even after sufficient adding is done. Placing 'p' inside the 'if' block will do the job.
s='2(aa)2(bb)'
q=''
for i in range(0,len(s)):
if s[i]=='(':
k=int(s[i-1])
p=''
while(s[i+1]!=')'):
#print(i,'first time')
p+=s[i+1]
i+=1
q+=p*k
#print(i,'second time')
print(q)
what you want is not print substrings . the real purpose is most like to generate text based regular expression or comands.
you can parametrize a function to read it or use something like it:
The python library rstr has the function xeger() to do what you need by using random strings and only returning ones that match:
Example
Install with pip install rstr
In [1]: from __future__ import print_function
In [2]: import rstr
In [3]: for dummy in range(10):
...: print(rstr.xeger(r"(a|b)[cd]{2}\1"))
...:
acca
bddb
adda
bdcb
bccb
bcdb
adca
bccb
bccb
acda
Warning
For complex re patterns this might take a long time to generate any matches.

How to get a string after keyword

I would like to get the string after a specific keyword.
For example:
import re
def findWholeWord(w):
return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search
abc = "<StephenCurry Pro='ThreepointShooter'>MVP1times</StephenCurry>"
if findWholeWord("SeedNumber")(abc):
dddd = re.search('(?<=ThreepointShooter)(.\w+)', abc)
mvp = dddd.gorup()
print (mvp)
print ("found")
else:
print ("not found")
I expect the result suppose to be 'MVP1times'.
Is there any better method to find a specific string after keyword ? the result maybe a string, Digit or even mix like the result above.
Thanks for help!
You can use look-arounds to get the string surrounded by > and < (assuming this stays consistent):
>>> s = "<StephenCurry Pro='ThreepointShooter'>MVP1times</StephenCurry>"
>>> re.search(r'(?<=\>)[^<]+(?=\<)', s).group(0)
'MVP1times'
You can change the regular expressiion to: (?<=ThreepointShooter['|"]>)(.\w+). See it live on http://pythex.org/
I'm not sure what exactly your going to do but you don't even need to use lookbehind expression here.

Python, how do I parse key=value list ignoring what is inside parentheses?

Suppose I have a string like this:
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
I would like to get a dictionary corresponding to the above, where the value for key3 is the string
"(key3.1=value3.1;key3.2=value3.2)"
and eventually the corresponding sub-dictionary.
I know how to split the string at the semicolons, but how can I tell the parser to ignore the semicolon between parentheses?
This includes potentially nested parentheses.
Currently I am using an ad-hoc routine that looks for pairs of matching parentheses, "clears" its content, gets split positions and applies them to the original string, but this does not appear very elegant, there must be some prepackaged pythonic way to do this.
If anyone is interested, here is the code I am currently using:
def pparams(parameters, sep=';', defs='=', brc='()'):
'''
unpackages parameter string to struct
for example, pippo(a=21;b=35;c=pluto(h=zzz;y=mmm);d=2d3f) becomes:
a: '21'
b: '35'
c.fn: 'pluto'
c.h='zzz'
d: '2d3f'
fn_: 'pippo'
'''
ob=strfind(parameters,brc[0])
dp=strfind(parameters,defs)
out={}
if len(ob)>0:
if ob[0]<dp[0]:
#opening function
out['fn_']=parameters[:ob[0]]
parameters=parameters[(ob[0]+1):-1]
if len(dp)>0:
temp=smart_tokenize(parameters,sep,brc);
for v in temp:
defp=strfind(v,defs)
pname=v[:defp[0]]
pval=v[1+defp[0]:]
if len(strfind(pval,brc[0]))>0:
out[pname]=pparams(pval,sep,defs,brc);
else:
out[pname]=pval
else:
out['fn_']=parameters
return out
def smart_tokenize( instr, sep=';', brc='()' ):
'''
tokenize string ignoring separators contained within brc
'''
tstr=instr;
ob=strfind(instr,brc[0])
while len(ob)>0:
cb=findclsbrc(tstr,ob[0])
tstr=tstr[:ob[0]]+'?'*(cb-ob[0]+1)+tstr[cb+1:]
ob=strfind(tstr,brc[1])
sepp=[-1]+strfind(tstr,sep)+[len(instr)+1]
out=[]
for i in range(1,len(sepp)):
out.append(instr[(sepp[i-1]+1):(sepp[i])])
return out
def findclsbrc(instr, brc_pos, brc='()'):
'''
given a string containing an opening bracket, finds the
corresponding closing bracket
'''
tstr=instr[brc_pos:]
o=strfind(tstr,brc[0])
c=strfind(tstr,brc[1])
p=o+c
p.sort()
s1=[1 if v in o else 0 for v in p]
s2=[-1 if v in c else 0 for v in p]
s=[s1v+s2v for s1v,s2v in zip(s1,s2)]
s=[sum(s[:i+1]) for i in range(len(s))] #cumsum
return p[s.index(0)]+brc_pos
def strfind(instr, substr):
'''
returns starting position of each occurrence of substr within instr
'''
i=0
out=[]
while i<=len(instr):
try:
p=instr[i:].index(substr)
out.append(i+p)
i+=p+1
except:
i=len(instr)+1
return out
If you want to build a real parser, use one of the Python parsing libraries, like PLY or PyParsing. If you figure such a full-fledged library is overkill for the task at hand, go for some hack like the one you already have. I'm pretty sure there is no clean few-line solution without an external library.
Expanding on Sven Marnach's answer, here's an example of a pyparsing grammar that should work for you:
from pyparsing import (ZeroOrMore, Word, printables, Forward,
Group, Suppress, Dict)
collection = Forward()
simple_value = Word(printables, excludeChars='()=;')
key = simple_value
inner_collection = Suppress('(') + collection + Suppress(')')
value = simple_value ^ inner_collection
key_and_value = Group(key + Suppress('=') + value)
collection << Dict(key_and_value + ZeroOrMore(Suppress(';') + key_and_value))
coll = collection.parseString(
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)")
print coll['key1'] # value1
print coll['key2'] # value2
print coll['key3']['key3.1'] # value3.1
You could use a regex to capture the groups:
>>> import re
>>> s = "key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
>>> r = re.compile('(\w+)=(\w+|\([^)]+\));?')
>>> dict(r.findall(s))
This regex says:
(\w)+ # Find and capture a group with 1 or more word characters (letters, digits, underscores)
= # Followed by the literal character '='
(\w+ # Followed by a group with 1 or more word characters
|\([^)]+\) # or a group that starts with an open paren (parens escaped with '\(' or \')'), followed by anything up until a closed paren, which terminates the alternate grouping
);? # optionally this grouping might be followed by a semicolon.
Gotta say, kind of a strange grammar. You should consider using a more standard format. If you need guidance choosing one maybe ask another question. Good luck!

Categories