python regex - no match in script, although it should - python

I am programming a parser for an old dictionary and I'm trying to find a pattern like re.findall("{.*}", string) in a string.
A control print after the check proves, that only a few strings match, although all strings contain a pattern like {...}.
Even copying the string and matching it interactively in the idle shell
gives a match, but inside the rest of the code, it simply does not.
Is it possible that this problem is caused by the actual python interpreter?
I cannot figure out any other problem...
thanks for your help
the code snippet looks like that:
for aParse in chunklist:
aSigle = aParse[1]
aParse = aParse[0]
print("to be parsed", aParse)
aContext = Context()
aContext._init_("")
aContext.ID = contextID
aContext.source = aSigle
# here, aParse is the string containing {Abriss}
# which is part of a lexicon entry
metamatches = re.findall("\{.*\}", aParse)
print("metamatches: ", metamatches)
for meta in metamatches:
aMeta = meta.replace("{", "").replace("}", "")
aMeta = aMeta.split()
for elem in aMeta:
...

Try this:
re = {0: "{.test1}",1: "{.test1}",2: "{.test1}",3: "{.test1}"}
for value in re.itervalues():
if "{" in value:
value = value.replace("{"," ")
print value
or if you want to remove both "{}”
for value in re.itervalues():
if "{" in value:
value = value.strip('{}')
print value

Try this
data=re.findall(r"\{([^\}]*)}",aParse,re.I|re.S)
DEMO

So, in a really simplified scenario, a lexical entry looks like that:
"headword" {meta, meaning} context [reference for context].
So, I was chunking (split()) the entry at [...] with a regex. that works fine so far. then, after separating the headword, I tried to find the meta/meaning with a regex that finds all patterns of the form {...}. Since that regex didn't work, I replaced it with this function:
def findMeta(self, string, alist):
opened = 0
closed = 0
for char in enumerate(string):
if char[1] == "{":
opened = char[0]
elif char[1] == "}":
closed = char[0]
meta = string[opened:closed+1]
alist.append(meta)
string.replace(meta, "")
Now, its effectively much faster and the meaning component is correctly analysed. The remaining question is: in how far are the regex which I use to find other information (e.g. orthographic variants, introduced by "s.}") reliable? should they work or is it possible that the IDLE shell is simply not capable of parsing a 1000 line program correctly (and compiling all regex)? an example for a string whose meta should actually have been found is: " {stm.} {der abbruch thut, den armen das gebührende vorenthält} [Renn.]"
the algorithm finds the first, saying this word is a noun, but the second, it's translation, is not recognized.
... This is medieval German, sorry for that! Thank you for all your help.

Related

Complex string filtering with python

I have a long string that is a phylogenetic tree and I want to do a very specific filtering.
(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;
Basically every x#y is a species#gene_id information. What I am trying to do is trimming this down so that I will only have x instead of x#y.
(Esy, Aar,(Spa,Cpl))...
I tried splitting the string first but the problem is string has different 'split points' for what I want to achieve i.e. some parts x#y is ending with a , and others with a ). I searched for a solution and saw regular expression operations, but I am new to Python and I couldn't be sure if that is what I should be focusing on. I also thought about strip() but it seems like I need to specify the characters to be stripped for this.
Main problem is there is no 'pattern' for me to tell Python to follow. Only thing is that all species ids are 3 letters and they are before an # character.
Is there a method that can do what I want? I will be really glad if you can help me out with my problem. Thanks in advance.
Give this a try:
import re:
pat = re.compile(r'(\w{3})#')
txt = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
pat.findall(t)
Result:
['Esy', 'Aar', 'Spa', 'Cpl', 'Bst', 'Aly', 'Ath', 'Chi', 'Cru', 'Hco', 'Hlo', 'Hla', 'Hse', 'Esa', 'Aal']
If you need the structure intact, we can try to remove the unnecessary parts instead:
pat = re.compile(r'(#|:)[^/),]*')
pat.sub('',t).replace(',', ', ')
Result:
'(Esy, Aar, ((Spa, Cpl), (((Bst, ((Aly, Ath), (Chi, Cru))), (((Hco, Hlo), Hla), Hse)), (Esa, Aal))))'
Regex demo
How about this kind of function:
def parse_string(string):
new_string = ''
skip = False
for char in string:
if char == '#':
skip = True
if char == ',':
skip = False
if not skip or char in ['(', ')']:
new_string += char
return new_string
Calling it on your string:
string = '(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
parse_string(string)
> '(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hse)),(Esa,Aal))))'
you can use regex:
import re
s = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
p = "...?(?=#)|\(|\)"
result = re.findall(p, s)
and you have your result as a list, so you can make it string or do anything with it
for explaining what is happening :
p is regular expression pattern
so in this pattern:
. means matching any word
...?(?=#) means match any word until I get to a word ? wich ? is #, so this whole pattern means that you get any three words before #
| is or statement, I used it here to find another pattern
and the rest of them is to find ) and (
Try this regex if you need the brackets in the output:
import re
regex = r"#[A-Za-z0-9_\.:]+|[0-9:\.;e-]+"
phylogenetic_tree = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
print(re.sub(regex,"",phylogenetic_tree))
Output:
(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hs)),(Esa,Aal))))
Because you are trying to parse a phylogenetic tree, I highly suggest to let BioPython do the heavy lifting for you.
You can easily parse and display a phylogenetic with Bio.Phylo. Then it is just iterating over all tree elements and splitting the names at the 'at'-sign.
Because Phylo expects the input to be in a file, we create an in-memory file-like object with io.StringIO. Getting the complete tree is then as easy as
Phylo.read(io.StringIO(s), 'newick')
In order to check if the parsed tree looks sane, I print it once with print(tree).
Now we want to change all node names that contain a '#'. With tree.find_elements we get access to all nodes. Some nodes don't have a name and some might not contain a '#'. So to be extra careful, we first check if n.name and '#' in n.name. Only then do we split each node's name at the '#' and take just the first part (index 0) of it:
n.name = n.name.split('#')[0]
In order to recreate the initial string representation, we use Phylo.write:
out = io.StringIO()
Phylo.write(tree, out, "newick")
print(out.getvalue())
Again, write wants to get a file argument - if we just want to get a string, we can use a StringIO object again.
Full code:
import io
from Bio import Phylo
if __name__ == '__main__':
s = '(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
tree = Phylo.read(io.StringIO(s), 'newick')
print(' before '.center(20, '='))
print(tree)
for n in tree.find_elements():
if n.name and '#' in n.name:
n.name = n.name.split('#')[0]
print(' result '.center(20, '='))
out = io.StringIO()
Phylo.write(tree, out, "newick")
print(out.getvalue())
Output:
====== before ======
Tree(rooted=False, weight=1.0)
Clade(branch_length=0.0129090235079)
Clade(branch_length=0.0726396855636, name='Esy#ESY15_g64743_DN3_SP7_c0')
Clade(branch_length=0.137507902808, name='Aar#AA_maker7399_1')
Clade(branch_length=0.0129090235079)
Clade(branch_length=9.05326020871e-05)
Clade(branch_length=0.0318934795022, name='Spa#Tp2g18720')
Clade(branch_length=0.0273465005242, name='Cpl#CP2_g48793_DN3_SP8_c')
Clade(branch_length=0.00328120860999)
Clade(branch_length=0.00859075940423)
Clade(branch_length=0.0340484449097)
Clade(branch_length=0.0332592496158, name='Bst#Bostr_13083s0053_1')
Clade(branch_length=0.0150356382287)
Clade(branch_length=0.0205924636564)
Clade(branch_length=0.0328569260951, name='Aly#AL8G21130_t1')
Clade(branch_length=0.0391706378372, name='Ath#AT5G48370_1')
Clade(branch_length=0.00998579652059)
Clade(branch_length=0.0954469923893, name='Chi#CARHR183840_1')
Clade(branch_length=0.0570981548016, name='Cru#Carubv10026342m')
Clade(branch_length=0.0372829371381)
Clade(branch_length=0.0206478928557)
Clade(branch_length=0.0144626717872)
Clade(branch_length=0.00823215335663, name='Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100')
Clade(branch_length=0.0085462978729, name='Hlo#DN13684_c0_g1_i1_p1')
Clade(branch_length=0.0225079453622, name='Hla#DN22821_c0_g1_i1_p1')
Clade(branch_length=0.048590776459, name='Hse#DN23412_c0_g1_i3_p1')
Clade(branch_length=1.00000050003e-06)
Clade(branch_length=0.0378509854703, name='Esa#Thhalv10004228m')
Clade(branch_length=0.0712272454125, name='Aal#Aa_G102140_t1')
==== result =====
(Esy:0.07264,Aar:0.13751,((Spa:0.03189,Cpl:0.02735):0.00009,(((Bst:0.03326,((Aly:0.03286,Ath:0.03917):0.02059,(Chi:0.09545,Cru:0.05710):0.00999):0.01504):0.03405,(((Hco:0.00823,Hlo:0.00855):0.01446,Hla:0.02251):0.02065,Hse:0.04859):0.03728):0.00859,(Esa:0.03785,Aal:0.07123):0.00000):0.00328):0.01291):0.01291;
The default format of Phylo uses less digits than in your original tree. In order to keep the numbers unchanged, just override the branch length format string with a '%s':
Phylo.write(tree, out, "newick", format_branch_length="%s")
Parsing code can be hard to follow. Tatsu lets you write readable parsing code by combining grammars and python:
text = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
import sys
import tatsu
grammar = """
start = things ';'
;
things = thing [ ',' things ]
;
thing = x '#' y ':' number
| '(' things ')' ':' number
;
x = /\w+/
;
y = /\w+/
;
number = /[+-]?\d+\.?\d*(e?[+-]?\d*)/
;
"""
class Semantics:
def x(self, ast):
# the method name matches the rule name
print('X =', ast)
parser = tatsu.compile(grammar, semantics=Semantics())
parser.parse(text)

Change part of a word (string) into in a different string if a sign occurs. Python

How most effectively do I cut out a part of a word if the character '=#=' appears and then finish cutting the word if the character '=#=' appears? For example:
From a large string
'321#5=85#45#41=#=I-LOVE-STACK-OVER-FLOW=#=3234#41#=q#$^1=#=xx$q=#=xpa$=4319'
The python code returns:
'I-LOVE-STACK-OVER-FLOW'
Any help will be appreciated.
Using split():
s = '321#5=85#45#41=#=I-LOVE-STACK-OVER-FLOW=#=3234#41#=q#$^1=#=xx$q=#=xpa$=4319'
st = '=#='
ed = '=#='
print((s.split(st))[1].split(ed)[0])
Using regex:
import re
s = '321#5=85#45#41=#=I-LOVE-STACK-OVER-FLOW=#=3234#41#=q#$^1=#=xx$q=#=xpa$=4319'
print(re.search('%s(.*)%s' % (st, ed), s).group(1))
OUTPUT:
I-LOVE-STACK-OVER-FLOW
In addition to #DirtyBit's answer, if you want to also handle cases of more than 2 '=#='s, you can split the string, and then add every other element:
s = '321#5=85#45#41=#=I-LOVE-STACK-OVER-FLOW=#=3234#41#=q#$^1=#=xx$q=#=xpa$=4319=#=|I-ALSO-LOVE-SO=#=3123123'
parts = s.split('=#=')
print(''.join([parts[i] for i in range(1,len(parts),2)]))
Output
I-LOVE-STACK-OVER-FLOW|I-ALSO-LOVE-SO
The explanation is in the code.
import re
ori_list = re.split("=#=",ori_str)
# you can imagine your goal is to find the string wrapped between signs of "=#="
# so after the split, the even number position must be the parts outsides of "=#="
# and the odd number position is what you want
for i in range(len(ori_list)):
if i%2 == 1:#odd position
print(ori_list[i])

Using unicode char code in regular expression

Simplifying my task, lets say I want to find any words written in Hebrew in some web page.
So I know that Hebrew char codes are U+05D0 to U+05EA.
I want to write something like:
expr = "[\u05D0-\u05EA]+"
url = "https://en.wikipedia.org/wiki/Category:Countries"
web_handle = urllib2.urlopen(url)
website_text = website_handle.read()
matches = sre.findall(exp, website_text)
for item in matches:
print item
The output I would expect is:
עברית
But instead the out put is a lot of Chinese/Japanese chars.
You can just use standard representation of unicode in python within a character class :
re.findall([\u05D0-\u05EA], website_text,re.U)
The expression should be:
expr = u"[\u05D0-\u05EA]+"
Notice the 'u' at the beginning.

Python, how do I parse key=value list ignoring what is inside parentheses?

Suppose I have a string like this:
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
I would like to get a dictionary corresponding to the above, where the value for key3 is the string
"(key3.1=value3.1;key3.2=value3.2)"
and eventually the corresponding sub-dictionary.
I know how to split the string at the semicolons, but how can I tell the parser to ignore the semicolon between parentheses?
This includes potentially nested parentheses.
Currently I am using an ad-hoc routine that looks for pairs of matching parentheses, "clears" its content, gets split positions and applies them to the original string, but this does not appear very elegant, there must be some prepackaged pythonic way to do this.
If anyone is interested, here is the code I am currently using:
def pparams(parameters, sep=';', defs='=', brc='()'):
'''
unpackages parameter string to struct
for example, pippo(a=21;b=35;c=pluto(h=zzz;y=mmm);d=2d3f) becomes:
a: '21'
b: '35'
c.fn: 'pluto'
c.h='zzz'
d: '2d3f'
fn_: 'pippo'
'''
ob=strfind(parameters,brc[0])
dp=strfind(parameters,defs)
out={}
if len(ob)>0:
if ob[0]<dp[0]:
#opening function
out['fn_']=parameters[:ob[0]]
parameters=parameters[(ob[0]+1):-1]
if len(dp)>0:
temp=smart_tokenize(parameters,sep,brc);
for v in temp:
defp=strfind(v,defs)
pname=v[:defp[0]]
pval=v[1+defp[0]:]
if len(strfind(pval,brc[0]))>0:
out[pname]=pparams(pval,sep,defs,brc);
else:
out[pname]=pval
else:
out['fn_']=parameters
return out
def smart_tokenize( instr, sep=';', brc='()' ):
'''
tokenize string ignoring separators contained within brc
'''
tstr=instr;
ob=strfind(instr,brc[0])
while len(ob)>0:
cb=findclsbrc(tstr,ob[0])
tstr=tstr[:ob[0]]+'?'*(cb-ob[0]+1)+tstr[cb+1:]
ob=strfind(tstr,brc[1])
sepp=[-1]+strfind(tstr,sep)+[len(instr)+1]
out=[]
for i in range(1,len(sepp)):
out.append(instr[(sepp[i-1]+1):(sepp[i])])
return out
def findclsbrc(instr, brc_pos, brc='()'):
'''
given a string containing an opening bracket, finds the
corresponding closing bracket
'''
tstr=instr[brc_pos:]
o=strfind(tstr,brc[0])
c=strfind(tstr,brc[1])
p=o+c
p.sort()
s1=[1 if v in o else 0 for v in p]
s2=[-1 if v in c else 0 for v in p]
s=[s1v+s2v for s1v,s2v in zip(s1,s2)]
s=[sum(s[:i+1]) for i in range(len(s))] #cumsum
return p[s.index(0)]+brc_pos
def strfind(instr, substr):
'''
returns starting position of each occurrence of substr within instr
'''
i=0
out=[]
while i<=len(instr):
try:
p=instr[i:].index(substr)
out.append(i+p)
i+=p+1
except:
i=len(instr)+1
return out
If you want to build a real parser, use one of the Python parsing libraries, like PLY or PyParsing. If you figure such a full-fledged library is overkill for the task at hand, go for some hack like the one you already have. I'm pretty sure there is no clean few-line solution without an external library.
Expanding on Sven Marnach's answer, here's an example of a pyparsing grammar that should work for you:
from pyparsing import (ZeroOrMore, Word, printables, Forward,
Group, Suppress, Dict)
collection = Forward()
simple_value = Word(printables, excludeChars='()=;')
key = simple_value
inner_collection = Suppress('(') + collection + Suppress(')')
value = simple_value ^ inner_collection
key_and_value = Group(key + Suppress('=') + value)
collection << Dict(key_and_value + ZeroOrMore(Suppress(';') + key_and_value))
coll = collection.parseString(
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)")
print coll['key1'] # value1
print coll['key2'] # value2
print coll['key3']['key3.1'] # value3.1
You could use a regex to capture the groups:
>>> import re
>>> s = "key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
>>> r = re.compile('(\w+)=(\w+|\([^)]+\));?')
>>> dict(r.findall(s))
This regex says:
(\w)+ # Find and capture a group with 1 or more word characters (letters, digits, underscores)
= # Followed by the literal character '='
(\w+ # Followed by a group with 1 or more word characters
|\([^)]+\) # or a group that starts with an open paren (parens escaped with '\(' or \')'), followed by anything up until a closed paren, which terminates the alternate grouping
);? # optionally this grouping might be followed by a semicolon.
Gotta say, kind of a strange grammar. You should consider using a more standard format. If you need guidance choosing one maybe ask another question. Good luck!

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Categories