Say I define a string in Python like the following:
my_string = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
I would like to parse that string in Python in a way that allows me to index the different structures of the language.
For example, the output could be a dictionary parsing_result that allows me to index the different elements in a structred manner.
For example, the following:
parsing_result['names']
would hold a list of strings: ['name1', 'name2']
whereas parsing_result['options'] would hold a dictionary so that:
parsing_result['something']['options']['opt2'] holds the string "text"
parsing_result['something_else']['options']['opt1'] holds the string "58"
My first question is: How do I approach this problem in Python? Are there any libraries that simplify this task?
For a working example, I am not necessarily interested in a solution that parses the exact syntax I defined above (although that would be fantastic), but anything close to it would be great.
Update
It looks like the general right solution is using a parser and a lexer such as ply (thank you #Joran), but the documentation is a bit intimidating. Is there an easier way of getting this done when the syntax is lightweight?
I found this thread where the following regular expression is provided to partition a string around outer commas:
r = re.compile(r'(?:[^,(]|\([^)]*\))+')
r.findall(s)
But this is assuming that the grouping character are () (and not {}). I am trying to adapt this, but it doesn't look easy.
I highly recommend pyparsing:
The pyparsing module is an alternative approach to creating and
executing simple grammars, vs. the traditional lex/yacc approach, or
the use of regular expressions.
The Python representation of the grammar is quite
readable, owing to the self-explanatory class names, and the use of
'+', '|' and '^' operator definitions. The parsed results returned from parseString() can be accessed as a nested list, a dictionary, or an object with named attributes.
Sample code (Hello world from the pyparsing docs):
from pyparsing import Word, alphas
greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- grammar defined here
hello = "Hello, World!"
print (hello, "->", greet.parseString( hello ))
Output:
Hello, World! -> ['Hello', ',', 'World', '!']
Edit: Here's a solution to your sample language:
from pyparsing import *
import json
identifier = Word(alphas + nums + "_")
expression = identifier("lhs") + Suppress("=") + identifier("rhs")
struct_vals = delimitedList(Group(expression | identifier))
structure = Group(identifier + nestedExpr(opener="{", closer="}", content=struct_vals("vals")))
grammar = delimitedList(structure)
my_string = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
parse_result = grammar.parseString(my_string)
result_list = parse_result.asList()
def list_to_dict(l):
d = {}
for struct in l:
d[struct[0]] = {}
for ident in struct[1]:
if len(ident) == 2:
d[struct[0]][ident[0]] = ident[1]
elif len(ident) == 1:
d[struct[0]][ident[0]] = None
return d
print json.dumps(list_to_dict(result_list), indent=2)
Output: (pretty printed as JSON)
{
"something_else": {
"opt1": "58",
"name3": null
},
"something": {
"opt1": "2",
"opt2": "text",
"name2": null,
"name1": null
}
}
Use the pyparsing API as your guide to exploring the functionality of pyparsing and understanding the nuances of my solution. I've found that the quickest way to master this library is trying it out on some simple languages you think up yourself.
As stated by #Joran Beasley, you'd really want to use a parser and a lexer. They are not easy to wrap your head around at first, so you'd want to start off with a very simple tutorial on them.
If you are really trying to write a light weight language, then you're going to want to go with parser/lexer, and learn about context-free grammars.
If you are really just trying to write a program to strip data out of some text, then regular expressions would be the way to go.
If this is not a programming exercise, and you are just trying to get structured data in text format into python, check out JSON.
Here is a test of regular expression modified to react on {} instead of ():
import re
s = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
r = re.compile(r'(?:[^,{]|{[^}]*})+')
print r.findall(s)
You'll get a list of separate 'named blocks' as a result:
`['something{name1, name2, opt1=2, opt2=text}', ' something_else{name3, opt1=58}']`
I've made better code that can parse your simple example, you should for example catch exceptions to detect a syntax error, and restrict more valid block names, parameter names:
import re
s = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
r = re.compile(r'(?:[^,{]|{[^}]*})+')
rblock = re.compile(r'\s*(\w+)\s*{(.*)}\s*')
rparam = re.compile(r'\s*([^=\s]+)\s*(=\s*([^,]+))?')
blocks = r.findall(s)
for block in blocks:
resb = rblock.match(block)
blockname = resb.group(1)
blockargs = resb.group(2)
print "block name=", blockname
print "args:"
for arg in re.split(",", blockargs):
resp = rparam.match(arg)
paramname = resp.group(1)
paramval = resp.group(3)
if paramval == None:
print "param name =\"{0}\" no value".format(paramname)
else:
print "param name =\"{0}\" value=\"{1}\"".format(paramname, str(paramval))
Related
I have a long string that is a phylogenetic tree and I want to do a very specific filtering.
(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;
Basically every x#y is a species#gene_id information. What I am trying to do is trimming this down so that I will only have x instead of x#y.
(Esy, Aar,(Spa,Cpl))...
I tried splitting the string first but the problem is string has different 'split points' for what I want to achieve i.e. some parts x#y is ending with a , and others with a ). I searched for a solution and saw regular expression operations, but I am new to Python and I couldn't be sure if that is what I should be focusing on. I also thought about strip() but it seems like I need to specify the characters to be stripped for this.
Main problem is there is no 'pattern' for me to tell Python to follow. Only thing is that all species ids are 3 letters and they are before an # character.
Is there a method that can do what I want? I will be really glad if you can help me out with my problem. Thanks in advance.
Give this a try:
import re:
pat = re.compile(r'(\w{3})#')
txt = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
pat.findall(t)
Result:
['Esy', 'Aar', 'Spa', 'Cpl', 'Bst', 'Aly', 'Ath', 'Chi', 'Cru', 'Hco', 'Hlo', 'Hla', 'Hse', 'Esa', 'Aal']
If you need the structure intact, we can try to remove the unnecessary parts instead:
pat = re.compile(r'(#|:)[^/),]*')
pat.sub('',t).replace(',', ', ')
Result:
'(Esy, Aar, ((Spa, Cpl), (((Bst, ((Aly, Ath), (Chi, Cru))), (((Hco, Hlo), Hla), Hse)), (Esa, Aal))))'
Regex demo
How about this kind of function:
def parse_string(string):
new_string = ''
skip = False
for char in string:
if char == '#':
skip = True
if char == ',':
skip = False
if not skip or char in ['(', ')']:
new_string += char
return new_string
Calling it on your string:
string = '(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
parse_string(string)
> '(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hse)),(Esa,Aal))))'
you can use regex:
import re
s = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
p = "...?(?=#)|\(|\)"
result = re.findall(p, s)
and you have your result as a list, so you can make it string or do anything with it
for explaining what is happening :
p is regular expression pattern
so in this pattern:
. means matching any word
...?(?=#) means match any word until I get to a word ? wich ? is #, so this whole pattern means that you get any three words before #
| is or statement, I used it here to find another pattern
and the rest of them is to find ) and (
Try this regex if you need the brackets in the output:
import re
regex = r"#[A-Za-z0-9_\.:]+|[0-9:\.;e-]+"
phylogenetic_tree = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
print(re.sub(regex,"",phylogenetic_tree))
Output:
(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hs)),(Esa,Aal))))
Because you are trying to parse a phylogenetic tree, I highly suggest to let BioPython do the heavy lifting for you.
You can easily parse and display a phylogenetic with Bio.Phylo. Then it is just iterating over all tree elements and splitting the names at the 'at'-sign.
Because Phylo expects the input to be in a file, we create an in-memory file-like object with io.StringIO. Getting the complete tree is then as easy as
Phylo.read(io.StringIO(s), 'newick')
In order to check if the parsed tree looks sane, I print it once with print(tree).
Now we want to change all node names that contain a '#'. With tree.find_elements we get access to all nodes. Some nodes don't have a name and some might not contain a '#'. So to be extra careful, we first check if n.name and '#' in n.name. Only then do we split each node's name at the '#' and take just the first part (index 0) of it:
n.name = n.name.split('#')[0]
In order to recreate the initial string representation, we use Phylo.write:
out = io.StringIO()
Phylo.write(tree, out, "newick")
print(out.getvalue())
Again, write wants to get a file argument - if we just want to get a string, we can use a StringIO object again.
Full code:
import io
from Bio import Phylo
if __name__ == '__main__':
s = '(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
tree = Phylo.read(io.StringIO(s), 'newick')
print(' before '.center(20, '='))
print(tree)
for n in tree.find_elements():
if n.name and '#' in n.name:
n.name = n.name.split('#')[0]
print(' result '.center(20, '='))
out = io.StringIO()
Phylo.write(tree, out, "newick")
print(out.getvalue())
Output:
====== before ======
Tree(rooted=False, weight=1.0)
Clade(branch_length=0.0129090235079)
Clade(branch_length=0.0726396855636, name='Esy#ESY15_g64743_DN3_SP7_c0')
Clade(branch_length=0.137507902808, name='Aar#AA_maker7399_1')
Clade(branch_length=0.0129090235079)
Clade(branch_length=9.05326020871e-05)
Clade(branch_length=0.0318934795022, name='Spa#Tp2g18720')
Clade(branch_length=0.0273465005242, name='Cpl#CP2_g48793_DN3_SP8_c')
Clade(branch_length=0.00328120860999)
Clade(branch_length=0.00859075940423)
Clade(branch_length=0.0340484449097)
Clade(branch_length=0.0332592496158, name='Bst#Bostr_13083s0053_1')
Clade(branch_length=0.0150356382287)
Clade(branch_length=0.0205924636564)
Clade(branch_length=0.0328569260951, name='Aly#AL8G21130_t1')
Clade(branch_length=0.0391706378372, name='Ath#AT5G48370_1')
Clade(branch_length=0.00998579652059)
Clade(branch_length=0.0954469923893, name='Chi#CARHR183840_1')
Clade(branch_length=0.0570981548016, name='Cru#Carubv10026342m')
Clade(branch_length=0.0372829371381)
Clade(branch_length=0.0206478928557)
Clade(branch_length=0.0144626717872)
Clade(branch_length=0.00823215335663, name='Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100')
Clade(branch_length=0.0085462978729, name='Hlo#DN13684_c0_g1_i1_p1')
Clade(branch_length=0.0225079453622, name='Hla#DN22821_c0_g1_i1_p1')
Clade(branch_length=0.048590776459, name='Hse#DN23412_c0_g1_i3_p1')
Clade(branch_length=1.00000050003e-06)
Clade(branch_length=0.0378509854703, name='Esa#Thhalv10004228m')
Clade(branch_length=0.0712272454125, name='Aal#Aa_G102140_t1')
==== result =====
(Esy:0.07264,Aar:0.13751,((Spa:0.03189,Cpl:0.02735):0.00009,(((Bst:0.03326,((Aly:0.03286,Ath:0.03917):0.02059,(Chi:0.09545,Cru:0.05710):0.00999):0.01504):0.03405,(((Hco:0.00823,Hlo:0.00855):0.01446,Hla:0.02251):0.02065,Hse:0.04859):0.03728):0.00859,(Esa:0.03785,Aal:0.07123):0.00000):0.00328):0.01291):0.01291;
The default format of Phylo uses less digits than in your original tree. In order to keep the numbers unchanged, just override the branch length format string with a '%s':
Phylo.write(tree, out, "newick", format_branch_length="%s")
Parsing code can be hard to follow. Tatsu lets you write readable parsing code by combining grammars and python:
text = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
import sys
import tatsu
grammar = """
start = things ';'
;
things = thing [ ',' things ]
;
thing = x '#' y ':' number
| '(' things ')' ':' number
;
x = /\w+/
;
y = /\w+/
;
number = /[+-]?\d+\.?\d*(e?[+-]?\d*)/
;
"""
class Semantics:
def x(self, ast):
# the method name matches the rule name
print('X =', ast)
parser = tatsu.compile(grammar, semantics=Semantics())
parser.parse(text)
I have template strings for formating with names substitution variables, like
mystr = "Some {title} text {body}"
mystr_ready = mystr.format(title='abc', body='bcd')
There can be many different substitution variables names in {} there, we don't know their names each time, so before I will take them from data base for substitution, I need to know their names first (taking all variants from huge table in data base is too slow).
So I need to realize this logic:
mystr = "Some {title} text {body}"
subs = SOMETHING(mystr) # title, body
I know this can be solved with regular expressions, but I suppose there can be more elegant and pythonic solution.
Use the string.Formatter:
import string
parser = string.Formatter().parse
def fmt_fields(fmt):
return [f[1] for f in parser(fmt) if f[1] is not None]
print(fmt_fields("Some {title} text {body}"))
I would like to parse JSON-like strings. Their lone difference with normal JSON is the presence of contiguous commas in arrays. When there are two such commas, it implicitly means that null should be inserted in-between. Example:
JSON-like: ["foo",,,"bar",[1,,3,4]]
Javascript: ["foo",null,null,"bar",[1,null,3,4]]
Decoded (Python): ["foo", None, None, "bar", [1, None, 3, 4]]
The native json.JSONDecoder class doesn't allow me to change the behavior of the array parsing. I can only modify the parser for objects (dicts), ints, floats, strings (by giving kwargs functions to JSONDecoder(), please see the doc).
So, does it mean I have to write a JSON parser from scratch? The Python code of json is available but it's quite a mess. I would prefer to use its internals instead of duplicating its code!
Since what you're trying to parse isn't JSON per se, but rather a different language that's very much like JSON, you may need your own parser.
Fortunately, this isn't as hard as it sounds. You can use a Python parser generator like pyparsing. JSON can be fully specified with a fairly simple context-free grammar (I found one here), so you should be able to modify it to fit your needs.
Small & simple workaround to try out:
Convert JSON-like data to strings.
Replace ",," with ",null,".
Convert it to whatever is your representation.
Let JSONDecoder(),
do the heavy lifting.
& 3. can be omitted if you already deal with strings.
(And if converting to string is impractical, update your question with this info!)
You can do the comma replacement of Lattyware's/przemo_li's answers in one pass by using a lookbehind expression, i.e. "replace all commas that are preceded by just a comma":
>>> s = '["foo",,,"bar",[1,,3,4]]'
>>> re.sub(r'(?<=,)\s*,', ' null,', s)
'["foo", null, null,"bar",[1, null,3,4]]'
Note that this will work for small things where you can assume there aren't consecutive commas in string literals, for example. In general, regular expressions aren't enough to handle this problem, and Taymon's approach of using a real parser is the only fully correct solution.
It's a hackish way of doing it, but one solution is to simply do some string modification on the JSON-ish data to get it in line before parsing it.
import re
import json
not_quite_json = '["foo",,,"bar",[1,,3,4]]'
not_json = True
while not_json:
not_quite_json, not_json = re.subn(r',\s*,', ', null, ', not_quite_json)
Which leaves us with:
'["foo", null, null, "bar",[1, null, 3,4]]'
We can then do:
json.loads(not_quite_json)
Giving us:
['foo', None, None, 'bar', [1, None, 3, 4]]
Note that it's not as simple as a replace, as the replacement also inserts commas that can need replacing. Given this, you have to loop through until no more replacements can be made. Here I have used a simple regex to do the job.
I've had a look at Taymon recommendation, pyparsing, and I successfully hacked the example provided here to suit my needs.
It works well at simulating Javascript eval() but fails one situation: trailing commas. There should be a optional trailing comma – see tests below – but I can't find any proper way to implement this.
from pyparsing import *
TRUE = Keyword("true").setParseAction(replaceWith(True))
FALSE = Keyword("false").setParseAction(replaceWith(False))
NULL = Keyword("null").setParseAction(replaceWith(None))
jsonString = dblQuotedString.setParseAction(removeQuotes)
jsonNumber = Combine(Optional('-') + ('0' | Word('123456789', nums)) +
Optional('.' + Word(nums)) +
Optional(Word('eE', exact=1) + Word(nums + '+-', nums)))
jsonObject = Forward()
jsonValue = Forward()
# black magic begins
commaToNull = Word(',,', exact=1).setParseAction(replaceWith(None))
jsonElements = ZeroOrMore(commaToNull) + Optional(jsonValue) + ZeroOrMore((Suppress(',') + jsonValue) | commaToNull)
# black magic ends
jsonArray = Group(Suppress('[') + Optional(jsonElements) + Suppress(']'))
jsonValue << (jsonString | jsonNumber | Group(jsonObject) | jsonArray | TRUE | FALSE | NULL)
memberDef = Group(jsonString + Suppress(':') + jsonValue)
jsonMembers = delimitedList(memberDef)
jsonObject << Dict(Suppress('{') + Optional(jsonMembers) + Suppress('}'))
jsonComment = cppStyleComment
jsonObject.ignore(jsonComment)
def convertNumbers(s, l, toks):
n = toks[0]
try:
return int(n)
except ValueError:
return float(n)
jsonNumber.setParseAction(convertNumbers)
def test():
tests = (
'[1,2]', # ok
'[,]', # ok
'[,,]', # ok
'[ , , , ]', # ok
'[,1]', # ok
'[,,1]', # ok
'[1,,2]', # ok
'[1,]', # failure, I got [1, None], I should have [1]
'[1,,]', # failure, I got [1, None, None], I should have [1, None]
)
for test in tests:
results = jsonArray.parseString(test)
print(results.asList())
For those looking for something quick and dirty to convert general JS objects (to dicts). Some part of the page of one real site gives me some object I'd like to tackle. There are 'new' constructs for dates, and it's in one line, no spaces in between, so two lines suffice:
data=sub(r'new Date\(([^)])*\)', r'\1', data)
data=sub(r'([,{])(\w*):', r'\1"\2":', data)
Then json.loads() worked fine. Your mileage may vary:)
I currently have a string that has variables in it.
domain.com/?hello=randomtext&thingy=randotext2&stuff=1231kjh
I'm trying to delete
&thingy=(all text that is in here)
The order might not always be that, and the text after the = will change.
I started doing something like this, but I feel there has to be quicker alternative:
cleanlist = []
variables = url.split('&')
for t in variables:
if not t.split('=', 1)[0] == 'thingy':
cleanlist.append(t.split('=', 1)[0])
I don't know Python, but from experience with other programming languages, the question I think you should have asked is "How do you parse a URL in Python?" or "How do you parse a url query string in Python?"
Just Googling this I got the following info that may help:
from urlparse import urlparse
o = urlparse('domain.com/?hello=randomtext&thingy=randotext2&stuff=1231kjh')
q = urlparse.parse_qs(o.query, true)
>>> q.hello
randomtext
>>> q.thingy
randomtext2
Once you parse the URL and query string, just grab what you want.
You can substitute using regex.
import re
p = re.compile(ur'(&thingy=.*)&')
test_str = u"domain.com/?hello=randomtext&thingy=randotext2&stuff=1231kjh"
subst = u"&"
result = re.sub(p, subst, test_str)
>>> result
u'domain.com/?hello=randomtext&stuff=1231kjh'
If I get your question right then you're trying the delete all the string which is "&thingy=randotext2&stuff=1231kjh "
This can be easily achieved by doing something like this:
current_str = "domain.com/?hello=randomtext&thingy=randotext2&stuff=1231kjh"
cursor = current_str.find("&thingy=")
clean_str = current_str[:cursor]
Now the clean_str variable is what you're looking for.
This will give a clean result which is only:
domain.com/?hello=randomtext
If you wish to delete only a query string argument value, such as &thingy=, in a regular expression it's like this:
import re
domain = "domain.com/?hello=randomtext&thingy=randotext2&stuff=1231kjh"
x = re.sub(r'(&thingy=)[^&]*(&?.*)$', r'\1\2', domain)
Never mind what's followed after the given one.
is there a similar function in python that takes search(array) and replace(array) as a parameter? Then takes a value from each array and uses them to do search and replace on subject(string).
I know I can achieve this using for loops, but just looking more elegant way.
I believe the answer is no.
I would specify your search/replace strings in a list, and the iterate over it:
edits = [(search0, replace0), (search1, replace1), (search2, replace2)] # etc.
for search, replace in edits:
s = s.replace(search, replace)
Even if python did have a str_replace-style function, I think I would still separate out my search/replace strings as a list, so really this is only taking one extra line of code.
Finally, this is a programming language after all. If it doesn't supply the function you want, you can always define it yourself.
Heh - you could use the one-liner below whose elegance is second only to its convenience :-P
(Acts like PHP when search is longer than replace, too, if I read that correctly in the PHP docs.):
**** Edit: This new version works for all sized substrings to replace. ****
>>> subject = "Coming up with these convoluted things can be very addictive."
>>> search = ['Coming', 'with', 'things', 'addictive.', ' up', ' these', 'convoluted ', ' very']
>>> replace = ['Making', 'Python', 'one-liners', 'fun!']
>>> reduce(lambda s, p: s.replace(p[0],p[1]),[subject]+zip(search, replace+['']*(len(search)-len(replace))))
'Making Python one-liners can be fun!'
Do it with regexps:
import re
def replace_from_list(replacements, str):
def escape_string_to_regex(str):
return re.sub(r"([\\.^$*+?{}[\]|\(\)])", r"\\\1", str)
def get_replacement(match):
return replacements[match.group(0)]
replacements = dict(replacements)
replace_from = [escape_string_to_regex(r) for r in replacements.keys()]
regex = "|".join(["(%s)" % r for r in replace_from])
repl = re.compile(regex)
return repl.sub(get_replacement, str)
# Simple replacement:
assert replace_from_list([("in1", "out1")], "in1") == "out1"
# Replacements are never themselves replaced, even if later search strings match
# earlier destination strings:
assert replace_from_list([("1", "2"), ("2", "3")], "123") == "233"
# These are plain strings, not regexps:
assert replace_from_list([("...", "out")], "abc ...") == "abc out"
Using regexps for this makes the searching fast. This won't iteratively replace replacements with further replacements, which is usually what's wanted.
Made a tiny recursive function for this
def str_replace(sbjct, srch, rplc):
if len(sbjct) == 0:
return ''
if len(srch) == 1:
return sbjct.replace(srch[0], rplc[0])
lst = sbjct.split(srch[0])
reslst = []
for s in lst:
reslst.append(str_replace(s, srch[1:], rplc[1:]))
return rplc[0].join(reslst);