decode Google translate json response in python [duplicate] - python

I would like to parse JSON-like strings. Their lone difference with normal JSON is the presence of contiguous commas in arrays. When there are two such commas, it implicitly means that null should be inserted in-between. Example:
JSON-like: ["foo",,,"bar",[1,,3,4]]
Javascript: ["foo",null,null,"bar",[1,null,3,4]]
Decoded (Python): ["foo", None, None, "bar", [1, None, 3, 4]]
The native json.JSONDecoder class doesn't allow me to change the behavior of the array parsing. I can only modify the parser for objects (dicts), ints, floats, strings (by giving kwargs functions to JSONDecoder(), please see the doc).
So, does it mean I have to write a JSON parser from scratch? The Python code of json is available but it's quite a mess. I would prefer to use its internals instead of duplicating its code!

Since what you're trying to parse isn't JSON per se, but rather a different language that's very much like JSON, you may need your own parser.
Fortunately, this isn't as hard as it sounds. You can use a Python parser generator like pyparsing. JSON can be fully specified with a fairly simple context-free grammar (I found one here), so you should be able to modify it to fit your needs.

Small & simple workaround to try out:
Convert JSON-like data to strings.
Replace ",," with ",null,".
Convert it to whatever is your representation.
Let JSONDecoder(),
do the heavy lifting.
& 3. can be omitted if you already deal with strings.
(And if converting to string is impractical, update your question with this info!)

You can do the comma replacement of Lattyware's/przemo_li's answers in one pass by using a lookbehind expression, i.e. "replace all commas that are preceded by just a comma":
>>> s = '["foo",,,"bar",[1,,3,4]]'
>>> re.sub(r'(?<=,)\s*,', ' null,', s)
'["foo", null, null,"bar",[1, null,3,4]]'
Note that this will work for small things where you can assume there aren't consecutive commas in string literals, for example. In general, regular expressions aren't enough to handle this problem, and Taymon's approach of using a real parser is the only fully correct solution.

It's a hackish way of doing it, but one solution is to simply do some string modification on the JSON-ish data to get it in line before parsing it.
import re
import json
not_quite_json = '["foo",,,"bar",[1,,3,4]]'
not_json = True
while not_json:
not_quite_json, not_json = re.subn(r',\s*,', ', null, ', not_quite_json)
Which leaves us with:
'["foo", null, null, "bar",[1, null, 3,4]]'
We can then do:
json.loads(not_quite_json)
Giving us:
['foo', None, None, 'bar', [1, None, 3, 4]]
Note that it's not as simple as a replace, as the replacement also inserts commas that can need replacing. Given this, you have to loop through until no more replacements can be made. Here I have used a simple regex to do the job.

I've had a look at Taymon recommendation, pyparsing, and I successfully hacked the example provided here to suit my needs.
It works well at simulating Javascript eval() but fails one situation: trailing commas. There should be a optional trailing comma – see tests below – but I can't find any proper way to implement this.
from pyparsing import *
TRUE = Keyword("true").setParseAction(replaceWith(True))
FALSE = Keyword("false").setParseAction(replaceWith(False))
NULL = Keyword("null").setParseAction(replaceWith(None))
jsonString = dblQuotedString.setParseAction(removeQuotes)
jsonNumber = Combine(Optional('-') + ('0' | Word('123456789', nums)) +
Optional('.' + Word(nums)) +
Optional(Word('eE', exact=1) + Word(nums + '+-', nums)))
jsonObject = Forward()
jsonValue = Forward()
# black magic begins
commaToNull = Word(',,', exact=1).setParseAction(replaceWith(None))
jsonElements = ZeroOrMore(commaToNull) + Optional(jsonValue) + ZeroOrMore((Suppress(',') + jsonValue) | commaToNull)
# black magic ends
jsonArray = Group(Suppress('[') + Optional(jsonElements) + Suppress(']'))
jsonValue << (jsonString | jsonNumber | Group(jsonObject) | jsonArray | TRUE | FALSE | NULL)
memberDef = Group(jsonString + Suppress(':') + jsonValue)
jsonMembers = delimitedList(memberDef)
jsonObject << Dict(Suppress('{') + Optional(jsonMembers) + Suppress('}'))
jsonComment = cppStyleComment
jsonObject.ignore(jsonComment)
def convertNumbers(s, l, toks):
n = toks[0]
try:
return int(n)
except ValueError:
return float(n)
jsonNumber.setParseAction(convertNumbers)
def test():
tests = (
'[1,2]', # ok
'[,]', # ok
'[,,]', # ok
'[ , , , ]', # ok
'[,1]', # ok
'[,,1]', # ok
'[1,,2]', # ok
'[1,]', # failure, I got [1, None], I should have [1]
'[1,,]', # failure, I got [1, None, None], I should have [1, None]
)
for test in tests:
results = jsonArray.parseString(test)
print(results.asList())

For those looking for something quick and dirty to convert general JS objects (to dicts). Some part of the page of one real site gives me some object I'd like to tackle. There are 'new' constructs for dates, and it's in one line, no spaces in between, so two lines suffice:
data=sub(r'new Date\(([^)])*\)', r'\1', data)
data=sub(r'([,{])(\w*):', r'\1"\2":', data)
Then json.loads() worked fine. Your mileage may vary:)

Related

Extend Formatter features to f-string syntax

In a project of mine, I'm passing strings to a Formatter subclass whic formats it using the format specifier mini-language. In my case it is customized (using the features of the Formatter class) by adding additional bang converters : !u converts the resulting string to lowercase, !c to titlecase, !q doubles any square bracket (because reasons), and some others.
For example, using a = "toFu", "{a!c}" becomes "Tofu"
How could I make my system use f-string syntax, so I can have "{a+a!c}" be turned into "Tofutofu" ?
NB: I'm not asking for a way of making f"{a+a!c}" (note the presence of an f) resolve itself as "Tofutofu", which is what hook into the builtin python f-string format machinery covers, I'm asking if there is a way for a function or any form of python code to turn "{a+a!c}" (note the absence of an f) into "Tofutofu".
Not sure I still fully understand what you need, but from the details given in the question and some comments, here is a function that parses strings with the format you specified and gives the desired results:
import re
def formatter(s):
def replacement(match):
expr, frmt = match[1].split('!')
if frmt == 'c':
return eval(expr).title()
return re.sub(r"{([^{]+)}", replacement, s)
a = "toFu"
print(formatter("blah {a!c}"))
print(formatter("{a+a!c}blah"))
Outputs:
blah Tofu
Tofutofublah
This uses the function variation of the repl argument of the re.sub function. This function (replacement) can be further extended to support all other !xs.
Main disadvantages:
Using eval is evil.
This doesn't take in count regular format specifiers, i.e. :0.3
Maybe someone can take it from here and improve.
Evolved from #Tomerikoo 's life-saving answer, here's the code:
import re
def formatter(s):
def replacement(match):
pre, bangs, suf = match.group(1, 2, 3)
# pre : the part before the first bang
# bangs : the bang (if any) and the characters going with it
# suf : the colon (if any) and the characters going with it
if not bangs:
return eval("f\"{" + pre + suf + "}\"")
conversion = set(bangs[1:]) # the first character is always a bang
sra = conversion - set("tiqulc")
conversion = conversion - sra
if sra:
sra = "!" + "".join(sra)
value = eval("f\"{" + pre + (sra or "") + suf + "}\"")
if "q" in conversion:
value = value.replace("{", "{{")
if "u" in conversion:
value = value.upper()
if "l" in conversion:
value = value.lower()
if "c" in conversion and value:
value = value.capitalize()
return value
return re.sub(r"{([^!:\n]+)((?:![^!:\n]+)?)((?::[^!:\n]+)?)}", replacement, s)
The massive regex results in the three groups I commented about at the top.
Caveat: it still uses eval (no acceptable way around it anyway), it doesn't allow for multiline replacement fields, and it may cause issues and/or discrepancies to put spaces between the ! and the :.
But these are acceptable for the use I have.
Please check specifcation
only those characters are allowed : 's', 'r', or 'a'
https://peps.python.org/pep-0498/

How to split a string and keeping the pattern

This is how the string splitting works for me right now:
output = string.encode('UTF8').split('}/n}')[0]
output += '}\n}'
But I am wondering if there is a more pythonic way to do it.
The goal is to get everything before this '}/n}' including '}/n}'.
This might be a good use of str.partition.
string = '012za}/n}ddfsdfk'
parts = string.partition('}/n}')
# ('012za', '}/n}', 'ddfsdfk')
''.join(parts[:-1])
# 012za}/n}
Or, you can find it explicitly with str.index.
repl = '}/n}'
string[:string.index(repl) + len(repl)]
# 012za}/n}
This is probably better than using str.find since an exception will be raised if the substring isn't found, rather than producing nonsensical results.
It seems like anything "more elegant" would require regular expressions.
import re
re.search('(.*?}/n})', string).group(0)
# 012za}/n}
It can be done with with re.split() -- the key is putting parens around the split pattern to preserve what you split on:
import re
output = "".join(re.split(r'(}/n})', string.encode('UTF8'))[:2])
However, I doubt that this is either the most efficient nor most Pythonic way to achieve what you want. I.e. I don't think this is naturally a split sort of problem. For example:
tag = '}/n}'
encoded = string.encode('UTF8')
output = encoded[:encoded.index(tag)] + tag
or if you insist on a one-liner:
output = (lambda string, tag: string[:string.index(tag)] + tag)(string.encode('UTF8'), '}/n}')
or returning to regex:
output = re.match(r".*}/n}", string.encode('UTF8')).group(0)
>>> string_to_split = 'first item{\n{second item'
>>> sep = '{\n{'
>>> output = [item + sep for item in string_to_split.split(sep)]
NOTE: output = ['first item{\n{', 'second item{\n{']
then you can use the result:
for item_with_delimiter in output:
...
It might be useful to look up os.linesep if you're not sure what the line ending will be. os.linesep is whatever the line ending is under your current OS, so '\r\n' under Windows or '\n' under Linux or Mac. It depends where input data is from, and how flexible your code needs to be across environments.
Adapted from Slice a string after a certain phrase?, you can combine find and slice to get the first part of the string and retain }/n}.
str = "012za}/n}ddfsdfk"
str[:str.find("}/n}")+4]
Will result in 012za}/n}

Parsing a lightweight language in Python

Say I define a string in Python like the following:
my_string = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
I would like to parse that string in Python in a way that allows me to index the different structures of the language.
For example, the output could be a dictionary parsing_result that allows me to index the different elements in a structred manner.
For example, the following:
parsing_result['names']
would hold a list of strings: ['name1', 'name2']
whereas parsing_result['options'] would hold a dictionary so that:
parsing_result['something']['options']['opt2'] holds the string "text"
parsing_result['something_else']['options']['opt1'] holds the string "58"
My first question is: How do I approach this problem in Python? Are there any libraries that simplify this task?
For a working example, I am not necessarily interested in a solution that parses the exact syntax I defined above (although that would be fantastic), but anything close to it would be great.
Update
It looks like the general right solution is using a parser and a lexer such as ply (thank you #Joran), but the documentation is a bit intimidating. Is there an easier way of getting this done when the syntax is lightweight?
I found this thread where the following regular expression is provided to partition a string around outer commas:
r = re.compile(r'(?:[^,(]|\([^)]*\))+')
r.findall(s)
But this is assuming that the grouping character are () (and not {}). I am trying to adapt this, but it doesn't look easy.
I highly recommend pyparsing:
The pyparsing module is an alternative approach to creating and
executing simple grammars, vs. the traditional lex/yacc approach, or
the use of regular expressions.
The Python representation of the grammar is quite
readable, owing to the self-explanatory class names, and the use of
'+', '|' and '^' operator definitions. The parsed results returned from parseString() can be accessed as a nested list, a dictionary, or an object with named attributes.
Sample code (Hello world from the pyparsing docs):
from pyparsing import Word, alphas
greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- grammar defined here
hello = "Hello, World!"
print (hello, "->", greet.parseString( hello ))
Output:
Hello, World! -> ['Hello', ',', 'World', '!']
Edit: Here's a solution to your sample language:
from pyparsing import *
import json
identifier = Word(alphas + nums + "_")
expression = identifier("lhs") + Suppress("=") + identifier("rhs")
struct_vals = delimitedList(Group(expression | identifier))
structure = Group(identifier + nestedExpr(opener="{", closer="}", content=struct_vals("vals")))
grammar = delimitedList(structure)
my_string = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
parse_result = grammar.parseString(my_string)
result_list = parse_result.asList()
def list_to_dict(l):
d = {}
for struct in l:
d[struct[0]] = {}
for ident in struct[1]:
if len(ident) == 2:
d[struct[0]][ident[0]] = ident[1]
elif len(ident) == 1:
d[struct[0]][ident[0]] = None
return d
print json.dumps(list_to_dict(result_list), indent=2)
Output: (pretty printed as JSON)
{
"something_else": {
"opt1": "58",
"name3": null
},
"something": {
"opt1": "2",
"opt2": "text",
"name2": null,
"name1": null
}
}
Use the pyparsing API as your guide to exploring the functionality of pyparsing and understanding the nuances of my solution. I've found that the quickest way to master this library is trying it out on some simple languages you think up yourself.
As stated by #Joran Beasley, you'd really want to use a parser and a lexer. They are not easy to wrap your head around at first, so you'd want to start off with a very simple tutorial on them.
If you are really trying to write a light weight language, then you're going to want to go with parser/lexer, and learn about context-free grammars.
If you are really just trying to write a program to strip data out of some text, then regular expressions would be the way to go.
If this is not a programming exercise, and you are just trying to get structured data in text format into python, check out JSON.
Here is a test of regular expression modified to react on {} instead of ():
import re
s = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
r = re.compile(r'(?:[^,{]|{[^}]*})+')
print r.findall(s)
You'll get a list of separate 'named blocks' as a result:
`['something{name1, name2, opt1=2, opt2=text}', ' something_else{name3, opt1=58}']`
I've made better code that can parse your simple example, you should for example catch exceptions to detect a syntax error, and restrict more valid block names, parameter names:
import re
s = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
r = re.compile(r'(?:[^,{]|{[^}]*})+')
rblock = re.compile(r'\s*(\w+)\s*{(.*)}\s*')
rparam = re.compile(r'\s*([^=\s]+)\s*(=\s*([^,]+))?')
blocks = r.findall(s)
for block in blocks:
resb = rblock.match(block)
blockname = resb.group(1)
blockargs = resb.group(2)
print "block name=", blockname
print "args:"
for arg in re.split(",", blockargs):
resp = rparam.match(arg)
paramname = resp.group(1)
paramval = resp.group(3)
if paramval == None:
print "param name =\"{0}\" no value".format(paramname)
else:
print "param name =\"{0}\" value=\"{1}\"".format(paramname, str(paramval))

Replacing leading text in Python

I use Python 2.6 and I want to replace each instance of certain leading characters (., _ and $ in my case) in a string with another character or string. Since in my case the replacement string is the same, I came up with this:
def replaceLeadingCharacters(string, old, new = ''):
t = string.lstrip(old)
return new * (len(string) - len(t)) + t
which seems to work fine:
>>> replaceLeadingCharacters('._.!$XXX$._', '._$', 'Y')
'YYY!$XXX$._'
Is there a better (simpler or more efficient) way to achieve the same effect in Python ?
Is there a way to achieve this effect with a string instead of characters? Something like str.replace() that stops once something different than the string-to-be-replaced comes up in the input string? Right now I've come up with this:
def replaceLeadingString(string, old, new = ''):
n = 0
o = 0
s = len(old)
while string.startswith(old, o):
n += 1
o += s
return new * n + string[o:]
I am hoping that there is a way to do this without an explicit loop
EDIT:
There are quite a few answers using the re module. I have a couple of questions/issues with it:
Isn't it significantly slower than the str methods when used as a replacement for them?
Is there an easy way to properly quote/escape strings that will be used in a regular expression? For example if I wanted to use re for replaceLeadingCharacters, how would I ensure that the contents of the old variable will not mess things up in ^[old]+ ? I'd rather have a "black-box" function that does not require its users to pay attention to the list of characters that they provide.
Your replaceLeadingCharacters() seems fine as is.
Here's replaceLeadingString() implementation that uses re module (without the while loop):
#!/usr/bin/env python
import re
def lreplace(s, old, new):
"""Return a copy of string `s` with leading occurrences of
substring `old` replaced by `new`.
>>> lreplace('abcabcdefabc', 'abc', 'X')
'XXdefabc'
>>> lreplace('_abc', 'abc', 'X')
'_abc'
"""
return re.sub(r'^(?:%s)+' % re.escape(old),
lambda m: new * (m.end() / len(old)),
s)
Isn't it significantly slower than the str methods when used as a replacement for them?
Don't guess. Measure it for expected input.
Is there an easy way to properly quote/escape strings that will be used in a regular expression?
re.escape()
re.sub(r'^[._$]+', lambda m: 'Y' * m.end(0), '._.!$XXX$._')
But IMHO your first solution is good enough.

similar function to php's str_replace in python?

is there a similar function in python that takes search(array) and replace(array) as a parameter? Then takes a value from each array and uses them to do search and replace on subject(string).
I know I can achieve this using for loops, but just looking more elegant way.
I believe the answer is no.
I would specify your search/replace strings in a list, and the iterate over it:
edits = [(search0, replace0), (search1, replace1), (search2, replace2)] # etc.
for search, replace in edits:
s = s.replace(search, replace)
Even if python did have a str_replace-style function, I think I would still separate out my search/replace strings as a list, so really this is only taking one extra line of code.
Finally, this is a programming language after all. If it doesn't supply the function you want, you can always define it yourself.
Heh - you could use the one-liner below whose elegance is second only to its convenience :-P
(Acts like PHP when search is longer than replace, too, if I read that correctly in the PHP docs.):
**** Edit: This new version works for all sized substrings to replace. ****
>>> subject = "Coming up with these convoluted things can be very addictive."
>>> search = ['Coming', 'with', 'things', 'addictive.', ' up', ' these', 'convoluted ', ' very']
>>> replace = ['Making', 'Python', 'one-liners', 'fun!']
>>> reduce(lambda s, p: s.replace(p[0],p[1]),[subject]+zip(search, replace+['']*(len(search)-len(replace))))
'Making Python one-liners can be fun!'
Do it with regexps:
import re
def replace_from_list(replacements, str):
def escape_string_to_regex(str):
return re.sub(r"([\\.^$*+?{}[\]|\(\)])", r"\\\1", str)
def get_replacement(match):
return replacements[match.group(0)]
replacements = dict(replacements)
replace_from = [escape_string_to_regex(r) for r in replacements.keys()]
regex = "|".join(["(%s)" % r for r in replace_from])
repl = re.compile(regex)
return repl.sub(get_replacement, str)
# Simple replacement:
assert replace_from_list([("in1", "out1")], "in1") == "out1"
# Replacements are never themselves replaced, even if later search strings match
# earlier destination strings:
assert replace_from_list([("1", "2"), ("2", "3")], "123") == "233"
# These are plain strings, not regexps:
assert replace_from_list([("...", "out")], "abc ...") == "abc out"
Using regexps for this makes the searching fast. This won't iteratively replace replacements with further replacements, which is usually what's wanted.
Made a tiny recursive function for this
def str_replace(sbjct, srch, rplc):
if len(sbjct) == 0:
return ''
if len(srch) == 1:
return sbjct.replace(srch[0], rplc[0])
lst = sbjct.split(srch[0])
reslst = []
for s in lst:
reslst.append(str_replace(s, srch[1:], rplc[1:]))
return rplc[0].join(reslst);

Categories