Using Pythons (2.7) 'json' module I'm looking to process various JSON feeds. Unfortunately some of these feeds do not conform with JSON standards - in specific some keys are not wrapped in double speech-marks ("). This is causing Python to bug out.
Before writing an ugly-as-hell piece of code to parse and repair the incoming data, I thought I'd ask - is there any way to allow Python to either parse this malformed JSON or 'repair' the data so that it would be valid JSON?
Working example
import json
>>> json.loads('{"key1":1,"key2":2,"key3":3}')
{'key3': 3, 'key2': 2, 'key1': 1}
Broken example
import json
>>> json.loads('{key1:1,key2:2,key3:3}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\json\__init__.py", line 310, in loads
return _default_decoder.decode(s)
File "C:\Python27\lib\json\decoder.py", line 346, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\lib\json\decoder.py", line 362, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 1 (char 1)
I've written a small REGEX to fix the JSON coming from this particular provider, but I forsee this being an issue in the future. Below is what I came up with.
>>> import re
>>> s = '{key1:1,key2:2,key3:3}'
>>> s = re.sub('([{,])([^{:\s"]*):', lambda m: '%s"%s":'%(m.group(1),m.group(2)),s)
>>> s
'{"key1":1,"key2":2,"key3":3}'
You're trying to use a JSON parser to parse something that isn't JSON. Your best bet is to get the creator of the feeds to fix them.
I understand that isn't always possible. You might be able to fix the data using regexes, depending on how broken it is:
j = re.sub(r"{\s*(\w)", r'{"\1', j)
j = re.sub(r",\s*(\w)", r',"\1', j)
j = re.sub(r"(\w):", r'\1":', j)
Another option is to use the demjson module which can parse json in non-strict mode.
The regular expressions pointed out by Ned and cheeseinvert don't take into account when the match is inside a string.
See the following example (using cheeseinvert's solution):
>>> fixLazyJsonWithRegex ('{ key : "a { a : b }", }')
'{ "key" : "a { "a": b }" }'
The problem is that the expected output is:
'{ "key" : "a { a : b }" }'
Since JSON tokens are a subset of python tokens, we can use python's tokenize module.
Please correct me if I'm wrong, but the following code will fix a lazy json string in all the cases:
import tokenize
import token
from StringIO import StringIO
def fixLazyJson (in_text):
tokengen = tokenize.generate_tokens(StringIO(in_text).readline)
result = []
for tokid, tokval, _, _, _ in tokengen:
# fix unquoted strings
if (tokid == token.NAME):
if tokval not in ['true', 'false', 'null', '-Infinity', 'Infinity', 'NaN']:
tokid = token.STRING
tokval = u'"%s"' % tokval
# fix single-quoted strings
elif (tokid == token.STRING):
if tokval.startswith ("'"):
tokval = u'"%s"' % tokval[1:-1].replace ('"', '\\"')
# remove invalid commas
elif (tokid == token.OP) and ((tokval == '}') or (tokval == ']')):
if (len(result) > 0) and (result[-1][1] == ','):
result.pop()
# fix single-quoted strings
elif (tokid == token.STRING):
if tokval.startswith ("'"):
tokval = u'"%s"' % tokval[1:-1].replace ('"', '\\"')
result.append((tokid, tokval))
return tokenize.untokenize(result)
So in order to parse a json string, you might want to encapsulate a call to fixLazyJson once json.loads fails (to avoid performance penalties for well-formed json):
import json
def json_decode (json_string, *args, **kwargs):
try:
json.loads (json_string, *args, **kwargs)
except:
json_string = fixLazyJson (json_string)
json.loads (json_string, *args, **kwargs)
The only problem I see when fixing lazy json, is that if the json is malformed, the error raised by the second json.loads won't be referencing the line and column from the original string, but the modified one.
As a final note I just want to point out that it would be straightforward to update any of the methods to accept a file object instead of a string.
BONUS: Apart from this, people usually likes to include C/C++ comments when json is used for
configuration files, in this case, you can either remove comments using a regular expression, or use the extended version and fix the json string in one pass:
import tokenize
import token
from StringIO import StringIO
def fixLazyJsonWithComments (in_text):
""" Same as fixLazyJson but removing comments as well
"""
result = []
tokengen = tokenize.generate_tokens(StringIO(in_text).readline)
sline_comment = False
mline_comment = False
last_token = ''
for tokid, tokval, _, _, _ in tokengen:
# ignore single line and multi line comments
if sline_comment:
if (tokid == token.NEWLINE) or (tokid == tokenize.NL):
sline_comment = False
continue
# ignore multi line comments
if mline_comment:
if (last_token == '*') and (tokval == '/'):
mline_comment = False
last_token = tokval
continue
# fix unquoted strings
if (tokid == token.NAME):
if tokval not in ['true', 'false', 'null', '-Infinity', 'Infinity', 'NaN']:
tokid = token.STRING
tokval = u'"%s"' % tokval
# fix single-quoted strings
elif (tokid == token.STRING):
if tokval.startswith ("'"):
tokval = u'"%s"' % tokval[1:-1].replace ('"', '\\"')
# remove invalid commas
elif (tokid == token.OP) and ((tokval == '}') or (tokval == ']')):
if (len(result) > 0) and (result[-1][1] == ','):
result.pop()
# detect single-line comments
elif tokval == "//":
sline_comment = True
continue
# detect multiline comments
elif (last_token == '/') and (tokval == '*'):
result.pop() # remove previous token
mline_comment = True
continue
result.append((tokid, tokval))
last_token = tokval
return tokenize.untokenize(result)
Expanding on Ned's suggestion, the following has been helpful for me:
j = re.sub(r"{\s*'?(\w)", r'{"\1', j)
j = re.sub(r",\s*'?(\w)", r',"\1', j)
j = re.sub(r"(\w)'?\s*:", r'\1":', j)
j = re.sub(r":\s*'(\w+)'\s*([,}])", r':"\1"\2', j)
In a similar case, I have used ast.literal_eval. AFAIK, this won't work only when the constant null (corresponding to Python None) appears in the JSON.
Given that you know about the null/None predicament, you can:
import ast
decoded_object= ast.literal_eval(json_encoded_text)
In addition to Neds and cheeseinvert suggestion, adding (?!/) should avoid the mentioned problem with urls
j = re.sub(r"{\s*'?(\w)", r'{"\1', j)
j = re.sub(r",\s*'?(\w)", r',"\1', j)
j = re.sub(r"(\w)'?\s*:(?!/)", r'\1":', j)
j = re.sub(r":\s*'(\w+)'\s*([,}])", r':"\1"\2', j)
j = re.sub(r",\s*]", "]", j)
Related
The following code works on python without any issue. It opens the JSON file, and replace all bananas to apples:
import json
replacements = "banana" : "apple"
with open(mycodepath, 'r') as file:
data = file.read()
for old, new in replacements.items():
data = data.replace(old, new)
However, I want to replace "ArmReoriented", "visible": true with "ArmReoriented", "visible": false,
I tried using triple quotes, but it does not work.
replacements = """ArmReoriented", "visible": true""" : """ArmReoriented", "visible": false,"""
How I replace a text containing quotes on JSON using Python?
Using """ is basically a comment or a doc-string and would not work.
Put double quote between a single quote string like below.
replacements = {'"ArmReoriented", "visible": true' : '"ArmReoriented", "visible": false'}
Try switching to single quotes in your replacements dict:
replacements = {'"ArmReoriented", "visible": true': '"ArmReoriented", "visible": false,'}
data = 'asd asd asd "ArmReoriented", "visible": true asd asd'
for old, new in replacements.items():
data = data.replace(old, new)
print(data)
You also may escape double quotes both from key and from value in your replacement dict:
replacements = {"\"ArmReoriented\", \"visible\": true": "\"ArmReoriented\", \"visible\": false,"}
If we know what key the value ArmReoriented is attached to, we can do this the right way, processing your content structured data rather than as a string:
import json, copy
def hideReorientedArm(obj):
if isinstance(obj, dict):
if obj.get("LastEvent") == "ArmReoriented" and obj.get("visible") is True:
obj["visible"] = False
return obj
def walk(obj, updateFn):
if isinstance(obj, list):
obj = [walk(elem, updateFn) for elem in obj]
elif isinstance(obj, dict):
obj = {k: walk(v, updateFn) for k, v in obj.items()}
return updateFn(obj)
with open(mycodepath, 'r') as file:
data = json.load(file)
data = walk(data, hideReorientedArm)
See this running at https://ideone.com/Zr6546
Even if you don't know the name of the specific key having that value, you can still search all of them if that's necessary for some reason, replacing the shorter hideReorientedArm definition above with something more like the following:
def hideReorientedArm(obj):
if isinstance(obj, dict):
if obj.get("visible") is True:
foundArmReoriented = False
for (k,v) in obj.items():
if v == "ArmReoriented":
foundArmReoriented = True
break
if foundArmReoriented:
obj["visible"] = False
return obj
...see this version running at https://ideone.com/z34Mx1
I have below string, I am able to grab the 'text' what I wanted to (text is warped between pattern). code is give below,
val1 = '[{"vmdId":"Text1","vmdVersion":"text2","vmId":"text3"},{"vmId":"text4","vmVersion":"text5","vmId":"text6"}]'
temp = val1.split(',')
list_len = len(temp)
for i in range(0, list_len):
var = temp[i]
found = re.findall(r':"([^(]*)\"\;', var)
print ''.join(found)
I would like to replace values (Text1, text2, tex3, etc) with new values provided by user / or by reading from another XML. (Text1, tex2 .. are is totally random and alphanumeric data. below some details
Text1 = somename
text2 = alphanumatic value
text3 = somename
Text4 = somename
text5 = alphanumatic value
text6 = somename
anstring =
[{"vmdId":"newText1","vmdVersion":"newtext2","vmId":"newtext3"},{"vmId":"newtext4","vmVersion":"newtext5","vmId":"newtext6"}]
I decided to go with replace() but later realize data is not constant. hence seeking for help again. Appreciate your response.
Any help would be appreciated. Also, if let me know if I can improve the way i am grabing the value right now, as i new with regex.
You can do this by using backreferences in combination with re.sub:
import re
val1 = '[{"vmdId":"Text1","vmdVersion":"text2","vmId":"text3"},{"vmId":"text4","vmVersion":"text5","vmId":"text6"}]'
ansstring = re.sub(r'(?<=:")([^(]*)', r'new\g<1>' , val1)
print ansstring
\g<1> is the text which is in the first ().
EDIT
Maybe a better approach would be to decode the string, change the data and encode it again. This should allow you to easier access the values.
import sys
# python2 version
if sys.version_info[0] < 3:
import HTMLParser
html = HTMLParser.HTMLParser()
html_escape_table = {
"&": "&",
'"': """,
"'": "'",
">": ">",
"<": "<",
}
def html_escape(text):
"""Produce entities within text."""
return "".join(html_escape_table.get(c,c) for c in text)
html.escape = html_escape
else:
import html
import json
val1 = '[{"vmdId":"Text1","vmdVersion":"text2","vmId":"text3"},{"vmId":"text4","vmVersion":"text5","vmId":"text6"}]'
print(val1)
unescaped = html.unescape(val1)
json_data = json.loads(unescaped)
for d in json_data:
d['vmId'] = 'new value'
new_unescaped = json.dumps(json_data)
new_val = html.escape(new_unescaped)
print(new_val)
I hope this helps.
I have a file containing a text like this:
loadbalancer {
upstream application1 {
server 127.0.0.1:8082;
server 127.0.0.1:8083;
server 127.0.0.1:8084;
}
upstream application2 {
server 127.0.0.1:8092;
server 127.0.0.1:8093;
server 127.0.0.1:8094;
}
}
Does anyone know, how could I extract variables like below:
appList=["application1","application2"]
ServerOfapp1=["127.0.0.1:8082","127.0.0.1:8083","127.0.0.1:8084"]
ServerOfapp2=["127.0.0.1:8092","127.0.0.1:8093","127.0.0.1:8094"]
.
.
.
and so on
If the lines you want always start with upstream and server this should work:
app_dic = {}
with open('file.txt','r') as f:
for line in f:
if line.startswith('upstream'):
app_i = line.split()[1]
server_of_app_i = []
for line in f:
if not line.startswith('server'):
break
server_of_app_i.append(line.split()[1][:-1])
app_dic[app_i] = server_of_app_i
app_dic should then be a dictionary of lists:
{'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'],
'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']}
EDIT
If the input file does not contain any newline character, as long as the file is not too large you could write it to a list and iterate over it:
app_dic = {}
with open('file.txt','r') as f:
txt_iter = iter(f.read().split()) #iterator of list
for word in txt_iter:
if word == 'upstream':
app_i = next(txt_iter)
server_of_app_i=[]
for word in txt_iter:
if word == 'server':
server_of_app_i.append(next(txt_iter)[:-1])
elif word == '}':
break
app_dic[app_i] = server_of_app_i
This is more ugly as one has to search for the closing curly bracket to break. If it gets any more complicated, regex should be used.
If you are able to use the newer regex module by Matthew Barnett, you can use the following solution, see an additional demo on regex101.com:
import regex as re
rx = re.compile(r"""
(?:(?P<application>application\d)\s{\n| # "application" + digit + { + newline
(?!\A)\G\n) # assert that the next match starts here
server\s # match "server"
(?P<server>[\d.:]+); # followed by digits, . and :
""", re.VERBOSE)
string = """
loadbalancer {
upstream application1 {
server 127.0.0.1:8082;
server 127.0.0.1:8083;
server 127.0.0.1:8084;
}
upstream application2 {
server 127.0.0.1:8092;
server 127.0.0.1:8093;
server 127.0.0.1:8094;
}
}
"""
result = {}
for match in rx.finditer(string):
if match.group('application'):
current = match.group('application')
result[current] = list()
if current:
result[current].append(match.group('server'))
print result
# {'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094'], 'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084']}
This makes use of the \G modifier, named capture groups and some programming logic.
This is the basic method:
# each of your objects here
objText = "xyz xcyz 244.233.233.2:123"
listOfAll = re.findall(r"/\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):[0-9]{1,5}/g", objText)
for eachMatch in listOfAll:
print "Here's one!" % eachMatch
Obviously that's a bit rough around the edges, but it will perform a full-scale regex search of whatever string it's given. Probably a better solution would be to pass it the objects themselves, but for now I'm not sure what you would have as raw input. I'll try to improve on the regex, though.
I believe this as well can be solved with re:
>>> import re
>>> from collections import defaultdict
>>>
>>> APP = r'\b(?P<APP>application\d+)\b'
>>> IP = r'server\s+(?P<IP>[\d\.:]+);'
>>>
>>> pat = re.compile('|'.join([APP, IP]))
>>>
>>>
>>> scan = pat.scanner(s)
>>> d = defaultdict(list)
>>>
>>> for m in iter(scan.search, None):
group = m.lastgroup
if group == 'APP':
keygroup = m.group(group)
continue
else:
d[keygroup].append(m.group(group))
>>> d
defaultdict(<class 'list'>, {'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'], 'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']})
Or similarly with re.finditer method and without pat.scanner:
>>> for m in re.finditer(pat, s):
group = m.lastgroup
if group == 'APP':
keygroup = m.group(group)
continue
else:
d[keygroup].append(m.group(group))
>>> d
defaultdict(<class 'list'>, {'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'], 'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']})
Using json.dumps I can convert a dictionary to json format like this:
>>> from json import dumps
>>> dumps({'height': 100, 'title': 'some great title'})
'{"title": "some great title", "height": 100}'
However, I'm looking to turn my dictionary into a javascript literal, like this:
{title: "some great title", height: 100}
(Notice there are no double quotes around title and height.)
Is there a Python library to do this?
If you know all your keys are valid tokens you can use this simple code:
import json
def js_list(encoder, data):
pairs = []
for v in data:
pairs.append(js_val(encoder, v))
return "[" + ", ".join(pairs) + "]"
def js_dict(encoder, data):
pairs = []
for k, v in data.iteritems():
pairs.append(k + ": " + js_val(encoder, v))
return "{" + ", ".join(pairs) + "}"
def js_val(encoder, data):
if isinstance(data, dict):
val = js_dict(encoder, data)
elif isinstance(data, list):
val = js_list(encoder, data)
else:
val = encoder.encode(data)
return val
encoder = json.JSONEncoder(ensure_ascii=False)
js_val(encoder, {'height': 100, 'title': 'some great title'})
The return value of js_val() is in your desired format.
Dirty hack, but could work: subclass JSONEncoder and override _iterencode_dict method, so that it yields key formatted as you want. In python3.3 it is in line 367 of json.encoder module in line yield _encoder(key). You probably want to copy the whole method and change this line to something like yield key if isinstance(key, str) else _encoder(key).
Again - this is really dirty, unless you have no other option - don't do that. Though, it is doable, what is worth knowing if you really need it.
In Python, is there a way to check if a string is valid JSON before trying to parse it?
For example working with things like the Facebook Graph API, sometimes it returns JSON, sometimes it could return an image file.
You can try to do json.loads(), which will throw a ValueError if the string you pass can't be decoded as JSON.
In general, the "Pythonic" philosophy for this kind of situation is called EAFP, for Easier to Ask for Forgiveness than Permission.
Example Python script returns a boolean if a string is valid json:
import json
def is_json(myjson):
try:
json.loads(myjson)
except ValueError as e:
return False
return True
Which prints:
print is_json("{}") #prints True
print is_json("{asdf}") #prints False
print is_json('{ "age":100}') #prints True
print is_json("{'age':100 }") #prints False
print is_json("{\"age\":100 }") #prints True
print is_json('{"age":100 }') #prints True
print is_json('{"foo":[5,6.8],"foo":"bar"}') #prints True
Convert a JSON string to a Python dictionary:
import json
mydict = json.loads('{"foo":"bar"}')
print(mydict['foo']) #prints bar
mylist = json.loads("[5,6,7]")
print(mylist)
[5, 6, 7]
Convert a python object to JSON string:
foo = {}
foo['gummy'] = 'bear'
print(json.dumps(foo)) #prints {"gummy": "bear"}
If you want access to low-level parsing, don't roll your own, use an existing library: http://www.json.org/
Great tutorial on python JSON module: https://pymotw.com/2/json/
Is String JSON and show syntax errors and error messages:
sudo cpan JSON::XS
echo '{"foo":[5,6.8],"foo":"bar" bar}' > myjson.json
json_xs -t none < myjson.json
Prints:
, or } expected while parsing object/hash, at character offset 28 (before "bar}
at /usr/local/bin/json_xs line 183, <STDIN> line 1.
json_xs is capable of syntax checking, parsing, prittifying, encoding, decoding and more:
https://metacpan.org/pod/json_xs
I would say parsing it is the only way you can really entirely tell. Exception will be raised by python's json.loads() function (almost certainly) if not the correct format. However, the the purposes of your example you can probably just check the first couple of non-whitespace characters...
I'm not familiar with the JSON that facebook sends back, but most JSON strings from web apps will start with a open square [ or curly { bracket. No images formats I know of start with those characters.
Conversely if you know what image formats might show up, you can check the start of the string for their signatures to identify images, and assume you have JSON if it's not an image.
Another simple hack to identify a graphic, rather than a text string, in the case you're looking for a graphic, is just to test for non-ASCII characters in the first couple of dozen characters of the string (assuming the JSON is ASCII).
I came up with an generic, interesting solution to this problem:
class SafeInvocator(object):
def __init__(self, module):
self._module = module
def _safe(self, func):
def inner(*args, **kwargs):
try:
return func(*args, **kwargs)
except:
return None
return inner
def __getattr__(self, item):
obj = getattr(self.module, item)
return self._safe(obj) if hasattr(obj, '__call__') else obj
and you can use it like so:
safe_json = SafeInvocator(json)
text = "{'foo':'bar'}"
item = safe_json.loads(text)
if item:
# do something
An effective, and reliable way to check for valid JSON. If the 'get' accessor does't throw an AttributeError then the JSON is valid.
import json
valid_json = {'type': 'doc', 'version': 1, 'content': [{'type': 'paragraph', 'content': [{'text': 'Request for widget', 'type': 'text'}]}]}
invalid_json = 'opo'
def check_json(p, attr):
doc = json.loads(json.dumps(p))
try:
doc.get(attr) # we don't care if the value exists. Only that 'get()' is accessible
return True
except AttributeError:
return False
To use, we call the function and look for a key.
# Valid JSON
print(check_json(valid_json, 'type'))
Returns 'True'
# Invalid JSON / Key not found
print(check_json(invalid_json, 'type'))
Returns 'False'
Much simple in try block. You can then validate if the body is a valid JSON
async def get_body(request: Request):
try:
body = await request.json()
except:
body = await request.body()
return body