How to fix missing double quotes issue when parsing JSON data?

How to fix missing double quotes issue when parsing JSON data? - python

I am running a piece of code in Python3 where I am consuming JSON data from the source. I don't have control over the source. While reading the json data I am getting following error:
simplejson.errors.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2
Here is the code
import logging
import simplejson as json
logging.basicConfig(level=logging.INFO)
consumer = KafkaConsumer(
bootstrap_servers='localhost:9092',
api_version=(1,0,0))
consumer.subscribe(['Test_Topic-1'])
for message in consumer:
msg_str=message.value
y = json.loads(msg_str)
print(y["city_name"])
As I can not change the source, I need to fix it at my end. I found out this post helpful as my data contains the timestamps with : in it: How to Fix JSON Key Values without double-quotes?
But it also fails for some values in my json data as those values contain : in it. e.g.
address:"1600:3050:rf02:hf64:h000:0000:345e:d321"
Is there any way where I can add double quotes to keys in my json data?

You can try to use module dirtyjson - it can fix some mistakes.
import dirtyjson
d = dirtyjson.loads('{address:"1600:3050:rf02:hf64:h000:0000:345e:d321"}')
print( d['address'] )
d = dirtyjson.loads('{abc:"1:2:3:4", efg:"5:6:7:8", "hij":"foo"}')
print( d['abc'] )
It creates AttributedDict so it may need dict() to create normal dictionary
d = dirtyjson.loads('{abc:"1:2:3:4", efg:"5:6:7:8", "hij":"foo"}')
print( d )
print( dict(d) )
Result:
AttributedDict([('abc', '1:2:3:4'), ('efg', '5:6:7:8'), ('hij', 'foo')])
{'abc': '1:2:3:4', 'efg': '5:6:7:8', 'hij': 'foo'}

I think your problem is that you have strings like this:
{abc:"1:2:3:4", efg:"5:6:7:8", "hij":"foo"}
which are not valid JSON. You could try to repair it with a regular expression substitution:
import re
jtxt_bad ='{abc:"1:2:3:4", efg:"5:6:7:8", "hij":"foo", klm:"bar"\n}'
jtxt = re.sub(r'\b([a-zA-Z]+):("[^"]+"[,\n}])', r'"\1":\2', jtxt_bad)
print(f'Original: {jtxt_bad}\nRepaired: {jtxt}')
The output of this is:
Original: {abc:"1:2:3:4", efg:"5:6:7:8", "hij":"foo", klm:"bar"
}
Repaired: {"abc":"1:2:3:4", "efg":"5:6:7:8", "hij":"foo", "klm":"bar"
}
The regular expression \b([a-zA-Z]+):("[^"]+"[,\}]) means: boundary, followed by one or more letters, followed by a :, followed by double-quoted string, followed by one of ,, }, \n. However, this will fail if there is a quote inside the string, such as "1:\"2:3".

Related

Python put string into dictionary

I want to convert a string into a dictionary. I saved this dictionary previously in a text file.
The problem is now, that I am not sure, how the structure of the keys are. The values are generated with Counter(dictionaryName). The dictionary is really large, so I cannot check every key to see how it would be possible.
The keys can contain simple quotes like ', double quotes ", commas and maybe other characters. So is there any possibility to convert it back into a dictionary?
For example this is stored in the file:
Counter({'element0':512, "'4,5'element1":50, '4:55foobar':23,...})
I found previous solutions with for example json, but I have problems with the double quotes and I cannot simply split for the commas.

If you trust the source, load from collections import Counter and eval() the string

How about something like:
>> from collections import Counter
>> line = '''Counter({'element0':512, "'4,5'element1":50, '4:55foobar':23})'''
>> D = eval(line)
>> D
Counter({"'4,5'element1": 50, '4:55foobar': 23, 'element0': 512})

You could remove the Counter( and ) parts, then parse the rest with ast.literal_eval as long as it only involves basic Python data types:
import ast
def parse_Counter_string(s):
s = s.strip()
if not (s.startswith('Counter(') and s.endswith(')')):
raise ValueError('String does not match expected format')
# Counter( is 8 characters
# 12345678
s = s[8:-1]
return Counter(ast.literal_eval(s))
In the future, I recommend picking a different way to serialize your data.

you can use demjson library for doing this, you can have the text directly in your program
import demjson
counter = demjson.decode("enter your text here")
if it is in the file ,you can do the following steps :
WD = dirname(realpath(__file__))
file = open(WD, "filename"), "r")
counter = demjson.decode(file.read())
file.close()

Importing wrongly concatenated JSONs in python

I've a text document that has several thousand jsons strings in the form of: "{...}{...}{...}". This is not a valid json it self but each {...} is.
I currently use the following a regular expression to split them:
fp = open('my_file.txt', 'r')
raw_dataset = (re.sub('}{', '}\n{', fp.read())).split('\n')
Which basically breaks every line where a curly bracket closes and other opens (}{ -> }\n{) so I can split them into different lines.
The problem is that few of them have a tags attribute written as "{tagName1}{tagName2}" which breaks my regular expression.
An example would be:
'{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
Is parsed into
'{"name":"Bob Dylan", "tags":"{Artist}'
'{Singer}"}'
'{"name": "Michael Jackson"}'
instead of
'{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}'
'{"name": "Michael Jackson"}'
What is the proper way of achieve this for further json parsing?

Use the raw_decode method of json.JSONDecoder
>>> import json
>>> d = json.JSONDecoder()
>>> x='{"name":\"Bob Dylan\", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}'
>>> d.raw_decode(x)
({'tags': '{Artist}{Singer}', 'name': 'Bob Dylan'}, 47)
>>> x=x[47:]
>>> d.raw_decode(x)
({'name': 'Michael Jackson'}, 27)
raw_decode returns a 2-tuple, the first element being the decoded JSON and the second being the offset in the string of the next byte after the JSON ended.
To loop until the end or until an invalid JSON element is encountered:
>>> while True:
... try:
... j,n = d.raw_decode(x)
... except ValueError:
... break
... print(j)
... x=x[n:]
...
{'name': 'Bob Dylan', 'tags': '{Artist}{Singer}'}
{'name': 'Michael Jackson'}
When the loop breaks, inspection of x will reveal if it has processed the whole string or had encountered a JSON syntax error.
With a very long file of short elements you might read a chunk into a buffer and apply the above loop, concatenating anything that's left over with the next chunk after the loop breaks.

You can use the jq command line utility to transfer your input to json. Let's say you have the following input:
input.txt:
{"name":"Bob Dylan", "tags":"{Artist}{Singer}"}{"name": "Michael Jackson"}
You can use jq -s, which consumes multiple json documents from input and transfers them into a single output array:
jq -s . input.txt
Gives you:
[
{
"name": "Bob Dylan",
"tags": "{Artist}{Singer}"
},
{
"name": "Michael Jackson"
}
]
I've just realized that there are python bindings for libjq. Meaning you
don't need to use the command line, you can use jq directly in python.
https://github.com/mwilliamson/jq.py
However, I've not tried it so far. Let me give it a try :) ...
Update: The above library is nice, but it does not support the slurp mode so far.

you need to make a parser ... I dont think regex can help you for
data = ""
curlies = []
def get_dicts(file_text):
for letter in file_text:
data += letter
if letter == "{":
curlies.append(letter)
elif letter == "}":
curlies.pop() # remove last
if not curlies:
yield json.loads(data)
data = ""
note that this does not actually solve the problem that {name:"bob"} is not valid json ... {"name":"bob"} is
this will also break in the event you have weird unbalanced parenthesis inside of strings ie {"name":"{{}}}"} would break this
really your json is so broken based on your example your best bet is probably to edit it by hand and fix the code that is generating it ... if that is not feasible you may need to write a more complex parser using pylex or some other grammar library (effectively writing your own language parser)

Parsing JSON failed

I am trying to parse this data (from the Viper malware analysis framework API specifically). I am having a hard time figure out the best way to do this. Ideally, I would just do a:
jsonObject.get("SSdeep")
... and I would get the value.
I don't think this is valid JSON unfortunately, and without editing the source of the project, how can I make this proper JSON or easily get these values?
[{
'data': {
'header': ['Key', 'Value'],
'rows': [
['Name', u 'splwow64.exe'],
['Tags', ''],
['Path', '/home/ubuntu/viper-master/projects/../binaries/8/e/e/5/8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781'],
['Size', 125952],
['Type', 'PE32+ executable (GUI) x86-64, for MS Windows'],
['Mime', 'application/x-dosexec'],
['MD5', '4b1d2cba1367a7b99d51b1295b3a1d57'],
['SHA1', 'caf8382df0dcb6e9fb51a5e277685b540632bf18'],
['SHA256', '8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781'],
['SHA512', '709ca98bfc0379648bd686148853116cabc0b13d89492c8a0fa2596e50f7e4d384e5c359081a90f893d8d250cfa537193cbaa1c53186f29c0b6dedeb50d53d4d'],
['SSdeep', ''],
['CRC32', '7106095E']
]
},
'type': 'table'
}]
Edit 1
Thank you! So I have tried this:
jsonObject = r.content.replace("'", "\"")
jsonObject = jsonObject.replace(" u", "")
and the output I have now is:
"[{"data": {"header": ["Key", "Value"], "rows": [["Name","splwow64.exe"], ["Tags", ""], ["Path", "/home/ubuntu/viper-master/projects/../binaries/8/e/e/5/8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781"], ["Size", 125952], ["Type", "PE32+ executable (GUI) x86-64, for MS Windows"], ["Mime", "application/x-dosexec"], ["MD5", "4b1d2cba1367a7b99d51b1295b3a1d57"], ["SHA1", "caf8382df0dcb6e9fb51a5e277685b540632bf18"], ["SHA256", "8ee5b228bd78781aa4e6b2e15e965e24d21f791d35b1eccebd160693ba781781"], ["SHA512", "709ca98bfc0379648bd686148853116cabc0b13d89492c8a0fa2596e50f7e4d384e5c359081a90f893d8d250cfa537193cbaa1c53186f29c0b6dedeb50d53d4d"], ["SSdeep", ""], ["CRC32", "7106095E"]]}, "type": "table"}]"
and now I'm getting this error:
File "/usr/lib/python2.7/json/decoder.py", line 369, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 5 - line 1 column 716 (char 4 - 715)
Note: I'd really rather not do the find and replaces like that.. especially the " u" one, as this could have unintended consequences.
Edit 2:
Figured it out! Thank you everyone!
Here's what I ended up doing, as someone mentioned the original text from the server was a "list of dicts":
r = requests.post(url, data=data) #Make the server request
listObject = r.content #Grab the content (don't really need this line)
listObject = listObject[1:-1] #Get rid of the quotes
listObject = ast.literal_eval(listObject) #Create a list out of the literal characters of the string
dictObject = listObject[0] #My dict!

JSON specifies double quotes "s for strings, from the JSON standard
A value can be a string in double quotes, or a number, or true or false or null, or an object or an array.
So you would need to replace all the single quotes with double quotes:
data.replace("'", '"')
There is also a spurious u in the Name field that will need to be removed.
However if the data is valid Python and you trust it you could try evaluating it, this worked with your original data (without the space after the u):
result = eval(data)
Or more safely:
result = ast.literal_eval(data)

Now you appear to have quotes "wrapping" the entire thing. Which is causing all the brackets to be strings. Remove the quotes at the start and end of the JSON.
Also, in JSON, start the structure with either '[' or '{' (usually '{'), not both.

No need to use eval(), just replace the malformed characters (use escape \ character) and parse it with json will be fine:
resp = r.content.replace(" u \'", " \'").replace("\'", "\"")
json.loads(resp)

How to dump json without quotes in python

Here is how I dump a file
with open('es_hosts.json', 'w') as fp:
json.dump(','.join(host_list.keys()), fp)
The results is
"a,b,c"
I would like:
a,b,c
Thanks

Before doing a string replace, you might want to strip the quotation marks:
print '"a,b,c"'.strip('"')
Output:
a,b,c
That's closer to what you want to achieve. Even just removing the first and the last character works: '"a,b,c"'[1:-1].
But have you looked into this question?

To remove the quotation marks in the keys only, which may be important if you are parsing it later (presumably with some tolerant parser or maybe you just pipe it directly into node for bizarre reasons), you could try the following regex.
re.sub(r'(?<!: )"(\S*?)"', '\\1', json_string)
One issue is that this regex expects fields to be seperated key: value and it will fail for key:value. You could make it work for the latter with a minor change, but similarly it won't work for variable amounts of whitespace after :
There may be other edge cases but it will work with outputs of json.dumps, however the results will not be parseable by json. Some more tolerant parsers like yaml might be able to read the results.
import re
regex = r'(?<!: )"(\S*?)"'
o = {"noquotes" : 127, "put quotes here" : "and here", "but_not" : "there"}
s = json.dumps(o)
s2 = json.dumps(o, indent=3)
strip_s = re.sub(regex,'\\1',s)
strip_s2 = re.sub(regex,'\\1',s2)
print(strip_s)
print(strip_s2)
assert(json.loads(strip_s) == json.loads(s) == json.loads(strip_s2) == json.loads(s2) == object)
Will raise a ValueError but prints what you want.

Well, that's not valid json, so the json module won't help you to write that data. But you can do this:
import json
with open('es_hosts.json', 'w') as fp:
data = ['a', 'b', 'c']
fp.write(json.dumps(','.join(data)).replace('"', ''))
That's because you asked for json, but since that's not json, this should suffice:
with open('es_hosts.json', 'w') as fp:
data = ['a', 'b', 'c']
fp.write(','.join(data))

Use python's built-in string replace function
with open('es_hosts.json', 'w') as fp:
json.dump(','.join(host_list.keys()).replace('\"',''), fp)

Just use for loop to assign list to string.
import json
with open('json_file') as f:
data = json.loads(f.read())
for value_wo_bracket in data['key_name']:
print(value_wo_bracket)
Note there is difference between json.load and json.loads

Python Regex: find all lines that start with '{' and end with '}'

I am receiving data over a socket, a bunch of JSON strings. However, I receive a set amount of bytes, so sometimes the last of my JSON strings is cut-off. I will typically get the following:
{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}
{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}
{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}
{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}
{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}
{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}
{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}
{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}
{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}
{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}
{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}
{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}
{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}
{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}
{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}
{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}
{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}
{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}
{"pitch":-30.816765,"yaw":-125
With Python, I would like to create a string array of the first 18 complete { data... } strings.
Here is what I have tried: cleanData = re.search('{.*}', data) but it seems like this is only giving me the very first { data... } entry. How can I get the full string array of complete { } sets?

To get all, you can use re.finditer or re.findall.
>>> re.findall(r'{.*}', s)
['{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}', '{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}', '{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}', '{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}', '{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}', '{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}', '{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}', '{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}', '{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}', '{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}', '{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}', '{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}', '{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}', '{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}', '{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}', '{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}', '{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}', '{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}']
>>>
OR
>>> [x.group() for x in re.finditer(r'{.*}', s)]
['{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}', '{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}', '{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}', '{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}', '{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}', '{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}', '{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}', '{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}', '{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}', '{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}', '{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}', '{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}', '{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}', '{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}', '{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}', '{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}', '{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}', '{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}']
>>>

You need re.findall() (or re.finditer)
>>> import re
>>> for r in re.findall(r'{.*}', data)[:18]:
print r
{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}
{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}
{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}
{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}
{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}
{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}
{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}
{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}
{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}
{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}
{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}
{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}
{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}
{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}
{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}
{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}
{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}
{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}

Extracting lines that start and end with a specific character can be done without any regex, use str.startswith and str.endswith methods when iterating through the lines in a file:
results = []
with open(filepath, 'r') as f:
for line in f:
if line.startswith('{') and line.rstrip('\n').endswith('}'):
results.append(line.rstrip('\n'))
Note the .rstrip('\n') is used before .endswith to make sure the final newline does not interfere with the } check at the end of the string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to fix missing double quotes issue when parsing JSON data? - python

Related

Python put string into dictionary

Importing wrongly concatenated JSONs in python

Parsing JSON failed

How to dump json without quotes in python

Python Regex: find all lines that start with '{' and end with '}'

Categories

Resources