Parsing JSON using python with multiple dictionaries [duplicate]

Parsing JSON using python with multiple dictionaries [duplicate] - python

I have thousands of text files containing multiple JSON objects, but unfortunately there is no delimiter between the objects. Objects are stored as dictionaries and some of their fields are themselves objects. Each object might have a variable number of nested objects. Concretely, an object might look like this:
{field1: {}, field2: "some value", field3: {}, ...}
and hundreds of such objects are concatenated without a delimiter in a text file. This means that I can neither use json.load() nor json.loads().
Any suggestion on how I can solve this problem. Is there a known parser to do this?

This decodes your "list" of JSON Objects from a string:
from json import JSONDecoder
def loads_invalid_obj_list(s):
decoder = JSONDecoder()
s_len = len(s)
objs = []
end = 0
while end != s_len:
obj, end = decoder.raw_decode(s, idx=end)
objs.append(obj)
return objs
The bonus here is that you play nice with the parser. Hence it keeps telling you exactly where it found an error.
Examples
>>> loads_invalid_obj_list('{}{}')
[{}, {}]
>>> loads_invalid_obj_list('{}{\n}{')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "decode.py", line 9, in loads_invalid_obj_list
obj, end = decoder.raw_decode(s, idx=end)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 2 column 2 (char 5)
Clean Solution (added later)
import json
import re
#shameless copy paste from json/decoder.py
FLAGS = re.VERBOSE | re.MULTILINE | re.DOTALL
WHITESPACE = re.compile(r'[ \t\n\r]*', FLAGS)
class ConcatJSONDecoder(json.JSONDecoder):
def decode(self, s, _w=WHITESPACE.match):
s_len = len(s)
objs = []
end = 0
while end != s_len:
obj, end = self.raw_decode(s, idx=_w(s, end).end())
end = _w(s, end).end()
objs.append(obj)
return objs
Examples
>>> print json.loads('{}', cls=ConcatJSONDecoder)
[{}]
>>> print json.load(open('file'), cls=ConcatJSONDecoder)
[{}]
>>> print json.loads('{}{} {', cls=ConcatJSONDecoder)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
return cls(encoding=encoding, **kw).decode(s)
File "decode.py", line 15, in decode
obj, end = self.raw_decode(s, idx=_w(s, end).end())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 1 column 5 (char 5)

Sebastian Blask has the right idea, but there's no reason to use regexes for such a simple change.
objs = json.loads("[%s]"%(open('your_file.name').read().replace('}{', '},{')))
Or, more legibly
raw_objs_string = open('your_file.name').read() #read in raw data
raw_objs_string = raw_objs_string.replace('}{', '},{') #insert a comma between each object
objs_string = '[%s]'%(raw_objs_string) #wrap in a list, to make valid json
objs = json.loads(objs_string) #parse json

How about something like this:
import re
import json
jsonstr = open('test.json').read()
p = re.compile( '}\s*{' )
jsonstr = p.sub( '}\n{', jsonstr )
jsonarr = jsonstr.split( '\n' )
for jsonstr in jsonarr:
jsonobj = json.loads( jsonstr )
print json.dumps( jsonobj )

Solution
As far as I know }{ does not appear in valid JSON, so the following should be perfectly safe when trying to get strings for separate objects that were concatenated (txt is the content of your file). It does not require any import (even of re module) to do that:
retrieved_strings = map(lambda x: '{'+x+'}', txt.strip('{}').split('}{'))
or if you prefer list comprehensions (as David Zwicker mentioned in the comments), you can use it like that:
retrieved_strings = ['{'+x+'}' for x in txt.strip('{}').split('}{'))]
It will result in retrieved_strings being a list of strings, each containing separate JSON object. See proof here: http://ideone.com/Purpb
Example
The following string:
'{field1:"a",field2:"b"}{field1:"c",field2:"d"}{field1:"e",field2:"f"}'
will be turned into:
['{field1:"a",field2:"b"}', '{field1:"c",field2:"d"}', '{field1:"e",field2:"f"}']
as proven in the example I mentioned.

Why don't you load the file as string, replace all }{ with },{ and surround the whole thing with []? Something like:
re.sub('\}\s*?\{', '\}, \{', string_read_from_a_file)
Or simple string replace if you are sure you always have }{ without whitespaces in between.
In case you expect }{ to occur in strings as well, you could also split on }{ and evaluate each fragment with json.load, in case you get an error, the fragment wasn't complete and you have to add the next to the first one and so forth.

import json
file1 = open('filepath', 'r')
data = file1.readlines()
for line in data :
values = json.loads(line)
'''Now you can access all the objects using values.get('key') '''

How about reading through the file incrementing a counter every time a { is found and decrementing it when you come across a }. When your counter reaches 0 you'll know that you've come to the end of the first object so send that through json.load and start counting again. Then just repeat to completion.

Suppose you added a [ to the start of the text in a file, and used a version of json.load() which, when it detected the error of finding a { instead of an expected comma (or hits the end of the file), spit out the just-completed object?

Replace a file with that junk in it:
$ sed -i -e 's;}{;}, {;g' foo
Do it on the fly in Python:
junkJson.replace('}{', '}, {')

Related

TypeError: cannot use a string pattern on a bytes-like object python3

I have updated my project to Python 3.7 and Django 3.0
Here is code of models.py
def get_fields(self):
fields = []
html_text = self.html_file.read()
self.html_file.seek(0)
# for now just find singleline, multiline, img editable
# may put repeater in there later (!!)
for m in re.findall("(<(singleline|multiline|img editable)[^>]*>)", html_text):
# m is ('<img editable="true" label="Image" class="w300" width="300" border="0">', 'img editable')
# or similar
# first is full tag, second is tag type
# append as a list
# MUST also save value in here
data = {'tag':m[0], 'type':m[1], 'label':'', 'value':None}
title_list = re.findall("label\s*=\s*\"([^\"]*)", m[0])
if(len(title_list) == 1):
data['label'] = title_list[0]
# store the data
fields.append(data)
return fields
Here is my error traceback
File "/home/harika/krishna test/dev-1.8/mcam/server/mcam/emails/models.py", line 91, in get_fields
for m in re.findall("(<(singleline|multiline|img editable)[^>]*>)", html_text):
File "/usr/lib/python3.7/re.py", line 225, in findall
return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object
How can I solve my issue?

The thing is that python3's read returns bytes (i.e. "raw" representation) and not string. You can convert between bytes and string if you specify encoding, i.e. how are characters converted to bytes:
>>> '☺'.encode('utf8')
b'\xe2\x98\xba'
>>> '☺'.encode('utf16')
b'\xff\xfe:&'
the b before string signifies that the value is not string but rather bytes. You can also supply raw bytes if you use that prefix:
>>> bytes_x = b'x'
>>> string_x = 'x'
>>> bytes_x == string_x
False
>>> bytes_x.decode('ascii') == string_x
True
>>> bytes_x == string_x.encode('ascii')
True
Note you can only use basic (ASCII) characters if you are using b prefix:
>>> b'☺'
File "<stdin>", line 1
SyntaxError: bytes can only contain ASCII literal characters.
So to fix your problem you need to either convert the input to a string with appropriate encoding:
html_text = self.html_file.read().decode('utf-8') # or 'ascii' or something else
Or -- probably better option -- is to use bytes in the findalls instead of strings:
for m in re.findall(b"(<(singleline|multiline|img editable)[^>]*>)", html_text):
...
title_list = re.findall(b"label\s*=\s*\"([^\"]*)", m[0])
(note the b in front of each "string")

Python regex problem when using the sub() method to replace content in html code to respective translations [duplicate]

This question already has an answer here:
python re.sub group: number after \number
(1 answer)
Closed 3 years ago.
I am a beginner at Python and met with some coding problem that I can't solve.
What I have:
the source sentences and their respective translations in two columns in a spreadsheet;
the html code which contains sentences and html tags
What I'm trying to do: use Python regex method - sub() to find and replace english sentences to their respective translated sentences.
For example: three sentences in html codes -
Pumas are large animals.
They are found in America.
They don't eat grass
I have the translations of each sentence in the html code. I want to replace the sentences one at a time and also keep the html tags. Normally I can use the sub() method like this:
regex1 = re.compile(r'(\>.*)SOURCE_SENTENCE_HERE ?(.*\<)')
resultCode = regex1.sub(r'\1TRANSLATION_SENTENCE_HERE\2', originalHtmlCode)
I've written a python script to do this. I save the html code in a txt file and access it in my Python code (succeeded). Then I create a dictionary to store the source-target paires in the spreadsheet mentioned above (succeeded). Lastly, I use rexgex sub() method to find and replace the sentences in the html code (failed). This last part didn't work at all for some reason. Link to my Python code - https://pastebin.com/ZSUNB4yg or below:
import re, openpyxl, pyperclip
buynavFile = open('C:\\Users\\zs\\Documents\\PythonScripts\\buynavCode.txt')
buynavCode = buynavFile.read()
buynavFile.close()
wb = openpyxl.load_workbook('buynavSegments.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
segDict = {}
maxRow = sheet.max_row
for i in range(2, maxRow + 1):
segDict[sheet.cell(row=i, column=3).value] = sheet.cell(row=i, column=4).value
for k, v in segDict.items():
k = '(\\>.*)' + str(k) + ' ?(.*\\<)'
v = '\\1' + str(v) + '\\2'
buynavRegex = re.compile(k)
buynavResult = buynavRegex.sub(v, buynavCode)
pyperclip.copy(buynavResult)
print('Result copied to clipboard')
Error message below:
Traceback (most recent call last):
File "C:\Users\zs\Documents\PythonScripts\buynav.py", line 20, in
buynavResult = buynavRegex.sub(v, buynavCode)
File "C:\Users\zs\AppData\Local\Programs\Python\Python36\lib\re.py",
line 326, in _subx
template = _compile_repl(template, pattern)
File "C:\Users\zs\AppData\Local\Programs\Python\Python36\lib\re.py",
line 317, in _compile_repl
return sre_parse.parse_template(repl, pattern)
File
"C:\Users\zs\AppData\Local\Programs\Python\Python36\lib\sre_parse.py",
line 943, in parse_template
addgroup(int(this[1:]), len(this) - 1)
File
"C:\Users\zs\AppData\Local\Programs\Python\Python36\lib\sre_parse.py",
line 887, in addgroup
raise s.error("invalid group reference %d" % index, pos)
sre_constants.error: invalid group reference 11 at position 1
Could someone enlighten me on this please? I would really appreciate it.

Consider if you want to use a replacement text where you have to put the contents of group 1 and concatenate them to the string 2. You could write r'\12' but this wont work because the regex parser will think that you are referencing group 12 instead of the group 1 followed by the string 2!
>>> re.sub(r'(he)llo', r'\12', 'hello')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.6/re.py", line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib/python3.6/re.py", line 326, in _subx
template = _compile_repl(template, pattern)
File "/usr/lib/python3.6/re.py", line 317, in _compile_repl
return sre_parse.parse_template(repl, pattern)
File "/usr/lib/python3.6/sre_parse.py", line 943, in parse_template
addgroup(int(this[1:]), len(this) - 1)
File "/usr/lib/python3.6/sre_parse.py", line 887, in addgroup
raise s.error("invalid group reference %d" % index, pos)
sre_constants.error: invalid group reference 12 at position 1
You can solve this using the \g<1> syntax to refer to the group: r'\g<1>2':
>>> re.sub(r'(he)llo', r'\g<1>2', 'hello')
'he2'
In your case your replacement string contains dynamic contents like str(v) which can be anything. If it happens to start with a number you end up in the case described before so you want to use \g<1> to avoid this issue.

What is the python vesion of this?

How would you do this in python?
(It goes through a file and print the string in between author": ", " and text": ", \ and then print them to their files)
Here is an example string before it goes through this:
{"text": "Love this series!\ufeff", "time": "Hace 11 horas", "author": "HasRah", "cid": "UgyvXmvSiMjuDrOQn-l4AaABAg"}
#!/bin/bash
cat html.txt | awk -F 'author": "' {'print $2'} | cut -d '"' -f1 >> users.txt
cat html.txt | awk -F 'text": "' {'print $2'} | cut -d '\' -f1 >> comments.txt
I tried to do it like this in python (Didn't work):
import re
start = '"author": "'
end = '", '
st = open("html.txt", "r")
s = st.readlines()
u = re.search('%s(.*)%s' % (start, end), s).group(1)
#print u.group(1)
Not sure if I'm close.
I get this error code:
Traceback (most recent call last):
File "test.py", line 9, in <module>
u = re.search('%s(.*)%s' % (start, end), s).group(1)
File "/usr/lib/python2.7/re.py", line 146, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer`

Before getting into any of this: As chepner pointed out in a comment, this input looks like, and therefore is probably intended to be, JSON. Which means you shouldn't be parsing it with regular expressions; just parse it as JSON:
>>> s = ''' {"text": "Love this series!\ufeff", "time": "Hace 11 horas", "author": "HasRah", "cid": "UgyvXmvSiMjuDrOQn-l4AaABAg"}'''
>>> obj = json.loads(s)
>>> obj['author']
'HasRah'
Actually, it's not clear whether your input is a JSON file (a file containing one JSON text), or a JSONlines file (a file containing a bunch of lines, each of which is a JSON text with no embedded newlines).1
For the former, you want to parse it like this:
obj = json.load(st)
For the latter, you want to loop over the lines, and parse each one like this:
for line in st:
obj = json.loads(line)
… or, alternatively, you can get a JSONlines library off PyPI.
But meanwhile, if you want to understand what's wrong with your code:
The error message is telling you the problem, although maybe not in the user-friendliest way:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/re.py", line 148, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
See the docs for search make clear:
re.search(pattern, string, flags=0)
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance…
You haven't passed it a string, you've passed it a list of strings. That's the whole point of readlines, after all.
There are two obvious fixes here.
First, you could read the whole file into a single string, instead of reading it into a list of strings:
s = st.read()
u = re.search('%s(.*)%s' % (start, end), s).group(1)
Alternatively, you could loop over the lines, trying to match each one. And, if you do this, you still don't need readlines(), because a file is already an iterable of lines:
for line in st:
u = re.search('%s(.*)%s' % (start, end), line).group(1)
While we're at it, if any of your lines don't match the pattern, this is going to raise an AttributeError. After all, search returns None if there's no match, but then you're going to try to call None.group(1).
There are two obvious fixes here as well.
You could handle that error:
try:
u = re.search('%s(.*)%s' % (start, end), line).group(1)
except AttributeError:
pass
… or you could check whether you got a match:
m = re.search('%s(.*)%s' % (start, end), line)
if m:
u = m.group(1)
1. In fact, there are at least two other formats that are nearly, but not quite, identical to JSONlines. I think that if you only care about reading, not creating files, and you don't have any numbers, you can parse all of them with a loop around json.loads or with a JSONlines library. But if you know who created the file, and know that they intended it to be, say, NDJ rather than JSONlines, you should read the docs on NDJ, or get a library made for NDJ, rather than just trusting that some guy on the internet thinks it's OK to treat it as JSONlines.

Read Null terminated string in python

I'm trying to read a null terminated string but i'm having issues when unpacking a char and putting it together with a string.
This is the code:
def readString(f):
str = ''
while True:
char = readChar(f)
str = str.join(char)
if (hex(ord(char))) == '0x0':
break
return str
def readChar(f):
char = unpack('c',f.read(1))[0]
return char
Now this is giving me this error:
TypeError: sequence item 0: expected str instance, int found
I'm also trying the following:
char = unpack('c',f.read(1)).decode("ascii")
But it throws me:
AttributeError: 'tuple' object has no attribute 'decode'
I don't even know how to read the chars and add it to the string, Is there any proper way to do this?

Here's a version that (ab)uses __iter__'s lesser-known "sentinel" argument:
with open('file.txt', 'rb') as f:
val = ''.join(iter(lambda: f.read(1).decode('ascii'), '\x00'))

How about:
myString = myNullTerminatedString.split("\x00")[0]
For example:
myNullTerminatedString = "hello world\x00\x00\x00\x00\x00\x00"
myString = myNullTerminatedString.split("\x00")[0]
print(myString) # "hello world"
This works by splitting the string on the null character. Since the string should terminate at the first null character, we simply grab the first item in the list after splitting. split will return a list of one item if the delimiter doesn't exist, so it still works even if there's no null terminator at all.
It also will work with byte strings:
myByteString = b'hello world\x00'
myStr = myByteString.split(b'\x00')[0].decode('ascii') # "hello world" as normal string
If you're reading from a file, you can do a relatively larger read - estimate how much you'll need to read to find your null string. This is a lot faster than reading byte-by-byte. For example:
resultingStr = ''
while True:
buf = f.read(512)
resultingStr += buf
if len(buf)==0: break
if (b"\x00" in resultingStr):
extraBytes = resultingStr.index(b"\x00")
resultingStr = resultingStr.split(b"\x00")[0]
break
# now "resultingStr" contains the string
f.seek(0 - extraBytes,1) # seek backwards by the number of bytes, now the pointer will be on the null byte in the file
# or f.seek(1 - extraBytes,1) to skip the null byte in the file

(edit version 2, added extra way at the end)
Maybe there are some libraries out there that can help you with this, but as I don't know about them lets attack the problem at hand with what we know.
In python 2 bytes and string are basically the same thing, that change in python 3 where string is what in py2 is unicode and bytes is its own separate type, which mean that you don't need to define a read char if you are in py2 as no extra work is required, so I don't think you need that unpack function for this particular case, with that in mind lets define the new readString
def readString(myfile):
chars = []
while True:
c = myfile.read(1)
if c == chr(0):
return "".join(chars)
chars.append(c)
just like with your code I read a character one at the time but I instead save them in a list, the reason is that string are immutable so doing str+=char result in unnecessary copies; and when I find the null character return the join string. And chr is the inverse of ord, it will give you the character given its ascii value. This will exclude the null character, if its needed just move the appending...
Now lets test it with your sample file
for instance lets try to read "Sword_Wea_Dummy" from it
with open("sword.blendscn","rb") as archi:
#lets simulate that some prior processing was made by
#moving the pointer of the file
archi.seek(6)
string=readString(archi)
print "string repr:", repr(string)
print "string:", string
print ""
#and the rest of the file is there waiting to be processed
print "rest of the file: ", repr(archi.read())
and this is the output
string repr: 'Sword_Wea_Dummy'
string: Sword_Wea_Dummy
rest of the file: '\xcd\xcc\xcc=p=\x8a4:\xa66\xbfJ\x15\xc6=\x00\x00\x00\x00\xeaQ8?\x9e\x8d\x874$-i\xb3\x00\x00\x00\x00\x9b\xc6\xaa2K\x15\xc6=;\xa66?\x00\x00\x00\x00\xb8\x88\xbf#\x0e\xf3\xb1#ITuB\x00\x00\x80?\xcd\xcc\xcc=\x00\x00\x00\x00\xcd\xccL>'
other tests
>>> with open("sword.blendscn","rb") as archi:
print readString(archi)
print readString(archi)
print readString(archi)
sword
Sword_Wea_Dummy
ÍÌÌ=p=Š4:¦6¿JÆ=
>>> with open("sword.blendscn","rb") as archi:
print repr(readString(archi))
print repr(readString(archi))
print repr(readString(archi))
'sword'
'Sword_Wea_Dummy'
'\xcd\xcc\xcc=p=\x8a4:\xa66\xbfJ\x15\xc6='
>>>
Now that I think about it, you mention that the data portion is of fixed size, if that is true for all files and the structure on all of them is as follow
[unknow size data][know size data]
then that is a pattern we can exploit, we only need to know the size of the file and we can get both part smoothly as follow
import os
def getDataPair(filename,knowSize):
size = os.path.getsize(filename)
with open(filename, "rb") as archi:
unknown = archi.read(size-knowSize)
know = archi.read()
return unknown, know
and by knowing the size of the data portion, its use is simple (which I get by playing with the prior example)
>>> strins_data, data = getDataPair("sword.blendscn", 80)
>>> string_data, data = getDataPair("sword.blendscn", 80)
>>> string_data
'sword\x00Sword_Wea_Dummy\x00'
>>> data
'\xcd\xcc\xcc=p=\x8a4:\xa66\xbfJ\x15\xc6=\x00\x00\x00\x00\xeaQ8?\x9e\x8d\x874$-i\xb3\x00\x00\x00\x00\x9b\xc6\xaa2K\x15\xc6=;\xa66?\x00\x00\x00\x00\xb8\x88\xbf#\x0e\xf3\xb1#ITuB\x00\x00\x80?\xcd\xcc\xcc=\x00\x00\x00\x00\xcd\xccL>'
>>> string_data.split(chr(0))
['sword', 'Sword_Wea_Dummy', '']
>>>
Now to get each string a simple split will suffice and you can pass the rest of the file contained in data to the appropriated function to be processed

Doing file I/O one character at a time is horribly slow.
Instead use readline0, now on pypi: https://pypi.org/project/readline0/ . Or something like it.
In 3.x, there's a "newline" argument to open, but it doesn't appear to be as flexible as readline0.

Here is my implementation:
import struct
def read_null_str(f):
r_str = ""
while 1:
back_offset = f.tell()
try:
r_char = struct.unpack("c", f.read(1))[0].decode("utf8")
except:
f.seek(back_offset)
temp_char = struct.unpack("<H", f.read(2))[0]
r_char = chr(temp_char)
if ord(r_char) == 0:
return r_str
else:
r_str += r_char

Python: json.loads chokes on escapes [duplicate]

This question already has an answer here:
json reading error json.decoder.JSONDecodeError: Invalid \escape
(1 answer)
Closed 7 months ago.
I have an application that is sending a JSON object (formatted with Prototype) to an ASP server. On the server, the Python 2.6 "json" module tries to loads() the JSON, but it's choking on some combination of backslashes. Observe:
>>> s
'{"FileExists": true, "Version": "4.3.2.1", "Path": "\\\\host\\dir\\file.exe"}'
>>> tmp = json.loads(s)
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
{... blah blah blah...}
File "C:\Python26\lib\json\decoder.py", line 155, in JSONString
return scanstring(match.string, match.end(), encoding, strict)
ValueError: Invalid \escape: line 1 column 58 (char 58)
>>> s[55:60]
u'ost\\d'
So column 58 is the escaped-backslash. I thought this WAS properly escaped! UNC is \\host\dir\file.exe, so I just doubled up on slashes. But apparently this is no good. Can someone assist? As a last resort I'm considering converting the \ to / and then back again, but this seems like a real hack to me.
Thanks in advance!

The correct json is:
r'{"FileExists": true, "Version": "4.3.2.1", "Path": "\\\\host\\dir\\file.exe"}'
Note the letter r if you omit it you need to escape \ for Python too.
>>> import json
>>> d = json.loads(s)
>>> d.keys()
[u'FileExists', u'Path', u'Version']
>>> d.values()
[True, u'\\\\host\\dir\\file.exe', u'4.3.2.1']
Note the difference:
>>> repr(d[u'Path'])
"u'\\\\\\\\host\\\\dir\\\\file.exe'"
>>> str(d[u'Path'])
'\\\\host\\dir\\file.exe'
>>> print d[u'Path']
\\host\dir\file.exe
Python REPL prints by default the repr(obj) for an object obj:
>>> class A:
... __str__ = lambda self: "str"
... __repr__ = lambda self: "repr"
...
>>> A()
repr
>>> print A()
str
Therefore your original s string is not properly escaped for JSON. It contains unescaped '\d' and '\f'. print s must show '\\d' otherwise it is not correct JSON.
NOTE: JSON string is a collection of zero or more Unicode characters, wrapped in double quotes, using backslash escapes (json.org). I've skipped encoding issues (namely, transformation from byte strings to unicode and vice versa) in the above examples.

Since the exception gives you the index of the offending escape character, this little hack I developed might be nice :)
def fix_JSON(json_message=None):
result = None
try:
result = json.loads(json_message)
except Exception as e:
# Find the offending character index:
idx_to_replace = int(str(e).split(' ')[-1].replace(')', ''))
# Remove the offending character:
json_message = list(json_message)
json_message[idx_to_replace] = ' '
new_message = ''.join(json_message)
return fix_JSON(json_message=new_message)
return result

>>> s
'{"FileExists": true, "Version": "4.3.2.1", "Path": "\\\\host\\dir\\file.exe"}'
>>> print s
{"FileExists": true, "Version": "4.3.2.1", "Path": "\\host\dir\file.exe"}
You've not actually escaped the string, so it's trying to parse invalid escape codes like \d or \f. Consider using a well-tested JSON encoder, such as json2.js.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing JSON using python with multiple dictionaries [duplicate] - python

How about something like this: import re import json jsonstr = open('test.json').read() p = re.compile( '}\s*{' ) jsonstr = p.sub( '}\n{', jsonstr ) jsonarr = jsonstr.split( '\n' ) for jsonstr in jsonarr: jsonobj = json.loads( jsonstr ) print json.dumps( jsonobj )

import json file1 = open('filepath', 'r') data = file1.readlines() for line in data : values = json.loads(line) '''Now you can access all the objects using values.get('key') '''

Suppose you added a [ to the start of the text in a file, and used a version of json.load() which, when it detected the error of finding a { instead of an expected comma (or hits the end of the file), spit out the just-completed object?

Replace a file with that junk in it: $ sed -i -e 's;}{;}, {;g' foo Do it on the fly in Python: junkJson.replace('}{', '}, {')

Related

TypeError: cannot use a string pattern on a bytes-like object python3

Python regex problem when using the sub() method to replace content in html code to respective translations [duplicate]

What is the python vesion of this?

Read Null terminated string in python

Python: json.loads chokes on escapes [duplicate]

Categories

Resources