Python - json.loads large file of dictionaries that are not connected - python

I have a large (50,000+ lines) file that is a collection of json outputs from another application that i would like to read in as json and perform some analysis on. The issue is that while a single entry is valid json, I can't read the entire file in as json because each entry isn't connected.
Snippet:
{"action":"Iops","idg":"2214472975167211","idx":537994,"system":"Qos","utc":"2019-07-02T11:45:09.606765Z","ver":"1.1","xQosIops":{"ActualReadOps":{"avg":0,"ct":60,"max":0,"min":0,"std":0,"tmax":29880,"tmin":29880}}}
{"action":"Latency","idg":"2214472975167211","idx":537995,"system":"Qos","utc":"2019-07-02T11:45:09.606829Z","ver":"1.1","xQosLatency":{"AverageLocalWriteLatencyUS":{"avg":0,"ct":60,"max":0,"min":0,"std":0,"tmax":29880,"tmin":29880}}}
Individually they are both valid, but what I would like to achieve is dynamically connect all of these into a single json object. It is important to note that these json responses could span multiple lines, so I can't just read in line by line. Any help would be appreciated.

You can load the contents of the file with vanilla Python (not using the json package), then use json to parse each individual line.
Example:
import json
data_fp = "/path/to/data.txt"
with open(data_fp, "r") as f:
lines = f.readlines()
# now, parse each line as a JSON string
json_object = [json.loads(l) for l in line]
# optional: dump as a JSON file
with open("/path/to/output.json", "w") as f:
json.dump(json_object, f)
Edit: if each dictionary is not necessarily limited to a single line, you could try parsing JSON for a variable number of lines until it succeeds (continuing from above example):
start_line = 0
end_line = 1
json_object = []
while end_line <= len(lines):
try:
data = json.loads("".join(lines[start_line:end_line]))
except:
end_line += 1
else:
json_object.append(data)
start_line = end_line
end_line = start_line + 1

If each line is valid JSON, you could wrap this in a script that read them in individually, and appended them to a list. Something like:
import json
data = []
with open("fakejson.txt") as data_f:
for line in data_f:
data.append(json.loads(line)

You can create a function that recognizes a json, by looking for pairs of open-close {}. See below:
def isjson(t):
for i in range(len(t)):
if t[i]=='{':
s=t[i]
c=1
n=1
while c>0:
s+=t[i+n]
if t[i+n]=='{':
c+=1
elif t[i+n]=='}':
c-=1
n+=1
return (s, i+n)
You can now load your entire file as text with the following:
with open('yourfile.txt') as f:
t=f.read()
And extact all jsons, using the above function:
d={}
n=1
while True:
d[n]=isjson(t)[0]
t=t[isjson(t)[1]+1:]
n+=1
if t.count('{')==0:
break

Related

Append to a json array multiple times in Python

I want to write a code in python to create a sample JSON file in a by duplicating the objects 'array' object x number of times to create a large file with the required format using an existing smaller JSON file.
smaller JSON sample file:
{"item":"book1","price":"10.00","array"[{"object1":"var1","object2":"var2"}]}
output file:
{"item":"book1","price":"10.00","array"[{"object1":"var1","object2":"var2"},{"object1":"var1","object2":"var2"},{"object1":"var1","object2":"var2"},......]}
I have tried this but I can figure out how to just duplicate the objects in the array:
result = ''
x = 2
with open("test.json", "r") as infile:
for i in range(x):
infile.seek(0)
result += infile.read() + ','
with open("merged.json", "w") as outfile:
outfile.writelines(result)
which gives me this:
{"item":"book1","price":"10.00","array"[{"object1":"var1","object2":"var2"}]},{"item":"book1","price":"10.00","array"[{"object1":"var1","object2":"var2"}]}
You can use json and do something like this.
import json
myfile='test.json'
with open(myfile) as jfile:
data=json.loads(jfile.read())
x=5 #x can be how many duplicates you want to make
data['array']=data['array']*x
with open(myfile,'w') as jfile:
json.dumps(data,jfile)

Accessing items in a dump of dictionary objects in Python

I have a strange dataset from our customer. It is a .json file but inside it looks like below
{"a":"aaa","b":"bbb","text":"hello"}
{"a":"aaa","b":"bbb","text":"hi"}
{"a":"aaa","b":"bbb","text":"hihi"}
As you notice, this is just a dump of dictionary objects. It is neither a list (no [] and comma seperator between objects) nor a proper JSON although the file extension is .json. So I am really confused about how to read this file.
All I care about is reading all the text keys from each of the dictionary objects.
This "strange dataset" is actually an existing format that builds upon JSON, called JSONL.
As #user655321 said, you can parse each line. Here's a more complete example with the complete dataset available in the list of dicts dataset:
import json
dataset = []
with open("my_file.json") as file:
for line in file:
dataset.append(json.loads(line))
In [51]: [json.loads(i)["text"] for i in open("file.json").readlines()]
Out[51]: ['hello', 'hi', 'hihi']
Use list comprehension, it's easier
You can read it line by line and convert the lines to JSON objects and extract the needed data text in your case.
You can do something as follows:
import json
lines = open("file.txt").readlines()
for line in lines:
dictionary = json.loads(line)
print(dictionary["text"])
Since it's not a single JSON file, you can read in the input line by line and deserialize them independently:
import json
with open('my_file.json') as fh:
for line in fh:
json_obj = json.loads(line)
keys = json_obj.keys() # eg, 'a', 'b', 'text'
text_val = json_obj['text'] # eg, 'hello', 'hi', or 'hihi'
How about splitting the content by \n then using json to load each dictionary? something like:
import json
with open(your_file) as f:
data = f.read()
my_dicts = []
for line in data.split():
my_dicts.append(json.loads(line))
import ast
with open('my_file.json') as fh:
for line in fh:
try:
dict_data = ast.literal_eval(line)
assert isinstance(dict_data,dict)
### Process Dictionary Data here or append to list to convert to list of dicts
except (SyntaxError, ValueError, AssertionError):
print('ERROR - {} is not a dictionary'.format(line))

How to parse a single line json file containing multiple objects

I need to read some JSON data for processing. I have a single line file that has multiple JSON objects how can I parse this?
I want the output to be a file with a single line per object.
I have tried a brute force method that will use json.loads recursively to check if the json is valid but I'm getting different results every time I run the program
import json
with open('sample.json') as inp:
s = inp.read()
jsons = []
start, end = s.find('{'), s.find('}')
while True:
try:
jsons.append(json.loads(s[start:end + 1]))
print(jsons)
except ValueError:
end = end + 1 + s[end + 1:].find('}')
else:
s = s[end + 1:]
if not s:
break
start, end = s.find('{'), s.find('}')
for x in jsons:
writeToFilee(x)
The json format can be seen here
https://pastebin.com/DgbyjAG9
why not just use the pos attribute of the JSONDecodeError to tell you where to delimit things?
something like:
import json
def json_load_all(buf):
while True:
try:
yield json.loads(buf)
except json.JSONDecodeError as err:
yield json.loads(buf[:err.pos])
buf = buf[err.pos:]
else:
break
works with your demo data as:
with open('data.json') as fd:
arr = list(json_load_all(fd.read()))
gives me exactly two elements, but I presume you have more?
to complete this using the standard library, writing out would look something like:
with open('data.json') as inp, open('out.json', 'w') as out:
for obj in json_load_all(inp.read()):
json.dump(obj, out)
print(file=out)
otherwise the jsonlines package is good for dealing with this data format
The code below worked for me:
import json
with open(input_file_path) as f_in:
file_data = f_in.read()
file_data = file_data.replace("}{", "},{")
file_data = "[" + file_data + "]"
data = json.loads(file_data)
Following #Chris A's comment, I've prepared this snippet which should work just fine:
with open('my_jsons.file') as file:
json_string = file.read()
json_objects = re.sub('}\s*{', '}|!|{', json_string).split('|!|')
# replace |!| with whatever suits you best
for json_object in json_objects:
print(json.loads(obj))
This example, however, will become worthless as soon as '}{' string appears in some value inside your JSON, so I strongly recommend using #Sam Mason's solution

How to extract multiple JSON objects from one file?

I am very new to Json files. If I have a json file with multiple json objects such as following:
{"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes",
"Code":[{"event1":"A","result":"1"},…]}
{"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No",
"Code":[{"event1":"B","result":"1"},…]}
{"ID":"AA356","Timestamp":"20140103", "Usefulness":"No",
"Code":[{"event1":"B","result":"0"},…]}
…
I want to extract all "Timestamp" and "Usefulness" into a data frames:
Timestamp Usefulness
0 20140101 Yes
1 20140102 No
2 20140103 No
…
Does anyone know a general way to deal with such problems?
Update: I wrote a solution that doesn't require reading the entire file in one go. It's too big for a stackoverflow answer, but can be found here jsonstream.
You can use json.JSONDecoder.raw_decode to decode arbitarily big strings of "stacked" JSON (so long as they can fit in memory). raw_decode stops once it has a valid object and returns the last position where wasn't part of the parsed object. It's not documented, but you can pass this position back to raw_decode and it start parsing again from that position. Unfortunately, the Python json module doesn't accept strings that have prefixing whitespace. So we need to search to find the first non-whitespace part of your document.
from json import JSONDecoder, JSONDecodeError
import re
NOT_WHITESPACE = re.compile(r'\S')
def decode_stacked(document, pos=0, decoder=JSONDecoder()):
while True:
match = NOT_WHITESPACE.search(document, pos)
if not match:
return
pos = match.start()
try:
obj, pos = decoder.raw_decode(document, pos)
except JSONDecodeError:
# do something sensible if there's some error
raise
yield obj
s = """
{"a": 1}
[
1
,
2
]
"""
for obj in decode_stacked(s):
print(obj)
prints:
{'a': 1}
[1, 2]
Use a json array, in the format:
[
{"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes",
"Code":[{"event1":"A","result":"1"},…]},
{"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No",
"Code":[{"event1":"B","result":"1"},…]},
{"ID":"AA356","Timestamp":"20140103", "Usefulness":"No",
"Code":[{"event1":"B","result":"0"},…]},
...
]
Then import it into your python code
import json
with open('file.json') as json_file:
data = json.load(json_file)
Now the content of data is an array with dictionaries representing each of the elements.
You can access it easily, i.e:
data[0]["ID"]
So, as was mentioned in a couple comments containing the data in an array is simpler but the solution does not scale well in terms of efficiency as the data set size increases. You really should only use an iterable object when you want to access a random item in the array, otherwise, generators are the way to go. Below I have prototyped a reader function which reads each json object individually and returns a generator.
The basic idea is to signal the reader to split on the carriage character "\n" (or "\r\n" for Windows). Python can do this with the file.readline() function.
import json
def json_reader(filename):
with open(filename) as f:
for line in f:
yield json.loads(line)
However, this method only really works when the file is written as you have it -- with each object separated by a newline character. Below I wrote an example of a writer that separates an array of json objects and saves each one on a new line.
def json_writer(file, json_objects):
with open(file, "w") as f:
for jsonobj in json_objects:
jsonstr = json.dumps(jsonobj)
f.write(jsonstr + "\n")
You could also do the same operation with file.writelines() and a list comprehension:
...
json_strs = [json.dumps(j) + "\n" for j in json_objects]
f.writelines(json_strs)
...
And if you wanted to append the data instead of writing a new file just change open(file, "w") to open(file, "a").
In the end I find this helps a great deal not only with readability when I try and open json files in a text editor but also in terms of using memory more efficiently.
On that note if you change your mind at some point and you want a list out of the reader, Python allows you to put a generator function inside of a list and populate the list automatically. In other words, just write
lst = list(json_reader(file))
Added streaming support based on the answer of #dunes:
import re
from json import JSONDecoder, JSONDecodeError
NOT_WHITESPACE = re.compile(r"[^\s]")
def stream_json(file_obj, buf_size=1024, decoder=JSONDecoder()):
buf = ""
ex = None
while True:
block = file_obj.read(buf_size)
if not block:
break
buf += block
pos = 0
while True:
match = NOT_WHITESPACE.search(buf, pos)
if not match:
break
pos = match.start()
try:
obj, pos = decoder.raw_decode(buf, pos)
except JSONDecodeError as e:
ex = e
break
else:
ex = None
yield obj
buf = buf[pos:]
if ex is not None:
raise ex

open a .json file with multiple dictionaries

I have a problem that I can't solve with python, it is probably very stupid but I didn't manage to find the solution by myself.
I have a .json file where the results of a simulation are stored. The result is stored as a series of dictionaries like
{"F_t_in_max": 709.1800264942982, "F_t_out_max": 3333.1574129603068, "P_elec_max": 0.87088836042046958, "beta_max": 0.38091242406098391, "r0_max": 187.55175182942901, "r1_max": 1354.8636763521174, " speed ": 8}
{"F_t_in_max": 525.61428305710433, "F_t_out_max": 2965.0538075438467, "P_elec_max": 0.80977406754203796, "beta_max": 0.59471606595464666, "r0_max": 241.25371753877008, "r1_max": 688.61786996066826, " speed ": 9}
{"F_t_in_max": 453.71124051199763, "F_t_out_max": 2630.1763649193008, "P_elec_max": 0.64268078173342935, "beta_max": 1.0352896471221695, "r0_max": 249.32706230502498, "r1_max": 709.11415981343885, " speed ": 10}
I would like to open the file and and access the values like to plot "r0_max" as function of "speed" but I can't open unless there is only one dictionary.
I use
with open('./results/rigid_wing_opt.json') as data_file:
data = json.load(data_file)
but When the file contains more than one dictionary I get the error
ValueError: Extra data: line 5 column 1 - line 6 column 1 (char 217 - 431)
If your input data is exactly as provided then you should be able to interpret each individual dictionary using json.load. If each dictionary is on its own line then this should be sufficient:
with open('filename', 'r') as handle:
json_data = [json.loads(line) for line in handle]
I would recommend reading the file line-by-line and convert each line independently to a dictionary.
You can place each line into a list with the following code:
import ast
# Read all lines into a list
with open(fname) as f:
content = f.readlines()
# Convert each list item to a dict
content = [ ast.literal_eval( line ) for line in content ]
Or an even shorter version performing the list comprehension on the same line:
import ast
# Read all lines into a list
with open(fname) as f:
content = [ ast.literal_eval( l ) for l in f.readlines() ]
{...} {...} is not proper json. It is two json objects separated by a space. Unless you can change the format of the input file to correct this, I'd suggest you try something a little different. If the data is a simple as in your example, then you could do something like this:
with open('filename', 'r') as handle:
text_data = handle.read()
text_data = '[' + re.sub(r'\}\s\{', '},{', text_data) + ']'
json_data = json.loads(text_data)
This should work even if your dictionaries are not on separate lines.
That is not valid JSON. You can't have multiple obje at the top level, without surrounding them by a list and inserting commas between them.

Categories