How to parse a JSON object into smaller objects using Python? - python

I have a very large JSON object that I need to split into smaller objects and write those smaller objects to file.
Sample Data
raw = '[{"id":"1","num":"2182","count":-17}{"id":"111","num":"3182","count":-202}{"id":"222","num":"4182","count":12},{"id":"33333","num":"5182","count":12}]'
Desired Output (In this example, split the data in half)
output_file1.json = [{"id":"1","num":"2182","count":-17},{"id":"111","num":"3182","count":-202}]
output_file2.json = [{"id":"222","num":"4182","count":12}{"id":"33333","num":"5182","count":12}]
Current Code
import pandas as pd
import itertools
import json
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return zip_longest(fillvalue=fillvalue, *args)
raw = '[{"id":"1","num":"2182","count":-17}{"id":"111","num":"3182","count":-202}{"id":"222","num":"4182","count":12},{"id":"33333","num":"5182","count":12}]'
#split the data into manageable chunks + write to files
for i, group in enumerate(grouper(raw, 4)):
with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
json.dump(list(group), outputfile)
Current Output of first file "outputbatch_0.json"
["[", "{", "\"", "s"]
I feel like I'm making this much harder than it needs to be.

assuming the raw should be a valid json string (I included the missing commas), here is a simple, but working solution.
import json
raw = '[{"id":"1","num":"2182","count":-17},{"id":"111","num":"3182","count":-202},{"id":"222","num":"4182","count":12},{"id":"33333","num":"5182","count":12}]'
json_data = json.loads(raw)
def split_in_files(json_data, amount):
step = len(json_data) // amount
pos = 0
for i in range(amount - 1):
with open('output_file{}.json'.format(i+1), 'w') as file:
json.dump(json_data[pos:pos+step], file)
pos += step
# last one
with open('output_file{}.json'.format(amount), 'w') as file:
json.dump(json_data[pos:], file)
split_in_files(json_data, 2)

if raw is valid json. the saving part is not detailed.
import json
raw = '[{"id":"1","num":"2182","count":-17},{"id":"111","num":"3182","count":-202},{"id":"222","num":"4182","count":12},{"id":"33333","num":"5182","count":12}]'
raw_list = eval(raw)
raw__zipped = list(zip(raw_list[0::2], raw_list[1::2]))
for item in raw__zipped:
with open('a.json', 'w') as f:
json.dump(item, f)

If you need the exactly half of the data you can use slicing:
import json
raw = '[{"id":"1","num":"2182","count":-17},{"id":"111","num":"3182","count":-202},{"id":"222","num":"4182","count":12},{"id":"33333","num":"5182","count":12}]'
json_data = json.loads(raw)
size_of_half = len(json_data)/2
print json_data[:size_of_half]
print json_data[size_of_half:]
In shared code basic cases are not handled like what if length is odd etc, In short You can do everything that you can do with list.

Related

How to parse a single line json file containing multiple objects

I need to read some JSON data for processing. I have a single line file that has multiple JSON objects how can I parse this?
I want the output to be a file with a single line per object.
I have tried a brute force method that will use json.loads recursively to check if the json is valid but I'm getting different results every time I run the program
import json
with open('sample.json') as inp:
s = inp.read()
jsons = []
start, end = s.find('{'), s.find('}')
while True:
try:
jsons.append(json.loads(s[start:end + 1]))
print(jsons)
except ValueError:
end = end + 1 + s[end + 1:].find('}')
else:
s = s[end + 1:]
if not s:
break
start, end = s.find('{'), s.find('}')
for x in jsons:
writeToFilee(x)
The json format can be seen here
https://pastebin.com/DgbyjAG9
why not just use the pos attribute of the JSONDecodeError to tell you where to delimit things?
something like:
import json
def json_load_all(buf):
while True:
try:
yield json.loads(buf)
except json.JSONDecodeError as err:
yield json.loads(buf[:err.pos])
buf = buf[err.pos:]
else:
break
works with your demo data as:
with open('data.json') as fd:
arr = list(json_load_all(fd.read()))
gives me exactly two elements, but I presume you have more?
to complete this using the standard library, writing out would look something like:
with open('data.json') as inp, open('out.json', 'w') as out:
for obj in json_load_all(inp.read()):
json.dump(obj, out)
print(file=out)
otherwise the jsonlines package is good for dealing with this data format
The code below worked for me:
import json
with open(input_file_path) as f_in:
file_data = f_in.read()
file_data = file_data.replace("}{", "},{")
file_data = "[" + file_data + "]"
data = json.loads(file_data)
Following #Chris A's comment, I've prepared this snippet which should work just fine:
with open('my_jsons.file') as file:
json_string = file.read()
json_objects = re.sub('}\s*{', '}|!|{', json_string).split('|!|')
# replace |!| with whatever suits you best
for json_object in json_objects:
print(json.loads(obj))
This example, however, will become worthless as soon as '}{' string appears in some value inside your JSON, so I strongly recommend using #Sam Mason's solution

How to extract multiple JSON objects from one file?

I am very new to Json files. If I have a json file with multiple json objects such as following:
{"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes",
"Code":[{"event1":"A","result":"1"},…]}
{"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No",
"Code":[{"event1":"B","result":"1"},…]}
{"ID":"AA356","Timestamp":"20140103", "Usefulness":"No",
"Code":[{"event1":"B","result":"0"},…]}
…
I want to extract all "Timestamp" and "Usefulness" into a data frames:
Timestamp Usefulness
0 20140101 Yes
1 20140102 No
2 20140103 No
…
Does anyone know a general way to deal with such problems?
Update: I wrote a solution that doesn't require reading the entire file in one go. It's too big for a stackoverflow answer, but can be found here jsonstream.
You can use json.JSONDecoder.raw_decode to decode arbitarily big strings of "stacked" JSON (so long as they can fit in memory). raw_decode stops once it has a valid object and returns the last position where wasn't part of the parsed object. It's not documented, but you can pass this position back to raw_decode and it start parsing again from that position. Unfortunately, the Python json module doesn't accept strings that have prefixing whitespace. So we need to search to find the first non-whitespace part of your document.
from json import JSONDecoder, JSONDecodeError
import re
NOT_WHITESPACE = re.compile(r'\S')
def decode_stacked(document, pos=0, decoder=JSONDecoder()):
while True:
match = NOT_WHITESPACE.search(document, pos)
if not match:
return
pos = match.start()
try:
obj, pos = decoder.raw_decode(document, pos)
except JSONDecodeError:
# do something sensible if there's some error
raise
yield obj
s = """
{"a": 1}
[
1
,
2
]
"""
for obj in decode_stacked(s):
print(obj)
prints:
{'a': 1}
[1, 2]
Use a json array, in the format:
[
{"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes",
"Code":[{"event1":"A","result":"1"},…]},
{"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No",
"Code":[{"event1":"B","result":"1"},…]},
{"ID":"AA356","Timestamp":"20140103", "Usefulness":"No",
"Code":[{"event1":"B","result":"0"},…]},
...
]
Then import it into your python code
import json
with open('file.json') as json_file:
data = json.load(json_file)
Now the content of data is an array with dictionaries representing each of the elements.
You can access it easily, i.e:
data[0]["ID"]
So, as was mentioned in a couple comments containing the data in an array is simpler but the solution does not scale well in terms of efficiency as the data set size increases. You really should only use an iterable object when you want to access a random item in the array, otherwise, generators are the way to go. Below I have prototyped a reader function which reads each json object individually and returns a generator.
The basic idea is to signal the reader to split on the carriage character "\n" (or "\r\n" for Windows). Python can do this with the file.readline() function.
import json
def json_reader(filename):
with open(filename) as f:
for line in f:
yield json.loads(line)
However, this method only really works when the file is written as you have it -- with each object separated by a newline character. Below I wrote an example of a writer that separates an array of json objects and saves each one on a new line.
def json_writer(file, json_objects):
with open(file, "w") as f:
for jsonobj in json_objects:
jsonstr = json.dumps(jsonobj)
f.write(jsonstr + "\n")
You could also do the same operation with file.writelines() and a list comprehension:
...
json_strs = [json.dumps(j) + "\n" for j in json_objects]
f.writelines(json_strs)
...
And if you wanted to append the data instead of writing a new file just change open(file, "w") to open(file, "a").
In the end I find this helps a great deal not only with readability when I try and open json files in a text editor but also in terms of using memory more efficiently.
On that note if you change your mind at some point and you want a list out of the reader, Python allows you to put a generator function inside of a list and populate the list automatically. In other words, just write
lst = list(json_reader(file))
Added streaming support based on the answer of #dunes:
import re
from json import JSONDecoder, JSONDecodeError
NOT_WHITESPACE = re.compile(r"[^\s]")
def stream_json(file_obj, buf_size=1024, decoder=JSONDecoder()):
buf = ""
ex = None
while True:
block = file_obj.read(buf_size)
if not block:
break
buf += block
pos = 0
while True:
match = NOT_WHITESPACE.search(buf, pos)
if not match:
break
pos = match.start()
try:
obj, pos = decoder.raw_decode(buf, pos)
except JSONDecodeError as e:
ex = e
break
else:
ex = None
yield obj
buf = buf[pos:]
if ex is not None:
raise ex

writing data into file with binary packed format in python

I am reading some value for file and wants to write modified value into file. My file is .ktx format [binary packed format].
I am using struct.pack() but seems that something is going wrong with that:
bytes = file.read(4)
bytesAsInt = struct.unpack("l",bytes)
number=1+(bytesAsInt[0])
number=hex(number)
no=struct.pack("1",number)
outfile.write(no)
I want to write in both ways little-endian and big-endian.
no_little =struct.pack(">1",bytesAsInt)
no_big =struct.pack("<1",bytesAsInt) # i think this is default ...
again you can check the docs and see the format characters you need
https://docs.python.org/3/library/struct.html
>>> struct.unpack("l","\x05\x04\x03\03")
(50529285,)
>>> struct.pack("l",50529285)
'\x05\x04\x03\x03'
>>> struct.pack("<l",50529285)
'\x05\x04\x03\x03'
>>> struct.pack(">l",50529285)
'\x03\x03\x04\x05'
also note that it is a lowercase L , not a one (as also covered in the docs)
I haven't tested this but the following function should solve your problem. At the moment it reads the file contents completely, creates a buffer and then writes out the updated contents. You could also modify the file buffer directly using unpack_from and pack_into but it might be slower (again, not tested). I'm using the struct.Struct class since you seem to want to unpack the same number many times.
import os
import struct
from StringIO import StringIO
def modify_values(in_file, out_file, increment=1, num_code="i", endian="<"):
with open(in_file, "rb") as file_h:
content = file_h.read()
num = struct.Struct(endian + num_code)
buf = StringIO()
try:
while len(content) >= num.size:
value = num.unpack(content[:num.size])[0]
value += increment
buf.write(num.pack(value))
content = content[num.size:]
except Exception as err:
# handle
else:
buf.seek(0)
with open(out_file, "wb") as file_h:
file_h.write(buf.read())
An alternative is to use the array which makes it quite easy. I don't know how to implement endianess with an array.
def modify_values(filename, increment=1, num_code="i"):
with open(filename, "rb") as file_h:
arr = array("i", file_h.read())
for i in range(len(arr)):
arr[i] += increment
with open(filename, "wb") as file_h:
arr.tofile(file_h)

How to read a simple binary file

I have a binary file that consists of consecutive binary subsequences of fixed and equal length. Each subsequence can be unpacked into the same number of values. I know the length of each subsequence and the binary format of the values.
How can I work through the binary file, chopping out the subsequences,
unpacking their content and write them out as csv as I go.
I know how to write out as csv. My problem is the reading from file and unpacking part. This is my non-working code.
import csv
import sys
import struct
writer = csv.writer(sys.stdout, delimiter=',', quoting=csv.QUOTE_NONE,escapechar='\\')
? rows = sys.stdin. ?
? header = id, time ....
? write the header with csv
i = 0
for row in rows:
unpacked_row = unpack('QqqqddiBIBcsbshlshhlQB',row)
writer.writerow(unpacked_row)
i += 1
Possible solution using Reading binary file in Python and looping over each byte and the answer of ignacio.
First calculate chunksize = struct.calcsize()
def bytes_from_file(filename, chunksize=8192):
with open(filename, "rb") as f:
while True:
chunk = f.read(chunksize)
if chunk:
yield chunk
else:
break
# example:
for chunk in bytes_from_file('filename'):
# row = unpack(chunk)
# write out row as csv
You need to calculate the size of the structure (Hint: struct.calcsize()) and read some multiple of that from the file at a time. You cannot directly iterate over the input as you can with a text file, since there is no delimiter as such.
You could use struct.Struct to unpack values from a file:
#!/usr/bin/env python
import csv
import sys
from struct import Struct
record = Struct('QqqqddiBIBcsbshlshhlQB')
with open('input_filename', 'rb') as file:
writer = csv.writer(sys.stdout, quoting=csv.QUOTE_NONE, escapechar='\\')
while True:
buf = file.read(record.size)
if len(buf) != record.size:
break
writer.writerow(record.unpack_from(buf))
You could also write the while-loop as:
from functools import partial
for buf in iter(partial(file.read, record.size), b''):
writer.writerow(record.unpack_from(buf))

How to save a dictionary to a file?

I have problem with changing a dict value and saving the dict to a text file (the format must be same), I only want to change the member_phone field.
My text file is the following format:
memberID:member_name:member_email:member_phone
and I split the text file with:
mdict={}
for line in file:
x=line.split(':')
a=x[0]
b=x[1]
c=x[2]
d=x[3]
e=b+':'+c+':'+d
mdict[a]=e
When I try change the member_phone stored in d, the value has changed not flow by the key,
def change(mdict,b,c,d,e):
a=input('ID')
if a in mdict:
d= str(input('phone'))
mdict[a]=b+':'+c+':'+d
else:
print('not')
and how to save the dict to a text file with same format?
Python has the pickle module just for this kind of thing.
These functions are all that you need for saving and loading almost any object:
import pickle
with open('saved_dictionary.pkl', 'wb') as f:
pickle.dump(dictionary, f)
with open('saved_dictionary.pkl', 'rb') as f:
loaded_dict = pickle.load(f)
In order to save collections of Python there is the shelve module.
Pickle is probably the best option, but in case anyone wonders how to save and load a dictionary to a file using NumPy:
import numpy as np
# Save
dictionary = {'hello':'world'}
np.save('my_file.npy', dictionary)
# Load
read_dictionary = np.load('my_file.npy',allow_pickle='TRUE').item()
print(read_dictionary['hello']) # displays "world"
FYI: NPY file viewer
We can also use the json module in the case when dictionaries or some other data can be easily mapped to JSON format.
import json
# Serialize data into file:
json.dump( data, open( "file_name.json", 'w' ) )
# Read data from file:
data = json.load( open( "file_name.json" ) )
This solution brings many benefits, eg works for Python 2.x and Python 3.x in an unchanged form and in addition, data saved in JSON format can be easily transferred between many different platforms or programs. This data are also human-readable.
Save and load dict to file:
def save_dict_to_file(dic):
f = open('dict.txt','w')
f.write(str(dic))
f.close()
def load_dict_from_file():
f = open('dict.txt','r')
data=f.read()
f.close()
return eval(data)
As Pickle has some security concerns and is slow (source), I would go for JSON, as it is fast, built-in, human-readable, and interchangeable:
import json
data = {'another_dict': {'a': 0, 'b': 1}, 'a_list': [0, 1, 2, 3]}
# e.g. file = './data.json'
with open(file, 'w') as f:
json.dump(data, f)
Reading is similar easy:
with open(file, 'r') as f:
data = json.load(f)
This is similar to this answer, but implements the file handling correctly.
If the performance improvement is still not enough, I highly recommend orjson, fast, correct JSON library for Python build upon Rust.
I'm not sure what your first question is, but if you want to save a dictionary to file you should use the json library. Look up the documentation of the loads and puts functions.
I would suggest saving your data using the JSON format instead of pickle format as JSON's files are human-readable which makes your debugging easier since your data is small. JSON files are also used by other programs to read and write data. You can read more about it here
You'll need to install the JSON module, you can do so with pip:
pip install json
# To save the dictionary into a file:
json.dump( data, open( "myfile.json", 'w' ) )
This creates a json file with the name myfile.
# To read data from file:
data = json.load( open( "myfile.json" ) )
This reads and stores the myfile.json data in a data object.
For a dictionary of strings such as the one you're dealing with, it could be done using only Python's built-in text processing capabilities.
(Note this wouldn't work if the values are something else.)
with open('members.txt') as file:
mdict={}
for line in file:
a, b, c, d = line.strip().split(':')
mdict[a] = b + ':' + c + ':' + d
a = input('ID: ')
if a not in mdict:
print('ID {} not found'.format(a))
else:
b, c, d = mdict[a].split(':')
d = input('phone: ')
mdict[a] = b + ':' + c + ':' + d # update entry
with open('members.txt', 'w') as file: # rewrite file
for id, values in mdict.items():
file.write(':'.join([id] + values.split(':')) + '\n')
I like using the pretty print module to store the dict in a very user-friendly readable form:
import pprint
def store_dict(fname, dic):
with open(fname, "w") as f:
f.write(pprint.pformat(dic, indent=4, sort_dicts=False))
# note some of the defaults are: indent=1, sort_dicts=True
Then, when recovering, read in the text file and eval() it to turn the string back into a dict:
def load_file(fname):
try:
with open(fname, "r") as f:
dic = eval(f.read())
except:
dic = {}
return dic
Unless you really want to keep the dictionary, I think the best solution is to use the csv Python module to read the file.
Then, you get rows of data and you can change member_phone or whatever you want ;
finally, you can use the csv module again to save the file in the same format
as you opened it.
Code for reading:
import csv
with open("my_input_file.txt", "r") as f:
reader = csv.reader(f, delimiter=":")
lines = list(reader)
Code for writing:
with open("my_output_file.txt", "w") as f:
writer = csv.writer(f, delimiter=":")
writer.writerows(lines)
Of course, you need to adapt your change() function:
def change(lines):
a = input('ID')
for line in lines:
if line[0] == a:
d=str(input("phone"))
line[3]=d
break
else:
print "not"
I haven't timed it but I bet h5 is faster than pickle; the filesize with compression is almost certainly smaller.
import deepdish as dd
dd.io.save(filename, {'dict1': dict1, 'dict2': dict2}, compression=('blosc', 9))
file_name = open("data.json", "w")
json.dump(test_response, file_name)
file_name.close()
or use context manager, which is better:
with open("data.json", "w") as file_name:
json.dump(test_response, file_name)

Categories