I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T
Related
When working with json.dump() I noticed that it appears to be rewriting the entire document. Is this correct, and is there another way to append to the dictionary like .append() deos with lists?
When I write the function like this and change the key value (name), it would appear that the item is being appended.
filename = "infohere.json"
name = "Bob"
numbers = 20
#Write to JSON
def writejson(name = name, numbers = numbers):
with open(filename, "r") as info:
xdict = json.load(info)
xdict[name] = numbers
with open(filename, "w") as info:
json.dump(xdict, info)
When you write it out like this however, you can see that the code clearly writes over the entire dictionary/json file.
filename = infohere.json
dict = {"Bob":23, "Mark":50}
dict2 = {Ricky":40}
#Write to JSON
def writejson2(dict):
with open(filehere, "w") as info:
json.dump(dict, info)
writejson(dict)
writejson(dict2)
In the second example it only ever shows up the last date input leading me to believe that this is rewriting the entire document. If the case is that it writes the whole document during each json.dump, does this cause issues with larger json file, if so is there another method like .append() but for dealing with json.
Thanks in advance.
Neither.
json.dump doesn't decide whether to delete prior content when it writes to a file. That decision happens when you run open(filehere, "w"); that is what deletes old content.
But: Normal JSON isn't amenable to appends.
A single JSON document is one object. There are variants on the format that allow multiple documents in one file, the most common of which is JSONL (which has one JSON document per line). Unless you're using such a format, trying to append JSON to a non-empty file usually won't result in something that can be successfully parsed.
I've found a couple of others asking for help with this, but not specifically what I'm trying to do. I have a dictionary full of various formats (int, str, bool, etc) and I'm trying to save it so I can load it at a later time. Here is a basic version of the code without all the extra trappings that are irrelevant for this.
petStats = { 'name':"", 'int':1, 'bool':False }
def petSave(pet):
with open(pet['name']+".txt", "w+") as file:
for k,v in pet.items():
file.write(str(k) + ':' + str(v) + "\n")
def digimonLoad(petName):
dStat = {}
with open(petName+".txt", "r") as file:
for line in file:
(key, val) = line.split(":")
dStat[str(key)] = val
print(petName,"found. Loading",petName+".")
return dStat
In short I'm just brute forcing it by saving a text file with a Key:Value on each line, then split them all back up on load. Unfortunately this turns all of my int and bool into strings. Is there a file format I could use to save a dictionary to (I don't need to be able to read it, but the conveniance would be nice) that I could easily load back in?
This works for a basic dictionary but if I start adding things like arrays this is going to get out of hand as it is.
Use module json.
import json
def save_pet(pet):
filename = <Whatever filename you want>
with open(filename, 'w') as f:
f.write(json.dumps(pet))
def load_pet(filename):
with open(filename) as f:
pet = json.loads(f.read())
return pet
Use pickle. This is part of the standard library, so you can just import it.
import pickle
pet_stats = {'name':"", 'int':1, 'bool':False}
def pet_save(pet):
with open(pet['name'] + '.pickle', 'wb') as f:
pickle.dump(pet, f, pickle.HIGHEST_PROTOCOL)
def digimon_load(pet_name):
with open(pet_name + '.pickle', 'rb') as f:
return pickle.load(f)
Pickle works on more data types than JSON, and automatically loads them as the right Python type. (There are ways to save more types with JSON, but it takes more work.) JSON (or XML) is better if you need the output to be human-readable, or need to share it with non-Python programs, but neither appears to be necessary for your use case. Pickle will be easiest.
If you need to see what's in the file, just load it using Python or
python -m pickle foo.pickle
instead of a text editor. (Only do this to pickle files from sources you trust, pickle is not at all secure against hacking.)
Q: Is there a file format I could use to save a dictionary to load back in?
A: Yes, there are many. XML and JSON come immediately to mind.
For example:
jsonfile.txt
{
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
Here's an example reading the file into a dictionary:
import json
with open('data.txt','r') as json_file:
data = json.load(json_file)
... and an example writing the dictionary to JSON:
import json
with open('data.txt','w') as fp:
fp.write(json.dumps(data))
If you prefer XML, there are many libraries, including xmltodict:
import xmltodict
with open('path/to/file.xml') as fd:
doc = xmltodict.parse(fd.read())
There are two useful words that you may not know about yet : serialization and pickle.
Serialization refers to the process of converting a data structure (like your dictionary) to a stream of bytes that can be written to storage, and later retrieved from storage to recreate that data structure. This is a common task and your intuition is correct: trying to do this all by yourself will quickly get out of hand.
Pickle is the standard python module for implementing serialization. It’s easy to use, mature and works with a large set of Python data types. You can read more about pickle here : https://docs.python.org/3/library/pickle.html
Which is the best way to store dictionary of strings in file(as they are big) and load it partially in python. Dictionary of strings here means, keyword would be a string and the value would be a list of strings.
Dictionary storing in appended form to check keys, if available not update or else update. Then use keys for post processing.
Usually a dictionary is stored in JSON.
I'll leave here a link:
Convert Python dictionary to JSON array
You could simply write the dictionary to a text file, and then create a new dictionary that only pulls certain keys and values from that text file.
But you're probably best off exploring the json module.
Here's a straighforward way to write a dict called "sample" to a file with the json module:
import json
with open('result.json', 'w') as fp:
json.dump(sample, fp)
On the loading side, we'd need to know more about how you want to choose which keys to load from the JSON file.
The above answers are great, but i hate using JSON, i have had issues with pickle before that corrupted my data, so what i do is, i use numpy's save and load
To save np.save(filename,dict)
to load dict = np.load(filename).item()
really simple and works well, as far as loading partially goes, you could always split the dictionary into multiple smaller dictionaries and save them as individual files, maybe not a very concrete solution but it could work
to split the dictionary you could do something like this
temp_dict = {}
for i,k in enumerate(dict.keys()):
if i%1000 == 0:
np.save("records-"+str(i-1000)+"-"+str(i)+".npy",temp_dict)
temp_dict = {}
temp_dict[k]=dict[k].value()
then for loading just do something like
my_dict={}
all_files = glob.glob("*.npy")
for f in all_files:
dict = np.load(filename).item()
my_dict.update(dict)
If this is for some sort of database type use then save yourself the headache and use TinyDB. It uses JSON format when saving to disc and will provide you the "partial" loading that you're looking for.
I only recommend TinyDB as this seems to be the closest to what you're looking to achieve, maybe try googling for other databases if this isn't your fancy there's TONS of them out there!
I am trying to retrieve the names of the people from my file. The file size is 201GB
import json
with open("D:/dns.json", "r") as fh:
for l in fh:
d = json.loads(l)
print(d["name"])
Whenever I try to run this program on windows, I encounter a Memory error, which says insufficient memory.
Is there a reliable way to parse a single key, value pair without loading the whole file? I have reading the file in chunks in mind, but I don't know how to start.
Here is sample: test.json
Every line is seperated by newline. Hope this helps.
You may want to give ijson a try : https://pypi.python.org/pypi/ijson
Unfortunately there is no guarantee that each line of a JSON file will make any sense to the parser on its own. I'm afraid JSON was never intended for multi-gigabyte data exchange, precisely because each JSON file contains an integral data structure. In the XML world people have written incremental event-driven (SAX-based) parsers. I'm not aware of such a library for JSON.
I want to create a dictionary with values from a file.
The problem is that it would have to be read line by line to be added to the dictionary because I don't think I have enough memory to load in all the information to be appended to the dictionary.
The key can be default but the value will be one selected from each line in the file. The file is not csv but I always split the lines so I can be able to select a value from it.
import sys
def prod_check(dirname):
dict1 = {}
k = 0
with open('select_sha_sub_hashes.out') as inf:
for line in inf:
pline = line.split('|')
value = pline[3]
dict1[line] = dict1[k]
k += 1
print dict1
if __name__ =="__main__":
dirname=sys.argv[1]
prod_check(dirname)
This is the code I am working with, and the variable I have set as value is the index in the line from the file which I am pulling data from. I seem to be coming to a problem when I try and call the dictionary to print the values, but I think it may be a problem in my syntax or maybe an assignment I made. I want the values to be added to the keys, but the keys to remain as regular numbers like 0-100
If you don't have enough memory to store the entire dictionary in RAM at once, try anydbm, bsddb and/or gdbm. These are dictionary-like objects that keep key-value pairs on disk in a single-table, keystring-valuestring database.
Optionally, consider:
http://stromberg.dnsalias.org/~strombrg/cachedb.html
...which will allow you to transparently convert between serialized and not-serialized representations pretty transparently.
Have a look at something like "Tokyo Cabinet" # http://fallabs.com/tokyocabinet/ which has Python bindings and is fairly efficient. There's also Kyoto cabinet but the licensing on that is a little restrictive.
Also check out this previous S/O post: Reliable and efficient key--value database for Linux?
So it sounds as if the main problem is reading the file line-by-line. To read a file line-by-line you can do this:
with open('data.txt') as inf:
for line in inf:
# do your rest of processing
The advantage of using with is that the file is closed for you automagically when you are done or an exception occurs.
--
Note, the original post didn't contain any code, it now seems to have incorporated a copy of this code to help further explain the problem.