Read large json file - update? - python

I'm having a large json file which I'm struggling to read and work with in python. It seems I can for instance run json.loads() but then it crashes after a while.
There are two questions which are basically the same thing:
Reading rather large JSON files
Is there a memory efficient and fast way to load big JSON files?
But these questions are from 2010 and 2012, so I was wondering if there's a newer/better/faster way to do things?
My file is on the format:
import json
f = open('../Data/response.json')
data = json.load(f)
dict_keys(['item', 'version'])
# Path to data : data['item']
Thanks.

Related

REPL.it JSON Files

So recently I've been using REPL as python code source, but whenever I'm offline, any information stored in the JSON File is rolled back after a bit of time. Now I know this is a REPL specific problem after doing some research, but is there any way I can fix this? My code itself is quite a few lines long, so I would rather not want to use a completely different storage method.
To successfully store data in json files in replit.com, it's important to load and dump it the correct way.
An example of storing data in json files:
with open("sample.json", "r") as file:
sample = json.load(file)
sample["item"] = "Value"
with open("sample.json", "w") as file:
json.dump(sample, file)
Let me know if you've already followed these steps.

how can I reliably access a single key-value pair from a JSON file that's too large to load into memory?

I am trying to retrieve the names of the people from my file. The file size is 201GB
import json
with open("D:/dns.json", "r") as fh:
for l in fh:
d = json.loads(l)
print(d["name"])
Whenever I try to run this program on windows, I encounter a Memory error, which says insufficient memory.
Is there a reliable way to parse a single key, value pair without loading the whole file? I have reading the file in chunks in mind, but I don't know how to start.
Here is sample: test.json
Every line is seperated by newline. Hope this helps.
You may want to give ijson a try : https://pypi.python.org/pypi/ijson
Unfortunately there is no guarantee that each line of a JSON file will make any sense to the parser on its own. I'm afraid JSON was never intended for multi-gigabyte data exchange, precisely because each JSON file contains an integral data structure. In the XML world people have written incremental event-driven (SAX-based) parsers. I'm not aware of such a library for JSON.

Loading extremely large JSON file without knowing the schema?

I'm trying to load an extremely large JSON file in Python. I've tried:
import json
data = open('file.json').read()
loaded = json.loads(data)
but that gives me a SIGKILL error.
I've tried:
import pandas as pd
df = pd.read_json('file.json')
and I get an out-of-memory error.
I'd like to try to use ijson to stream my data and only pull a subset into it at a time. However, you need to know what the schema of the JSON file is so that you know what events to look for. I don't actually know what the schema of my JSON file is. So, I have two questions:
Is there a way to load or stream a large json file in Python without knowing the schema? Or a way to convert a JSON file into another format (or into a postgresql server, for example)?
Is there a tool for spitting out what the schema of my JSON file is?
UPDATE:
Used head file.json to get an idea of what my JSON file looks like. From there it's a bit easier.
I would deal with smaller pieces of the file. Take a look at Lazy Method for Reading Big File in Python?. You can adapt the proposed answer to parse your JSON object by object.
You can read in chunks, something like this
f=open("file.json")
while True:
data = f.read(1024)
if not data:
break
yield data
Line by line option
data = []
with open('file') as f:
for line in f:
data.append(json.loads(line))
Also look at
https://www.dataquest.io/blog/python-json-tutorial/
Look for more answers with jsonline

Best way to store a large python dictionary in a file

I have a python script (script 1) which generate a large python dictionary. This dictionary has to be read by an another script (script 2).
Could any one suggest me the best way to write the python dictionary generated by script 1 and to be read by script 2.
In past I have used cPickle to write and read such large dictionaries.
Is there a beter way to do this?
shelve will give you access to each item separately, instead of requiring you to serialize and deserialize the entire dictionary each time.
If you want your dictionary to be readable by different types of scripts (i.e. not just Python), JSON is a good option as well.
It's not as fast as shelve, but it's easy to use and quite readable to the human eye.
import json
with open("/tmp/test.json", "w") as out_handle:
json.dump(my_dict, out_handle) # save dictionary
with open("/tmp/test.json", "r") as in_handle:
my_dict = json.load(in_handle) # load dictionary

Python: Converting Entire Directory of JSON to Python Dictionaries to send to MongoDB

I'm relatively new to Python, and extremely new to MongoDB (as such, I'll only be concerned with taking the text files and converting them). I'm currently trying to take a bunch of .txt files that are in JSON to move them into MongoDB. So, my approach is to open each file in the directory, read each line, convert it from JSON to a dictionary, and then over-write that line that was JSON as a dictionary. Then it'll be in a format to send to MongoDB
(If there's any flaw in my reasoning, please point it out)
At the moment, I've written this:
"""
Kalil's step by step iteration / write.
JSON dumps takes a python object and serializes it to JSON.
Loads takes a JSON string and turns it into a python dictionary.
So we return json.loads so that we can take that JSON string from the tweet and save it as a dictionary for Pymongo
"""
import os
import json
import pymongo
rootdir='~/Tweets'
def convert(line):
line = file.readline()
d = json.loads(lines)
return d
for subdir, dirs, files in os.walk(rootdir):
for file in files:
f=open(file, 'r')
lines = f.readlines()
f.close()
f=open(file, 'w')
for line in lines:
newline = convert(line)
f.write(newline)
f.close()
But it isn't writing.
Which... As a rule of thumb, if you're not getting the effect that you're wanting, you're making a mistake somewhere.
Does anyone have any suggestions?
When you decode a json file you don't need to convert line by line as the parser will iterate over the file for you (that is unless you have one json document per line).
Once you've loaded the json document you'll have a dictionary which is a data structure and cannot be directly written back to file without first serializing it into a certain format such as json, yaml or many others (the format mongodb uses is called bson but your driver will handle the encoding for you).
The overall process to load a json file and dump it into mongo is actually pretty simple and looks something like this:
import json
from glob import glob
from pymongo import Connection
db = Connection().test
for filename in glob('~/Tweets/*.txt'):
with open(filename) as fp:
doc = json.load(fp)
db.tweets.save(doc)
a dictionary in python is an object that lives within the program, you can't save the dictionary directly to a file unless you pickle it (pickling is a way to save objects in files so you can retrieve it latter). Now I think a better approach would be to read the lines from the file, load the json which converts that json to a dictionary and save that info into mongodb right away, no need to save that info into a file.

Categories