Loading extremely large JSON file without knowing the schema?

Loading extremely large JSON file without knowing the schema? - python

I'm trying to load an extremely large JSON file in Python. I've tried:
import json
data = open('file.json').read()
loaded = json.loads(data)
but that gives me a SIGKILL error.
I've tried:
import pandas as pd
df = pd.read_json('file.json')
and I get an out-of-memory error.
I'd like to try to use ijson to stream my data and only pull a subset into it at a time. However, you need to know what the schema of the JSON file is so that you know what events to look for. I don't actually know what the schema of my JSON file is. So, I have two questions:
Is there a way to load or stream a large json file in Python without knowing the schema? Or a way to convert a JSON file into another format (or into a postgresql server, for example)?
Is there a tool for spitting out what the schema of my JSON file is?
UPDATE:
Used head file.json to get an idea of what my JSON file looks like. From there it's a bit easier.

I would deal with smaller pieces of the file. Take a look at Lazy Method for Reading Big File in Python?. You can adapt the proposed answer to parse your JSON object by object.

You can read in chunks, something like this
f=open("file.json")
while True:
data = f.read(1024)
if not data:
break
yield data
Line by line option
data = []
with open('file') as f:
for line in f:
data.append(json.loads(line))
Also look at
https://www.dataquest.io/blog/python-json-tutorial/
Look for more answers with jsonline

Related

Read large json file - update?

I'm having a large json file which I'm struggling to read and work with in python. It seems I can for instance run json.loads() but then it crashes after a while.
There are two questions which are basically the same thing:
Reading rather large JSON files
Is there a memory efficient and fast way to load big JSON files?
But these questions are from 2010 and 2012, so I was wondering if there's a newer/better/faster way to do things?
My file is on the format:
import json
f = open('../Data/response.json')
data = json.load(f)
dict_keys(['item', 'version'])
# Path to data : data['item']
Thanks.

Reading json twitter streams with python

I am pretty new at working with python and coding in general so I feel as though this answer is something that I do not understand about how python works.
I have been working with using Tweepy to collect streams of data from python to measure sentiment with different things. That part worked fine. When I ran the program, I had the data write to a txt file and then was trying to use the data within that file to see things such as common words or locations. But I am running into problems when I am reading the data. I have been searching online and found a number of different ways that people have read the data but as I am unfamiliar with json files in general, I don't understand why these methods would work or not.
The main error I seem to be running into is something similar to this:
JSONDecodeError: Expecting value: line 1 column 1 (char 0).
From my understanding, this means that the data is not reading in correctly as can't be read as a json file. But I have also experienced the error where it reads like this:
JSONDecodeError: Expecting value: line 4 column 1 (char 0).
I don't understand why it would change or not. I have tried reading the file in as the original txt file and then saving it again as a json file. The first error I received when trying it as a json file with the second coming from the txt file.
I have read a number of different threads discussing similar problems but they keep giving me these types of errors. Just as an example, here is what my code looked like for the most recent error:
import json
source = open("../twitterdata24.json")
json_data = json.load(source)
One of my other attempts:
import json
tweets = []
for line in open("fileinfo"):
tweets.append(json.load(line))
One other point of interest, the data I am working with contains many individual tweets and from what I have read, I think there is a problem with each individual tweet being a new dictionary, so I tried to make the whole data file a list using [] but that just moved the error down a line.
So if there is anything anyone could tell me or point me to that would help me understand what I am supposed to do to read this data, I would really appreciate it.
Thanks
Edit:
Here is a small sample of the data. The whole data file is little large so here are the first two tweets in the file.
https://drive.google.com/file/d/1l6uiCzBTYf-SqUpCThQ3WDXmslMcUnPA/view?usp=sharing

Looking at your sample data, I suspect that the problem is that it isn't a valid json document. You effectively have data like:
{"a": "b"}
{"c": "d"}
{"a": "b"} is valid json, and {"c": "d"} is valid json, but {"a": "b"}\n{"c": "d"} is not valid json. This explains why `json.load(source) fails.
You're on the right track with your second attempt: by reading through the file line-by-line, you can extract the valid json data objects individually. But your implementation has two problems:
line is a string and you can't call json.load on a string. That's what json.loads is for.
you can't convert an empty line to a json object.
So if you check for empty lines and use loads, you should be able to fill your tweets list without any problems.
import json
tweets = []
with open("sampledata.txt") as source:
for line in source:
if line.strip():
tweets.append(json.loads(line))
print("Succesfully loaded {} tweets.".format(len(tweets)))
Result:
Succesfully loaded 2 tweets.

How to parse WIkidata JSON (.bz2) file using Python?

I want to look at entities and relationships using Wikidata. I downloaded the Wikidata JSON dump (from here .bz2 file, size ~ 18 GB).
However, I cannot open the file, it's just too big for my computer.
Is there a way to look into the file without extracting the full .bz2
file. Especially using Python, I know that there is a PHP dump
reader (here ), but I can't use it.

I came up with a strategy that allows to use json module to access information without opening the file:
import bz2
import json
with bz2.open(filename, "rt") as bzinput:
lines = []
for i, line in enumerate(bzinput):
if i == 10: break
tweets = json.loads(line)
lines.append(tweets)
In this way lines will be a list of dictionaries that you can easly manipulate and, for example, reduce their size by removing keys that you don't need.
Note also that (obviously) the condition i==10 can be arbitrarly changed to fit anyone(?) needings. For example, you may parse some line at a time, analyze them and writing on a txt file the indices of the lines you really want from the original file. Than it will be sufficient to read only those lines (using a similar condition in i in the for loop).

you can use BZ2File interface to manipulate the compressed file. But you can NOT use json module to access information for it, it will take too much space. You will have to index the file meaning you have to read the file line by line and save position and length of interesting object in a Dictionary (hashtable) and then you could extract a given object and load it with the json module.

You'd have to do line-by-line processing:
import bz2
import json
path = "latest.json.bz2"
with bz2.BZ2File(path) as file:
for line in file:
line = line.decode().strip()
if line in {"[", "]"}:
continue
if line.endswith(","):
line = line[:-1]
entity = json.loads(line)
# do your processing here
print(str(entity)[:50] + "...")
Seeing as WikiData is now 70GB+, you might wish to process it directly from the URL:
import bz2
import json
from urllib.request import urlopen
path = "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2"
with urlopen(path) as stream:
with bz2.BZ2File(path) as file:
...

how can I reliably access a single key-value pair from a JSON file that's too large to load into memory?

I am trying to retrieve the names of the people from my file. The file size is 201GB
import json
with open("D:/dns.json", "r") as fh:
for l in fh:
d = json.loads(l)
print(d["name"])
Whenever I try to run this program on windows, I encounter a Memory error, which says insufficient memory.
Is there a reliable way to parse a single key, value pair without loading the whole file? I have reading the file in chunks in mind, but I don't know how to start.
Here is sample: test.json
Every line is seperated by newline. Hope this helps.

You may want to give ijson a try : https://pypi.python.org/pypi/ijson

Unfortunately there is no guarantee that each line of a JSON file will make any sense to the parser on its own. I'm afraid JSON was never intended for multi-gigabyte data exchange, precisely because each JSON file contains an integral data structure. In the XML world people have written incremental event-driven (SAX-based) parsers. I'm not aware of such a library for JSON.

Python: Converting Entire Directory of JSON to Python Dictionaries to send to MongoDB

I'm relatively new to Python, and extremely new to MongoDB (as such, I'll only be concerned with taking the text files and converting them). I'm currently trying to take a bunch of .txt files that are in JSON to move them into MongoDB. So, my approach is to open each file in the directory, read each line, convert it from JSON to a dictionary, and then over-write that line that was JSON as a dictionary. Then it'll be in a format to send to MongoDB
(If there's any flaw in my reasoning, please point it out)
At the moment, I've written this:
"""
Kalil's step by step iteration / write.
JSON dumps takes a python object and serializes it to JSON.
Loads takes a JSON string and turns it into a python dictionary.
So we return json.loads so that we can take that JSON string from the tweet and save it as a dictionary for Pymongo
"""
import os
import json
import pymongo
rootdir='~/Tweets'
def convert(line):
line = file.readline()
d = json.loads(lines)
return d
for subdir, dirs, files in os.walk(rootdir):
for file in files:
f=open(file, 'r')
lines = f.readlines()
f.close()
f=open(file, 'w')
for line in lines:
newline = convert(line)
f.write(newline)
f.close()
But it isn't writing.
Which... As a rule of thumb, if you're not getting the effect that you're wanting, you're making a mistake somewhere.
Does anyone have any suggestions?

When you decode a json file you don't need to convert line by line as the parser will iterate over the file for you (that is unless you have one json document per line).
Once you've loaded the json document you'll have a dictionary which is a data structure and cannot be directly written back to file without first serializing it into a certain format such as json, yaml or many others (the format mongodb uses is called bson but your driver will handle the encoding for you).
The overall process to load a json file and dump it into mongo is actually pretty simple and looks something like this:
import json
from glob import glob
from pymongo import Connection
db = Connection().test
for filename in glob('~/Tweets/*.txt'):
with open(filename) as fp:
doc = json.load(fp)
db.tweets.save(doc)

a dictionary in python is an object that lives within the program, you can't save the dictionary directly to a file unless you pickle it (pickling is a way to save objects in files so you can retrieve it latter). Now I think a better approach would be to read the lines from the file, load the json which converts that json to a dictionary and save that info into mongodb right away, no need to save that info into a file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.