How to parse WIkidata JSON (.bz2) file using Python?

How to parse WIkidata JSON (.bz2) file using Python? - python

I want to look at entities and relationships using Wikidata. I downloaded the Wikidata JSON dump (from here .bz2 file, size ~ 18 GB).
However, I cannot open the file, it's just too big for my computer.
Is there a way to look into the file without extracting the full .bz2
file. Especially using Python, I know that there is a PHP dump
reader (here ), but I can't use it.

I came up with a strategy that allows to use json module to access information without opening the file:
import bz2
import json
with bz2.open(filename, "rt") as bzinput:
lines = []
for i, line in enumerate(bzinput):
if i == 10: break
tweets = json.loads(line)
lines.append(tweets)
In this way lines will be a list of dictionaries that you can easly manipulate and, for example, reduce their size by removing keys that you don't need.
Note also that (obviously) the condition i==10 can be arbitrarly changed to fit anyone(?) needings. For example, you may parse some line at a time, analyze them and writing on a txt file the indices of the lines you really want from the original file. Than it will be sufficient to read only those lines (using a similar condition in i in the for loop).

you can use BZ2File interface to manipulate the compressed file. But you can NOT use json module to access information for it, it will take too much space. You will have to index the file meaning you have to read the file line by line and save position and length of interesting object in a Dictionary (hashtable) and then you could extract a given object and load it with the json module.

You'd have to do line-by-line processing:
import bz2
import json
path = "latest.json.bz2"
with bz2.BZ2File(path) as file:
for line in file:
line = line.decode().strip()
if line in {"[", "]"}:
continue
if line.endswith(","):
line = line[:-1]
entity = json.loads(line)
# do your processing here
print(str(entity)[:50] + "...")
Seeing as WikiData is now 70GB+, you might wish to process it directly from the URL:
import bz2
import json
from urllib.request import urlopen
path = "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2"
with urlopen(path) as stream:
with bz2.BZ2File(path) as file:
...

Related

When extracting my .json.gz file, some characters are added to it - and the file cannot be stored as a json file

I am trying to unzip some .json.gz files, but gzip adds some characters to it, and hence makes it unreadable for JSON.
What do you think is the problem, and how can I solve it?
If I use unzipping software such as 7zip to unzip the file, this problem disappears.
This is my code:
with gzip.open('filename' , 'rb') as f:
json_content = json.loads(f.read())
This is the error I get:
Exception has occurred: json.decoder.JSONDecodeError
Extra data: line 2 column 1 (char 1585)
I used this code:
with gzip.open ('filename', mode='rb') as f:
print(f.read())
and realized that the file starts with b' (as shown below):
b'{"id":"tag:search.twitter.com,2005:5667817","objectType":"activity"
I think b' is what makes the file unworkable for the next stage. Do you have any solution to remove the b'? There are millions of this zipped file, and I cannot manually do that.
I uploaded a sample of these files in the following link
just a few json.gz files

The problem isn't with that b prefix you're seeing with print(f.read()), which just means the data is a bytes sequence (i.e. integer ASCII values) not a sequence of UTF-8 characters (i.e. a regular Python string) — json.loads() will accept either. The JSONDecodeError is because the data in the gzipped file isn't in valid JSON format, which is required. The format looks like something known as JSON Lines — which the Python standard library json module doesn't (directly) support.
Dunes' answer to the question #Charles Duffy marked this—at one point—as a duplicate of wouldn't have worked as presented because of this formatting issue. However from the sample file you added a link to in your question, it looks like there is a valid JSON object on each line of the file. If that's true of all of your files, then a simple workaround is to process each file line-by-line.
Here's what I mean:
import json
import gzip
filename = '00_activities.json.gz' # Sample file.
json_content = []
with gzip.open(filename , 'rb') as gzip_file:
for line in gzip_file: # Read one line.
line = line.rstrip()
if line: # Any JSON data on it?
obj = json.loads(line)
json_content.append(obj)
print(json.dumps(json_content, indent=4)) # Pretty-print data parsed.
Note that the output it prints shows what valid JSON might have looked like.

how can I reliably access a single key-value pair from a JSON file that's too large to load into memory?

I am trying to retrieve the names of the people from my file. The file size is 201GB
import json
with open("D:/dns.json", "r") as fh:
for l in fh:
d = json.loads(l)
print(d["name"])
Whenever I try to run this program on windows, I encounter a Memory error, which says insufficient memory.
Is there a reliable way to parse a single key, value pair without loading the whole file? I have reading the file in chunks in mind, but I don't know how to start.
Here is sample: test.json
Every line is seperated by newline. Hope this helps.

You may want to give ijson a try : https://pypi.python.org/pypi/ijson

Unfortunately there is no guarantee that each line of a JSON file will make any sense to the parser on its own. I'm afraid JSON was never intended for multi-gigabyte data exchange, precisely because each JSON file contains an integral data structure. In the XML world people have written incremental event-driven (SAX-based) parsers. I'm not aware of such a library for JSON.

Loading extremely large JSON file without knowing the schema?

I'm trying to load an extremely large JSON file in Python. I've tried:
import json
data = open('file.json').read()
loaded = json.loads(data)
but that gives me a SIGKILL error.
I've tried:
import pandas as pd
df = pd.read_json('file.json')
and I get an out-of-memory error.
I'd like to try to use ijson to stream my data and only pull a subset into it at a time. However, you need to know what the schema of the JSON file is so that you know what events to look for. I don't actually know what the schema of my JSON file is. So, I have two questions:
Is there a way to load or stream a large json file in Python without knowing the schema? Or a way to convert a JSON file into another format (or into a postgresql server, for example)?
Is there a tool for spitting out what the schema of my JSON file is?
UPDATE:
Used head file.json to get an idea of what my JSON file looks like. From there it's a bit easier.

I would deal with smaller pieces of the file. Take a look at Lazy Method for Reading Big File in Python?. You can adapt the proposed answer to parse your JSON object by object.

You can read in chunks, something like this
f=open("file.json")
while True:
data = f.read(1024)
if not data:
break
yield data
Line by line option
data = []
with open('file') as f:
for line in f:
data.append(json.loads(line))
Also look at
https://www.dataquest.io/blog/python-json-tutorial/
Look for more answers with jsonline

Create hash table from the contents of a file

How can I open a text file, read the contents of the file and create a hash table from this content? So far I have tried:
import json
json_data = open(/home/azoi/Downloads/yes/1.txt).read()
data = json.loads(json_data)
pprint(data)

I suggest this solution:
import json
with open("/home/azoi/Downloads/yes/1.txt") as f:
data=json.load(f)
pprint(data)
The with statement ensures that your file is automatically closed whatever happens and that your program throws the correct exception if the open fails. The json.load function directoly loads data from an open file handle.
Additionally, I strongly suggest reading and understanding the Python tutorial. It's essential reading and won't take too long.

To open a file you have to use the open statment correctly, something like:
json_data=open('/home/azoi/Downloads/yes/1.txt','r')
where the first string is the path to the file and the second is the mode: r = read, w = write, a = append

Python: Converting Entire Directory of JSON to Python Dictionaries to send to MongoDB

I'm relatively new to Python, and extremely new to MongoDB (as such, I'll only be concerned with taking the text files and converting them). I'm currently trying to take a bunch of .txt files that are in JSON to move them into MongoDB. So, my approach is to open each file in the directory, read each line, convert it from JSON to a dictionary, and then over-write that line that was JSON as a dictionary. Then it'll be in a format to send to MongoDB
(If there's any flaw in my reasoning, please point it out)
At the moment, I've written this:
"""
Kalil's step by step iteration / write.
JSON dumps takes a python object and serializes it to JSON.
Loads takes a JSON string and turns it into a python dictionary.
So we return json.loads so that we can take that JSON string from the tweet and save it as a dictionary for Pymongo
"""
import os
import json
import pymongo
rootdir='~/Tweets'
def convert(line):
line = file.readline()
d = json.loads(lines)
return d
for subdir, dirs, files in os.walk(rootdir):
for file in files:
f=open(file, 'r')
lines = f.readlines()
f.close()
f=open(file, 'w')
for line in lines:
newline = convert(line)
f.write(newline)
f.close()
But it isn't writing.
Which... As a rule of thumb, if you're not getting the effect that you're wanting, you're making a mistake somewhere.
Does anyone have any suggestions?

When you decode a json file you don't need to convert line by line as the parser will iterate over the file for you (that is unless you have one json document per line).
Once you've loaded the json document you'll have a dictionary which is a data structure and cannot be directly written back to file without first serializing it into a certain format such as json, yaml or many others (the format mongodb uses is called bson but your driver will handle the encoding for you).
The overall process to load a json file and dump it into mongo is actually pretty simple and looks something like this:
import json
from glob import glob
from pymongo import Connection
db = Connection().test
for filename in glob('~/Tweets/*.txt'):
with open(filename) as fp:
doc = json.load(fp)
db.tweets.save(doc)

a dictionary in python is an object that lives within the program, you can't save the dictionary directly to a file unless you pickle it (pickling is a way to save objects in files so you can retrieve it latter). Now I think a better approach would be to read the lines from the file, load the json which converts that json to a dictionary and save that info into mongodb right away, no need to save that info into a file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse WIkidata JSON (.bz2) file using Python? - python

Related

When extracting my .json.gz file, some characters are added to it - and the file cannot be stored as a json file

how can I reliably access a single key-value pair from a JSON file that's too large to load into memory?

Loading extremely large JSON file without knowing the schema?

Create hash table from the contents of a file

Python: Converting Entire Directory of JSON to Python Dictionaries to send to MongoDB

Categories

Resources