Python: ijson.parse(in_file) vs json.load(in_file) - python

I am trying to read a large JSON file (~ 2GB) in python.
The following code works well on small files but doesn't work on large files because of MemoryError on the second line.
in_file = open(sys.argv[1], 'r')
posts = json.load(in_file)
I looked at similar posts and almost everyone suggested to use ijson so I decided to give it a try.
in_file = open(sys.argv[1], 'r')
posts = list(ijson.parse(in_file))
This handled reading the big file size but ijson.parse didn't return a JSON object like json.load does so the rest of my code didn't work
TypeError: tuple indices must be integers or slices, not str
If I print out "posts" when using json.load, the o/p looks like a normal JSON
[{"Id": "23400089", "PostTypeId": "2", "ParentId": "23113726", "CreationDate": ... etc
If I print out "posts" after using ijson.parse, the o/p looks like a hash map
[["", "start_array", null], ["item", "start_map", null],
["item", "map_key", "Id"], ["item.Id", "string ... etc
My question:
I don't want to change the rest of my code so I am wondering if there is anyway to convert the o/p of ijson.parse(in_file) back to a JSON object so that it's exactly the same as if we are using json.load(in_file)?

Maybe this works for you:
in_file = open(sys.argv[1], 'r')
posts = []
data = ijson.items(in_file, 'item')
for post in data:
posts.append(post)

Related

Convert json with data from each id in different lines into one line per id with python

I have a json file with the following format:
{
"responses":[
{
"id":"123",
"cid":"01A",
"response":{nested lists and dictionaries}
},
{
"id":"456",
"cid":"54G",
"response":{nested lists and dictionaries}
}
]}
And so on.
And I want to convert it into a json file like this:
{"id":"123", "cid":"01A", "response":{nested lists and dictionaries}},
{"id":"456", "cid":"54G", "response":{nested lists and dictionaries}}
or
{responses:[
{"id":"123", "cid":"01A", "response":{nested lists and dictionaries}},
{"id":"456", "cid":"54G", "response":{nested lists and dictionaries}}
]}
I don't care about the surrounding format as long as I have the information for each ID in just one line.
I have to do this while reading it because things like pd.read_json don't read this kind of file.
Thanks!
Maybe just dump it line wise? But I guess I didn't understand your question right?
import json
input_lines = {"responses": ...}
with open("output.json", "w") as f:
for line in input_lines["responses"]:
f.write(json.dumps(line) + "\n")
You can use the built-in json library to print each response on a separate line. The json.dump() function has an option to indent, if you want that, but its default is to put everything on one line, like what you want.
Here's an example that works for the input you showed in your post.
#!/usr/bin/env python3
import json
import sys
with open(sys.argv[1]) as json_file:
obj = json.load(json_file)
print("{responses:[")
for response in obj['responses']:
print(json.dumps(response))
print("]}")
Usage (assuming you named the program format_json.py):
$ chmod +x format_json.py
$ format_json.py my_json_input.json > my_json_output.json
Or, if you're not in a command-line environment, you can also hardcode the input and output filenames:
#!/usr/bin/env python3
import json
import sys
infile = 'my_json_input.json'
outfile = 'my_json_output.json'
with open(infile) as json_file:
obj = json.load(json_file)
print("{responses:[", file=outfile)
for response in obj['responses']:
print(json.dumps(response), file=outfile)
print("]}", file=outfile)

Reading a text file of dictionaries stored in one line

Question
I have a text file that records metadata of research papers requested with SemanticScholar API. However, when I wrote requested data, I forgot to add "\n" for each individual record. This results in something looks like
{<metadata1>}{<metadata2>}{<metadata3>}...
and this should be if I did add "\n".
{<metadata1>}
{<metadata2>}
{<metadata3>}
...
Now, I would like to read the data. As all the metadata is now stored in one line, I need to do some hacks
First I split the cluttered dicts using "{".
Then I tried to convert the string line back to dict. Note that I do consider line might not be in a proper JSON format.
import json
with open("metadata.json", "r") as f:
for line in f.readline().split("{"):
print(json.loads("{" + line.replace("\'", "\"")))
However, there is still an error message
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
I am wondering what should I do to recover all the metadata I collected?
MWE
Note, in order to get metadata.json file I use, use the following code, it should work out of the box.
import json
import urllib
import requests
baseURL = "https://api.semanticscholar.org/v1/paper/"
paperIDList = ["200794f9b353c1fe3b45c6b57e8ad954944b1e69",
"b407a81019650fe8b0acf7e4f8f18451f9c803d5",
"ff118a6a74d1e522f147a9aaf0df5877fd66e377"]
for paperID in paperIDList:
response = requests.get(urllib.parse.urljoin(baseURL, paperID))
metadata = response.json()
record = dict()
record["title"] = metadata["title"]
record["abstract"] = metadata["abstract"]
record["paperId"] = metadata["paperId"]
record["year"] = metadata["year"]
record["citations"] = [item["paperId"] for item in metadata["citations"] if item["paperId"]]
record["references"] = [item["paperId"] for item in metadata["references"] if item["paperId"]]
with open("metadata.json", "a") as fileObject:
fileObject.write(json.dumps(record))
The problem is that when you do the split("{") you get a first item that is empty, corresponding to the opening {. Just ignore the first element and everything works fine (I added an r in your quote replacements so python considers then as strings literals and replace them properly):
with open("metadata.json", "r") as f:
for line in f.readline().split("{")[1:]:
print(json.loads("{" + line).replace(r"\'", r"\""))
As suggested in the comments, I would actually recommend recreating the file or saving a new version where you replace }{ by }\n{:
with open("metadata.json", "r") as f:
data = f.read()
data_lines = data.replace("}{","}\n{")
with open("metadata_mod.json", "w") as f:
f.write(data_lines)
That way you will have the metadata of a paper per line as you want.

Read JSON file correctly

I am trying to read a JSON file (BioRelEx dataset: https://github.com/YerevaNN/BioRelEx/releases/tag/1.0alpha7) in Python. The JSON file is a list of objects, one per sentence.
This is how I try to do it:
def _read(self, file_path):
with open(cached_path(file_path), "r") as data_file:
for line in data_file.readlines():
if not line:
continue
items = json.loads(lines)
text = items["text"]
label = items.get("label")
My code is failing on items = json.loads(line). It looks like the data is not formatted as the code expects it to be, but how can I change it?
Thanks in advance for your time!
Best,
Julia
With json.load() you don't need to read each line, you can do either of these:
import json
def open_json(path):
with open(path, 'r') as file:
return json.load(file)
data = open_json('./1.0alpha7.dev.json')
Or, even cooler, you can GET request the json from GitHub
import json
import requests
url = 'https://github.com/YerevaNN/BioRelEx/releases/download/1.0alpha7/1.0alpha7.dev.json'
response = requests.get(url)
data = response.json()
These will both give the same output. data variable will be a list of dictionaries that you can iterate over in a for loop and do your further processing.
Your code is reading one line at a time and parsing each line individually as JSON. Unless the creator of the file created the file in this format (which given it has a .json extension is unlikely) then that won't work, as JSON does not use line breaks to indicate end of an object.
Load the whole file content as JSON instead, then process the resulting items in the array.
def _read(self, file_path):
with open(cached_path(file_path), "r") as data_file:
data = json.load(data_file)
for item in data:
text = item["text"]
label appears to be buried in item["interaction"]

How to remove comment lines from a JSON file in python

I am getting a JSON file with following format :
// 20170407
// http://info.employeeportal.org
{
"EmployeeDataList": [
{
"EmployeeCode": "200005ABH9",
"Skill": CT70,
"Sales": 0.0,
"LostSales": 1010.4
}
]
}
Need to remove the extra comment lines present in the file.
I tried with the following code :
import json
import commentjson
with open('EmployeeDataList.json') as json_data:
employee_data = json.load(json_data)
'''employee_data = json.dump(json.load(json_data))'''
'''employee_data = commentjson.load(json_data)'''
print(employee_data)`
Still not able to remove the comments from the file and bring
the JSON file in correct format.
Not getting where things are going wrong? Any direction in this regard is highly appreciated.Thanks in advance
You're not using commentjson correctly. It has the same interface as the json module:
import commentjson
with open('EmployeeDataList.json', 'r') as handle:
employee_data = commentjson.load(handle)
print(employee_data)
Although in this case, your comments are simple enough that you probably don't need to install an extra module to remove them:
import json
with open('EmployeeDataList.json', 'r') as handle:
fixed_json = ''.join(line for line in handle if not line.startswith('//'))
employee_data = json.loads(fixed_json)
print(employee_data)
Note the difference here between the two code snippets is that json.loads is used instead of json.load, since you're parsing a string instead of a file object.
Try JSON-minify:
JSON-minify minifies blocks of JSON-like content into valid JSON by removing all whitespace and JS-style comments (single-line // and multiline /* .. */).
I usually read the JSON as a normal file, delete the comments and then parse it as a JSON string. It can be done in one line with the following snippet:
with open(path,'r') as f: jsonDict = json.loads('\n'.join(row for row in f if not row.lstrip().startswith("//")))
IMHO it is very convenient because it does not need CommentJSON or any other non standard library.
Well that's not a valid json format so just open it like you would a text document then delete anything from// to \n.
with open("EmployeeDataList.json", "r") as rf:
with open("output.json", "w") as wf:
for line in rf.readlines():
if line[0:2] == "//"
continue
wf.write(line)
Your file is parsable using HOCON.
pip install pyhocon
>>> from pyhocon import ConfigFactory
>>> conf = ConfigFactory.parse_file('data.txt')
>>> conf
ConfigTree([('EmployeeDataList',
[ConfigTree([('EmployeeCode', '200005ABH9'),
('Skill', 'CT70'),
('Sales', 0.0),
('LostSales', 1010.4)])])])
If it is the same number of lines every time you can just do:
fh = open('EmployeeDataList.NOTjson',"r")
rawText = fh.read()
json_data = rawText[rawText.index("\n",3)+1:]
This way json_data is now the string of text without the first 3 lines.

How to load json data from serialised file and process them in python 3?

I am trying to create a program in python3 (Mac OS X) and tkinter. It takes an incremental id, the datetime.now and a third string as variables. For example,
a window opens displaying : id / date time / "hello world". The user makes a choice and presses a save button. The inputs are being serialised as json and saved in a file.
mytest = dict([('testId',testId), ('testDate',testDate), ('testStyle',testStyle)])
with open('data/test.txt', mode = 'a', encoding = 'utf-8') as myfile:
json.dump(mytest, myfile, indent = 2)
myfile.close()
the result in the file is
{
"testStyle": "blabla",
"testId": "8",
"testDate": "2013-05-09 13:32"
}{
"testDate": "2013-05-09 13:41",
"testId": "9",
"testStyle": "blabla"
}
As a python newbie, I want to load the file data and make some checks, like "If user made another entry at 2013-05-09, display a message saying that you already entered data for today." What is the proper way to load all these json data ? The list will expand each day and will contain lots of data.
Instead of directly storing the dictionary you can store a list of dictionaries which can be loaded back into an list which can be modified and appended to
import json
mytest1 = dict([('testId','testId1'), ('testDate','testDate1'), ('testStyle','testStyle1')])
json_values = []
json_values.append(mytest1)
s = json.dumps(json_values)
print(s)
json_values = None
mytest2 = dict([('testId','testId2'), ('testDate','testDate2'), ('testStyle','testStyle2')])
json_values = json.loads(s)
json_values.append(mytest2)
s = json.dumps(json_values)
print(s)
You could simply load the file and parse it:
with open(path, mode="r", encoding="utf-8") as myfile:
data = json.loads(myfile.read())
Now you can do with data whatever you want.
If your file is really big, then I suppose you should use a proper database instead.

Categories