Convert to 'Strict' JSON - python

I have a JSON file, which the creators say is not 'Strict' JSON and have given python code to convert it into Strict JSON. I am new to python and keep getting error messages.
JSON example:
{'asin': '0078764343', 'description': 'Brand new sealed!', 'price': 37.98, 'imUrl': 'http://ecx.images-amazon.com/images/I/513h6dPbwLL._SY300_.jpg', 'related': {'also_bought': ['B000TI836G', 'B003Q53VZC', 'B00EFFW0HC', 'B003VWGBC0', 'B003O6G5TW', 'B0037LTTRO', 'B002I098JE', 'B008OQTS0U', 'B005EVEODY', 'B008B3AVNE', 'B000PE0HBS', 'B00354NAYG', 'B0050SYPV2', 'B00503E8S2', 'B0050SY77E', 'B0022TNO7S', 'B0056WJA30', 'B0023CBY4E', 'B002SRSQ72', 'B005EZ5GQY', 'B004XACA60', 'B00273Z9WM', 'B004HX1QFY', 'B002I0K50U'], 'bought_together': ['B002I098JE'], 'buy_after_viewing': ['B0050SY5BM', 'B000TI836G', 'B0037LTTRO', 'B002I098JE']}, 'salesRank': {'Video Games': 28655}, 'categories': [['Video Games', 'Xbox 360', 'Games']]}
Python Code:
import json
import gzip
def parse(file_path=r"c:\Users\kiero\PycharmProjects\untitled\source\reviews_Video_Games.json.gz"):
g = gzip.open(file_path, 'r')
for l in g:
yield json.dumps(eval(l))
f = open("C:\\Users\\kiero\\PycharmProjects\\untitled\\source\\reviews_Video_Games.json.gz",'w')
for l in parse("C:\\Users\\kiero\\PycharmProjects\\untitled\\source\\reviews_Video_Games.json.gz"):
f.write(l + '\n')
The process keeps finishing with no edit to any files. Also, the script runs for less than a second. no error messages.
Any help would be appreciated.

There are many problems, the first is that using eval on data you get from other people is a huge security leak.
The second is that your code is indented wrong -- the part from f = open( onwards that calls parse shouldn't be indented, the current way it is part of parse and that function is never called (so nothing happens).
Third, you open the exact same file for writing (with , "w") as you do on the next line for reading; but opening a file for writing empties it. So there is no data, what was there is destroyed.
Fourth, the file you open for writing has a ".gz" filename, but is written to as text, so the result would never be a gzip file.
Real code could look like:
import ast, gzip, json
INFILE = 'c://path/to/infile.gz' # Gzip file
OUTFILE = 'c://path/to/other/file/out.json' # Text file with generated JSON
in_f = gzip.open(INFILE)
out_f = open(OUTFILE, 'w')
for line in in_f:
data = ast.literal_eval(line) # Assuming the line is a valid Python literal
out_f.write(json.dumps(data) + '\n')
in_f.close()
out_f.close()
Fifth, the end result would still have one data object per line, so the file as a whole would still not be valid JSON -- a JSON string represents one object. Each individual line would be valid JSON.

Related

GZip and output file

I'm having difficulty with the following code (which is simplified from a larger application I'm working on in Python).
from io import StringIO
import gzip
jsonString = 'JSON encoded string here created by a previous process in the application'
out = StringIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
f.write(str.encode(jsonString))
# Write the file once finished rather than streaming it - uncomment the next line to see file locally.
with open("out_" + currenttimestamp + ".json.gz", "a", encoding="utf-8") as f:
f.write(out.getvalue())
When this runs I get the following error:
File "d:\Development\AWS\TwitterCompetitionsStreaming.py", line 61, in on_status
with gzip.GzipFile(fileobj=out, mode="w") as f:
File "C:\Python38\lib\gzip.py", line 204, in __init__
self._write_gzip_header(compresslevel)
File "C:\Python38\lib\gzip.py", line 232, in _write_gzip_header
self.fileobj.write(b'\037\213') # magic header
TypeError: string argument expected, got 'bytes'
PS ignore the rubbish indenting here...I know it doesn't look right.
What I'm wanting to do is to create a json file and gzip it in place in memory before saving the gzipped file to the filesystem (windows). I know I've gone about this the wrong way and could do with a pointer. Many thanks in advance.
You have to use bytes everywhere when working with gzip instead of strings and text. First, use BytesIO instead of StringIO. Second, mode should be 'wb' for bytes instead of 'w' (last is for text) (samely 'ab' instead of 'a' when appending), here 'b' character means "bytes". Full corrected code below:
Try it online!
from io import BytesIO
import gzip
jsonString = 'JSON encoded string here created by a previous process in the application'
out = BytesIO()
with gzip.GzipFile(fileobj = out, mode = 'wb') as f:
f.write(str.encode(jsonString))
currenttimestamp = '2021-01-29'
# Write the file once finished rather than streaming it - uncomment the next line to see file locally.
with open("out_" + currenttimestamp + ".json.gz", "wb") as f:
f.write(out.getvalue())

Reading a text file of dictionaries stored in one line

Question
I have a text file that records metadata of research papers requested with SemanticScholar API. However, when I wrote requested data, I forgot to add "\n" for each individual record. This results in something looks like
{<metadata1>}{<metadata2>}{<metadata3>}...
and this should be if I did add "\n".
{<metadata1>}
{<metadata2>}
{<metadata3>}
...
Now, I would like to read the data. As all the metadata is now stored in one line, I need to do some hacks
First I split the cluttered dicts using "{".
Then I tried to convert the string line back to dict. Note that I do consider line might not be in a proper JSON format.
import json
with open("metadata.json", "r") as f:
for line in f.readline().split("{"):
print(json.loads("{" + line.replace("\'", "\"")))
However, there is still an error message
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
I am wondering what should I do to recover all the metadata I collected?
MWE
Note, in order to get metadata.json file I use, use the following code, it should work out of the box.
import json
import urllib
import requests
baseURL = "https://api.semanticscholar.org/v1/paper/"
paperIDList = ["200794f9b353c1fe3b45c6b57e8ad954944b1e69",
"b407a81019650fe8b0acf7e4f8f18451f9c803d5",
"ff118a6a74d1e522f147a9aaf0df5877fd66e377"]
for paperID in paperIDList:
response = requests.get(urllib.parse.urljoin(baseURL, paperID))
metadata = response.json()
record = dict()
record["title"] = metadata["title"]
record["abstract"] = metadata["abstract"]
record["paperId"] = metadata["paperId"]
record["year"] = metadata["year"]
record["citations"] = [item["paperId"] for item in metadata["citations"] if item["paperId"]]
record["references"] = [item["paperId"] for item in metadata["references"] if item["paperId"]]
with open("metadata.json", "a") as fileObject:
fileObject.write(json.dumps(record))
The problem is that when you do the split("{") you get a first item that is empty, corresponding to the opening {. Just ignore the first element and everything works fine (I added an r in your quote replacements so python considers then as strings literals and replace them properly):
with open("metadata.json", "r") as f:
for line in f.readline().split("{")[1:]:
print(json.loads("{" + line).replace(r"\'", r"\""))
As suggested in the comments, I would actually recommend recreating the file or saving a new version where you replace }{ by }\n{:
with open("metadata.json", "r") as f:
data = f.read()
data_lines = data.replace("}{","}\n{")
with open("metadata_mod.json", "w") as f:
f.write(data_lines)
That way you will have the metadata of a paper per line as you want.

random/empty characters while re-editing a json file

I apologize for the vague definition of my problem in the title, but I really can't figure out what sort of problem I'm dealing with. So, here it goes.
I have python file:
edit-json.py
import os, json
def add_rooms(data):
if(not os.path.exists('rooms.json')):
with open('rooms.json', 'w'): pass
with open('rooms.json', 'r+') as f:
d = f.read() # take existing data from file
f.truncate(0) # empty the json file
if(d == ''): rooms = [] # check if data is empty i.e the file was just created
else: rooms = json.loads(d)['rooms']
rooms.append({'name': data['roomname'], 'active': 1})
f.write(json.dumps({"rooms": rooms})) # write new data(rooms list) to the json file
add_rooms({'roomname': 'friends'})'
This python script basically creates a file rooms.json(if it doesn't exist), grabs the data(array) from the json file, empties the json file, then finally writes the new data into the file. All this is done in the function add_rooms(), which is then called at the end of the script, pretty simple stuff.
So, here's the problem, I run the file once, nothing weird happens, i.e the file is created and the data inside it is:
{"rooms": [{"name": "friends"}]}
But the weird stuff happens when the run the script again.
What I should see:
{"rooms": [{"name": "friends"}, {"name": "friends"}]}
What I see instead:
I apologize I had to post the image because for some reason I couldn't copy the text I got.
and I can't obviously run the script again(for the third time) because the json parser gives error due to those characters
I obtained this result in an online compiler. In my local windows system, I get extra whitespace instead of those extra symbols.
I can't figure out what causes it. Maybe I'm not doing file handling incorrectly? or is it due to the json module? or am I the only one getting this result?
When you truncate the file, the file pointer is still at the end of the file. Use f.seek(0) to move back to the start of the file:
import os, json
def add_rooms(data):
if(not os.path.exists('rooms.json')):
with open('rooms.json', 'w'): pass
with open('rooms.json', 'r+') as f:
d = f.read() # take existing data from file
f.truncate(0) # empty the json file
f.seek(0) # <<<<<<<<< add this line
if(d == ''): rooms = [] # check if data is empty i.e the file was just created
else: rooms = json.loads(d)['rooms']
rooms.append({'name': data['roomname'], 'active': 1})
f.write(json.dumps({"rooms": rooms})) # write new data(rooms list) to the json file
add_rooms({'roomname': 'friends'})

Read JSON file correctly

I am trying to read a JSON file (BioRelEx dataset: https://github.com/YerevaNN/BioRelEx/releases/tag/1.0alpha7) in Python. The JSON file is a list of objects, one per sentence.
This is how I try to do it:
def _read(self, file_path):
with open(cached_path(file_path), "r") as data_file:
for line in data_file.readlines():
if not line:
continue
items = json.loads(lines)
text = items["text"]
label = items.get("label")
My code is failing on items = json.loads(line). It looks like the data is not formatted as the code expects it to be, but how can I change it?
Thanks in advance for your time!
Best,
Julia
With json.load() you don't need to read each line, you can do either of these:
import json
def open_json(path):
with open(path, 'r') as file:
return json.load(file)
data = open_json('./1.0alpha7.dev.json')
Or, even cooler, you can GET request the json from GitHub
import json
import requests
url = 'https://github.com/YerevaNN/BioRelEx/releases/download/1.0alpha7/1.0alpha7.dev.json'
response = requests.get(url)
data = response.json()
These will both give the same output. data variable will be a list of dictionaries that you can iterate over in a for loop and do your further processing.
Your code is reading one line at a time and parsing each line individually as JSON. Unless the creator of the file created the file in this format (which given it has a .json extension is unlikely) then that won't work, as JSON does not use line breaks to indicate end of an object.
Load the whole file content as JSON instead, then process the resulting items in the array.
def _read(self, file_path):
with open(cached_path(file_path), "r") as data_file:
data = json.load(data_file)
for item in data:
text = item["text"]
label appears to be buried in item["interaction"]

Not able to fix file handling issue in python

I wrote python code to search a pattern in a tcl file and replace it with a string, it prints the output but the same is not saved in the tcl file
import re
import fileinput
filename=open("Fdrc.tcl","r+")
for i in filename:
if i.find("set qa_label")!=-1:
print(i)
a=re.sub(r'REL.*','harsh',i)
print(a)
filename.close()
actual result
set qa_label
REL_ts07n0g42p22sadsl01msaA04_2018-09-11-11-01
set qa_label harsh
Expected result is that in my file it should reflect the same result as above but it is not
You need to actually write your changes back to disk if you want to see them affected there. As #ImperishableNight says, you don't want to do this by trying to write to a file you're also reading from...you want to write to a new file. Here's an expanded version of your code that does that:
import re
import fileinput
fin=open("/tmp/Fdrc.tcl")
fout=open("/tmp/FdrcNew.tcl", "w")
for i in fin:
if i.find("set qa_label")!=-1:
print(i)
a=re.sub(r'REL.*','harsh',i)
print(a)
fout.write(a)
else:
fout.write(i)
fin.close()
fout.close()
Input and output file contents:
> cat /tmp/Fdrc.tcl
set qa_label REL_ts07n0g42p22sadsl01msaA04_2018-09-11-11-01
> cat /tmp/FdrcNew.tcl
set qa_label harsh
If you wanted to overwrite the original file, then you would want to read the entire file into memory and close the input file stream, then open the file again for writing, and write modified content to the same file.
Here's a cleaner version of your code that does this...produces an in memory result and then writes that out using a new file handle. I am still writing to a different file here because that's usually what you want to do at least while you're testing your code. You can simply change the name of the second file to match the first and this code will overwrite the original file with the modified content:
import re
lines = []
with open("/tmp/Fdrc.tcl") as fin:
for i in fin:
if i.find("set qa_label")!=-1:
print(i)
i=re.sub(r'REL.*','harsh',i)
print(i)
lines.append(i)
with open("/tmp/FdrcNew.tcl", "w") as fout:
fout.writelines(lines)
Open a tempfile for writing the updated file contents and open the file for writing.
After modifying the lines, write it back in the file.
import re
import fileinput
from tempfile import TemporaryFile
with TemporaryFile() as t:
with open("Fdrc.tcl", "r") as file_reader:
for line in file_reader:
if line.find("set qa_label") != -1:
t.write(
str.encode(
re.sub(r'REL.*', 'harsh', str(line))
)
)
else:
t.write(str.encode(line))
t.seek(0)
with open("Fdrc.tcl", "wb") as file_writer:
file_writer.writelines(t)

Categories