Question
I have a text file that records metadata of research papers requested with SemanticScholar API. However, when I wrote requested data, I forgot to add "\n" for each individual record. This results in something looks like
{<metadata1>}{<metadata2>}{<metadata3>}...
and this should be if I did add "\n".
{<metadata1>}
{<metadata2>}
{<metadata3>}
...
Now, I would like to read the data. As all the metadata is now stored in one line, I need to do some hacks
First I split the cluttered dicts using "{".
Then I tried to convert the string line back to dict. Note that I do consider line might not be in a proper JSON format.
import json
with open("metadata.json", "r") as f:
for line in f.readline().split("{"):
print(json.loads("{" + line.replace("\'", "\"")))
However, there is still an error message
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
I am wondering what should I do to recover all the metadata I collected?
MWE
Note, in order to get metadata.json file I use, use the following code, it should work out of the box.
import json
import urllib
import requests
baseURL = "https://api.semanticscholar.org/v1/paper/"
paperIDList = ["200794f9b353c1fe3b45c6b57e8ad954944b1e69",
"b407a81019650fe8b0acf7e4f8f18451f9c803d5",
"ff118a6a74d1e522f147a9aaf0df5877fd66e377"]
for paperID in paperIDList:
response = requests.get(urllib.parse.urljoin(baseURL, paperID))
metadata = response.json()
record = dict()
record["title"] = metadata["title"]
record["abstract"] = metadata["abstract"]
record["paperId"] = metadata["paperId"]
record["year"] = metadata["year"]
record["citations"] = [item["paperId"] for item in metadata["citations"] if item["paperId"]]
record["references"] = [item["paperId"] for item in metadata["references"] if item["paperId"]]
with open("metadata.json", "a") as fileObject:
fileObject.write(json.dumps(record))
The problem is that when you do the split("{") you get a first item that is empty, corresponding to the opening {. Just ignore the first element and everything works fine (I added an r in your quote replacements so python considers then as strings literals and replace them properly):
with open("metadata.json", "r") as f:
for line in f.readline().split("{")[1:]:
print(json.loads("{" + line).replace(r"\'", r"\""))
As suggested in the comments, I would actually recommend recreating the file or saving a new version where you replace }{ by }\n{:
with open("metadata.json", "r") as f:
data = f.read()
data_lines = data.replace("}{","}\n{")
with open("metadata_mod.json", "w") as f:
f.write(data_lines)
That way you will have the metadata of a paper per line as you want.
Related
I apologize for the vague definition of my problem in the title, but I really can't figure out what sort of problem I'm dealing with. So, here it goes.
I have python file:
edit-json.py
import os, json
def add_rooms(data):
if(not os.path.exists('rooms.json')):
with open('rooms.json', 'w'): pass
with open('rooms.json', 'r+') as f:
d = f.read() # take existing data from file
f.truncate(0) # empty the json file
if(d == ''): rooms = [] # check if data is empty i.e the file was just created
else: rooms = json.loads(d)['rooms']
rooms.append({'name': data['roomname'], 'active': 1})
f.write(json.dumps({"rooms": rooms})) # write new data(rooms list) to the json file
add_rooms({'roomname': 'friends'})'
This python script basically creates a file rooms.json(if it doesn't exist), grabs the data(array) from the json file, empties the json file, then finally writes the new data into the file. All this is done in the function add_rooms(), which is then called at the end of the script, pretty simple stuff.
So, here's the problem, I run the file once, nothing weird happens, i.e the file is created and the data inside it is:
{"rooms": [{"name": "friends"}]}
But the weird stuff happens when the run the script again.
What I should see:
{"rooms": [{"name": "friends"}, {"name": "friends"}]}
What I see instead:
I apologize I had to post the image because for some reason I couldn't copy the text I got.
and I can't obviously run the script again(for the third time) because the json parser gives error due to those characters
I obtained this result in an online compiler. In my local windows system, I get extra whitespace instead of those extra symbols.
I can't figure out what causes it. Maybe I'm not doing file handling incorrectly? or is it due to the json module? or am I the only one getting this result?
When you truncate the file, the file pointer is still at the end of the file. Use f.seek(0) to move back to the start of the file:
import os, json
def add_rooms(data):
if(not os.path.exists('rooms.json')):
with open('rooms.json', 'w'): pass
with open('rooms.json', 'r+') as f:
d = f.read() # take existing data from file
f.truncate(0) # empty the json file
f.seek(0) # <<<<<<<<< add this line
if(d == ''): rooms = [] # check if data is empty i.e the file was just created
else: rooms = json.loads(d)['rooms']
rooms.append({'name': data['roomname'], 'active': 1})
f.write(json.dumps({"rooms": rooms})) # write new data(rooms list) to the json file
add_rooms({'roomname': 'friends'})
I am trying to read a JSON file (BioRelEx dataset: https://github.com/YerevaNN/BioRelEx/releases/tag/1.0alpha7) in Python. The JSON file is a list of objects, one per sentence.
This is how I try to do it:
def _read(self, file_path):
with open(cached_path(file_path), "r") as data_file:
for line in data_file.readlines():
if not line:
continue
items = json.loads(lines)
text = items["text"]
label = items.get("label")
My code is failing on items = json.loads(line). It looks like the data is not formatted as the code expects it to be, but how can I change it?
Thanks in advance for your time!
Best,
Julia
With json.load() you don't need to read each line, you can do either of these:
import json
def open_json(path):
with open(path, 'r') as file:
return json.load(file)
data = open_json('./1.0alpha7.dev.json')
Or, even cooler, you can GET request the json from GitHub
import json
import requests
url = 'https://github.com/YerevaNN/BioRelEx/releases/download/1.0alpha7/1.0alpha7.dev.json'
response = requests.get(url)
data = response.json()
These will both give the same output. data variable will be a list of dictionaries that you can iterate over in a for loop and do your further processing.
Your code is reading one line at a time and parsing each line individually as JSON. Unless the creator of the file created the file in this format (which given it has a .json extension is unlikely) then that won't work, as JSON does not use line breaks to indicate end of an object.
Load the whole file content as JSON instead, then process the resulting items in the array.
def _read(self, file_path):
with open(cached_path(file_path), "r") as data_file:
data = json.load(data_file)
for item in data:
text = item["text"]
label appears to be buried in item["interaction"]
I am running this program to basically get the page source code of a website I put in. It saves it to a file and what I want is it to look for a specific string which is basically # for the emails. However, I can't get it to work.
import requests
import re
url = 'https://www.youtube.com/watch?v=GdKEdN66jUc&app=desktop'
data = requests.get(url)
# dump resulting text to file
with open("data6.txt", "w") as out_f:
out_f.write(data.text)
with open("data6.txt", "r") as f:
searchlines = f.readlines()
for i, line in enumerate(searchlines):
if "#" in line:
for l in searchlines[i:i+3]: print((l))
You can use the regex method findall to find all email addresses in your text content, and use file.read() instead of file.readlines(). To get all content together rather than split into separate lines.
For example:
import re
with open("data6.txt", "r") as file:
content = file.read()
emails = re.findall(r"[\w\.]+#[\w\.]+", content)
Maybe cast to a set for uniqueness afterwards, and then save to a file however you like.
I am getting a JSON file with following format :
// 20170407
// http://info.employeeportal.org
{
"EmployeeDataList": [
{
"EmployeeCode": "200005ABH9",
"Skill": CT70,
"Sales": 0.0,
"LostSales": 1010.4
}
]
}
Need to remove the extra comment lines present in the file.
I tried with the following code :
import json
import commentjson
with open('EmployeeDataList.json') as json_data:
employee_data = json.load(json_data)
'''employee_data = json.dump(json.load(json_data))'''
'''employee_data = commentjson.load(json_data)'''
print(employee_data)`
Still not able to remove the comments from the file and bring
the JSON file in correct format.
Not getting where things are going wrong? Any direction in this regard is highly appreciated.Thanks in advance
You're not using commentjson correctly. It has the same interface as the json module:
import commentjson
with open('EmployeeDataList.json', 'r') as handle:
employee_data = commentjson.load(handle)
print(employee_data)
Although in this case, your comments are simple enough that you probably don't need to install an extra module to remove them:
import json
with open('EmployeeDataList.json', 'r') as handle:
fixed_json = ''.join(line for line in handle if not line.startswith('//'))
employee_data = json.loads(fixed_json)
print(employee_data)
Note the difference here between the two code snippets is that json.loads is used instead of json.load, since you're parsing a string instead of a file object.
Try JSON-minify:
JSON-minify minifies blocks of JSON-like content into valid JSON by removing all whitespace and JS-style comments (single-line // and multiline /* .. */).
I usually read the JSON as a normal file, delete the comments and then parse it as a JSON string. It can be done in one line with the following snippet:
with open(path,'r') as f: jsonDict = json.loads('\n'.join(row for row in f if not row.lstrip().startswith("//")))
IMHO it is very convenient because it does not need CommentJSON or any other non standard library.
Well that's not a valid json format so just open it like you would a text document then delete anything from// to \n.
with open("EmployeeDataList.json", "r") as rf:
with open("output.json", "w") as wf:
for line in rf.readlines():
if line[0:2] == "//"
continue
wf.write(line)
Your file is parsable using HOCON.
pip install pyhocon
>>> from pyhocon import ConfigFactory
>>> conf = ConfigFactory.parse_file('data.txt')
>>> conf
ConfigTree([('EmployeeDataList',
[ConfigTree([('EmployeeCode', '200005ABH9'),
('Skill', 'CT70'),
('Sales', 0.0),
('LostSales', 1010.4)])])])
If it is the same number of lines every time you can just do:
fh = open('EmployeeDataList.NOTjson',"r")
rawText = fh.read()
json_data = rawText[rawText.index("\n",3)+1:]
This way json_data is now the string of text without the first 3 lines.
So I'm new in python and I desperately need help.
I have a file which has a bunch of ids (integer values) written in 'em. Its a text file.
Now I need to pass each id inside the file into a url.
For example "https://example.com/[id]"
It will be done in this way
A = json.load(urllib.urlopen("https://example.com/(the first id present in the text file)"))
print A
What this will essentially do is that it will read certain information about the id present in the above url and display it. I want this to work in a loop format where in it will read all the ids inside the text file and pass it to the url mentioned in 'A' and display the values continuously..is there a way to do this?
I'd be very grateful if someone could help me out!
Old style string concatenation can be used
>>> id = "3333333"
>>> url = "https://example.com/%s" % id
>>> print url
https://example.com/3333333
>>>
The new style string formatting:
>>> url = "https://example.com/{0}".format(id)
>>> print url
https://example.com/3333333
>>>
The reading for file as mentioned by avasal with a small change:
f = open('file.txt', 'r')
for line in f.readlines():
id = line.strip('\n')
url = "https://example.com/{0}".format(id)
urlobj = urllib.urlopen(url)
try:
json_data = json.loads(urlobj)
print json_data
except:
print urlobj.readlines()
lazy style:
url = "https://example.com/" + first_id
A = json.load(urllib.urlopen(url))
print A
old style:
url = "https://example.com/%s" % first_id
A = json.load(urllib.urlopen(url))
print A
new style 2.6+:
url = "https://example.com/{0}".format( first_id )
A = json.load(urllib.urlopen(url))
print A
new style 2.7+:
url = "https://example.com/{}".format( first_id )
A = json.load(urllib.urlopen(url))
print A
Python 3+
New String formatting is supported in Python 3 which is a more readable and better way to format a string.
Here's the good article to read about the same: Python 3's f-Strings
In this case, it can be formatted as
url = f"https://example.com/{id}"
Detailed example
When you want to pass multiple params to the URL it can be done as below.
name = "test_api_4"
owner = "jainik#test.com"
url = f"http://localhost:5001/files/create" \
f"?name={name}" \
f"&owner={owner}" \
We are using multiple f-string here and they can be appended by ''. This will keep them in the same line without inserting any new line character between them.
For values which have space
For such values you should import from urllib.parse import quote in your python file and then quote the string like: quote("firstname lastname")
This will replace space character with %20.
The first thing you need to do is know how to read each line from a file. First, you have to open the file; you can do this with a with statement:
with open('my-file-name.txt') as intfile:
This opens a file and stores a reference to that file in intfile, and it will automatically close the file at the end of your with block. You then need to read each line from the file; you can do that with a regular old for loop:
for line in intfile:
This will loop through each line in the file, reading them one at a time. In your loop, you can access each line as line. All that's left is to make the request to your website using the code you gave. The one bit your missing is what's called "string interpolation", which allows you to format a string with other strings, numbers, or anything else. In your case, you'd like to put a string (the line from your file) inside another string (the URL). To do that, you use the %s flag along with the string interpolation operator, %:
url = 'http://example.com/?id=%s' % line
A = json.load(urllib.urlopen(url))
print A
Putting it all together, you get:
with open('my-file-name.txt') as intfile:
for line in intfile:
url = 'http://example.com/?id=%s' % line
A = json.load(urllib.urlopen(url))
print A