I want to read multiple JSON objects from a single file imported from local dir. So far this is my simple work:
Data:
[{
"uuid": "6f476e26",
"created": "2018-09-26T06:57:04.142232",
"creator": "admin"
}, {
"uuid": "11d1e78a",
"created": "2019-09-21T11:19:39.845876",
"creator": "admin"
}]
Code:
import json
with open('/home/data.json') as f:
for line in f:
data = json.load(f)
Error:
File "/usr/lib64/python3.8/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 8 (char 7)
My question is similar to Loading and parsing a JSON file with multiple JSON objects and I've tried it however same issue appears. What should I do to solve this issue?
for line in f:
data = json.load(f)
This makes no sense. You are trying to parse the file over and over again, as many times as the number of lines in the file. This is more problematic than it sounds since f is exhausted after the first call to json.load(f).
You don't need the loop, just pass f to json.load:
with open('/home/data.json') as f:
data = json.load(f)
print(data)
outputs
[{'uuid': '6f476e26', 'created': '2018-09-26T06:57:04.142232', 'creator': 'admin'},
{'uuid': '11d1e78a', 'created': '2019-09-21T11:19:39.845876', 'creator': 'admin'}]
Now you can loop over data or directly access a specific index, ie data[0] or data[1].
Related
I have a function that I apply to a json file. It works if it looks like this:
import json
def myfunction(dictionary):
#does things
return new_dictionary
data = """{
"_id": {
"$oid": "5e7511c45cb29ef48b8cfcff"
},
"description": "some text",
"startDate": {
"$date": "5e7511c45cb29ef48b8cfcff"
},
"completionDate": {
"$date": "2021-01-05T14:59:58.046Z"
},
"videos":[{"$oid":"5ecf6cc19ad2a4dfea993fed"}]
}"""
info = json.loads(data)
refined = key_replacer(info)
new_data = json.dumps(refined)
print(new_data)
However, I need to apply it to a whole while and the input looks like this (there are multiple elements and they are not separated by commas, they are one after another):
{"_id":{"$oid":"5f06cb272cfede51800b6b53"},"company":{"$oid":"5cdac819b6d0092cd6fb69d3"},"name":"SomeName","videos":[{"$oid":"5ecf6cc19ad2a4dfea993fed"}]}
{"_id":{"$oid":"5ddb781fb4a9862c5fbd298c"},"company":{"$oid":"5d22cf72262f0301ecacd706"},"name":"SomeName2","videos":[{"$oid":"5dd3f09727658a1b9b4fb5fd"},{"$oid":"5d78b5a536e59001a4357f4c"},{"$oid":"5de0b85e129ef7026f27ad47"}]}
How could I do this? I tried opening and reading the file, using load and dump instead of loads and dumps, and it still doesn't work. Do I need to read, or iterate over every line?
You are dealing with ndjson(Newline delimited JSON) data format.
You have to read the whole data string, split it by lines and parse each line as a JSON object resulting in a list of JSONs:
def parse_ndjson(data):
return [json.loads(l) for l in data.splitlines()]
with open('C:\\Users\\test.json', 'r', encoding="utf8") as handle:
data = handle.read()
dicts = parse_ndjson(data)
for d in dicts:
new_d = my_function(d)
print("New dict", new_d)
with open('data.txt', 'r') as file:
dat2 = file.read()
post2 = {
"id": 5,
"method": "set",
"params": [
{
"data": [
dat2
],
"url": "/config/url"
},
]
"session": sessionkey,
"verbose": 1
}
Data from the file I am trying to read looks as so...
{"name": "Host1","type": "ipmask","subnet": ["0.0.0.0","255.255.255.255"],"dynamic_mapping": null},
{"name": "Host2","type": "ipmask","subnet": ["0.0.0.0","255.255.255.255"],"dynamic_mapping": null}, I am trying to read this data and insert into a variable to put it into post2 for a request. What I have tried so far includes: reading the file and replacing null with None so python can read it as well as stripping all of the whitespace. I have tried using json.loads(), json.load() and json.dumps(), but nothing seems to work. When I try to use json.load() I get the following error.
File "C:\Users\user\AppData\Local\Programs\Python\Python37\lib\json\__init__.py", line 296, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "C:\Users\user\AppData\Local\Programs\Python\Python37\lib\json\__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "C:\Users\user\AppData\Local\Programs\Python\Python37\lib\json\decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 18 (char 145)
After the data is placed into dat2, it gets inserted into post2 as '{data}' instead of {data}. Also yes, I know that file.read() will read the contents of the file into a string, but I have been trying everything since I am struggling to have success using json. I have been stuck on this part of my code for the longest time now and would appreciate and ideas. NOTE: I HAVE LOOKED AT MULTIPLE PYTHON/JSON POSTS FOR READING PYTHON AND NOTHING WORKS SO PLEASE DON'T MARK AS DUPLICATE.
Remove the trailing apostrophe and put the dictionaries in a list in your file, it should look like this:
[{"name": "Host1","type": "ipmask","subnet": ["0.0.0.0","255.255.255.255"],"dynamic_mapping": null}, {"name": "Host2","type": "ipmask","subnet": ["0.0.0.0","255.255.255.255"],"dynamic_mapping": null}]
Then use json.loads to turn it into a list:
with open('data.txt', 'r') as file:
dat2 = file.read()
import json
post2 = {"data": json.loads(dat2)}
And this will make post2 be
{'data': [{'name': 'Host1', 'type': 'ipmask', 'subnet': ['0.0.0.0', '255.255.255.255'], 'dynamic_mapping': None}, {'name': 'Host2', 'type': 'ipmask', 'subnet': ['0.0.0.0', '255.255.255.255'], 'dynamic_mapping': None}]}
Hope this helps!
I had a code, which gave me an empty DataFrame with no saved tweets.
I tried to debug it by putting print(line) under the for line in json file: and json_data = json.loads(line).
That resulted a KeyError.
How do I fix it?
Thank you.
list_df = list()
# read the .txt file, line by line, and append the json data in each line to the list
with open('tweet_json.txt', 'r') as json_file:
for line in json_file:
print(line)
json_data = json.loads(line)
print(line)
tweet_id = json_data['tweet_id']
fvrt_count = json_data['favorite_count']
rtwt_count = json_data['retweet_count']
list_df.append({'tweet_id': tweet_id,
'favorite_count': fvrt_count,
'retweet_count': rtwt_count})
# create a pandas DataFrame using the list
df = pd.DataFrame(list_df, columns = ['tweet_id', 'favorite_count', 'retweet_count'])
df.head()
Your comment says you're trying to save to a file, but your code kind of says that you're trying to read from a file. Here are examples of how to do both:
Writing to JSON
import json
import pandas as pd
content = { # This just dummy data, in the form of a dictionary
"tweet1": {
"id": 1,
"msg": "Yay, first!"
},
"tweet2": {
"id": 2,
"msg": "I'm always second :("
}
}
# Write it to a file called "tweet_json.txt" in JSON
with open("tweet_json.txt", "w") as json_file:
json.dump(content, json_file, indent=4) # indent=4 is optional, it makes it easier to read
Note the w (as in write) in open("tweet_json.txt", "w"). You're using r (as in read), which doesn't give you permission to write anything. Also note the use of json.dump() rather than json.load(). We then get a file that looks like this:
$ cat tweet_json.txt
{
"tweet1": {
"id": 1,
"msg": "Yay, first!"
},
"tweet2": {
"id": 2,
"msg": "I'm always second :("
}
}
Reading from JSON
Let's read the file that we just wrote, using pandas read_json():
import pandas as pd
df = pd.read_json("tweet_json.txt")
print(df)
Output looks like this:
>>> df
tweet1 tweet2
id 1 2
msg Yay, first! I'm always second :(
import json
file= open('webtext.txt','a+')
with open('output-dataset_v1_webtext.test.jsonl') as json_file:
data= json.load(json_file)
for item in data:
file.write(item)
print(item)
>>> I am getting this error:
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 656)
I have already tried with json.loads()
My json file look like with multiple objects:
{"id": 255000, "ended": true, "length": 134, "text": "Is this restaurant fami"}
{"id": 255001, "ended": true, "length": 713, "text": "Clinton talks about her time of 'refle"}
Any advise will be highly appreciated on how to resolve the existing issue and write the dict['text'] into text file
you need to loop through it:
import json
with open('output-dataset_v1_webtext.test.jsonl','r') as json_file:
for line in json_file.readlines():
data= json.loads(line)
for item in data:
print(item)
Looks like you need to iterate each line in the file and then use json.loads.
Ex:
with open('output-dataset_v1_webtext.test.jsonl') as json_file:
for line in json_file: #Iterate Each Line
data= json.loads(line.strip()) #Use json.loads
for item in data:
file.write(item)
print(item)
I'm certainly not a JSON expert, so there might be a better way to do this, but you should be able to resolve your issue by putting your top-level data into an array:
[
{"id": 255000, "ended": true, "length": 134, "text": "Is this restaurant fami"},
{"id": 255001, "ended": true, "length": 713, "text": "Clinton talks about her time of 'refle"}
]
The error you're getting is basically telling you, that there may be no more than one top-level JSON entity. If you want more, they have to be put in an array.
As others have pointed out, your JSON must be surrounded in square brackets, as it can only have one top level object.
Such as like this:
[
{"id": 255000,"ended": true, "length": 134, "text": "Is this restaurant fami"},
{"id": 255001, "ended": true, "length": 713, "text": "Clinton talks about her time of 'refle"}
]
then, you should be able to use this code to do so what you're trying:
import json
file = open('webtext.txt', 'a')
with open('test.json') as json_file:
data = json.load(json_file)
for item in data:
file.write(str(item))
print(item)
In order to fix your file.write issue you need to cast item as a string, like so: str(item).
I have a txt file with json structures. the problem is the file does not only contain json structures but also raw text like log error:
2019-01-18 21:00:05.4521|INFO|Technical|Batch Started|
2019-01-18 21:00:08.8740|INFO|Technical|Got Entities List from 20160101 00:00 :
{
"name": "1111",
"results": [{
"filename": "xxxx",
"numberID": "7412"
}, {
"filename": "xgjhh",
"numberID": "E52"
}]
}
2019-01-18 21:00:05.4521|INFO|Technical|Batch Started|
2019-01-18 21:00:08.8740|INFO|Technical|Got Entities List from 20160101 00:00 :
{
"name": "jfkjgjkf",
"results": [{
"filename": "hhhhh",
"numberID": "478962"
}, {
"filename": "jkhgfc",
"number": "12544"
}]
}
I read the .txt file but trying to patch the jason structures I have an error:
IN :
import json
with open("data.txt", "r", encoding="utf-8", errors='ignore') as f:
json_data = json.load(f)
OUT : json.decoder.JSONDecodeError: Extra data: line 1 column 5 (char 4)
I would like to parce json and save as csv file.
A more general solution to parsing a file with JSON objects mixed with other content without any assumption of the non-JSON content would be to split the file content into fragments by the curly brackets, start with the first fragment that is an opening curly bracket, and then join the rest of fragments one by one until the joined string is parsable as JSON:
import re
fragments = iter(re.split('([{}])', f.read()))
while True:
try:
while True:
candidate = next(fragments)
if candidate == '{':
break
while True:
candidate += next(fragments)
try:
print(json.loads(candidate))
break
except json.decoder.JSONDecodeError:
pass
except StopIteration:
break
This outputs:
{'name': '1111', 'results': [{'filename': 'xxxx', 'numberID': '7412'}, {'filename': 'xgjhh', 'numberID': 'E52'}]}
{'name': 'jfkjgjkf', 'results': [{'filename': 'hhhhh', 'numberID': '478962'}, {'filename': 'jkhgfc', 'number': '12544'}]}
This solution will strip out the non-JSON structures, and wrap them in a containing JSON structure.This should do the job for you. I'm posting this as is for expediency, then I'll edit my answer for a more clear explanation. I'll edit this first bit when I've done that:
import json
with open("data.txt", "r", encoding="utf-8", errors='ignore') as f:
cleaned = ''.join([item.strip() if item.strip() is not '' else '-split_here-' for item in f.readlines() if '|INFO|' not in item]).split('-split_here-')
json_data = json.loads(json.dumps(('{"entries":[' + ''.join([entry + ', ' for entry in cleaned])[:-2] + ']}')))
Output:
{"entries":[{"name": "1111","results": [{"filename": "xxxx","numberID": "7412"}, {"filename": "xgjhh","numberID": "E52"}]}, {"name": "jfkjgjkf","results": [{"filename": "hhhhh","numberID": "478962"}, {"filename": "jkhgfc","number": "12544"}]}]}
What's going on here?
In the cleaned = ... line, we're using a list comprehension that creates a list of the lines in the file (f.readlines()) that do not contain the string |INFO| and adds the string -split_here- to the list whenever there's a blank line (where .strip() yields '').
Then, we're converting that list of lines (''.join()) into a string.
Finally we're converting that string (.split('-split_here-') into a list of lists, separating the JSON structures into their own lists, marked by blank lines in data.txt.
In the json_data = ... line, we're appending a ', ' to each of the JSON structures using a list comprehension.
Then, we convert that list back into a single string, stripping off the last ', ' (.join()[:-2]. [:-2]slices of the last two characters from the string.).
We then wrap the string with '{"entries":[' and ']}' to make the whole thing a valid JSON structure, and feed it to json.dumps and json.loads to clean any encoding and load your data a a python object.
You could do one of several things:
On the Command Line, remove all lines where, say, "|INFO|Technical|" appears (assuming this appears in every line of raw text):
sed -i '' -e '/\|INFO\|Technical/d' yourfilename (if on Mac),
sed -i '/\|INFO\|Technical/d' yourfilename (if on Linux).
Move these raw lines into their own JSON fields
Use the "text structures" as a delimiter between JSON objects.
Iterate over the lines in the file, saving them to a buffer until you encounter a line that is a text line, at which point parse the lines you've saved as a JSON object.
import re
import json
def is_text(line):
# returns True if line starts with a date and time in "YYYY-MM-DD HH:MM:SS" format
line = line.lstrip('|') # you said some lines start with a leading |, remove it
return re.match("^(\d{4})-(\d{2})-(\d{2}) (\d{2}):(\d{2}):(\d{2})", line)
json_objects = []
with open("data.txt") as f:
json_lines = []
for line in f:
if not is_text(line):
json_lines.append(line)
else:
# if there's multiple text lines in a row json_lines will be empty
if json_lines:
json_objects.append(json.loads("".join(json_lines)))
json_lines = []
# we still need to parse the remaining object in json_lines
# if the file doesn't end in a text line
if json_lines:
json_objects.append(json.loads("".join(json_lines)))
print(json_objects)
Repeating logic in the last two lines is a bit ugly, but you need to handle the case where the last line in your file is not a text line, so when you're done with the for loop you need parse the last object sitting in json_lines if there is one.
I'm assuming there's never more than one JSON object between text lines and also my regex expression for a date will break in 8,000 years.
You could count curly brackets in your file to find beginning and ending of your jsons, and store them in list, here found_jsons.
import json
open_chars = 0
saved_content = []
found_jsons = []
for i in content.splitlines():
open_chars += i.count('{')
if open_chars:
saved_content.append(i)
open_chars -= i.count('}')
if open_chars == 0 and saved_content:
found_jsons.append(json.loads('\n'.join(saved_content)))
saved_content = []
for i in found_jsons:
print(json.dumps(i, indent=4))
Output
{
"results": [
{
"numberID": "7412",
"filename": "xxxx"
},
{
"numberID": "E52",
"filename": "xgjhh"
}
],
"name": "1111"
}
{
"results": [
{
"numberID": "478962",
"filename": "hhhhh"
},
{
"number": "12544",
"filename": "jkhgfc"
}
],
"name": "jfkjgjkf"
}