Parse JSON structures in a txt file containing JSON and text structures - python

I have a txt file with json structures. the problem is the file does not only contain json structures but also raw text like log error:
2019-01-18 21:00:05.4521|INFO|Technical|Batch Started|
2019-01-18 21:00:08.8740|INFO|Technical|Got Entities List from 20160101 00:00 :
{
"name": "1111",
"results": [{
"filename": "xxxx",
"numberID": "7412"
}, {
"filename": "xgjhh",
"numberID": "E52"
}]
}
2019-01-18 21:00:05.4521|INFO|Technical|Batch Started|
2019-01-18 21:00:08.8740|INFO|Technical|Got Entities List from 20160101 00:00 :
{
"name": "jfkjgjkf",
"results": [{
"filename": "hhhhh",
"numberID": "478962"
}, {
"filename": "jkhgfc",
"number": "12544"
}]
}
I read the .txt file but trying to patch the jason structures I have an error:
IN :
import json
with open("data.txt", "r", encoding="utf-8", errors='ignore') as f:
json_data = json.load(f)
OUT : json.decoder.JSONDecodeError: Extra data: line 1 column 5 (char 4)
I would like to parce json and save as csv file.

A more general solution to parsing a file with JSON objects mixed with other content without any assumption of the non-JSON content would be to split the file content into fragments by the curly brackets, start with the first fragment that is an opening curly bracket, and then join the rest of fragments one by one until the joined string is parsable as JSON:
import re
fragments = iter(re.split('([{}])', f.read()))
while True:
try:
while True:
candidate = next(fragments)
if candidate == '{':
break
while True:
candidate += next(fragments)
try:
print(json.loads(candidate))
break
except json.decoder.JSONDecodeError:
pass
except StopIteration:
break
This outputs:
{'name': '1111', 'results': [{'filename': 'xxxx', 'numberID': '7412'}, {'filename': 'xgjhh', 'numberID': 'E52'}]}
{'name': 'jfkjgjkf', 'results': [{'filename': 'hhhhh', 'numberID': '478962'}, {'filename': 'jkhgfc', 'number': '12544'}]}

This solution will strip out the non-JSON structures, and wrap them in a containing JSON structure.This should do the job for you. I'm posting this as is for expediency, then I'll edit my answer for a more clear explanation. I'll edit this first bit when I've done that:
import json
with open("data.txt", "r", encoding="utf-8", errors='ignore') as f:
cleaned = ''.join([item.strip() if item.strip() is not '' else '-split_here-' for item in f.readlines() if '|INFO|' not in item]).split('-split_here-')
json_data = json.loads(json.dumps(('{"entries":[' + ''.join([entry + ', ' for entry in cleaned])[:-2] + ']}')))
Output:
{"entries":[{"name": "1111","results": [{"filename": "xxxx","numberID": "7412"}, {"filename": "xgjhh","numberID": "E52"}]}, {"name": "jfkjgjkf","results": [{"filename": "hhhhh","numberID": "478962"}, {"filename": "jkhgfc","number": "12544"}]}]}
What's going on here?
In the cleaned = ... line, we're using a list comprehension that creates a list of the lines in the file (f.readlines()) that do not contain the string |INFO| and adds the string -split_here- to the list whenever there's a blank line (where .strip() yields '').
Then, we're converting that list of lines (''.join()) into a string.
Finally we're converting that string (.split('-split_here-') into a list of lists, separating the JSON structures into their own lists, marked by blank lines in data.txt.
In the json_data = ... line, we're appending a ', ' to each of the JSON structures using a list comprehension.
Then, we convert that list back into a single string, stripping off the last ', ' (.join()[:-2]. [:-2]slices of the last two characters from the string.).
We then wrap the string with '{"entries":[' and ']}' to make the whole thing a valid JSON structure, and feed it to json.dumps and json.loads to clean any encoding and load your data a a python object.

You could do one of several things:
On the Command Line, remove all lines where, say, "|INFO|Technical|" appears (assuming this appears in every line of raw text):
sed -i '' -e '/\|INFO\|Technical/d' yourfilename (if on Mac),
sed -i '/\|INFO\|Technical/d' yourfilename (if on Linux).
Move these raw lines into their own JSON fields

Use the "text structures" as a delimiter between JSON objects.
Iterate over the lines in the file, saving them to a buffer until you encounter a line that is a text line, at which point parse the lines you've saved as a JSON object.
import re
import json
def is_text(line):
# returns True if line starts with a date and time in "YYYY-MM-DD HH:MM:SS" format
line = line.lstrip('|') # you said some lines start with a leading |, remove it
return re.match("^(\d{4})-(\d{2})-(\d{2}) (\d{2}):(\d{2}):(\d{2})", line)
json_objects = []
with open("data.txt") as f:
json_lines = []
for line in f:
if not is_text(line):
json_lines.append(line)
else:
# if there's multiple text lines in a row json_lines will be empty
if json_lines:
json_objects.append(json.loads("".join(json_lines)))
json_lines = []
# we still need to parse the remaining object in json_lines
# if the file doesn't end in a text line
if json_lines:
json_objects.append(json.loads("".join(json_lines)))
print(json_objects)
Repeating logic in the last two lines is a bit ugly, but you need to handle the case where the last line in your file is not a text line, so when you're done with the for loop you need parse the last object sitting in json_lines if there is one.
I'm assuming there's never more than one JSON object between text lines and also my regex expression for a date will break in 8,000 years.

You could count curly brackets in your file to find beginning and ending of your jsons, and store them in list, here found_jsons.
import json
open_chars = 0
saved_content = []
found_jsons = []
for i in content.splitlines():
open_chars += i.count('{')
if open_chars:
saved_content.append(i)
open_chars -= i.count('}')
if open_chars == 0 and saved_content:
found_jsons.append(json.loads('\n'.join(saved_content)))
saved_content = []
for i in found_jsons:
print(json.dumps(i, indent=4))
Output
{
"results": [
{
"numberID": "7412",
"filename": "xxxx"
},
{
"numberID": "E52",
"filename": "xgjhh"
}
],
"name": "1111"
}
{
"results": [
{
"numberID": "478962",
"filename": "hhhhh"
},
{
"number": "12544",
"filename": "jkhgfc"
}
],
"name": "jfkjgjkf"
}

Related

Load a from a text file containing multiple JSONs into Python

I have a text file temp.txt of the sort --
{
"names" : [ {"index" : 0, "cards": "\n\nbingo" ...} ]
"more stuff": ...
}
{
"names" : [ {"index" : 0, "cards": "\nfalse" ...} ]
"more stuff": ...
}
.
.
Here's how I am trying to load it --
def read_op(filename):
with open("temp.txt", 'r') as file:
for line in file:
print (json.load(line))
return lines
But this throws the error:
JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1
I'm not sure what I am doing wrong here. Are there alternatives to reading this file another way?
Reading it line-by-line will not work because every line is not a valid JSON object by itself.
You should pre-process the data before loading it as a JSON, for example by doing the following:
Read the whole content
Add commas between every 2 objects
Add [] to contain the data
Load with json.loads
import re
import json
with open(r'test.txt', 'r') as fp:
data = fp.read()
concat_data = re.sub(r"\}\n\{", "},{", data)
json_data_as_str = f"[{concat_data}]"
json_data = json.loads(json_data_as_str)
print(json_data)

Reading a json file that has multiple lines

I have a function that I apply to a json file. It works if it looks like this:
import json
def myfunction(dictionary):
#does things
return new_dictionary
data = """{
"_id": {
"$oid": "5e7511c45cb29ef48b8cfcff"
},
"description": "some text",
"startDate": {
"$date": "5e7511c45cb29ef48b8cfcff"
},
"completionDate": {
"$date": "2021-01-05T14:59:58.046Z"
},
"videos":[{"$oid":"5ecf6cc19ad2a4dfea993fed"}]
}"""
info = json.loads(data)
refined = key_replacer(info)
new_data = json.dumps(refined)
print(new_data)
However, I need to apply it to a whole while and the input looks like this (there are multiple elements and they are not separated by commas, they are one after another):
{"_id":{"$oid":"5f06cb272cfede51800b6b53"},"company":{"$oid":"5cdac819b6d0092cd6fb69d3"},"name":"SomeName","videos":[{"$oid":"5ecf6cc19ad2a4dfea993fed"}]}
{"_id":{"$oid":"5ddb781fb4a9862c5fbd298c"},"company":{"$oid":"5d22cf72262f0301ecacd706"},"name":"SomeName2","videos":[{"$oid":"5dd3f09727658a1b9b4fb5fd"},{"$oid":"5d78b5a536e59001a4357f4c"},{"$oid":"5de0b85e129ef7026f27ad47"}]}
How could I do this? I tried opening and reading the file, using load and dump instead of loads and dumps, and it still doesn't work. Do I need to read, or iterate over every line?
You are dealing with ndjson(Newline delimited JSON) data format.
You have to read the whole data string, split it by lines and parse each line as a JSON object resulting in a list of JSONs:
def parse_ndjson(data):
return [json.loads(l) for l in data.splitlines()]
with open('C:\\Users\\test.json', 'r', encoding="utf8") as handle:
data = handle.read()
dicts = parse_ndjson(data)
for d in dicts:
new_d = my_function(d)
print("New dict", new_d)

Regular Expression to remove selective string

Looking to remove particular string coming in between json string:
For Example my Json string is :
{"tableName":"avzConf","rows":[{"Comp":"mster","Conf": "[{\"name\": \"state\", \"dispN\": \"c_d_test\", \"\": {\"updated_at\": \"2020-09-16T06:33:07.684504Z\", \"updated_by\": \"Abc_xyz<abc_xyz#uuvvww.com>\"}}, {\"name\": \"stClu\", \"dNme\": \"tab(s) Updatedd\", \"\": {\"updated_at\": \"2020-09-21T10:17:48.307874Z\", \"updated_by\": \"Def Ghi<def_ghi#uuvvww.com>\"}}
}]
}
want to remove: \"\": {\"updated_at\": \"2020-09-16T06:33:07.684504Z\", \"updated_by\": \"Abc_xyz<abc_xyz#uuvvww.com>\"}
Expected output :
{"tableName":"avzConf","rows":[{"Comp":"mster","Conf": "[{\"name\": \"state\", \"dispN\": \"c_d_test\"}, {\"name\": \"stClu\", \"dNme\": \"tab(s) Updatedd\"}
}]
}
I tried with ( \\"\\": {\\"updated_\w+)(.*)(>\\")
used in my code:
import re
line = re.sub(r"updated_\w+(.*)(.com>)", '', json_str)
But it's also selecting the between lines as there is 2 occurrences of "": {"updated_at\ and "updated_by"
And leaving special char "": {""}
How can I completely remove \"\": {\"updated_at\": \"2020-09-16T06:33:07.684504Z\", \"updated_by\": \"Abc_xyz<abc_xyz#uuvvww.com>\"}?
Try this:
\{\"updated_at[^{]+\}
This matches from the relevant opening { to the relevant closing }by allowing any character except { to occur once or more times in-between
With python json string I'm able to remove those unwanted fields as below:
this has completely removed the unwanted empty key and replace the same with }, to complete the json perfectly.
regex as \,\s\\\"\\\":\s\{\\\"updated_at[^{]+\}[^\]]
json_str = str({"tableName":"avzConf","rows":[{"Comp":"mster","Conf": "[{"name": "state", "dispN": "c_d_test", "": {"updated_at": "2020-09-16T06:33:07.684504Z", "updated_by": "Abc_xyzabc_xyz#uuvvww.com"}}, {"name": "stClu", "dNme": "tab(s) Updatedd", "": {"updated_at": "2020-09-21T10:17:48.307874Z", "updated_by": "Def Ghidef_ghi#uuvvww.com"}} }] })
import re
line = re.sub(r"\,\s\\\"\\\":\s\{\\\"updated_at[^{]+\}",'},', json_str)

How to read JSON objects from Tweet.py results

I am trying to read the JSON file created by Tweet.py. However, whatever I tried I am receiving an ValueError consistently.
ValueError: Expecting property name: line 1 column 3 (char 2)
JSON results are in the format of:
{ 'Twitter Data' : [ {
"contributors": null,
"coordinates": null,
"created_at": "Tue Oct 24 15:55:21 +0000 2017",
"entities": {
"hashtags": ["#football"]
}
} , {
"contributors": johnny,
"coordinates": null,
"created_at": "Tue Oct 24 15:55:21 +0000 2017",
"entities": {
"hashtags": ["#football" , "#FCB"]
}
} , ... ] }
There are at least 50 of these JSON objects in the file, which are separated by commas.
My Python script to read this json file is:
twitter_data=[]
with open('#account.json' , 'r') as json_data:
for line in json_data:
twitter_data.append(json.loads(line))
print twitter_data
Tweet.py writes these Json objects by using:
json.dump(status._json,file,sort_keys = True,indent = 4)
I would appreciate any help and guidance on how to read this file!
Thank you.
The { 'Twitter Data' bit should be { "Twitter Data" as well as "Johnny"
That is to say keys and values (strings) must be enclosed in double quotes.
with open("#account.json","r") as json_data:
data = json_data.readlines()
twitter_data.append(json.loads(data))
Also, Haven't used this myself but this might be of help as well: https://jsonlint.com
First off, as both #Rob and #silent have noted, 'Twitter Data' should be "Twitter Data". Json needs double quotes, not single quotes to delimit a string.
Secondly, when reading with json.load() it expects a file Object, so when calling json.load(), just pass in json_data and it will read the whole json file into memory:
with open('#account.json' , 'r') as json_data:
contents = json.load(json_data)
EDIT:
for handling multiple objects at once:
def get_objs(f):
content = f.read()
# Get each object in the contents of the file object.
# This is kinda clunky and inelegant, but it should work
objs = ['{}{}'.format(i, '}') for i in content.split('},')]
# Last json_obj probably got an unnecessary "}" at the end, so trim the
# last character from it
objs[-1] = objs[-1][0:-1]
json_objs = [json.loads(i) for i in objs]
return json_objs
and then just go:
with open('#account.json', 'r') as json_data:
json_objs = get_objs(json_data)
Hopefully this will work for you. It did for me when I tested it on a simalarly formatted json file.

In JSON output, force every opening curly brace to appear in a new separate line

With json.dumps(some_dict,indent=4,sort_keys=True) in my code:
I get something like this:
{
"a": {
"x":1,
"y":2
},
"b": {
"z":3,
"w":4
}
}
But I want something like this:
{
"a":
{
"x":1,
"y":2
},
"b":
{
"z":3,
"w":4
}
}
How can I force each opening curly brace to appear at the beginning of a new separate line?
Do I have to write my own JSON serializer, or is there a special argument that I can use when calling json.dumps?
You can use a regular expression replacement on the result.
better_json = re.sub(r'^((\s*)".*?":)\s*([\[{])', r'\1\n\2\3', json, flags=re.MULTILINE)
The first capture group matches everything up to the : after the property name, the second capture group matches the whitespace before the property name, and the third capture group captures the { or [ before the object or array. The whitespace is then copied after the newline, so that the indentation will match properly.
DEMO
Building on Barmar's excellent answer, here's a more complete demo showing how you can convert and customize your JSON in Python:
import json
import re
# JSONifies dict and saves it to file
def save(data, filename):
with open(filename, "w") as write_file:
write_file.write(jsonify(data))
# Converts Python dict to a JSON string. Indents with tabs and puts opening
# braces on their own line.
def jsonify(data):
default_json = json.dumps(data, indent = '\t')
better_json = re.sub(
r'^((\s*)".*?":)\s*([\[{])',
r'\1\n\2\3',
default_json,
flags=re.MULTILINE
)
return better_json
# Sample data for demo
data = {
"president":
{
"name": "Zaphod Beeblebrox",
"species": "Betelgeusian"
}
}
filename = 'test.json'
# Demo
print("Here's your pretty JSON:")
print(jsonify(data))
print()
print('Saving to file:', filename)
save(data, filename)

Categories