Trying to load multiple json files and merge into one pandas dataframe - python

I am trying to load multiple json files from a directory in my Google Drive into one pandas dataframe.
I have tried quite a few solutions but nothing seems to be yielding a positive result.
This is what I have tried so far
path_to_json = '/path/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
jsons_data = pd.DataFrame(columns=['participants','messages','active','threadtype','thread path'])
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
participants = json_text['participants']
messages = json_text['messages']
active = json_text['is_still_participant']
threadtype = json_text['thread_type']
threadpath = json_text['thread_path']
jsons_data.loc[index]=[participants,messages,active,threadtype,threadpath]
jsons_data
And this is the full traceback of error message I am receiving:
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-30-8385abf6a3a7> in <module>()
1 for index, js in enumerate(json_files):
2 with open(os.path.join(path_to_json, js)) as json_file:
----> 3 json_text = json.load(json_file)
4 participants = json_text['participants']
5 messages = json_text['messages']
/usr/lib/python3.6/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
297 cls=cls, object_hook=object_hook,
298 parse_float=parse_float, parse_int=parse_int,
--> 299 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
300
301
/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352 parse_int is None and parse_float is None and
353 parse_constant is None and object_pairs_hook is None and not kw):
--> 354 return _default_decoder.decode(s)
355 if cls is None:
356 cls = JSONDecoder
/usr/lib/python3.6/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):
/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I have added a sample of the json files I am trying to read from
Link to Jsons
Example of jsons:
{
participants: [
{
name: "Test 1"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Person",
timestamp_ms: 1485467319139,
content: "Hie",
type: "Generic"
}
],
title: "Test 1",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/xyz"
}
#second example
{
participants: [
{
name: "Clearance"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Emmanuel Sibanda",
timestamp_ms: 1212242073308,
content: "Dear",
share: {
link: "http://www.example.com/"
},
type: "Share"
}
],
title: "Clearance",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/Clearance"
}

There were a few challenges working with the JSON files you provided and then some more converting them to dataframes and merging. This was because the keys of the JSON's were not strings, secondly, the arrays of the resulting "valid" JSONS were of different length and could not be converted to dataframes directly and thirdly, you did not specify the shape of the dataframe.
Nevertheless, this is an important problem as malformed JSON's are more commonplace than "valid" ones and despite several SO answers to fix such JSON strings, every malformed JSON problem is unique on its own.
I've broken down the problem into the following parts:
convert malformed JSON's in the files to valid JSON's
flatten the dict's in the valid JSON files to prepare for dataframe conversion
create dataframes from the files and merge into one dataframe
Note: For this answer, I copied the example JSON strings you provided into two files namely "test.json" and "test1.json" and saved them into a "Test" folder.
Part 1: convert malformed JSON's in the files to valid JSON's:
The two example JSON strings that you provided had no data type whatsoever. This is because the keys were not strings and were invalid. So, even if you load the JSON file and parse the contents, an error appears.
with open('./Test/test.json') as f:
data = json.load(f)
print(data)
#Error:
JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)
The only way I found to get around this problem was:
to convert all the JSON files to txt files as this would convert the contents to string
perform regex on the JSON string in text files and add quotes(" ") around the keys
save the file as JSON again
The above three steps were accomplished with two functions that I wrote. The first one renames the files to txt files and returns a list of filenames. The second one accepts this list of filenames, fixes the JSON keys using a regex and saves them to JSON format again.
import json
import os
import re
import pandas as pd
#rename to txt files and return list of filenames
def rename_to_text_files():
all_new_filenames = []
for filename in os.listdir('./Test'):
if filename.endswith("json"):
new_filename = filename.split('.')[0] + '.txt'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
all_new_filenames.append(new_filename)
else:
all_new_filenames.append(filename)
return all_new_filenames
#fix JSON string and save as a JSON file again, returns a list of valid JSON filenames
def fix_dict_rename_to_json_files(files):
json_validated_files = []
for index, filename in enumerate(files):
filepath = os.path.join('./Test',filename)
with open(filepath,'r+') as f:
data = f.read()
dict_converted = re.sub("(\w+):(.+)", r'"\1":\2', data)
f.seek(0)
f.write(dict_converted)
f.truncate()
#rename
new_filename = filename[:-4] + '.json'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
json_validated_files.append(new_filename)
print("All files converted to valid JSON!")
return json_validated_files
So, now I had two JSON files with valid JSON. But they were still not ready for dataframe conversion. To explain things better, consider the valid JSON from "test.json":
#test.json
{
"participants": [
{
"name": "Test 1"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1485467319139,
"content": "Hie",
"type": "Generic"
}
],
"title": "Test 1",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/xyz"
}
If I read the json into a dataframe, I still get an error because the array lengths are different for each of the keys. You can check this: the "messages" key value is an array of length 1 while "participants" has a value of array length 2:
df = pd.read_json('./Test/test.json')
print(df)
#Error
ValueError: arrays must all be same length
In the next part, we fix this problem by flattening the dict in the JSON.
Part 2: Flatten dict for dataframe conversion:
As you had not specified the shape that you expect for your dataframe, I extracted the values in the best way possible and flattened the dict with the following function. This is assuming the keys provided in the example JSON's will not change across all JSON files:
#accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs
def flatten(d):
values = []
d['participants_name'] = d.pop('participants')
for i in d['participants_name']:
values.append(i['name'])
for i in d['messages']:
d['messages_sender_name'] = i['sender_name']
d['messages_timestamp_ms'] = str(i['timestamp_ms'])
d['messages_content'] = i['content']
d['messages_type'] = i['type']
if "share" in i:
d['messages_share_link'] = i["share"]["link"]
d["is_still_participant"] = str(d["is_still_participant"])
d.pop('messages')
d.update(participants_name=values)
return d
This time let's consider the second example JSON string which also has a "share" key with a URL. The valid JSON string is as below:
#test1.json
{
"participants": [
{
"name": "Clearance"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Emmanuel Sibanda",
"timestamp_ms": 1212242073308,
"content": "Dear",
"share": {
"link": "http://www.example.com/"
},
"type": "Share"
}
],
"title": "Clearance",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clearance"
}
When we flatten this dict with the above function, we get a dict that can be easily fed into a DataFrame function(discussed later):
with open('./Test/test1.json') as f:
data = json.load(f)
print(flatten(data))
#Output:
{'title': 'Clearance',
'is_still_participant': 'True',
'thread_type': 'Regular',
'thread_path': 'inbox/Clearance',
'participants_name': ['Clearance', 'Person'],
'messages_sender_name': 'Emmanuel Sibanda',
'messages_timestamp_ms': '1212242073308',
'messages_content': 'Dear',
'messages_type': 'Share',
'messages_share_link': 'http://www.example.com/'}
Part 3: Create Dataframes and merge them into one:
So now that we have a function that can flatten the dict, we can call this function inside our final function where we will:
open the JSON files one by one, load each JSON as a dict in memory using json.load().
call the flatten function on each dict
convert the flattened dicts to dataframes
append all dataframes to an empty list.
merge all dataframes with pd.concat() passing the list of dataframes as an argument.
The code to accomplish these tasks:
#accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.
def create_merge_dataframes(list_of_valid_json_files):
df_list = []
for index, js in enumerate(list_of_valid_json_files):
with open(os.path.join('./Test', js)) as json_file:
data = json.load(json_file)
flattened_json_data = flatten(data)
df = pd.DataFrame(flattened_json_data)
df_list.append(df)
merged_df = pd.concat(df_list,sort=False, ignore_index=True)
return merged_df
Let's give the whole code a test run. We begin with functions in Part1 and end with Part 3, to get a merged ddataframe.
#rename invalid JSON files to text
files = rename_to_text_files()
#fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function
json_validated_files = fix_dict_rename_to_json_files(files)
#flatten and receive merged dataframes
df = create_merge_dataframes(json_validated_files)
print(df)
The final Dataframe:
title is_still_participant thread_type thread_path \
0 Test 1 True Regular inbox/xyz
1 Test 1 True Regular inbox/xyz
2 Clearance True Regular inbox/Clearance
3 Clearance True Regular inbox/Clearance
participants_name messages_sender_name messages_timestamp_ms \
0 Test 1 Person 1485467319139
1 Person Person 1485467319139
2 Clearance Emmanuel Sibanda 1212242073308
3 Person Emmanuel Sibanda 1212242073308
messages_content messages_type messages_share_link
0 Hie Generic NaN
1 Hie Generic NaN
2 Dear Share http://www.example.com/
3 Dear Share http://www.example.com/
You can change the order of columns as you like.
Note:
The code does not have Exception handling and assumes the keys will be the same for the dicts as shown in your examples
The shape and columns of the Dataframes has also been assumed
You may add all the functions into one Python script and wherever "./Test" is used for the JSON folder path, you should enter your path. The folder should only contain mailformed JSON files to begin with.
The whole script can be further modularized by putting the functions into a class.
It can also be further optimized with the use of hashable data types like tuples and sped up with threading and asyncio libraries. However, for a folder of 1000 files this code should work fairly well and shouldn't take very long.
It could be possible some errors may crop up while converting the malformed JSON files to valid ones, as the contents of all the JSON files is not known.
The code discussed provides a workflow to accomplish what you need and I hope this helps you and anyone who comes across a similar problem.

I've checked your json files, and found that there are same problems in document1.json, document2.json and document3.json: the property name are not enclosed with double quotes.
For example, document1.json should be corrected as:
{
"participants": [
{
"name": "Clothing"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1210107456233,
"content": "Good day",
"type": "Generic"
}
],
"title": "Clothing",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clothing"
}
EDIT: you can use following line to add double quote to the keys of a json file:
re.sub("([^\s^\"]+):(.+)", '"\\1":\\2', s)

Related

Unable to successfully divide the JSON file using python in DataBricks

Hi I am writing a DATABRICKS Python code which picks huge JSON file and divide into two part. Which means from index 0 or "reporting_entity_name" till index 3 or "version" on one file and from index 4 in other file till the end. Though it successfully divides the file from index 1 of the json file but when i provide index 0 it fails and says
Datasource does not support writing empty or nested empty schemas. Please make sure the data schema has at least one or more column(s).
Here is the SAMPLE Data of large JSON file.
{
"reporting_entity_name": "launcher",
"reporting_entity_type": "launcher",
"last_updated_on": "2020-08-27",
"version": "1.0.0",
"in_network": [
{
"negotiation_arrangement": "ffs",
"name": "Boosters",
"billing_code_type": "CPT",
"billing_code_type_version": "2020",
"billing_code": "27447",
"description": "Boosters On Demand",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
0
],
"tin": {
"type": "ein",
"value": "11-1111111"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 123.45,
"expiration_date": "2022-01-01",
"billing_class": "organizational"
}
]
}
]
}
]
}
Here is the python code.
from pyspark.sql.functions import explode, col
import itertools
# Read the JSON file from Databricks storage
df_json = spark.read.option("multiline","true").json("/mnt/BigData_JSONFiles/SampleDatafilefrombigfile.json")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
# Convert the dataframe to a dictionary
data = df_json.toPandas().to_dict()
# Split the data into two parts
d1 = dict(itertools.islice(data.items(), 1))
d2 = dict(itertools.islice(data.items(), 1, len(data.items())))
# Convert the first part of the data back to a dataframe
df1 = spark.createDataFrame([d1])
# Write the first part of the data to a JSON file in Databricks storage
df1.write.format("json").save("/mnt/BigData_JSONFiles/SampleDatafilefrombigfile_detail.json")
# Convert the second part of the data back to a dataframe
df2 = spark.createDataFrame([d2])
# Write the second part of the data to a JSON file in Databricks storage
df2.write.format("json").save("/mnt/BigData_JSONFiles/SampleDatafilefrombigfile_header.json")
Here is the output of the two files. In the output file you can see in the detail file it should only contains the data of "in_network" but it also have the 0 index data which is "reporting_entity_name" which shouldnt be in detail file it should be in header file.
{
"in_network": [
{
"negotiation_arrangement": "ffs",
"name": "Boosters",
"billing_code_type": "CPT",
"billing_code_type_version": "2020",
"billing_code": "27447",
"description": "Boosters On Demand",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
0
],
"tin": {
"type": "ein",
"value": "11-1111111"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 123.45,
"expiration_date": "2022-01-01",
"billing_class": "organizational"
}
]
}
]
}
]
},"negotiation_arrangement":"ffs"}]}}
The output of the Headerfile which starts from 1 index and gives the output.
{"reporting_entity_type": "launcher",
"last_updated_on": "2020-08-27",
"version": "1.0.0"}
Kindly please help me in this error.
A guidance on code will be helpful.
Here is the screenshot of large json file which is exact copy of the file attached above I increased the cluster from 2 gb to 8gb. But the error is same also dict inside in_network occurs 714 times in the file does. But why it is failing in the big file. If it is exactly same.
I change the code line of the answer to this also
df_network=df_json.select(df_json.columns[714:])
Here is the TraceBack
AnalysisException Traceback (most recent call last)
<command-863551447189973> in <cell line: 13>()
11 df_version=df_json.select(df_json.columns[:1])
12
---> 13 df_network.write.format("json").save("/mnt/BigData_JSONFiles/2022-10_040_05C0_in-network-rates_2_of_2_detail.json")
14 df_version.write.format("json").save("/mnt/BigData_JSONFiles/2022-10_040_05C0_in-network-rates_2_of_2_header.json")
15 display(df_network)
/databricks/spark/python/pyspark/instrumentation_utils.py in wrapper(*args, **kwargs)
46 start = time.perf_counter()
47 try:
---> 48 res = func(*args, **kwargs)
49 logger.log_success(
50 module_name, class_name, function_name, time.perf_counter() - start, signature
/databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
966 self._jwrite.save()
967 else:
--> 968 self._jwrite.save(path)
969
970 #since(1.4)
/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py in __call__(self, *args)
1319
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1323
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
200 # Hide where the exception came from that shows a non-Pythonic
201 # JVM exception message.
--> 202 raise converted from None
203 else:
204 raise
AnalysisException:
Datasource does not support writing empty or nested empty schemas.
Please make sure the data schema has at least one or more column(s).
I have reproduced your code and got below results for one file.
{"last_updated_on":{"0":"2020-08-27"},"reporting_entity_name":{"0":"launcher"},"reporting_entity_type":{"0":"launcher"},"version":{"0":"1.0.0"}}
The inner 0 key might be due to the usage of dictionary and pandas.
As your JSON has the same structure, you can try the below workaround to divide the JSON using select rather than converting into dictionary.
This is the Original Dataframe from JSON file.
So, use select to generate the required JSON files.
df_network=df_json.select(df_json.columns[:1])
df_version=df_json.select(df_json.columns[1:])
display(df_network)
display(df_version)
Dataframes:
Result after writing to JSON files:

Inconsistent error: json.decoder.JSONDecodeError: Extra data: line 30 column 2 (char 590)

I have .json documents generated from the same code. Here multiple nested dicts are being dumped to the json documents. While loadling with json.load(opened_json), I get the json.decoder.JSONDecodeError: Extra data: line 30 column 2 (char 590) like error for some of of the files whereas not for others. It is not understood why. What is the proper way to dump multiple dicts (maybe nested) into json docs and in my current case what is way to read them all? (Extra: Dicts can be over multiple lines, so 'linesplitting' does not work probably.)
Ex: Say I am json.dump(data, file) with data = {'meta_data':{some_data}, 'real_data':{more_data}}.
Let us take these two fake files:
{
"meta_data": {
"id": 0,
"start": 1238397024.0,
"end": 1238397056.0,
"best": []
},
"real_data": {
"YAS": {
"t1": [
1238397047.2182617
],
"v1": [
5.0438767766574255
],
"v2": [
4.371670270544587
]
}
}
}
and
{
"meta_data": {
"id": 0,
"start": 1238397056.0,
"end": 1238397088.0,
"best": []
},
"real_data": {
"XAS": {
"t1": [
1238397047.2182617
],
"v1": [
5.0438767766574255
],
"v2": [
4.371670270544587
]
}
}
}
and try to load them using json.load(open(file_path)) for duplicatling the problem.
You chose not to offer a
reprex.
Here is the code I'm running
which is intended to represent what you're running.
If there is some discrepancy, update the original
question to clarify the details.
import json
from io import StringIO
some_data = dict(a=1)
more_data = dict(b=2)
data = {"meta_data": some_data, "real_data": more_data}
file = StringIO()
json.dump(data, file)
file.seek(0)
d = json.load(file)
print(json.dumps(d, indent=4))
output
{
"meta_data": {
"a": 1
},
"real_data": {
"b": 2
}
}
As is apparent, over the circumstances you have
described the JSON library does exactly what we
would expect of it.
EDIT
Your screenshot makes it pretty clear
that a bunch of ASCII NUL characters are appended
to the 1st file.
We can easily reproduce that JSONDecodeError: Extra data
symptom by adding a single line:
json.dump(data, file)
file.write(chr(0))
(Or perhaps chr(0) * 80 more closely matches the truncated screenshot.)
If your file ends with extraneous characters, such as NUL,
then it will no longer be valid JSON and compliant
parsers will report a diagnostic message when they
attempt to read it.
And there's nothing special about NUL, as a simple
file.write("X") suffices to produce that same
diagnostic.
You will need to trim those NULs from the file's end
before attempting to parse it.
For best results, use UTF8 unicode encoding with no
BOM.
Your editor should have settings for
switching to utf8.
Use $ file foo.json to verify encoding details,
and $ iconv --to-code=UTF-8 < foo.json
to alter an unfortunate encoding.
You need to read the file, you can do both of these.
data = json.loads(open("data.json").read())
or
with open("data.json", "r") as file:
data = json.load(file)

Reading a json file that has multiple lines

I have a function that I apply to a json file. It works if it looks like this:
import json
def myfunction(dictionary):
#does things
return new_dictionary
data = """{
"_id": {
"$oid": "5e7511c45cb29ef48b8cfcff"
},
"description": "some text",
"startDate": {
"$date": "5e7511c45cb29ef48b8cfcff"
},
"completionDate": {
"$date": "2021-01-05T14:59:58.046Z"
},
"videos":[{"$oid":"5ecf6cc19ad2a4dfea993fed"}]
}"""
info = json.loads(data)
refined = key_replacer(info)
new_data = json.dumps(refined)
print(new_data)
However, I need to apply it to a whole while and the input looks like this (there are multiple elements and they are not separated by commas, they are one after another):
{"_id":{"$oid":"5f06cb272cfede51800b6b53"},"company":{"$oid":"5cdac819b6d0092cd6fb69d3"},"name":"SomeName","videos":[{"$oid":"5ecf6cc19ad2a4dfea993fed"}]}
{"_id":{"$oid":"5ddb781fb4a9862c5fbd298c"},"company":{"$oid":"5d22cf72262f0301ecacd706"},"name":"SomeName2","videos":[{"$oid":"5dd3f09727658a1b9b4fb5fd"},{"$oid":"5d78b5a536e59001a4357f4c"},{"$oid":"5de0b85e129ef7026f27ad47"}]}
How could I do this? I tried opening and reading the file, using load and dump instead of loads and dumps, and it still doesn't work. Do I need to read, or iterate over every line?
You are dealing with ndjson(Newline delimited JSON) data format.
You have to read the whole data string, split it by lines and parse each line as a JSON object resulting in a list of JSONs:
def parse_ndjson(data):
return [json.loads(l) for l in data.splitlines()]
with open('C:\\Users\\test.json', 'r', encoding="utf8") as handle:
data = handle.read()
dicts = parse_ndjson(data)
for d in dicts:
new_d = my_function(d)
print("New dict", new_d)

Parse JSON structures in a txt file containing JSON and text structures

I have a txt file with json structures. the problem is the file does not only contain json structures but also raw text like log error:
2019-01-18 21:00:05.4521|INFO|Technical|Batch Started|
2019-01-18 21:00:08.8740|INFO|Technical|Got Entities List from 20160101 00:00 :
{
"name": "1111",
"results": [{
"filename": "xxxx",
"numberID": "7412"
}, {
"filename": "xgjhh",
"numberID": "E52"
}]
}
2019-01-18 21:00:05.4521|INFO|Technical|Batch Started|
2019-01-18 21:00:08.8740|INFO|Technical|Got Entities List from 20160101 00:00 :
{
"name": "jfkjgjkf",
"results": [{
"filename": "hhhhh",
"numberID": "478962"
}, {
"filename": "jkhgfc",
"number": "12544"
}]
}
I read the .txt file but trying to patch the jason structures I have an error:
IN :
import json
with open("data.txt", "r", encoding="utf-8", errors='ignore') as f:
json_data = json.load(f)
OUT : json.decoder.JSONDecodeError: Extra data: line 1 column 5 (char 4)
I would like to parce json and save as csv file.
A more general solution to parsing a file with JSON objects mixed with other content without any assumption of the non-JSON content would be to split the file content into fragments by the curly brackets, start with the first fragment that is an opening curly bracket, and then join the rest of fragments one by one until the joined string is parsable as JSON:
import re
fragments = iter(re.split('([{}])', f.read()))
while True:
try:
while True:
candidate = next(fragments)
if candidate == '{':
break
while True:
candidate += next(fragments)
try:
print(json.loads(candidate))
break
except json.decoder.JSONDecodeError:
pass
except StopIteration:
break
This outputs:
{'name': '1111', 'results': [{'filename': 'xxxx', 'numberID': '7412'}, {'filename': 'xgjhh', 'numberID': 'E52'}]}
{'name': 'jfkjgjkf', 'results': [{'filename': 'hhhhh', 'numberID': '478962'}, {'filename': 'jkhgfc', 'number': '12544'}]}
This solution will strip out the non-JSON structures, and wrap them in a containing JSON structure.This should do the job for you. I'm posting this as is for expediency, then I'll edit my answer for a more clear explanation. I'll edit this first bit when I've done that:
import json
with open("data.txt", "r", encoding="utf-8", errors='ignore') as f:
cleaned = ''.join([item.strip() if item.strip() is not '' else '-split_here-' for item in f.readlines() if '|INFO|' not in item]).split('-split_here-')
json_data = json.loads(json.dumps(('{"entries":[' + ''.join([entry + ', ' for entry in cleaned])[:-2] + ']}')))
Output:
{"entries":[{"name": "1111","results": [{"filename": "xxxx","numberID": "7412"}, {"filename": "xgjhh","numberID": "E52"}]}, {"name": "jfkjgjkf","results": [{"filename": "hhhhh","numberID": "478962"}, {"filename": "jkhgfc","number": "12544"}]}]}
What's going on here?
In the cleaned = ... line, we're using a list comprehension that creates a list of the lines in the file (f.readlines()) that do not contain the string |INFO| and adds the string -split_here- to the list whenever there's a blank line (where .strip() yields '').
Then, we're converting that list of lines (''.join()) into a string.
Finally we're converting that string (.split('-split_here-') into a list of lists, separating the JSON structures into their own lists, marked by blank lines in data.txt.
In the json_data = ... line, we're appending a ', ' to each of the JSON structures using a list comprehension.
Then, we convert that list back into a single string, stripping off the last ', ' (.join()[:-2]. [:-2]slices of the last two characters from the string.).
We then wrap the string with '{"entries":[' and ']}' to make the whole thing a valid JSON structure, and feed it to json.dumps and json.loads to clean any encoding and load your data a a python object.
You could do one of several things:
On the Command Line, remove all lines where, say, "|INFO|Technical|" appears (assuming this appears in every line of raw text):
sed -i '' -e '/\|INFO\|Technical/d' yourfilename (if on Mac),
sed -i '/\|INFO\|Technical/d' yourfilename (if on Linux).
Move these raw lines into their own JSON fields
Use the "text structures" as a delimiter between JSON objects.
Iterate over the lines in the file, saving them to a buffer until you encounter a line that is a text line, at which point parse the lines you've saved as a JSON object.
import re
import json
def is_text(line):
# returns True if line starts with a date and time in "YYYY-MM-DD HH:MM:SS" format
line = line.lstrip('|') # you said some lines start with a leading |, remove it
return re.match("^(\d{4})-(\d{2})-(\d{2}) (\d{2}):(\d{2}):(\d{2})", line)
json_objects = []
with open("data.txt") as f:
json_lines = []
for line in f:
if not is_text(line):
json_lines.append(line)
else:
# if there's multiple text lines in a row json_lines will be empty
if json_lines:
json_objects.append(json.loads("".join(json_lines)))
json_lines = []
# we still need to parse the remaining object in json_lines
# if the file doesn't end in a text line
if json_lines:
json_objects.append(json.loads("".join(json_lines)))
print(json_objects)
Repeating logic in the last two lines is a bit ugly, but you need to handle the case where the last line in your file is not a text line, so when you're done with the for loop you need parse the last object sitting in json_lines if there is one.
I'm assuming there's never more than one JSON object between text lines and also my regex expression for a date will break in 8,000 years.
You could count curly brackets in your file to find beginning and ending of your jsons, and store them in list, here found_jsons.
import json
open_chars = 0
saved_content = []
found_jsons = []
for i in content.splitlines():
open_chars += i.count('{')
if open_chars:
saved_content.append(i)
open_chars -= i.count('}')
if open_chars == 0 and saved_content:
found_jsons.append(json.loads('\n'.join(saved_content)))
saved_content = []
for i in found_jsons:
print(json.dumps(i, indent=4))
Output
{
"results": [
{
"numberID": "7412",
"filename": "xxxx"
},
{
"numberID": "E52",
"filename": "xgjhh"
}
],
"name": "1111"
}
{
"results": [
{
"numberID": "478962",
"filename": "hhhhh"
},
{
"number": "12544",
"filename": "jkhgfc"
}
],
"name": "jfkjgjkf"
}

List Indices in json in Python

I've got a json file that I've pulled from a web service and am trying to parse it. I see that this question has been asked a whole bunch, and I've read whatever I could find, but the json data in each example appears to be very simplistic in nature. Likewise, the json example data in the python docs is very simple and does not reflect what I'm trying to work with. Here is what the json looks like:
{"RecordResponse": {
"Id": blah
"Status": {
"state": "complete",
"datetime": "2016-01-01 01:00"
},
"Results": {
"resultNumber": "500",
"Summary": [
{
"Type": "blah",
"Size": "10000000000",
"OtherStuff": {
"valueOne": "first",
"valueTwo": "second"
},
"fieldIWant": "value i want is here"
The code block in question is:
jsonFile = r'C:\Temp\results.json'
with open(jsonFile, 'w') as dataFile:
json_obj = json.load(dataFile)
for i in json_obj["Summary"]:
print(i["fieldIWant"])
Not only am I not getting into the field I want, but I'm also getting a key error on trying to suss out "Summary".
I don't know how the indices work within the array; once I even get into the "Summary" field, do I have to issue an index manually to return the value from the field I need?
The example you posted is not valid JSON (no commas after object fields), so it's hard to dig in much. If it's straight from the web service, something's messed up. If you did fix it with proper commas, the "Summary" key is within the "Results" object, so you'd need to change your loop to
with open(jsonFile, 'w') as dataFile:
json_obj = json.load(dataFile)
for i in json_obj["Results"]["Summary"]:
print(i["fieldIWant"])
If you don't know the structure at all, you could look through the resulting object recursively:
def findfieldsiwant(obj, keyname="Summary", fieldname="fieldIWant"):
try:
for key,val in obj.items():
if key == keyname:
return [ d[fieldname] for d in val ]
else:
sub = findfieldsiwant(val)
if sub:
return sub
except AttributeError: #obj is not a dict
pass
#keyname not found
return None

Categories