Parsing incomplete json array - python

I have downloaded 5MB of a very large json file. From this, I need to be able to load that 5MB to generate a preview of the json file. However, the file will probably be incomplete. Here's an example of what it may look like:
[{
"first": "bob",
"address": {
"street": 13301,
"zip": 1920
}
}, {
"first": "sarah",
"address": {
"street": 13301,
"zip": 1920
}
}, {"first" : "tom"
From here, I'd like to "rebuild it" so that it can parse the first two objects (and ignore the third).
Is there a json parser that can infer or cut off the end of the string to make it parsable? Or perhaps to 'stream' the parsing of the json array, so that when it fails on the last object, I can exit the loop? If not, how could the above be accomplished?

If your data will always look somewhat similar, you could do something like this:
import json
json_string = """[{
"first": "bob",
"address": {
"street": 13301,
"zip": 1920
}
}, {
"first": "sarah",
"address": {
"street": 13301,
"zip": 1920
}
}, {"first" : "tom"
"""
while True:
if not json_string:
raise ValueError("Couldn't fix JSON")
try:
data = json.loads(json_string + "]")
except json.decoder.JSONDecodeError:
json_string = json_string[:-1]
continue
break
print(data)
This assumes that the data is a list of dicts. Step by step, the last character is removed and a missing ] appended. If the new string can be interpreted as JSON, the infinite loop breaks. Otherwise the next character is removed and so on. If there are no characters left ValueError("Couldn't fix JSON") is raised.
For the above example, it prints:
[{'first': 'bob', 'address': {'zip': 1920, 'street': 13301}}, {'first': 'sarah', 'address': {'zip': 1920, 'street': 13301}}]

For the specific structure in the example we can walk through the string and track occurrences of curly brackets and their closing counterparts. If at the end one or more curly brackets remain unmatched, we know that this indicates an incomplete object. We can then strip any intermediate characters such as commas or whitespace and close the resulting string with a square bracket.
This method ensures that the string is only parsed twice, one time manually and one time by the JSON parser, which might be advantageous for large text files (with incomplete objects consisting of many characters).
brackets = []
for i, c in enumerate(string):
if c == '{':
brackets.append(i)
elif c == '}':
brackets.pop()
if brackets:
string = string[:brackets[0]].rstrip(', \n')
if not string.endswith(']'):
string += ']'

Related

How to Convert Text in Non-structured Format to JSON Format

Content of a Sample Input Text
{'key1':'value1','msg1':"content1"} //line 1
{'key2':'value2','msg2':"content2"} //line 2
{'key3':'value3','msg3':"content3"} //line 3
Also, pointing out some notable characteristics of the input text
Lacks a proper delimiter, currently each object {...} takes a new line "\n"
Contains single quotes, which can be an issue since JSON (the expected output) accepts only double quotes
Does not have the opening and closing curly brackets required by JSON
Expected Output JSON
{
{
"key1":"value1",
"msg1":"content1"
},
{
"key2":"value2",
"msg2":"content2"
},
{
"key3":"value3",
"msg3":"content3"
}
}
What I have tried, but failed
json.dumps(input_text), but it cannot identify "\n" as the "delimiter"
Appending a comma at the end of each object {...}, but encountered the issue of extra comma when it comes to the last object
If you have one dictionary per line, you can replace newlines with , and enclose the whole in brackets [,] (you get a list of dictionaries).
You can use ast.literal_eval to import your file as list of dictionaries.
Finally export it to json:
import json
import ast
with open("file.txt", "r") as f:
dic_list = ast.literal_eval("[" + f.read().replace('\n',',') + "]")
print(json.dumps(dic_list, indent=4))
Output:
[
{
"key1": "value1",
"msg1": "content1"
},
{
"key2": "value2",
"msg2": "content2"
},
{
"key3": "value3",
"msg3": "content3"
}
]
Just use ast
import ast
with open('test.txt') as f:
data = [ast.literal_eval(l.strip()) for l in f.readlines()]
print(data)
output
[{'key1': 'value1', 'msg1': 'content1'}, {'key2': 'value2', 'msg2': 'content2'}, {'key3': 'value3', 'msg3': 'content3'}]

How to convert string json to python float and back to number in json

Hi I'm trying to solve a problem. I have a json list of products, ex:
[{
"id": 5677240,
"name": "Cønjuntø de Pænelæs æntiæderentes ¢øm 05 Peçæs Pæris",
"quantity": 21,
"price": "192.84",
"category": "Panelas"
},
{
"id": 9628920,
"name": "Lava & Seca 10,2 Kg Sæmsung E¢ø ßußßle ßræn¢æ ¢øm 09 Prøgræmæs de Lævægem",
"quantity": 57,
"price": 3719.70,
"category": "Eletrodomésticos"
}]
But I basically need the "price" to be float like the second product. I have a large list of these products
(Ignore the weird characters I managed to fix it with help from a teacher.) I converted them to python object using this
import json
with open('br2.json', 'r', encoding='utf8') as json_data:
data = json.load(json_data)
I've tried something like this but it doesn't work
for product in data:
product["price"] = product["price"].replace(",", "")
I want to replace the values that are in string with the "" to float
thanks in advance sorry I'm new to python so I don't understand much
You can convert a string to float with float(). So instead of your replace line, try:
product['price'] = float(product['price'])

Converting nested JSON into Python dictionary

I'm receiving a string server side which I then convert to JSON:
127.0.0.1:8000/devices/f751/?json={ "DeviceId":"192-2993-2993", "Date":"1/4/2019 9:52:2", "Location":"-1.000000000,-1.000000000", "Key":"{XXXX-XXXX-XXXX}", "Data":" { \"Value0\":\"{ \"ReferenceValue\":\"Elevation\", \"Prediction\":\"22.216558464\"}\", \"Value1\":\"{ \"ReferenceValue\":\"Wind Speed\", \"Prediction\":\"42.216558464\"}\" } "}
After conversion using json.loads() I get the following output:
updatedRequest = json.loads(jsonRequest)
updatedRequest
{'DeviceId': '192-2993-2993',
'Date': '1/4/2019 9:52:2',
'Location': '-1.000000000,-1.000000000',
'Key': '{XXXX-XXXX-XXXX}',
'Data': '{ "Value0":"{ "ReferenceValue":"Elevation", "Prediction":"22.216558464"}", "Value1":"{ "ReferenceValue":"Wind Speed", "Prediction":"42.216558464"}" }'}
So far so good, I can access the Data value via updatedRequest['Data'].
updatedRequest['Data']
'{ "Value0":"{ "ReferenceValue":"Elevation", "Prediction":"22.216558464"}", "Value1":"{ "ReferenceValue":"Wind Speed", "Prediction":"42.216558464"}" }'
My issue when attempting to convert this into a Python usable dictionary (e.g updatedRequest['Data']['Value0']['ReferenceValue']). Because there is an unknown number of 'Value' keys, I'm uncertain as to what the best procedure would be to move this into workable data.
You have received a JSON document with a nested JSON document, itself containing further JSON documents, inside one another like a Matryoshka doll.
Unfortunately, you can only decode one level, because the next level is broken. There should be \ escapes in front of the " quote characters used for the 3rd level of JSON documents, just like the second level quotes were escaped when it was embedded in the top-level JSON document. Those are missing so no JSON parser can decode it anymore. The delimiters around JSON strings have been derailed by stray, unescaped " characters that were meant to be part of a JSON string value.
You either need to repair the client sending this data, and discard these malformed values as an invalid request.
For completeness sake, a valid document would look like this:
>>> v0 = '''{ "ReferenceValue":"Elevation", "Prediction":"22.216558464"}'''
>>> v1 = '''{ "ReferenceValue":"Wind Speed", "Prediction":"42.216558464"}" }'''
>>> data_value = json.dumps({'Value0': v0, 'Value1': v1})
>>> print(json.dumps({'Data': data_value, 'Date': '1/4/2019 9:52:2', 'DeviceId': '192-2993-2993', 'Key': '{XXXX-XXXX-XXXX}', 'Location': '-1.000000000,-1.000000000'}, indent=4))
{
"Data": "{\"Value0\": \"{ \\\"ReferenceValue\\\":\\\"Elevation\\\", \\\"Prediction\\\":\\\"22.216558464\\\"}\", \"Value1\": \"{ \\\"ReferenceValue\\\":\\\"Wind Speed\\\", \\\"Prediction\\\":\\\"42.216558464\\\"}\\\" }\"}",
"Date": "1/4/2019 9:52:2",
"DeviceId": "192-2993-2993",
"Key": "{XXXX-XXXX-XXXX}",
"Location": "-1.000000000,-1.000000000"
}
Note the \" and \\\" escapes in the Data value. On decoding, the string value for Data will have one level of escape sequences removed, forming " and \" sequences, where the " quotes are part of the JSON syntax and \" are part of the string values, which in turn can be decoded to " used in the innermost JSON document.
It really depends what you want to do with the data. You can loop through the 'Data' dictionary with:
for k,v in updatedRequest['Data'].items():
# do some stuff
This will allow you to process without having to deal with the variable number of items in this dictionary. Hard to say what is best without knowing exactly what you wish to do though!

Iterating over JSON list in Python

I'm trying to iterate over a JSON list to print out all of the results of the following:
"examples": [
{
"text": "carry all of the blame"
},
{
"text": "she left all her money to him"
},
{
"text": "we all have different needs"
},
{
"text": "he slept all day"
},
{
"text": "all the people I met"
},
{
"text": "10% of all cars sold"
}
],
I've tried to iterate over it by doing:
iterator = 0
json_example = str(json_data['results'][0]['lexicalEntries'][0]['entries'][0]['senses'][0]['examples'][iterator]['text']).capitalize()
for i in json_example:
print(i)
iterator += 1
But this is only printing each letter of the first example, as oppose to the entire example, followed by other entire examples.
Can I iterate over these as I would like to, or do I need to create separate variables with each example?
Following your code and example, it looks like what you need is :
for example in json_data['results'][0]['lexicalEntries'][0]['entries'][0]['senses'][0]['examples']:
print(example["text"])
In your code, by doing json_data['results'][0]['lexicalEntries'][0]['entries'][0]['senses'][0]['examples'][iterator]['text'] you were only accessing the iteratorth item, so, always the first one (iterator=0), and then iterating on the content of the "text" member.
Only index the json data out to 'examples':
json_example = json_data['results'][0]['lexicalEntries'][0]['entries'][0]['senses'][0]['examples']
then treat each element of 'examples' like a dictionary:
for dictionary in json_example:
for key in dictionary:
print(dictionary[key])
This will print out each value correlated with the key 'text', like you want.

How does json determine write/output order

Playing with json in Python's STL and came up with this..
import json as j
cred = j.dumps({'Name': 'John Doe', 'Occupation': 'Programmer'},
sort_keys = True,
indent = 4,
separators = (',', ': '))
_f = open('credentials', 'w')
_f.write(cred)
_f.close()
The output is below and all is fine..
{
"Name": "John Doe",
"Occupation": "Programmer"
}
However, i accidentally typed name in lowercase like this..
cred = j.dumps({'name': 'John Doe', 'Occupation': 'Programmer'},
sort_keys = True,
indent = 4,
separators = (',', ': '))
and the result was this..
{
"Occupation": "Programmer",
"name": "John Doe"
}
How does json determine the write/output order of the values passed to it, what precedence does uppercase have over lowercase or vice versa and is there a way to preserve order?
Python dictionaries, as well as JSON objects, do not have an order. Any order you might see is arbitrary and may change at any time. If you want to store order in JSON, you'll need to use an array instead of an object.
sort_keys seems to guarantee some sort of output order, but that's likely only to make it more readable for humans. Computers reading JSON shouldn't care about field order.

Categories