Parse JSON from unicode string as dictionaries - python

I have a JSON input that consists of a list of dictionaries as unicode characters:
Example:
input = u'[{
attributes: {
NAME: "Name_1ĂĂÎÎ",
TYPE: "Tip1",
LOC_JUD: "Bucharest",
LAT_LON: "234343/432545",
S70: "2342345",
MAP: "Map_one",
SCH: "1:5000,
SURSA: "PPP"
}
}, {
attributes: {
NAME: "NAME_2șțț",
TYPE: "Tip2",
LOC_JUD: "cea",
LAT_LON: "123/54645",
S70: "4324",
MAP: "Map_two",
SCH: "1:578000",
SURSA: "PPP"
}
}
]
'
How can I parse this string into a list of dictionaries? I tried to do this using:
import json
json_d = json.dumps(input)
print type(json_d) # string object / Not list of dicts
json_obj = json.loads(json_d) # unicode object / Not list of dicts
I cannot parse the contents of the JSON:
print json_obj[0]["attributes"]
TypeError: string indices must be integers
I am using Python 2.7.11. Thanks for any help!

Try a simplified example:
s = '[{attributes: { a: "foo", b: "bar" } }]'
The main problem is your string is not in a valid JSON:
>>> json.loads(s)
[...]
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)
If the input is generated by you, then fix it. If it comes from somewhere else, then you will need to edit it before loading it with the json module.
Note how having a proper JSON, the .load() method works as expected:
>>> s = '[{"attributes": { "a": "foo", "b": "bar" } }]'
>>> json.loads(s)
[{'attributes': {'a': 'foo', 'b': 'bar'}}]
>>> type(json.loads(s))
list

As others have mentioned, your input data is not JSON. Ideally, that should be fixed upstream so that you do get valid JSON.
However, if that's out of your control you can convert that data to JSON.
The main problem is all those unquoted keys. We can fix that by using a regex to search for a valid name in the first field on each line. If a valid name is found we wrap it in double quotes.
import json
import re
source = u'''[{
attributes: {
NAME: "Name_1ĂĂÎÎ",
TYPE: "Tip1",
LOC_JUD: "Bucharest",
LAT_LON: "234343/432545",
S70: "2342345",
MAP: "Map_one",
SCH: "1:5000",
SURSA: "PPP"
}
}, {
attributes: {
NAME: "NAME_2șțț",
TYPE: "Tip2",
LOC_JUD: "cea",
LAT_LON: "123/54645",
S70: "4324",
MAP: "Map_two",
SCH: "1:578000",
SURSA: "PPP"
}
}
]
'''
# Split source into lines, then split lines into colon-separated fields
a = [s.strip().split(': ') for s in source.splitlines()]
# Wrap names in first field in double quotes
valid_name = re.compile('(^\w+$)')
for row in a:
row[0] = valid_name.sub(r'"\1"', row[0])
# Recombine the data and load it
data = json.loads(' '.join([': '.join(row) for row in a]))
# Test
print data[0]["attributes"]
print '- ' * 30
print json.dumps(data, indent=4, ensure_ascii=False)
output
{u'LOC_JUD': u'Bucharest', u'NAME': u'Name_1\u0102\u0102\xce\xce', u'MAP': u'Map_one', u'SURSA': u'PPP', u'S70': u'2342345', u'TYPE': u'Tip1', u'LAT_LON': u'234343/432545', u'SCH': u'1:5000'}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[
{
"attributes": {
"LOC_JUD": "Bucharest",
"NAME": "Name_1ĂĂÎÎ",
"MAP": "Map_one",
"SURSA": "PPP",
"S70": "2342345",
"TYPE": "Tip1",
"LAT_LON": "234343/432545",
"SCH": "1:5000"
}
},
{
"attributes": {
"LOC_JUD": "cea",
"NAME": "NAME_2șțț",
"MAP": "Map_two",
"SURSA": "PPP",
"S70": "4324",
"TYPE": "Tip2",
"LAT_LON": "123/54645",
"SCH": "1:578000"
}
}
]
Note that this code is a little fragile. It works with data that's in the format shown in the question, but it won't work if there are more than one key-value pair on a line.
As I said earlier, the best way to fix this problem is upstream, where the non-JSON is being produced.

Related

Assign variable value in JSON payload fomat

I am using python 3.6.3a. I would like to generate payload for each of the json records. I am using each variable to access the record. How to assign variable value (each in this case) in payload? I tried {each} and other methods but didn't work.
code snippet below.
json_records = [{"description":"<p>This is scenario1<\/p>","owner":"deb",
"priority":"high"},
{"description":"<p>This is scenario2<\/p>","owner":"deb",
"priority":"medium"}]
json_object = json.loads(json_records)
for each in json_object:
payload = """
{
"subject": "test",
"fieldValues": [
{each}
]
}
"""
There are two ways to approach this problem.
One way could be creating a dict() object and inserting keys as you wish, then json.dumps(object) to convert into string payload as in:
import json
json_records = [{"description":"This is scenario1</p>","owner":"deb","priority":"high"}
,{"description":"This is scenario2</p>","owner":"deb","priority":"medium"}]
for obj in json_records:
payload = dict()
payload['subject'] = 'test'
for key,value in obj.items():
payload['fieldName'] = {
key:value
}
print(json.dumps(payload))
#{"subject": "test", "fieldName": {"priority": "high"}}
#{"subject": "test", "fieldName": {"priority": "medium"}}
Second way is to create a textual payload from string as in, however if you need a valid JSON at the end, this would require a post-step of validation (something like try json.loads(payload) - So I'd just use the first method. I would use this method only if I have a specific requirements to generate the payload in a certain way.
import json
json_records = [{"description":"This is scenario1</p>","owner":"deb","priority":"high"}
,{"description":"This is scenario2</p>","owner":"deb","priority":"medium"}]
# json_object = json.loads(json_records) # json.loads works only on byte-like strings. your object is already in python in this case.
for obj in json_records:
payload = """
{
"subject": "test",
"fieldValues": [
%s
]
}
""" % (obj["priority"])
print(payload)
#{
# "subject": "test",
# "fieldValues": [
# high
# ]
# }
#
#
# {
# "subject": "test",
# "fieldValues": [
# medium
# ]
# }
You could make payload a Template string and use it to put the data in each JSON record into the format you want. Bracket {} characters have not special meaning in Templates, which is what makes using them easy.
Doing that will create a valid string representation of a dictionary containing everything. You can turn this into an actual Python dictionary data-structure using the ast.literal_eval() function, and then convert that into JSON string format — which I think is the final format you're after.
rom ast import literal_eval
import json
from string import Template
from textwrap import dedent
json_records = '''[{"description":"<p>This is scenario1<\/p>","owner":"deb",
"priority":"high"},
{"description":"<p>This is scenario2<\/p>","owner":"deb",
"priority":"medium"}]'''
json_object = json.loads(json_records)
payload = Template(dedent("""
{
"subject": "test",
"fieldValues": [
$each
]
}""")
)
for each in json_object:
obj = literal_eval(payload.substitute(dict(each=each)))
print(json.dumps(obj, indent=2))
Output:
{
"subject": "test",
"fieldValues": [
{
"description": "<p>This is scenario1</p>",
"owner": "deb",
"priority": "high"
}
]
}
{
"subject": "test",
"fieldValues": [
{
"description": "<p>This is scenario2</p>",
"owner": "deb",
"priority": "medium"
}
]
}

format a json and then open it with the json.load () [duplicate]

I currently have JSON in the below format.
Some of the Key values are NOT properly formatted as they are missing double quotes (")
How do I fix these key values to have double-quotes on them?
{
Name: "test",
Address: "xyz",
"Age": 40,
"Info": "test"
}
Required:
{
"Name": "test",
"Address": "xyz",
"Age": 40,
"Info": "test"
}
Using the below post, I was able to find such key values in the above INVALID JSON.
However, I could NOT find an efficient way to replace these found values with double-quotes.
s = "Example: String"
out = re.findall(r'\w+:', s)
How to Escape Double Quote inside JSON
Using Regex:
import re
data = """{ Name: "test", Address: "xyz"}"""
print( re.sub("(\w+):", r'"\1":', data) )
Output:
{ "Name": "test", "Address": "xyz"}
You can use PyYaml. Since JSON is a subset of Yaml, pyyaml may overcome the lack of quotes.
Example
import yaml
dirty_json = """
{
key: "value",
"key2": "value"
}
"""
yaml.load(dirty_json, yaml.SafeLoader)
I had few more issues that I faced in my JSON.
Thought of sharing the final solution that worked for me.
jsonStr = re.sub("((?=\D)\w+):", r'"\1":', jsonStr)
jsonStr = re.sub(": ((?=\D)\w+)", r':"\1"', jsonStr)
First Line will fix this double-quotes issue for the Key. i.e.
Name: "test"
Second Line will fix double-quotes issue for the value. i.e. "Info": test
Also, above will exclude double-quoting within date timestamp which have : (colon) in them.
You can use online formatter. I know most of them are throwing error for not having double quotes but below one seems handling it nicely!
JSON Formatter
The regex approach can be brittle. I suggest you find a library that can parse the JSON text that is missing quotes.
For example, in Kotlin 1.4, the standard way to parse a JSON string is using Json.decodeFromString. However, you can use Json { isLenient = true }.decodeFromString to relax the requirements for quotes. Here is a complete example in JUnit.
import kotlinx.serialization.Serializable
import kotlinx.serialization.decodeFromString
import kotlinx.serialization.json.Json
import org.junit.jupiter.api.Assertions
import org.junit.jupiter.api.Test
#Serializable
data class Widget(val x: Int, val y: String)
class JsonTest {
#Test
fun `Parsing Json`() {
val w: Widget = Json.decodeFromString("""{"x":123, "y":"abc"}""")
Assertions.assertEquals(123, w.x)
Assertions.assertEquals("abc", w.y)
}
#Test
fun `Parsing Json missing quotes`() {
// Json.decodeFromString("{x:123, y:abc}") failed to decode due to missing quotes
val w: Widget = Json { isLenient = true }.decodeFromString("{x:123, y:abc}")
Assertions.assertEquals(123, w.x)
Assertions.assertEquals("abc", w.y)
}
}

Parse json string in Python

A simple one, but I've just not yet been able to wrap my head around parsing nested lists and json structures in Python...
Here is the raw message I am trying to parse.
{
"Records": [
{
"messageId": "1b9c0952-3fe3-4ab4-a8ae-26bd5d3445f8",
"receiptHandle": "AQEBy40IsvNDy33dOhn4KB8+7apBecWpSuw5OgL9sw/Nf+tM2esLgqmWjGsd4n0oqB",
"body": "{\n \"Type\" : \"Notification\",\n \"MessageId\" : \"dce5c301-029f-55e1-8cee-959b1ad4e500\",\n \"TopicArn\" : \"arn:aws:sns:ap-southeast-2:062497424678:vid\",\n \"Message\" : \"ChiliChallenge.mp4\",\n \"Timestamp\" : \"2020-01-16T07:51:39.807Z\",\n \"SignatureVersion\" : \"1\",\n \"Signature\" : \"oloRF7SzS8ipWQFZieXDQ==\",\n \"SigningCertURL\" : \"https://sns.ap-southeast-2.amazonaws.com/SimpleNotificationService-a.pem\",\n \"UnsubscribeURL\" : \"https://sns.ap-southeast-2.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=arn:aws:sns:ap-southeast-2:062478:vid\"\n}",
"attributes": {
"ApproximateReceiveCount": "1",
"SentTimestamp": "1579161099897",
"SenderId": "AIDAIY4XD42",
"ApproximateFirstReceiveTimestamp": "1579161099945"
},
"messageAttributes": {},
"md5OfBody": "1f246d643af4ea232d6d4c91f",
"eventSource": "aws:sqs",
"eventSourceARN": "arn:aws:sqs:ap-southeast-2:062497424678:vid",
"awsRegion": "ap-southeast-2"
}
]
}
I am trying to extract the Message in the body section, ending up with a string as "ChiliChallenge.mp4\"
Thanks!
Essentially I just keep getting either TypeError: string indices must be integers or parsing the body but not getting any further into the list without an error.
Here's my attempt:
import json
with open ("event_testing.txt", "r") as myfile:
event=myfile.read().replace('\n', '')
str(event)
event = json.loads(event)
key = event['Records'][0]['body']
print(key)
you can use json.loads to load string
with open ("event_testing.txt", "r") as fp:
event = json.loads(fp.read())
key = json.loads(event['Records'][0]['body'])['Message']
print(key)
'ChiliChallenge.mp4'
Say your message is phrase,
I rebuild your code like:
phrase_2 = phrase["Records"]
print(phrase_2[0]["body"])
Then it works clearly. Because beginning of the Records, it looks like an array so you need to organized it.

Parsing a nested JSON keys and getting the values in a CSV format

I have a nested JSON data like this of about 5000 records.
{
"data": {
"attributes": [
{
"alert_type": "download",
"severity_level": "med",
"user": "10.1.1.16"
},
{
"alert_type": "download",
"severity_level": "low",
"user": "10.2.1.18"
}
]
}
}
Now , I need to parse this JSON and get only certain fields in a CSV format. Let's we would need alert_type & user in a CSV format.
I tried to parse this JSON dictionary:
>>> import json
>>> resp = '{"data":{"attributes":[{"alert_type":"download","severity_level":"med","user":"10.1.1.16"},{"alert_type":"download","severity_level":"low","user":"10.2.1.18"}]}}'
>>> user_dict = json.loads(resp)
>>> event_cnt = user_dict['data']['attributes']
>>> print event_cnt[0]['alert_type']
download
>>> print event_cnt[0]['user']
10.1.1.16
>>> print event_cnt[0]['alert_type'] + "," + event_cnt[0]['user']
download,10.1.1.16
>>>
How to get all the elements/values of a particular keys in a CSV format and in a single iteration ?
Output:
download,10.1.1.16
download,10.2.1.18
Simple list comprehension:
>>> jdict=json.loads(resp)
>>> ["{},{}".format(d["alert_type"],d["user"]) for d in jdict["data"]["attributes"]]
['download,10.1.1.16', 'download,10.2.1.18']
Which you can join for your desired output:
>>> li=["{},{}".format(d["alert_type"],d["user"]) for d in jdict["data"]["attributes"]]
>>> print '\n'.join(li)
download,10.1.1.16
download,10.2.1.18
Since {"data":{"attributes": is a list, you can loop over it and print the values for desired keys (d is the user dict):
for item in d['data']['attributes']:
print(item['alert_type'],',',item['user'], sep='')
You could make it somewhat data-driven like this:
import json
DESIRED_KEYS = 'alert_type', 'user'
resp = '''{ "data": {
"attributes": [
{
"alert_type": "download",
"severity_level": "med",
"user": "10.1.1.16"
},
{
"alert_type": "download",
"severity_level": "low",
"user": "10.2.1.18"
}
]
}
}
'''
user_dict = json.loads(resp)
for attribute in user_dict['data']['attributes']:
print(','.join(attribute[key] for key in DESIRED_KEYS))
To handle attributes that don't have all the keys, you could instead use this as the last line which will assign missing values a default value (such as a blank string as shown) instead of it causing an exception.
print(','.join(attribute.get(key, '') for key in DESIRED_KEYS))
Using jq, a one-line solution is straightforward:
$ jq -r '.data.attributes[] | [.alert_type, .user] | #csv' input.json
"download","10.1.1.16"
"download","10.2.1.18"
If you don't want the strings to be quoted, use join(",") instead of #csv

String format a JSON string gives KeyError

Why does this code give a KeyError?
output_format = """
{
"File": "{filename}",
"Success": {success},
"ErrorMessage": "{error_msg}",
"LogIdentifier": "{log_identifier}"
}
"""
print output_format.format(filename='My_file_name',
success=True,
error_msg='',
log_identifier='123')
Error message:
KeyError: ' "File"'
You need to double the outer braces; otherwise Python thinks { "File".. is a reference too:
output_format = '{{ "File": "{filename}", "Success": {success}, "ErrorMessage": "{error_msg}", "LogIdentifier": "{log_identifier}" }}'
Result:
>>> print output_format.format(filename='My_file_name',
... success=True,
... error_msg='',
... log_identifier='123')
{ "File": "My_file_name", "Success": True, "ErrorMessage": "", "LogIdentifier": "123" }
If, indicentally, you are producing JSON output, you'd be better off using the json module:
>>> import json
>>> print json.dumps({'File': 'My_file_name',
... 'Success': True,
... 'ErrorMessage': '',
... 'LogIdentifier': '123'})
{"LogIdentifier": "123", "ErrorMessage": "", "Success": true, "File": "My_file_name"}
Note the lowercase true in the output, as required by the JSON standard.
As mentioned by Tudor in a comment to another answer, the Template class was the solution that worked best for me. I'm dealing with nested dictionaries or list of dictionaries and handling those were not as straightforward.
Using Template though the solution is quite simple.
I start with a dictionary that is converted into a string. I then replace all instances of { with ${ which is the Template identifier to substitute a placeholder.
The key point of getting this to work is using the Template method safe_substitute. It will replace all valid placeholders like ${user_id} but ignore any invalid ones that are part of the dictionary structure, like ${'name': 'John', ....
After the substitution is done I remove any leftovers $ and convert the string back to a dictionary.
In the code bellow, resolve_placeholders returns a dictionary where each key matches a placeholder in the payload string and the value is substituted by the Template class.
from string import Template
.
.
.
payload = json.dumps(payload)
payload = payload.replace('{', '${')
replace_values = self.resolve_placeholders(payload)
if replace_values:
string_template = Template(payload)
payload = string_template.safe_substitute(replace_values)
payload = payload.replace('${', '{')
payload = json.loads(payload)
To extend on Martijn Pieters answer and comment:
According to MArtijn' comment, escaping the {..} pairs that are not placeholders is they way to go with nested dictionaries. I haven't succeded in doing that, so I suggest the following method.
For nested dictionaries I tried doubling up on any { and } of the nested dictionaries.
a='{{"names":{{"a":"{name}"}}}}'
a.format(name=123) output:
output: '{"names":{"a":"123"}}'
But this makes using format to change values inside a json string, a over-complex method, so I use a twist on the format command.
I replace ${param_name} in a json string. For example:
My predefined JSON looks like this:
my_json_dict = {
'parameter': [
{
'name': 'product',
'value': '${product}'
},
{
'name': 'suites',
'value': '${suites}'
},
{
'name': 'markers',
'value': '${markers}'
}
]
}
I provide this dictionary as values to replace instead of the parameters
parameters = {
'product': 'spam',
'suites': 'ham',
'markers': 'eggs'
}
And use this code to do the replacment
json_str = json.dumps(my_json_dict)
for parameter_name, parameter_value in parameters.iteritems():
parameter_name = '${'+parameter_name+'}'
json_str = json_str.replace(parameter_name, parameter_value)
json_dict = json.loads(json_str)

Categories