I've seen a fair share of ungainly XML->JSON code on the web, and having interacted with Stack's users for a bit, I'm convinced that this crowd can help more than the first few pages of Google results can.
So, we're parsing a weather feed, and we need to populate weather widgets on a multitude of web sites. We're looking now into Python-based solutions.
This public weather.com RSS feed is a good example of what we'd be parsing (our actual weather.com feed contains additional information because of a partnership w/them).
In a nutshell, how should we convert XML to JSON using Python?
xmltodict (full disclosure: I wrote it) can help you convert your XML to a dict+list+string structure, following this "standard". It is Expat-based, so it's very fast and doesn't need to load the whole XML tree in memory.
Once you have that data structure, you can serialize it to JSON:
import xmltodict, json
o = xmltodict.parse('<e> <a>text</a> <a>text</a> </e>')
json.dumps(o) # '{"e": {"a": ["text", "text"]}}'
There is no "one-to-one" mapping between XML and JSON, so converting one to the other necessarily requires some understanding of what you want to do with the results.
That being said, Python's standard library has several modules for parsing XML (including DOM, SAX, and ElementTree). As of Python 2.6, support for converting Python data structures to and from JSON is included in the json module.
So the infrastructure is there.
You can use the xmljson library to convert using different XML JSON conventions.
For example, this XML:
<p id="1">text</p>
translates via the BadgerFish convention into this:
{
'p': {
'#id': 1,
'$': 'text'
}
}
and via the GData convention into this (attributes are not supported):
{
'p': {
'$t': 'text'
}
}
... and via the Parker convention into this (attributes are not supported):
{
'p': 'text'
}
It's possible to convert from XML to JSON and from JSON to XML using the same
conventions:
>>> import json, xmljson
>>> from lxml.etree import fromstring, tostring
>>> xml = fromstring('<p id="1">text</p>')
>>> json.dumps(xmljson.badgerfish.data(xml))
'{"p": {"#id": 1, "$": "text"}}'
>>> xmljson.parker.etree({'ul': {'li': [1, 2]}})
# Creates [<ul><li>1</li><li>2</li></ul>]
Disclosure: I wrote this library. Hope it helps future searchers.
To anyone that may still need this. Here's a newer, simple code to do this conversion.
from xml.etree import ElementTree as ET
xml = ET.parse('FILE_NAME.xml')
parsed = parseXmlToJson(xml)
def parseXmlToJson(xml):
response = {}
for child in list(xml):
if len(list(child)) > 0:
response[child.tag] = parseXmlToJson(child)
else:
response[child.tag] = child.text or ''
# one-liner equivalent
# response[child.tag] = parseXmlToJson(child) if len(list(child)) > 0 else child.text or ''
return response
If some time you get only response code instead of all data then error like json parse will be there so u need to convert it as text
import xmltodict
data = requests.get(url)
xpars = xmltodict.parse(data.text)
json = json.dumps(xpars)
print json
Here's the code I built for that. There's no parsing of the contents, just plain conversion.
from xml.dom import minidom
import simplejson as json
def parse_element(element):
dict_data = dict()
if element.nodeType == element.TEXT_NODE:
dict_data['data'] = element.data
if element.nodeType not in [element.TEXT_NODE, element.DOCUMENT_NODE,
element.DOCUMENT_TYPE_NODE]:
for item in element.attributes.items():
dict_data[item[0]] = item[1]
if element.nodeType not in [element.TEXT_NODE, element.DOCUMENT_TYPE_NODE]:
for child in element.childNodes:
child_name, child_dict = parse_element(child)
if child_name in dict_data:
try:
dict_data[child_name].append(child_dict)
except AttributeError:
dict_data[child_name] = [dict_data[child_name], child_dict]
else:
dict_data[child_name] = child_dict
return element.nodeName, dict_data
if __name__ == '__main__':
dom = minidom.parse('data.xml')
f = open('data.json', 'w')
f.write(json.dumps(parse_element(dom), sort_keys=True, indent=4))
f.close()
There is a method to transport XML-based markup as JSON which allows it to be losslessly converted back to its original form. See http://jsonml.org/.
It's a kind of XSLT of JSON. I hope you find it helpful
I'd suggest not going for a direct conversion. Convert XML to an object, then from the object to JSON.
In my opinion, this gives a cleaner definition of how the XML and JSON correspond.
It takes time to get right and you may even write tools to help you with generating some of it, but it would look roughly like this:
class Channel:
def __init__(self)
self.items = []
self.title = ""
def from_xml( self, xml_node ):
self.title = xml_node.xpath("title/text()")[0]
for x in xml_node.xpath("item"):
item = Item()
item.from_xml( x )
self.items.append( item )
def to_json( self ):
retval = {}
retval['title'] = title
retval['items'] = []
for x in items:
retval.append( x.to_json() )
return retval
class Item:
def __init__(self):
...
def from_xml( self, xml_node ):
...
def to_json( self ):
...
You may want to have a look at http://designtheory.org/library/extrep/designdb-1.0.pdf. This project starts off with an XML to JSON conversion of a large library of XML files. There was much research done in the conversion, and the most simple intuitive XML -> JSON mapping was produced (it is described early in the document). In summary, convert everything to a JSON object, and put repeating blocks as a list of objects.
objects meaning key/value pairs (dictionary in Python, hashmap in Java, object in JavaScript)
There is no mapping back to XML to get an identical document, the reason is, it is unknown whether a key/value pair was an attribute or an <key>value</key>, therefore that information is lost.
If you ask me, attributes are a hack to start; then again they worked well for HTML.
Well, probably the simplest way is just parse the XML into dictionaries and then serialize that with simplejson.
When I do anything with XML in python I almost always use the lxml package. I suspect that most people use lxml. You could use xmltodict but you will have to pay the penalty of parsing the XML again.
To convert XML to json with lxml you:
Parse XML document with lxml
Convert lxml to a dict
Convert list to json
I use the following class in my projects. Use the toJson method.
from lxml import etree
import json
class Element:
'''
Wrapper on the etree.Element class. Extends functionality to output element
as a dictionary.
'''
def __init__(self, element):
'''
:param: element a normal etree.Element instance
'''
self.element = element
def toDict(self):
'''
Returns the element as a dictionary. This includes all child elements.
'''
rval = {
self.element.tag: {
'attributes': dict(self.element.items()),
},
}
for child in self.element:
rval[self.element.tag].update(Element(child).toDict())
return rval
class XmlDocument:
'''
Wraps lxml to provide:
- cleaner access to some common lxml.etree functions
- converter from XML to dict
- converter from XML to json
'''
def __init__(self, xml = '<empty/>', filename=None):
'''
There are two ways to initialize the XmlDocument contents:
- String
- File
You don't have to initialize the XmlDocument during instantiation
though. You can do it later with the 'set' method. If you choose to
initialize later XmlDocument will be initialized with "<empty/>".
:param: xml Set this argument if you want to parse from a string.
:param: filename Set this argument if you want to parse from a file.
'''
self.set(xml, filename)
def set(self, xml=None, filename=None):
'''
Use this to set or reset the contents of the XmlDocument.
:param: xml Set this argument if you want to parse from a string.
:param: filename Set this argument if you want to parse from a file.
'''
if filename is not None:
self.tree = etree.parse(filename)
self.root = self.tree.getroot()
else:
self.root = etree.fromstring(xml)
self.tree = etree.ElementTree(self.root)
def dump(self):
etree.dump(self.root)
def getXml(self):
'''
return document as a string
'''
return etree.tostring(self.root)
def xpath(self, xpath):
'''
Return elements that match the given xpath.
:param: xpath
'''
return self.tree.xpath(xpath);
def nodes(self):
'''
Return all elements
'''
return self.root.iter('*')
def toDict(self):
'''
Convert to a python dictionary
'''
return Element(self.root).toDict()
def toJson(self, indent=None):
'''
Convert to JSON
'''
return json.dumps(self.toDict(), indent=indent)
if __name__ == "__main__":
xml='''<system>
<product>
<demod>
<frequency value='2.215' units='MHz'>
<blah value='1'/>
</frequency>
</demod>
</product>
</system>
'''
doc = XmlDocument(xml)
print doc.toJson(indent=4)
The output from the built in main is:
{
"system": {
"attributes": {},
"product": {
"attributes": {},
"demod": {
"attributes": {},
"frequency": {
"attributes": {
"units": "MHz",
"value": "2.215"
},
"blah": {
"attributes": {
"value": "1"
}
}
}
}
}
}
}
Which is a transformation of this xml:
<system>
<product>
<demod>
<frequency value='2.215' units='MHz'>
<blah value='1'/>
</frequency>
</demod>
</product>
</system>
I found for simple XML snips, use regular expression would save troubles. For example:
# <user><name>Happy Man</name>...</user>
import re
names = re.findall(r'<name>(\w+)<\/name>', xml_string)
# do some thing to names
To do it by XML parsing, as #Dan said, there is not one-for-all solution because the data is different. My suggestion is to use lxml. Although not finished to json, lxml.objectify give quiet good results:
>>> from lxml import objectify
>>> root = objectify.fromstring("""
... <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
... <a attr1="foo" attr2="bar">1</a>
... <a>1.2</a>
... <b>1</b>
... <b>true</b>
... <c>what?</c>
... <d xsi:nil="true"/>
... </root>
... """)
>>> print(str(root))
root = None [ObjectifiedElement]
a = 1 [IntElement]
* attr1 = 'foo'
* attr2 = 'bar'
a = 1.2 [FloatElement]
b = 1 [IntElement]
b = True [BoolElement]
c = 'what?' [StringElement]
d = None [NoneElement]
* xsi:nil = 'true'
While the built-in libs for XML parsing are quite good I am partial to lxml.
But for parsing RSS feeds, I'd recommend Universal Feed Parser, which can also parse Atom.
Its main advantage is that it can digest even most malformed feeds.
Python 2.6 already includes a JSON parser, but a newer version with improved speed is available as simplejson.
With these tools building your app shouldn't be that difficult.
My answer addresses the specific (and somewhat common) case where you don't really need to convert the entire xml to json, but what you need is to traverse/access specific parts of the xml, and you need it to be fast, and simple (using json/dict-like operations).
Approach
For this, it is important to note that parsing an xml to etree using lxml is super fast. The slow part in most of the other answers is the second pass: traversing the etree structure (usually in python-land), converting it to json.
Which leads me to the approach I found best for this case: parsing the xml using lxml, and then wrapping the etree nodes (lazily), providing them with a dict-like interface.
Code
Here's the code:
from collections import Mapping
import lxml.etree
class ETreeDictWrapper(Mapping):
def __init__(self, elem, attr_prefix = '#', list_tags = ()):
self.elem = elem
self.attr_prefix = attr_prefix
self.list_tags = list_tags
def _wrap(self, e):
if isinstance(e, basestring):
return e
if len(e) == 0 and len(e.attrib) == 0:
return e.text
return type(self)(
e,
attr_prefix = self.attr_prefix,
list_tags = self.list_tags,
)
def __getitem__(self, key):
if key.startswith(self.attr_prefix):
return self.elem.attrib[key[len(self.attr_prefix):]]
else:
subelems = [ e for e in self.elem.iterchildren() if e.tag == key ]
if len(subelems) > 1 or key in self.list_tags:
return [ self._wrap(x) for x in subelems ]
elif len(subelems) == 1:
return self._wrap(subelems[0])
else:
raise KeyError(key)
def __iter__(self):
return iter(set( k.tag for k in self.elem) |
set( self.attr_prefix + k for k in self.elem.attrib ))
def __len__(self):
return len(self.elem) + len(self.elem.attrib)
# defining __contains__ is not necessary, but improves speed
def __contains__(self, key):
if key.startswith(self.attr_prefix):
return key[len(self.attr_prefix):] in self.elem.attrib
else:
return any( e.tag == key for e in self.elem.iterchildren() )
def xml_to_dictlike(xmlstr, attr_prefix = '#', list_tags = ()):
t = lxml.etree.fromstring(xmlstr)
return ETreeDictWrapper(
t,
attr_prefix = '#',
list_tags = set(list_tags),
)
This implementation is not complete, e.g., it doesn't cleanly support cases where an element has both text and attributes, or both text and children (only because I didn't need it when I wrote it...) It should be easy to improve it, though.
Speed
In my specific use case, where I needed to only process specific elements of the xml, this approach gave a suprising and striking speedup by a factor of 70 (!) compared to using #Martin Blech's xmltodict and then traversing the dict directly.
Bonus
As a bonus, since our structure is already dict-like, we get another alternative implementation of xml2json for free. We just need to pass our dict-like structure to json.dumps. Something like:
def xml_to_json(xmlstr, **kwargs):
x = xml_to_dictlike(xmlstr, **kwargs)
return json.dumps(x)
If your xml includes attributes, you'd need to use some alphanumeric attr_prefix (e.g. "ATTR_"), to ensure the keys are valid json keys.
I haven't benchmarked this part.
check out lxml2json (disclosure: I wrote it)
https://github.com/rparelius/lxml2json
it's very fast, lightweight (only requires lxml), and one advantage is that you have control over whether certain elements are converted to lists or dicts
jsonpickle or if you're using feedparser, you can try feed_parser_to_json.py
You can use declxml. It has advanced features like multi attributes and complex nested support. You just need to write a simple processor for it. Also with the same code, you can convert back to JSON as well. It is fairly straightforward and the documentation is awesome.
Link: https://declxml.readthedocs.io/en/latest/index.html
If you don't want to use any external libraries and 3rd party tools, Try below code.
Code
import re
import json
def getdict(content):
res=re.findall("<(?P<var>\S*)(?P<attr>[^/>]*)(?:(?:>(?P<val>.*?)</(?P=var)>)|(?:/>))",content)
if len(res)>=1:
attreg="(?P<avr>\S+?)(?:(?:=(?P<quote>['\"])(?P<avl>.*?)(?P=quote))|(?:=(?P<avl1>.*?)(?:\s|$))|(?P<avl2>[\s]+)|$)"
if len(res)>1:
return [{i[0]:[{"#attributes":[{j[0]:(j[2] or j[3] or j[4])} for j in re.findall(attreg,i[1].strip())]},{"$values":getdict(i[2])}]} for i in res]
else:
return {res[0]:[{"#attributes":[{j[0]:(j[2] or j[3] or j[4])} for j in re.findall(attreg,res[1].strip())]},{"$values":getdict(res[2])}]}
else:
return content
with open("test.xml","r") as f:
print(json.dumps(getdict(f.read().replace('\n',''))))
Sample Input
<details class="4b" count=1 boy>
<name type="firstname">John</name>
<age>13</age>
<hobby>Coin collection</hobby>
<hobby>Stamp collection</hobby>
<address>
<country>USA</country>
<state>CA</state>
</address>
</details>
<details empty="True"/>
<details/>
<details class="4a" count=2 girl>
<name type="firstname">Samantha</name>
<age>13</age>
<hobby>Fishing</hobby>
<hobby>Chess</hobby>
<address current="no">
<country>Australia</country>
<state>NSW</state>
</address>
</details>
Output
[
{
"details": [
{
"#attributes": [
{
"class": "4b"
},
{
"count": "1"
},
{
"boy": ""
}
]
},
{
"$values": [
{
"name": [
{
"#attributes": [
{
"type": "firstname"
}
]
},
{
"$values": "John"
}
]
},
{
"age": [
{
"#attributes": []
},
{
"$values": "13"
}
]
},
{
"hobby": [
{
"#attributes": []
},
{
"$values": "Coin collection"
}
]
},
{
"hobby": [
{
"#attributes": []
},
{
"$values": "Stamp collection"
}
]
},
{
"address": [
{
"#attributes": []
},
{
"$values": [
{
"country": [
{
"#attributes": []
},
{
"$values": "USA"
}
]
},
{
"state": [
{
"#attributes": []
},
{
"$values": "CA"
}
]
}
]
}
]
}
]
}
]
},
{
"details": [
{
"#attributes": [
{
"empty": "True"
}
]
},
{
"$values": ""
}
]
},
{
"details": [
{
"#attributes": []
},
{
"$values": ""
}
]
},
{
"details": [
{
"#attributes": [
{
"class": "4a"
},
{
"count": "2"
},
{
"girl": ""
}
]
},
{
"$values": [
{
"name": [
{
"#attributes": [
{
"type": "firstname"
}
]
},
{
"$values": "Samantha"
}
]
},
{
"age": [
{
"#attributes": []
},
{
"$values": "13"
}
]
},
{
"hobby": [
{
"#attributes": []
},
{
"$values": "Fishing"
}
]
},
{
"hobby": [
{
"#attributes": []
},
{
"$values": "Chess"
}
]
},
{
"address": [
{
"#attributes": [
{
"current": "no"
}
]
},
{
"$values": [
{
"country": [
{
"#attributes": []
},
{
"$values": "Australia"
}
]
},
{
"state": [
{
"#attributes": []
},
{
"$values": "NSW"
}
]
}
]
}
]
}
]
}
]
}
]
This stuff here is actively maintained and so far is my favorite: xml2json in python
I published one on github a while back..
https://github.com/davlee1972/xml_to_json
This converter is written in Python and will convert one or more XML files into JSON / JSONL files
It requires a XSD schema file to figure out nested json structures (dictionaries vs lists) and json equivalent data types.
python xml_to_json.py -x PurchaseOrder.xsd PurchaseOrder.xml
INFO - 2018-03-20 11:10:24 - Parsing XML Files..
INFO - 2018-03-20 11:10:24 - Processing 1 files
INFO - 2018-03-20 11:10:24 - Parsing files in the following order:
INFO - 2018-03-20 11:10:24 - ['PurchaseOrder.xml']
DEBUG - 2018-03-20 11:10:24 - Generating schema from PurchaseOrder.xsd
DEBUG - 2018-03-20 11:10:24 - Parsing PurchaseOrder.xml
DEBUG - 2018-03-20 11:10:24 - Writing to file PurchaseOrder.json
DEBUG - 2018-03-20 11:10:24 - Completed PurchaseOrder.xml
I also have a follow up xml to parquet converter that works in a similar fashion
https://github.com/blackrock/xml_to_parquet
Related
import dicttoxml
json_req_body = {
"req": {
"SessionKey": "aad3584e-ce40-4937-9eae-5084ab693986",
"ObjectId": "1f79ed70-77c4-ec11-997e-281878c34860",
"PhoneNumber": {
"EntryId": 0,
"CountryCode": "",
"PhoneNumberType": "Home",
"_PhoneNumber": "7073861807",
"_IsPrimary": "true",
"OptOut": "false"
}
}
}
json_to_xml_data = dicttoxml.dicttoxml(json_req_body, attr_type=False,root=False)
#Byte to String
json_to_xml_data = json_to_xml_data.decode("utf-8")
print(json_to_xml_data)
current output:
<req>
<SessionKey>aad3584e-ce40-4937-9eae-5084ab693986</SessionKey>
<ObjectId>1f79ed70-77c4-ec11-997e-281878c34860</ObjectId>
<PhoneNumber>
<EntryId>0</EntryId>
<CountryCode></CountryCode>
<PhoneNumberType>Home</PhoneNumberType>
<_PhoneNumber>7073861807</_PhoneNumber>
<_IsPrimary>true</_IsPrimary>
<OptOut>false</OptOut>
</PhoneNumber>
</req>
I am using dicttoxml package to convert JSON to XML, conversation is happening well. but I have a scenario where JSON keys start with _ . in that scenario need JSON key and values have to be part of the XML parent tag as follows.
expected output;
<req>
<SessionKey>aad3584e-ce40-4937-9eae-5084ab693986</SessionKey>
<ObjectId>1f79ed70-77c4-ec11-997e-281878c34860</ObjectId>
<PhoneNumber PhoneNumber='7073861807' IsPrimary='true'>
<EntryId>0</EntryId>
<CountryCode></CountryCode>
<PhoneNumberType>Home</PhoneNumberType>
<OptOut>false</OptOut>
</PhoneNumber>
</req>
is there a way to achieve this using dicttoxml package, or are there any other packages that support these scenarios?
I don't think dicttoxml includes such a specific function. However, you can easily make all the changes you want to your xml with ElementTree:
import dicttoxml
import xml.etree.ElementTree as ET
json_req_body = {
"req": {
"SessionKey": "aad3584e-ce40-4937-9eae-5084ab693986",
"ObjectId": "1f79ed70-77c4-ec11-997e-281878c34860",
"PhoneNumber": {
"EntryId": 0,
"CountryCode": "",
"PhoneNumberType": "Home",
"_PhoneNumber": "7073861807",
"_IsPrimary": "true",
"OptOut": "false"
}
}
}
json_to_xml_data = dicttoxml.dicttoxml(json_req_body, attr_type=False,root=False).decode("utf-8")
elt_tree = ET.XML(json_to_xml_data)
for phone_nb in elt_tree.iter('PhoneNumber'):
remove_children = []
for child in phone_nb:
if child.tag[0] == '_':
phone_nb.set(child.tag.strip('_'), child.text)
remove_children.append(child)
for child in remove_children:
phone_nb.remove(child)
ET.indent(elt_tree)
print(ET.tostring(elt_tree, encoding='unicode'))
Output:
<req>
<SessionKey>aad3584e-ce40-4937-9eae-5084ab693986</SessionKey>
<ObjectId>1f79ed70-77c4-ec11-997e-281878c34860</ObjectId>
<PhoneNumber PhoneNumber="7073861807" IsPrimary="true">
<EntryId>0</EntryId>
<CountryCode />
<PhoneNumberType>Home</PhoneNumberType>
<OptOut>false</OptOut>
</PhoneNumber>
</req>
Edit: for all keys (not only "PhoneNumber") that are directly under root, replace the loop with:
for key_tag in elt_tree:
remove_children = []
for child in key_tag:
if child.tag[0] == '_':
key_tag.set(child.tag.strip('_'), child.text)
remove_children.append(child)
for child in remove_children:
key_tag.remove(child)
(loop over elt_tree.iter() instead of elt_tree if you want to process all elements recursively)
I am using python 3.6.3a. I would like to generate payload for each of the json records. I am using each variable to access the record. How to assign variable value (each in this case) in payload? I tried {each} and other methods but didn't work.
code snippet below.
json_records = [{"description":"<p>This is scenario1<\/p>","owner":"deb",
"priority":"high"},
{"description":"<p>This is scenario2<\/p>","owner":"deb",
"priority":"medium"}]
json_object = json.loads(json_records)
for each in json_object:
payload = """
{
"subject": "test",
"fieldValues": [
{each}
]
}
"""
There are two ways to approach this problem.
One way could be creating a dict() object and inserting keys as you wish, then json.dumps(object) to convert into string payload as in:
import json
json_records = [{"description":"This is scenario1</p>","owner":"deb","priority":"high"}
,{"description":"This is scenario2</p>","owner":"deb","priority":"medium"}]
for obj in json_records:
payload = dict()
payload['subject'] = 'test'
for key,value in obj.items():
payload['fieldName'] = {
key:value
}
print(json.dumps(payload))
#{"subject": "test", "fieldName": {"priority": "high"}}
#{"subject": "test", "fieldName": {"priority": "medium"}}
Second way is to create a textual payload from string as in, however if you need a valid JSON at the end, this would require a post-step of validation (something like try json.loads(payload) - So I'd just use the first method. I would use this method only if I have a specific requirements to generate the payload in a certain way.
import json
json_records = [{"description":"This is scenario1</p>","owner":"deb","priority":"high"}
,{"description":"This is scenario2</p>","owner":"deb","priority":"medium"}]
# json_object = json.loads(json_records) # json.loads works only on byte-like strings. your object is already in python in this case.
for obj in json_records:
payload = """
{
"subject": "test",
"fieldValues": [
%s
]
}
""" % (obj["priority"])
print(payload)
#{
# "subject": "test",
# "fieldValues": [
# high
# ]
# }
#
#
# {
# "subject": "test",
# "fieldValues": [
# medium
# ]
# }
You could make payload a Template string and use it to put the data in each JSON record into the format you want. Bracket {} characters have not special meaning in Templates, which is what makes using them easy.
Doing that will create a valid string representation of a dictionary containing everything. You can turn this into an actual Python dictionary data-structure using the ast.literal_eval() function, and then convert that into JSON string format — which I think is the final format you're after.
rom ast import literal_eval
import json
from string import Template
from textwrap import dedent
json_records = '''[{"description":"<p>This is scenario1<\/p>","owner":"deb",
"priority":"high"},
{"description":"<p>This is scenario2<\/p>","owner":"deb",
"priority":"medium"}]'''
json_object = json.loads(json_records)
payload = Template(dedent("""
{
"subject": "test",
"fieldValues": [
$each
]
}""")
)
for each in json_object:
obj = literal_eval(payload.substitute(dict(each=each)))
print(json.dumps(obj, indent=2))
Output:
{
"subject": "test",
"fieldValues": [
{
"description": "<p>This is scenario1</p>",
"owner": "deb",
"priority": "high"
}
]
}
{
"subject": "test",
"fieldValues": [
{
"description": "<p>This is scenario2</p>",
"owner": "deb",
"priority": "medium"
}
]
}
I have a python code that interacts with multiple APIs. All of the APIs return some json but each has different structure. Let's say I'm looking for people's names in all these jsons:
json_a = {
"people": [
{"name": "John"},
{"name": "Peter"}
]
}
json_b = {
"humans": {
"names": ["Adam", "Martin"]
}
}
As you can see above the dictionaries from jsons have arbitrary structures. I'd like to define something that will serve as a "blueprint" for navigating each json, something like this:
all_jsons = {
"json_a": {
"url": "http://endpoint",
"json_structure": "people -> list -> name"
},
"json_b": {
"url": "http://someotherendpoint",
"json_structure": "humans -> names -> list"
}
}
So that if I'm working with json_a I'll just look into all_jsons["json_a"]["json_structure"] and I have an information on how to navigate this exact json. What would be the best way to achieve this?
Why not define concrete retrieval functions for each api:
def retrieve_a(data):
return [d["name"] for d in data["people"]]
def retrieve_b(data):
return data["humans"]["names"]
and store them for each endpoint:
all_jsons = {
"json_a": {
"url": "http://endpoint",
"retrieve": retrieve_a
},
"json_b": {
"url": "http://someotherendpoint",
"retrieve": retrieve_b
}
}
I have found this approach more workable than trying to express code-logic by configuration. Then you can easily collect names:
for dct in all_jsons.values():
data = ... # requests.get(dct["url"]).json() # or similar
names = dct["retrieve"](data)
To get a value from a dictionary with a key if it may not exist, dict.get(key) is used for. To distinguish which type list or dict or else, type(val) is useful for. Combination of them should achieve your problem.
I am using python2.7
I have a json i pull that is always changing when i request it.
I need to pull out Animal_Target_DisplayName under Term7 Under Relation6 in my dict.
The problem is sometimes the object Relation6 is in another part of the Json, it could be leveled deeper or in another order.
I am trying to create code that can just export the values of the key Animal_Target_DisplayName but nothing is working. It wont even loop down the nested dict.
Now this can work if i just pull it out using something like ['view']['Term0'][0]['Relation6'] but remember the JSON is never returned in the same structure.
Code i am using to get the values of the key Animal_Target_DisplayName but it doesnt seem to loop through my dict and find all the values with that key.
array = []
for d in dict.values():
row = d['Animal_Target_DisplayName']
array.append(row)
JSON Below:
dict = {
"view":{
"Term0":[
{
"Id":"b0987b91-af12-4fe3-a56f-152ac7a4d84d",
"DisplayName":"Dog",
"FullName":"Dog",
"AssetType1":[
{
"AssetType_Id":"00000000-0000-0000-0000-000000031131",
}
]
},
{
"Id":"ee74a59d-fb74-4052-97ba-9752154f015d",
"DisplayName":"Dog2",
"FullName":"Dog",
"AssetType1":[
{
"AssetType_Id":"00000000-0000-0000-0000-000000031131",
}
]
},
{
"Id":"eb548eae-da6f-41e8-80ea-7e9984f56af6",
"DisplayName":"Dog3",
"FullName":"Dog3",
"AssetType1":[
{
"AssetType_Id":"00000000-0000-0000-0000-000000031131",
}
]
},
{
"Id":"cfac6dd4-0efa-4417-a2bf-0333204f8a42",
"DisplayName":"Animal Set",
"FullName":"Animal Set",
"AssetType1":[
{
"AssetType_Id":"00000000-0000-0000-0001-000400000001",
}
],
"StringAttribute2":[
{
"StringAttribute_00000000-0000-0000-0000-000000003114_Id":"00a701a8-be4c-4b76-a6e5-3b0a4085bcc8",
"StringAttribute_00000000-0000-0000-0000-000000003114_Value":"Desc"
}
],
"StringAttribute3":[
{
"StringAttribute_00000000-0000-0000-0000-000000000262_Id":"a81adfb4-7528-4673-8c95-953888f3b43a",
"StringAttribute_00000000-0000-0000-0000-000000000262_Value":"meow"
}
],
"BooleanAttribute4":[
{
"BooleanAttribute_00000000-0000-0000-0001-000500000001_Id":"932c5f97-c03f-4a1a-a0c5-a518f5edef5e",
"BooleanAttribute_00000000-0000-0000-0001-000500000001_Value":"true"
}
],
"SingleValueListAttribute5":[
{
"SingleValueListAttribute_00000000-0000-0000-0001-000500000031_Id":"ef51dedd-6f25-4408-99a6-5a6cfa13e198",
"SingleValueListAttribute_00000000-0000-0000-0001-000500000031_Value":"Blah"
}
],
"Relation6":[
{
"Animal_Id":"2715ca09-3ced-4b74-a418-cef4a95dddf1",
"Term7":[
{
"Animal_Target_Id":"88fd0090-4ea8-4ae6-b7f0-1b13e5cf3d74",
"Animal_Target_DisplayName":"Animaltheater",
"Animal_Target_FullName":"Animaltheater"
}
]
},
{
"Animal_Id":"6068fe78-fc8e-4542-9aee-7b4b68760dcd",
"Term7":[
{
"Animal_Target_Id":"4e87a614-2a8b-46c0-90f3-8a0cf9bda66c",
"Animal_Target_DisplayName":"Animaltitle",
"Animal_Target_FullName":"Animaltitle"
}
]
},
{
"Animal_Id":"754ec0e6-19b6-4b6b-8ba1-573393268257",
"Term7":[
{
"Animal_Target_Id":"a8986ed5-3ec8-44f3-954c-71cacb280ace",
"Animal_Target_DisplayName":"Animalcustomer",
"Animal_Target_FullName":"Animalcustomer"
}
]
},
{
"Animal_Id":"86b3ffd1-4d54-4a98-b25b-369060651bd6",
"Term7":[
{
"Animal_Target_Id":"89d02067-ebe8-4b87-9a1f-a6a0bdd40ec4",
"Animal_Target_DisplayName":"Animalfact_transaction",
"Animal_Target_FullName":"Animalfact_transaction"
}
]
},
{
"Animal_Id":"ea2e1b76-f8bc-46d9-8ebc-44ffdd60f213",
"Term7":[
{
"Animal_Target_Id":"e398cd32-1e73-46bd-8b8f-d039986d6de0",
"Animal_Target_DisplayName":"Animalfact_transaction",
"Animal_Target_FullName":"Animalfact_transaction"
}
]
}
],
"Relation10":[
{
"TargetRelation_b8b178ff-e957-47db-a4e7-6e5b789d6f03_Id":"aff80bd0-a282-4cf5-bdcc-2bad35ddec1d",
"Term11":[
{
"AnimalId":"3ac22167-eb91-469a-9d94-315aa301f55a",
"AnimalDisplayName":"Animal",
"AnimalFullName":"Animal"
}
]
}
],
"Tag12":[
{
"Tag_Id":"75968ea6-4c9f-43c9-80f7-dfc41b24ec8f",
"Tag_Name":"AnimalAnimaltitle"
},
{
"Tag_Id":"b1adbc00-aeef-415b-82b6-a3159145c60d",
"Tag_Name":"Animal2"
},
{
"Tag_Id":"5f78e4dc-2b37-41e0-a0d3-cec773af2397",
"Tag_Name":"AnimalDisplayName"
}
]
}
]
}
}
The output i am trying to get is a list of all the values from key Animal_Target_DisplayName like this ['Animaltheater','Animaltitle', 'Animalcustomer', 'Animalfact_transaction', 'Animalfact_transaction'] but we need to remember the nested structure of this json always changes but the keys for it are always the same.
I guess your only option is running through the entire dict and get the values of Animal_Target_DisplayName key, I propose the following recursive solution:
def run_json(dict_):
animal_target_sons = []
if type(dict_) is list:
for element in dict_:
animal_target_sons.append(run_json(element))
elif type(dict_) is dict:
for key in dict_:
if key=="Animal_Target_DisplayName":
animal_target_sons.append([dict_[key]])
else:
animal_target_sons.append(run_json(dict_[key]))
return [x for sublist in animal_target_sons for x in sublist]
run_json(dict_)
Then calling run_json returns a list with what you want. By the way, I recommend you to rename your json from dict to, for example dict_, since dict is a reserved word of Python for the dictionary type.
Since you're getting JSON, why not make use of the json module? That will do the parsing for you and allow you to use dictionary functions+features to get the information you need.
#!/usr/bin/python2.7
from __future__ import print_function
import json
# _somehow_ get your JSON in as a string. I'm calling it "jstr" for this
# example.
# Use the module to parse it
jdict = json.loads(jstr)
# our dict has keys...
# view -> Term0 -> keys-we're-interested-in
templist = jdict["view"]["Term0"]
results = {}
for _el in range(len(templist)):
if templist[_el]["FullName"] == "Animal Set":
# this is the one we're interested in - and it's another list
moretemp = templist[_el]["Relation6"]
for _k in range(len(moretemp)):
term7 = moretemp[_k]["Term7"][0]
displayName = term7["Animal_Target_DisplayName"]
fullName = term7["Animal_Target_FullName"]
results[fullName] = displayName
print("{0}".format(results))
Then you can dump the results dict plain, or with pretty-printing:
>>> print(json.dumps(results, indent=4))
{
"Animaltitle2": "Animaltitle2",
"Animalcustomer3": "Animalcustomer3",
"Animalfact_transaction4": "Animalfact_transaction4",
"Animaltheater1": "Animaltheater1"
}
I would like to convert the following nltk Tree representation into JSON format:
Desired output:
{
"scores": {
"filler": [
[
"scores"
],
[
"for"
]
],
"extent": [
"highest"
],
"team": [
"India"
]
}
}
It looks like the input tree may contain children with the same name. To support the general case, you could convert each Tree into a dictionary that maps its name to its children list:
from nltk import Tree # $ pip install nltk
def tree2dict(tree):
return {tree.node: [tree2dict(t) if isinstance(t, Tree) else t
for t in tree]}
Example:
import json
import sys
tree = Tree('scores',
[Tree('extent', ['highest']),
Tree('filler',
[Tree('filler', ['scores']),
Tree('filler', ['for'])]),
Tree('team', ['India'])])
d = tree2dict(tree)
json.dump(d, sys.stdout, indent=2)
Output:
{
"scores": [
{
"extent": [
"highest"
]
},
{
"filler": [
{
"filler": [
"scores"
]
},
{
"filler": [
"for"
]
}
]
},
{
"team": [
"India"
]
}
]
}
Convert Tree to dict and then to JSON.
def tree_to_dict(tree):
tdict = {}
for t in tree:
if isinstance(t, nltk.Tree) and isinstance(t[0], nltk.Tree):
tdict[t.node] = tree_to_dict(t)
elif isinstance(t, nltk.Tree):
tdict[t.node] = t[0]
return tdict
def dict_to_json(dict):
return json.dumps(dict)
output_json = dict_to_json({tree.node: tree_to_dict(tree)})
the will covert the tree to a dictionary with the tree labels as the key, then you can convert it into JSON using by using JSON dumps easily
import nltk.tree.Tree
def tree_to_dict(tree):
tree_dict = dict()
leaves = []
for subtree in tree:
if type(subtree) == nltk.tree.Tree:
tree_dict.update(tree_to_dict(subtree))
else:
(expression,tag) = subtree
leaves.append(expression)
tree_dict[tree.label()] = " ".join(leaves)
return tree_dict
A related alternative. For my purposes, I didn't need an exact tree preserved, but instead wanted to extract entities as keys and tokens as lists of values. For the sentence "Tom and Larry play for the Patriots." I wanted the following JSON:
{
"PERSON": [
"Tom",
"Larry"
],
"ORGANIZATION": [
"Patriots"
]
}
This preserves order of tokens (per entity type), while also not "stomping" values set for an entity key. You can reuse the same json.dump code in the other answers to return this dict to json.
from nltk import tag,chunk,tokenize
def prep(sentence):
return chunk.ne_chunk(tag.pos_tag(tokenize.word_tokenize(sentence)))
t = prep("Tom and Larry play for the Patriots.")
def tree_to_dict(tree):
tree_dict = dict()
for st in tree:
# not everything gets a NE tag,
# so we can ignore untagged tokens
# which are stored in tuples
if isinstance(st, nltk.Tree):
if st.label() in tree_dict:
tree_dict[st.label()] = tree_dict[st.label()] + [st[0][0]]
else:
tree_dict[st.label()] = [st[0][0]]
return tree_dict
print(tree_to_dict(t))
# {'PERSON': ['Tom', 'Larry'], 'ORGANIZATION': ['Patriots']}