parse a special xml in python - python

I have s special xml file like below:
<alarm-dictionary source="DDD" type="ProxyComponent">
<alarm code="402" severity="Alarm" name="DDM_Alarm_402">
<message>Database memory usage low threshold crossed</message>
<description>dnKinds = database
type = quality_of_service
perceived_severity = minor
probable_cause = thresholdCrossed
additional_text = Database memory usage low threshold crossed
</description>
</alarm>
...
</alarm-dictionary>
I know in python, I can get the "alarm code", "severity" in tag alarm by:
for alarm_tag in dom.getElementsByTagName('alarm'):
if alarm_tag.hasAttribute('code'):
alarmcode = str(alarm_tag.getAttribute('code'))
And I can get the text in tag message like below:
for messages_tag in dom.getElementsByTagName('message'):
messages = ""
for message_tag in messages_tag.childNodes:
if message_tag.nodeType in (message_tag.TEXT_NODE, message_tag.CDATA_SECTION_NODE):
messages += message_tag.data
But I also want to get the value like dnkind(database), type(quality_of_service), perceived_severity(thresholdCrossed) and probable_cause(Database memory usage low threshold crossed
) in tag description.
That is, I also want to parse the content in the tag in xml.
Could anyone help me with this?
Thanks a lot!

Once you have the text from the description tag, it's nothing to do with XML parsing. You just need do simple string-parsing to get the type = quality_of_service keys/values strings into something nicer to use in Python like a dictionary
With some slightly simpler parsing thanks to ElementTree, it would look like this
messages = """
<alarm-dictionary source="DDD" type="ProxyComponent">
<alarm code="402" severity="Alarm" name="DDM_Alarm_402">
<message>Database memory usage low threshold crossed</message>
<description>dnKinds = database
type = quality_of_service
perceived_severity = minor
probable_cause = thresholdCrossed
additional_text = Database memory usage low threshold crossed
</description>
</alarm>
...
</alarm-dictionary>
"""
import xml.etree.ElementTree as ET
# Parse XML
tree = ET.fromstring(messages)
for alarm in tree.getchildren():
# Get code and severity
print alarm.get("code")
print alarm.get("severity")
# Grab description text
descr = alarm.find("description").text
# Parse "thing=other" into dict like {'thing': 'other'}
info = {}
for dl in descr.splitlines():
if len(dl.strip()) > 0:
key, _, value = dl.partition("=")
info[key.strip()] = value.strip()
print info

I'm not completely sure on Python, but after quick research.
Seeing as you can already get all of the content from the description tag in XML, can you not split by line breaks, and then split each line using the str.split() function on the equals signs to give you name / value separately?
e.g.
for messages_tag in dom.getElementsByTagName('message'):
messages = ""
for message_tag in messages_tag.childNodes:
if message_tag.nodeType in (message_tag.TEXT_NODE, message_tag.CDATA_SECTION_NODE):
messages += message_tag.data
tag = str.split('=');
tagName = tag[0]
tagValue = tag[1]
(I haven't taken into account splitting each line up and looping)
But that should get you on the right track :)

AFAIK there is no library to handle the text as DOM elements.
You can however (after you have the message in the message variable) do:
description = {}
messageParts = message.split("\n")
for part in messageParts:
descInfo = part.split("=")
description[descInfo[0].strip()] = descInfo[1].strip()
then you'll have inside description the information you need in the form of a key-value map.
You should also add error handling on my code...

Related

Downloading emails with UTF-8 B encoded header

I have a problem with a code which is supposed to download your emails in eml files.
Its supposed to go through the INBOX email listing, retrieve the email content and attachments(if any) and create an .eml file which contains all that.
What it does is that it works with content type of text and the majority multiparts. If an email in the listing contains utf-8B in its header, it simply acts like its the end of the email listing, without displaying any error.
The code in question is:
result, data = p.uid('search',None, search_criteria) # search_criteria is defined earlier in code
if result == 'OK':
data = get_newer_emails_first(data) # get_newer_emails_first() is a function defined to return the list of UIDs in reverse order (newer first)
context['emailsum'] = len(data) # total amount of emails based on the search_criteria parameter.
for num in data:
mymail2 = {}
result,data1 = p.iud('fetch', num, '(RFC822)')
email_message = email.message_from_bytes(data[0][1])
fullemail = email_message.as_bytes()
default_charset = 'ASCII'
if email_message.is_multipart():
m_subject = make_header(decode_header(email_message['Subject']))
else:
m_subject = r''.join([ six.text_type(t[0], t[1] or default_charset) for t in email.header.decode_header(email_message['Subject']) ])
m_from = string(make_header(decode_header(email_message['From'])))
m_date = email_message['Date']
I have done my tests and discovered that while the fullemail variable contains the email properly (thus it reads the data from the actual email successfully), the problem should be in the if else immediately after, but I cannot find what the problem is exactly.
Any ideas?
PS: I accidentally posted this question as a guest, but I opted to delete it and repost it from my account.
Apparently the error lay in my code in the silliest of ways.
Instead of:
m_from = string(make_header(decode_header(email_message['From'])))
m_date = email_message['Date']
It should be:
m_from = str(make_header(decode_header(email_message['From'])))
m_date = str(make_header(decode_header(email_message['Date'])))

Is there a way to parse a XML according to its attributes?

I'm trying to parse my xml using minidom.parse but the program crushes when debugger reaches line xmldoc = minidom.parse(self)
Here is what have I tried:
attribValList = list()
xmldoc = minidom.parse(path)
equipments = xmldoc.getElementsByTagName(xmldoc , elementName)
equipNames = equipments.getElementsByTagName(xmldoc , attributeKey)
for item in equipNames:
attribValList.append(item.value)
return attribValList
Maybe my XML is too specific for minidom. Here is how it looks like:
<TestSystem id="...">
<Port>58</Port>
<TestSystemEquipment>
<Equipment type="BCAFC">
<Name>System1</Name>
<DU-Junctions>
...
</DU-Junctions>
</Equipment>
Basically I need to get for each Equipment its name and write the names into a list.
Can anybody tell what I'm doing wrong?
enter image description here

python ElementTree.Element missing text?

So, I'm parsing this xml file of moderate size (about 27K lines). Not far into it, I'm seeing unexpected behavior from ElementTree.Element where I get Element.text for one entry but not the next, yet it's there in the source XML as you can see:
<!-- language: lang-xml -->
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:enumeration value="24">
<xs:annotation>
<xs:documentation>UPC12 (item-specific) on cover 2</xs:documentation>
<xs:documentation>AKA item/price; ‘cover 2’ is defined as the inside front cover of a book</xs:documentation>
</xs:annotation>
</xs:enumeration>
<xs:enumeration value="25">
<xs:annotation>
<xs:documentation>UPC12+5 (item-specific) on cover 2</xs:documentation>
<xs:documentation>AKA item/price; ‘cover 2’ is defined as the inside front cover of a book</xs:documentation>
</xs:annotation>
</xs:enumeration>
When I encounter an enumeration tag I call this function:
import xml.etree.cElementTree as ElementTree
...
def _parse_list_item(xmlns: str, list_id: int, itemElement: ElementTree.Element) -> ListItem:
if isinstance(itemElement, ElementTree.Element):
if itemElement.attrib['value'] is not None:
item_id = itemElement.attrib['value'] # string
if list_id == 6 and (item_id == '25' or item_id=='24'):
print(list_id, item_id) # <== debug break point here
desc = None
notes = ""
for child in itemElement:
if child.tag == (xmlns + 'annotation'):
for grandchild in child:
if grandchild.tag == (xmlns + 'documentation'):
if desc is None:
desc = grandchild.text
else:
if len(notes)>0:
notes += " " # add a space
notes += grandchild.text or ""
if item_id is not None and desc is not None:
return Codex.ListItem({'itemId': item_id, 'listId': list_id, 'description': desc, 'notes': notes})
If I place a breakpoint at the print statement, when I get to the enumeration node for "24" I can look at the text for the grandchild nodes and they are as shown in the XML, i.e. "UPC12..." or "AKA item...", but when I get to the enumeration node for "25", and look at the grandchild text, it's None.
When I remove the xs: namespace by pre-filtering the XML file, the grandchild text comes through fine.
Is it possible I'm over some size limit or is there some syntax problem?
Sorry for less-than-pythonic code but I wanted to be able to examine all the intermediate values in pycharm. It's python 3.6.
Thanks for any insights you may have!
In the for loop, this condition is never met: if child.tag == (xmlns + 'annotation'):.
Why?
Try to output the child's tag. If we suppose your namespace (xmlns) is 'Steve' then:
print(child.tag) will output: {Steve}annotation, not Steveannotation.
So given this fact, if child.tag == (xmlns + 'annotation'): is always False.
You should change it to: if child.tag == ('{'+xmlns+'}annotation'):
With the same logic, you will find out you will also have to change this condition:
if grandchild.tag == (xmlns + 'documentation'):
to:
if grandchild.tag == ('{'+xmlns+'}documentation'):
So, ultimately, I solved my problem by running a pre-process on the XML file to remove the xs: namespace from all of the open/close XML tags and then I was able to successfully process the file using the function as defined above. Not sure why namespaces are causing problems, but perhaps there is a bug in cElementTree for namespace prefixes in large XML files. To #mzjn - I expect that it would be difficult to construct a minimal example as it does process hundreds of items correctly before it fails, so I would at least have to provide a fairly large XML file. Nevertheless, thanks for being a sounding board.

How do I detect proper nouns in the Google NLP API?

Apologies if this isn't totally clear - I'm a Python copy-the-code-and-try-to-make-it-work developer.
I'm using the Google NLP API in Python 2.7.
When I use analyze_entities(), I can get and print the name, entity type and salience.
Mentions is supposed to contain the noun type: PROPER or COMMON, per this page:
https://cloud.google.com/natural-language/docs/reference/rest/v1beta1/Entity#EntityMention
I can't get mention type from the returned dictionary.
Here's my hideous code:
def entities_text(text, client):
"""Detects entities in the text."""
language_client = client
# Instantiates a plain text document.
document = language_client.document_from_text(text)
# Detects entities in the document. You can also analyze HTML with:
# document.doc_type == language.Document.HTML
entities = document.analyze_entities()
return entities
articles = os.listdir('articles')
for f in articles:
language_client = language.Client()
fname = "articles/" + f
thisfile = open(fname,'r')
content = thisfile.read()
entities = entities_text(content, language_client)
for e in entities:
name = e.name.strip()
type = e.entity_type.strip()
if e.name.strip()[0].isupper() and len(e.name.strip()) > 2:
print name, type, e.salience, e.mentions
That returns this:
RELATED OTHER 0.0019081507 [u'RELATED']
Zoe 3 PERSON 0.0016676666 [u'Zoe 3']
Where the value in [] is the mentions.
If I try to get mentions.type, I get an attribute not found error.
I'd appreciate any input.
1) Do not call the "AnalyzeEntities" function, but call the "AnnotateText" one instead.
2) Check for "Proper". Examine its value, it should be "PROPER" and not "PROPER_UNKNOWN" nor "NOT_PROPER".

Python Minidom XML parsing dotted quad/nested children

I've got a gigantic list of varying objects I need to parse, and have multiple questions:
The string values within XML I'm able to parse quite easily (hostname, color,class_name etc), however anything numerical in nature (ip address/subnet mask etc) I'm not doing correctly. How do I get it to display the correct dotted quad?
What is the correct method (using minidom) to pull information out of deeper children? (see Group object - need 'name' under reference)
How can I sanitize (remove) the erroneous [] when a field does not contain a value (netmask for instance).
XML looks like one of the two outputs(sanitized):
a) Host object:
<network_object>
<Name>DB1</Name>
<Class_Name>host_plain</Class_Name>
<color><![CDATA[black]]></color>
<ipaddr><![CDATA[192.168.100.100]]></ipaddr>
b) Group object (contains multiple members):
<network_object>
<Name>DB_Servers</Name>
<Class_Name>network_object_group</Class_Name>
<members>
<reference>
<Name>DB1</Name>
<Table>network_objects</Table>
</reference>
<reference>
<Name>DB2</Name>
<Table>network_objects</Table>
</reference>
</members>
<color><![CDATA[black]]></color>
Current output of my code looks like this for a host object:
DB1 host_plain black [<DOM Element: ipaddr at 0x2d05a50>] []
For a network object:
Net_192.168.100.0 network black [<DOM Element: ipaddr at 0x399add0>] [<DOM Element: netmask at 0x399af10>]
For a group object:
DB_Servers network_object_group black [] []
My code:
from xml.dom import minidom
net_xml = minidom.parse("network_objects.xml")
NetworkObjectsTag = net_xml.getElementsByTagName("network_objects")[0]
# Pull individual network objects
NetworkObjectTag = NetworkObjectsTag.getElementsByTagName("network_object")
for network_object in NetworkObjectTag:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
class_name = network_object.getElementsByTagName("Class_Name")[0].firstChild.data
color = network_object.getElementsByTagName("color")[0].firstChild.data
ipaddr = network_object.getElementsByTagName("ipaddr")
netmask = network_object.getElementsByTagName("netmask")
print(name,class_name,color,ipaddr,netmask)
Edit: I've been able to get some output to resolve #1, however it seems I'm reaching a limit I'm unware of.
New code:
ipElement = network_object.getElementsByTagName("ipaddr")
ipaddr = ipElement.firstChild.data
maskElement = network_object.getElementsByTagName("netmask")
netmask = maskElement.firstChild.data
Gives me the output I'm looking for, however it seems to stop after 6-9 entries noting that 'builtins.IndexError: list index out of range'
I've been able to answer all of my questions except how to properly handle the network_group_object. I'll make another post for that specifically.
Here's my new code:
from xml.dom import minidom
net_xml = minidom.parse("network_objects.xml")
NetworkObjectsTag = net_xml.getElementsByTagName("network_objects")[0]
# Pull individual network objects
NetworkObjectTag = NetworkObjectsTag.getElementsByTagName("network_object")
for network_object in NetworkObjectTag:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
class_name = network_object.getElementsByTagName("Class_Name")[0].firstChild.data
color = network_object.getElementsByTagName("color")[0].firstChild.data
ipElement = network_object.getElementsByTagName("ipaddr")
if ipElement:
ipElement = network_object.getElementsByTagName("ipaddr")[0]
ipaddr = ipElement.firstChild.data
maskElement = network_object.getElementsByTagName("netmask")
if maskElement:
maskElement = network_object.getElementsByTagName("netmask")[0]
netmask = maskElement.firstChild.data
#address_ranges
ipaddr_firstElement = network_object.getElementsByTagName("ipaddr_first")
if ipaddr_firstElement:
ipaddr_firstElement = network_object.getElementsByTagName("ipaddr_first")[0]
ipaddr_first = ipaddr_firstElement.firstChild.data
ipaddr_lastElement = network_object.getElementsByTagName("ipaddr_last")
if ipaddr_lastElement:
ipaddr_lastElement = network_object.getElementsByTagName("ipaddr_last")[0]
ipaddr_last = ipaddr_lastElement.firstChild.data
if ipaddr_firstElement:
print(name,class_name,ipaddr,netmask,ipaddr_first,ipaddr_last,color)
else:
print(name,class_name,ipaddr,netmask,color)

Categories