Python Minidom XML parsing dotted quad/nested children

Python Minidom XML parsing dotted quad/nested children - python

I've got a gigantic list of varying objects I need to parse, and have multiple questions:
The string values within XML I'm able to parse quite easily (hostname, color,class_name etc), however anything numerical in nature (ip address/subnet mask etc) I'm not doing correctly. How do I get it to display the correct dotted quad?
What is the correct method (using minidom) to pull information out of deeper children? (see Group object - need 'name' under reference)
How can I sanitize (remove) the erroneous [] when a field does not contain a value (netmask for instance).
XML looks like one of the two outputs(sanitized):
a) Host object:
<network_object>
<Name>DB1</Name>
<Class_Name>host_plain</Class_Name>
<color><![CDATA[black]]></color>
<ipaddr><![CDATA[192.168.100.100]]></ipaddr>
b) Group object (contains multiple members):
<network_object>
<Name>DB_Servers</Name>
<Class_Name>network_object_group</Class_Name>
<members>
<reference>
<Name>DB1</Name>
<Table>network_objects</Table>
</reference>
<reference>
<Name>DB2</Name>
<Table>network_objects</Table>
</reference>
</members>
<color><![CDATA[black]]></color>
Current output of my code looks like this for a host object:
DB1 host_plain black [<DOM Element: ipaddr at 0x2d05a50>] []
For a network object:
Net_192.168.100.0 network black [<DOM Element: ipaddr at 0x399add0>] [<DOM Element: netmask at 0x399af10>]
For a group object:
DB_Servers network_object_group black [] []
My code:
from xml.dom import minidom
net_xml = minidom.parse("network_objects.xml")
NetworkObjectsTag = net_xml.getElementsByTagName("network_objects")[0]
# Pull individual network objects
NetworkObjectTag = NetworkObjectsTag.getElementsByTagName("network_object")
for network_object in NetworkObjectTag:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
class_name = network_object.getElementsByTagName("Class_Name")[0].firstChild.data
color = network_object.getElementsByTagName("color")[0].firstChild.data
ipaddr = network_object.getElementsByTagName("ipaddr")
netmask = network_object.getElementsByTagName("netmask")
print(name,class_name,color,ipaddr,netmask)
Edit: I've been able to get some output to resolve #1, however it seems I'm reaching a limit I'm unware of.
New code:
ipElement = network_object.getElementsByTagName("ipaddr")
ipaddr = ipElement.firstChild.data
maskElement = network_object.getElementsByTagName("netmask")
netmask = maskElement.firstChild.data
Gives me the output I'm looking for, however it seems to stop after 6-9 entries noting that 'builtins.IndexError: list index out of range'

I've been able to answer all of my questions except how to properly handle the network_group_object. I'll make another post for that specifically.
Here's my new code:
from xml.dom import minidom
net_xml = minidom.parse("network_objects.xml")
NetworkObjectsTag = net_xml.getElementsByTagName("network_objects")[0]
# Pull individual network objects
NetworkObjectTag = NetworkObjectsTag.getElementsByTagName("network_object")
for network_object in NetworkObjectTag:
name = network_object.getElementsByTagName("Name")[0].firstChild.data
class_name = network_object.getElementsByTagName("Class_Name")[0].firstChild.data
color = network_object.getElementsByTagName("color")[0].firstChild.data
ipElement = network_object.getElementsByTagName("ipaddr")
if ipElement:
ipElement = network_object.getElementsByTagName("ipaddr")[0]
ipaddr = ipElement.firstChild.data
maskElement = network_object.getElementsByTagName("netmask")
if maskElement:
maskElement = network_object.getElementsByTagName("netmask")[0]
netmask = maskElement.firstChild.data
#address_ranges
ipaddr_firstElement = network_object.getElementsByTagName("ipaddr_first")
if ipaddr_firstElement:
ipaddr_firstElement = network_object.getElementsByTagName("ipaddr_first")[0]
ipaddr_first = ipaddr_firstElement.firstChild.data
ipaddr_lastElement = network_object.getElementsByTagName("ipaddr_last")
if ipaddr_lastElement:
ipaddr_lastElement = network_object.getElementsByTagName("ipaddr_last")[0]
ipaddr_last = ipaddr_lastElement.firstChild.data
if ipaddr_firstElement:
print(name,class_name,ipaddr,netmask,ipaddr_first,ipaddr_last,color)
else:
print(name,class_name,ipaddr,netmask,color)

Related

Extract Text from a word document

I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.

There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]

Python - Display key values before and after Findall(regex, output)

I am trying to extract MAC addresses for each NIC from Dell's RACADM output such that my output should be like below:
NIC.Slot.2-2-1 --> 24:84:09:3E:2E:1B
I have used the following to extract the output
output = subprocess.check_output("sshpass -p {} ssh {}#{} racadm {}".format(args.password,args.username,args.hostname,args.command),shell=True).decode()
Part of output
https://pastebin.com/cz6LbcxU
Each component details are displayed between ------ lines
I want to search Device Type = NIC and then print Instance ID and Permanent MAC.
regex = r'Device Type = NIC'
match = re.findall(regex, output, flags=re.MULTILINE|re.DOTALL)
match = re.finditer(regex, output, flags=re.S)
I used both the above functions to extract the match but how do I print [InstanceID: NIC.Slot.2-2-1] and PermanentMACAddress of the Matched regex.
Please help anyone?

If I understood correctly,
you can search for the pattern [InstanceID: ...] to get the instance id,
and PermanentMACAddress = ... to get the MAC address.
Here's one way to do it:
import re
match_inst = re.search(r'\[InstanceID: (?P<inst>[^]]*)', output)
match_mac = re.search(r'PermanentMACAddress = (?P<mac>.*)', output)
inst = match_inst.groupdict()['inst']
mac = match_mac.groupdict()['mac']
print('{} --> {}'.format(inst, mac))
# prints: NIC.Slot.2-2-1 --> 24:84:09:3E:2E:1B
If you have multiple records like this and want to map NIC to MAC, you can get a list of each, zip them together to create a dictionary:
inst = re.findall(r'\[InstanceID: (?P<inst>[^]]*)', output)
mac = re.findall(r'PermanentMACAddress = (?P<mac>.*)', output)
mapping = dict(zip(inst, mac))

Your output looks like INI file content, you could try to parse them using configparser.
>>> import configparser
>>> config = configparser.ConfigParser()
>>> config.read_string(output)
>>> for section in config.sections():
... print(section)
... print(config[section]['Device Type'])
...
InstanceID: NIC.Slot.2-2-1
NIC
>>>

How do I instantiate a group of objects from a text file?

I have some log files that look like many lines of the following:
<tickPrice tickerId=0, field=2, price=201.81, canAutoExecute=1>
<tickSize tickerId=0, field=3, size=25>
<tickSize tickerId=0, field=8, size=534349>
<tickPrice tickerId=0, field=2, price=201.82, canAutoExecute=1>
I need to define a class of type tickPrice or tickSize. I will need to decide which to use before doing the definition.
What would be the Pythonic way to grab these values? In other words, I need an effective way to reverse str() on a class.
The classes are already defined and just contain the presented variables, e.g., tickPrice.tickerId. I'm trying to find a way to extract these values from the text and set the instance attributes to match.
Edit: Answer
This is what I ended up doing-
with open(commandLineOptions.simulationFilename, "r") as simulationFileHandle:
for simulationFileLine in simulationFileHandle:
(date, time, msgString) = simulationFileLine.split("\t")
if ("tickPrice" in msgString):
msgStringCleaned = msgString.translate(None, ''.join("<>,"))
msgList = msgStringCleaned.split(" ")
msg = message.tickPrice()
msg.tickerId = int(msgList[1][9:])
msg.field = int(msgList[2][6:])
msg.price = float(msgList[3][6:])
msg.canAutoExecute = int(msgList[4][15:])
elif ("tickSize" in msgString):
msgStringCleaned = msgString.translate(None, ''.join("<>,"))
msgList = msgStringCleaned.split(" ")
msg = message.tickSize()
msg.tickerId = int(msgList[1][9:])
msg.field = int(msgList[2][6:])
msg.size = int(msgList[3][5:])
else:
print "Unsupported tick message type"

I'm not sure how you want to dynamically create objects in your namespace, but the following will at least dynamically create objects based on your loglines:
Take your line:
line = '<tickPrice tickerId=0, field=2, price=201.81, canAutoExecute=1>'
Remove chars that aren't interesting to us, then split the line into a list:
line = line.translate(None, ''.join('<>,'))
line = line.split(' ')
Name the potential class attributes for convenience:
line_attrs = line[1:]
Then create your object (name, base tuple, dictionary of attrs):
tickPriceObject = type(line[0], (object,), { key:value for key,value in [at.split('=') for at in line_attrs]})()
Prove it works as we'd expect:
print(tickPriceObject.field)
# 2

Approaching the problem with regex, but with the same result as tristan's excellent answer (and stealing his use of the type constructor that I will never be able to remember)
import re
class_instance_re = re.compile(r"""
<(?P<classname>\w[a-zA-Z0-9]*)[ ]
(?P<arguments>
(?:\w[a-zA-Z0-9]*=[0-9.]+[, ]*)+
)>""", re.X)
objects = []
for line in whatever_file:
result = class_instance_re.match(line)
classname = line.group('classname')
arguments = line.group('arguments')
new_obj = type(classname, (object,),
dict([s.split('=') for s in arguments.split(', ')]))
objects.append(new_obj)

parse a special xml in python

I have s special xml file like below:
<alarm-dictionary source="DDD" type="ProxyComponent">
<alarm code="402" severity="Alarm" name="DDM_Alarm_402">
<message>Database memory usage low threshold crossed</message>
<description>dnKinds = database
type = quality_of_service
perceived_severity = minor
probable_cause = thresholdCrossed
additional_text = Database memory usage low threshold crossed
</description>
</alarm>
...
</alarm-dictionary>
I know in python, I can get the "alarm code", "severity" in tag alarm by:
for alarm_tag in dom.getElementsByTagName('alarm'):
if alarm_tag.hasAttribute('code'):
alarmcode = str(alarm_tag.getAttribute('code'))
And I can get the text in tag message like below:
for messages_tag in dom.getElementsByTagName('message'):
messages = ""
for message_tag in messages_tag.childNodes:
if message_tag.nodeType in (message_tag.TEXT_NODE, message_tag.CDATA_SECTION_NODE):
messages += message_tag.data
But I also want to get the value like dnkind(database), type(quality_of_service), perceived_severity(thresholdCrossed) and probable_cause(Database memory usage low threshold crossed
) in tag description.
That is, I also want to parse the content in the tag in xml.
Could anyone help me with this?
Thanks a lot!

Once you have the text from the description tag, it's nothing to do with XML parsing. You just need do simple string-parsing to get the type = quality_of_service keys/values strings into something nicer to use in Python like a dictionary
With some slightly simpler parsing thanks to ElementTree, it would look like this
messages = """
<alarm-dictionary source="DDD" type="ProxyComponent">
<alarm code="402" severity="Alarm" name="DDM_Alarm_402">
<message>Database memory usage low threshold crossed</message>
<description>dnKinds = database
type = quality_of_service
perceived_severity = minor
probable_cause = thresholdCrossed
additional_text = Database memory usage low threshold crossed
</description>
</alarm>
...
</alarm-dictionary>
"""
import xml.etree.ElementTree as ET
# Parse XML
tree = ET.fromstring(messages)
for alarm in tree.getchildren():
# Get code and severity
print alarm.get("code")
print alarm.get("severity")
# Grab description text
descr = alarm.find("description").text
# Parse "thing=other" into dict like {'thing': 'other'}
info = {}
for dl in descr.splitlines():
if len(dl.strip()) > 0:
key, _, value = dl.partition("=")
info[key.strip()] = value.strip()
print info

I'm not completely sure on Python, but after quick research.
Seeing as you can already get all of the content from the description tag in XML, can you not split by line breaks, and then split each line using the str.split() function on the equals signs to give you name / value separately?
e.g.
for messages_tag in dom.getElementsByTagName('message'):
messages = ""
for message_tag in messages_tag.childNodes:
if message_tag.nodeType in (message_tag.TEXT_NODE, message_tag.CDATA_SECTION_NODE):
messages += message_tag.data
tag = str.split('=');
tagName = tag[0]
tagValue = tag[1]
(I haven't taken into account splitting each line up and looping)
But that should get you on the right track :)

AFAIK there is no library to handle the text as DOM elements.
You can however (after you have the message in the message variable) do:
description = {}
messageParts = message.split("\n")
for part in messageParts:
descInfo = part.split("=")
description[descInfo[0].strip()] = descInfo[1].strip()
then you'll have inside description the information you need in the form of a key-value map.
You should also add error handling on my code...

Generating Xml using python

Kindly have a look at below code i am using this to generate a xml using python .
from lxml import etree
# Some dummy text
conn_id = 5
conn_name = "Airtelll"
conn_desc = "Largets TRelecome"
ip = "192.168.1.23"
# Building the XML tree
# Note how attributes and text are added, using the Element methods
# and not by concatenating strings as in your question
root = etree.Element("ispinfo")
child = etree.SubElement(root, 'connection',
number = str(conn_id),
name = conn_name,
desc = conn_desc)
subchild_ip = etree.SubElement(child, 'ip_address')
subchild_ip.text = ip
# and pretty-printing it
print etree.tostring(root, pretty_print=True)
This will produce:
<ispinfo>
<connection desc="Largets TRelecome" number="5" name="Airtelll">
<ip_address>192.168.1.23</ip_address>
</connection>
</ispinfo>
But i want it to be like :
<ispinfo>
<connection desc="Largets TRelecome" number='1' name="Airtelll">
<ip_address>192.168.1.23</ip_address>
</connection>
</ispinfo>
Mean number attribute should be come in a single quote .Any idea ....How can i achieve this

There is no flag in lxml to do this, so you have to resort to manual manipulation.
import re
re.sub(r'number="([0-9]+)"',r"number='\1'", etree.tostring(root, pretty_print=True))
However, why do you want to do this? As there is no difference other than cosmetics.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Minidom XML parsing dotted quad/nested children - python

Related

Extract Text from a word document

Python - Display key values before and after Findall(regex, output)

How do I instantiate a group of objects from a text file?

parse a special xml in python

Generating Xml using python

Categories

Resources