python lxml get the name of a node

python lxml get the name of a node - python

This is my xml file:
<FuzzyComparison>
<Modules>
<Module>
<name>AutosoukModelMakeFuzzyComparisonModule</name>
<configurationLoader>DefaultLoader</configurationLoader>
<configurationFile>MakesModels.conf</configurationFile>
<settings></settings>
</Module>
<Module>
<name>DefaultFuzzyComparisonModule</name>
<configurationLoader>DefaultLoader</configurationLoader>
<configurationFile>Buildings.conf</configurationFile>
<settings>
<attribute>building</attribute>
</settings>
</Module>
</Modules>
</FuzzyComparison>
This is the code I've been trying to parse it with:
from lxml import etree
class AttributesXMLParser():
def __init__(self):
self.doc=etree.parse('Items.xml')
def getValueOfTag(self, tagName): #This function returns the value of a specific tag for exmaple, the tageName could be "FirstDate"
return self.doc.find(tagName).text
def loadFuzzySettings(self):
modulesDict = list()
modules = self.doc.findall('FuzzyComparison/Modules/Module')
for module in modules:
moduleDict = dict()
moduleName = module.find('name').text
moduleDict['name'] = moduleName
moduleConfigurationLoader = module.find('configurationLoader').text
moduleDict['configurationLoader'] = moduleConfigurationLoader
moduleConfigurationFile = module.find('configurationFile').text
moduleDict['moduleConfigurationFile'] = moduleConfigurationFile
settings = module.findall('settings')
settingsDict = dict()
for oneSetting in settings:
settingsDict[oneSetting] = oneSetting.text
moduleDict['settings'] = settingsDict
modulesDict.append(moduleDict)
return modulesDict
and this is the results:
[{'moduleConfigurationFile': 'MakesModels.conf', 'configurationLoader': 'Default
Loader', 'name': 'AutosoukModelMakeFuzzyComparisonModule', 'settings': {<Element
settings at 0x25257c8>: None}}, {'moduleConfigurationFile': 'Buildings.conf', '
configurationLoader': 'DefaultLoader', 'name': 'DefaultFuzzyComparisonModule', '
settings': {<Element settings at 0x2525e48>: '\n\t\t\t\t'}}]
My problem
I don't know how to get the name and value of the settings node, because as you see everything is working great except the settings, I need to have it like this:
"attribute": building
But my code gives me:
{<Element settings at 0x2525e48>: '\n\t\t\t\t'}}]
Could you help please to solve that?

Since findall() returns a list, you want to iterate over the contents of elements of that list, rather than the list itself. You also want to use the element's tag as a key, rather than using the element itself.
settingsDict = {}
for settingsNode in module.findall('settings'):
for setting in settingsNode:
settingsDict[setting.tag] = setting.text
Or, if you only have one settings tag,
settingsDict = {}
for setting in module.find('settings'):
settingsDict[setting.tag] = setting.text
Which can be simplified to:
settingsDict = {setting.tag: setting.text
for setting in module.find('settings')}

Related

python docxtpl - cant create dynamic form

I have been trying to create a dynamic table in word but the created file has all the contents except the list inside the options tag. I have tried a lot of stuff but can't get it to work.
from docxtpl import DocxTemplate
from registration.models import stuff
def create_doc(options, model, file='resources/stuff.docx'):
doc = DocxTemplate(file)
log = stuff(options_object=model)
log.save()
l = []
for i in range(1, len(options) + 1):
x = {}
x['number'] = i
x['course'] = options[i - 1]
l.append(x)
context = {
'name': model.name,
'reg_no': model.regno,
'roll_no': model.rollno,
'rank': model.student_rank,
'course': model.alloted_to,
'cat': model.alloted_cat,
'options': l
}
doc.render(context)
doc.save(f'media/request_{model.rollno}.docx')
The template
The output

I solved it by using a different name than "options". I'm guessing it messed with the parameter that was being passed somehow and caused the issue.

Using "info.get" for a child element in Python / lxml

I'm trying to get the attribute of a child element in Python, using lxml.
This is the structure of the xml:
<GroupInformation groupId="crid://thing.com/654321" ordered="true">
<GroupType value="show" xsi:type="ProgramGroupTypeType"/>
<BasicDescription>
<Title type="main" xml:lang="EN">A programme</Title>
<RelatedMaterial>
<HowRelated href="urn:eventis:metadata:cs:HowRelatedCS:2010:boxCover">
<Name>Box cover</Name>
</HowRelated>
<MediaLocator>
<mpeg7:MediaUri>file://ftp.something.com/Images/123456.jpg</mpeg7:MediaUri>
</MediaLocator>
</RelatedMaterial>
</BasicDescription>
The code I've got is below. The bit I want to return is the 'value' attribute ("Show" in the example) under 'grouptype' (third line from the bottom):
file_name = input('Enter the file name, including .xml extension: ')
print('Parsing ' + file_name)
from lxml import etree
parser = etree.XMLParser()
tree = etree.parse(file_name, parser)
root = tree.getroot()
nsmap = {'xmlns': 'urn:tva:metadata:2010','mpeg7':'urn:tva:mpeg7:2008'}
with open(file_name+'.log', 'w', encoding='utf-8') as f:
for info in root.xpath('//xmlns:GroupInformation', namespaces=nsmap):
crid = info.get('groupId'))
grouptype = info.find('.//xmlns:GroupType', namespaces=nsmap)
gtype = grouptype.get('value')
titlex = info.find('.//xmlns:BasicDescription/xmlns:Title', namespaces=nsmap)
title = titlex.text if titlex != None else 'Missing'
Can anyone explain to me how to implement it? I had a quick look at the xsi namespace, but was unable to get it to work (and didn't know if it was the right thing to do).

Is this what you are looking for?
grouptype.attrib['value']
PS: why the parenthesis around assignment values? Those look unnecessary.

Difficulty creating lxml Element subclass

I’m trying to create a subclass of the Element class. I’m having trouble getting started though.
from lxml import etree
try:
import docx
except ImportError:
from docx import docx
class File(etree.ElementBase):
def _init(self):
etree.ElementBase._init(self)
self.body = self.append(docx.makeelement('body'))
f = File()
relationships = docx.relationshiplist()
title = 'File'
subject = 'A very special File'
creator = 'Me'
keywords = ['python', 'Office Open XML', 'Word']
coreprops = docx.coreproperties(title=title, subject=subject, creator=creator,
keywords=keywords)
appprops = docx.appproperties()
contenttypes = docx.contenttypes()
websettings = docx.websettings()
wordrelationships = docx.wordrelationships(relationships)
docx.savedocx(f, coreprops, appprops, contenttypes, websettings,
wordrelationships, 'file.docx')
When I try to open the document that is outputted from this code, my version of Word (2003 with compatibility pack) gives me the following error: “This file was created by a previous beta version of Word 2007 and cannot be opened in this version.” When I replace the File object with a different Element created with docx.newdocument(), the document comes out fine. Any ideas/advice?

I don't really get why you want to use a separate class named File.
As Michael0x2a said, you did'nt put a document tag, so it won't work (I don't think Word 2007 can read your file too)
But here is the corrected code:
from lxml import etree
try:
import docx
except ImportError:
from docx import docx
class File(object):
def makeelement(tagname, tagtext=None, nsprefix='w', attributes=None,
attrnsprefix=None):
'''Create an element & return it'''
# Deal with list of nsprefix by making namespacemap
namespacemap = None
if isinstance(nsprefix, list):
namespacemap = {}
for prefix in nsprefix:
namespacemap[prefix] = nsprefixes[prefix]
# FIXME: rest of code below expects a single prefix
nsprefix = nsprefix[0]
if nsprefix:
namespace = '{'+nsprefixes[nsprefix]+'}'
else:
# For when namespace = None
namespace = ''
newelement = etree.Element(namespace+tagname, nsmap=namespacemap)
# Add attributes with namespaces
if attributes:
# If they haven't bothered setting attribute namespace, use an empty
# string (equivalent of no namespace)
if not attrnsprefix:
# Quick hack: it seems every element that has a 'w' nsprefix for
# its tag uses the same prefix for it's attributes
if nsprefix == 'w':
attributenamespace = namespace
else:
attributenamespace = ''
else:
attributenamespace = '{'+nsprefixes[attrnsprefix]+'}'
for tagattribute in attributes:
newelement.set(attributenamespace+tagattribute,
attributes[tagattribute])
if tagtext:
newelement.text = tagtext
return newelement
def __init__(self):
super(File,self).__init__()
self.document = self.makeelement('document')
self.document.append(self.makeelement('body'))
f = File()
relationships = docx.relationshiplist()
title = 'File'
subject = 'A very special File'
creator = 'Me'
keywords = ['python', 'Office Open XML', 'Word']
coreprops = docx.coreproperties(title=title, subject=subject, creator=creator,
keywords=keywords)
appprops = docx.appproperties()
contenttypes = docx.contenttypes()
websettings = docx.websettings()
wordrelationships = docx.wordrelationships(relationships)
docx.savedocx(f.document, coreprops, appprops, contenttypes, websettings,
wordrelationships, 'file.docx')

Python OOP Project Organization

I'm a bit new to Python dev -- I'm creating a larger project for some web scraping. I want to approach this as "Pythonically" as possible, and would appreciate some help with the project structure. Here's how I'm doing it now:
Basically, I have a base class for an object whose purpose is to go to a website and parse some specific data on it into its own array, jobs[]
minion.py
class minion:
# Empty getJobs() function to be defined by object pre-instantiation
def getJobs(self):
pass
# Constructor for a minion that requires site authorization
# Ex: minCity1 = minion('http://portal.com/somewhere', 'user', 'password')
# or minCity2 = minion('http://portal.com/somewhere')
def __init__(self, title, URL, user='', password=''):
self.title = title
self.URL = URL
self.user = user
self.password = password
self.jobs = []
if (user == '' and password == ''):
self.reqAuth = 0
else:
self.reqAuth = 1
def displayjobs(self):
for j in self.jobs:
j.display()
I'm going to have about 100 different data sources. The way I'm doing it now is to just create a separate module for each "Minion", which defines (and binds) a more tailored getJobs() function for that object
Example: minCity1.py
from minion import minion
from BeautifulSoup import BeautifulSoup
import urllib2
from job import job
# MINION CONFIG
minTitle = 'Some city'
minURL = 'http://www.somewebpage.gov/'
# Here we define a function that will be bound to this object's getJobs function
def getJobs(self):
page = urllib2.urlopen(self.URL)
soup = BeautifulSoup(page)
# For each row
for tr in soup.findAll('tr'):
tJob = job()
span = tr.findAll(['span', 'class="content"'])
# If row has 5 spans, pull data from span 2 and 3 ( [1] and [2] )
if len(span) == 5:
tJob.title = span[1].a.renderContents()
tJob.client = 'Some City'
tJob.source = minURL
tJob.due = span[2].div.renderContents().replace('<br />', '')
self.jobs.append(tJob)
# Don't forget to bind the function to the object!
minion.getJobs = getJobs
# Instantiate the object
mCity1 = minion(minTitle, minURL)
I also have a separate module which simply contains a list of all the instantiated minion objects (which I have to update each time I add one):
minions.py
from minion_City1 import mCity1
from minion_City2 import mCity2
from minion_City3 import mCity3
from minion_City4 import mCity4
minionList = [mCity1,
mCity2,
mCity3,
mCity4]
main.py references minionList for all of its activities for manipulating the aggregated data.
This seems a bit chaotic to me, and was hoping someone might be able to outline a more Pythonic approach.
Thank you, and sorry for the long post!

Instead of creating functions and assigning them to objects (or whatever minion is, I'm not really sure), you should definitely use classes instead. Then you'll have one class for each of your data sources.
If you want, you can even have these classes inherit from a common base class, but that isn't absolutely necessary.

Node.TEXT_NODE has the value, but I need the Attribute

I have an xml file like so:
<host name='ip-10-196-55-2.ec2.internal'>
<hostvalue name='arch_string'>lx24-x86</hostvalue>
<hostvalue name='num_proc'>1</hostvalue>
<hostvalue name='load_avg'>0.01</hostvalue>
</host>
I can get get out the Node.data from a Node.TEXT_NODE, but I also need the Attribute name, like I want to know load_avg = 0.01, without writing load_avg, num_proc, etc, one by one. I want them all.
My code looks like this, but I can't figure out what part of the Node has the attribute name for me.
for stat in h.getElementsByTagName("hostvalue"):
for node3 in stat.childNodes:
attr = "foo"
val = "poo"
if node3.nodeType == Node.ATTRINUTE_NODE:
attr = node3.tagName
if node3.nodeType == Node.TEXT_NODE:
#attr = node3.tagName
val = node3.data
From the above code, I'm able to get val, but not attr (compile error:

here's a short example of what you could achieve:
from xml.dom import minidom
xmldoc = minidom.parse("so.xml")
values = {}
for stat in xmldoc.getElementsByTagName("hostvalue"):
attr = stat.attributes["name"].value
value = "\n".join([x.data for x in stat.childNodes])
values[attr] = value
print repr(values)
This outputs, given your XML file:
$ ./parse.py
{u'num_proc': u'1', u'arch_string': u'lx24-x86', u'load_avg': u'0.01'}
Be warned that this is not failsafe, i.e. if you have nested elements inside <hostvalue>.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python lxml get the name of a node - python

Related

python docxtpl - cant create dynamic form

Using "info.get" for a child element in Python / lxml

Difficulty creating lxml Element subclass

Python OOP Project Organization

Node.TEXT_NODE has the value, but I need the Attribute

Categories

Resources