Extracting XML Element and Attribute Data with Python 3

Extracting XML Element and Attribute Data with Python 3 - python

I'm looking to extract the extract the values of a particular attribute from a particular element, using Python 3.
An example of the element in question (Atom3d):
<Atom3d ID="18" Mapping="43" Parent="2" Name="C7"
XYZ="0.0148299997672439,0.283699989318848,1.0291999578476" Connections="33,39"
TemperatureType="Isotropic" IsotropicTemperature="0.0677"
AnisotropicTemperature="0,0,0,0,0,0,0,0,0" Occupancy="0.708" Components="C"/>
I need to extract the XYZ value, and further need to take this value and separate the comma-separated numbers within it. I need to use these numbers in another input file of a different format, so I was thinking to assign them to three separate variables and take it from there.
I'm very inexperienced with Python, and completely so when it comes to XML. I'm not sure of which libraries I would need to use, if such libraries even exist and how to use them if they do.

http://docs.python.org/3/library/xml.etree.elementtree.html
>>> from xml.etree import ElementTree as ET
>>> elem = ET.fromstring('''<Atom3d ID="18" Mapping="43" Parent="2" Name="C7"
... XYZ="0.0148299997672439,0.283699989318848,1.0291999578476" Connections="33,39"
... TemperatureType="Isotropic" IsotropicTemperature="0.0677"
... AnisotropicTemperature="0,0,0,0,0,0,0,0,0" Occupancy="0.708" Components="C"/>
... ''')
get attribute using get('attribute-name'):
>>> elem.get('XYZ')
'0.0148299997672439,0.283699989318848,1.0291999578476'
split string by ',':
>>> elem.get('XYZ').split(',')
['0.0148299997672439', '0.283699989318848', '1.0291999578476']

Related

Find for multiple tags' values with lxml

I am using lxml to parse an XML like this sample one:
<compounddef xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="d2/db7/class_foo" kind="class">
<compoundname>FooClass</compoundname>
<sectiondef kind="public-type">
<memberdef kind="typedef" id="d2/db7/class_bar">
<type><ref refid="d3/d73/struct_foo" kindref="compound">StructFoo</ref></type>
<definition>StructFooDefinition</definition>
</memberdef>
</sectiondef>
</compounddef>
I'm trying to get the element with <refid> "d3/d73/struct_foo" and with the <definition> containing the text "Foo".
There could be many refid with that value and many definitions containing Foo, but only one has this combination.
I am able to first find all the elements with that refid and then filter this list by checking which of them containts "Foo" in the , but since I'm working with a really big XML file (~1GB) and the application is time sensitive, I wanted to avoid this.
I tried combining the various etree paths using the keyword 'and' or '//precede:...', but without success.
My last try was:
self.dox_tree_root_.xpath(".//compounddef[#kind = 'class']//memberdef[#kind='typedef'][/type/ref[#refid='%s'] and contains(definition, 'name')]" % (independent_type_refid, name)))
but it is giving me an error.
Is there a way to combine the two filters inside one command?

You can use XPATH
//a[.//ref[#refid="12345"] and contains(c, "Good")]

If I understand your correctly, this should get you close enough:
.//compounddef[#kind = 'class']//memberdef[#kind='typedef'][./type/ref[#refid='d3/d73/struct_foo']][contains(.//definition, 'Foo')]//definition
Output:
StructFooDefinition

How Can I Extract Variables from a LaTeX Doc into a Python Dictionary So That I Can Pull it into Django?

I'm pretty new to Django and LaTeX so I'm hoping that someone out there has done something like this before:
I'm trying to create a Django app that can read a LaTeX file, extract all of the variables (things of this form: "\newcommand{\StartDate}{January 1, 2018}") and place them as key/value pairs into a dictionary that I can work with inside Django.
The idea is that each variable in the LaTeX file starts with a place holder value. I'll be building a dynamic form that uses the dictionary to create field/values and let's a user replace the place holder value with a real one. After a user has set all of the values, I'd like to be able to write those new values back into the LaTeX file and generate a pdf from it.
I've tried regular expressions but have run into trouble because some of the 'variables' will contain blocks of LaTeX like lists, for example. I've also looked at TexSoup which seems to be very promising but I haven't been able to totally figure out yet. Here is a section from the preamble of an example LaTeX file like the ones I'll be dealing with:
%% Project Name
\newcommand{\projectName}{Project Name}
%% Start and End dates
\newcommand{\startDate}{January 1, 2018}
\newcommand{\finDate}{December 31, 2018}
%% Name of User
\newcommand{\userName}{aUser}
% What tasks will be a part of this process?
\newcommand{\tasks}{
\begin{itemize}[noitemsep,topsep=0pt]
\item Planning of \projectName{} on \startDate{}
\item Construction of \projectName{}
\item Configuration of \projectName{} by \userName{} on \finDate{}
\end{itemize}
}
Using TexSoup, I'm able to pull the LaTex file into an object, find all instances of a '\newcommand' into a generator object that I can iterate:
from TexSoup import TexSoup
soup = TexSoup(open('slatex.tex'))
newcommands = list(soup.find_all('newcommand'))
I know that this is pulling each '\newcommand' into its own element and maintaining the formats properly because I can easily print them out one at a time.
I'm stuck trying to figure out how to pull the '\newcommand' from each item, get the name of the item into a dictionary key and the value into a dictionary value. I'd like to think that TexSoup exposes those with some kind of attribute or method but I can't find anything about it. If it doesn't, am I back to looking at regular expressions again?

Each of the \newcommands has two required arguments, denoted using {}. As a result, we can
access each newcommand's arguments, and
access the value of each argument
With your definition of slatex.tex above, we can obtain
{'\\finDate': 'December 31, 2018', '\\startDate': 'January 1, 2018'}
using the following script
from pprint import pprint
from TexSoup import TexSoup
soup = TexSoup(open('slatex.tex'))
newcommands = list(soup.find_all('newcommand'))
result = {}
for newcommand in newcommands:
key, value = newcommand.args
result[key.value] = value.value
pprint(result)
*On a side note, TexSoup doesn't yet understand that these redefined variables will have tangible impact on the rest of the document. It treats them as any other command, passively.

Parse XML attribute to variable with ElementTree

Hello im writing a bit of code im Maya and running into some issues with ElementTree. I need help reading in this xml, or something similar. The XML is generated based on a selection, so it can change.
<root>
<Locations>
<1 name="CacheLocation">C:\Users\daunish\Desktop</1>
</Locations>
<Objects>
<1 name="Sphere">[u'pSphere1', u'pSphere2']</1>
<2 name="Cube">[u'pCube1']</2>
</Objects>
</root>
I need a way of searching for a particular "name" inside "Locations", and passing the text to a variable.
I also need a way of going through each line inside of "Objects" and preforming a functions, as in a for loop.
I'm open to all suggestions, I have been going crazy trying to get this to work. If you think i should format the XML differently I'm up for that as well. Thanks in advance for the help.

[Note: your XML is not well formed because you can't have tags that start with a number]
Not sure what you've tried but there are many ways to do this, here's one:
Find the first element with name=CacheLocation in Locations:
>>> filename = root.find("./Locations/*[#name='CacheLocation']").text
>>> filename
'C:\\Users\\daunish\\Desktop'
Iterating over all the elements in Objects:
>>> import ast
>>> for target in root.find("./Objects"):
... for i in ast.literal_eval(target.text):
... print(target.get('name'), i)
Sphere pSphere1
Sphere pSphere2
Cube pCube1

Search for specific XML element Attribute values

Using Python ElementTree to construct and edit test messages:
Part of XML as follows:
<FIXML>
<TrdMtchRpt TrdID="$$+TrdID#" RptTyp="0" TrdDt="20120201" MtchTyp="4" LastMkt="ABCD" LastPx="104.11">
The key TrdID contain values beginning with $$ to identify that this value is variable data and needs to be amended once the message is constructed from a template, in this case to the next sequential number (stored in a dictionary - the overall idea is to load a dictionary from a file with the attribute key listed and the associated value such as the next sequential number e.g. dictionary file contains $$+TrdID# 12345 using space as the delim).
So far my script iterates the parsed XML and examines each indexed element in turn. There will be several fields in the xml file that require updating so I need to avoid using hard coded references to element tags.
How can I search the element/attribute to identify if the attribute contains a key where the corresponding value starts with or contains the specific string $$?
And for reasons unknown to me we cannot use lxml!

You can use XPath.
import lxml.etree as etree
import StringIO from StringIO
xml = """<FIXML>
<TrdMtchRpt TrdID="$$+TrdID#"
RptTyp="0"
TrdDt="20120201"
MtchTyp="4"
LastMkt="ABCD"
LastPx="104.11"/>
</FIXML>"""
tree = etree.parse(StringIO(xml))
To find elements TrdMtchRpt where the attribute TrdID starts with $$:
r = tree.xpath("//TrdMtchRpt[starts-with(#TrdID, '$$')]")
r[0].tag == 'TrdMtchRpt'
r[0].get("TrdID") == '$$+TrdID#'
If you want to find any element where at least one attribute starts with $$ you can do this:
r = tree.xpath("//*[starts-with(#*, '$$')]")
r[0].tag == 'TrdMtchRpt'
r[0].get("TrdID") == '$$+TrdID#'
Look at the documentation:
http://lxml.de/xpathxslt.html#the-xpath-method
http://www.w3schools.com/xpath/xpath_functions.asp#string
http://www.w3schools.com/xpath/xpath_syntax.asp

You can use ElementTree package. It gives you an object with a hierarchical data structure from XML document.

xml missing element in python

System uses dom parser in python 2.7.2. The goal is to extract the .db file and use it on sql server.I currently have no problem with sqlite3 library. I have read the similar questions/answers about how to handle a missing element while parsing xml files.But still I couldn't figure out the solution. xml has 15000+ elements. here is the basic code from xml:
<topo>
<vlancard>
<id>4545</id>
<nodeValue>21</nodeValue>
<vlanName>voice</vlanName>
</vlancard>
<vlancard>
<id>1234</id>
<nodeValue>42</nodeValue>
<vlanName>camera</vlanName>
</vlancard>
<vlancard>
<id>9876</id>
<nodeValue>84</nodeValue>
</vlancard>
</topo>
Like the 3rd element, several elements do not have the node. That causes inconsistency on element numbers. i.e.
from xml.dom import minidom
xmldoc = minidom.parse('c:\vlan.xml')
vlId = xmldoc.getElementsByTagName('id')
vlValue = xmldoc.getElementsByTagName('nodeValue')
vlName = xmldoc.getElementsByTagName('vlanName')
after running the module:
IndexError: list index out of range
>>> len(id)
16163
>>> len(vlanName)
16155
Because of this problem , problem occurs for ordering the elements. while printing the table , parser passes the missing elements and element orders are mixed up. I use a simple while loop to insert the values into the table.
x=0
while x < (len(vlId)):
c.execute('''insert into vlan ('id','nodeValue','vlanName') values ('%s','%s','%s') ''' %(id[x].firstChild.nodeValue, nodeValue[x].firstChild.nodeValue, vlanName[x].firstChild.nodeValue))
x= x+1
How else can I do this? Any help will be appreciated.
Yusuf

Instead of parsing the entire xml and then inserting, parse each vlancard the retrieve it's id/value/name and then insert them into the DB.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting XML Element and Attribute Data with Python 3 - python

Related

Find for multiple tags' values with lxml

How Can I Extract Variables from a LaTeX Doc into a Python Dictionary So That I Can Pull it into Django?

Parse XML attribute to variable with ElementTree

Search for specific XML element Attribute values

xml missing element in python

Categories

Resources