using python to parse xml data - python

I have a question with regards to XML and python. I want to comb through this xml file and look for certain tags, and then within those tags look for where there is data separated by a comma. split that and make a new line. I have the logic down, im just not too familiar with python to know whoch modules I should be researching. Any help as to where i should start researching would help.
172.28.18.142,10.0.0.2
thanks

I think when it comes to xml parsing in python there are a few options: lxml, xml, and BeautifulSoup. Most of my experience has dealt with the first two and I've found lxml to be extraordinarily faster than xml. Here's an lxml code snippet for parsing all elements of the root with a particular tag and storing the comma-separated text of each tag as a list. I think you'll want to add a lot of try and except blocks and tinker with the details, but this should get you started.
from lxml import etree
file_path = r'C:\Desktop\some_file.xml'
tree = etree.parse(file_path)
info_list = []
my_tag_path = tree.xpath('//topTag')
for elem in my_tag_path:
if elem.find('.//childTag') is not None:
info_list.append(elem.xpath('.//childTag')[0].text.split(','))

Related

python to pars xml to get value

I have a xml response from one of my system where i am trying to get the value using python code. Need experts view on highlighting my mistake.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns3:loginResponse xmlns:ns2="http://ws.core.product.xxxxx.com/groupService/" xmlns:ns3="http://ws.core.product.xxxxx.com/loginService/" xmlns:ns4="http://ws.core.product.xxxxx.com/userService/"><ns3:return>YWeVDwuZwHdxxxxxxxxxxx_GqLtkNTE.</ns3:return></ns3:loginResponse>
I am using the below code of code and had no luck in getting the value - YWeVDwuZwHdxxxxxxxxxxx_GqLtkNTE . I haven't used xml parsing with namespace. response.text has the above xml response.
responsetree = ET.ElementTree(ET.fromstring(response.text))
responseroot = responsetree.getroot()
for a in root.iter('return'):
print(a.attrib)
YWeVDwuZwHdxxxxxxxxxxx_GqLtkNTE is not in the attrib. It is the element text
The attrib in this case is an empty dict
See https://www.cmi.ac.in/~madhavan/courses/prog2-2012/docs/diveintopython3/xml.html about parsing XML dics using namespace.
Reference from other answer helped to understand the concepts.
Once I understood the xml structure , Its plain simple. Just adding the output it might help someone in future for quick reference.
responsetree = ET.ElementTree(ET.fromstring(response.text))
responseroot = responsetree.getroot()
root[0].text
Keeping it simple for understanding. You might need to find the len(root) and/or iterate over the loop with condition to get apt value. You can also use findall , find along with to get the interested item.

lxml xpath and find return nothing

Python 2.7
I assume I'm missing something incredibly basic having to do with lxml but I have no idea what it is. By way of background, I have not used lxml much before but have used Xpaths extensively in Selenium and have also done a bit of parsing with BS4.
So, I'm making a call to this API that returns some XML as a string. Easy enough:
from lxml import etree
from io import StringIO
myXML = 'xml here'
tree = etree.parse(StringIO(myXML))
print tree.xpath('/IKnowThisTagExistsInMyXML')
It always returns [] or None. I've tried tree.find() and tree.findall() as well, to no avail.
I'm hoping someone has seen this before and can tell me what's going on.
By using an XPath of /IKnowThisTagExistsInMyXML this assumes the tag IKnowThisTagExistsInMyXML is at the top-level of your XML Document; which I really doubt it is.
Trying search your XMl Document for this tag instead by doing:
print tree.xpath('//*/IKnowThisTagExistsInMyXML')
See: XPath Syntax

XML parsing in Python using Python 2 or 3

I'm just trying to write a simple program to allow me to parse some of the following XML.
So far in following examples I am not getting the results I'm looking for.
I encounter many of these XML files and I generally want the info after a handful of tags.
What's the best way using elementtree to be able to do a search for <Id> and grab what ever info is in that tag. I was trying things like
for Reel in root.findall('Reel'):
... id = Reel.findtext('Id')
... print id
Is there a way just to look for every instance of <Id> and grab the urn: etc that comes after it? Some code that traverses everything and looks for <what I want> and so on.
This is a very truncated version of what I usually deal with.
This didn't get what I wanted at all. Is there an easy just to match <what I want> in any XML file and get the contents of that tag, or do i need to know the structure of the XML well enough to know its relation to Root/child etc?
<Reel>
<Id>urn:uuid:632437bc-73f9-49ca-b687-fdb3f98f430c</Id>
<AssetList>
<MainPicture>
<Id>urn:uuid:46afe8a3-50be-4986-b9c8-34f4ba69572f</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
<FrameRate>24 1</FrameRate>
<ScreenAspectRatio>2048 858</ScreenAspectRatio>
</MainPicture>
<MainSound>
<Id>urn:uuid:1fce0915-f8c7-48a7-b023-36e204a66ed1</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
</MainSound>
</AssetList>
</Reel>
#Mata that worked perfectly, but when I tried to use that for different values on another XML file I fell flat on my face. For instance, what about this section of a file.I couldn't post the whole thing unfortunately. What if I want to grab what comes after KeyId?
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><DCinemaSecurityMessage xmlns="http://www.digicine.com/PROTO-ASDCP-KDM-20040311#" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<!-- Generated by Wailua Version 0.3.20 -->
<AuthenticatedPublic Id="ID_AuthenticatedPublic">
<MessageId>urn:uuid:7bc63f4c-c617-4d00-9e51-0c8cd6a4f59e</MessageId>
<MessageType>http://www.digicine.com/PROTO-ASDCP-KDM-20040311#</MessageType>
<AnnotationText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE ~ KDM for Quvis-10010.pem</AnnotationText>
<IssueDate>2007-04-29T04:13:43-00:00</IssueDate>
<Signer>
<dsig:X509IssuerName>dnQualifier=BzC0n/VV/uVrl2PL3uggPJ9va7Q=,CN=.deluxe-admin-c,OU=.mxf-j2c.ca.cinecert.com,O=.ca.cinecert.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>10039</dsig:X509SerialNumber>
</Signer>
<RequiredExtensions>
<Recipient>
<X509IssuerSerial>
<dsig:X509IssuerName>dnQualifier=RUxyQle0qS7qPbcNRFBEgVjw0Og=,CN=SM.QuVIS.com.001,OU=QuVIS Digital Cinema,O=QuVIS.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>363</dsig:X509SerialNumber>
</X509IssuerSerial>
<X509SubjectName>CN=SM MD LE FM.QuVIS_CinemaPlayer-3d_10010,OU=QuVIS,O=QuVIS.com,dnQualifier=3oBfjTfx1me0p1ms7XOX\+eqUUtE=</X509SubjectName>
</Recipient>
<CompositionPlaylistId>urn:uuid:336263da-e4f1-324e-8e0c-ebea00ff79f4</CompositionPlaylistId>
<ContentTitleText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE</ContentTitleText>
<ContentKeysNotValidBefore>2007-04-30T05:00:00-00:00</ContentKeysNotValidBefore>
<ContentKeysNotValidAfter>2007-04-30T10:00:00-00:00</ContentKeysNotValidAfter>
<KeyIdList>
<KeyId>urn:uuid:9851b0f6-4790-0d4c-a69d-ea8abdedd03d</KeyId>
<KeyId>urn:uuid:8317e8f3-1597-494d-9ed8-08a751ff8615</KeyId>
<KeyId>urn:uuid:5d9b228d-7120-344c-aefc-840cdd32bbfc</KeyId>
<KeyId>urn:uuid:1e32ccb2-ab0b-9d43-b879-1c12840c178b</KeyId>
<KeyId>urn:uuid:44d04416-676a-2e4f-8995-165de8cab78d</KeyId>
<KeyId>urn:uuid:906da0c1-b0cb-4541-b8a9-86476583cdc4</KeyId>
<KeyId>urn:uuid:0fe2d73a-ebe3-9844-b3de-4517c63c4b90</KeyId>
<KeyId>urn:uuid:862fa79a-18c7-9245-a172-486541bef0c0</KeyId>
<KeyId>urn:uuid:aa2f1a88-7a55-894d-bc19-42afca589766</KeyId>
<KeyId>urn:uuid:59d6eeff-cd56-6245-9f13-951554466626</KeyId>
<KeyId>urn:uuid:14a13b1a-76ba-764c-97d0-9900f58af53e</KeyId>
<KeyId>urn:uuid:ccdbe0ae-1c3f-224c-b450-947f43bbd640</KeyId>
<KeyId>urn:uuid:dcd37f10-b042-8e44-bef0-89bda2174842</KeyId>
<KeyId>urn:uuid:9dd7103e-7e5a-a840-a15f-f7d7fe699203</KeyId>
</KeyIdList>
</RequiredExtensions>
<NonCriticalExtensions/>
</AuthenticatedPublic>
<AuthenticatedPrivate Id="ID_AuthenticatedPrivate"><enc:EncryptedKey xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<enc:EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p">
<ds:DigestMethod xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
</enc:EncryptionMethod>
The expression Reel.findtext('Id') only matches direct children of Reel. If you want to find all Id tags in your xml document, you can just use:
ids = [id.text for id in Reel.findall(".//Id")]
This would give you a list of all text nodes of all Id tags which are children of Reel.
edit:
Your updated example uses namespaces, in this case KeyId is in the default namespace (http://www.digicine.com/PROTO-ASDCP-KDM-20040311#), so to search for it you need to include it in your search:
from xml.etree import ElementTree
doc = ElementTree.parse('test.xml')
nsmap = {'ns': 'http://www.digicine.com/PROTO-ASDCP-KDM-20040311#'}
ids = [id.text for id in doc.findall(".//ns:KeyId", namespaces=nsmap)]
print(ids)
...
The xpath subset ElementTree supports is rather limited. If you want a more complete support, you should use lxml instead, it's xpath support is way more complete.
For example, using xpath to search for all KeyId tags (ignoring namespaces) and returning their text content directly:
from lxml import etree
doc = etree.parse('test.xml')
ids = doc.xpath(".//*[local-name()='KeyId']/text()")
print(ids)
...
It sounds like XPath might be right up your alley - it will let you query your XML document for exactly what you're looking for, as long as you know the structure.
Here's what I needed to do. This works for finding whatever I need.
for node in tree.getiterator():
... if 'KeyId' in node.tag:
... mylist = node.tag
... print(mylist)
...

Speedier/less resource-demolishing way to strip html from large files than BeautifulSoup? Or, a better way to use BeautifulSoup?

Currently I am having trouble typing this because, according to top, my processor is at 100% and my memory is at 85.7%, all being taken up by python.
Why? Because I had it go through a 250-meg file to remove markup. 250 megs, that's it! I've been manipulating these files in python with so many other modules and things; BeautifulSoup is the first code to give me any problems with something so small. How are nearly 4 gigs of RAM used to manipulate 250megs of html?
The one-liner that I found (on stackoverflow) and have been using was this:
''.join(BeautifulSoup(corpus).findAll(text=True))
Additionally, this seems to remove everything BUT markup, which is sort of the opposite of what I want to do. I'm sure that BeautifulSoup can do that, too, but the speed issue remains.
Is there anything that will do something similar (remove markup, leave text reliably) and NOT require a Cray to run?
lxml.html is FAR more efficient.
http://lxml.de/lxmlhtml.html
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
Looks like this will do what you want.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
A couple of other similar questions: python [lxml] - cleaning out html tags
lxml.etree, element.text doesn't return the entire text from an element
Filter out HTML tags and resolve entities in python
UPDATE:
You probably want to clean the HTML to remove all scripts and CSS, and then extract the text using .text_content()
from lxml import html
from lxml.html.clean import clean_html
tree = html.parse('http://www.example.com')
tree = clean_html(tree)
text = tree.getroot().text_content()
(From: Remove all html in python?)
use cleaner from lxml.html:
>>> import lxml.html
>>> from lxml.html.clean import Cleaner
>>> cleaner = Cleaner(style=True) # to delete scripts styles objects comments etc;)
>>> html = lxml.html.fromstring(content).xpath('//body')[0]
>>> print cleaner.clean_html(html)

regex to parse tables wrapped into xml

Suppose we have a table:
Key|Val|Flag
01 |AAA| Y
02 |BBB| N
...
wrapped into xml this way:
<Data>
<R><F>Key</F><F>Val</F><F>Flag</F></R>
<R><F>01</F><F>AAA</F><F>Y</F></R>
<R><F>02</F><F>BBB</F><F>N</F></R>
...
</Data>
There can be more columns and rows, obviously.
Now I'd like to parse XML back to table using single regex.
I can find all fields with '<F>([\w\d]*)</F>', but I need them to be groupped by rows somehow.
I thought about <R>(<F>([\w\d]*)</F>)*</R>, but Python implementation finds nothing.
Can someone please help to compose regex?
UPDATE
Some context of the question.
I'm aware about plenty of XML parsing libraries, but unfortunately my environment is limited to standard libraries. Anyway thanks to everyone who have warned not to use regexes for XML parsing.
And I needed some quick and dirty solution, therefore I decided to start with regexes and switch to parsing later.
So far I have the code:
...
row_p = r'<R>(.*?)</R>'
field_p = r'<F>(.*?)</F>'
table = ''
for row in re.finditer(row_p, xml):
table += '|'.join(re.findall(field_p, row.group(1))) + '\n'
...
It works for small datasets (about 10'000 rows) but fails for tables larger 500'000 rows.
Maybe I'll do some investigation why it fails, but next step I'm going to take - switch to some standard XML parser. ElementTree is the first candidate.
Mandatory links:
RegEx match open tags except XHTML self-contained tags and
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Use an XML parser. lxml is very good and even provides (among other XML-related thingies) XPath - if you got a fetish with oneliners, I'm sure there is an XPath oneliner to extract these elements ;)
import libxml2
txt = '\n<Data>\n <R><F>Key</F><F>Val</F><F>Flag</F></R>\n <R><F>01</F><F>AAA</F><F>Y</F></R>\n <R><F>02</F><F>BBB</F><F>N</F></R>\n</Data>\n'
rows = []
for elem in libxml2.parseDoc(txt):
if elem.name == 'R':
curRow = []
rows.append(curRow)
elif elem.name == 'F':
curRow.append(elem.get_content())
returns:
rows = [['Key', 'Val', 'Flag'], ['01', 'AAA', 'Y'], ['02', 'BBB', 'N']]
If this question is tagged with Perl, I can post a solution + code for you, but since this is python.
Anyway, I suggest you load the xml file, and read it line by line. Loop each line until the end of the file and find all fields within that line. As far as I know matches in python are stored in an array. There you have it. Wish I can show you with code but this is just the main idea:
load file
foreach line in <file>
if regex.match('<F>([\w\d]*)</F>', line)
print matches[1] . '|' . matches[2] . '|' . matches[3] . "\n"
end loop
DISCLAIMER: The above code is just a scratch
Oh by the way, if possible, use an XML parser instead.
lxml is a Pythonic binding for
the libxml2 and libxslt libraries. It
is unique in that it combines the
speed and feature completeness of
these libraries with the simplicity of
a native Python API, mostly compatible
but superior to the well-known
ElementTree API.

Categories