can we search multiple pattern using etree findall() in xml?

can we search multiple pattern using etree findall() in xml? - python

For my case, I have to find few elements in the XML file and update their values using the text attribute. For that, I have to search xml element A, B and C. My project is using xml.etree and python language. Currently I am using:
self.get_root.findall(H/A/T)
self.get_root.findall(H/B/T)
self.get_root.findall(H/C/T)
The sample XML file:
<H><A><T>text-i-have-to-update</H></A></T>
<H><B><T>text-i-have-to-update</H></B></T>
<H><C><T>text-i-have-to-update</H></C></T>
As we can notice, only the middle element in the path is different. Is there a way to optimize the code using something like self.get_root.findall(H|(A,B,C)|T)? Any guidance in the right direction will do! Thanks!
I went through the similar question: XPath to select multiple tags but it didn't work for my case
Update: maybe regular expression inside the findall()?

The html in your question is malformed; assuming it's properly formatted (like below), try this:
import xml.etree.ElementTree as ET
data = """<root>
<H><A><T>text-i-have-to-update</T></A></H>
<H><B><T>text-i-have-to-update</T></B></H>
<H><C><T>text-i-have-to-update</T></C></H>
</root>"""
doc = ET.fromstring(data)
for item in doc.findall('.//H//T'):
item.text = "modified text"
print(ET.tostring(doc).decode())
Output:
<root>
<H><A><T>modified text</T></A></H>
<H><B><T>modified text</T></B></H>
<H><C><T>modified text</T></C></H>
</root>

Related

How to get values from this XML?

I want to parse xml like this:
<?xml version="1.0" ?>
<matches>
<round_1>
<match_1>
<home_team>team_5</home_team>
<away_team>team_13</away_team>
<home_goals_time>None</home_goals_time>
<away_goals_time>24;37</away_goals_time>
<home_age_average>27.4</home_age_average>
<away_age_average>28.3</away_age_average>
<score>0:2</score>
<ball_possession>46:54</ball_possession>
<shots>8:19</shots>
<shots_on_target>2:6</shots_on_target>
<shots_off_target>5:10</shots_off_target>
<blocked_shots>1:3</blocked_shots>
<corner_kicks>3:4</corner_kicks>
<fouls>10:12</fouls>
<offsides>0:0</offsides>
</match_1>
</round_1>
</matches>
I use standard library - xml but I can't get values from inner tags. That's my exemplary code:
import xml.etree.ElementTree as et
TEAMS_STREAM = "data/stats1.xml"
tree = et.parse(TEAMS_STREAM)
root = tree.getroot()
for elem in root.iter('home_goals_time'):
print(elem.attrib)
It should work but it's not. I was trying to find issue in xml structure but I coludn't find it. I always got empty dict. Can you tell me what's wrong?

You are calling .attrib on the element, but there are no attributes for those elements. If you want to print the inner text of the element, use .text instead of .attrib
for elem in root.iter('home_goals_time'):
print(elem.text)

The reason you're having issues is that you need to parse through the xml level by level. Using findall, I was able to get the value inside <home_goals_time>.
for i in root.findall('.//home_goals_time'):
print (i.text)
None

Issue with XML parsing in python

I'm having difficulty parsing an XML tree using xml.etree.ElementTree in Python. Basically, I'm making a request to an API that gives an XML response, and trying to extract the values of several elements in the tree.
This is what I've done so far with no success:
root = etree.fromstring(resp_arr[0])
walkscore = root.find('./walkscore')
Here is my XML tree:
<result>
<status>1</status>
<walkscore>95</walkscore>
<description>walker's paradise</description>
<updated>2009-12-25 03:40:16.006257</updated>
<logo_url>https://cdn.walk.sc/images/api-logo.png</logo_url>
<more_info_icon>https://cdn.walk.sc/images/api-more-info.gif</more_info_icon>
<ws_link>http://www.walkscore.com/score/1119-8th-Avenue-Seattle-WA-98101/lat=47.6085/lng=-122.3295/?utm_source=myrealtysite.com&utm_medium=ws_api&utm_campaign=ws_api</ws_link>
<help_link>https://www.redfin.com/how-walk-score-works</help_link>
<snapped_lat>47.6085</snapped_lat>
<snapped_lon>-122.3295</snapped_lon>
</result>
Essentially, I'm trying to pull the walkscores from the XML document but my code isn't returning a value. Does anyone with experience using ElementTree have any advice to help me extract the values I'm after?
Sam

Your XML appears to be malformed. But if I replace instances of & with &, then it's parseable:
>>> from xml.etree import ElementTree as ET
>>> tree = ET.fromstring(xml)
>>> tree.find('./walkscore').text
'95'

XML parsing in Python using Python 2 or 3

I'm just trying to write a simple program to allow me to parse some of the following XML.
So far in following examples I am not getting the results I'm looking for.
I encounter many of these XML files and I generally want the info after a handful of tags.
What's the best way using elementtree to be able to do a search for <Id> and grab what ever info is in that tag. I was trying things like
for Reel in root.findall('Reel'):
... id = Reel.findtext('Id')
... print id
Is there a way just to look for every instance of <Id> and grab the urn: etc that comes after it? Some code that traverses everything and looks for <what I want> and so on.
This is a very truncated version of what I usually deal with.
This didn't get what I wanted at all. Is there an easy just to match <what I want> in any XML file and get the contents of that tag, or do i need to know the structure of the XML well enough to know its relation to Root/child etc?
<Reel>
<Id>urn:uuid:632437bc-73f9-49ca-b687-fdb3f98f430c</Id>
<AssetList>
<MainPicture>
<Id>urn:uuid:46afe8a3-50be-4986-b9c8-34f4ba69572f</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
<FrameRate>24 1</FrameRate>
<ScreenAspectRatio>2048 858</ScreenAspectRatio>
</MainPicture>
<MainSound>
<Id>urn:uuid:1fce0915-f8c7-48a7-b023-36e204a66ed1</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
</MainSound>
</AssetList>
</Reel>
#Mata that worked perfectly, but when I tried to use that for different values on another XML file I fell flat on my face. For instance, what about this section of a file.I couldn't post the whole thing unfortunately. What if I want to grab what comes after KeyId?
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><DCinemaSecurityMessage xmlns="http://www.digicine.com/PROTO-ASDCP-KDM-20040311#" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<!-- Generated by Wailua Version 0.3.20 -->
<AuthenticatedPublic Id="ID_AuthenticatedPublic">
<MessageId>urn:uuid:7bc63f4c-c617-4d00-9e51-0c8cd6a4f59e</MessageId>
<MessageType>http://www.digicine.com/PROTO-ASDCP-KDM-20040311#</MessageType>
<AnnotationText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE ~ KDM for Quvis-10010.pem</AnnotationText>
<IssueDate>2007-04-29T04:13:43-00:00</IssueDate>
<Signer>
<dsig:X509IssuerName>dnQualifier=BzC0n/VV/uVrl2PL3uggPJ9va7Q=,CN=.deluxe-admin-c,OU=.mxf-j2c.ca.cinecert.com,O=.ca.cinecert.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>10039</dsig:X509SerialNumber>
</Signer>
<RequiredExtensions>
<Recipient>
<X509IssuerSerial>
<dsig:X509IssuerName>dnQualifier=RUxyQle0qS7qPbcNRFBEgVjw0Og=,CN=SM.QuVIS.com.001,OU=QuVIS Digital Cinema,O=QuVIS.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>363</dsig:X509SerialNumber>
</X509IssuerSerial>
<X509SubjectName>CN=SM MD LE FM.QuVIS_CinemaPlayer-3d_10010,OU=QuVIS,O=QuVIS.com,dnQualifier=3oBfjTfx1me0p1ms7XOX\+eqUUtE=</X509SubjectName>
</Recipient>
<CompositionPlaylistId>urn:uuid:336263da-e4f1-324e-8e0c-ebea00ff79f4</CompositionPlaylistId>
<ContentTitleText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE</ContentTitleText>
<ContentKeysNotValidBefore>2007-04-30T05:00:00-00:00</ContentKeysNotValidBefore>
<ContentKeysNotValidAfter>2007-04-30T10:00:00-00:00</ContentKeysNotValidAfter>
<KeyIdList>
<KeyId>urn:uuid:9851b0f6-4790-0d4c-a69d-ea8abdedd03d</KeyId>
<KeyId>urn:uuid:8317e8f3-1597-494d-9ed8-08a751ff8615</KeyId>
<KeyId>urn:uuid:5d9b228d-7120-344c-aefc-840cdd32bbfc</KeyId>
<KeyId>urn:uuid:1e32ccb2-ab0b-9d43-b879-1c12840c178b</KeyId>
<KeyId>urn:uuid:44d04416-676a-2e4f-8995-165de8cab78d</KeyId>
<KeyId>urn:uuid:906da0c1-b0cb-4541-b8a9-86476583cdc4</KeyId>
<KeyId>urn:uuid:0fe2d73a-ebe3-9844-b3de-4517c63c4b90</KeyId>
<KeyId>urn:uuid:862fa79a-18c7-9245-a172-486541bef0c0</KeyId>
<KeyId>urn:uuid:aa2f1a88-7a55-894d-bc19-42afca589766</KeyId>
<KeyId>urn:uuid:59d6eeff-cd56-6245-9f13-951554466626</KeyId>
<KeyId>urn:uuid:14a13b1a-76ba-764c-97d0-9900f58af53e</KeyId>
<KeyId>urn:uuid:ccdbe0ae-1c3f-224c-b450-947f43bbd640</KeyId>
<KeyId>urn:uuid:dcd37f10-b042-8e44-bef0-89bda2174842</KeyId>
<KeyId>urn:uuid:9dd7103e-7e5a-a840-a15f-f7d7fe699203</KeyId>
</KeyIdList>
</RequiredExtensions>
<NonCriticalExtensions/>
</AuthenticatedPublic>
<AuthenticatedPrivate Id="ID_AuthenticatedPrivate"><enc:EncryptedKey xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<enc:EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p">
<ds:DigestMethod xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
</enc:EncryptionMethod>

The expression Reel.findtext('Id') only matches direct children of Reel. If you want to find all Id tags in your xml document, you can just use:
ids = [id.text for id in Reel.findall(".//Id")]
This would give you a list of all text nodes of all Id tags which are children of Reel.
edit:
Your updated example uses namespaces, in this case KeyId is in the default namespace (http://www.digicine.com/PROTO-ASDCP-KDM-20040311#), so to search for it you need to include it in your search:
from xml.etree import ElementTree
doc = ElementTree.parse('test.xml')
nsmap = {'ns': 'http://www.digicine.com/PROTO-ASDCP-KDM-20040311#'}
ids = [id.text for id in doc.findall(".//ns:KeyId", namespaces=nsmap)]
print(ids)
...
The xpath subset ElementTree supports is rather limited. If you want a more complete support, you should use lxml instead, it's xpath support is way more complete.
For example, using xpath to search for all KeyId tags (ignoring namespaces) and returning their text content directly:
from lxml import etree
doc = etree.parse('test.xml')
ids = doc.xpath(".//*[local-name()='KeyId']/text()")
print(ids)
...

It sounds like XPath might be right up your alley - it will let you query your XML document for exactly what you're looking for, as long as you know the structure.

Here's what I needed to do. This works for finding whatever I need.
for node in tree.getiterator():
... if 'KeyId' in node.tag:
... mylist = node.tag
... print(mylist)
...

Grab Content from XML using Python? almost there

I'm using ElementTree and I can get tags and attributes but not that actual content between elements.
from this XML:
<tag_name attrib="1">I WANT THIS INFO HERE</tag_name>
here's my python code:
import urllib2
import xml.etree.ElementTree as ET
XML = urllib2.urlopen("http://URL/file.xml")
Tree = ET.parse(XML)
for node in Tree.getiterator():
print node.tag, node.attrib
This prints most of the XML file, and I understand what 'tag' and 'attrib' are, but how do I get the 'Content'? I tried looking through ElementTree's docs, but I think this might be too basic of a question.

.text method should give you the required text value.
for node in Tree.getiterator():
print node.tag, node.attrib, node.text

Did you try XPath ?
There are a lot of libraries to extract content from tags with a very easy yet powerful syntax.
Here an example:
import XmlXPathSelector
xs = XmlXPathSelector(text="<tags>your xml</tags>")
print xs.select("//tag_name[#attrib='1']/text()").extract()

Targeting specific sub-elements when parsing XML with Python

I'm working on building a simple parser to handle a regular data feed at work. This post, XML to csv(-like) format , has been very helpful. I'm using a for loop like in the solution, to loop through all of the elements/subelements I need to target but I'm still a bit stuck.
For instance, my xml file is structured like so:
<root>
<product>
<identifier>12</identifier>
<identifier>ab</identifier>
<contributor>Alex</contributor>
<contributor>Steve</contributor>
</product>
<root>
I want to target only the second identifier, and only the first contributor. Any suggestions on how might I do that?
Cheers!

The other answer you pointed to has an example of how to turn all instances of a tag into a list. You could just loop through those and discard the ones you're not interested in.
However, there's a way to do this directly with XPath: the mini-language supports item indexes in brackets:
import xml.etree.ElementTree as etree
document = etree.parse(open("your.xml"))
secondIdentifier = document.find(".//product/identifier[2]")
firstContributor = document.find(".//product/contributor[1]")
print secondIdentifier, firstContributor
prints
'ab', 'Alex'
Note that in XPath, the first index is 1, not 0.
ElementTree's find and findall only support a subset of XPath, described here. Full XPath, described in brief on W3Schools and more fully in the W3C's normative document is available from lxml, a third-party package, but one that is widely available. With lxml, the example would look like this:
import lxml.etree as etree
document = etree.parse(open("your.xml"))
secondIdentifier = document.xpath(".//product/identifier[2]")[0]
firstContributor = document.xpath(".//product/contributor[1]")[0]
print secondIdentifier, firstContributor

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

can we search multiple pattern using etree findall() in xml? - python

Related

How to get values from this XML?

Issue with XML parsing in python

XML parsing in Python using Python 2 or 3

Grab Content from XML using Python? almost there

Targeting specific sub-elements when parsing XML with Python

Categories

Resources