Parsing XML in python - stumped how to do this - python

I've looked through a number of support pages, examples and documents however I am still stumped as to how I can achieve what I am after using python.
I need to process/parse an xml feed and just take very specific values from the XML document. Which is where I am stumped.
The xml looks like the following:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed>
<title type="text">DailyTreasuryYieldCurveRateData</title>
<id></id>
<updated>2014-12-03T07:44:30Z</updated>
<link rel="self" title="DailyTreasuryYieldCurveRateData" href="DailyTreasuryYieldCurveRateData" />
<entry>
<id></id>
<title type="text"></title>
<updated>2014-12-03T07:44:30Z</updated>
<author>
<name />
</author>
<link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(6235)" />
<category />
<content type="application/xml">
<m:properties>
<d:Id m:type="Edm.Int32">6235</d:Id>
<d:NEW_DATE m:type="Edm.DateTime">2014-12-01T00:00:00</d:NEW_DATE>
<d:BC_1MONTH m:type="Edm.Double">0.01</d:BC_1MONTH>
<d:BC_3MONTH m:type="Edm.Double">0.03</d:BC_3MONTH>
<d:BC_6MONTH m:type="Edm.Double">0.08</d:BC_6MONTH>
<d:BC_1YEAR m:type="Edm.Double">0.13</d:BC_1YEAR>
<d:BC_2YEAR m:type="Edm.Double">0.49</d:BC_2YEAR>
<d:BC_3YEAR m:type="Edm.Double">0.9</d:BC_3YEAR>
<d:BC_5YEAR m:type="Edm.Double">1.52</d:BC_5YEAR>
<d:BC_7YEAR m:type="Edm.Double">1.93</d:BC_7YEAR>
<d:BC_10YEAR m:type="Edm.Double">2.22</d:BC_10YEAR>
<d:BC_20YEAR m:type="Edm.Double">2.66</d:BC_20YEAR>
<d:BC_30YEAR m:type="Edm.Double">2.95</d:BC_30YEAR>
<d:BC_30YEARDISPLAY m:type="Edm.Double">2.95</d:BC_30YEARDISPLAY>
</m:properties>
</content>
</entry>
<entry>
<id></id>
<title type="text"></title>
<updated>2014-12-03T07:44:30Z</updated>
<author>
<name />
</author>
<link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(6236)" />
<category />
<content type="application/xml">
<m:properties>
<d:Id m:type="Edm.Int32">6236</d:Id>
<d:NEW_DATE m:type="Edm.DateTime">2014-12-02T00:00:00</d:NEW_DATE>
<d:BC_1MONTH m:type="Edm.Double">0.04</d:BC_1MONTH>
<d:BC_3MONTH m:type="Edm.Double">0.03</d:BC_3MONTH>
<d:BC_6MONTH m:type="Edm.Double">0.08</d:BC_6MONTH>
<d:BC_1YEAR m:type="Edm.Double">0.14</d:BC_1YEAR>
<d:BC_2YEAR m:type="Edm.Double">0.55</d:BC_2YEAR>
<d:BC_3YEAR m:type="Edm.Double">0.96</d:BC_3YEAR>
<d:BC_5YEAR m:type="Edm.Double">1.59</d:BC_5YEAR>
<d:BC_7YEAR m:type="Edm.Double">2</d:BC_7YEAR>
<d:BC_10YEAR m:type="Edm.Double">2.28</d:BC_10YEAR>
<d:BC_20YEAR m:type="Edm.Double">2.72</d:BC_20YEAR>
<d:BC_30YEAR m:type="Edm.Double">3</d:BC_30YEAR>
<d:BC_30YEARDISPLAY m:type="Edm.Double">3</d:BC_30YEARDISPLAY>
</m:properties>
</content>
</entry>
</feed>
This XML document gets a new Entry appended each day for the duration of the month when it resets and starts again on the 1st of the next month.
I need to extract the date from d:NEW_DATE and the value from d:BC_10YEAR, now when there is just a single entry this is no problem, however I am struggling to work out how to have it go through the file and extracting the relevant date and value from each ENTRY block.
Any assistance is very much appreciated.

BeautifulSoup is probably the easiest way to do what you're looking for:
from BeautifulSoup import BeautifulSoup
xmldoc = open('datafile.xml', 'r').read()
bs = BeautifulSoup(xmldoc)
entryList = bs.findAll('entry')
for entry in entryList:
print entry.content.find('m:properties').find('d:new_date').contents[0]
print entry.content.find('m:properties').find('d:bc_10year').contents[0]
You can then replace the print with whatever you want to do with the data (add to a list etc.).

Related

How can I reference a parent and remove the parent element in an RSS XML through LXML in Python?

I've been having trouble cracking this one. I have an RSS feed in the form of an XML file. Simplified, it looks like this:
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description></description>
<item>...</item>
<item>...</item>
<item>...</item>
<item>
<guid></guid>
<pubDate></pubDate>
<author/>
<title>Title of the item</title>
<link href="https://example.com" rel="alternate" type="text/html"/>
<description>
<![CDATA[View Example]]>
</description>
<description>
<![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
</description>
</item>
<item>...</item>
</channel>
</rss>
My objective is to check if the second description tag contains certain strings. If it does contain that string, I'd like to completely remove it. Currently in my code I have this:
doc = lxml.etree.fromstring(testString)
found = doc.findall('channel/item/description')
for desc in found:
if "FORBIDDENSTRING" in desc.text:
desc.getparent().remove(desc)
And it removes just the second description tag which makes sense but I want the whole item gone.
I don't know how I can get a hold on the 'item' element if I only have the 'desc' reference.
I've tried googling aswell as searching on here but the situations I see just want to remove the tag like I'm doing now, weirdly I haven't stumbled upon sample code that wants to get rid of the entire parent object.
Any pointers towards documentation/tutorials or help is very welcome.
I'm a big fan of XSLT, but another option is to just select the item instead of the description (select the element you want to delete; not its child).
Also, if you use xpath(), you can put the check for the forbidden string directly in the xpath predicate.
Example...
from lxml import etree
testString = """
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description></description>
<item>...</item>
<item>...</item>
<item>...</item>
<item>
<guid></guid>
<pubDate></pubDate>
<author/>
<title>Title of the item</title>
<link href="https://example.com" rel="alternate" type="text/html"/>
<description>
<![CDATA[View Example]]>
</description>
<description>
<![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]>
</description>
</item>
<item>...</item>
</channel>
</rss>
"""
forbidden_string = "I want to get rid of the whole item"
parser = etree.XMLParser(strip_cdata=False)
doc = etree.fromstring(testString, parser=parser)
found = doc.xpath('.//channel/item[description[contains(.,"{}")]]'.format(forbidden_string))
for item in found:
item.getparent().remove(item)
print(etree.tostring(doc, encoding="unicode", pretty_print=True))
this prints...
<rss version="2.0">
<channel>
<title>My RSS Feed</title>
<link href="https://www.examplefeedurl.com">Feed</link>
<description/>
<item>...</item>
<item>...</item>
<item>...</item>
<item>...</item>
</channel>
</rss>
Consider XSLT, the special-purpose language designed to transform XML files such as removing nodes conditionally by value. Python's lxml can run XSLT 1.0 scripts and even pass a parameter from Python script to XSLT (not unlike passing parameters in SQL!). In this way, you avoid any for loops or if logic or rebuilding tree at application layer.
XSLT (save as .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" cdata-section-elements="description"/>
<xsl:strip-space elements="*"/>
<!-- VALUE TO BE PASSED INTO FROM PYTHON -->
<xsl:param name="search_string" />
<!-- IDENTITY TRANSFORM -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- KEEP ONLY item NODES THAT DO NOT CONTAIN $search_string -->
<xsl:template match="channel">
<xsl:copy>
<xsl:apply-templates select="item[not(contains(description[2], $search_string))]"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Python (for demo, below runs two searches using posted sample)
import lxml.etree as et
# LOAD XML AND XSL
doc = et.parse('Input.xml')
xsl = et.parse('XSLT_String.xsl')
# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)
# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('FORBIDDENSTRING')
result = transform(doc, search_string=n)
print(result)
# <?xml version="1.0"?>
# <rss version="2.0">
# <channel>
# <item>...</item>
# <item>...</item>
# <item>...</item>
# <item>
# <guid/>
# <pubDate/>
# <author/>
# <title>Title of the item</title>
# <link href="https://example.com" rel="alternate" type="text/html"/>
# <description><![CDATA[View Example]]></description>
# <description><![CDATA[<p>This actually contains a bunch of text I want to work with. If this text contains certain strings, I want to get rid of the whole item.</p>]]></description>
# </item>
# <item>...</item>
# </channel>
# </rss>
# RUN TRANSFORMATION WITH PARAM
n = et.XSLT.strparam('bunch of text')
result = transform(doc, search_string=n)
print(result)
# <?xml version="1.0"?>
# <rss version="2.0">
# <channel>
# <item>...</item>
# <item>...</item>
# <item>...</item>
# <item>...</item>
# </channel>
# </rss>
# SAVE TO FILE
with open('Output.xml', 'wb') as f:
f.write(result)

Unreachable xml feed entries

I'm working on a python application supposed to make a request on a phonebook search api and format the received data.
The entries are sent back as an xml feed looking like the exemple at the bottom.
I'm using feedparser to split the information.
What I'm struggling with, is the extraction of the e-mail field.
This information is contained under the tag <tel:extra type="email">
I could only make it work to get the value of "type" for the last extra entry.
The one before and the content between the tags are unreachable.
Does anyone have some experience with this kind of feeds?
Thank you for helping me.
API information
Python code:
import feedparser
data = feedparser.parse(xml)
entry = data.entries[0]
print(entry.tel_extra)
XML example:
<?xml version="1.0" encoding="utf-8" ?>
<feed xml:lang="de" xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:tel="http://tel.search.ch/api/spec/result/1.0/">
<id>https://tel.search.ch/api/04b361c38a40dc3aab2355d79f221f86/5acc2bdfc4554dfd5a4bb10424cd597e</id>
<title type="text">tel.search.ch API Search Results</title>
<generator version="1.0" uri="https://tel.search.ch">tel.search.ch</generator>
<updated>2018-02-12T03:00:00Z</updated>
<link href="https://tel.search.ch/result.html?was=nestle&wo=broc&private=0" rel="alternate" type="text/html" />
<link href="http://tel.search.ch/api/?was=nestle&wo=broc&private=0&key=04b361c38a40dc3aab2355d79f221f86" type="application/atom+xml" rel="self" />
<openSearch:totalResults>1</openSearch:totalResults>
<openSearch:startIndex>1</openSearch:startIndex>
<openSearch:itemsPerPage>20</openSearch:itemsPerPage>
<openSearch:Query role="request" searchTerms="nestle broc" startPage="1" />
<openSearch:Image height="1" width="1" type="image/gif">https://www.search.ch/audit/CP/tel/de/api</openSearch:Image>
<entry>
<id>urn:uuid:ca71838ddcbb6a92</id>
<updated>2018-02-12T03:00:00Z</updated>
<published>2018-02-12T03:00:00Z</published>
<title type="text">Nestlé Suisse SA</title>
<content type="text">Nestlé Suisse SA
Fabrique de Broc
rue Jules Bellet 7
1636 Broc/FR
026 921 51 51</content>
<tel:nopromo>*</tel:nopromo>
<author>
<name>tel.search.ch</name>
</author>
<link href="https://tel.search.ch/broc/rue-jules-bellet-7/nestle-suisse-sa" title="Details" rel="alternate" type="text/html" />
<link href="https://tel.search.ch/vcard/Nestle-Suisse-SA.vcf?key=ca71838ddcbb6a92" type="text/x-vcard" title="VCard Download" rel="alternate" />
<link href="https://tel.search.ch/edit/?id=ca71838ddcbb6a92" rel="edit" type="text/html" />
<tel:pos>1</tel:pos>
<tel:id>ca71838ddcbb6a92</tel:id>
<tel:type>Organisation</tel:type>
<tel:name>Nestlé Suisse SA</tel:name>
<tel:occupation>Fabrique de Broc</tel:occupation>
<tel:street>rue Jules Bellet</tel:street>
<tel:streetno>7</tel:streetno>
<tel:zip>1636</tel:zip>
<tel:city>Broc</tel:city>
<tel:canton>FR</tel:canton>
<tel:country>fr</tel:country>
<tel:category>Schokolade</tel:category>
<tel:phone>+41269215151</tel:phone>
<tel:extra type="Fax Service technique">+41269215154</tel:extra>
<tel:extra type="Fax">+41269215525</tel:extra>
<tel:extra type="Besichtigung">+41269215960</tel:extra>
<tel:extra type="email">maisoncailler#nestle.com</tel:extra>
<tel:extra type="website">http://www.cailler.ch</tel:extra>
<tel:copyright>Daten: Swisscom Directories AG</tel:copyright>
</entry>
</feed>
You may want to check out BeautifulSoup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(xml, 'xml')
soup.find("tel:extra", attrs={"type":"email"}).text
Out[111]: 'maisoncailler#nestle.com'

xml parsing (Removing parent nodes)

Hi I'm seriously stuck when trying to filter out my xml document. Here is some example of the contents:
<sentence id="1" document_id="Perseus:text:1999.02.0029" >
<primary>millermo</primary>
<word id="1" />
<word id="2" />
<word id="3" />
<word id="4" />
</sentence>
<sentence id="2" document_id="Perseus:text:1999.02.0029" >
<primary>millermo</primary>
<word id="1" />
<word id="2" />
<word id="3" />
<word id="4" />
<word id="5" />
<word id="6" />
<word id="7" />
<word id="8" />
</sentence>
There are many sentences (Over 3000) but all I want to do is write some code (preferably in java or python) that will go through my xml file and remove all the sentences which have more than 5 word ids,
so in other words I will be left with just sentences tags with 5 or less word ids. Thanks. (Just to note my xml isnt great, I get mixed up with nodes/tags/element/ids.
I'm trying this atm but not sure:
import xml.etree.ElementTree as ET
tree = ET.parse('treebank.xml')
root = tree.getroot()
parent_map = dict((c, p) for p in tree.getiterator() for c in p)
iterator = list(root.getiterator('word id'))
for item in iterator:
old = item.find('word id')
text = old.text
if 'id=16' in text:
parent_map[item].remove(item)
continue
tree.write('out.xml')
Consider an XSLT solution where no looping is required. As information, XSLT is a declarative, special purpose language designed natively to transform XML documents to various formatting, styling, structuring for end use purposes. Specifically here, the identity transform copies entire document as is and writes an empty template to all <word> nodes whose position is greater than 5.
XSLT script (save as .xsl or .xslt file)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="word[position() > 5]"/>
</xsl:transform>
Python Script
import os, sys
import lxml.etree as ET
# LOAD XML AND XSL
dom = ET.parse('C/Path/To/Input.xml')
xslt = ET.parse('C/Path/To/XSLTscript.xsl')
# TRANSFORM XML
transform = ET.XSLT(xslt)
newdom = transform(dom)
# PRETTY PRINT OUTPUT
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
print(tree_out.decode("utf-8"))
# SAVE TO FILE
xmlfile = open('Output.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
And the beauty of XSLT is that it is transferrable as practically all general purpose languages maintain XSLT processors including Java:
Java Script
import javax.xml.transform.*;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import java.io.File;
import java.io.IOException;
import java.net.URISyntaxException;
public class Sentence {
public static void main(String[] args) throws IOException, URISyntaxException, TransformerException {
String currentDir = new File("").getAbsolutePath();
String xml = "C:/Path/To/Input.xml";
String xsl = "C:/Path/To/XSLTScript.xsl";
// Transformation
TransformerFactory factory = TransformerFactory.newInstance();
Source xslt = new StreamSource(new File(xsl));
Transformer transformer = factory.newTransformer(xslt);
Source text = new StreamSource(new File(xml));
transformer.transform(text, new StreamResult(new File("C:/Path/To/Output.xml")));
}
}
OUTPUT (using posted content)
<?xml version='1.0' encoding='UTF-8'?>
<root>
<sentence id="1" document_id="Perseus:text:1999.02.0029">
<primary>millermo</primary>
<word id="1"/>
<word id="2"/>
<word id="3"/>
<word id="4"/>
</sentence>
<sentence id="2" document_id="Perseus:text:1999.02.0029">
<primary>millermo</primary>
<word id="1"/>
<word id="2"/>
<word id="3"/>
<word id="4"/>
<word id="5"/>
</sentence>
</root>
The thing is that word is a tag and id is its attribute; you can't pass them both to .find().
Also, the result of parsing is a tree, where attributes and text are represented differently than in an XML file.
I suppose you have a root element which has <sentence> elements as children.
The you have to look at each <sentence> node, count its <word> elements, and remove the sentence if needed.
# We cannot iterate over a tree and modify it at the same time.
# Remember the nodes to remove later.
elements_to_kill = []
for sentence_node in root.getiterator('sentence'):
if len(sentence_node.findall('word')) <= 5:
elements_to_kill.append(sentence_node)
# Now it's safe to remove them
for node in elements_to_kill:
root.remove(node)
# Serialize as file, etc
Hope this helps.
You seem to lack the grasp on how ETree works. Please feel free to read the docs and experiment in a Python REPL to gain the understanding.

XML Prettifying from file in Python

I have an xml file which looks like the example below.
Many texts contain space as the start character, or have \n (newline) at the beginning, or other crazy stuff. I'm working with xml.etree.ElementTree, and it is good to parse from this file.
But I want more! :) I tried to prettify this mess, but without success. Tried many tutorials, but it always ends without pretty XML.
<?xml version="1.0"?>
<import>
<article>
<name> Name with space
</name>
<source> Daily Telegraph
</source>
<number>72/2015
</number>
<page>10
</page>
<date>2015-03-26
</date>
<author> Tomas First
</author>
<description>Economy
</description>
<attachment>
</attachment>
<region>
</region>
<text>
My text is here
</text>
</article>
<article>
<name> How to parse
</name>
<source> Internet article
</source>
<number>72/2015
</number>
<page>1
</page>
<date>2015-03-26
</date>
<author>Some author
</author>
<description> description
</description>
<attachment>
</attachment>
<region>
</region>
<text>
My text here
</text>
</article>
</import>
When I tried another answers from SO it generates same file or more messy XML
bs4 can do it
from bs4 import BeautifulSoup
doc = BeautifulSoup(xmlstring, 'xml')
print doc.prettify()

Text parsing XML File with Python

So I have been able to query and receive an HTTP RSS webpage, convert it to a .txt file, and query the elements within the XML with minidom.
What I am tying to do next is create a selective list of links that meet my requirements.
Here is an example XML file that has a similar architecture to my file:
<xml>
<Document name = "example_file.txt">
<entry id = "1">
<link href="http://wwww.examplesite.com/files/test_image_1_Big.jpg"/>
</entry>
<entry id = "2">
<link href="http://wwww.examplesite.com/files/test_image_1.jpg"/>
</entry>
<entry id = "3">
<link href="http://wwww.examplesite.com/files/test_image_1_Small.jpg"/>
</entry>
</entry>
<entry id = "4">
<link href="http://wwww.examplesite.com/files/test_image_1.png"/>
</entry>
<entry id = "5">
<link href="http://wwww.examplesite.com/files/test_image_2_Big.jpg"/>
</entry>
<entry id = "6">
<link href="http://wwww.examplesite.com/files/test_image_2.jpg"/>
</entry>
<entry id = "7">
<link href="http://wwww.examplesite.com/files/test_image_2_Small.jpg"/>
</entry>
</entry>
<entry id = "8">
<link href="http://wwww.examplesite.com/files/test_image_2.png"/>
</entry>
</Document>
</xml>
With minidom, I can get it down to a list of just links, but I think I can skip this step if I can create a list based off of text-searching parameters. I do not want all links, I only want these links:
http://wwww.examplesite.com/files/test_image_1.jpg
http://wwww.examplesite.com/files/test_image_2.jpg
Being new to Python, I am not sure how to say "only grab links that do not have ".png", "Big", or "Small" in the link name.
My end goal is to have python download these files one at a time. Would a list be best for this?
To make this even more complicated, I am limited to the stock library with Python 2.6. I won't be able to implement any great 3rd party APIs.
Using lxml and cssselect this is easy:
from pprint import pprint
import cssselect # noqa
from lxml.html import fromstring
doc = fromstring(open("foo.html", "r").read())
links = [e.attrib["href"] for e in doc.cssselect("link")]
pprint(links)
Output:
['http://wwww.examplesite.com/files/test_image_1_Big.jpg',
'http://wwww.examplesite.com/files/test_image_1.jpg',
'http://wwww.examplesite.com/files/test_image_1_Small.jpg',
'http://wwww.examplesite.com/files/test_image_1.png',
'http://wwww.examplesite.com/files/test_image_2_Big.jpg',
'http://wwww.examplesite.com/files/test_image_2.jpg',
'http://wwww.examplesite.com/files/test_image_2_Small.jpg',
'http://wwww.examplesite.com/files/test_image_2.png']
If you only want two of the links (which two?):
links = links[:2]
This is called Slicing in Python.
Being new to Python, I am not sure how to say "only grab links that do not have ".png", "Big", or "Small" in the link name. Any help would be great
You can filter your list like this:
doc = fromstring(open("foo.html", "r").read())
links = [e.attrib["href"] for e in doc.cssselect("link")]
predicate = lambda l: not any([s in l for s in ("png", "Big", "Small")])
links = [l for l in links if predicate(l)]
pprint(links)
This will give you:
['http://wwww.examplesite.com/files/test_image_1.jpg',
'http://wwww.examplesite.com/files/test_image_2.jpg']
import re
from xml.dom import minidom
_xml = '''<?xml version="1.0" encoding="utf-8"?>
<xml >
<Document name="example_file.txt">
<entry id="1">
<link href="http://wwww.examplesite.com/files/test_image_1_Big.jpg"/>
</entry>
<entry id="2">
<link href="http://wwww.examplesite.com/files/test_image_1.jpg"/>
</entry>
<entry id="3">
<link href="http://wwww.examplesite.com/files/test_image_1_Small.jpg"/>
</entry>
<entry id="4">
<link href="http://wwww.examplesite.com/files/test_image_1.png"/>
</entry>
<entry id="5">
<link href="http://wwww.examplesite.com/files/test_image_2_Big.jpg"/>
</entry>
<entry id="6">
<link href="http://wwww.examplesite.com/files/test_image_2.jpg"/>
</entry>
<entry id="7">
<link href="http://wwww.examplesite.com/files/test_image_2_Small.jpg"/>
</entry>
<entry id="8">
<link href="http://wwww.examplesite.com/files/test_image_2.png"/>
</entry>
</Document>
</xml>
'''
doc = minidom.parseString(_xml) # minidom.parse(your-file-path) gets same resul
entries = doc.getElementsByTagName('entry')
link_ref = (
entry.getElementsByTagName('link').item(0).getAttribute('href')
for entry in entries
)
plain_jpg = re.compile(r'.*\.jpg$') # regex you needs
result = (link for link in link_ref if plain_jpg.match(link))
print list(result)
This code gets result of [u'http://wwww.examplesite.com/files/test_image_1_Big.jpg', u'http://wwww.examplesite.com/files/test_image_1.jpg', u'http://wwww.examplesite.com/files/test_image_1_Small.jpg', u'http://wwww.examplesite.com/files/test_image_2_Big.jpg', u'http://wwww.examplesite.com/files/test_image_2.jpg', u'http://wwww.examplesite.com/files/test_image_2_Small.jpg'].
But we may use xml.etree.ElementTree better.
etree is faster and low memory and smarter interfaces.
etree was bundled in standard library.
from feedparse import parse
data=parse("foo.html")
for elem in data['entries']:
if 'link' in elem.keys():
print(elem['link'])
The Library "feedparse" returns dictionaries by parsing the XML content

Categories