I want to scrape the TheRegister.com Security section and parse the XML parts into a data structure.
In the Scrapy Shell I've tried:
>>> fetch('https://www.theregister.com/security/headlines.atom')
resulting in response
2020-11-07 09:34:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.theregister.com/security/headlines.atom> (referer: None)
The response has a body that can be viewed, see a snippet below (I only selected the first couple of lines)
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
<id>tag:theregister.com,2005:feed/theregister.com/security/</id>
<title>The Register - Security</title>
<link rel="self" type="application/atom+xml" href="https://www.theregister.com/security/headlines.atom"/>
<link rel="alternate" type="text/html" href="https://www.theregister.com/security/"/>
<rights>Copyright © 2020, Situation Publishing</rights>
<author>
<name>Team Register</name>
<email>webmaster#theregister.co.uk</email>
<uri>https://www.theregister.com/odds/about/contact/</uri>
</author>
<icon>https://www.theregister.com/Design/graphics/icons/favicon.png</icon>
<subtitle>Biting the hand that feeds IT — Enterprise Technology News and Analysis</subtitle>
<logo>https://www.theregister.com/Design/graphics/Reg_default/The_Register_r.png</logo>
<updated>2020-11-06T23:58:13Z</updated>
<entry>
<id>tag:theregister.com,2005:story211912</id>
<updated>2020-11-06T23:58:13Z</updated>
<author>
<name>Thomas Claburn</name>
<uri>https://search.theregister.com/?author=Thomas%20Claburn</uri>
</author>
<link rel="alternate" type="text/html" href="https://go.theregister.com/feed/www.theregister.com/2020/11/06/android_encryption_certs/"/>
<title type="html">Let's Encrypt warns about a third of Android devices will from next year stumble over sites that use its certs</title>
<summary type="html" xml:base="https://www.theregister.com/"><h4>Expiration of cross-signed root certificates spells trouble for pre-7.1.1 kit... unless they're using Firefox</h4> <p>Let's Encrypt, a Certificate Authority (CA) that puts the "S" in "HTTPS" for about <a target="_blank" rel="nofollow" href="https://letsencrypt.org/stats/">220m domains</a>, has issued a warning to users of older Android devices that their web surfing may get choppy next year.…</p> <p><!--#include virtual='/data_centre/_whitepaper_textlinks_top.html' --></p></summary>
</entry>
Why can I not parse any data with the regular Xpath method? I've tried:
>>> response.xpath('entry')
[]
>>> response.xpath('/entry')
[]
>>> response.xpath('//entry')
[]
>>> response.xpath('.//entry')
[]
>>> response.xpath('entry/text()')
[]
>>> response.xpath('/entry/text()')
[]
>>> response.xpath('//entry/text()')
[]
>>> response.xpath('.//entry/text()')
[]
All with no luck. Also the other xml-tags, like title, link, author I cannot extract.
TLDR; execute response.selector.remove_namespaces() before running response.xpath()
It essentially means that you are removing xmlns="http://www.w3.org/2005/Atom" from response to write easier XPath.
Alternative, you can register the namespace and change your selectors to include this namespace:
response.selector.register_namespace('n', 'http://www.w3.org/2005/Atom')
response.xpath('//n:entry')
You can read more details here.
Related
I'm working on a python application supposed to make a request on a phonebook search api and format the received data.
The entries are sent back as an xml feed looking like the exemple at the bottom.
I'm using feedparser to split the information.
What I'm struggling with, is the extraction of the e-mail field.
This information is contained under the tag <tel:extra type="email">
I could only make it work to get the value of "type" for the last extra entry.
The one before and the content between the tags are unreachable.
Does anyone have some experience with this kind of feeds?
Thank you for helping me.
API information
Python code:
import feedparser
data = feedparser.parse(xml)
entry = data.entries[0]
print(entry.tel_extra)
XML example:
<?xml version="1.0" encoding="utf-8" ?>
<feed xml:lang="de" xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:tel="http://tel.search.ch/api/spec/result/1.0/">
<id>https://tel.search.ch/api/04b361c38a40dc3aab2355d79f221f86/5acc2bdfc4554dfd5a4bb10424cd597e</id>
<title type="text">tel.search.ch API Search Results</title>
<generator version="1.0" uri="https://tel.search.ch">tel.search.ch</generator>
<updated>2018-02-12T03:00:00Z</updated>
<link href="https://tel.search.ch/result.html?was=nestle&wo=broc&private=0" rel="alternate" type="text/html" />
<link href="http://tel.search.ch/api/?was=nestle&wo=broc&private=0&key=04b361c38a40dc3aab2355d79f221f86" type="application/atom+xml" rel="self" />
<openSearch:totalResults>1</openSearch:totalResults>
<openSearch:startIndex>1</openSearch:startIndex>
<openSearch:itemsPerPage>20</openSearch:itemsPerPage>
<openSearch:Query role="request" searchTerms="nestle broc" startPage="1" />
<openSearch:Image height="1" width="1" type="image/gif">https://www.search.ch/audit/CP/tel/de/api</openSearch:Image>
<entry>
<id>urn:uuid:ca71838ddcbb6a92</id>
<updated>2018-02-12T03:00:00Z</updated>
<published>2018-02-12T03:00:00Z</published>
<title type="text">Nestlé Suisse SA</title>
<content type="text">Nestlé Suisse SA
Fabrique de Broc
rue Jules Bellet 7
1636 Broc/FR
026 921 51 51</content>
<tel:nopromo>*</tel:nopromo>
<author>
<name>tel.search.ch</name>
</author>
<link href="https://tel.search.ch/broc/rue-jules-bellet-7/nestle-suisse-sa" title="Details" rel="alternate" type="text/html" />
<link href="https://tel.search.ch/vcard/Nestle-Suisse-SA.vcf?key=ca71838ddcbb6a92" type="text/x-vcard" title="VCard Download" rel="alternate" />
<link href="https://tel.search.ch/edit/?id=ca71838ddcbb6a92" rel="edit" type="text/html" />
<tel:pos>1</tel:pos>
<tel:id>ca71838ddcbb6a92</tel:id>
<tel:type>Organisation</tel:type>
<tel:name>Nestlé Suisse SA</tel:name>
<tel:occupation>Fabrique de Broc</tel:occupation>
<tel:street>rue Jules Bellet</tel:street>
<tel:streetno>7</tel:streetno>
<tel:zip>1636</tel:zip>
<tel:city>Broc</tel:city>
<tel:canton>FR</tel:canton>
<tel:country>fr</tel:country>
<tel:category>Schokolade</tel:category>
<tel:phone>+41269215151</tel:phone>
<tel:extra type="Fax Service technique">+41269215154</tel:extra>
<tel:extra type="Fax">+41269215525</tel:extra>
<tel:extra type="Besichtigung">+41269215960</tel:extra>
<tel:extra type="email">maisoncailler#nestle.com</tel:extra>
<tel:extra type="website">http://www.cailler.ch</tel:extra>
<tel:copyright>Daten: Swisscom Directories AG</tel:copyright>
</entry>
</feed>
You may want to check out BeautifulSoup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(xml, 'xml')
soup.find("tel:extra", attrs={"type":"email"}).text
Out[111]: 'maisoncailler#nestle.com'
I have an RSS feed to a news source. Amongst the news text and other metadata, the feed also contains an URL reference to the comments section, which can also be in RSS format. I want to download and include the contents of the comments section for each news article. My aim is to create an RSS feed with the articles and the comments for each article included in the RSS, then convert this new RSS in calibre to PDF.
Here is an example XML:
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<author>
<name>Some Author</name>
<uri>http://thenews.com</uri>
</author>
<category term="sports" label="Sports" />
<content type="html">This is the news text.</content>
<id>123abc</id>
<link href="http://thenews.com/article/123abc/comments" />
<updated>2016-04-29T13:44:00+00:00</updated>
<title>The Title</title>
</entry>
<entry>
<author>
<name>Some other Author</name>
<uri>http://thenews.com</uri>
</author>
<category term="sports" label="Sports" />
<content type="html">This is another news text.</content>
<id>123abd</id>
<link href="http://thenews.com/article/123abd/comments" />
<updated>2016-04-29T14:46:00+00:00</updated>
<title>The other Title</title>
</entry>
</feed>
Now I want to replace the <link href="http://thenews.com/article/123abc/comments" /> with the content of the URL. The RSS feed can be fetched by adding a /rss at the end of the URL. So in the end, a single entry would look like this:
<entry>
<author>
<name>Some Author</name>
<uri>http://thenews.com</uri>
</author>
<category term="sports" label="Sports" />
<content type="html">This is the news text.</content>
<id>123abc</id>
<comments>
<comment>
<author>A commenter</author>
<timestamp>2016-04-29T16:00:00+00:00</timestamp>
<text>Cool story, yo!</text>
</comment>
<comment>
<author>Another commenter</author>
<timestamp>2016-04-29T16:01:00+00:00</timestamp>
<text>This is interesting news.</text>
</comment>
</comments>
<updated>2016-04-29T13:44:00+00:00</updated>
<title>The Title</title>
</entry>
I'm open to any programming language. I tried this with python and lxml but couldn't get far. I was able to extract the comments URL and download the comments feed but couldn't replace the actual <link>-tag.
Without having to download the actual RSS, here's how far I've come:
import lxml.etree as et
import urllib2
import re
# These will be downloaded from the RSS feed source when the code works
xmltext = """[The above news feed, too long to paste]"""
commentsRSS = """[The above comments feed]"""
hdr = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
article = et.fromstring(xmltext)
for elem in article.xpath('//feed/entry'):
commentsURL = elem.xpath('link/#href')
#request = urllib2.Request(commentsURL[0] + '.rss', headers=hdr)
#comments = urllib2.urlopen(request).read()
comments = commentsRSS
# Now the <link>-tag should be replaced by the comments feed without the <?xml ...> tag
For each <link> element, download XML from the href attribute and then parse the XML into a new Element. Then replace <link> with the corresponding new Element, something like this :
....
article = et.fromstring(xmltext)
ns = {'d': 'http://www.w3.org/2005/Atom'}
for elem in article.xpath('//d:feed/d:entry/d:link', namespaces=ns):
request = urllib2.Request(elem.attrib['href'] + '.rss', headers=hdr)
comments = urllib2.urlopen(request).read()
newElem = et.fromstring(comments)
elem.getparent().replace(elem, newElem)
# print the result
print et.tostring(article)
So I have been able to query and receive an HTTP RSS webpage, convert it to a .txt file, and query the elements within the XML with minidom.
What I am tying to do next is create a selective list of links that meet my requirements.
Here is an example XML file that has a similar architecture to my file:
<xml>
<Document name = "example_file.txt">
<entry id = "1">
<link href="http://wwww.examplesite.com/files/test_image_1_Big.jpg"/>
</entry>
<entry id = "2">
<link href="http://wwww.examplesite.com/files/test_image_1.jpg"/>
</entry>
<entry id = "3">
<link href="http://wwww.examplesite.com/files/test_image_1_Small.jpg"/>
</entry>
</entry>
<entry id = "4">
<link href="http://wwww.examplesite.com/files/test_image_1.png"/>
</entry>
<entry id = "5">
<link href="http://wwww.examplesite.com/files/test_image_2_Big.jpg"/>
</entry>
<entry id = "6">
<link href="http://wwww.examplesite.com/files/test_image_2.jpg"/>
</entry>
<entry id = "7">
<link href="http://wwww.examplesite.com/files/test_image_2_Small.jpg"/>
</entry>
</entry>
<entry id = "8">
<link href="http://wwww.examplesite.com/files/test_image_2.png"/>
</entry>
</Document>
</xml>
With minidom, I can get it down to a list of just links, but I think I can skip this step if I can create a list based off of text-searching parameters. I do not want all links, I only want these links:
http://wwww.examplesite.com/files/test_image_1.jpg
http://wwww.examplesite.com/files/test_image_2.jpg
Being new to Python, I am not sure how to say "only grab links that do not have ".png", "Big", or "Small" in the link name.
My end goal is to have python download these files one at a time. Would a list be best for this?
To make this even more complicated, I am limited to the stock library with Python 2.6. I won't be able to implement any great 3rd party APIs.
Using lxml and cssselect this is easy:
from pprint import pprint
import cssselect # noqa
from lxml.html import fromstring
doc = fromstring(open("foo.html", "r").read())
links = [e.attrib["href"] for e in doc.cssselect("link")]
pprint(links)
Output:
['http://wwww.examplesite.com/files/test_image_1_Big.jpg',
'http://wwww.examplesite.com/files/test_image_1.jpg',
'http://wwww.examplesite.com/files/test_image_1_Small.jpg',
'http://wwww.examplesite.com/files/test_image_1.png',
'http://wwww.examplesite.com/files/test_image_2_Big.jpg',
'http://wwww.examplesite.com/files/test_image_2.jpg',
'http://wwww.examplesite.com/files/test_image_2_Small.jpg',
'http://wwww.examplesite.com/files/test_image_2.png']
If you only want two of the links (which two?):
links = links[:2]
This is called Slicing in Python.
Being new to Python, I am not sure how to say "only grab links that do not have ".png", "Big", or "Small" in the link name. Any help would be great
You can filter your list like this:
doc = fromstring(open("foo.html", "r").read())
links = [e.attrib["href"] for e in doc.cssselect("link")]
predicate = lambda l: not any([s in l for s in ("png", "Big", "Small")])
links = [l for l in links if predicate(l)]
pprint(links)
This will give you:
['http://wwww.examplesite.com/files/test_image_1.jpg',
'http://wwww.examplesite.com/files/test_image_2.jpg']
import re
from xml.dom import minidom
_xml = '''<?xml version="1.0" encoding="utf-8"?>
<xml >
<Document name="example_file.txt">
<entry id="1">
<link href="http://wwww.examplesite.com/files/test_image_1_Big.jpg"/>
</entry>
<entry id="2">
<link href="http://wwww.examplesite.com/files/test_image_1.jpg"/>
</entry>
<entry id="3">
<link href="http://wwww.examplesite.com/files/test_image_1_Small.jpg"/>
</entry>
<entry id="4">
<link href="http://wwww.examplesite.com/files/test_image_1.png"/>
</entry>
<entry id="5">
<link href="http://wwww.examplesite.com/files/test_image_2_Big.jpg"/>
</entry>
<entry id="6">
<link href="http://wwww.examplesite.com/files/test_image_2.jpg"/>
</entry>
<entry id="7">
<link href="http://wwww.examplesite.com/files/test_image_2_Small.jpg"/>
</entry>
<entry id="8">
<link href="http://wwww.examplesite.com/files/test_image_2.png"/>
</entry>
</Document>
</xml>
'''
doc = minidom.parseString(_xml) # minidom.parse(your-file-path) gets same resul
entries = doc.getElementsByTagName('entry')
link_ref = (
entry.getElementsByTagName('link').item(0).getAttribute('href')
for entry in entries
)
plain_jpg = re.compile(r'.*\.jpg$') # regex you needs
result = (link for link in link_ref if plain_jpg.match(link))
print list(result)
This code gets result of [u'http://wwww.examplesite.com/files/test_image_1_Big.jpg', u'http://wwww.examplesite.com/files/test_image_1.jpg', u'http://wwww.examplesite.com/files/test_image_1_Small.jpg', u'http://wwww.examplesite.com/files/test_image_2_Big.jpg', u'http://wwww.examplesite.com/files/test_image_2.jpg', u'http://wwww.examplesite.com/files/test_image_2_Small.jpg'].
But we may use xml.etree.ElementTree better.
etree is faster and low memory and smarter interfaces.
etree was bundled in standard library.
from feedparse import parse
data=parse("foo.html")
for elem in data['entries']:
if 'link' in elem.keys():
print(elem['link'])
The Library "feedparse" returns dictionaries by parsing the XML content
I'm trying to get information from this site:
http://www.gocrimson.com/sports/mbkb/2011-12/roster
If you look at that page in a browser, you see a nice <table> that contains all the player info, with the coach's info below it.
When I pull that page into a python program (using urllib2) or a ruby program (using nokogiri) the table is represented as a bunch of div elements. I thought there might be some javascript running, so I disabled javascript on my browser and revisited the page. It still loads up wit the tables in place.
If I use Selenium to pull in the page source, I do get the table format.
Any idea on why the page comes in with the divs?
Python:
page = urllib2.urlopen(url)
html = page.read()
print html output (I put one of the divs on the last line to draw attention to it. That is a tr in the browser page. Shortened to stay under character limit):
'\t\t\t\r\n\t\t\r\n\t\t\r\n\t\t\r\n\r\n\r\n\r\n\r\n\r\n\t\t\t\t\r\n\r\n\r\n<?xml version="1.0" encoding="iso-8859-1"?>\r\n<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN" "http://www.wapforum.org/DTD/xhtml-mobile10.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=iso-8859-1"/> <meta name="viewport" content="width=device-width,minimum-scale=1.0,maximum-scale=1.0"/>\r\n<meta forua="true" http-equiv="Cache-Control" content="must-revalidate" />\r\n<meta http-equiv="Pragma" content="no-cache, must-revalidate" />\r\n
<title>The Official Website of Harvard University Athletics: Harvard Athletics - GoCrimson.com : Men\'s Basketball - 2011-12 Roster </title>\r\n<link rel="stylesheet" href="/info/mobile/mobile.css" type="text/css"></link>\r\n<link rel="stylesheet" href="/mobile-overwrite.css" type="text/css"></link>\r\n</head>\r\n
<body class="classic">\r\n\r\n\r\n\t<strong>News</strong>\r\n | \r\n\tScores\r\n<br /><br />\r\n\r\n<p class="goBack-link"><<< Back</p>\r\n\r\n\r\n<div class="roster ">\r\n\t\t\t<div class="title">Men\'s Basketball - 2011-12 Roster</div>\r\n\t\t<div class="table">\r\n\t\t<div class="titles">\r\n\t\t\t
<div class="number">No.</div>\r\n\t\t\t<div class="name">Name</div>\r\n\t\t\t<div class="positions">Position</div>\r\n\t\t</div>\r\n\t\t\r\n\t\t\t\t\t<div class="item even clearfix">\r\n\t\t\t\t<div class="data">\r\n\t\t\t\t\t<div class="number">\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t3\r\n\t\t\t\t\t\t\t\t\t\t\t</div>\r\n\t\t\t\t\t<div class="name">
ruby:
doc = Nokogiri::HTML(open("http://www.google.com/search?q=doughnuts"))
doc.css('tr').each do |node|
puts node.text
end
finds no trs, but
doc.css('div').each do |node|
puts node.text
end
finds the divs
I was able to get a <table> instead of divs by adding User-Agent headers. Specifically I pretended to be a known popular browser.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent',
('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_7) '
'AppleWebKit/535.1 (KHTML, like Gecko) '
'Chrome/13.0.782.13 Safari/535.1'))
]
response = opener.open('http://www.gocrimson.com/sports/mbkb/2011-12/roster')
print response.readlines() # divs are now a table
I have this complicated problem that I can't find a answer to.
I have a Python HTTPServer running that serves webpages. These webpages are created at runtime with help of Beautiful Soup. Problem is that the Firefox shows HTML Code for the webpage and not the actual page? I really don't know know who is causing this problem -
- Python HTTPServer
- Beautiful Soup
- HTML Code
Any case, I have copied parts of the webpage HTML:-
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>
My title
</title>
<link href="style.css" rel="stylesheet" type="text/css" />
<script src="./123_ui.js">
</script>
</head>
<body>
<div>
Hellos
</div>
</body>
</html>
Just to help you, here are the things that I have already tried-
- I have made sure that Python HTTPServer is sending the MIME header as text/html
- Just copying and pasting the HTML Code will show you correct page as its static. I can tell from here that the problem is in HTTPServer side
- The Firebug shows that is empty and "This element has no style rules. You can create a rule for it." is displayed
I just want to know if the error is in Beautiful Soup or HTTPServer or HTML?
Thanks,
Amit
Why are you adding this at the top of the document?
<?xml version="1.0" encoding="iso-8859-1"?>
That will make the browser think the entire document is XML and not XHTML. Removing that line should make it render correctly. I assume Firefox is displaying a page with a bunch of elements which you can expand/collapse to see the content like it normally would for an XML document, even though the HTTP headers might say it's text/html.
So guys,
I have finally solved this problem. The reason was because I wasn't sending MIME header (even though I thought I was) with content type "text/html"
In python HTTPServer, before writing anything to file you always do this:-
self.send_response(301)
self.send_header("Location", self.path + "/")
self.end_headers()
# Once you have called the above methods, you can send the HTML to Client
self.wfile.write('ANY HTML CODE YOU WANT TO WRITE')