Scrapy: Define items dynamically - python

As I started to learn scrapy, i have come accross a requirement to dynamically build the Item attributes. I'm just scraping a webpage which has a table structure and I wanted to form the item and field attributes while crawling. I have gone through this example Scraping data without having to explicitly define each field to be scraped but couldn't make much of it.
Should I be writing an item pipleline to capture the info dynamically. I have also looked at Item loader function, but if anyone can explain in detail, it will be really helpful.

Just use a single Field as an arbitrary data placeholder. And then when you want to get the data out, instead of saying for field in item, you say for field in item['row']. You don't need pipelines or loaders to accomplish this task, but they are both used extensively for good reason: they are worth learning.
spider:
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
class TableItem(Item):
row = Field()
class TestSider(BaseSpider):
name = "tabletest"
start_urls = ('http://scrapy.org?finger', 'http://example.com/toe')
def parse(self, response):
item = TableItem()
row = dict(
foo='bar',
baz=[123, 'test'],
)
row['url'] = response.url
if 'finger' in response.url:
row['digit'] = 'my finger'
row['appendage'] = 'hand'
else:
row['foot'] = 'might be my toe'
item['row'] = row
return item
outptut:
stav#maia:/srv/stav/scrapie/oneoff$ scrapy crawl tabletest
2013-03-14 06:55:52-0600 [scrapy] INFO: Scrapy 0.17.0 started (bot: oneoff)
2013-03-14 06:55:52-0600 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'oneoff.spiders', 'SPIDER_MODULES': ['oneoff.spiders'], 'USER_AGENT': 'Chromium OneOff 24.0.1312.56 Ubuntu 12.04 (24.0.1312.56-0ubuntu0.12.04.1)', 'BOT_NAME': 'oneoff'}
2013-03-14 06:55:53-0600 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-03-14 06:55:53-0600 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-03-14 06:55:53-0600 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-03-14 06:55:53-0600 [scrapy] DEBUG: Enabled item pipelines:
2013-03-14 06:55:53-0600 [tabletest] INFO: Spider opened
2013-03-14 06:55:53-0600 [tabletest] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-14 06:55:53-0600 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-14 06:55:53-0600 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-03-14 06:55:53-0600 [tabletest] DEBUG: Crawled (200) <GET http://scrapy.org?finger> (referer: None)
2013-03-14 06:55:53-0600 [tabletest] DEBUG: Scraped from <200 http://scrapy.org?finger>
{'row': {'appendage': 'hand',
'baz': [123, 'test'],
'digit': 'my finger',
'foo': 'bar',
'url': 'http://scrapy.org?finger'}}
2013-03-14 06:55:53-0600 [tabletest] DEBUG: Redirecting (302) to <GET http://www.iana.org/domains/example/> from <GET http://example.com/toe>
2013-03-14 06:55:53-0600 [tabletest] DEBUG: Redirecting (302) to <GET http://www.iana.org/domains/example> from <GET http://www.iana.org/domains/example/>
2013-03-14 06:55:53-0600 [tabletest] DEBUG: Crawled (200) <GET http://www.iana.org/domains/example> (referer: None)
2013-03-14 06:55:53-0600 [tabletest] DEBUG: Scraped from <200 http://www.iana.org/domains/example>
{'row': {'baz': [123, 'test'],
'foo': 'bar',
'foot': 'might be my toe',
'url': 'http://www.iana.org/domains/example'}}
2013-03-14 06:55:53-0600 [tabletest] INFO: Closing spider (finished)
2013-03-14 06:55:53-0600 [tabletest] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1066,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 3833,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 3, 14, 12, 55, 53, 848735),
'item_scraped_count': 2,
'log_count/DEBUG': 13,
'log_count/INFO': 4,
'response_received_count': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2013, 3, 14, 12, 55, 53, 99635)}
2013-03-14 06:55:53-0600 [tabletest] INFO: Spider closed (finished)

Use this class:
class Arbitrary(Item):
def __setitem__(self, key, value):
self._values[key] = value
self.fields[key] = {}

The custom __setitem__ solution didn't work for me when using item loaders in Scrapy 1.0.3 because the item loader accesses the fields attribute directly:
value = self.item.fields[field_name].get(key, default)
The custom __setitem__ is only called for item-level accesses like item['new field']. Since fields is just a dict, I realized I could simply create an Item subclass that uses a defaultdict to gracefully handle these situations.
In the end, just two extra lines of code:
from collections import defaultdict
class FlexItem(scrapy.Item):
"""An Item that creates fields dynamically"""
fields = defaultdict(scrapy.Field)

In Scrapy 1.0+ the better way could be to yield Python dicts instead of Item instances if you don't have a well-defined schema. Check e.g. an example on http://scrapy.org/ front page - there is no Item defined.

I was more xpecting about explanation in handling the data with item loaders and pipelines
Assuming:
fieldname = 'test'
fieldxpath = '//h1'
It's (in recent versions) very simple...
item = Item()
l = ItemLoader(item=item, response=response)
item.fields[fieldname] = Field()
l.add_xpath(fieldname, fieldxpath)
return l.load_item()

Related

Scrapy terminated unexpected

I follow book published by o'Reilly to create a spider as below:
articleSpider.py
from scrapy.spiders import CrawlSpider, Rule
from TestScrapy.items import Article
from scrapy.linkextractors import LinkExtractor
class ArticleSpider(CrawlSpider):
name = "article"
allowed_domains = ["en.wikipedia.org"]
start_urls = ["https://en.wikipedia.org/wiki/Object-oriented_programming"]
rules = [Rule(LinkExtractor(allow=('(/wiki/)((?!:).)*$'), ),callback="parse_item", follow=True)]
def parse(self, response):
item = Article()
title = response.xpath('//h1/text()')[0].extract()
print "Title is:" + title
item['title'] = title
return item
Items.py
from scrapy import Item, Field
class Article(Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = Field()
However, when I run this spider, it just display one result and the terminates. I expect it to run until I terminate it.
Please see the result and debug info from Scrapy:
2016-06-06 15:45:28 [scrapy] INFO: Scrapy 1.0.3 started (bot: TestScrapy)
2016-06-06 15:45:28 [scrapy] INFO: Optional features available: ssl, http11
2016-06-06 15:45:28 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'TestScrapy.spiders', 'SPIDER_MODULES': ['TestScrapy.spiders'], 'BOT_NAME': 'TestScrapy'}
2016-06-06 15:45:29 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-06-06 15:45:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-06-06 15:45:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-06-06 15:45:30 [scrapy] INFO: Enabled item pipelines:
2016-06-06 15:45:30 [scrapy] INFO: Spider opened
2016-06-06 15:45:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-06 15:45:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-06 15:45:33 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Object-oriented_programming> (referer: None)
Title is:Object-oriented programming
2016-06-06 15:45:33 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/Object-oriented_programming>
{'title': u'Object-oriented programming'}
2016-06-06 15:45:33 [scrapy] INFO: Closing spider (finished)
2016-06-06 15:45:33 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 246,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 51238,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 6, 6, 7, 45, 33, 441000),
'item_scraped_count': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 6, 6, 7, 45, 30, 614000)}
2016-06-06 15:45:33 [scrapy] INFO: Spider closed (finished)
Change the method name from parse to parse_item.
Now you're just crawling the start url but when filtering the rules there is no method to callback, thus the spider ends the execution.
Check this example of CrawlSpider:
http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider-example
You can also use start_product_requests instead of parse here.

Why doesn't the spider return any response for this site?

I'm using scrapy to scrap this site but when I run the spider I don't see any response.
I tried reddit.com and quora.com and they both returned data (started to crawl) but not the site I want.
Here is my simple spider:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
class FirstSpider(CrawlSpider):
name = "jobs"
allowed_domains = ["bayt.com"]
start_urls = (
'http://www.bayt.com/',
)
rules = [
Rule(
LinkExtractor(allow=['.*']),
)
]
I tried several combinations of urls in the start_urls but nothing seemed to work.
Here is the log after running the spider:
2015-12-13 20:31:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: bayt)
2015-12-13 20:31:45 [scrapy] INFO: Optional features available: ssl, http11
2015-12-13 20:31:45 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bayt.spiders', 'SPIDER_MODULES': ['bayt.spiders'], 'BOT_NAME': 'bayt'}
2015-12-13 20:31:45 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-12-13 20:31:45 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-12-13 20:31:45 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-12-13 20:31:45 [scrapy] INFO: Enabled item pipelines:
2015-12-13 20:31:45 [scrapy] INFO: Spider opened
2015-12-13 20:31:45 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-13 20:31:45 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-13 20:31:45 [scrapy] DEBUG: Redirecting (302) to <GET http://www.bayt.com/en/jordan/> from <GET http://www.bayt.com/>
2015-12-13 20:31:46 [scrapy] DEBUG: Crawled (200) <GET http://www.bayt.com/en/jordan/> (referer: None)
2015-12-13 20:31:46 [scrapy] INFO: Closing spider (finished)
2015-12-13 20:31:46 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 881,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2320,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 12, 13, 18, 31, 46, 212468),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2015, 12, 13, 18, 31, 45, 138408)}
2015-12-13 20:31:46 [scrapy] INFO: Spider closed (finished)
the problem is that you are not using the rules as you mentioned, you have your own parse method, which is not ok, CrawlSpider uses the parse method so you shouldn't override that method.
Now, if you are still getting items when overriding the parse method, it is because parse is the default method for the start_urls requests, so the requests are not really following the rules, but only crawling urls inside start_urls
Just change the name of your parsing method from parse to a different one, and specify that on your rule as a callback.
I did Curl www.bayt.com in the command line and it seems that they redirect the request to http://www.bayt.com/en/jordan/
I put that as My start_urls and it worked and changed the user agent in the settings.py to localhost and it worked.

Scrapy: only scrape parts of website

I'd like to scrape parts of a number of very large websites using Scrapy. For instance, from northeastern.edu I would like to scrape only pages that are below the URL http://www.northeastern.edu/financialaid/, such as http://www.northeastern.edu/financialaid/contacts or http://www.northeastern.edu/financialaid/faq. I do not want to scrape the university's entire web site, i.e. http://www.northeastern.edu/faq should not be allowed.
I have no problem with URLs in the format financialaid.northeastern.edu (by simply limiting the allowed_domains to financialaid.northeastern.edu), but the same strategy doesn't work for northestern.edu/financialaid. (The whole spider code is actually longer as it loops through different web pages, I can provide details. Everything works apart from the rules.)
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from test.items import testItem
class DomainSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['northestern.edu/financialaid']
start_urls = ['http://www.northestern.edu/financialaid/']
rules = (
Rule(LxmlLinkExtractor(allow=(r"financialaid/",)), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = testItem()
#i['domain_id'] = response.xpath('//input[#id="sid"]/#value').extract()
#i['name'] = response.xpath('//div[#id="name"]').extract()
#i['description'] = response.xpath('//div[#id="description"]').extract()
return i
The results look like this:
2015-05-12 14:10:46-0700 [scrapy] INFO: Scrapy 0.24.4 started (bot: finaid_scraper)
2015-05-12 14:10:46-0700 [scrapy] INFO: Optional features available: ssl, http11
2015-05-12 14:10:46-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'finaid_scraper.spiders', 'SPIDER_MODULES': ['finaid_scraper.spiders'], 'FEED_URI': '/Users/hugo/Box Sync/finaid/ScrapedSiteText_check/Northeastern.json', 'USER_AGENT': 'stanford_sociology', 'BOT_NAME': 'finaid_scraper'}
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled item pipelines:
2015-05-12 14:10:46-0700 [graphspider] INFO: Spider opened
2015-05-12 14:10:46-0700 [graphspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-12 14:10:46-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-05-12 14:10:46-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-05-12 14:10:46-0700 [graphspider] DEBUG: Redirecting (301) to <GET http://www.northeastern.edu/financialaid/> from <GET http://www.northeastern.edu/financialaid>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Crawled (200) <GET http://www.northeastern.edu/financialaid/> (referer: None)
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'assistive.usablenet.com': <GET http://assistive.usablenet.com/tt/http://www.northeastern.edu/financialaid/index.html>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'www.northeastern.edu': <GET http://www.northeastern.edu/financialaid/index.html>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/pages/Boston-MA/NU-Student-Financial-Services/113143082891>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/NUSFS>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'nusfs.wordpress.com': <GET http://nusfs.wordpress.com/>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'northeastern.edu': <GET http://northeastern.edu/howto>
2015-05-12 14:10:47-0700 [graphspider] INFO: Closing spider (finished)
2015-05-12 14:10:47-0700 [graphspider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 431,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 9574,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 5, 12, 21, 10, 47, 94112),
'log_count/DEBUG': 10,
'log_count/INFO': 7,
'offsite/domains': 6,
'offsite/filtered': 32,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2015, 5, 12, 21, 10, 46, 566538)}
2015-05-12 14:10:47-0700 [graphspider] INFO: Spider closed (finished)
The second strategy I attempted was to use allow-rules of the LxmlLinkExtractor and to limit the crawl to everything within the sub-domain, but in that case the entire web page is scraped. (Deny-rules do work.)
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from test.items import testItem
class DomainSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['www.northestern.edu']
start_urls = ['http://www.northestern.edu/financialaid/']
rules = (
Rule(LxmlLinkExtractor(allow=(r"financialaid/",)), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = testItem()
#i['domain_id'] = response.xpath('//input[#id="sid"]/#value').extract()
#i['name'] = response.xpath('//div[#id="name"]').extract()
#i['description'] = response.xpath('//div[#id="description"]').extract()
return i
I also tried:
rules = (
Rule(LxmlLinkExtractor(allow=(r"northeastern.edu/financialaid",)), callback='parse_site', follow=True),
)
The log is too long to be posted here, but these lines show that Scrapy ignores the allow-rule:
2015-05-12 14:26:06-0700 [graphspider] DEBUG: Crawled (200) <GET http://www.northeastern.edu/camd/journalism/2014/10/07/prof-leff-talks-american-press-holocaust/> (referer: http://www.northeastern.edu/camd/journalism/2014/10/07/prof-schroeder-quoted-nc-u-s-senate-debates-charlotte-observer/)
2015-05-12 14:26:06-0700 [graphspider] DEBUG: Crawled (200) <GET http://www.northeastern.edu/camd/journalism/tag/north-carolina/> (referer: http://www.northeastern.edu/camd/journalism/2014/10/07/prof-schroeder-quoted-nc-u-s-senate-debates-charlotte-observer/)
2015-05-12 14:26:06-0700 [graphspider] DEBUG: Scraped from <200 http://www.northeastern.edu/camd/journalism/2014/10/07/prof-leff-talks-american-press-holocaust/>
Here is my items.py:
from scrapy.item import Item, Field
class FinAidScraperItem(Item):
# define the fields for your item here like:
url=Field()
linkedurls=Field()
internal_linkedurls=Field()
external_linkedurls=Field()
http_status=Field()
title=Field()
text=Field()
I am using Mac, Python 2.7, Scrapy version 0.24.4. Similar questions have been posted before, but none of the suggested solutions fixed my problem.
You have a typo in your URLs used inside spiders, see:
northeastern
vs
northestern
Here is the spider that worked for me (it follows "financialaid" links only):
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class DomainSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['northeastern.edu']
start_urls = ['http://www.northeastern.edu/financialaid/']
rules = (
Rule(LinkExtractor(allow=r"financialaid/"), callback='parse_item', follow=True),
)
def parse_item(self, response):
print response.url
Note that I'm using LinkExtractor shortcut and a string for the allow argument value.
I've also edited your question and fixed the indentation problems assuming they were just "posting" issues.

Quickly mass scrape Facebook fan pages' numeric IDs [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Many Facebook fan pages are now in the following format - https://www.facebook.com/TiltedKiltEsplanade where "TiltedKiltEsplanade" is an example of the name claimed by the page owner. However, the same page's RSS feed is found at https://www.facebook.com/feeds/page.php?id=414117051979234&format=rss20 where 414117051979234 is an ID that can be determined by going to https://graph.facebook.com/TiltedKiltEsplanade and looking for the last numeric ID listed on the page (there are two similar-looking IDs at the top of the page but these can be ignored).
I have a long list of Facebook fan pages in the format described above and I would like to quickly grab the numeric IDs that correspond to those pages so that I can add all of them to an RSS reader. What would be the simplest way to scrape these pages? I am familiar with Scrapy but I'm not sure if it can be used because the graph version of the page isn't marked up in a way that allows for easy scraping (as far as I can tell)
Thanks.
The output of the graph request is a JSON object. That's way more easy to process than HTML content.
This would be a simple implementation of what you are looking for:
# file: myspider.py
import json
from scrapy.http import Request
from scrapy.spider import BaseSpider
class MySpider(BaseSpider):
name = 'myspider'
start_urls = (
# Add here more urls. Alternatively, make the start urls dynamic
# reading them from a file, db or an external url.
'https://www.facebook.com/TiltedKiltEsplanade',
)
graph_url = 'https://graph.facebook.com/{name}'
feed_url = 'https://www.facebook.com/feeds/page.php?id={id}&format=rss20'
def start_requests(self):
for url in self.start_urls:
# This assumes there is no trailing slash
name = url.rpartition('/')[2]
yield Request(self.graph_url.format(name=name), self.parse_graph)
def parse_graph(self, response):
data = json.loads(response.body)
return Request(self.feed_url.format(id=data['id']), self.parse_feed)
def parse_feed(self, response):
# You can use the xml spider, xml selector or the feedparser module
# to extract information from the feed.
self.log('Got feed: %s' % response.body[:100])
The output:
$ scrapy runspider myspider.py
2014-01-11 02:19:48-0400 [scrapy] INFO: Scrapy 0.21.0-97-g21a8a94 started (bot: scrapybot)
2014-01-11 02:19:48-0400 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2014-01-11 02:19:48-0400 [scrapy] DEBUG: Overridden settings: {}
2014-01-11 02:19:49-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-11 02:19:49-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-11 02:19:49-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-11 02:19:49-0400 [scrapy] DEBUG: Enabled item pipelines:
2014-01-11 02:19:49-0400 [myspider] INFO: Spider opened
2014-01-11 02:19:49-0400 [myspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-11 02:19:49-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-11 02:19:49-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-11 02:19:49-0400 [myspider] DEBUG: Crawled (200) <GET https://graph.facebook.com/TiltedKiltEsplanade> (referer: None)
2014-01-11 02:19:50-0400 [myspider] DEBUG: Crawled (200) <GET https://www.facebook.com/feeds/page.php?id=414117051979234&format=rss20> (referer: https://graph.facebook.com/TiltedKiltEsplanade)
2014-01-11 02:19:50-0400 [myspider] DEBUG: Got feed: <?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
xmlns:media="http://search.yahoo.com
2014-01-11 02:19:50-0400 [myspider] INFO: Closing spider (finished)
2014-01-11 02:19:50-0400 [myspider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 578,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 6669,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 1, 11, 6, 19, 50, 849162),
'log_count/DEBUG': 9,
'log_count/INFO': 3,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2014, 1, 11, 6, 19, 49, 221361)}
2014-01-11 02:19:50-0400 [myspider] INFO: Spider closed (finished)

Scrapy parse_item callback not being called

I'm having a problem getting my Scrapy spider to run its callback method.
I don't think it's an indentation error which seems to be the case for the other previous posts, but perhaps it is and I don't know it? Any ideas?
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy import log
import tldextract
class CrawlerSpider(CrawlSpider):
name = "crawler"
def __init__(self, initial_url):
log.msg('initing...', level=log.WARNING)
CrawlSpider.__init__(self)
if not initial_url.startswith('http'):
initial_url = 'http://' + initial_url
ext = tldextract.extract(initial_url)
initial_domain = ext.domain + '.' + ext.tld
initial_subdomain = ext.subdomain + '.' + ext.domain + '.' + ext.tld
self.allowed_domains = [initial_domain, 'www.' + initial_domain, initial_subdomain]
self.start_urls = [initial_url]
self.rules = [
Rule(SgmlLinkExtractor(), callback='parse_item'),
Rule(SgmlLinkExtractor(allow_domains=self.allowed_domains), follow=True),
]
self._compile_rules()
def parse_item(self, response):
log.msg('parse_item...', level=log.WARNING)
hxs = HtmlXPathSelector(response)
links = hxs.select("//a/#href").extract()
for link in links:
log.msg('link', level=log.WARNING)
Sample output is below; it should show a warning message with "parse_item..." printed but it doesn't.
$ scrapy crawl crawler -a initial_url=http://www.szuhanchang.com/test.html
2013-02-19 18:03:24+0000 [scrapy] INFO: Scrapy 0.16.4 started (bot: crawler)
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Enabled item pipelines:
2013-02-19 18:03:24+0000 [scrapy] WARNING: initing...
2013-02-19 18:03:24+0000 [crawler] INFO: Spider opened
2013-02-19 18:03:24+0000 [crawler] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-02-19 18:03:24+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-02-19 18:03:25+0000 [crawler] DEBUG: Crawled (200) <GET http://www.szuhanchang.com/test.html> (referer: None)
2013-02-19 18:03:25+0000 [crawler] DEBUG: Filtered offsite request to 'www.20130219-0606.com': <GET http://www.20130219-0606.com/>
2013-02-19 18:03:25+0000 [crawler] INFO: Closing spider (finished)
2013-02-19 18:03:25+0000 [crawler] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 234,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 363,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 2, 19, 18, 3, 25, 84855),
'log_count/DEBUG': 8,
'log_count/INFO': 4,
'log_count/WARNING': 1,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2013, 2, 19, 18, 3, 24, 805064)}
2013-02-19 18:03:25+0000 [crawler] INFO: Spider closed (finished)
Thanks in advance!
The start_urls of http://www.szuhanchang.com/test.html has only one anchor link, namely:
Test
which contains a link to the domain 20130219-0606.com and according to your allowed_domains of:
['szuhanchang.com', 'www.szuhanchang.com', 'www.szuhanchang.com']
this Request gets filtered by the OffsiteMiddleware:
2013-02-19 18:03:25+0000 [crawler] DEBUG: Filtered offsite request to 'www.20130219-0606.com': <GET http://www.20130219-0606.com/>
therefore parse_item will not be called for this url.
Changing the name of your callback to parse_start_url seems to work, although since the test URL provided is quite small, I cannot be sure if this will still be effective. Give it a go and let me know. :)

Categories