Aim: Trigger the execution of an XMLFeedSpider by passing the response as an argument (i.e. no need for start_urls).
Example Command:
scrapy crawl spider_name -a response_as_string="<xml><sometag>abc123</sometag></xml>"
Example Spider:
class ExampleXmlSpider(XMLFeedSpider):
name = "spider_name"
itertag = 'sometag'
def parse_node(self, response, node):
response2 = XmlResponse(url="Some URL", body=self.response_as_string)
ProcessResponse().get_data(response2)
def __init__(self, response_as_string=''):
self.response_as_string = response_as_string
Problem: Terminal complains that there is no start_urls. I can only get the above to work if I include a dummy.xml within start_urls.
E.g.
start_urls = ['file:///home/user/dummy.xml']
Question: Is there anyway to have an XMLFeedSpider that is purely driven by a response provided by an argument (as per the original command)? In which case I would need to suppress the need for the XMLFeedSpider to seek out a start_url to execute a request.
Thanks Paul, you were spot on. Updated example code below. I stopped referring to the class as an XMLFeedSpider. Python script updated to be a class of type "object" with the ability to pass the url and body as arguments.
from scrapy.http import XmlResponse
class ExampleXmlSpider(object):
def __init__(self, response_url='', response_body=''):
self. response_url = response_url
self.response_body = response_body
def run(self):
response = XmlResponse(url=self.response_url, body=self.response_body)
print response.url
Related
i'm just getting started using scrapy and i'd like to do the following
Have a list of n domains
i=0
loop for i to n
Use a (mostly) generic CrawlSpider to get all links (a href) of domain[i]
Save results as json lines
to do this, the Spider needs to receive the domain it has to crawl as an argument.
I already successfully created the CrawlSpider:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from scrapy.crawler import CrawlerProcess
class MyItem(Item):
#MyItem Fields
class SubsiteSpider(CrawlSpider):
name = "subsites"
start_urls = []
allowed_domains = []
rules = (Rule(LinkExtractor(), callback='parse_obj', follow=True),)
def __init__(self, starturl, allowed, *args, **kwargs):
print(args)
self.start_urls.append(starturl)
self.allowed_domains.append(allowed)
super().__init__(**kwargs)
def parse_obj(self, response):
item = MyItem()
#fill Item Fields
return item
process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
process.crawl(SubsiteSpider)
process.start()
If i call it with scrapy crawl subsites -a starturl=http://example.com -a allowed=example.com -o output.jl
the result is exactly as i want it, so this part is fine already.
What i fail to do is create multiple instances of SubsiteSpider, each with a different domain as argument.
I tried (in SpiderRunner.py)
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('subsites', ['https://example.com', 'example.com'])
process.start()
Variant:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
allowed = ["example.com"]
start = ["https://example.com"]
process.crawl('subsites', start, allowed)
process.start()
But i get an error that occurs, i presume, because the argument is not properly passed to __init__, for example TypeError: __init__() missing 1 required positional argument: 'allowed' or TypeError: __init__() missing 2 required positional arguments: 'starturl' and 'allowed'
(Loop is yet to be implemented)
So, here are my questions:
1) What is the proper way to pass arguments to init, if i do not start crawling via scrapy shell, but from within python code?
2) How can i also pass the -o output.jl argument? (or maybe, use allowed argument as filename?)
3) I am fine with this running each spider after another - would it still be considered best / good practice to do it that way? Could you point to a more extensive tutorial about "running the same spider again and again, with different arguments(=target domains), optionally parallel", if there is one?
Thank you all very much in advance!
If there are any spelling mistakes (not an english native speaker), or if question / details are not precise enough, please tell me how to correct them.
There are a few problems with your code:
start_urls and allowed_domains are class attributes which you modify in __init__(), making them shared across all instances of your class.
What you should do instead is make them instance attributes:
class SubsiteSpider(CrawlSpider):
name = "subsites"
rules = (Rule(LinkExtractor(), callback='parse_obj', follow=True),)
def __init__(self, starturl, allowed, *args, **kwargs):
self.start_urls = [starturl]
self.allowed_domains = [allowed]
super().__init__(*args, **kwargs)
Those last 3 lines should not be in the file with you spider class, since you probably don't want to run that code each time your spider is imported.
Your calling of CrawlProcess.crawl() is slightly wrong. You can use it like this, passing the arguments in the same manner you'd pass them to the spider class' __init__().
process = CrawlerProcess(get_project_settings())
process.crawl('subsites', 'https://example.com', 'example.com')
process.start()
How can i also pass the -o output.jl argument? (or maybe, use allowed argument as filename?
You can achieve the same effect using custom_settings, giving each instance a different FEED_URI setting.
Unfortunately I don't have enough population to make a comment, so I have to make this new question, referring to https://stackoverflow.com/questions/23105590/how-to-get-the-pipeline-object-in-scrapy-spider
I have many urls in a DB. So I want to get the start_url from my db. So far not a big problem.
Well I don't want the mysql things inside the spider and in the pipeline I get a problem.
If I try to hand over the pipeline object to my spider like in the referred question, I only get an Attribute Error with the message
'None Type' object has no attribute getUrl
I think the actual problem is that the function spider_opened doesn't get called (also inserted a print statement which never showed its output in the console).
Has somebody an idea how to get the pipeline object inside the spider?
MySpider.py
def __init__(self):
self.pipe = None
def start_requests(self):
url = self.pipe.getUrl()
scrapy.Request(url,callback=self.parse)
Pipeline.py
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
def spider_opened(self, spider):
spider.pipe = self
def getUrl(self):
...
Scrapy pipelines already have expected methods of open_spider and close_spider
Taken from docs: https://doc.scrapy.org/en/latest/topics/item-pipeline.html#open_spider
open_spider(self, spider)
This method is called when the spider is opened.
Parameters: spider (Spider object) – the spider which was opened
close_spider(self, spider)
This method is called when the spider is closed.
Parameters: spider (Spider object) – the spider which was closed
However your original issue doesn't make much sense, why do you want to assign pipeline reference to your spider? That seems like a very bad idea.
What you should do is open up db and read urls in your spider itself.
from scrapy import Spider
class MySpider(Spider):
name = 'myspider'
start_urls = []
#classmethod
def from_crawler(self, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
spider.start_urls = self.get_urls_from_db()
return spider
def get_urls_from_db(self):
db = # get db cursor here
urls = # use cursor to pop your urls
return urls
I'm using accepted solution but not works as expected.
TypeError: get_urls_from_db() missing 1 required positional argument: 'self'
Here's the worked one from my side
from scrapy import Spider
class MySpider(Spider):
name = 'myspider'
start_urls = []
def __init__(self, db_dsn):
self.db_dsn = db_dsn
self.start_urls = self.get_urls_from_db(db_dsn)
#classmethod
def from_crawler(cls, crawler):
spider = cls(
db_dsn=os.getenv('DB_DSN', 'mongodb://localhost:27017'),
)
spider._set_crawler(crawler)
return spider
def get_urls_from_db(self, db_dsn):
db = # get db cursor here
urls = # use cursor to pop your urls
return urls
I am running a CrawlSpider and I want to implement some logic to stop following some of the links in mid-run, by passing a function to process_request.
This function uses the spider's class variables in order to keep track of the current state, and depending on it (and on the referrer URL), links get dropped or continue to be processed:
class BroadCrawlSpider(CrawlSpider):
name = 'bitsy'
start_urls = ['http://scrapy.org']
foo = 5
rules = (
Rule(LinkExtractor(), callback='parse_item', process_request='filter_requests', follow=True),
)
def parse_item(self, response):
<some code>
def filter_requests(self, request):
if self.foo == 6 and request.headers.get('Referer', None) == someval:
raise IgnoreRequest("Ignored request: bla %s" % request)
return request
I think that if I were to run several spiders on the same machine, they would all use the same class variables which is not my intention.
Is there a way to add instance variables to CrawlSpiders? Is only a single instance of the spider created when I run Scrapy?
I could probably work around it with a dictionary with values per process ID, but that will be ugly...
I think spider arguments would be the solution in your case.
When invoking scrapy like scrapy crawl some_spider, you could add arguments like scrapy crawl some_spider -a foo=bar, and the spider would receive the values via its constructor, e.g.:
class SomeSpider(scrapy.Spider):
def __init__(self, foo=None, *args, **kwargs):
super(SomeSpider, self).__init__(*args, **kwargs)
# Do something with foo
What's more, as scrapy.Spider actually sets all additional arguments as instance attributes, you don't even need to explicitly override the __init__ method but just access the .foo attribute. :)
I want to parse sitemap and find out all urls from sitemap and then appending some word to all urls and then I want to check response code of all modified urls.
for this task I decided to use scrapy because it have luxury to crawl sitemaps. its given in scarpy's documentation
with the help of this documentation I created my spider. but I want to change urls before sending for fetching. so for this I tried to take help from this link. this link suggested my to use rules and implement process_requests(). but I am not able to make use of these. I tired little bit that I have commented. could anyone help me write exact code for commented lines or any other ways to do this task in scrapy?
from scrapy.contrib.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
#sitemap_rules = [some_rules, process_request='process_request')]
#def process_request(self, request, spider):
# modified_url=orginal_url_from_sitemap + 'myword'
# return request.replace(url = modified_url)
def parse(self, response):
print response.status, response.url
You can attach the request_scheduled signal to a function and do what you want in the function. For example
class MySpider(SitemapSpider):
#classmethod
def from_crawler(cls, crawler):
spider = cls()
crawler.signals.connect(spider.request_scheduled, signals.request_scheduled)
def request_scheduled(self, request, spider):
modified_url = orginal_url_from_sitemap + 'myword'
request.url = modified_url
SitemapSpider has sitemap_filter method.
You can override it to implement required functionality.
class MySpider(SitemapSpider):
...
def sitemap_filter(self, entries):
for entry in entries:
entry["loc"] = entry["loc"] + myword
yield entry
Each of that entry objects are dicts with structure like this:
<class 'dict'>:
{'loc': 'https://example.com/',
'lastmod': '2019-01-04T08:09:23+00:00',
'changefreq': 'weekly',
'priority': '0.8'}
Important note!. SitemapSpider.sitemap_filter method appeared on scrapy 1.6.0 released on Jan 2019 1.6.0 release notes - new extensibility features section
I've just facet this. Apparently you can't really use process_requests because sitemap rules in SitemapSpider are different from Rule objects in CrawlSpider - only the latter can have this argument.
After examining the code it looks like this can be avoided by manually overriding part of SitemapSpider implementation:
class MySpider(SitemapSpider):
sitemap_urls = ['...']
sitemap_rules = [('/', 'parse')]
def start_requests(self):
# override to call custom_parse_sitemap instead of _parse_sitemap
for url in self.sitemap_urls:
yield Request(url, self.custom_parse_sitemap)
def custom_parse_sitemap(self, response):
# modify requests marked to be called with parse callback
for request in super()._parse_sitemap(response):
if request.callback == self.parse:
yield self.modify_request(request)
else:
yield request
def modify_request(self, request):
return request.replace(
# ...
)
def parse(self, response):
# ...
I'm a writing a crawler in Python that crawls all pages in a given domain, as part of a domain-specific search engine . I'am using Django, Scrapy, and Celery for achieving this. The scenario is as follows:
I receive a domain name from the user and call the crawl task inside the view, passing the domain as an argument:
crawl.delay(domain)
The task itself just calls a function that starts the crawling process:
from .crawler.crawl import run_spider
from celery import shared_task
#shared_task
def crawl(domain):
return run_spider(domain)
run_spider starts the crawling process, as in this SO answer, replacing MySpider with WebSpider.
WebSpider inherits from CrawlSpider and I'm using it now just to test functionality. The only rule defined takes an SgmlLinkExtractor instance and a callback function parse_page which simply extracts the response url and the page title, populates a new DjangoItem (HTMLPageItem) with them and saves it into the database (not so efficient, I know).
from urlparse import urlparse
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ..items import HTMLPageItem
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule, CrawlSpider
class WebSpider(CrawlSpider):
name = "web"
def __init__(self, **kw):
super(WebSpider, self).__init__(**kw)
url = kw.get('domain') or kw.get('url')
if not (url.startswith('http://') or url.startswith('https://')):
url = "http://%s/" % url
self.url = url
self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
self.start_urls = [url]
self.rules = [
Rule(SgmlLinkExtractor(
allow_domains=self.allowed_domains,
unique=True), callback='parse_page', follow=True)
]
def parse_start_url(self, response):
return self.parse_page(response)
def parse_page(self, response):
sel = Selector(response)
item = HTMLPageItem()
item['url'] = response.request.url
item['title'] = sel.xpath('//title/text()').extract()[0]
item.save()
return item
The problem is the crawler only crawls the start_urls and does not follow links (or call the callback function) when following this scenario and using Celery. However calling run_spider through python manage.py shell works just fine!
Another problem is that Item Pipelines and logging are not working with Celery. This is making debugging much harder. I think these problems might be related.
So after inspecting Scrapy's code and enabling Celery logging, by inserting these two lines in web_spider.py:
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
I was able to locate the problem:
In the initialization function of WebSpider:
super(WebSpider, self).__init__(**kw)
The __init__ function of the parent CrawlSpider calls the _compile_rules function which in short copies the rules from self.rules to self._rules while making some changes. self._rules is what the spider uses when it checks for rules . Calling the initialization function of CrawlSpider before defining the rules led to an empty self._rules, hence no links were followed.
Moving the super(WebSpider, self).__init__(**kw) line to the last line of WebSpider's __init__ fixed the problem.
Update: There is a little mistake in code from the previously mentioned SO answer. It causes the reactor to hang after second call. The fix is simple, in WebCrawlerScript's __init__ method, simply move this line:
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
out of the if statement, as suggested in the comments there.
Update 2: I finally got pipelines to work! It was not a Celery problem. I realized that the settings module wasn't being read. It was simply an import problem. To fix it:
Set the environment variable SCRAPY_SETTINGS_MODULE in your django project's settings module myproject/settings.py:
import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'myapp.crawler.crawler.settings'
In your Scrapy settings module crawler/settings.py, add your Scrapy project path to sys.path so that relative imports in the settings file would work:
import sys
sys.path.append('/absolute/path/to/scrapy/project')
Change the paths to suit your case.