Scrapy Django Limit links crawled - python

I just got scrapy setup and running and it works great, but I have two (noob) questions. I should say first that I am totally new to scrapy and spidering sites.
Can you limit the number of links crawled? I have a site that doesn't use pagination and just lists a lot of links (which I crawl) on their home page. I feel bad crawling all of those links when I really just need to crawl the first 10 or so.
How do you run multiple spiders at once? Right now I am using the command scrapy crawl example.com, but I also have spiders for example2.com and example3.com. I would like to run all of my spiders using one command. Is this possible?

for #1: Don't use rules attribute to extract links and follow, write your rule in parse function and yield or return Requests object.
for #2: Try scrapyd

Credit goes to Shane, here https://groups.google.com/forum/?fromgroups#!topic/scrapy-users/EyG_jcyLYmU
Using a CloseSpider should allow you to specify limits of this sort.
http://doc.scrapy.org/en/latest/topics/extensions.html#module-scrapy.contrib.closespider
Haven't tried it yet since I didn't need it. Looks like you also might have to enable as an extension (see top of same page) in your settings file.

Related

How to use one crawler for multiple domains?

I'm working on a project that involves crawling multiple domains in their entirety. My scraper simply crawls the whole domain and doesn't check specific parts of the html, it just gets all the html.
For some domains, I would only want to crawl one subdomain, but other than that everything about the crawl itself would be the same for each domain.
The rest of the things that are different between the domains would be handled once the crawl is finished, in a separate python script.
My question is, do I need to write a unique Scrapy crawler for each domain, or can I use one, and pass to it parameters for allowed_domains/start_urls.
I'm using Scraping Hub, and I may need to run the crawl on all my domains at the same time.

How to crawl specific ASP.NET pages using Python?

I want to crawl an ASP.NET website but the urls are all the same how can I crawl specific pages using python?
here is the website I want to crawl:
http://www.fveconstruction.ch/index.htm
(I am using beautifulsoup, urllib and python 3)
What information should I get to distinguish a page from the other?
If the target website is just a single page application, it can't be crawled. As a workaround you can see the requests (GET, POST etc) that actually go when you manually navigate through the website and ask the crawler to use those. Or, teach your crawler to execute javascript at least what's on the target website.
It's the website who need to change to be easily crawlable, they need to provide a reasonable non-AJAX version of every page that needs to be indexed, or links to a page that needs to be indexed. Or use something like what pushState does in angularJs.

Browse links recursively using selenium

I'd like to know if is it possible to browse all links in a site (including the parent links and sublinks) using python selenium (example: yahoo.com),
fetch all links in the homepage,
open each one of them
open all the links in the sublinks to three four levels.
I'm using selenium on python.
Thanks
Ala'a
You want "web-scraping" software like Scrapy and possibly Beautifulsoup4 - the first is used to build a program called a "spider" which "crawls" through web pages, extracting structured data from them, and following certain (or all) links in them. BS4 is also for extracting data from web pages, and combined with libraries like requests can be used to build your own spider, though at this point something like Scrapy is probably more relevant to what you need.
There are numerous tutorials and examples out there to help you - just start with the google search I linked above.
Sure it is possible, but you have to instruct selenium to enter these links one by one as you are working within one browser.
In case, the pages are not having the links rendered by JavaScript in the browser, it would be much more efficient to fetch these pages by direct http request and process it this way. In this case I would recommend using requests. However, even with requests it is up to your code to locate all urls in the page and follow up with fetching those pages.
There might be also other Python packages, which are specialized on this kind of task, but here I cannot serve with real experience.

Scraping forums with scrapy

I am trying to scrap some forums with scrapy and store the data in a database. But I don't know to do it efficiently when it comes to updating the database. This is what my spider looks like:
class ForumSpider(CrawlSpider):
name = "forum"
allowed_domains= ["forums.example.com"]
start_urls = ["forums.example.com/index.php"]
rules = (
Rule(SgmlLinkExtractor(allow=(r'/forum?id=\d+',)),
follow=True, callback='parse_index'),
)
def parse_index(self, response):
hxs = HtmlXPathSelector(response)
#parsing....looking for threads.....
#pass the data to pipeline and store in to the db....
My problem is when I scrap the same forum again, say a week later, there is no point to go through all the pages, because the new threads or any threads with new post would be on top of other inactive threads. My idea is to check the first pages of a forum(forums.example.com/forum?id=1), if it found a thread with the same URL and the same number of reply on page one. There is no point to go to the second page. So the spider should proceed to another forum(forums.example.com/forum?id=2). I tried modifying the start_urls and rules, but it seemed like they are not responding once the spider is running. Is there a way to do it in scrapy?
My second problem is how to use different pipeline for different spiders. I found something on stack overflow. But it seems like scrapy isn't built to do this, it seems like you suppose to create a new project for different sites.
Am I using the wrong tool to do this? Or I am missing something. I thought about using mechanize and lxml to do it. But I need to implement twisted and unicode handling and so on which makes me want to stick with scrapy
Thanks
What you are asking for is to create a http requests on fly.
Inside the parse_index function do this.
request = self.make_requests_from_url(http://forums.example.com/forum?id=2)
return request
If you want to submit multiple http requests return a array.
See this Request in scrapy
You are right about the second thing, you are suppose to write different spiders if you want to extract different type of data from different websites.

Scrapy - Change rules after spider starts crawling

My query is for the CrawlSpider
I understand the link extractor rules is a static variable,
Can i change the rules in runtime say, like
#classmethod
def set_rules(cls,rules):
cls.rules = rules
by
self.set_rules(rules)
Is this the acceptable practice for the CrawlSpider ? if not please suggest the appropriate method
My use case,
I'm using scrapy to crawl certain categories A,B,C....Z of a particular website. each category has 1000 links spread over 10 pages
and when scrapy hits a link in a some category which is "too old". I'd like the crawler to stop following/crawling the remainder of the 10 pages ONLY for that category alone and thus my requirement of dynamic rule changes.
Please point me out on the right direction.
Thanks!
The rules in a spider aren't meant to be changed dynamically. They are compiled at instantiation of the CrawlSpider. You could always change your spider.rules and re-run spider._compile_rules(), but I advise against it.
The rules create a set of instructions for the Crawler in what to queue up to crawl (ie. it queues Requests). These requests aren't revisited and re-evaluated before they are dispatched, as the rules weren't "designed" to change. So even if you did change the rules dynamically, you may still end up making a bunch of requests you didn't intend to, and still crawl a bunch of content you didn't mean to.
For instance, if your target page is setup so that the page for "Category A" contains links to pages 1 to 10 of "Category A", then Scrapy will queue up requests for all 10 of these pages. If Page 2 turns out to have entries that are "too old", changing the rules will do nothing because requests for pages 3-10 are already queued to go.
As #imx51 said, it would be much better to write a Downloader Middleware. These would be able to drop each request that you not longer want to make as they trigger for every request going through it before it's downloaded.
I would suggest to you to write your own custom downloader middleware. These would allow you to filter out those requests that you not longer want to make.
Further details about the architecture overview of Scrapy can you find here: http://doc.scrapy.org/en/master/topics/architecture.html
And about downloader middleware and how to write your custom one: http://doc.scrapy.org/en/master/topics/downloader-middleware.html

Categories