Managing multiple spiders with scrapy - python

I am creating an aggregator and I started with scrapy as my initial tool set.
First I only had a few spiders, but as the project grows it seems like I may have hundreds or even a thousand different spiders as i scrape more and more sites.
What is the best way to manage these spiders as some websites only need to be crawled once, some on a more regular basis?
Is scrapy still a good tool when dealing with so many sites or would you recommend some other technology.

You can check out the project scrapely, that is from the creators of scrapy. But, as far as I know, it is not suitable for parsing sites containing javascript (more precisely, if the parsed data is not generated by javascript).

Related

How to use one crawler for multiple domains?

I'm working on a project that involves crawling multiple domains in their entirety. My scraper simply crawls the whole domain and doesn't check specific parts of the html, it just gets all the html.
For some domains, I would only want to crawl one subdomain, but other than that everything about the crawl itself would be the same for each domain.
The rest of the things that are different between the domains would be handled once the crawl is finished, in a separate python script.
My question is, do I need to write a unique Scrapy crawler for each domain, or can I use one, and pass to it parameters for allowed_domains/start_urls.
I'm using Scraping Hub, and I may need to run the crawl on all my domains at the same time.

Need to implement a web-scraper to compile a database of images from https://diatoms.org/species

For a research project, I am trying to implement a script that will go through this site and save the set of images from each species, with the file saved as "genus_species_index.jpeg". I have been looking at Beautiful soup tutorials as well. The main issue is that accessing each species page via script has proved to be quite difficult.
I would recommend looking at scrapy to solve your problem. Beautiful soup is a parser (that does a great job of what you are looking for) but does not handle the crawling. Generally when doing tasks like this you will first scrape the site and then parse it to extract the data and spiders like scrapy were invented for the first purpose. (here is a link for some context: https://www.scrapehero.com/a-beginners-guide-to-web-scraping-part-1-the-basics/)

Web crawler - how to build the visited url set?

I have implemented a distributed web crawler on rabbitMQ. Everything is almost done except the visited url set. I want to have some kind of shared variable between different crawlers.
Furthermore, as I have been reading, the size of this url set will be huge and should be stored in disk.
What is the best way to store, access and share this visited-urls list in a distributed environment?
As majidkabir says, Nutch is quite a good solution...but that doesn't answer the question since it's about how to track state when building your own crawler.
I'll offer the approach I took when I created a crawler in Node (https://www.npmjs.com/package/node-nutch). As you can see from the name, the approach I've taken is in turn modelled on the approach taken in Nutch.
All I did was to use the URL as the key (after normalising), and then store a simple JSON file in S3 that contained the state of the crawl. When it was time to run the next crawl I'd whizz through each of these JSON files looking for candidates to be crawled and then after retrieving the page, set the JSON to indicate when to crawl next.
The number of pages I was crawling was never very large so this worked fine, but if it did get larger I'd put the JSON into something like ElasticSearch and then search for URLs to be crawled based on a date field.
Ideally, any storage that is scalable and supports indexing can be used for such use cases.
Some of the systems that I know are being used for such purposes are Solr, ElasticSearch, Redis or any SQL databases that can scale.
I have used Redis for the same purpose and I have been storing approximately 2 millions of URLs. I am quite sure, by increasing the nodes I should be able to scale easily.
You can use Apache Nutch for crawling, this library has ability to crawl url in a specific period and use some algorithms for this purpose.
For example: When a page with specific url don't changed in second crawling nutch increase the period of next crawling and if it changed, decrease this period.
You can create your own nutch plugin for parsing the data that nutch crawled or using predefined nutch plugins.

Scrapy's System Performance Requirement?

I am planning to create a website for price comparison of service provided by several companies.
The main idea is visitor enter search requirement on the website and start searching, the crawl result display instantly on website rather than output as a file.
I am new to python and scrapy, not sure if scrapy can do this?
I am going to use it daily and even many times every day, and crawl on 30+ websites. I am afraid the search may overload the server? Can a shared web hosting support such crawling? Is there any system performance requirement?

Scrapy Django Limit links crawled

I just got scrapy setup and running and it works great, but I have two (noob) questions. I should say first that I am totally new to scrapy and spidering sites.
Can you limit the number of links crawled? I have a site that doesn't use pagination and just lists a lot of links (which I crawl) on their home page. I feel bad crawling all of those links when I really just need to crawl the first 10 or so.
How do you run multiple spiders at once? Right now I am using the command scrapy crawl example.com, but I also have spiders for example2.com and example3.com. I would like to run all of my spiders using one command. Is this possible?
for #1: Don't use rules attribute to extract links and follow, write your rule in parse function and yield or return Requests object.
for #2: Try scrapyd
Credit goes to Shane, here https://groups.google.com/forum/?fromgroups#!topic/scrapy-users/EyG_jcyLYmU
Using a CloseSpider should allow you to specify limits of this sort.
http://doc.scrapy.org/en/latest/topics/extensions.html#module-scrapy.contrib.closespider
Haven't tried it yet since I didn't need it. Looks like you also might have to enable as an extension (see top of same page) in your settings file.

Categories