I'm working on a project that involves crawling multiple domains in their entirety. My scraper simply crawls the whole domain and doesn't check specific parts of the html, it just gets all the html.
For some domains, I would only want to crawl one subdomain, but other than that everything about the crawl itself would be the same for each domain.
The rest of the things that are different between the domains would be handled once the crawl is finished, in a separate python script.
My question is, do I need to write a unique Scrapy crawler for each domain, or can I use one, and pass to it parameters for allowed_domains/start_urls.
I'm using Scraping Hub, and I may need to run the crawl on all my domains at the same time.
Related
I am running a scraper on a set of domains. The list with domains is provided externally, and a test run with scrapy made clear that for some websites the http prefix is not correctly specified. Some websites throw a DNS error when you try to navigate http://www.example.com instead of http://example.com.
I tried to address this by a for loop that generates for each domain the combinations with the most common prefixes (http://www., http://,https://,https://www.). Yet, I found out that this for some websites results in them being scraped twice (with just all duplicate content), which is not only inefficient from my side, but also not inline with web etiquette.
I am having two questions:
Is this how Scrapy should behave?
How can I prevent this from happening?
I have seen How check if website support http, htts and www prefix with scrapy, but it feels like a detour, it should be part of Scrapy itself.
I want to scrap multiple webpages using scrapy. I already have a prototype but are not satisfied with the "maintenance" cost.
How can I strucutre my scrapy project better, improving the following features:
Having an independent code base for each page (webpage) scraper
Using "extraction methods" for multiple pages but maintaining it only ones, eg. page screenshot util, image downloader
Testing "scraper" before full run (Unit test -> Is value returned)?
If Portia is mature, maybe use it if the Unit tests fail?
What do I have currently:
I run my scraper via cron crawl tagesschau and crawl spiegel
After the scraper is run, I run a second script fetching a screenshot of every new entry (bare python script using MySQL)
This lakes a testing if the scraper still works, and is hard to manage.
What can I do better?
Thank you,
-lony
P.S.: Scrapy was selected following #elias advice.
I want to crawl an ASP.NET website but the urls are all the same how can I crawl specific pages using python?
here is the website I want to crawl:
http://www.fveconstruction.ch/index.htm
(I am using beautifulsoup, urllib and python 3)
What information should I get to distinguish a page from the other?
If the target website is just a single page application, it can't be crawled. As a workaround you can see the requests (GET, POST etc) that actually go when you manually navigate through the website and ask the crawler to use those. Or, teach your crawler to execute javascript at least what's on the target website.
It's the website who need to change to be easily crawlable, they need to provide a reasonable non-AJAX version of every page that needs to be indexed, or links to a page that needs to be indexed. Or use something like what pushState does in angularJs.
I am creating an aggregator and I started with scrapy as my initial tool set.
First I only had a few spiders, but as the project grows it seems like I may have hundreds or even a thousand different spiders as i scrape more and more sites.
What is the best way to manage these spiders as some websites only need to be crawled once, some on a more regular basis?
Is scrapy still a good tool when dealing with so many sites or would you recommend some other technology.
You can check out the project scrapely, that is from the creators of scrapy. But, as far as I know, it is not suitable for parsing sites containing javascript (more precisely, if the parsed data is not generated by javascript).
I just got scrapy setup and running and it works great, but I have two (noob) questions. I should say first that I am totally new to scrapy and spidering sites.
Can you limit the number of links crawled? I have a site that doesn't use pagination and just lists a lot of links (which I crawl) on their home page. I feel bad crawling all of those links when I really just need to crawl the first 10 or so.
How do you run multiple spiders at once? Right now I am using the command scrapy crawl example.com, but I also have spiders for example2.com and example3.com. I would like to run all of my spiders using one command. Is this possible?
for #1: Don't use rules attribute to extract links and follow, write your rule in parse function and yield or return Requests object.
for #2: Try scrapyd
Credit goes to Shane, here https://groups.google.com/forum/?fromgroups#!topic/scrapy-users/EyG_jcyLYmU
Using a CloseSpider should allow you to specify limits of this sort.
http://doc.scrapy.org/en/latest/topics/extensions.html#module-scrapy.contrib.closespider
Haven't tried it yet since I didn't need it. Looks like you also might have to enable as an extension (see top of same page) in your settings file.