I'v created a script in scrapy to parse the author name of different posts from it's landing page and then pass it to the parse_page method using meta keyword in order to print the post content along with the author name at the same time.
I've used download_slot within meta keyword which allegedly maskes the script run faster. Although it is not necessary to comply with the logic I tried to apply here, I would like to stick to it only to understand how download_slot works within any script and why. I searched a lot to know more about download_slot but I end up some links like this one.
An example usage of download_slot (I'm not quite sure about it though):
from scrapy.crawler import CrawlerProcess
from scrapy import Request
import scrapy
class ConventionSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']
def parse(self,response):
for link in response.css('.summary'):
name = link.css('.user-details a::text').extract_first()
url = link.css('.question-hyperlink::attr(href)').extract_first()
nurl = response.urljoin(url)
yield Request(nurl,callback=self.parse_page,meta={'item':name,"download_slot":name})
def parse_page(self,response):
elem = response.meta.get("item")
post = ' '.join([item for item in response.css("#question .post-text p::text").extract()])
yield {'Name':elem,'Main_Content':post}
if __name__ == "__main__":
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
process.crawl(ConventionSpider)
process.start()
The above script runs flawlessly.
My question: how download_slot works within scrapy?
Let's start with the Scrapy architecture. When you create a scrapy.Request, the Scrapy engine passes the request to the downloader to fetch the content. The downloader puts incoming requests into slots which you can imagine as independent queues of requests. The queues are then polled and each individual request gets processed (the content gets downloaded).
Now, here's the crucial part. To determine into what slot to put the incoming request, downloader checks request.meta for download_slot key. If it's present, it puts the request into the slot with that name (and creates it if it doesn't yet exist). If the download_slot key is not present, it puts the request into the slot for the domain (more accurately, the hostname) the request's URL points to.
This explains why your script runs faster. You create multiple downloader slots because they are based on the author's name. If you did not, they would be put into the same slot based on the domain (which is always stackoverflow.com). Thus, you effectively increase the parallelism of downloading content.
This explanation is a little bit simplified but it should give you a picture of what's going on. You can check the code yourself.
For example there is a some target website which allows to process only 1 request per 20 seconds and we need to parse/process 3000 webpages of products data from it.
Common spider with DOWNLOAD_DELAY setting to 20 - application will finish work in ~17 hours(3000 pages * 20 seconds download delay).
If you have aim to increase scraping speed without getting banned by website and you have for example 20 valid proxies You can uniformly allocate request urls to all your proxies using proxy and download_slot meta key and significally reduce application completion time
from scrapy.crawler import CrawlerProcess
from scrapy import Request
import scrapy
class ProxySpider(scrapy.Spider):
name = 'proxy'
start_urls = ['https://example.com/products/1','https://example.com/products/2','....']#list with 3000 products url
proxies = [',,,'] #list wiht 20 proxies
def start_requests(self):
for index, url in start_urls:
chosen_proxy = proxies(index % len(self.proxies)
yield Request(url, callback=self.parse,
meta = {"proxy":chosen_proxy,"download_slot":chosen_proxy})
def parse(self,response):
....
yeild item
#yield Request(deatails_url,
callback=self.parse_additional_details,
meta=
{"download_slot":response.request.meta["download_slot"],
"proxy":response.request.meta["download_slot"]})
if __name__ == "__main__":
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0','DOWNLOAD_DELAY':20, "COOKIES_ENABLED":False
})
process.crawl(ProxySpider)
process.start()
Related
I’m trying to use Scrapy to log into a website, then navigate within than website, and eventually download data from it. Currently I’m stuck in the middle of the navigation part. Here are the things I looked into to solve the problem on my own.
Datacamp course on Scrapy
Following Pagination Links with Scrapy
http://scrapingauthority.com/2016/11/22/scrapy-login/
Scrapy - Following Links
Relative URL to absolute URL Scrapy
However, I do not seem to connect the dots.
Below is the code I currently use. I manage to log in (when I call the "open_in_browser" function, I see that I’m logged in). I also manage to "click" on the first button on the website in the "parse2" part (if I call "open_in_browser" after parse 2, I see that the navigation bar at the top of the website has gone one level deeper.
The main problem is now in the "parse3" part as I cannot navigate another level deeper (or maybe I can, but the "open_in_browser" does not open the website any more - only if I put it after parse or parse 2). My understanding is that I put multiple "parse-functions" after another to navigate through the website.
Datacamp says I always need to start with a "start request function" which is what I tried but within the YouTube videos, etc. I saw evidence that most start directly with parse functions. Using "inspect" on the website for parse 3, I see that this time href is a relative link and I used different methods (See source 5) to navigate to it as I thought this might be the source of error.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.crawler import CrawlerProcess
class LoginNeedScraper(scrapy.Spider):
name = "login"
start_urls = ["<some website>"]
def parse(self, response):
loginTicket = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[1]/#value').extract_first()
execution = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[2]/#value').extract_first()
return FormRequest.from_response(response, formdata={
'loginTicket': loginTicket,
'execution': execution,
'username': '<someusername>',
'password': '<somepassword>'},
callback=self.parse2)
def parse2(self, response):
next_page_url = response.xpath('/html/body/nav/div[2]/ul/li/a/#href').extract_first()
yield scrapy.Request(url=next_page_url, callback=self.parse3)
def parse3(self, response):
next_page_url_2 = response.xpath('/html//div[#class = "headerPanel"]/div[3]/a/#href').extract_first()
absolute_url = response.urljoin(next_page_url_2)
yield scrapy.Request(url=absolute_url, callback=self.start_scraping)
def start_scraping(self, response):
open_in_browser(response)
process = CrawlerProcess()
process.crawl(LoginNeedScraper)
process.start()
You need to define rules in order to scrape a website completely. Let's say you want to crawl all links in the header of the website and then open that link in order to see the main page to which that link was referring.
In order to achieve this, firstly identify what you need to scrape and mark CSS or XPath selectors for those links and put them in a rule. Every rule has a default callback to parse or you can also assign it to some other method. I am attaching a dummy example of creating rules, and you can map it accordingly to your case:
rules = (
Rule(LinkExtractor(restrict_css=[crawl_css_selectors])),
Rule(LinkExtractor(restrict_css=[product_css_selectors]), callback='parse_item')
)
I'm scraping news.crunchbase.com with Scrapy. The callback function on following recursive links doesn't fire in case if I follow the actual link encountered, but it works fine if I crawl some test link instead. I assume that the problem is in timing, hence want to delay the recursive request.
EDIT: answer from here does set the global delay, but it doesn't adjust the recursive delay. Recursive link crawl is done instantly - just as soon as data has been scraped.
def parse(self, response):
time.sleep(5)
for post in response.css('div.herald-posts'):
article_url = post.css('div.herald-post-thumbnail a::attr(href)').get()
if article_url is not None:
print('\nGot article...', article_url, '\n')
yield response.follow(article_url, headers = self.custom_headers, callback = self.parse_article)
yield {
'title': post.css('div.herald-post-thumbnail a::attr(title)').get(),
}
This was actually enough.
custom_settings = {
"DOWNLOAD_DELAY": 5,
"CONCURRENT_REQUESTS_PER_DOMAIN": 1
}
Child requests are put into request queue and processed after the parent requests. In case if requests are not related to the current domain the DOWNLOAD_DELAY is ignored and request is done instantly.
P.S. I just didn't wait until start_requests(self) processes the entire list of url, hence thought I was banned. Hope this helps somebody.
I am making a spider which will crawl the entire site on the first run and store the data in my database.
But I will keep running this spider on weekly basis to get the updates of the crawled site in my database and I don't want scrapy to crawl the pages which are already present in my database how to achieve this I have made two plans -
1] Make a crawler to fetch the entire site and somehow store the first fetched URL in a csv file then keep following the next pages. Then make another crawler which will start fetching backwards that means it will take the input from the URL in csv and keep running till prev_page exits this way I will get the data, but the url in csv will be crawled twice.
2] Make a crawler which will check condition if the data is in the database then stop, is it possible? This will be the most productive way but I can't find the way out. Maybe making logs files might help in some way?
Update
The site is a blog which updates frequently and sorted as latest post on the top manner
Something like this :
from scrapy import Spider
from scrapy.http import Request, FormRequest
class MintSpiderSpider(Spider):
name = 'Mint_spider'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/']
def parse(self, response):
urls = response.xpath('//div[#class = "post-inner post-hover"]/h2/a/#href').extract()
for url in urls:
if never_visited(url, database):
yield Request(url, callback=self.parse_lyrics) #do you mean parse_foo ?
next_page_url = response.xpath('//li[#class="next right"]/a/#href').extract_first()
if next_page_url:
yield scrapy.Request(next_page_url, callback=self.parse)
def parse_foo(self, response):
save_url(response.request.url, database)
info = response.xpath('//*[#class="songinfo"]/p/text()').extract()
name = response.xpath('//*[#id="lyric"]/h2/text()').extract()
yield{
'name' : name,
'info': info
}
You just need to implement never_visited and save_url functions.
never_visited will check in your database if url is already there. save_url will add the url into your database.
OS: Ubuntu 16.04
Stack - Scrapy 1.0.3 + Selenium
I'm pretty new to scrapy and this might sound very basic, But in my spider, only "init" is being getting executed. Any code/function after that is not getting called and thhe spider just halts.
class CancerForumSpider(scrapy.Spider):
name = "mainpage_spider"
allowed_domains = ["cancerforums.net"]
start_urls = [
"http://www.cancerforums.net/forums/14-Prostate-Cancer-Forum"
]
def __init__(self,*args,**kwargs):
self.browser=webdriver.Firefox()
self.browser.get("http://www.cancerforums.net/forums/14-Prostate-Cancer-Forum")
print "----------------Going to sleep------------------"
time.sleep(5)
# self.parse()
def __exit__(self):
print "------------Exiting----------"
self.browser.quit()
def parse(self,response):
print "----------------Inside Parse------------------"
print "------------Exiting----------"
self.browser.quit()
The spider gets the browser object, prints "Going to sleep" and just halts. It doesn't go inside the parse function.
Following are the contents of the run logs:
----------------inside init----------------
----------------Going to sleep------------------
There are a few problems you need to address or be aware of:
You're not calling super() during the __init__ method, so none of the inherited classes initialization is going to be happening. Scrapy won't do anything (like calling it's parse() method), as that all is setup in scrapy.Spider.
After fixing the above, your parse() method will be called by Scrapy, but won't be operating on your Selenium-fetched webpage. It will have no knowledge of this whatsoever, and will go re-fetch the url (based on start_urls). It's very much likely that these two sources will differ (often drastically).
You're going to be bypassing almost all of Scrapy's functionality using Selenium the way you are. All of Selenium's get()'s will be executed outside of the Scrapy framework. Middleware won't be applied (cookies, throttling, filtering, etc.) nor will any of the expected/created objects (like request and response) be populated with the data you expect.
Before you fix all of that, you should consider a couple of better options/alternatives:
Create a downloader middleware that handles all "Selenium" related functionality. Have it intercept request objects right before they hit the downloader, populate a new response objects and return them for processing by the spider.
This isn't optimal, as you're effectively creating your own downloader, and short-circuiting Scrapy's. You'll have to re-implement the handling of any desired settings the downloader usually takes into account and make them work with Selenium.
Ditch Selenium and use the Splash HTTP and scrapy-splash middleware for handling Javascript.
Ditch Scrapy all together and just use Selenium and BeautifulSoup.
Scrapy is useful when you have to crawl a big amount of pages. Selenium is normally useful for scraping when you need to have the DOM source after the JS was loaded. If that's your situation, there are two main ways to combine Selenium and Scrapy. One is to write a download handler, like the one you can find here.
The code goes as:
# encoding: utf-8
from __future__ import unicode_literals
from scrapy import signals
from scrapy.signalmanager import SignalManager
from scrapy.responsetypes import responsetypes
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from six.moves import queue
from twisted.internet import defer, threads
from twisted.python.failure import Failure
class PhantomJSDownloadHandler(object):
def __init__(self, settings):
self.options = settings.get('PHANTOMJS_OPTIONS', {})
max_run = settings.get('PHANTOMJS_MAXRUN', 10)
self.sem = defer.DeferredSemaphore(max_run)
self.queue = queue.LifoQueue(max_run)
SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed)
def download_request(self, request, spider):
"""use semaphore to guard a phantomjs pool"""
return self.sem.run(self._wait_request, request, spider)
def _wait_request(self, request, spider):
try:
driver = self.queue.get_nowait()
except queue.Empty:
driver = webdriver.PhantomJS(**self.options)
driver.get(request.url)
# ghostdriver won't response when switch window until page is loaded
dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle))
dfd.addCallback(self._response, driver, spider)
return dfd
def _response(self, _, driver, spider):
body = driver.execute_script("return document.documentElement.innerHTML")
if body.startswith("<head></head>"): # cannot access response header in Selenium
body = driver.execute_script("return document.documentElement.textContent")
url = driver.current_url
respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
resp = respcls(url=url, body=body, encoding="utf-8")
response_failed = getattr(spider, "response_failed", None)
if response_failed and callable(response_failed) and response_failed(resp, driver):
driver.close()
return defer.fail(Failure())
else:
self.queue.put(driver)
return defer.succeed(resp)
def _close(self):
while not self.queue.empty():
driver = self.queue.get_nowait()
driver.close()
Suppose your scraper is called "scraper". If you put the mentioned code inside a file called handlers.py on the root of the "scraper" folder, then you could add to your settings.py:
DOWNLOAD_HANDLERS = {
'http': 'scraper.handlers.PhantomJSDownloadHandler',
'https': 'scraper.handlers.PhantomJSDownloadHandler',
}
Another way is to write a download middleware, as described here. The download middleware has the downside of preventing some key features from working out of the box, such as cache and retries.
In any case, starting the Selenium webdriver at the init of the Scrapy spider is not the usual way to go.
Is there any scrapy module available to build referrer chains while crawling urls.
Lets say for instance I start my crawl from http://www.example.com and move to http://www.new-example.com and then from http://www.new-example.com to http://very-new-example.com.
Can I create a url chains(a csv or json file) like this:
http://www.example.com, http://www.new-example.com
http://www.example.com, http://www.new-example.com, http://very-new-example.com
and so on, if there's no module or implementation available at the moment then what other options can I try?
Yes you can keep track of referrals by making a global list which accesible by all methods for example.
referral_url_list = []
def call_back1(self, response):
self.referral_url_list.append(response.url)
def call_back1(self, response):
self.referral_url_list.append(response.url)
def call_back1(self, response):
self.referral_url_list.append(response.url)
after spider completion which is detected by spider signals. you can write csv or json file in signal function