Scrapy, Selenium, Python - problem with pagination (missing pages)

Scrapy, Selenium, Python - problem with pagination (missing pages) - python

I have a problem with running scrapy. Seems like scrapy is skiping last pages. For example I've set 20 pages to scrap but Scrapy is missing last 10 or 7 pages. It does not have any problem when Im setting one single page "for page in range(6,7)". Terminal shows that it is scraping all pages from 1 to 100 but output in my database is ending at random pages. Any ideas why is that heppening?
Mayber there is a way to run Scrapy synchronously. To scrap every item in first page -> second page -> third page and so on
class SomeSpider(scrapy.Spider):
name = 'default'
urls = [f'https://www.somewebsite.com/pl/c/cat?page={page}' for page in range(1, 101)]
service = Service(ChromeDriverManager().install())
options = Options()
options.add_argument('--ignore-certificate-errors')
options.add_argument("--headless")
options.add_argument("--allow-running-insecure-content")
options.add_argument("--enable-crash-reporter")
options.add_argument("--disable-popup-blocking")
options.add_argument("--disable-default-apps")
options.add_argument("--incognito")
driver = webdriver.Chrome(service=service, options=options)
def start_requests(self):
for url in self.urls:
yield scrapy.Request(
url=url,
callback=self.parse
)
def parse(self, response):
for videos in response.css('div.card-img'):
item = WebsitescrapperItem()
link = f'https://www.somewebsite.com{videos.css("a.item-link").attrib["href"]}'
SomeSpider.driver.get(link)
domain_name = SomeSpider.driver.current_url
SomeSpider.driver.back()
item['name'] = videos.css('span.title::text').get().strip()
item['duration'] = videos.css('span.duration::text').get().strip()
item['image'] = videos.css('img.thumb::attr(src)').get()
item['url'] = domain_name
item['hd'] = videos.css('span.hd-icon').get()
yield item

Try running the code using this format of calling,
def parse(self, response):
# do some stuff
for page in range(self.total_pages):
yield Requests(f'https://example.com/search?page={page}',
callback=self.parse)
Also, If you yield multiple requests from start_requests, or have multiple URLs in start_urls, those will be handled asynchronously, according to your concurrency settings (Scrapy’s default is 8 concurrent requests per domain, 16 total). Make sure you set accordingly in settings.py.
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

If you want to run it synchronously you would do it like so.
def parse(self, response, current_page):
url = 'https://www.somewebsite.com/pl/c/cat?page={}'
# do some stuff
self.current_page += 1
yield Request(url.format(current_page), call_back=self.parse)

Related

In scrapy+selenium, how to make a spider request to wait until previous request has finished processing?

TL;DR
In scrapy, I want the Request to wait till all spider parse callbacks finish. So the whole process needs to be sequential. Like this:
Request1 -> Crawl1 -> Request2 -> Crawl2 ...
But what is happening now:
Request1 -> Request2 -> Request3 ...
Crawl1
Crawl2
Crawl3 ...
Long version
I am new to scrapy + selenium web scraping.
I am trying to scrape a website where the contents are being updated heavily with javascript. Firstly I am opening the website with selenium and logging in. After that, I am creating a using a downloader middleware that handles the requests with selenium and returns the responses. Below is the middleware's process_request implementation:
class XYZDownloaderMiddleware:
'''Other functions are as is. I just changed this one'''
def process_request(self, request, spider):
driver = request.meta['driver']
# We are opening a new link
if request.meta['load_url']:
driver.get(request.url)
WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))
# We are clicking on an element to get new data using javascript.
elif request.meta['click_bet']:
element = request.meta['click_bet']
element.click()
WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))
body = driver.page_source
return HtmlResponse(driver.current_url, body=body, encoding="utf-8", request=request)
In settings, I have also set CONCURRENT_REQUESTS = 1 so that, multiple driver.get() do not called and selenium can peacefully load responses one by one.
Now what I see happening is selenium opens each URL, scrapy lets selenium wait for the response to finish loading and then middleware returns the response properly (goes to if response.meta['load_url'] block).
But, after I got the response, I want to use the selenium driver (in the parse(response) functions) to click on each of the elements by yielding a Request and return the updated HTML from the middleware (the elif request.meta['click_bet'] block).
The Spider is minimally like this:
class XYZSpider(scrapy.Spider):
def start_requests(self):
start_urls = [
'https://www.example.com/a',
'https://www.example.com/b'
]
self.driver = self.getSeleniumDriver()
for url in start_urls:
request = scrapy.Request(url=url, callback=self.parse)
request.meta['driver'] = self.driver
request.meta['load_url'] = True
request.meta['wait_for_xpath'] = '/div/bla/bla'
request.meta['click_bet'] = None
yield request
def parse(self, response):
urls = response.xpath('//a/#href').getall()
for url in start_urls:
request = scrapy.Request(url=url, callback=self.rightSectionParse)
request.meta['driver'] = self.driver
request.meta['load_url'] = True
request.meta['wait_for_xpath'] = '//div[contains(#class, "rightSection")]'
request.meta['click_bet'] = None
yield request
def rightSectionParse(self, response):
...
So what is happening is, scrapy is not waiting for the spider to parse. Scrapy gets the response, and then parallelly calls parse callback and next fetch response. But selenium driver needs to be used by the parse callback function before the next request processing.
I want the requests to wait until the parse callback is finished.

Too many Selenium web driver instances when using with Scrapy

I am creating a web crawler using Scrapy and Selenium.
The code looks like this:
class MySpider(scrapy.Spider):
urls = [/* a very long list of url */]
def start_requests(self):
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_item)
def parse_item(self, response):
item = Item()
item['field1'] = response.xpath('some xpath').extract()[0]
yield item
sub_item_url = response.xpath('some another xpath').extract()[0]
# Sub items are Javascript generated so it needs a web driver
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.set_window_size(1920, 1080)
sub_item_generator = self.get_sub_item_generator(driver, sub_item_url)
while True:
try:
yield next(sub_item_generator)
except StopIteration:
break
driver.close()
def get_sub_item_generator(driver, url):
# Crawling using the web driver goes here which takes a long time to finish
yield sub_item
The problem is that the crawler running for a while then it crashed due to run out of memory. Scrapy keeps scheduling a new URL from the list so there are too many web driver processes running.
Is there any way to control the Scrapy scheduler not to schedule a new URL when there is some number of web driver process running?

You could try setting CONCURRENT_REQUESTS to something lower than the default of 16 (as shown here):
class MySpider(scrapy.Spider):
# urls = [/* a very long list of url */]
custom_settings = {
'CONCURRENT_REQUESTS': 5 # default of 16 seemed like it was too much?
}

Try using driver.quit() instead of driver.close()

I have had same problem despite of using driver.close() then I did this, kill all firefox instances before the script starts.
from subprocess import call
call(["killall", "firefox"])

Stop Scrapy spider when it meets a specified URL

This question is very similar to Force my scrapy spider to stop crawling and some others asked several years ago. However, the suggested solutions there are either dated for Scrapy 1.1.1 or not precisely relevant. The task is to close the spider when it reaches a certain URL. You definitely need this when crawling a news website for your media project, for instance.
Among the settings CLOSESPIDER_TIMEOUT CLOSESPIDER_ITEMCOUNT CLOSESPIDER_PAGECOUNT CLOSESPIDER_ERRORCOUNT, item count and page count options are close but not enough since you never know the number of pages or items.
The raise CloseSpider(reason='some reason') exception seems to do the job but so far it does it in a bit weird way. I follow the “Learning Scrapy” textbook and the structure of my code looks like the one in the book.
In items.py I make a list of items:
class MyProjectItem(scrapy.Item):
Headline = scrapy.Field()
URL = scrapy.Field()
PublishDate = scrapy.Field()
Author = scrapy.Field()
pass
In myspider.py I use the def start_requests() method where the spider takes the pages to process, parse each index page in def parse(), and specify the XPath for each item in def parse_item():
class MyProjectSpider(scrapy.Spider):
name = 'spidername'
allowed_domains = ['domain.name.com']
def start_requests(self):
for i in range(1,3000):
yield scrapy.Request('http://domain.name.com/news/index.page'+str(i)+'.html', self.parse)
def parse(self, response):
urls = response.xpath('XPath for the URLs on index page').extract()
for url in urls:
# The urls are absolute in this case. There’s no need to use urllib.parse.urljoin()
yield scrapy.Request(url, callback=self.parse_item)
def parse_item(self, response):
l = ItemLoader(item=MyProjectItem(), response=response)
l.add_xpath('Headline', 'XPath for Headline')
l.add_value('URL', response.url)
l.add_xpath ('PublishDate', 'XPath for PublishDate')
l.add_xpath('Author', 'XPath for Author')
return l.load_item()
If raise CloseSpider(reason='some reason') exception is placed in def parse_item(), it still scrapes a number of items before it finally stops:
if l.get_output_value('URL') == 'http://domain.name.com/news/1234567.html':
raise CloseSpider('No more news items.')
If it’s placed in def parse() method to stop when the specific URL is reached, it stops after grabbing only the first item from the index page which contains that specific URL:
def parse(self, response):
most_recent_url_in_db = 'http://domain.name.com/news/1234567.html '
urls = response.xpath('XPath for the URLs on index page').extract()
if most_recent_url_in_db not in urls:
for url in urls:
yield scrapy.Request(url, callback=self.parse_item)
else:
for url in urls[:urls.index(most_recent_url_in_db)]:
yield scrapy.Request(url, callback=self.parse_item)
raise CloseSpider('No more news items.')
For example, if you have 5 index pages (each of them has 25 item URLs) and most_recent_url_in_db is on page 4, it means that you’ll have all items from pages 1-3 and only the first item from page 4. Then the spider stops. If most_recent_url_in_db is number 10 in the list, items 2-9 from index page 4 won’t appear in your database.
The “hacky” tricks with crawler.engine.close_spider() suggested in many cases or the ones shared in How do I stop all spiders and the engine immediately after a condition in a pipeline is met? don’t seem to work.
What should be the method to properly complete this task?

I'd recommend to change your approach. Scrapy crawls many requests concurrently without a linear order, that's why closing the spider when you find what you're looking for won't do, since a request after that could already be processed.
To tackle this you could make Scrapy crawl sequentially, meaning a request at a time in a fixed order. This can be achieved in different ways, here's an example about how I would go about it.
First of all, you should crawl a single page at a time. This could be done like this:
class MyProjectSpider(scrapy.Spider):
pagination_url = 'http://domain.name.com/news/index.page{}.html'
def start_requests(self):
yield scrapy.Request(
self.pagination_url.format(1),
meta={'page_number': 1},
)
def parse(self, response):
# code handling item links
...
page_number = response.meta['page_number']
next_page_number = page_number + 1
if next_page_number <= 3000:
yield scrapy.Request(
self.pagination_url.format(next_page_number),
meta={'page_number': next_page_number},
)
Once that's implemented, you could do something similar with the links in each page. However, since you can filter them without downloading their content, you could do something like this:
class MyProjectSpider(scrapy.Spider):
most_recent_url_in_db = 'http://domain.name.com/news/1234567.html '
def parse(self, response):
url_found = False
urls = response.xpath('XPath for the URLs on index page').extract()
for url in urls:
if url == self.most_recent_url_in_db:
url_found = True
break
yield scrapy.Request(url, callback=self.parse_item)
page_number = response.meta['page_number']
next_page_number = page_number + 1
if not url_found:
yield scrapy.Request(
self.pagination_url.format(next_page_number),
meta={'page_number': next_page_number},
)
Putting all together you'll have:
class MyProjectSpider(scrapy.Spider):
name = 'spidername'
allowed_domains = ['domain.name.com']
pagination_url = 'http://domain.name.com/news/index.page{}.html'
most_recent_url_in_db = 'http://domain.name.com/news/1234567.html '
def start_requests(self):
yield scrapy.Request(
self.pagination_url.format(1),
meta={'page_number': 1}
)
def parse(self, response):
url_found = False
urls = response.xpath('XPath for the URLs on index page').extract()
for url in urls:
if url == self.most_recent_url_in_db:
url_found = True
break
yield scrapy.Request(url, callback=self.parse_item)
page_number = response.meta['page_number']
next_page_number = page_number + 1
if next_page_number <= 3000 and not url_found:
yield scrapy.Request(
self.pagination_url.format(next_page_number),
meta={'page_number': next_page_number},
)
def parse_item(self, response):
l = ItemLoader(item=MyProjectItem(), response=response)
l.add_xpath('Headline', 'XPath for Headline')
l.add_value('URL', response.url)
l.add_xpath ('PublishDate', 'XPath for PublishDate')
l.add_xpath('Author', 'XPath for Author')
return l.load_item()
Hope that gives you an idea on how to accomplish what you're looking for, good luck!

When you raise close_spider() exception, ideal assumption is that scrapy should stop immediately, abandoning all other activity (any future page requests, any processing in pipeline ..etc)
but this is not the case, When you raise close_spider() exception, scrapy will try to close it's current operation's gracefully , meaning it will stop the current request but it will wait for any other request pending in any of the queues( there are multiple queues!)
(i.e. if you are not overriding default settings and have more than 16 start urls, scrapy make 16 requests at a time)
Now if you want to stop spider as soon as you Raise close_spider() exception, you will want to clear three Queues
-- At spider middleware level ---
spider.crawler.engine.slot.scheduler.mqs -> Memory Queue future request
spider.crawler.engine.slot.inprogress -> Any In-progress Request
-- download middleware level ---
spider.requests_queue -> pending Request in Request queue
flush all this queues by overriding proper middle-ware to prevent scrapy from visiting any further pages

PhantomJS and session timeout

I'm trying to scrape with phantomJS and selenium (and scrapy) into a list of links inside a website. I'm new to PhantomJS and Selenium so I'll ask here.
I think the website has a session timeout because I can scrape just the first of those links. Then I get this error:
NoSuchWindowException: Message: {"errorMessage":"Currently Window
handle/name is invalid
(closed?)","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"460","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:33038","User-Agent":"Python-urllib/2.7"},"httpVersion":"1.1","method":"POST","post":"{\"url\":
That's part of my code:
class bllanguage(scrapy.Spider):
handle_httpstatus_list = [302]
name = "bllanguage"
download_delay = 1
allowed_domains = ["http://explore.com/"]
f = open("link")
start_urls = [url.strip() for url in f.readlines()]
f.close()
def __init__(self):
self.driver = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs')
def start_requests(self):
for u in self.start_urls:
r = scrapy.Request(url = u, dont_filter=True, callback=self.parse)
r.meta['dont_redirect'] = True
yield r
def parse(self, response):
self.driver.get(response.url)
#print response.url
search_field = []
Etc.
The session timeout problem is just my interpretation, I've seen other messages like that but none of them has a solution. What I would like to try is to "close" the request to each link inside the "link" file. I don't know if this is something PhantomJS does naturally or I have to insert something: I've seen there's a resourceTimeout setting. Is it the right thing to use and where can I put it inside my code?

Only 25 entries are stored in JSON files while scraping data using Scrapy; how to increase?

I am scraping data using Scrapy in a item.json file. Data is getting stored but the problem is only 25 entries are stored, while in the website there are more entries. I am using the following command:
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["justdial.com"]
start_urls = ["http://www.justdial.com/Delhi-NCR/Taxi-Services/ct-57371"]
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
items.append(item)
return items
The command I'm using to run the script is:
scrapy crawl myspider -o items.json -t json
Is there any setting which I am not aware of? or the page is not getting loaded fully till scraping. how do i resolve this?

Abhi, here is some code, but please note that it isn't complete and working, it is just to show you the idea. Usually you have to find a next page URL and try to recreate the appropriate request in your spider. In your case AJAX is used. I used FireBug to check which requests are sent by the site.
URL = "http://www.justdial.com/function/ajxsearch.php?national_search=0&...page=%s" # this isn't the complete next page URL
next_page = 2 # how to handle next_page counter is up to you
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
yield item
# build you pagination URL and send a request
url = self.URL % self.next_page
yield Request(url) # Request is Scrapy request object here
# increment next_page counter if required, make additional
# checks and actions etc
Hope this will help.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy, Selenium, Python - problem with pagination (missing pages) - python

If you want to run it synchronously you would do it like so. def parse(self, response, current_page): url = 'https://www.somewebsite.com/pl/c/cat?page={}' # do some stuff self.current_page += 1 yield Request(url.format(current_page), call_back=self.parse)

Related

In scrapy+selenium, how to make a spider request to wait until previous request has finished processing?

Too many Selenium web driver instances when using with Scrapy

Stop Scrapy spider when it meets a specified URL

PhantomJS and session timeout

Only 25 entries are stored in JSON files while scraping data using Scrapy; how to increase?

Categories

Resources