Scraping many pages using scrapy

Scraping many pages using scrapy - python

I am trying to scrape multiple webpages using scrapy. The link of the pages are like:
http://www.example.com/id=some-number
In the next page the number at the end is reduced by 1.
So I am trying to build a spider which navigates to the other pages and scrapes them too. The code that I have is given below:
import scrapy
import requests
from scrapy.http import Request
URL = "http://www.example.com/id=%d"
starting_number = 1000
number_of_pages = 500
class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def start_request(self):
for i in range (starting_number, number_of_pages, -1):
yield Request(url = URL % i, callback = self.parse)
def parse(self, response):
**parsing data from the webpage**
This is running into an infinite loop where on printing the page number I am getting negative numbers. I think that is happening because I am requesting for a page within my parse() function.
But then the example given here works okay. Where am I going wrong?

The first page requested is "http://www.example.com/id=1000" (starting_number)
It's response goes through parse() and with for i in range (0, 500):
you are requesting http://www.example.com/id=999, http://www.example.com/id=998, http://www.example.com/id=997...http://www.example.com/id=500
self.page_number is a spider attribute, so when you're decrementing it's value, you have self.page_number == 500 after the first parse().
So when Scrapy calls parse for the response of http://www.example.com/id=999, you're generating requests for http://www.example.com/id=499, http://www.example.com/id=498, http://www.example.com/id=497...http://www.example.com/id=0
You guess what happens the 3rd time: http://www.example.com/id=-1, http://www.example.com/id=-2...http://www.example.com/id=-500
For each response, you're generating 500 requests.
You can stop the loop by testing self.page_number >= 0
Edit after OP question in comments:
No need for multiple threads, Scrapy works asynchronously and you can enqueue all your requests in an overridden start_requests() method (instead of requesting 1 page, and then returning Request istances in the parse method).
Scrapy will take enough requests to fill it's pipeline, parse the pages, pick new requests to send and so on.
See start_requests documentation.
Something like this would work:
class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def __init__(self):
self.page_number = starting_number
def start_requests(self):
# generate page IDs from 1000 down to 501
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i, callback=self.parse)
def parse(self, response):
**parsing data from the webpage**

Related

Python Scrapy: Return list of URLs scraped

I am using scrapy to scrape all the links off single domain. I am following all links on the domain but saving all links off the domain. The following scraper works correctly, but I can't access member variables from within the scraper since I am running it with a CrawlerProcess.
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
on_domain_urls = set()
off_domain_urls = set()
def parse(self, response):
links = response.xpath('//a/#href')
for link in links:
url = link.get()
if 'example.com' in url and url not in self.on_domain_urls:
print('On domain links found: {}'.format(
len(self.on_domain_urls)))
self.on_domain_urls.add(url)
yield scrapy.Request(url, callback=self.parse)
elif url not in self.off_domain_urls:
print('Offf domain links found: {}'.format(
len(self.on_domain_urls)))
self.off_domain_urls.add(url)
process = CrawlerProcess()
process.crawl(GoodOnYouSpider)
process.start()
# Need access to off_domain_links
How can I access off_domain_links? I could probably move it to a global scope but this seems hack. I can also append to a file, but I'd like to avoid file I/O if possible. Is there a better way to return aggregated data like this?

Did you check the Itempipeline? I think you'll have to use that in this scenario and decide what needs to be done with the variable.
See:
https://docs.scrapy.org/en/latest/topics/item-pipeline.html

Scrapy: How To Start Scraping Data From a Search Result that uses Javascript

I am new at using scrapy and python
I wanted to start scraping data from a search result, if you will load the page the default content will appear, what I need to scrape is the filtered one, while doing pagination?
Here's the URL
https://teslamotorsclub.com/tmc/post-ratings/6/posts
I need to scrape the item from Time Filter: "Today" result
I tried different approach but none is working.
What I have done is this but more on layout structure.
class TmcnfSpider(scrapy.Spider):
name = 'tmcnf'
allowed_domains = ['teslamotorsclub.com']
start_urls = ['https://teslamotorsclub.com/tmc/post-ratings/6/posts']
def start_requests(self):
#Show form from a filtered search result
def parse(self, response):
#some code scraping item
#Yield url for pagination

To get the posts of todays filter, you need to send a post request to this url https://teslamotorsclub.com/tmc/post-ratings/6/posts along with payload. The following should fetch you the results you are interested in.
import scrapy
class TmcnfSpider(scrapy.Spider):
name = "teslamotorsclub"
start_urls = ["https://teslamotorsclub.com/tmc/post-ratings/6/posts"]
def parse(self,response):
payload = {'time_chooser':'4','_xfToken':''}
yield scrapy.FormRequest(response.url,formdata=payload,callback=self.parse_results)
def parse_results(self,response):
for items in response.css("h3.title > a::text").getall():
yield {"title":items.strip()}

Stop Scrapy spider when it meets a specified URL

This question is very similar to Force my scrapy spider to stop crawling and some others asked several years ago. However, the suggested solutions there are either dated for Scrapy 1.1.1 or not precisely relevant. The task is to close the spider when it reaches a certain URL. You definitely need this when crawling a news website for your media project, for instance.
Among the settings CLOSESPIDER_TIMEOUT CLOSESPIDER_ITEMCOUNT CLOSESPIDER_PAGECOUNT CLOSESPIDER_ERRORCOUNT, item count and page count options are close but not enough since you never know the number of pages or items.
The raise CloseSpider(reason='some reason') exception seems to do the job but so far it does it in a bit weird way. I follow the “Learning Scrapy” textbook and the structure of my code looks like the one in the book.
In items.py I make a list of items:
class MyProjectItem(scrapy.Item):
Headline = scrapy.Field()
URL = scrapy.Field()
PublishDate = scrapy.Field()
Author = scrapy.Field()
pass
In myspider.py I use the def start_requests() method where the spider takes the pages to process, parse each index page in def parse(), and specify the XPath for each item in def parse_item():
class MyProjectSpider(scrapy.Spider):
name = 'spidername'
allowed_domains = ['domain.name.com']
def start_requests(self):
for i in range(1,3000):
yield scrapy.Request('http://domain.name.com/news/index.page'+str(i)+'.html', self.parse)
def parse(self, response):
urls = response.xpath('XPath for the URLs on index page').extract()
for url in urls:
# The urls are absolute in this case. There’s no need to use urllib.parse.urljoin()
yield scrapy.Request(url, callback=self.parse_item)
def parse_item(self, response):
l = ItemLoader(item=MyProjectItem(), response=response)
l.add_xpath('Headline', 'XPath for Headline')
l.add_value('URL', response.url)
l.add_xpath ('PublishDate', 'XPath for PublishDate')
l.add_xpath('Author', 'XPath for Author')
return l.load_item()
If raise CloseSpider(reason='some reason') exception is placed in def parse_item(), it still scrapes a number of items before it finally stops:
if l.get_output_value('URL') == 'http://domain.name.com/news/1234567.html':
raise CloseSpider('No more news items.')
If it’s placed in def parse() method to stop when the specific URL is reached, it stops after grabbing only the first item from the index page which contains that specific URL:
def parse(self, response):
most_recent_url_in_db = 'http://domain.name.com/news/1234567.html '
urls = response.xpath('XPath for the URLs on index page').extract()
if most_recent_url_in_db not in urls:
for url in urls:
yield scrapy.Request(url, callback=self.parse_item)
else:
for url in urls[:urls.index(most_recent_url_in_db)]:
yield scrapy.Request(url, callback=self.parse_item)
raise CloseSpider('No more news items.')
For example, if you have 5 index pages (each of them has 25 item URLs) and most_recent_url_in_db is on page 4, it means that you’ll have all items from pages 1-3 and only the first item from page 4. Then the spider stops. If most_recent_url_in_db is number 10 in the list, items 2-9 from index page 4 won’t appear in your database.
The “hacky” tricks with crawler.engine.close_spider() suggested in many cases or the ones shared in How do I stop all spiders and the engine immediately after a condition in a pipeline is met? don’t seem to work.
What should be the method to properly complete this task?

I'd recommend to change your approach. Scrapy crawls many requests concurrently without a linear order, that's why closing the spider when you find what you're looking for won't do, since a request after that could already be processed.
To tackle this you could make Scrapy crawl sequentially, meaning a request at a time in a fixed order. This can be achieved in different ways, here's an example about how I would go about it.
First of all, you should crawl a single page at a time. This could be done like this:
class MyProjectSpider(scrapy.Spider):
pagination_url = 'http://domain.name.com/news/index.page{}.html'
def start_requests(self):
yield scrapy.Request(
self.pagination_url.format(1),
meta={'page_number': 1},
)
def parse(self, response):
# code handling item links
...
page_number = response.meta['page_number']
next_page_number = page_number + 1
if next_page_number <= 3000:
yield scrapy.Request(
self.pagination_url.format(next_page_number),
meta={'page_number': next_page_number},
)
Once that's implemented, you could do something similar with the links in each page. However, since you can filter them without downloading their content, you could do something like this:
class MyProjectSpider(scrapy.Spider):
most_recent_url_in_db = 'http://domain.name.com/news/1234567.html '
def parse(self, response):
url_found = False
urls = response.xpath('XPath for the URLs on index page').extract()
for url in urls:
if url == self.most_recent_url_in_db:
url_found = True
break
yield scrapy.Request(url, callback=self.parse_item)
page_number = response.meta['page_number']
next_page_number = page_number + 1
if not url_found:
yield scrapy.Request(
self.pagination_url.format(next_page_number),
meta={'page_number': next_page_number},
)
Putting all together you'll have:
class MyProjectSpider(scrapy.Spider):
name = 'spidername'
allowed_domains = ['domain.name.com']
pagination_url = 'http://domain.name.com/news/index.page{}.html'
most_recent_url_in_db = 'http://domain.name.com/news/1234567.html '
def start_requests(self):
yield scrapy.Request(
self.pagination_url.format(1),
meta={'page_number': 1}
)
def parse(self, response):
url_found = False
urls = response.xpath('XPath for the URLs on index page').extract()
for url in urls:
if url == self.most_recent_url_in_db:
url_found = True
break
yield scrapy.Request(url, callback=self.parse_item)
page_number = response.meta['page_number']
next_page_number = page_number + 1
if next_page_number <= 3000 and not url_found:
yield scrapy.Request(
self.pagination_url.format(next_page_number),
meta={'page_number': next_page_number},
)
def parse_item(self, response):
l = ItemLoader(item=MyProjectItem(), response=response)
l.add_xpath('Headline', 'XPath for Headline')
l.add_value('URL', response.url)
l.add_xpath ('PublishDate', 'XPath for PublishDate')
l.add_xpath('Author', 'XPath for Author')
return l.load_item()
Hope that gives you an idea on how to accomplish what you're looking for, good luck!

When you raise close_spider() exception, ideal assumption is that scrapy should stop immediately, abandoning all other activity (any future page requests, any processing in pipeline ..etc)
but this is not the case, When you raise close_spider() exception, scrapy will try to close it's current operation's gracefully , meaning it will stop the current request but it will wait for any other request pending in any of the queues( there are multiple queues!)
(i.e. if you are not overriding default settings and have more than 16 start urls, scrapy make 16 requests at a time)
Now if you want to stop spider as soon as you Raise close_spider() exception, you will want to clear three Queues
-- At spider middleware level ---
spider.crawler.engine.slot.scheduler.mqs -> Memory Queue future request
spider.crawler.engine.slot.inprogress -> Any In-progress Request
-- download middleware level ---
spider.requests_queue -> pending Request in Request queue
flush all this queues by overriding proper middle-ware to prevent scrapy from visiting any further pages

In Scrapy, how to loop over several start_urls which are themselves scraped

I'm trying to collect data on houses for sale in Amsterdam on http://www.funda.nl/koop/amsterdam/. The main page shows only a limited number of houses, and at the bottom there is a pager which looks like this:
("Volgende" means "Next" in Dutch). From this I infer that there are 255 pages in total. Each of these pages has the URL http://www.funda.nl/koop/amsterdam/p2/, http://www.funda.nl/koop/amsterdam/p3/, and so on. To get data on all the houses, I would like to 'loop over' all the subpages p1, p2, ..., p255.
I'm trying to see how I could 'set this up'. Till now I've written the following code:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Funda.items import FundaItem
# from scrapy.shell import inspect_response
class FundaSpider(CrawlSpider):
name = "Funda"
allowed_domains = ["funda.nl"]
start_urls = ["http://www.funda.nl/koop/amsterdam/"]
le1 = LinkExtractor(allow=r'%s+huis-\d{8}' % start_urls[0]) # Link to the page of an individual house, such as http://www.funda.nl/koop/amsterdam/huis-49805292-nieuwendammerdijk-21/
le2 = LinkExtractor(allow=r'%s+p\d+' % start_urls[0]) # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/
rules = (
Rule(le1, callback='parse_item'),
Rule(le2, callback='get_max_page_number')
)
def parse_item(self, response):
links = self.le1.extract_links(response)
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'):
item = FundaItem()
item['url'] = link.url
yield item
def get_max_page_number(self, response):
links = self.le2.extract_links(response)
max_page_number = 0
for link in links:
if link.url.count('/') == 6 and link.url.endswith('/'):
page_number = int(link.url.split("/")[-2].strip('p'))
if page_number > max_page_number:
max_page_number = page_number
return max_page_number
The LinkExtractor le2 calls back get_max_page_number, which simply returns the number 255. I would then like to use this number to 'synthesize' different start_urls for to be applied to LinkExtractor le1, which gets the links to individual houses on each page.
The problem is that as I understand it, scrapy processes these requests asynchronously, so I can't ensure that it will first get the number 255 and then use that number to generate other requests. If this is so I need to use two spiders in sequence and call them from a script, and in the second spider the start_url would have to be passed as a variable.
Any pointers on how to 'set this up'?

You are over-complicating the issue here -- you don't need to know the max page.
Scrapy has url dupefilter so you can use linkextractor to extract all visible pages every time and scrapy will be smart enough not to visit the pages it has been to unless you force it.
So all you need here is two Rules with LinkExtractors: One that extracts all of the links and has a callback of parse_item and one that extracts all of the visible pages and has NO callback or follow=True see docs here

Scrapy Spider authenticate and iterate

Below is a Scrapy spider I have put together to pull some elements from a web page. I borrowed this solution from another Stack Overflow solution. It works, but I need more. I need to be able to walk the series of pages specified in the for loop inside the start_requests method after authenticating.
Yes, I did locate the Scrapy documentation discussing this along with a previous solution for something very similar. Neither one seems to make much sense. From what I can gather, I need to somehow create a request object and keep passing it along, but cannot seem to figure out how to do this.
Thank you in advance for your help.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re
class MyBasicSpider(BaseSpider):
name = "awBasic"
allowed_domains = ["americanwhitewater.org"]
def start_requests(self):
'''
Override BaseSpider.start_requests to crawl all reaches in series
'''
# for every integer from one to 5000
for i in xrange(1, 50): # 1 to 50 for testing
# convert to string
iStr = str(i)
# add leading zeros to get to four digit length
while len(iStr) < 4:
iStr = '0{0}'.format(iStr)
# call make requests
yield self.make_requests_from_url('https://mycrawlsite.com/{0}/'.format(iStr))
def parse(self, response):
# create xpath selector object instance with response
hxs = HtmlXPathSelector(response)
# get part of url string
url = response.url
id = re.findall('/(\d{4})/', url)[0]
# selector 01
attribute01 = hxs.select('//div[#id="block_1"]/text()').re('([^,]*)')[0]
# selector for river section
attribute02 = hxs.select('//div[#id="block_1"]/div[1]/text()').extract()[0]
# print results
print('\tID: {0}\n\tAttr01: {1}\n\tAttr02: {2}').format(reachId, river, reachName)

You may have to approach the problem from a different angle:
first of all, scrape the main page; it contains a login form, so you can use FormRequest to simulate a user login; your parse method will likely look something like this:
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login)]
in after_login you check if the authentication was successful, usually by scanning the response for error messages; if all went well and you're logged in, you can start generating requests for the pages you're after:
def after_login(self, response):
if "Login failed" in response.body:
self.log("Login failed", level=log.ERROR)
else:
for i in xrange(1, 50): # 1 to 50 for testing
# convert to string
iStr = str(i)
# add leading zeros to get to four digit length
while len(iStr) < 4:
iStr = '0{0}'.format(iStr)
# call make requests
yield Request(url='https://mycrawlsite.com/{0}/'.format(iStr),
callback=self.scrape_page)
scrape_page will be called with each of the pages you created a request for; there you can finally extract the information you need using XPath, regex, etc.
BTW, you shouldn't 0-pad numbers manually; format will do it for you if you use the right format specifier.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping many pages using scrapy - python

Related

Python Scrapy: Return list of URLs scraped

Scrapy: How To Start Scraping Data From a Search Result that uses Javascript

Stop Scrapy spider when it meets a specified URL

In Scrapy, how to loop over several start_urls which are themselves scraped

Scrapy Spider authenticate and iterate

Categories

Resources