Scrapy best practice for crawling paging site - python

I am making crawler for html
Tach page has one tag like this,
Next >>
then last page there is not this tag.
So how can I get each page ??
At first, I thought is like this, however some how last self.start_request is not called.
page = 0
def start_requests(self,page=0):
urls = ['https://www.exmaple.com/page={0}'.format(page)]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self,response)
#check if there is <a tag ??
xlink = LinkExtractor()
for link in xlink.extract_links(response):
yield scrapy.Request(url=link.url, callback=self.parse_each)
if there is a tag:
page = page + 1
self.start_request(page)
What is the best practice for this crawling??

I'm pretty sure, that your start_requests method is executed. You are probably experiencing problems because you do not yield the result from start_request.
In your if statement, try this:
yield self.start_requests(page)
Also, I personally would not use start_requests like this, since start_requests is automatically called when your spider starts. Instead of yielding from start_requests, yielding a request directly from parse with the url scraped from the page would make your code more clear.

Related

How can I parse several differently structured websites with the same spider?

All the websites I want to parse are in the same domain but all look very different and contain different information I need.
My start_url is a page with a list containing all links I need. So in the parse() method I yield a request for each of these links and in parse_item_page I extract the first part of the information I need - which worked completely fine.
My problem is: I thought I could just do the same another time and for each link on my item_page call parse_entry. But I tried so many different versions of this and I just can't get it to work. They are the correct URLs but scrapy seems to just don't want to call a third parse() function, nothing in there ever gets executed.
How can I get scrapy to use parse_entry, or pass all these links to a new spider?
This is a simplified, shorter version of my spider class:
def parse(self, response, **kwargs):
for href in response.xpath("//listItem/#href"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_item_page)
def parse_item_page(self, response):
for sel in response.xpath("//div"):
item = items.FirstItem()
item['attribute'] = sel.xpath("//h1/text()").get().strip()
for href in response.xpath("//entry/#href"):
yield response.follow(href.extract(), callback=self.parse_entry)
yield item
def parse_entry(self, response):
for sel in response.xpath("//textBlock"):
item = items.SecondItem()
item['attribute'] = sel.xpath("//h1/text()").get().strip()
yield item

When parsing many websites, how to make Scrapy stop yielding requests in the for loop if data is found and move to the next website

So, I am parsing emails from many websites
1)
I take them from the front page and from the contacts section ('kont' or 'cont' in hrefs)
There could be many links with 'kont' or 'cont' at the front page
I don't want to visit all of them in the "for" loop
I would like the program to go to another website when the data is found in one of those links (email_list_2 != []). how to do that?
2)
There is some redundancy in the code, I yield data at the front page because I am afraid the request from the for loop would be unsuccessful, in which case I will lose data from the front page.
Can I just yield {'site': site,
'email_list_1': email_list_1,
'email_list_2': []} if data is not found
or
{'site': site,
'email_list_1': email_list_1,
'email_list_2': ['xyz']} if data is found without double yielding?
Please help
Regards,
class QuotesSpider(scrapy.Spider):
name = 'enrichment'
start_urls = website_list
def parse(self, response):
site = response.url
data = response.text
email_list_1 = emailRegex.findall(data)
yield {'lvl': '1',
'site': site,
'email_list_1': email_list_1,
'email_list_2': [],
}
soup = BeautifulSoup(data,'lxml')
for link in soup.find_all('a'):
raw_url = link.get('href')
full_url = str(site) + str(raw_url)
if (re.search('cont', full_url) != None or
re.search('kont', full_url) != None):
yield scrapy.Request(url=full_url,
callback=self.parse_2d_level,
meta={'site': site,'email_list_1': email_list_1 }
)
def parse_2d_level(self, response):
site = response.meta['site']
email_list_1 = response.meta['email_list_1']
data_2 = response.text
email_list_2 = emailRegex.findall(data_2)
yield {'lvl': '2',
'site': site,
'email_list_1': email_list_1,
'email_list_2': email_list_2,
}
I'm not sure I fully understand your question, but here it goes:
1 - You want to scrape PAGE1, look for 'cont' or 'kont' and if these components exists, make a new request for PAGE2. In PAGE2 you search for a email_list_2 and yield results. You asked:
I would like the program to go to another website when the data is
found in one of those links (email_list_2 != []). how to do that?
What website do you want it to go? Is it a follow on the page you are already scraping? Is it another website in your start_urls?
At current state, after parsing PAGE2 (on parse_2d_level method) your spider will yield results, whether it found values for email_list_2 or not. If there are other requests on queue, scrapy will go on to execute those, if there aren't, the spider will end.
2- You want to make sure the data you already found before the loop is yielded in case the request from inside the loop fails. Since you said
the request from the for loop would be unsuccessful
I'll assume you are only worried about the REQUEST failure, there are other ways your parsing could fail.
For failed request you can catch and handle the issue with a scrapy signal called spider_error, take a look here.
3-You should take a look at Scrapy's selectors, they are a very powerful tool. You don't need beautiful soup for the parsing, and the Selectors will help a lot with the precision.

Scrapy XHR Pagination on TripAdvisor

Although I've seen several similar questions here regarding this, none seem to precisely define the process for achieving this task. I borrowed largely from the Scrapy script located here but since it is over a year old I had to make adjustments to the xpath references.
My current code looks as such:
import scrapy
from tripadvisor.items import TripadvisorItem
class TrSpider(scrapy.Spider):
name = 'trspider'
start_urls = [
'https://www.tripadvisor.com/Hotels-g29217-Island_of_Hawaii_Hawaii-Hotels.html'
]
def parse(self, response):
for href in response.xpath('//div[#class="listing_title"]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_hotel)
next_page = response.xpath('//div[#class="unified pagination standard_pagination"]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
def parse_hotel(self, response):
for href in response.xpath('//div[starts-with(#class,"quote")]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_review)
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_hotel)
def parse_review(self, response):
item = TripadvisorItem()
item['headline'] = response.xpath('translate(//div[#class="quote"]/text(),"!"," ")').extract()[0][1:-1]
item['review'] = response.xpath('translate(//div[#class="entry"]/p,"\n"," ")').extract()[0]
item['bubbles'] = response.xpath('//span[contains(#class,"ui_bubble_rating")]/#alt').extract()[0]
item['date'] = response.xpath('normalize-space(//span[contains(#class,"ratingDate")]/#content)').extract()[0]
item['hotel'] = response.xpath('normalize-space(//span[#class="altHeadInline"]/a/text())').extract()[0]
return item
When running the spider in its current form, I scrape the first page of reviews for each hotel listed on the start_urls page but the pagination doesn't flip to the next page of reviews. From what I suspect, this is because of this line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
Since these pages load dynamically, there is no existing href for the next page on the current page. Investigating further I've read that these requests are sending a POST request using XHR. By exploring the "Network" tab in Firefox "Inspect" I can see both a Request URL and Form Data that might be needed to flip the page according to other posts on SO regarding the same topic.
However, it seems that the other posts refer to a static URL starting point when trying to pass a FormRequest using Scrapy. With TripAdvisor, the URL will always change based on the name of the hotel we're looking at so I'm not sure how to chose a URL when using FormRequest to submit the form data: reqNum=1&changeSet=REVIEW_LIST (this form data also never seems to change from page to page).
Alternatively, there doesn't appear to be a way to extract the URL shown in the "Network" tab's "Request URL". These pages do have URLs that change from page to page but the way TripAdvisor is set up, I cannot seem to extract them from the source code. The review pages change by incrementing the part of the URL that is -orXX- where "XX" is a number. For example:
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or10-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or15-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
So, my question is whether or not it is possible to paginate using the XHR request/form data or do I need to manually build a list of URLs for each hotel that adds the -orXX-?
Well I ended up discovering an xpath that apparently allowed pagination of the reviews, but it's funny because every time I checked the underlying HTML the href link never changed from referring to /Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html even if I was on page 10 for example. It seems the "-orXX-" part of the link always increments the XX by 5 so I'm not sure why this works.
All I did was change the line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
to:
next_page = response.xpath('//link[#rel="next"]/#href')
and have >41K extracted reviews. Would love to get other's opinions on handling this problem in other situations.

Scrapy - no list page, but I know the url for each item page

I'm using Scrapy to scrape a website. The item page that I want to scrape looks like: http://www.somepage.com/itempage/&page=x. Where x is any number from 1 to 100. Thus, I have an SgmlLinkExractor Rule with a callback function specified for any page resembling this.
The website does not have a listpage with all the items, so I want to somehow well scrapy to scrape those urls (from 1 to 100). This guy here seemed to have the same issue, but couldn't figure it out.
Does anyone have a solution?
You could list all the known URLs in your Spider class' start_urls attribute:
class SomepageSpider(BaseSpider):
name = 'somepage.com'
allowed_domains = ['somepage.com']
start_urls = ['http://www.somepage.com/itempage/&page=%s' % page for page in xrange(1, 101)]
def parse(self, response):
# ...
If it's just a one time thing, you can create a local html file file:///c:/somefile.html with all the links. Start scraping that file and add somepage.com to allowed domains.
Alternately, in the parse function, you can return a new Request which is the next url to be scraped.

Scrapy - parse a page to extract items - then follow and store item url contents

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items.
Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.
But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.
And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.
My code so far looks like this:
class MySpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/?q=example",
]
rules = (
Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[#class="pagination"]'), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
)
def parse_item(self, response):
main_selector = HtmlXPathSelector(response)
xpath = '//h2[#class="title"]'
sub_selectors = main_selector.select(xpath)
for sel in sub_selectors:
item = ExampleItem()
l = ExampleLoader(item = item, selector = sel)
l.add_xpath('title', 'a[#title]/#title')
......
yield l.load_item()
After some testing and thinking, I found this solution that works for me.
The idea is to use just the first rule, that gives you listings of items, and also, very important, add follow=True to that rule.
And in parse_item() you have to yield a request instead of an item, but after you load the item. The request is to item detail url. And you have to send the loaded item to that request callback. You do your job with the response, and there is where you yield the item.
So the finish of parse_item() will look like this:
itemloaded = l.load_item()
# fill url contents
url = sel.select(item_url_xpath).extract()[0]
request = Request(url, callback = lambda r: self.parse_url_contents(r))
request.meta['item'] = itemloaded
yield request
And then parse_url_contents() will look like this:
def parse_url_contents(self, response):
item = response.request.meta['item']
item['url_contents'] = response.body
yield item
If anyone has another (better) approach, let us know.
Stefan
I'm sitting with exactly the same problem, and from the fact that no-one has answered your question for 2 days I take it that the only solution is to follow that URL manually, from within your parse_item function.
I'm new to Scrapy, so I wouldn't attempt it with that (although I'm sure it's possible), but my solution will be to use urllib and BeatifulSoup to load the second page manually, extract that information myself, and save it as part of the Item. Yes, much more trouble than Scrapy makes normal parsing, but it should get the job done with the least hassle.

Categories