Creating multiple requests from same method in Scrapy - python

I am parsing webpages that have a similar structure to this page.
I have the following two functions:
def parse_next(self, response):
# implementation goes here
# create Request(the_next_link, callback=parse_next)
# for link in discovered_links:
# create Request(link, callback=parse_link)
def parse_link(self, response):
pass
I want parse_next() to create a request for the *Next link on the web page. At the same time, I want it to create requests for all the URLs that were discovered on the current page by using parse_link() as the callback. Note that I want parse_next to recursively use itself as a callback because this seems to me as the only possible way to generate requests for all the *Next links.
*Next: The link that appears besides all the numbers on the this page
How am I supposed to solve this problem?

Use a generator function and loop through
your links, then call this on the links
that you want to make a request to:
for link in links:
yield Request(link.url)
Since you are using scrapy, I'm assuming you have link extractors set up.
So, just declare your link extractor as a variable like this:
link_extractor = SgmlLinkExtractor(allow=('.+'))
Then in the parse function, call the link extractor on the 'the_next_link':
links = self.link_extractor.extract_links(response)
Here you go:
http://www.jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained

Related

Python / Scrapy - code skips products, even though xpath is the same for all products listed

Trying to scrape information from the www.archive.org, which contains historic product data. My code below, tries to click on every product listed, scrape the information per product, and do the same for subsequent pages.
The problem is that it SKIPS some products (20 in particular), even though the xpath:
products = response.xpath("//article[contains(#class,'product result-prd')]")
is the same for all products. Please see my complete code below.
class CurrysSpider(scrapy.Spider):
name = 'currys_mobiles_2015'
#allowed_domains = ['www.currys.co.uk']
start_urls = ['https://web.archive.org/web/20151204170941/http://www.currys.co.uk/gbuk/phones-broadband-and-sat-nav/mobile-phones-and-accessories/mobile-phones/362_3412_32041_xx_xx/xx-criteria.html']
def parse(self, response):
products = response.xpath("//article[contains(#class,'product result-prd')]") # done
for product in products:
brand = product.xpath(".//span[#data-product='brand']/text()").get() # done
link = product.xpath(".//div[#class='productListImage']/a/#href").get() # done
price = product.xpath(".//strong[#class='price']/text()").get().strip() # done
description = product.xpath(".//ul[#class='productDescription']/li/text()").getall() # done
absolute_url = link # done
yield scrapy.Request(url=absolute_url,callback=self.parse_product,
meta={'brand_name':brand,
'product_price':price,
'product_description':description}) # done
# process next page
next_page_url = response.xpath("//ul[#class='pagination']//li[last()]//#href").get()
absolute_next_page_url = next_page_url
if next_page_url:
yield scrapy.Request(url=absolute_next_page_url,callback=self.parse)
def parse_product(self, response):
.....
I have noticed this problem in many websites that I tried to scrape, and I am not sure why some products are skipped, since the xpath is the same for all of the product listings.
Would appreciate some feedback on this.
try to take a look if those products are present in page html or loaded via js.
Just ctrl+U and check html body for those products.
It's possible the individual pages are not loading properly possibly due to JS loading, as the rest of your code looks fine (though I would recommend using normalize-space($xpath) instead of .strip() for price).
In order to test this (on Chrome), visit your target web page, Open Chrome Dev Tools(F12), Click "Console" and Ctrl+Shift+P to pull up command window.
Next, type in "Disable Javascript" and select that option when it shows up. Now, Ctrl+R to refresh the page, and this is the "View" that your web-scraper gets. Check your Xpath expressions now.
If you do have issues, consider using scrapy-splash or scrapy-selenium to load this JS.
EDIT: I would check for the possibility of a memory leak. According to scrapy docs, using the meta attribute in your callback will sometimes cause leaks.

Passing a list as an argument to a function in python

I'm new to Python and I'm struggling with passing a list as an argument to a function.
I've written a block of code to take a url, extract all links from the page and put them into a list (links=[]). I want to pass this list to a function that filters out any link that is not from the same domain as the starting link (aka the first in the list) and output a new list (filtered_list = []).
This is what I have:
import requests
from bs4 import BeautifulSoup
start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
links.append(tag['href'])
def filter_links(links):
filtered_links = []
for link in links:
if link.startswith(links[0]):
filtered_links.append(link)
print(filter_links(links))
When I run this, I get an unfiltered list and below that, I get None.
Eventually I want to pass the filtered list to a function that grabs the html from each page in the domain linked on the homepage, but I am trying to tackle this problem 1 process at a time. Any tips would be much appreciated, thank you :)
EDIT
I now can pass the list of urls to the filter_links() function however, I'm filtering out too much now. eventually I want to pass several different start urls through this program, so I need a generic way of filtering urls that are within the same domain as the starting url. I have used the built-in startswith function, but its filtering out everything except the starting url. I think I could use regex but this should work too?

How to use Scrapy Request and get response at same place?

I am writing the scrapy crawler to scrape the data from the e-commerce website.
The website has the color variant and each variant has own price, sizes and stock for that sizes.
To get the price, sizes, and the stocks for variant need to visit the link of the variant(color).
And all data needes in one record.
I have tried using requests but it is slow and sometimes fails to load the page.
I have written the crawler using requests.get() and use the response in the scrapy.selector.Selector() and parsing data.
What my question is, is there any way to use scrapy.Request() to get the response where I use it not at the callback function. I need the response at the same place as below(something like below),
response = scrapy.Request(url=variantUrl)
sizes = response.xpath('sizesXpath').extract()
I know scrapy.Request() require parameter called callback=self.callbackparsefunction
that will be called when scrapy generates the response to handle that generated response. I do not want to use callback functions I want to handle the response in the current function.
Or is there any way to return the response from the callback function to function where scrapy.Request() is written as below(something like below),
def parse(self, response):
variants = response.xpath('variantXpath').extract()
for variant in variants:
res = scrapy.Request(url=variant,callback=self.parse_color)
# use of the res response
def parse_color(self, response):
return response
Take a look at scrapy-inline-requests package, I think it's exactly what you are looking for.

How to go to next page when web crawling using Python

I trying to do web crawling using python. But I can't figure out how to change pages automatically.
So I found the pattern but I don't know how to go to next page automatically until it reaches end of page.
so the pattern is
'http//.../sortBy=helpful&pageNumber=0'
'http//.../sortBy=helpful&pageNumber=1'
'http//.../sortBy=helpful&pageNumber=2'
'http//.../sortBy=helpful&pageNumber=3'
and so on ...
import re
from urllib.parse import urljoin
def review_next_page(page=1):
list_url = 'https://www.amazon.com/Quest-Nutrition-Protein-Apple-2-12oz/product-reviews/B00U3RGAMW/ref=cm_cr_arp_d_paging_btm_2?ie=UTF8&showViewpoints=1&sortBy=recent&pageNumber={0}'.format(page)
list_url = [urljoin(list_url, review_link) for review_link in ???]
return list_url
I am trying to change last number increases by 1 until it reaches the end...
Should I use for loop?
Thanks in advance!
Not directly answering the question, but this is something that can be easily and conveniently handled by Scrapy's CrawlSpider class and the link extractors. You can configure what patterns should href match for the link to be followed. In your case, it would be something like:
Rule(LinkExtractor(allow=r'sortBy=helpful&pageNumber=\d+$'), callback=self.parse_page)

Passing href text and referencing web page in Scrapy

Any way to do it using crawl spider? Not yielding requests. Just an example would suffice. I want to use the href text as the title of web page and have a link to the url that contained the link. I'm just using basic selectors to fill my item, but not sure how to get this information.
Edit:
I looked into it and I want to be able to pass in meta data of the href title and referencing url and also be able to comply with the rules I've defined rather than having to get all urls and conditioning on them myself.
meta={"hrefText" : ..., "refURL": ...}
see CrawlSpider code:
for link in links:
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
meaning you can get href text from response.meta['link_text']

Categories