scrapy : different URLs - python

Getting problem in making request to URL.
While inspecting in the main page I get URL in href as
But when the link gets open , it appears to be :
Both links are different, how can i make request for this.
Here what my shell says:

The problem is with allowed_domains spider attribute. Your current setting doesn't allow you to follow requests to jobview.monster.ca as per log (DEBUG: Filtered offsite request to ...). Set that attribute a bit loosely:
allowed_domains = ['monster.ca']

Related

Scraping Product Links at Coles.com.au 429 error with 1 request

I am new to webscraping and would like to scrape the links from the site below using scrapy:
https://shop.coles.com.au/a/national/everything/search/bread?pageNumber=1
I created the below xpath to scrape the links and when I test it out by going to inspect and pressing ctrl + f I get 51 matches which is equal to the number of products and so seems to be correct:
//span[#class="product-name"]/../../#href
However when I go into scrapy shell with the link and apply the command:
response.xpath('//span[#class="product-name"]/../../#href').extract()
with or without a User agent I just get an empty list.
When I run the shell I get a 429 error, which indicates I have made too many requests. But as far as I am aware I have only made 1 request.
In addition I have also set up a spider for this where I set CONCURRENT_REQUESTS = 1 and also get a 429 error.
Does anyone know why my xpath doesn't work on this site?
Thanks
Edit
Below is the spider code:
import scrapy
class ColesSpider(scrapy.Spider):
name = 'coles'
allowed_domains = ['shop.coles.com.au']
start_urls = ['https://shop.coles.com.au/a/national/everything/search/bread/']
def parse(self, response):
prod_urls = response.xpath('//span[#class="product-name"]/../../#href').extract()
for prod_url in prod_urls:
yield{"Product_URL": prod_url}
I've had a quick look around the website and it seems like the website is invoking a cookie challenges as well as requiring your IP address.
I think it may be worth thinking about trying scrapy-splash to render the page and go through the JS cookie challenges if you're hard on using scrapy.
Strangely I managed to get a 200 status code with headers,params and cookies with the requests package but couldn't get scrapy with same headers and cookies to recreate that response.

How to scrape data from multiple pages using Scrapy?

I'm trying to scrape data from multiple pages using Scrapy. I'musing the code below, what am I doing wrong?
import scrapy
class CollegeSpider(scrapy.Spider):
name = 'college'
allowed_domains = ['https://engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha']
start_urls = ['https://engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha/']
def parse(self,response):
for college in response.css('div.title'):
if college.css('a::text').extract_first():
yield {'college_name':college.css('a::text').extract_first()}
next_page_url=response.css('li.page-next>a::attr(href)').extract_first()
next_page_url=response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url,callback=self.praise)
Why do you think you are doing something wrong? Does it show any error? If so, the output should be included in the question in the first place. If it's not doing what you expected, again, you should tell us.
Anyway, looking at the code, there are at least two possible errors:
allowed_domains should be just a domain name, not full URL, as documented.
when you yield new Request to the next page, you should give callback=self.parse instead of self.praise to process the response the same way as the first URL

python selenium: possible to cancel redirect on driver.get()?

Is there a way to stop a url from redirecting?
driver.get('http://loginrequired.com')
This redirects me to another page but I want it to stay on that page without redirecting by default.
There are two ways that what users call "redirection" typically happens:
You load a page and the page loads some JavaScript code which performs a test and decides to load a different page. This process can be interrupted in some browsers by hitting the ESCAPE key. Selenium can send an ESCAPE key.
However, this redirection could happen before Selenium gives control back to your script. Whether it would work in any specific case depends on the page being loaded.
You load a page and get an HTTP 3xx (301, 303, 304, etc.) response from the server. There are no opportunities for users to interrupt these redirections in their browser, so Selenium does not provide the means to interrupt or prevent them.
So there is no surefire way to prevent a redirection in Selenium.
A solution, in case you do not need to visualize the page but access to the source of "http://loginrequired.com" would be the usage of Selenium with Scrapy.
Basically you tell the Scrapy middleware to stop redirecting, and while the spider access to the page the redirect is handle the redirection (302).
In the setting.py you have to set
"REDIRECT_ENABLED=False"
The spider code is:
class LoginSpider(CrawlSpider):
name = "login"
allowed_domains = ['loginrequired.com']
start_urls = ['http://loginrequired.com']
handle_httpstatus_list = [302]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
if response.status in self.handle_httpstatus_list:
return Request(url="http://loginrequired.com", callback=self.after_302)
def after_302(self, response):
print response.url
# Your code to analysis the page by here
Idea taken from how to handle 302 redirect in scrapy

How does scrapy use rules?

I'm new to using Scrapy and I wanted to understand how the rules are being used within the CrawlSpider.
If I have a rule where I'm crawling through the yellowpages for cupcake listings in Tucson, AZ, how does yielding a URL request activate the rule - specifically how does it activiate the restrict_xpath attribute?
Thanks.
The rules attribute for a CrawlSpider specify how to extract the links from a page and which callbacks should be called for those links. They are handled by the default parse() method implemented in that class -- look here to read the source.
So, whenever you want to trigger the rules for an URL, you just need to yield a scrapy.Request(url, self.parse), and the Scrapy engine will send a request to that URL and apply the rules to the response.
The extraction of the links (that may or may not use restrict_xpaths) is done by the LinkExtractor object registered for that rule. It basically searches for all the <a>s and <area>s elements in the whole page or only in the elements obtained after applying the restrict_xpaths expressions if the attribute is set.
Example:
For example, say you have a CrawlSpider like so:
from scrapy.contrib.spiders.crawl import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
start_urls = ['http://someurlhere.com']
rules = (
Rule(
LinkExtractor(restrict_xpaths=[
"//ul[#class='menu-categories']",
"//ul[#class='menu-subcategories']"]),
callback='parse'
),
Rule(
LinkExtractor(allow='/product.php?id=\d+'),
callback='parse_product_page'
),
)
def parse_product_page(self, response):
# yield product item here
The engine starts sending requests to the urls in start_urls and executing the default callback (the parse() method in CrawlSpider) for their response.
For each response, the parse() method will execute the link extractors on it to get the links from the page. Namely, it calls the LinkExtractor.extract_links(response) for each response object to get the urls, and then yields scrapy.Request(url, <rule_callback>) objects.
The example code is an skeleton for a spider that crawls an e-commerce site following the links of product categories and subcategories, to get links for each of the product pages.
For the rules registered specifically in this spider, it would crawl the links inside the lists of "categories" and "subcategories" with the parse() method as callback (which will trigger the crawl rules to be called for these pages), and the links matching the regular expression product.php?id=\d+ with the callback parse_product_page() -- which would finally scrape the product data.
As you can see, pretty powerful stuff. =)
Read more:
CrawlSpider - Scrapy docs
Link extractors - Scrapy docs

Scrapy - no list page, but I know the url for each item page

I'm using Scrapy to scrape a website. The item page that I want to scrape looks like: http://www.somepage.com/itempage/&page=x. Where x is any number from 1 to 100. Thus, I have an SgmlLinkExractor Rule with a callback function specified for any page resembling this.
The website does not have a listpage with all the items, so I want to somehow well scrapy to scrape those urls (from 1 to 100). This guy here seemed to have the same issue, but couldn't figure it out.
Does anyone have a solution?
You could list all the known URLs in your Spider class' start_urls attribute:
class SomepageSpider(BaseSpider):
name = 'somepage.com'
allowed_domains = ['somepage.com']
start_urls = ['http://www.somepage.com/itempage/&page=%s' % page for page in xrange(1, 101)]
def parse(self, response):
# ...
If it's just a one time thing, you can create a local html file file:///c:/somefile.html with all the links. Start scraping that file and add somepage.com to allowed domains.
Alternately, in the parse function, you can return a new Request which is the next url to be scraped.

Categories