below is my spider code,
class Blurb2Spider(BaseSpider):
name = "blurb2"
allowed_domains = ["www.domain.com"]
def start_requests(self):
yield self.make_requests_from_url("http://www.domain.com/bookstore/new")
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//div[#class="bookListingBookTitle"]/a/#href').extract()
for i in urls:
yield Request(urlparse.urljoin('www.domain.com/', i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
print response,'------->'
Here i am trying to combine the href link with the base link , but i am getting the following error ,
exceptions.ValueError: Missing scheme in request url: www.domain.com//bookstore/detail/3271993?alt=Something+I+Had+To+Do
Can anyone let me know why i am getting this error and how to join base url with href link and yield a request
An alternative solution, if you don't want to use urlparse:
response.urljoin(i[1:])
This solution goes even a step further: here Scrapy works out the domain base for joining. And as you can see, you don't have to provide the obvious http://www.example.com for joining.
This makes your code reusable in the future if you want to change the domain you are crawling.
It is because you didn't add the scheme, eg http:// in your base url.
Try: urlparse.urljoin('http://www.domain.com/', i[1:])
Or even more easy: urlparse.urljoin(response.url, i[1:]) as urlparse.urljoin will sort out the base URL itself.
The best way to follow a link in scrapy is to use response.follow(). scrapy will handle the rest.
more info
Quote from docs:
Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin.
Also, you can pass <a> element directly as argument.
Related
I am new to scrapy and python in general and i am trying to make a scraper that extracts links from a page then edit these links then go through each one of them .. I am using playwright with scrapy.
this is where i am at but for some reason it only scrapes the first link only.
def parse(self, response):
for link in response.css('div.som a::attr(href)'):
yield response.follow(link.get().replace('docs', 'www').replace('com/', 'com/#'),
cookies={'__utms': '265273107'},
meta=dict(
playwright=True,
playwright_include_page=True,
playwright_page_coroutines=[
PageCoroutine('wait_for_selector', 'span#pple_numbers')]
),
callback=self.parse_c)
async def parse_c(self, response):
yield {
'text': response.css('div.pple_numb span::text').getall()
it would be nice if you could add more details about the data you are trying to get. Thefore, could you add the indicated line to see if it is going through different links?
def parse(self, response):
for link in response.css('div.som a::attr(href)'):
print(link) <--- //could you add this line to check if prints all the links?
According to the documentation there are two functions for follow:
follow:
Return a Request instance to follow a link url. It accepts the same
arguments as Request.__init__ method, but url can be not only an
absolute URL, but also a relative URL, a Link object, e.g. the result of Link Extractors, ...
follow_all
A generator that produces Request instances to follow all links in
urls. It accepts the same arguments as the Request’s __init__ method,
except that each urls element does not need to be an absolute URL, it
can be any of the following: a relative URL, a Link object, e.g. the result of Link Extractors, ...
Probably if you try your code with follow_all instead of only follow it should do the trick.
All the websites I want to parse are in the same domain but all look very different and contain different information I need.
My start_url is a page with a list containing all links I need. So in the parse() method I yield a request for each of these links and in parse_item_page I extract the first part of the information I need - which worked completely fine.
My problem is: I thought I could just do the same another time and for each link on my item_page call parse_entry. But I tried so many different versions of this and I just can't get it to work. They are the correct URLs but scrapy seems to just don't want to call a third parse() function, nothing in there ever gets executed.
How can I get scrapy to use parse_entry, or pass all these links to a new spider?
This is a simplified, shorter version of my spider class:
def parse(self, response, **kwargs):
for href in response.xpath("//listItem/#href"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_item_page)
def parse_item_page(self, response):
for sel in response.xpath("//div"):
item = items.FirstItem()
item['attribute'] = sel.xpath("//h1/text()").get().strip()
for href in response.xpath("//entry/#href"):
yield response.follow(href.extract(), callback=self.parse_entry)
yield item
def parse_entry(self, response):
for sel in response.xpath("//textBlock"):
item = items.SecondItem()
item['attribute'] = sel.xpath("//h1/text()").get().strip()
yield item
How can i go to link and get its sub links and again get its sub sub links?like for example,
I want to go to
"https://stackoverflow.com"
then extract its links e.g
['https://stackoverflow.com/questions/ask', 'https://stackoverflow.com/?tab=bounties']
and again go to that sub link and extract those sub links links.
I would recommend using Scrapy for this. With Scrapy, you create a spider object which then is run by the Scrapy module.
First, to get all the links on a page, you can create a Selector object and find all of the hyperlink objects using the XPath:
hxs = scrapy.Selector(response)
urls = hxs.xpath('*//a/#href').extract()
Since the hxs.xpath returns an iterable list of paths, you can just iterate over them directly without storing them in a variable. Also each URL found should be passed back into this function using the callback argument, allowing it to recursively find all the links within each URL found:
hxs = scrapy.Selector(response)
for url in hxs.xpath('*//a/#href').extract():
yield scrapy.http.Request(url=url, callback=self.parse)
Each path found might not contain the original URL, so that check has to be made:
if not ( url.startswith('http://') or url.startswith('https://') ):
url = "https://stackoverflow.com/" + url
Finally, the each URL can be passed to a different function to be parsed, in this case it's just printed:
self.handle(url)
All of this put together in a full Spider object looks like this:
import scrapy
class StackSpider(scrapy.Spider):
name = "stackoverflow.com"
# limit the scope to stackoverflow
allowed_domains = ["stackoverflow.com"]
start_urls = [
"https://stackoverflow.com/",
]
def parse(self, response):
hxs = scrapy.Selector(response)
# extract all links from page
for url in hxs.xpath('*//a/#href').extract():
# make it a valid url
if not ( url.startswith('http://') or url.startswith('https://') ):
url = "https://stackoverflow.com/" + url
# process the url
self.handle(url)
# recusively parse each url
yield scrapy.http.Request(url=url, callback=self.parse)
def handle(self, url):
print(url)
And the spider would be run like this:
$ scrapy runspider spider.py > urls.txt
Also, keep in mind that running this code will get you rate limited from stack overflow. You might want to find a different target for testing, ideally a site that you're hosting yourself.
I have the following code for a web crawler in Python 3:
import requests
from bs4 import BeautifulSoup
import re
def get_links(link):
return_links = []
r = requests.get(link)
soup = BeautifulSoup(r.content, "lxml")
if r.status_code != 200:
print("Error. Something is wrong here")
else:
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
return_links.append(link.get('href')))
def recursive_search(links)
for i in links:
links.append(get_links(i))
recursive_search(links)
recursive_search(get_links("https://www.brandonskerritt.github.io"))
The code basically gets all the links off of my GitHub pages website, and then it gets all the links off of those links, and so on until the end of time or an error occurs.
I want to recreate this code in Scrapy so it can obey robots.txt and be a better web crawler overall. I've researched online and I can only find tutorials / guides / stackoverflow / quora / blog posts about how to scrape a specific domain (allowed_domains=["google.com"], for example). I do not want to do this. I want to create code that will scrape all websites recursively.
This isn't much of a problem but all the blog posts etc only show how to get the links from a specific website (for example, it might be that he links are in list tags). The code I have above works for all anchor tags, regardless of what website it's being run on.
I do not want to use this in the wild, I need it for demonstration purposes so I'm not going to suddenly annoy everyone with excessive web crawling.
Any help will be appreciated!
There is an entire section of scrapy guide dedicated to broad crawls. I suggest you to fine-grain your settings for doing this succesfully.
For recreating the behaviour you need in scrapy, you must
set your start url in your page.
write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls
An untested example (that can be, of course, refined):
class AllSpider(scrapy.Spider):
name = 'all'
start_urls = ['https://yourgithub.com']
def __init__(self):
self.links=[]
def parse(self, response):
self.links.append(response.url)
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)
If you want to allow crawling of all domains, simply don't specify allowed_domains, and use a LinkExtractor which extracts all links.
A simple spider that follows all links:
class FollowAllSpider(CrawlSpider):
name = 'follow_all'
start_urls = ['https://example.com']
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
def parse_item(self, response):
pass
So I'm trying to crawl the popular.ebay.com page and I get an error:Missing scheme in request url: #mainContent for the # anchor links.
The following is my code:
def parse_links(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('//a')
#domain = 'http://popular.ebay.com/'
for link in links:
anchor_text = ''.join(link.select('./text()').extract())
title = ''.join(link.select('./#title').extract())
url = ''.join(link.select('./#href').extract())
meta = {'title':title,}
meta = {'anchor_text':anchor_text,}
yield Request(url, callback = self.parse_page, meta=meta,)
I can't add the base url to #mainContent, because it adds a double URL to the urls with with the full url scheme. I end up getting urls like this http://popular.ebay.comhttp://www.ebay.com/sch/i.html?_nkw=grande+mansion
def parse_links(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('//a')
#domain = 'http://popular.ebay.com/'
for link in links:
anchor_text = ''.join(link.select('./text()').extract())
title = ''.join(link.select('./#title').extract())
url = ''.join(link.select('./#href').extract())
meta = {'title':title,}
meta = {'anchor_text':anchor_text,}
yield Request(response.url, callback = self.parse_page, meta=meta,)
The links I want to get look like this: Antique Chairs | but I get the error cause of links like this: <a id="gh-hdn-stm" class="gh-acc-a" href="#mainContent">Skip to main content</a>
How would I go about adding the base url to only the hash anchor links, or ignore links without the base url in them? For a simple solution I've tried the set rule deny=(#mainContent) and restrict_xpaths, but the crawler still spits the same error.
error:Missing scheme in request url: #mainContent is caused by requesting a url without a scheme (the "http://" part of the url).
#mainContent is an internal link, referring to a HTML element with the id "mainContent". You're probably not wanting to follow these links, as it's only linking to a different part of the current page you're on.
I'd suggest looking at this part of the documentation http://doc.scrapy.org/en/latest/topics/link-extractors.html#scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor. You can tell Scrapy to follow links which conform to a certain format and restrict what part of the page it will fetch links from. Take note of the "restrict_xpaths" and "allow" parameters.
Hope this helps :)
In your for loop:
meta = {'anchor_text':anchor_text,}
url = link.select('./#href').extract()[0]
if not '#' in url: // or if url[0] != '#'
yield Request(response.url, callback = self.parse_page, meta=meta,)
This will avoid yielding #foobar as an URL. You could add the base url to the #foobar in an else statement, but since this will redirect to a page scrapy has already scraped I don't think there's a point in it.
I found links other than #mainContent that were missing the scheme, so using #Robin's logic I made sure that the url contained the base url before parse_page.
for link in links:
anchor_text = ''.join(link.select('./text()').extract())
title = ''.join(link.select('./#title').extract())
url = ''.join(link.select('./#href').extract())
meta = {'title':title,}
meta = {'anchor_text':anchor_text,}
if domain in url:
yield Request(url, callback = self.parse_page, meta=meta,)