How can i go to link and get its sub links and again get its sub sub links?like for example,
I want to go to
"https://stackoverflow.com"
then extract its links e.g
['https://stackoverflow.com/questions/ask', 'https://stackoverflow.com/?tab=bounties']
and again go to that sub link and extract those sub links links.
I would recommend using Scrapy for this. With Scrapy, you create a spider object which then is run by the Scrapy module.
First, to get all the links on a page, you can create a Selector object and find all of the hyperlink objects using the XPath:
hxs = scrapy.Selector(response)
urls = hxs.xpath('*//a/#href').extract()
Since the hxs.xpath returns an iterable list of paths, you can just iterate over them directly without storing them in a variable. Also each URL found should be passed back into this function using the callback argument, allowing it to recursively find all the links within each URL found:
hxs = scrapy.Selector(response)
for url in hxs.xpath('*//a/#href').extract():
yield scrapy.http.Request(url=url, callback=self.parse)
Each path found might not contain the original URL, so that check has to be made:
if not ( url.startswith('http://') or url.startswith('https://') ):
url = "https://stackoverflow.com/" + url
Finally, the each URL can be passed to a different function to be parsed, in this case it's just printed:
self.handle(url)
All of this put together in a full Spider object looks like this:
import scrapy
class StackSpider(scrapy.Spider):
name = "stackoverflow.com"
# limit the scope to stackoverflow
allowed_domains = ["stackoverflow.com"]
start_urls = [
"https://stackoverflow.com/",
]
def parse(self, response):
hxs = scrapy.Selector(response)
# extract all links from page
for url in hxs.xpath('*//a/#href').extract():
# make it a valid url
if not ( url.startswith('http://') or url.startswith('https://') ):
url = "https://stackoverflow.com/" + url
# process the url
self.handle(url)
# recusively parse each url
yield scrapy.http.Request(url=url, callback=self.parse)
def handle(self, url):
print(url)
And the spider would be run like this:
$ scrapy runspider spider.py > urls.txt
Also, keep in mind that running this code will get you rate limited from stack overflow. You might want to find a different target for testing, ideally a site that you're hosting yourself.
Related
I am using scrapy to scrape all the links off single domain. I am following all links on the domain but saving all links off the domain. The following scraper works correctly, but I can't access member variables from within the scraper since I am running it with a CrawlerProcess.
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
on_domain_urls = set()
off_domain_urls = set()
def parse(self, response):
links = response.xpath('//a/#href')
for link in links:
url = link.get()
if 'example.com' in url and url not in self.on_domain_urls:
print('On domain links found: {}'.format(
len(self.on_domain_urls)))
self.on_domain_urls.add(url)
yield scrapy.Request(url, callback=self.parse)
elif url not in self.off_domain_urls:
print('Offf domain links found: {}'.format(
len(self.on_domain_urls)))
self.off_domain_urls.add(url)
process = CrawlerProcess()
process.crawl(GoodOnYouSpider)
process.start()
# Need access to off_domain_links
How can I access off_domain_links? I could probably move it to a global scope but this seems hack. I can also append to a file, but I'd like to avoid file I/O if possible. Is there a better way to return aggregated data like this?
Did you check the Itempipeline? I think you'll have to use that in this scenario and decide what needs to be done with the variable.
See:
https://docs.scrapy.org/en/latest/topics/item-pipeline.html
Hi guys I am very new in scraping data, I have tried the basic one. But my problem is I have 2 web page with same domain that I need to scrape
My Logic is,
First page www.sample.com/view-all.html
*This page open all the list of items and I need to get all the href attr of every item.
Second page www.sample.com/productpage.52689.html
*this is the link came from the first page so 52689 needs to change dynamically depending on the link provided by the first page.
I need to get all the data like title, description etc on the second page.
what I am thinking is for loop but Its not working on my end. I am searching on google but no one has the same problem as mine. please help me
import scrapy
class SalesItemSpider(scrapy.Spider):
name = 'sales_item'
allowed_domains = ['www.sample.com']
start_urls = ['www.sample.com/view-all.html', 'www.sample.com/productpage.00001.html']
def parse(self, response):
for product_item in response.css('li.product-item'):
item = {
'URL': product_item.css('a::attr(href)').extract_first(),
}
yield item`
Inside parse you can yield Request() with url and function's name to scrape this url in different function
def parse(self, response):
for product_item in response.css('li.product-item'):
url = product_item.css('a::attr(href)').extract_first()
# it will send `www.sample.com/productpage.52689.html` to `parse_subpage`
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
# here you parse from www.sample.com/productpage.52689.html
item = {
'title': ...,
'description': ...
}
yield item
Look for Request in Scrapy documentation and its tutorial
There is also
response.follow(url, callback=self.parse_subpage)
which will automatically add www.sample.com to urls so you don't have to do it on your own in
Request(url = "www.sample.com/" + url, callback=self.parse_subpage)
See A shortcut for creating Requests
If you interested in scraping then you should read docs.scrapy.org from first page to the last one.
I am a newbie and I've written a script in python scrapy to get information recursively.
Firstly, it scrapes links of city including information of tours then it tracks down each cities and reach their pages. Next, it get needed information of tours related to city before move to next pages then so on. Pagination is running on java-script without visible link.
The command I used to get the result along with a csv output is:
scrapy crawl pratice -o practice.csv -t csv
The expected result is csv file:
title, city, price, tour_url
t1, c1, p1, url_1
t2, c2, p2, url_2
...
The problem is that csv file is empty. The running is stopped at "parse_page" and callback="self.parse_item" doesn't work. I don't know how to fix it. Maybe my workflow is invalid or my code has issues. Thanks for your help.
name = 'practice'
start_urls = ['https://www.klook.com/vi/search?query=VI%E1%BB%86T%20NAM%20&type=country',]
def parse(self, response): # Extract cities from country
hxs = HtmlXPathSelector(response)
urls = hxs.select("//div[#class='swiper-wrapper cityData']/a/#href").extract()
for url in urls:
url = urllib.parse.urljoin(response.url, url)
self.log('Found city url: %s' % url)
yield response.follow(url, callback=self.parse_page) # Link to city
def parse_page(self, response): # Move to next page
url_ = response.request.url
yield response.follow(url_, callback=self.parse_item)
# I will use selenium to move next page because of next button is running
# on javascript without fixed url.
def parse_item(self, response): # Extract tours
for block in response.xpath("//div[#class='m_justify_list m_radius_box act_card act_card_lg a_sd_move j_activity_item js-item ']"):
article = {}
article['title'] = block.xpath('.//h3[#class="title"]/text()').extract()
article['city'] = response.xpath(".//div[#class='g_v_c_mid t_mid']/h1/text()").extract()# fixed
article['price'] = re.sub(" +","",block.xpath(".//span[#class='latest_price']/b/text()").extract_first()).strip()
article['tour_url'] = 'www.klook.com'+block.xpath(".//a/#href").extract_first()
yield article
hxs = HtmlXPathSelector(response) #response is already in Selector, use direct `response.xpath`
url = urllib.parse.urljoin(response.url, url)
use as:
url = response.urljoin(url)
yes it will stop as its a duplicate request to prev. url, you need to add dont_filter=True check
Instead of using Selenium, figure out what request the website performs using JavaScript (watch the Network tab of the developer tools of your browser while you navigate) and reproduce a similar request.
The website uses JSON requests undernead to fetch the items, which is much easier to parse than the HTML.
Also, if you are not familiar with Scrapy’s asynchronous nature, you are likely to get unexpected issues while using it in combination with Selenium.
Solutions like Splash or Selenium are only meant to be used as last resource, when everything else fails.
I use Scrapy to scrape data from the first URL.
The first URL returns a response contains a list of URLs.
So far is ok for me. My question is how can I further scrape this list of URLs? After searching, I know I can return a request in the parse but it seems only can process one URL.
This is my parse:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
return scrapy.Request(list[0])
# It works, but how can I continue b.com and c.com?
May I do something like that?
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
Full version:
import scrapy
class MySpider(scrapy.Spider):
name = "mySpider"
allowed_domains = ["x.com"]
start_urls = ["http://x.com"]
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
I think what you're looking for is the yield statement:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
request = scrapy.Request(link)
yield request
For this purpose, you need to subclass scrapy.spider and define a list of URLs to start with. Then, Scrapy will automatically follow the links it finds.
Just do something like this:
import scrapy
class YourSpider(scrapy.Spider):
name = "your_spider"
allowed_domains = ["a.com", "b.com", "c.com"]
start_urls = [
"http://a.com/",
"http://b.com/",
"http://c.com/",
]
def parse(self, response):
# do whatever you want
pass
You can find more information on the official documentation of Scrapy.
# within your parse method:
urlList = response.xpath('//a/#href').extract()
print(urlList) #to see the list of URLs
for url in urlList:
yield scrapy.Request(url, callback=self.parse)
This should work
below is my spider code,
class Blurb2Spider(BaseSpider):
name = "blurb2"
allowed_domains = ["www.domain.com"]
def start_requests(self):
yield self.make_requests_from_url("http://www.domain.com/bookstore/new")
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//div[#class="bookListingBookTitle"]/a/#href').extract()
for i in urls:
yield Request(urlparse.urljoin('www.domain.com/', i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
print response,'------->'
Here i am trying to combine the href link with the base link , but i am getting the following error ,
exceptions.ValueError: Missing scheme in request url: www.domain.com//bookstore/detail/3271993?alt=Something+I+Had+To+Do
Can anyone let me know why i am getting this error and how to join base url with href link and yield a request
An alternative solution, if you don't want to use urlparse:
response.urljoin(i[1:])
This solution goes even a step further: here Scrapy works out the domain base for joining. And as you can see, you don't have to provide the obvious http://www.example.com for joining.
This makes your code reusable in the future if you want to change the domain you are crawling.
It is because you didn't add the scheme, eg http:// in your base url.
Try: urlparse.urljoin('http://www.domain.com/', i[1:])
Or even more easy: urlparse.urljoin(response.url, i[1:]) as urlparse.urljoin will sort out the base URL itself.
The best way to follow a link in scrapy is to use response.follow(). scrapy will handle the rest.
more info
Quote from docs:
Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin.
Also, you can pass <a> element directly as argument.