I am a newbie and I've written a script in python scrapy to get information recursively.
Firstly, it scrapes links of city including information of tours then it tracks down each cities and reach their pages. Next, it get needed information of tours related to city before move to next pages then so on. Pagination is running on java-script without visible link.
The command I used to get the result along with a csv output is:
scrapy crawl pratice -o practice.csv -t csv
The expected result is csv file:
title, city, price, tour_url
t1, c1, p1, url_1
t2, c2, p2, url_2
...
The problem is that csv file is empty. The running is stopped at "parse_page" and callback="self.parse_item" doesn't work. I don't know how to fix it. Maybe my workflow is invalid or my code has issues. Thanks for your help.
name = 'practice'
start_urls = ['https://www.klook.com/vi/search?query=VI%E1%BB%86T%20NAM%20&type=country',]
def parse(self, response): # Extract cities from country
hxs = HtmlXPathSelector(response)
urls = hxs.select("//div[#class='swiper-wrapper cityData']/a/#href").extract()
for url in urls:
url = urllib.parse.urljoin(response.url, url)
self.log('Found city url: %s' % url)
yield response.follow(url, callback=self.parse_page) # Link to city
def parse_page(self, response): # Move to next page
url_ = response.request.url
yield response.follow(url_, callback=self.parse_item)
# I will use selenium to move next page because of next button is running
# on javascript without fixed url.
def parse_item(self, response): # Extract tours
for block in response.xpath("//div[#class='m_justify_list m_radius_box act_card act_card_lg a_sd_move j_activity_item js-item ']"):
article = {}
article['title'] = block.xpath('.//h3[#class="title"]/text()').extract()
article['city'] = response.xpath(".//div[#class='g_v_c_mid t_mid']/h1/text()").extract()# fixed
article['price'] = re.sub(" +","",block.xpath(".//span[#class='latest_price']/b/text()").extract_first()).strip()
article['tour_url'] = 'www.klook.com'+block.xpath(".//a/#href").extract_first()
yield article
hxs = HtmlXPathSelector(response) #response is already in Selector, use direct `response.xpath`
url = urllib.parse.urljoin(response.url, url)
use as:
url = response.urljoin(url)
yes it will stop as its a duplicate request to prev. url, you need to add dont_filter=True check
Instead of using Selenium, figure out what request the website performs using JavaScript (watch the Network tab of the developer tools of your browser while you navigate) and reproduce a similar request.
The website uses JSON requests undernead to fetch the items, which is much easier to parse than the HTML.
Also, if you are not familiar with Scrapy’s asynchronous nature, you are likely to get unexpected issues while using it in combination with Selenium.
Solutions like Splash or Selenium are only meant to be used as last resource, when everything else fails.
Related
How can i go to link and get its sub links and again get its sub sub links?like for example,
I want to go to
"https://stackoverflow.com"
then extract its links e.g
['https://stackoverflow.com/questions/ask', 'https://stackoverflow.com/?tab=bounties']
and again go to that sub link and extract those sub links links.
I would recommend using Scrapy for this. With Scrapy, you create a spider object which then is run by the Scrapy module.
First, to get all the links on a page, you can create a Selector object and find all of the hyperlink objects using the XPath:
hxs = scrapy.Selector(response)
urls = hxs.xpath('*//a/#href').extract()
Since the hxs.xpath returns an iterable list of paths, you can just iterate over them directly without storing them in a variable. Also each URL found should be passed back into this function using the callback argument, allowing it to recursively find all the links within each URL found:
hxs = scrapy.Selector(response)
for url in hxs.xpath('*//a/#href').extract():
yield scrapy.http.Request(url=url, callback=self.parse)
Each path found might not contain the original URL, so that check has to be made:
if not ( url.startswith('http://') or url.startswith('https://') ):
url = "https://stackoverflow.com/" + url
Finally, the each URL can be passed to a different function to be parsed, in this case it's just printed:
self.handle(url)
All of this put together in a full Spider object looks like this:
import scrapy
class StackSpider(scrapy.Spider):
name = "stackoverflow.com"
# limit the scope to stackoverflow
allowed_domains = ["stackoverflow.com"]
start_urls = [
"https://stackoverflow.com/",
]
def parse(self, response):
hxs = scrapy.Selector(response)
# extract all links from page
for url in hxs.xpath('*//a/#href').extract():
# make it a valid url
if not ( url.startswith('http://') or url.startswith('https://') ):
url = "https://stackoverflow.com/" + url
# process the url
self.handle(url)
# recusively parse each url
yield scrapy.http.Request(url=url, callback=self.parse)
def handle(self, url):
print(url)
And the spider would be run like this:
$ scrapy runspider spider.py > urls.txt
Also, keep in mind that running this code will get you rate limited from stack overflow. You might want to find a different target for testing, ideally a site that you're hosting yourself.
Hi guys I am very new in scraping data, I have tried the basic one. But my problem is I have 2 web page with same domain that I need to scrape
My Logic is,
First page www.sample.com/view-all.html
*This page open all the list of items and I need to get all the href attr of every item.
Second page www.sample.com/productpage.52689.html
*this is the link came from the first page so 52689 needs to change dynamically depending on the link provided by the first page.
I need to get all the data like title, description etc on the second page.
what I am thinking is for loop but Its not working on my end. I am searching on google but no one has the same problem as mine. please help me
import scrapy
class SalesItemSpider(scrapy.Spider):
name = 'sales_item'
allowed_domains = ['www.sample.com']
start_urls = ['www.sample.com/view-all.html', 'www.sample.com/productpage.00001.html']
def parse(self, response):
for product_item in response.css('li.product-item'):
item = {
'URL': product_item.css('a::attr(href)').extract_first(),
}
yield item`
Inside parse you can yield Request() with url and function's name to scrape this url in different function
def parse(self, response):
for product_item in response.css('li.product-item'):
url = product_item.css('a::attr(href)').extract_first()
# it will send `www.sample.com/productpage.52689.html` to `parse_subpage`
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
# here you parse from www.sample.com/productpage.52689.html
item = {
'title': ...,
'description': ...
}
yield item
Look for Request in Scrapy documentation and its tutorial
There is also
response.follow(url, callback=self.parse_subpage)
which will automatically add www.sample.com to urls so you don't have to do it on your own in
Request(url = "www.sample.com/" + url, callback=self.parse_subpage)
See A shortcut for creating Requests
If you interested in scraping then you should read docs.scrapy.org from first page to the last one.
I have an Intranet page with multiple input fields, I need Scrapy to run a search using the webpages "search for products" input field, it has an id of "searchBox"
I have been able to lock onto the correct search box using both Scrapy and Beautiful Soup but I am not sure how to pass that data back to Scrapys form submission function correctly.
In Method 1 I have tried to simply pass the results to Scrapys FormRequest.from_response function as an input but it does not work.
Method 1 - Using Scrapy to find the data
#Search for products
def parse(self, response):
##Let's try search using scrapy only
sel = Selector(response)
results = sel.xpath("//*[contains(#id, 'searchBox')]")
for result in results:
print (result.extract()) #Print out what scrapy found
return scrapy.FormRequest.from_response(results, formdata = {'Item': 'Whirlpool Washing Machine'}) #formdata is the data we are sending
Method 2 - Using Beautiful soup to find the data
#Search for products
def parse(self, response):
##Let's try search using Beautiful Soup only
soup = BeautifulSoup(response.text, 'html.parser')
product_search = []
product_search.append(soup.find("input", id="searchBox"))
print(product_search) #Print what BS found
About scrapy variant:
You should yield request, not return.
In function from_response you should use selector of form as first argument. Now you pass there some input data, as far as I could understand from your code.
Try something like:
yield scrapy.FormRequest.from_response(response.css('form'), formdata={'Item': 'Whirlpool Washing Machine'})
Just fix form selector in this expression. Also check what else should be used in this request, maybe some headers, cookies, etc.
The website I am trying to crawl has the following structure:
there are various modules (for which I generate links without issues) - let's call them "module_urls"
each module page has a random number of links to various pages with videos (let's call them "lesson_urls")
each page has one video
The idea is to print links to all videos.
I have successfully managed to, separately: (1) generate the module_urls, (2) scrape the links to lesson_urls, and (3) scrape the videos. However, I am struggling with creating the appropriate loop to make it all work together.
The following script correctly generates module_urls, but, contrary to my expectations, the request to crawl each url (and then to crawl each sub-url) is never fulfilled. I am sure that this comes from my pure ignorance of the topic - this is the first time I am trying to use Scrapy.
Thank you very much for your help!
video_links = []
def after_login(self, response):
module_urls = self.generate_links()
for module_url in module_urls:
print("This is one module URL: %s" % module_url)
Request(module_url, self.get_lesson_urls)
print(self.video_links)
def get_lesson_urls(self, response):
print("Entered get_lesson_urls")
urls = response.xpath('//*[starts-with(#id,"post")]//li/a/#href').extract()
for lesson_url in urls:
Request(lesson_url, self.get_video_link)
def get_video_link(self, response):
video_address = response.xpath('//*[starts-with(#id, "post")]//iframe[#name = "vooplayerframe"]/#src').extract_first()
self.video_links.append(video_address)
I believe you will need to yield your request objects
video_links = []
def after_login(self, response):
module_urls = self.generate_links()
for module_url in module_urls:
print("This is one module URL: %s" % module_url)
yield Request(module_url, self.get_lesson_urls)
def get_lesson_urls(self, response):
print("Entered get_lesson_urls")
urls = response.xpath('//*[starts-with(#id,"post")]//li/a/#href').extract()
for lesson_url in urls:
yield Request(lesson_url, self.get_video_link)
def get_video_link(self, response):
video_address = response.xpath('//*[starts-with(#id, "post")]//iframe[#name = "vooplayerframe"]/#src').extract_first()
yield video_address
Edit:
Rather than print, if you then yield the urls you want, you can output them to json (and other formats) using:
scrapy crawl myspider -o data.json
You can do further parsing with Scrapy's Item Pipeline: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
Although I've seen several similar questions here regarding this, none seem to precisely define the process for achieving this task. I borrowed largely from the Scrapy script located here but since it is over a year old I had to make adjustments to the xpath references.
My current code looks as such:
import scrapy
from tripadvisor.items import TripadvisorItem
class TrSpider(scrapy.Spider):
name = 'trspider'
start_urls = [
'https://www.tripadvisor.com/Hotels-g29217-Island_of_Hawaii_Hawaii-Hotels.html'
]
def parse(self, response):
for href in response.xpath('//div[#class="listing_title"]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_hotel)
next_page = response.xpath('//div[#class="unified pagination standard_pagination"]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
def parse_hotel(self, response):
for href in response.xpath('//div[starts-with(#class,"quote")]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_review)
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_hotel)
def parse_review(self, response):
item = TripadvisorItem()
item['headline'] = response.xpath('translate(//div[#class="quote"]/text(),"!"," ")').extract()[0][1:-1]
item['review'] = response.xpath('translate(//div[#class="entry"]/p,"\n"," ")').extract()[0]
item['bubbles'] = response.xpath('//span[contains(#class,"ui_bubble_rating")]/#alt').extract()[0]
item['date'] = response.xpath('normalize-space(//span[contains(#class,"ratingDate")]/#content)').extract()[0]
item['hotel'] = response.xpath('normalize-space(//span[#class="altHeadInline"]/a/text())').extract()[0]
return item
When running the spider in its current form, I scrape the first page of reviews for each hotel listed on the start_urls page but the pagination doesn't flip to the next page of reviews. From what I suspect, this is because of this line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
Since these pages load dynamically, there is no existing href for the next page on the current page. Investigating further I've read that these requests are sending a POST request using XHR. By exploring the "Network" tab in Firefox "Inspect" I can see both a Request URL and Form Data that might be needed to flip the page according to other posts on SO regarding the same topic.
However, it seems that the other posts refer to a static URL starting point when trying to pass a FormRequest using Scrapy. With TripAdvisor, the URL will always change based on the name of the hotel we're looking at so I'm not sure how to chose a URL when using FormRequest to submit the form data: reqNum=1&changeSet=REVIEW_LIST (this form data also never seems to change from page to page).
Alternatively, there doesn't appear to be a way to extract the URL shown in the "Network" tab's "Request URL". These pages do have URLs that change from page to page but the way TripAdvisor is set up, I cannot seem to extract them from the source code. The review pages change by incrementing the part of the URL that is -orXX- where "XX" is a number. For example:
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or10-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or15-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
So, my question is whether or not it is possible to paginate using the XHR request/form data or do I need to manually build a list of URLs for each hotel that adds the -orXX-?
Well I ended up discovering an xpath that apparently allowed pagination of the reviews, but it's funny because every time I checked the underlying HTML the href link never changed from referring to /Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html even if I was on page 10 for example. It seems the "-orXX-" part of the link always increments the XX by 5 so I'm not sure why this works.
All I did was change the line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
to:
next_page = response.xpath('//link[#rel="next"]/#href')
and have >41K extracted reviews. Would love to get other's opinions on handling this problem in other situations.