I am assigned to create a crawler by using python and scrapy to get the reviews of a specific hotel. I read quite a number of tutorials and guides, but still my code just keeps generating an empty CSV file.
Item.py
import scrapy
class AgodaItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
StarRating = scrapy.Field()
Title = scrapy.Field()
Comments = scrapy.Field()
Agoda_reviews.py
import scrapy
class AgodaReviewsSpider(scrapy.Spider):
name = 'agoda_reviews'
allowed_domains = ['agoda.com']
start_urls = ['https://www.agoda.com/holiday-inn-express-kuala-lumpur-city-centre/hotel/kuala-lumpur-my.html?checkIn=2020-04-14&los=1&adults=2&rooms=1&searchrequestid=41af11cc-eaa6-42cc-874d-383761d3523c&travellerType=1&tspTypes=9']
def parse(self, response):
StarRating=response.xpath('//span[#class="Review-comment-leftScore"]/span/text()').extract()
Title=response.xpath('//span[#class="Review-comment-bodyTitle"]/span/text()').extract()
Comments=response.xpath('//span[#class="Review-comment-bodyText"]/span/text()').extract()
count = 0
for item in zip(StarRating, Title, Comments):
# create a dictionary to store the scraped info
scraped_data = {
'StarRating': item[0],
'Title': item[1],
'Comments': item[2],
}
# yield or give the scraped info to scrapy
yield scraped_data
Can anybody please kindly let me know where the problems are? I am totally clueless...
Your results are empty because scrapy is receiving a response that does not have a lot of content. You can see this by starting a scrapy shell from your terminal and sending a request to the page you are trying to crawl.
scrapy shell 'https://www.agoda.com/holiday-
inn-express-kuala-lumpur-city-centre/hotel/kuala-lumpur-my.html?checkIn=2020-04-14&los=1&adults=2&rooms=1&searchrequestid=41af11cc
-eaa6-42cc-874d-383761d3523c&travellerType=1&tspTypes=9'
Then you can view the response that scrapy received by running:
view(response)
That should open the response that was received and stored by scrapy in your browser. As you should see, there are no reviews to extract from.
Also, as you are trying to extract some information from span-elements, you can run response.css('span').extract() and you will see that there are some span-elements in the response but none of them has a class that has anything to do with Reviews.
So to sum up, agoda is sending you a quite empty response. As a consequence scrapy is extracting empty lists. Possible reasons could be: Agoda has figured out that you are trying to crawl their website, for example based on your user agent, and is therefore hiding the content from you - or they are using javascript to generate the content.
To solve your problem you should either use the agoda api, make yourself familiar with user agent spoofing or check out the selenium package which might help with javascript-heavy websites.
Related
I am making a spider which will crawl the entire site on the first run and store the data in my database.
But I will keep running this spider on weekly basis to get the updates of the crawled site in my database and I don't want scrapy to crawl the pages which are already present in my database how to achieve this I have made two plans -
1] Make a crawler to fetch the entire site and somehow store the first fetched URL in a csv file then keep following the next pages. Then make another crawler which will start fetching backwards that means it will take the input from the URL in csv and keep running till prev_page exits this way I will get the data, but the url in csv will be crawled twice.
2] Make a crawler which will check condition if the data is in the database then stop, is it possible? This will be the most productive way but I can't find the way out. Maybe making logs files might help in some way?
Update
The site is a blog which updates frequently and sorted as latest post on the top manner
Something like this :
from scrapy import Spider
from scrapy.http import Request, FormRequest
class MintSpiderSpider(Spider):
name = 'Mint_spider'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/']
def parse(self, response):
urls = response.xpath('//div[#class = "post-inner post-hover"]/h2/a/#href').extract()
for url in urls:
if never_visited(url, database):
yield Request(url, callback=self.parse_lyrics) #do you mean parse_foo ?
next_page_url = response.xpath('//li[#class="next right"]/a/#href').extract_first()
if next_page_url:
yield scrapy.Request(next_page_url, callback=self.parse)
def parse_foo(self, response):
save_url(response.request.url, database)
info = response.xpath('//*[#class="songinfo"]/p/text()').extract()
name = response.xpath('//*[#id="lyric"]/h2/text()').extract()
yield{
'name' : name,
'info': info
}
You just need to implement never_visited and save_url functions.
never_visited will check in your database if url is already there. save_url will add the url into your database.
I am new to python scrapy and trying to get through a small example, however I am having some problems!
I am able to crawl the first given URL only, but I am unable to crawl more than one page or an entire website for that matter!
Please help me or give me some advice on how I can crawl an entire website or more pages in general...
The example I am doing is very simple...
My items.py
import scrapy
class WikiItem(scrapy.Item):
title = scrapy.Field()
my wikip.py (the spider)
import scrapy
from wiki.items import WikiItem
class CrawlSpider(scrapy.Spider):
name = "wikip"
allowed_domains = ["en.wikipedia.org/wiki/"]
start_urls = (
'http://en.wikipedia.org/wiki/Portal:Arts',
)
def parse(self, response):
for sel in response.xpath('/html'):
item = WikiItem()
item['title'] = sel.xpath('//h1[#id="firstHeading"]/text()').extract()
yield item
When I run scrapy crawl wikip -o data.csv in the root project diretory the result is:
title
Portal:Arts
Can anyone give me insight as to why it is not following urls and crawling deeper?
I have checked some related SO questions but they have not helped to solve the issue
scrapy.Spider is the simplest spider. Change the name CrawlSpider, since Crawl Spider is one of the generic spiders of scrapy.
One of the below option can be used:
eg: 1. class WikiSpider(scrapy.Spider)
or 2. class WikiSpider(CrawlSpider)
If you are using first option you need to code the logic for following the links you need to follow on that webpage.
For second option you can do the below:
After the start urls you need to define the rule as below:
rules = (
Rule(LinkExtractor(allow=('https://en.wikipedia.org/wiki/Portal:Arts\?.*?')), callback='parse_item', follow=True,),
)
Also please change the name of the function defined as "parse" if you use CrawlSpider. The Crawl Spider uses parse method to implement the logic. Thus, here you are trying to override the parse method and hence the crawl spider doesn't work.
I am using scrapy to extract all the posts of my blog. The problem is I cannot figure out how to create a rule that reads all the posts in any given blog category?
example: On my blog the category, "Environment setup" has 17 posts. So in the scrapy code I can hard code it as given but this is not a very practical approach
start_urls=["https://edumine.wordpress.com/category/ide- configuration/environment-setup/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/3/"]
I have read similar posts related to this question posted here on SO like 1, 2, 3, 4, 5, 6, 7, but I cant seem to find out my answer in any. As you can see, the only difference is the page count in the above url's. How can I write a rule in scrapy that can read all the blog posts in a category. And another trivial question, how can I configure the spider to crawl my blog such that when I post a new blog post entry, the crawler can immediately detect it an write it to a file.
This is what I have so far for the spider class
from BlogScraper.items import BlogscraperItem
from scrapy.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request
class MySpider(CrawlSpider):
name = "nextpage" # give your spider a unique name because it will be used for crawling the webpages
#allowed domain restricts the spider crawling
allowed_domains=["https://edumine.wordpress.com/"]
# in start_urls you have to specify the urls to crawl from
start_urls=["https://edumine.wordpress.com/category/ide-configuration/environment-setup/"]
'''
start_urls=["https://edumine.wordpress.com/category/ide-configuration/environment-setup/",
"https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/",
"https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/3/"]
rules = [
Rule(SgmlLinkExtractor
(allow=("https://edumine.wordpress.com/category/ide-configuration/environment-setup/\d"),unique=False,follow=True))
]
'''
rules= Rule(LinkExtractor(allow='https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/'),follow=True,callback='parse_page')
def parse_page(self, response):
hxs=Selector(response)
titles = hxs.xpath("//h1[#class='entry-title']")
items = []
with open("itemLog.csv","w") as f:
for title in titles:
item = BlogscraperItem()
item["post_title"] = title.xpath("//h1[#class='entry-title']//text()").extract()
item["post_time"] = title.xpath("//time[#class='entry-date']//text()").extract()
item["text"]=title.xpath("//p//text()").extract()
item["link"] = title.select("a/#href").extract()
items.append(item)
f.write('post title: {0}\n, post_time: {1}\n, post_text: {2}\n'.format(item['post_title'], item['post_time'],item['text']))
print "#### \tTotal number of posts= ",len(items), " in category####"
f.close()
Any help or suggestions to solve it?
You have some things you can improve in your code and two problems you want to solve: reading posts, automatic crawling.
If you want to get the contents of a new blog post you have to re-run your spider. Otherwise you would have an endless loop. Naturally in this case you have to make sure that you do not scrape already scraped entries (database, read available files at spider start and so on). But you cannot have a spider which runs forever and waits for new entries. This is not the purpose.
Your approach to store the posts into a file is wrong. This means why do you scrape a list of items and then do nothing with them? And why do you save the items in the parse_page function? For this there are item pipelines, you should write one and do there the exporting. And the f.close() is not necessary because you use the with statement which does this for you at the end.
Your rules variable should throw an error because it is not iterable. I wonder if you even tested your code. And the Rule is too complex. You can simplify it to this:
rules = [Rule(LinkExtractor(allow='page/*'), follow=True, callback='parse_page'),]
And it follows every URL which has /page in it.
If you start your scraper you will see that the results are filtered because of your allowed domains:
Filtered offsite request to 'edumine.wordpress.com': <GET https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/>
To solve this change your domain to:
allowed_domains = ["edumine.wordpress.com"]
If you want to get other wordpress sites, change it simply to
allowed_domains = ["wordpress.com"]
I'm trying to scrap an e-commerce web site, and I'm doing it in 2 steps.
This website has a structure like this:
The homepage has the links to the family-items and subfamily-items pages
Each family & subfamily page has a list of products paginated
Right now I have 2 spiders:
GeneralSpider to get the homepage links and store them
ItemSpider to get elements from each page
I'm completely new to Scrapy, I'm following some tutorials to achieve this. I'm wondering how complex can be the parse functions and how rules works. My spiders right now looks like:
GeneralSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
start_urls = ['http://www.domain.org/home']
def parse(self, response):
links = LinksItem()
links['content'] = response.xpath("//div[#id='h45F23']").extract()
return links
ItemSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
# Each URL in the file has pagination if it has more than 30 elements
# I don't know how to paginate over each URL
f.close()
def parse(self, response):
item = ShopItem()
item['name'] = response.xpath("//h1[#id='u_name']").extract()
item['description'] = response.xpath("//h3[#id='desc_item']").extract()
item['prize'] = response.xpath("//div[#id='price_eur']").extract()
return item
Wich is the best way to make the spider follow the pagination of an url ?
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
Can I have different "rules" in the same spider to scrap different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
I've also googled looking for any book related with Scrapy, but it seems there isn't any finished book yet, or at least I couldn't find one.
Does anyone know if some Scrapy book that will be released soon ?
Edit:
This 2 URL's fits for this example. In the Eroski Home page you can get the URL's to the products page.
In the products page you have a list of items paginated (Eroski Items):
URL to get Links: Eroski Home
URL to get Items: Eroski Fruits
In the Eroski Fruits page, the pagination of the items seems to be JQuery/AJAX, because more items are shown when you scroll down, is there a way to get all this items with Scrapy ?
Which is the best way to make the spider follow the pagination of an url ?
This is very site-specific and depends on how the pagination is implemented.
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
This is exactly your use case - the pagination is made via additional AJAX calls that you can simulate inside your Scrapy spider.
Can I have different "rules" in the same spider to scrape different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
Yes, the "rules" mechanism that a CrawlSpider provides is a very powerful piece of technology - it is highly configurable - you can have multiple rules, some of them would follow specific links that match specific criteria, or located in a specific section of a page. Having a single spider with multiple rules should be preferred comparing to having multiple spiders.
Speaking about your specific use-case, here is the idea:
make a rule to follow categories and subcategories in the navigation menu of the home page - this is there restrict_xpaths would help
in the callback, for every category or subcategory yield a Request that would mimic the AJAX request sent by your browser when you open a category page
in the AJAX response handler (callback) parse the available items and yield an another Request for the same category/subcategory but increasing the page GET parameter (getting next page)
Example working implementation:
import re
import urllib
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class ProductItem(scrapy.Item):
description = scrapy.Field()
price = scrapy.Field()
class GrupoeroskiSpider(CrawlSpider):
name = 'grupoeroski'
allowed_domains = ['compraonline.grupoeroski.com']
start_urls = ['http://www.compraonline.grupoeroski.com/supermercado/home.jsp']
rules = [
Rule(LinkExtractor(restrict_xpaths='//div[#class="navmenu"]'), callback='parse_categories')
]
def parse_categories(self, response):
pattern = re.compile(r'/(\d+)\-\w+')
groups = pattern.findall(response.url)
params = {'page': 1, 'categoria': groups.pop(0)}
if groups:
params['grupo'] = groups.pop(0)
if groups:
params['familia'] = groups.pop(0)
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
def parse_products(self, response):
for product in response.xpath('//div[#class="product_element"]'):
item = ProductItem()
item['description'] = product.xpath('.//span[#class="description_1"]/text()').extract()[0]
item['price'] = product.xpath('.//div[#class="precio_line"]/p/text()').extract()[0]
yield item
params = response.meta['params']
params['page'] += 1
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
Hope this is a good starting point for you.
Does anyone know if some Scrapy book that will be released soon?
Nothing specific that I can recall.
Though I heard that some publisher has some plans to may be release a book about web-scraping, but I'm not supposed to tell you that.
Hi all I an trying to get whole results from the given link in the code. but my code not giving all results. This link says it contain 2132 results but it returns only 20 results.:
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import Flipkart
class Test(Spider):
name = "flip"
allowed_domains = ["flipkart.com"]
start_urls = ["http://www.flipkart.com/mobiles/pr?sid=tyy,4io& otracker=ch_vn_mobile_filter_Mobile%20Brands_All"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="pu-details lastUnit"]')
items = []
for site in sites:
item = Flipkart()
item['title'] = site.xpath('div[1]/a/text()').extract()
items.append(item)
return items**
That is because the site only shows 20 results at a time, and loading of more results is done with JavaScript when the user scrolls to the bottom of the page.
You have two options here:
Find a link on the site which shows all results on a single page (doubtful it exists, but some sites may do so when passed an optional query string, for example).
Handle JavaScript events in your spider. The default Scrapy downloader doesn't do this, so you can either analyze the JS code and send the event signals yourself programmatically or use something like Selenium w/ PhantomJS to let the browser deal with it. I'd recommend the latter since it's more fail-proof than the manual approach of interpreting the JS yourself. See this question for more information, and Google around, there's plenty of information on this topic.